COCA corpus has a special feature that removes ten tokens every 200 tokens and replaces them with an @ symbol. 

So we want to pass through the corpus and remove those sentences.

We have a copy of the corpus stored in `/Volumes/data_gabriella_chronis/corpora/COCA`

In [91]:
#from __future__ import print_function
#import time
import numpy as np
import pandas as pd
import os

#from sklearn.decomposition import PCA
#from sklearn.manifold import TSNE
import pyarrow
import fastparquet

import csv
import re

import spacy
from collections import defaultdict 
from tqdm import tqdm

from spacy.matcher import PhraseMatcher, Matcher

_COCA_PATH = "/Volumes/data_gabriella_chronis/corpora/COCA/"
_COCA_METADATA = "/Volumes/data_gabriella_chronis/corpora/COCA/shared_files/coca-sources.txt"

_OUT_DIR = '/Volumes/data_gabriella_chronis/corpora/coca.2017.parquet'


First, let's explore the data by reading in one of the files

In [2]:
path = "texts/text_fiction_awq/w_fic_2012.txt"
example_text = open(_COCA_PATH+path, "r")


In [3]:
example_text.readline()


'\n'

Our goal is to take this corpus and to turn it into a giant data frame and to store that data   frame in a parquet file. 

I want the data frame to have the following format/columns


| doc_id | doc_text | year | genre |

1. for each file
2. split the file on the doc id
3. keep the doc id
4. sentencize the doc with spacy
5. remove the sentences that have the @@@@@ symbols
6. (don't actually do this yet---add the genre)
7. put into a dataframe
8. write/append to a parquet file.

We have metadata in COCA/shared_files/coca-sources.txt

In [53]:
columns = ["textID", "#words", "year", "genre", "subgen", "source", "title", "publication_info"]
#dtype = {"textID": int, "#words": int, "year": int, "subgen": int }
metadata = pd.read_csv( _COCA_METADATA, skiprows=3, sep='\t', names = columns, encoding='unicode_escape')
#metadata = metadata.convert_dtypes(infer_objects=False)
#metadata["textID"] = metadata["textID"].astype(np.int64, errors = 'ignore')

""""
we need nice pkey values
""""

# Convert the column to numeric, invalid parsing will be set as NaN
metadata['textID'] = pd.to_numeric(metadata['textID'], errors='coerce')

# Drop rows where 'textID' is NaN
metadata = metadata.dropna(subset=['textID'])

# Convert 'textID' back to integer type if desired
metadata['textID'] = metadata['textID'].astype(np.int64)

  metadata = pd.read_csv( _COCA_METADATA, skiprows=3, sep='\t', names = columns, encoding='unicode_escape')


In [54]:
metadata.head(5)

Unnamed: 0,textID,#words,year,genre,subgen,source,title,publication_info
0,221118,8101.0,1990.0,SPOK,200.0,ABC_2020,Is He a Killer?; Who Will Love My Pet?; The Tw...,--
1,221119,8358.0,1990.0,SPOK,200.0,ABC_2020,Golden Years Behind Bars; The Joker; Goodbye W...,--
2,221120,7824.0,1990.0,SPOK,200.0,ABC_2020,Too Old Too Soon; Danger on the Half Shell; Mi...,--
3,221121,8559.0,1990.0,SPOK,200.0,ABC_2020,Chicken at Any Price?; The Daytop Solution; Su...,--
4,221122,8199.0,1990.0,SPOK,200.0,ABC_2020,Children of Terror; Against All Odds; Buck Fev...,--


In [60]:
 #pecify the path to your text file
#input_file_path = 'path/to/your/file.txt'

# Read the entire content of the text file into a string
with open(_COCA_PATH+path, 'r') as file:
    text = file.read()

# Define the regular expression pattern to match the delimiters
pattern = r'\n##\d{7}'

# Find all matches of the pattern
matches = list(re.finditer(pattern, text))

# create a list of match texts, which are our document ids
doc_ids = [match.group()[3:] for match in matches] # get rid of the initial special characters on str_id
print(len(doc_ids))

# Initialize a list to store the captured segments, which are our document texts
doc_texts = []

# Add the text between matches
start_idx = 0
for match in matches:
    end_idx = match.start()
    if end_idx > start_idx:
        doc_texts.append(text[start_idx:end_idx].strip())
    start_idx = match.end()

# Add the text after the last match
if start_idx < len(text):
    doc_texts.append(text[start_idx:].strip())

# Create a DataFrame from the captured segments
df = pd.DataFrame(data = {'textID': doc_ids, 'doc_text': doc_texts})

"""
we need nice pkey values for our join
"""

# Convert the column to numeric, invalid parsing will be set as NaN
df['textID'] = pd.to_numeric(df['textID'], errors='coerce')

# Drop rows where 'textID' is NaN
df = df.dropna(subset=['textID'])

# Convert 'textID' back to integer type if desired
df['textID'] = df['textID'].astype(np.int64)

508


In [61]:
df

Unnamed: 0,textID,doc_text
0,4120102,""" The only problem with leaving four car lengt..."
1,4120103,"Waiting for spring in Reno , Nevada , is like ..."
2,4120106,One hour down . Three hours to go . <p> The af...
3,4120107,"After dinner with the first couple , Ida , Mav..."
4,4120108,"The letter , contemplated and worried about fo..."
...,...,...
503,4122123,The milky dhatura gum would siphon out my soul...
504,4122124,A flux of vertical wind pushed Guido Tarkenen ...
505,4122125,"I was eleven , the age when I knew everything ..."
506,4122126,Margo was sitting on one of the wooden benches...


Now we want to remove the sentences from each doc that have @@@@@@s in them. Lets grab an example doc

In [8]:
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x171c5d150>

In [9]:
df.iloc[0].doc_text[:1000]



'" The only problem with leaving four car lengths in front of you is that four cars come in to fill up the space ! " Hannah Swensen complained to her sister Michelle , who was riding in the passenger seat of her cookie truck . " I \'m going forty . Do you think that \'s too slow ? " <p> " Absolutely not . It \'s nasty out there , and anybody who drives faster than forty on a night like this is crazy . " <p> " Or they come from other states and they do n\'t know anything about winter driving in Minnesota . I think I \'ll pull over as far as I can and let that whole herd of cars behind me pass . " <p> " Good idea . " <p> Hannah signaled and moved over as far as she could to encourage the other drivers to pass her . They probably thought she was being too cautious , but a thin film of water glistened on the asphalt surface of the highway , and the temperature was dropping fast . @ @ @ @ @ @ @ @ @ @ of minutes and there was no way Hannah wanted to sail off into the ditch and land in the mu

In [10]:
def strip_censored_sents(text):
    censor_string = "@ @ @ @ @ @ @ @ @ @"
    # Process the text with spaCy
    doc = nlp(text)
    
    # Initialize a list to store sentences containing the specific string
    filtered_sentences = []
    
    # Iterate over the sentences in the document
    for sent in doc.sents:
        #print(sent)
        if censor_string not in sent.text:
            filtered_sentences.append(sent.text.strip())

    return(' '.join(filtered_sentences))

In [11]:
strip_censored_sents(df.iloc[0].doc_text)



In [12]:
cleaned_doctexts = []

for text in df.doc_text:
    cleaned_doctext = strip_censored_sents(text)
    cleaned_doctexts.append(cleaned_doctext)

df.doc_text = cleaned_doctexts

We should now see that all sentences with @@@ symbols have been removed from the documents

In [13]:
df.iloc[0].doc_text



In [22]:
df.head(5)

Unnamed: 0,doc_id,doc_text
0,4120102,""" The only problem with leaving four car lengt..."
1,4120103,"Waiting for spring in Reno , Nevada , is like ..."
2,4120106,One hour down . Three hours to go . <p> The af...
3,4120107,"After dinner with the first couple , Ida , Mav..."
4,4120108,"The letter , contemplated and worried about fo..."


In [23]:
metadata.head(5)

Unnamed: 0,textID,#words,year,genre,subgen,source,title,publication_info
0,221118,8101.0,1990,SPOK,200,ABC_2020,Is He a Killer?; Who Will Love My Pet?; The Tw...,--
1,221119,8358.0,1990,SPOK,200,ABC_2020,Golden Years Behind Bars; The Joker; Goodbye W...,--
2,221120,7824.0,1990,SPOK,200,ABC_2020,Too Old Too Soon; Danger on the Half Shell; Mi...,--
3,221121,8559.0,1990,SPOK,200,ABC_2020,Chicken at Any Price?; The Daytop Solution; Su...,--
4,221122,8199.0,1990,SPOK,200,ABC_2020,Children of Terror; Against All Odds; Buck Fev...,--


In [64]:
merged_df = pd.merge(df, metadata, on='textID', how='left')

In [65]:
merged_df

Unnamed: 0,textID,doc_text,#words,year,genre,subgen,source,title,publication_info
0,4120102,""" The only problem with leaving four car lengt...",3009.0,2012.0,FIC,222.0,Cinnamon Roll murder,Cinnamon Roll murder,"New York : Kensington Books,"
1,4120103,"Waiting for spring in Reno , Nevada , is like ...",2538.0,2012.0,FIC,222.0,Murder unleashed :a novel,Murder unleashed :a novel,"New York : Ballantine Books,Edition: 1st ed."
2,4120106,One hour down . Three hours to go . <p> The af...,1657.0,2012.0,FIC,222.0,Oath of office,Oath of office,"New York : St. Martin's Press,Edition: 1st ed."
3,4120107,"After dinner with the first couple , Ida , Mav...",2964.0,2012.0,FIC,222.0,Deadline,Deadline,"New York : Kensington Publishing Corp.,"
4,4120108,"The letter , contemplated and worried about fo...",605.0,2012.0,FIC,222.0,Letter from a stranger :[a novel],Letter from a stranger :[a novel],"New York : St. Martin's Press,Edition: 1st ed."
...,...,...,...,...,...,...,...,...,...
503,4122123,The milky dhatura gum would siphon out my soul...,3126.0,2012.0,FIC,222.0,India Currents,Dhatura,Mar 2012
504,4122124,A flux of vertical wind pushed Guido Tarkenen ...,5029.0,2012.0,FIC,222.0,The Antioch Review,Guido's Tale: The Job,Spring 2012
505,4122125,"I was eleven , the age when I knew everything ...",4211.0,2012.0,FIC,222.0,The Antioch Review,Eleven,Spring 2012
506,4122126,Margo was sitting on one of the wooden benches...,4510.0,2012.0,FIC,222.0,The Antioch Review,Bringing Up Baby,Spring 2012


In [62]:
metadata.dtypes

textID                int64
#words               object
year                float64
genre                object
subgen              float64
source               object
title                object
publication_info     object
dtype: object

In [63]:
df.dtypes

textID       int64
doc_text    object
dtype: object

In [56]:
metadata[metadata.textID ==4122123]

Unnamed: 0,textID,#words,year,genre,subgen,source,title,publication_info
60423,4122123,3126.0,2012.0,FIC,222.0,India Currents,Dhatura,Mar 2012


We want to do all of the above for every text file in the coca corpus

In [89]:
def get_text_file_names():
    # Define the root directory
    root_dir = _COCA_PATH + "texts"
    
    # List to store paths of all text files
    text_files = []
    
    # Walk through the directory
    for root, dirs, files in os.walk(root_dir):
        for dir_name in dirs:
            sub_dir_path = os.path.join(root, dir_name)
            # Find all text files in the current subdirectory
            for txt_file in os.listdir(sub_dir_path):
                text_files.append(os.path.join(sub_dir_path, txt_file))
    return text_files

In [90]:
get_text_file_names()

['/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2016_acad.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2016_fic.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2016_mag.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2016_news.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2016_spok.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2017_acad.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2017_fic.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2017_mag.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2017_news.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/coca2017_text_qpj/2017_spok.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA/texts/text_2012-2015_ksr/2012_acad.txt',
 '/Volumes/data_gabriella_chronis/corpora/COCA