# Step 1: Pre-process the Data
Complete some basic text preprocessing steps and create some different features based on the wikipedia training data and other data sets provided.

In [1]:
# Import modules
import pandas as pd
from feature_extractor import feature_mods

In [2]:
# Gather the training and testing data
train = pd.read_csv("assets/WikiLarge_Train.csv")
test = pd.read_csv("assets/WikiLarge_Test.csv")

In [3]:
# Look at the data frames
train.head()

Unnamed: 0,original_text,label
0,There is manuscript evidence that Austen conti...,1
1,"In a remarkable comparative analysis , Mandaea...",1
2,"Before Persephone was released to Hermes , who...",1
3,Cogeneration plants are commonly found in dist...,1
4,"Geneva -LRB- , ; , ; , ; ; -RRB- is the second...",1


In [4]:
test.head()

Unnamed: 0,id,original_text,label
0,0,-2011.0,
1,1,-2011.0,
2,2,-2000.0,
3,3,-1997.0,
4,4,1.636,


## Engineer Features

**In order to preprocess the text, we first tokenized each sentence. We also created another column with the words tokenized, lowercased, and lemmatized. We did not remove stop words at this point because during our testing we found we got better results if we left the stop words in the sentences.**

In [5]:
# Create the tokens
train = feature_mods.tokens(train)
test = feature_mods.tokens(test)

In [6]:
# Create the token_lemmas POSSIBLY REMOVE
train = feature_mods.token_lemmas(train)
test = feature_mods.token_lemmas(test)

100%|██████████| 416768/416768 [09:01<00:00, 769.57it/s] 
100%|██████████| 119092/119092 [01:44<00:00, 1142.37it/s]


**We also created a features which contained the count and ratio of "hard words" in each sentence. Hard words were determined by if they were not included in the concreteness lemmas, dale chall words, or lemmas that people under the age of 12 knew in the AoA data. We used the token_lemmas column in our data frame to calculate this.**

In [7]:
# Create the hard word ratio 
train = feature_mods.get_hard_word_ratio(train, 12)
test = feature_mods.get_hard_word_ratio(test, 12)

**Similarly we created columns which contained the count and ratio of "advanced words". These are words that people over the age of 8 know as well as words that less than 92% of people knew. We used the regular tokens to calculate this. Even though the AoA and concreteness data are lowercased and the concreteness data includes lemmas we found we got better results using the regular tokens to create this feature.** 

In [11]:
# Add a column with the advanced words, count and the ratio of advanced words out of all the tokens
train = feature_mods.add_difficulty_columns(train, 8, .92)
test = feature_mods.add_difficulty_columns(test, 8, .92)

**We also retrieved the parts-of-speech for all words in the original text. We found through our testing, data exploration, and research that different combinations of parts-of-speech in a sentence can make sentences more or less easy to comprehend**

In [13]:
# Add a column with the part of speech tags for the original text
train = feature_mods.pos_tokens(train)
test = feature_mods.pos_tokens(test)

**Length is an obvious choice of feature given the notion that more complex sentences are likely to be relatively longer. In order to account for length, we merely count the number of tokens.**

In [16]:
# Add a column with the number of words in a sentence
train = feature_mods.length(train)
test = feature_mods.length(test)

**Given the likelihood that more difficult vocabulary contains more characters on average, we computed the average length of tokens within a given sentence.**

In [19]:
# Add a column with the average length of a word in a sentence
train = feature_mods.average_word_length(train)
test = feature_mods.average_word_length(test)

**If there are more named entities in a given sentence this could indicate a sentence is harder to understand since there are more nouns the reader is forced to remember.**

In [22]:
# Add a column with named entities, count and the ratio of number of named entities in the original text
train = feature_mods.named_entity_counter(train)
test = feature_mods.named_entity_counter(test)

**More punctuation could indicate more complex sentence structure**

In [23]:
# Add a column for the count and ratio of punctuation to regular characters in the original text
train = feature_mods.add_punctuation(train)
test = feature_mods.add_punctuation(test)

**Given that the relative frequency of parts of speech might provide information about the difficulty of a sentence, we created a column for each part of speech indicating the ratio of tokens in the sentence that belong to each grammatical category.**

In [25]:
# Add a column with the ratio of each part of speech in the original text
train = feature_mods.add_pos_count_columns(train)
test = feature_mods.add_pos_count_columns(test)

100%|██████████| 35/35 [00:16<00:00,  2.11it/s]
100%|██████████| 35/35 [00:04<00:00,  8.21it/s]


In [26]:
# Filter down to only the columns needed for the model
columns = ['original_text', 'label', 'tokens', 'token_lemmas', 'length', 'advanced_word_count', 'advanced_words_ratio','hard_word_count', 
           'hard_word_ratio', 'named_entity_count', 'named_entity_ratio', 'punctuation_score', 'punctuation_ratio', 'CC', 
           'CD', 'DT', 'EX', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNS', 'NNP', 'NNPS', 
           'PDT', 'POS', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'TO', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 
           'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB', 'avg_word_length']
train = train[columns]
test = test[columns+['id']]

In [31]:
# Create a string version of tokens and token_lemmas columns
train['token_string'] = train['tokens'].apply(lambda x: ' '.join(x))
train['token_lemmas_string'] = train['token_lemmas'].apply(lambda x: ' '.join(x))

In [27]:
# Get the word embeddings feature representation
train_word_embeddings = feature_mods.create_embeddings(train)
test_word_embeddings = feature_mods.create_embeddings(test)

## Save the data to CSVs

In [32]:
train.to_csv('assets/wiki_train_preprocessed5.csv', index = False)
test.to_csv('assets/wiki_test_preprocessed5.csv', index = False)

train_word_embeddings.to_csv('assets/embeddings_train.csv', index = False)
test_word_embeddings.to_csv('assets/embeddings_test.csv', index = False)