# General Assembly Data Science Immersive - Capstone Project #

## Creating an automated English language error detector ##

## Part 2: Data cleaning and feature engineering - overview of process

This is the second part of my data science immersive Capstone Project, covering the data cleaning and feature engineering process. 

Recall that in part 1, I used the exam scripts in the FCE dataset to create overlapping ngrams of length 1 to 5 to query the Google Books Ngrams database. From this, I extracted "match counts" (frequency within the corpus) for each ngram, as well as match counts for the left and right context.

In this phase, I combine these match counts and context counts with the FCE dataframe, create the features, check for errors / outliers, clean the data and then save as as new dataframes for EDA and modelling purposes in Part 3.

**This entire process is completed twice - once for the training set and once for the test set**

In [606]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
import math
from scipy import stats
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
import ErrorDetection as ed
import importlib
from scipy import stats

## 1. Load the FCE dataset

In [546]:
# Load FCE dataset and read in DataFrame
my_file = "fce_train.csv"
fce = pd.read_csv(my_file, index_col=[0])

Note that we have already remoed the null value sentence separators in part 1

In [7]:
fce.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 452833 entries, 0 to 481562
Data columns (total 2 columns):
0    452833 non-null object
1    452833 non-null object
dtypes: object(2)
memory usage: 10.4+ MB


## 2. Load data created in part 1

In [567]:
# Load sentence indices, i.e. indices that mark out the sentences in our dataframe
with open("sentence_indices_train.pickle", 'rb') as f:
    sentence_indices = pickle.load(f)

In [290]:
# Open dictionaries of ngram match counts
with open('unigram_scores_col_train.pickle','rb') as f:
    unigram_scores = pickle.load(f)
with open('bigram_scores_col_train.pickle', 'rb') as f:
    bigram_scores = pickle.load(f)
with open('trigram_scores_col_train.pickle', 'rb') as f:
    trigram_scores = pickle.load(f)
with open('fourgram_scores_col_train.pickle', 'rb') as f:
    fourgram_scores = pickle.load(f)
with open('fivegram_scores_col_train.pickle', 'rb') as f:
    fivegram_scores = pickle.load(f)

In [291]:
# Open dictionaries of ngram context scores
with open('bigram_train_context.pickle','rb') as f:
    bigram_context = pickle.load(f)
with open('trigram_train_context.pickle', 'rb') as f:
    trigram_context = pickle.load(f)
with open('fourgram_train_context.pickle', 'rb') as f:
    fourgram_context = pickle.load(f)
with open('fivegram_train_context.pickle', 'rb') as f:
    fivegram_context = pickle.load(f)

In [292]:
# Open tagged ngrams
with open('tagged_unigrams_train.pickle','rb') as f:
    tagged_unigrams = pickle.load(f)
with open('tagged_bigrams_train.pickle', 'rb') as f:
    tagged_bigrams = pickle.load(f)
with open('tagged_trigrams_train.pickle', 'rb') as f:
    tagged_trigrams = pickle.load(f)
with open('tagged_fourgrams_train.pickle', 'rb') as f:
    tagged_fourgrams = pickle.load(f)
with open('tagged_fivegrams_train.pickle', 'rb') as f:
    tagged_fivegrams = pickle.load(f)

In [293]:
# Open tagged ngram boundaries
with open('tagged_unigrams_boundaries_train.pickle','rb') as f:
    tagged_unigrams_boundaries = pickle.load(f)
with open('tagged_bigrams_boundaries_train.pickle', 'rb') as f:
    tagged_bigrams_boundaries = pickle.load(f)
with open('tagged_trigrams_boundaries_train.pickle', 'rb') as f:
    tagged_trigrams_boundaries = pickle.load(f)
with open('tagged_fourgrams_boundaries_train.pickle', 'rb') as f:
    tagged_fourgrams_boundaries = pickle.load(f)
with open('tagged_fivegrams_boundaries_train.pickle', 'rb') as f:
    tagged_fivegrams_boundaries = pickle.load(f)

## 3. Create columns in the FCE dataset using ngram count dictionaries



In [547]:
# Create a column for each ngram length
for key in unigram_scores:
    fce[key] = unigram_scores[key]
for key in bigram_scores:
    fce[key] = bigram_scores[key]
for key in trigram_scores:
    fce[key] = trigram_scores[key]
for key in fourgram_scores:
    fce[key] = fourgram_scores[key]
for key in fivegram_scores:
    fce[key] = fivegram_scores[key]

With regard to the left and right context, to calculate probabilities I'm only interested in the first and last ngram for each ngram length respectively.

This is easy to extract for our left contexts. However, if a word is at the beginning of a sentence (and therefore doesn't have a full complement of ngrams), the correct right context ngram will be mapped to the "ngram_1_right_context" key, otherwise it will be mapped to the "ngram_2_right_context" key.

This will require an extra step of forward filling.

In [548]:
# Create a column for each ngram left context score
fce['bigram_left_context'] = bigram_context['bigram_1_left_context']
fce['trigram_left_context'] = trigram_context['trigram_1_left_context']
fce['fourgram_left_context'] = fourgram_context['fourgram_1_left_context']
fce['fivegram_left_context'] = fivegram_context['fivegram_1_left_context']

In [549]:
fce.reset_index(inplace=True, drop=True)

In [550]:
# create temporary dataframe with all right context columns
fce_rc = pd.DataFrame()

for key in bigram_context:
    if "right" in key:
        fce_rc[key] = bigram_context[key]
for key in trigram_context:
    if "right" in key:
        fce_rc[key] = trigram_context[key]
for key in fourgram_context:
    if "right" in key:
        fce_rc[key] = fourgram_context[key]
for key in fivegram_context:
    if "right" in key:
        fce_rc[key] = fivegram_context[key]

# replace "no score" with np.nan
fce_rc.replace("No score", np.nan, inplace=True)

# forward fill
fce_rc.loc[:, "bigram_1_right_context":"bigram_2_right_context"].fillna(method="ffill", axis=1, inplace=True)
fce_rc.loc[:, "trigram_1_right_context":"trigram_3_right_context"].fillna(method="ffill", axis=1, inplace=True)
fce_rc.loc[:, "fourgram_1_right_context":"fourgram_4_right_context"].fillna(method="ffill", axis=1, inplace=True)
fce_rc.loc[:, "fivegram_1_right_context":"fivegram_5_right_context"].fillna(method="ffill", axis=1, inplace=True)

# create new columns in main dataframe
fce['bigram_right_context'] = fce_rc['bigram_2_right_context']
fce['trigram_right_context'] = fce_rc['trigram_3_right_context']
fce['fourgram_right_context'] = fce_rc['fourgram_4_right_context']
fce['fivegram_right_context'] = fce_rc['fivegram_5_right_context']

# delete temporary dataframe
del fce_rc

In [298]:
# view created columns
fce.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 452833 entries, 0 to 481562
Data columns (total 25 columns):
0                         452833 non-null object
1                         452833 non-null object
unigram_1                 452833 non-null object
bigram_1                  452833 non-null object
bigram_2                  452833 non-null object
trigram_1                 452833 non-null object
trigram_2                 452833 non-null object
trigram_3                 452833 non-null object
fourgram_1                452833 non-null object
fourgram_2                452833 non-null object
fourgram_3                452833 non-null object
fourgram_4                452833 non-null object
fivegram_1                452833 non-null object
fivegram_2                452833 non-null object
fivegram_3                452833 non-null object
fivegram_4                452833 non-null object
fivegram_5                452833 non-null object
bigram_left_context       452833 non-null object
trigram

## 4. Create a separate dataframe with POS tags

As I will be doing a different kind of processing of POS tages to match counts, I will keep the POS tags in a separate dataframe at this stage.

In [551]:
fce_pos = fce.iloc[:, 0:2]

In [552]:
for key in tagged_unigrams:
    fce_pos[key] = tagged_unigrams[key]
for key in tagged_bigrams:
    fce_pos[key] = tagged_bigrams[key]
for key in tagged_trigrams:
    fce_pos[key] = tagged_trigrams[key]
for key in tagged_fourgrams:
    fce_pos[key] = tagged_fourgrams[key]
for key in tagged_fivegrams:
    fce_pos[key] = tagged_fivegrams[key]

In [553]:
fce_pos["tagged_unigram_boundaries"] = tagged_unigrams_boundaries
fce_pos["tagged_bigram_boundaries"] = tagged_bigrams_boundaries
fce_pos["tagged_trigram_boundaries"] = tagged_trigrams_boundaries
fce_pos["tagged_fourgram_boundaries"] = tagged_fourgrams_boundaries
fce_pos["tagged_fivegram_boundaries"] = tagged_fivegrams_boundaries

## 5. Initial data cleaning

Taking an overview of the columns, it's clear that a number of issues will need to be resolved:
- column names: some columns have a number not a name,
- column types: our match count columns are objects, but I would expect these to be floats or integers (as they should contain only numerical values)
- "No score" values: we will have to convert these to null values.

In [302]:
fce.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 452833 entries, 0 to 481562
Data columns (total 25 columns):
0                         452833 non-null object
1                         452833 non-null object
unigram_1                 452833 non-null object
bigram_1                  452833 non-null object
bigram_2                  452833 non-null object
trigram_1                 452833 non-null object
trigram_2                 452833 non-null object
trigram_3                 452833 non-null object
fourgram_1                452833 non-null object
fourgram_2                452833 non-null object
fourgram_3                452833 non-null object
fourgram_4                452833 non-null object
fivegram_1                452833 non-null object
fivegram_2                452833 non-null object
fivegram_3                452833 non-null object
fivegram_4                452833 non-null object
fivegram_5                452833 non-null object
bigram_left_context       452833 non-null object
trigram

In [303]:
fce.head(10)

Unnamed: 0,0,1,unigram_1,bigram_1,bigram_2,trigram_1,trigram_2,trigram_3,fourgram_1,fourgram_2,...,fivegram_4,fivegram_5,bigram_left_context,trigram_left_context,fourgram_left_context,fivegram_left_context,bigram_right_context,trigram_right_context,fourgram_right_context,fivegram_right_context
0,Dear,c,25307860,1152194,No score,10042,No score,No score,8972,No score,...,No score,No score,25307860,1152194,10042,8972,44272603,18492,12343,2487
1,Sir,c,44272603,1152194,18492,10042,12343,No score,8972,2487,...,No score,No score,25307860,1152194,10042,8972,1371154850,20109,5128,2487
2,or,c,1371154850,18492,20109,10042,12343,5128,8972,2487,...,No score,No score,44272603,1152194,10042,8972,2652177,1303813,5128,2487
3,Madam,c,2652177,20109,1303813,12343,5128,No score,8972,2487,...,No score,No score,1371154850,18492,10042,8972,22046138853,1303813,5128,2487
4,",",c,22046138853,1303813,No score,5128,No score,No score,2487,No score,...,No score,No score,2652177,20109,12343,8972,22046138853,1303813,5128,2487
6,I,c,1755464606,74006239,No score,314367,No score,No score,12886,No score,...,No score,No score,1755464606,74006239,314367,12886,41540111,1662948,12203,10977
7,am,c,102062498,74006239,333632,314367,13805,No score,12886,144,...,No score,No score,1755464606,74006239,314367,12886,6051835146,52058492,42643038,73825
8,writing,c,41540111,333632,1662948,314367,13805,12203,12886,144,...,No score,No score,102062498,74006239,314367,12886,123021129,44786667,74793,1069
9,in,c,6051835146,1662948,52058492,13805,12203,42643038,12886,144,...,1062,No score,41540111,333632,314367,12886,7171979288,6584602,374649,313
10,order,c,123021129,52058492,44786667,12203,42643038,74793,144,10977,...,1062,0,6051835146,1662948,13805,12886,18384490,554113,812,0


In [554]:
# rename the first two columns
fce.columns = ["word", "y"] + [col for col in fce.columns[2:]]

In [555]:
# Change our outcome variable to a binary variable
fce.loc[:,"y"] = fce["y"].map(lambda x: 0 if x == "i" else 1)

### Dealing with "no score" / null values


In [556]:
# replace "no score" with null 
fce.replace("No score", np.nan, inplace=True)

### Changing column types

In [557]:
# there is one particular sentence where an error was returned from Phrasefinder and wasn't
# picked up at an earlier stage. 
for col in fce.columns[2:]:
    fce[col] = fce[col].map(lambda x: 0 if 'improve the disadvantages' 
                                    in str(x) else 22425 if 'the disadvantages' in str(x) else 
                                    2852366 if 'Dear Sir' in str(x) else 12139192 if 'Madame' in str(x)
                                   else 0 if 'Sally' in str(x) else 256424470 if 'error' in str(x) else x)
    

In [308]:
# check columns
fce.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 452833 entries, 0 to 481562
Data columns (total 25 columns):
word                      452833 non-null object
y                         452833 non-null int64
unigram_1                 452833 non-null int64
bigram_1                  452569 non-null float64
bigram_2                  395635 non-null float64
trigram_1                 450437 non-null float64
trigram_2                 394216 non-null float64
trigram_3                 342252 non-null float64
fourgram_1                446180 non-null float64
fourgram_2                392600 non-null float64
fourgram_3                341465 non-null float64
fourgram_4                292691 non-null float64
fivegram_1                442948 non-null float64
fivegram_2                390239 non-null float64
fivegram_3                339869 non-null float64
fivegram_4                291737 non-null float64
fivegram_5                246467 non-null float64
bigram_left_context       452569 non-null fl

## 7. Feature engineering - analysis and explanation

### A. Creating features from language model probabilities

Recall that a key hypothesis was that we ngram language model probabilities would indicate whether a word is corrector incorrect given the context. For example, we could build on the intuition that where introducing a new word to a given context causes the ngram probability to hit zero or become very low, it is likely to be incorrect. More specifically, the hypothesis was that a machine learning classifier could learn the decision threshold based on some combination of ngram probabilities.

Given that I already have our word match counts and our left / right context counts generated from Google Ngrams, all I need to do is the latter by the former and insert this as a feature column in our dataset. However, in addition, I will take the natural log of all the probabilities. As we'll be dealing with extremely small values, taking the logarithm will help me achieve better simplicity and stability.


### Unigram probabilities - naive assumptions

In terms of determining the probability for unigram, the standard language model approach would be to take the count of the word in the corpus divided by the the total number of words in the corpus. However, as we're only interested in the *correctness* of the word, I will substitute this for the following naive assumptions: 

**if the natural logarithm of a word's count is above Q1 - 3 * IQR, we assign the probability to 1**: the assumption is that words appearing not at all or very infrequently in Google Ngrams is very likely to be incorrect. I have decided to use the Tukey method to determine a very low word count as it is more robust to a heavily skewed distribution.

**if the natural logarithm of a word's count is below the Tukey threshold**: assign a probability of 0.

### Cleaning and imputing probabilities for higher order contexts

Another issue I will have to deal with is data sparsity and null values. 

Depending on their position in a sentence (and the length of that sentences), certain words do not have as much left or right context to work with. For example, 
- first word: no words before it, therefore *no left context at all*. 
- second word: only bigram context (i.e. only one word before it)
- third word: only bigram and trigram context (i.e. only two words before it)...

and so on.

- last word: no words after it, so no right context at all
- penultimate word: only bigram context (i.e. only one word after it)...

and so on.


Where words have less context, the best approach is either a) to assign these as null values or b) impute by carrying forward the probability from the highest available ngram context. 

The reason why I have taken this approach is that these words have fewer syntactic dependencies - the context we have available is the only context we want to take into account for that word. Given that we are only interested in the probability and ultimately *correctness* of a word in *this specific context only*, we should not make assumptions about contexts that don't exist. 


### B. Creating features from raw match counts

Ngram probabilities can be helpful in identifying errors, but they only take into account the ngrams where the word appears either at the beginning or the end of a setnence. Recall that we also have counts for ngrams where the word appears in the middle of the sentence.

It might add some additional, useful signal to our model if I include features based on the "raw" match counts, which take these other ngrams into account. This would include:

- **Mean log match counts for each ngram length**

My hypothesis here is that incorrect words will appear in more zero / low count ngrams than a correct word. Therefore, lower ngram count means should be correlated with incorrect words - and some combination of ngram means should provide a classifier with signal to discover the decision boundary. 

Another benefit of taking the mean is that, to a large extent, it avoids the need for imputing. Although, for very short sentences, there will still be unavoidable null mean values for higher order ngrams (e.g. four and fivegrams).

- **Sum of ngrams where the log count is above a certain threshold**

There are some potential issues with using ngram count means. The main one is that an incorrect word might nevertheless appear in at least one very common ngram, which would skew the the mean.

For example, take the phrase: "the mistake what I did . ". This is incorrect due to the word "what", but the trigram "what I did" is so common that it might completely counteract low counts for "the mistake what", "mistake what I". In fact, it may end up getting a higher mean than a phrase.

Taking the logarithm of the counts before calculating the mean might help lessen this issue. However, another way to get around this would be manually creating a feature that creates a binary difference between a significant and insignificant / zero count (as we did with the unigram probabilities). 

More specifically, for each ngram length, we could sum the number of ngrams where there is an insignificant / zero count.

- **Directly using the scaled log counts as features**

Another potential option is to simply use the log match counts themselves as features rather than doing any further.

The issue here, however, is that we are again faced with sparsity and null values - not all words have the full complement of values.

To address this, I could impute by forward filling again. But different length ngrams operate on different scales (i.e. lower-order ngrams have much higher counts than higher-order ones) - imputing would completely bias the data. A better option is to first standard scale the match counts and then forward fill.
 

### C. Creating features using POS tags

I will include two features based on the POS tags I previously extracted using Spacy:

- **Binary feature: is the word a proper noun or not?** 

Proper nouns are much less likely to appear in the corpus (especially in longer ngram contexts) than other words. Due to this sparsity issue, the presence of proper nouns could significantly bias my results - causing correct phrases to return a zero count. To try and counter this, I will include in the model a binary feature indicating whether the word is a proper noun.

- **Part of speech trigrams** 

Another useful feature will be to identify whether certain part of speech ngrams are associated with language errors. For example, a trigram of **adjective, adjective, full stop** is very likely to contain an error. 

The best way to extract such a feature will be to use a Count Vectorizer to map individual words to the ngrams in which they appear. I will take this step in Part 3 of the project.

To not overload my model with complexity, I will start by only looking at trigrams.

## 7A. Creating features from language model probabilities


In [558]:
# Divide match count by the left context and create new feature columns
count_1 = ["bigram_left_context", "trigram_left_context", "fourgram_left_context", "fivegram_left_context"]
count_2 = ["bigram_1", "trigram_1", "fourgram_1", "fivegram_1"]
for i in range(len(count_1)):
    fce["proba_" + count_2[i]] = (fce[count_2[i]] / (fce[count_1[i]]))

In [559]:
# Impute null ngram match counts and right context match counts by forward filling 
fce_ff = fce.loc[:, "bigram_1": "fivegram_5"]
fce_ff.loc[:, "bigram_1":"bigram_2"] = fce_ff.loc[
    :, "bigram_1":"bigram_2"].fillna(method="ffill", axis=1)
fce_ff.loc[:, "trigram_1":"trigram_3"] = fce_ff.loc[
    :, "trigram_1":"trigram_3"].fillna(method="ffill", axis=1)
fce_ff.loc[:, "fourgram_1":"fourgram_4"] = fce_ff.loc[
    :, "fourgram_1":"fourgram_4"].fillna(method="ffill", axis=1)
fce_ff.loc[:, "fivegram_1":"fivegram_5"] = fce_ff.loc[
    :, "fivegram_1":"fivegram_5"].fillna(method="ffill", axis=1)
# fce_ff_2 = fce.loc[:, "bigram_right_context":"fivegram_right_context"].fillna(method="ffill", axis=1)

In [560]:
# Divide match count by the right context and create new feature columns
count_1 = ["bigram_right_context", "trigram_right_context", "fourgram_right_context", "fivegram_right_context"]
count_2 = ["bigram_2", "trigram_3", "fourgram_4", "fivegram_5"]
for i in range(len(count_1)):
    fce["proba_" + count_2[i]] = (fce_ff[count_2[i]] / (fce[count_1[i]]))

# delete temporary dataframes
del fce_ff

In [511]:
# reset index
fce.reset_index(inplace=True, drop=True)

In [568]:
# adjust sentence indices
sentence_indices = [[i, j-1] for i, j in sentence_indices]

In [562]:
# calculate outlier threhsolds using tukey method
unigram_upper, unigram_lower = ed.tukey_outlier_bounds(np.log(fce["unigram_1"]), 3)
bigram_upper, bigram_lower = ed.tukey_outlier_bounds(np.log(fce["bigram_1"]), 3)
trigram_upper, trigram_lower = ed.tukey_outlier_bounds(np.log(fce["trigram_1"]), 3)
fourgram_upper, fourgram_lower = ed.tukey_outlier_bounds(np.log(fce["fourgram_1"]), 3)
fivegram_upper, fivegram_lower = ed.tukey_outlier_bounds(np.log(fce["fivegram_1"]), 3)

In [569]:
# Impute probability for unigrams
fce["proba_word"] = fce["unigram_1"].map(lambda x: 1 if np.log(x)>unigram_lower else 0)

fce.loc[[i[0] for i in sentence_indices], "proba_bigram_1"] = fce.loc[
    [i[0] for i in sentence_indices], "unigram_1"].map(lambda x: 1 if np.log(x)>unigram_lower else 0)

In [570]:
# Correct word probabilities at beginning of sentences by assigning null values 
fce.loc[[i[0] for i in sentence_indices], "proba_trigram_1":"proba_fivegram_1"] = np.nan

fce.loc[[i[0]+1 if i[0]+1 <= i[1] else i[0] for i in sentence_indices], 
        "proba_trigram_1":"proba_fivegram_1"] = np.nan
fce.loc[[i[0]+2 if i[0]+2 <= i[1] else i[0] for i in sentence_indices], 
            "proba_fourgram_1":"proba_fivegram_1"] = np.nan

fce.loc[[i[0]+3 if i[0]+3 <= i[1] else i[0] for i in sentence_indices], "proba_fivegram_1"] = np.nan

In [571]:
# Correct word probabilities at end of sentences by assigning null values 
fce.loc[[i[1] for i in sentence_indices], "proba_bigram_2"] = fce.loc[
    [i[1] for i in sentence_indices], "unigram_1"].map(lambda x: 1 if np.log(x)>0 else 0)

fce.loc[[i[1] for i in sentence_indices], "proba_trigram_3":"proba_fivegram_5"] = np.nan
fce.loc[[i[1]-1 if i[1]-1 >= i[0] else i[1] for i in sentence_indices], 
        "proba_trigram_3":"proba_fivegram_5"] = np.nan
fce.loc[[i[1]-2 if i[1]-2 >= i[0] else i[1] for i in sentence_indices], 
        "proba_fourgram_4":"proba_fivegram_5"] = np.nan
fce.loc[[i[1]-3 if i[1]-3 >= i[0] else i[1] for i in sentence_indices ], "proba_fivegram_5"] = np.nan

In [574]:
# take log of probabilities, setting a sufficiently low negative value (-20) where probability is 0
fce.loc[:,"proba_bigram_1":"proba_fivegram_5"] = fce.loc[
    :,"proba_bigram_1":"proba_fivegram_5"].applymap(lambda x: -20 if (x==0) else np.log(x)) 
fce["proba_word"] = fce["proba_word"].map(lambda x: -20 if (x==0) else np.log(x))

## 7B. Create raw match count based features


In [575]:
# take log of counts, first setting any zero counts to value 0 (to avoid issues with log(0))
fce.loc[:,"unigram_1":"fivegram_5"] = fce.loc[
    :,"unigram_1":"fivegram_5"].applymap(lambda x: 0 if (x==0) else np.log(x)) 

In [576]:
# Find the log means and place into new columns in the dataframe
fce["unigram_mean"] = fce.loc[:, "unigram_1"]
fce["bigram_mean"] = fce.loc[:, "bigram_1" : "bigram_2"].mean(axis=1)
fce["trigram_mean"] = fce.loc[:, "trigram_1" : "trigram_3"].mean(axis=1)
fce["fourgram_mean"] = fce.loc[:, "fourgram_1" : "fourgram_4"].mean(axis=1)
fce["fivegram_mean"] = fce.loc[:, "fivegram_1" : "fivegram_5"].mean(axis=1)

In [577]:
# Sum number of ngrams where value is above lower threshold
fce["unigram_sum_threshold"] = fce.loc[:, "unigram_1"].map(
    lambda x: np.sum(x > unigram_lower))
fce["bigram_sum_threshold"] = fce.loc[:, "bigram_1" : "bigram_2"].apply(
    lambda x: np.sum(x > bigram_lower), axis=1)
fce["trigram_sum_threshold"] = fce.loc[:, "trigram_1" : "trigram_3"].apply(
    lambda x: np.sum(x > trigram_lower), axis=1)
fce["fourgram_sum_threshold"] = fce.loc[:, "fourgram_1" : "fourgram_4"].apply(
    lambda x: np.sum(x > fourgram_lower), axis=1)
fce["fivegram_sum_threshold"] = fce.loc[:, "fivegram_1" : "fivegram_5"].apply(
    lambda x: np.sum(x > fivegram_lower), axis=1)

Standardise and impute match counts - insert them as new feature columns

In [578]:
# start by manually calculating standard scores (Standard Scalar won't work with null values)
fce_ss = ed.manual_zscore(fce.loc[:,"unigram_1":"fivegram_5"])

# forward fill
fce_ss = fce_ss.fillna(method='ffill', axis=1)

# rename columns
fce_ss.columns = [col + "_scaled" for col in fce_ss.columns]

# concatenate dataframes
fce = pd.concat([fce, fce_ss], axis=1)

# delete fce_ss
del fce_ss

## 7C. Create POS tagged features

Recall that at the outset, I created a second dataframe (fce_pos) containing the POS tagged ngrams and ngram boundaries.

I will now use this dataframe to:
- extract a binary "proper_noun" binary feature and include within the fce dataframe
- insert trigram boundaries within fce dataframe for processing at modelling stage

In [579]:
# create binary POS feature
fce["proper_noun"] = fce_pos["unigrams_1"].map(lambda x: 1 if "NNP" in x else 0)

In [580]:
# merge trigram boundaries
fce["trigram_boundaries"] = fce_pos["tagged_trigram_boundaries"]

## 8. Check for and clean anomalies and outliers

Next, I will check the summary statistics to identify any outliers or anomalies before dealing with them.

In [399]:
# check summary statistics
fce.iloc[:,0:20].describe()

Unnamed: 0,y,unigram_1,bigram_1,bigram_2,trigram_1,trigram_2,trigram_3,fourgram_1,fourgram_2,fourgram_3,fourgram_4,fivegram_1,fivegram_2,fivegram_3,fivegram_4,fivegram_5,bigram_left_context,trigram_left_context,fourgram_left_context
count,452833.0,452833.0,452569.0,395635.0,450437.0,394216.0,342252.0,446180.0,392600.0,341465.0,292691.0,442948.0,390239.0,339869.0,291737.0,246467.0,452569.0,450437.0,446180.0
mean,0.869937,19.68217,14.133499,14.052845,9.291744,9.103233,9.033668,5.193553,4.93589,4.810206,4.778222,2.316988,2.070622,1.929279,1.876632,1.865009,2932671000.0,52769660.0,1312524.0
std,0.336374,3.179168,4.085849,4.090901,4.741899,4.73125,4.741062,4.655509,4.548952,4.518285,4.520664,3.591037,3.383543,3.282896,3.254113,3.239136,5600165000.0,233602400.0,34240150.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,17.770366,12.185186,12.134561,6.626718,6.461468,6.371612,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,46924780.0,189293.0,740.0
50%,1.0,20.147392,14.850568,14.719223,10.150387,9.916626,9.829787,5.55296,5.236442,5.087596,5.023881,0.0,0.0,0.0,0.0,0.0,496107100.0,2637777.0,24525.5
75%,1.0,21.821203,17.030865,16.879635,12.847521,12.624192,12.561429,9.019815,8.672144,8.527342,8.480944,5.099866,4.590044,4.290459,4.174387,4.158883,2087000000.0,22923020.0,358968.0
max,1.0,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,23.816403,22046140000.0,22046140000.0,22046140000.0


In [401]:
fce.iloc[:,20:40].describe()

Unnamed: 0,fivegram_left_context,bigram_right_context,trigram_right_context,fourgram_right_context,fivegram_right_context,proba_bigram_1,proba_trigram_1,proba_fourgram_1,proba_fivegram_1,proba_bigram_2,proba_trigram_3,proba_fourgram_4,proba_fivegram_5,proba_word,unigram_mean,bigram_mean,trigram_mean,fourgram_mean,fivegram_mean,unigram_sum_threshold
count,442948.0,425540.0,423518.0,419524.0,416502.0,451055.0,385148.0,317055.0,202167.0,427193.0,370391.0,328837.0,240069.0,452833.0,452833.0,452569.0,450437.0,446180.0,442948.0,452833.0
mean,185488.8,4378326000.0,86593750.0,1991162.0,201701.1,inf,inf,inf,inf,inf,inf,inf,inf,-0.094295,19.68217,14.171805,9.298811,5.149432,2.231384,0.995285
std,33944820.0,6734600000.0,394659500.0,36258970.0,35005120.0,,,,,,,,,1.370043,3.179168,3.448339,3.896302,3.760589,2.812585,0.068502
min,0.0,0.0,0.0,0.0,0.0,-20.10283,-20.0,-20.0,-20.0,-20.12752,-20.07873,-20.07873,-20.0,-20.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,60899140.0,232574.0,914.0,0.0,-7.054926,-7.133454,-20.0,-20.0,-8.432718,-9.345668,-20.0,-20.0,0.0,17.770366,12.461696,6.866836,1.902363,0.0,1.0
50%,249.0,609129200.0,3215607.0,28991.0,322.0,-4.712522,-4.397605,-4.503173,-9.181357,-5.16857,-5.033237,-5.953056,-20.0,0.0,20.147392,14.73523,9.896638,4.953866,1.101878,1.0
75%,8068.0,6051835000.0,29309740.0,463531.0,9366.75,-2.64287,-2.452737,-2.171249,-2.256229,-1.706916,-1.238425,-0.427444,-0.8396701,0.0,21.821203,16.559352,12.175507,8.10235,3.719312,1.0
max,22046140000.0,22046140000.0,22046140000.0,22046140000.0,22046140000.0,inf,inf,inf,inf,inf,inf,inf,inf,0.0,23.816403,23.816403,19.935604,18.099192,22.299355,1.0


In [403]:
fce.iloc[:,40:].describe()

Unnamed: 0,bigram_sum_threshold,trigram_sum_threshold,fourgram_sum_threshold,fivegram_sum_threshold,unigram_1_scaled,bigram_1_scaled,bigram_2_scaled,trigram_1_scaled,trigram_2_scaled,trigram_3_scaled,fourgram_1_scaled,fourgram_2_scaled,fourgram_3_scaled,fourgram_4_scaled,fivegram_1_scaled,fivegram_2_scaled,fivegram_3_scaled,fivegram_4_scaled,fivegram_5_scaled,proper_noun
count,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,452833.0,425780.0
mean,1.873106,2.621066,3.252713,3.779009,1.927964e-12,-0.00095,0.035036,-0.006889,0.028389,0.061138,-0.010184,0.024194,0.057679,0.087217,-0.012662,0.020439,0.050783,0.078106,0.102562,0.028919
std,0.334602,0.717913,1.140584,1.563177,1.000001,1.00153,1.007168,1.007349,1.009911,1.011067,1.008266,1.017503,1.025008,1.031344,1.008802,1.029563,1.049353,1.067717,1.084771,0.167578
min,0.0,0.0,0.0,0.0,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,-6.190988,0.0
25%,2.0,3.0,3.0,3.0,-0.6013541,-0.478121,-0.442608,-0.572908,-0.535833,-0.502986,-1.115573,-1.085062,-1.06461,-1.056975,-0.645215,-0.61197,-0.587677,-0.576696,-0.576696,0.0
50%,2.0,3.0,4.0,5.0,0.1463349,0.174458,0.204275,0.175923,0.206681,0.240622,0.070434,0.105987,0.145598,0.180924,-0.645215,-0.61197,-0.587677,-0.576696,-0.575775,0.0
75%,2.0,3.0,4.0,5.0,0.6728286,0.708773,0.738919,0.745837,0.781629,0.818093,0.815275,0.856571,0.891985,0.923284,0.762805,0.807852,0.850629,0.891947,0.938606,0.0
max,2.0,3.0,4.0,5.0,1.300415,2.369866,2.386655,3.06305,3.109789,3.118026,4.000179,4.150525,4.206513,4.211375,5.986973,6.426935,6.667027,6.742179,6.776942,1.0


There appear to be a number of issues:

- infinity values in the probabilities: All probabilities should be between 0 and 1 (or as log probabilities, zero or below) yet there appear to be some probabilities with infinite values.
- strange maximum values for our raw counts, i.e. all ngram lengths have the same max score. This is highly unexpected and suggestive on an error. We would expect max scores to descend as ngram lengths increase.

I will now check some of these in more detail

### Check the anomalous maximum values


In [168]:
# Select only rows where fivegram match count equals the max value for the column
fce.loc[fce["fivegram_1"] == fce["fivegram_1"].max(), ["word", "fivegram_1"]]

Unnamed: 0,word,fivegram_1
236197,me,22046140000.0


In [173]:
# Select only rows where fourgram match count equals the max value for the column
fce.loc[fce["fourgram_1"] == fce["fourgram_1"].max(), ["word", "fourgram_1"]]

Unnamed: 0,word,fourgram_1
236196,write,22046140000.0


In [175]:
# Select only rows where trigram match count equals the max value for the column
fce.loc[fce["trigram_1"] == fce["trigram_1"].max(), ["word", "trigram_1"]]

Unnamed: 0,word,trigram_1
236195,will,22046140000.0


In [242]:
# Select only rows where bigram match count equals the max value for the column
fce.loc[fce["bigram_1"] == fce["bigram_1"].max(), ["word", "bigram_1"]]

Unnamed: 0,word,bigram_1
77880,.,22046140000.0
190032,.,22046140000.0
236194,&,22046140000.0


Interestingly these almost all appear to be in the same sentence. I will look at the surrounding context to try and isolate the cause.

In [171]:
# analyse the surrounding context
fce.loc[236192:236198, ["word", "fivegram_1"]]

Unnamed: 0,word,fivegram_1
236192,forget,0.0
236193,",",0.0
236194,&,72.0
236195,will,231808.0
236196,write,783286.0
236197,me,22046140000.0
236198,soon,0.0


The problem appears to be with the ampersand. I confirmed this after checking how I created the queries in Part 1; Phrasefinder takes percent encoded queries and I failed to replace the ampersands with their percent encoded value.

As there are relatively few cases, I will remove these rather than requery them.

Finally, I will also check the surrounding context of the other bigram anomalies to see if these are also caused by the ampersand.

In [244]:
fce.loc[77875:77884, ["word", "bigram_1"]]

Unnamed: 0,word,bigram_1
77875,sit,0.0
77876,down,3444019.0
77877,!,516558.0
77878,",",0.0
77879,",",7791411.0
77880,.,22046140000.0
77881,We,1830.0
77882,ca,1830.0
77883,n't,62619.0
77884,talk,1952998.0


In these cases, it appears to be a sequence of commas causing the problem. Again, checking how I created the queries and with the Phrasefinder API, this has been caused by malformed queries.

Again, I will deal with these by removing them.

### Check anomalies in the probabilities

In [136]:
# select first 10 rows where trigrams have a probability greater than 1
fce.loc[fce["proba_trigram_1"] > 0, ["word", "proba_trigram_1", "trigram_1", "trigram_left_context"]].head(10)

Unnamed: 0,word,proba_trigram_1,trigram_1,trigram_left_context
12426,.,2.828653,5943.0,2101.0
17279,is,1.380616,8205.0,5943.0
28997,and,4.111111,259.0,63.0
37007,to,21.207765,71555.0,3374.0
48217,",",5.031008,649.0,129.0
54392,.,2.828653,5943.0,2101.0
63123,",",16.772794,45991.0,2742.0
94054,/,222.553207,256424470.0,1152194.0
94511,",",46.822581,2903.0,62.0
101603,.,2.828653,5943.0,2101.0


As before, it looks like it is punctuation marks and symbols causing the issues due to malformed queries. 

I'll check the context words to the left to confirm this.

In [181]:
fce.loc[17277:17283, ["word", "proba_trigram_1", "trigram_1", "trigram_left_context"]]

Unnamed: 0,word,proba_trigram_1,trigram_1,trigram_left_context
17277,the,0.210194,1239337.0,5896165.0
17278,U.S.A.,0.0,0.0,2365910.0
17279,is,1.380616,8205.0,5943.0
17280,July,0.0,0.0,12403.0
17281,due,0.0,0.0,27080.0
17282,to,0.862233,363.0,421.0
17283,the,0.324716,11438013.0,35224703.0


In this case, it is the word U.S.A. causing an error due to presence of dots.

I will see if this also applies to right contexgt ngrams where probability is greater than 1.

In [245]:
# select only rows where bigram right context probability is greater than 1
fce.loc[fce["proba_bigram_2"] > 0, ["word", "proba_bigram_2", "bigram_2", "bigram_right_context"]].head(10)

Unnamed: 0,word,proba_bigram_2,bigram_2,bigram_right_context
22391,we,98.578632,11019710.0,111786.0
31081,they,34.0627,3807733.0,111786.0
35573,it,17.38902,1943849.0,111786.0
49448,we,91.595562,4284932.0,46781.0
50900,the,19.592211,1287365.0,65708.0
51513,it,17.38902,1943849.0,111786.0
73304,we,98.578632,11019710.0,111786.0
77879,",",1.430438,22046140000.0,15412160000.0
94054,/,37.11186,256424500.0,6909502.0
99976,Development,inf,85261600.0,0.0


In [246]:
# select only surrounding context for one of the anamolous words
fce.loc[22390:22395]

Unnamed: 0,word,y,unigram_1,bigram_1,bigram_2,trigram_1,trigram_2,trigram_3,fourgram_1,fourgram_2,...,fourgram_right_context,fivegram_right_context,proba_bigram_1,proba_trigram_1,proba_fourgram_1,proba_fivegram_1,proba_bigram_2,proba_trigram_3,proba_fourgram_4,proba_fivegram_5
22390,that,1,2997974108,5516337.0,45662178.0,834791.0,144841.0,139063.0,102209.0,24980.0,...,54769.0,2963.0,0.12953,0.091773,0.058583,0.088196,0.05660246,0.012619,0.008015,0.0
22391,we,1,806717136,45662178.0,11019711.0,144841.0,139063.0,54769.0,24980.0,1994.0,...,0.0,0.0,0.015231,0.026257,0.029924,0.032052,98.57863,264.584541,inf,inf
22392,'ll,1,111786,11019711.0,207.0,139063.0,54769.0,0.0,1994.0,439.0,...,1545325.0,,0.01366,0.003045,0.013767,0.021417,9.228361e-07,0.0,0.0,0.001917
22393,work,1,224308513,207.0,1545325.0,54769.0,0.0,1545325.0,439.0,2963.0,...,,,0.001852,0.00497,0.003157,0.0,0.01998092,0.019981,0.0,
22394,together,1,77340044,1545325.0,77340044.0,0.0,1545325.0,,2963.0,0.0,...,,,0.006889,0.0,0.0541,0.0,0.005018118,0.0001,,
22395,.,1,15412159799,77340044.0,,1545325.0,,,0.0,,...,,,1.0,1.0,,1.0,,,,


In this case, it is the presence of the "'ll" causing the problem. Checking with Phrasefinder, this is because of a parsing issue where some of the Google Ngrams dataset parses "we'll", "they'll" as one token and others as "we" and "'ll" as separate tokens.

Next, I'll check infinity probabilities and contexts.

In [360]:
# select rows where trigram right context is infinity value
print(len(fce.loc[fce["proba_trigram_3"] == np.inf, ["word","trigram_3","trigram_right_context"]]),
         "rows of trigram infinity values")
fce.loc[fce["proba_trigram_3"] == np.inf, ["word","trigram_3","trigram_right_context"]].head(15)

8197 rows of trigram infinity values


Unnamed: 0,word,trigram_3,trigram_right_context
70,the,7.232733,0.0
121,the,13.858978,0.0
151,promised,,0.0
227,in,11.111686,0.0
228,Capital,12.682984,0.0
261,a,16.434304,0.0
292,who,8.288283,0.0
446,show,11.2936,0.0
449,saw,7.937375,0.0
450,last,14.589597,0.0


It appears as if pandas is assigning some of our 0 probabilities as infinity values. 

We will have to manually reassign these as 0 probabilities (or log -20, as explained above).

### Clean anomalies

As stated before, I will deal with the anomalies as follows:
- **max count issue:** remove rows
- **probabilities greater than 1:** remove rows
- **infinity values:** convert to -20.

However, it should be noted that when putting the model into production and applying to real data, the parsing / encoding issues that led to the first two issues will have to be corrected. A language checker can't skip words! 

In [581]:
# Change infinity values to -20
fce.replace(np.inf, -20, inplace=True)

In [582]:
# Loop through probability columns and drop where probability is greater than 0 (log 1)
for col in fce.columns[25:33]:
    mask = fce[col] > 0
    fce = fce.loc[~mask, :]

In [585]:
# Loop through match count columns and drop where bigrams and larger contexts are equal to unigram max value
for col in fce.columns[3:17]:
    mask = fce[col] == fce["unigram_1"].max()
    fce = fce.loc[~mask, :]

In [None]:
# remove ampersands
mask = fce["word"]=="&"
fce = fce[~mask]
mask = fce_pos["0"]=="&"
fce_pos = fce_pos[~mask]

In [591]:
# apply same cleaning to pos tagged dataframe
fce_pos = fce_pos.loc[fce.index]

## 9. Save dataframes as CSVs for later use


In [592]:
fce.to_csv("fce_train_final.csv")
fce_pos.to_csv("fce_pos_train_final.csv")

## 10. Repeat process for test set

In [598]:
# Load FCE test dataset and read in DataFrame
my_file = "fce_test.csv"
fce = pd.read_csv(my_file, index_col=[0])

In [599]:
# Load sentence indices, i.e. indices that mark out the sentences in our dataframe
with open("sentence_indices_test.pickle", 'rb') as f:
    sentence_indices = pickle.load(f)

In [600]:
# Open dictionaries of ngram match counts
with open('unigram_scores_col_test.pickle','rb') as f:
    unigram_scores = pickle.load(f)
with open('bigram_scores_col_test.pickle', 'rb') as f:
    bigram_scores = pickle.load(f)
with open('trigram_scores_col_test.pickle', 'rb') as f:
    trigram_scores = pickle.load(f)
with open('fourgram_scores_col_test.pickle', 'rb') as f:
    fourgram_scores = pickle.load(f)
with open('fivegram_scores_col_test.pickle', 'rb') as f:
    fivegram_scores = pickle.load(f)

# Open dictionaries of ngram context scores
with open('bigram_test_context.pickle','rb') as f:
    bigram_context = pickle.load(f)
with open('trigram_test_context.pickle', 'rb') as f:
    trigram_context = pickle.load(f)
with open('fourgram_test_context.pickle', 'rb') as f:
    fourgram_context = pickle.load(f)
with open('fivegram_test_context.pickle', 'rb') as f:
    fivegram_context = pickle.load(f)

# Open tagged ngrams
with open('tagged_unigrams_test.pickle','rb') as f:
    tagged_unigrams = pickle.load(f)
with open('tagged_bigrams_test.pickle', 'rb') as f:
    tagged_bigrams = pickle.load(f)
with open('tagged_trigrams_test.pickle', 'rb') as f:
    tagged_trigrams = pickle.load(f)
with open('tagged_fourgrams_test.pickle', 'rb') as f:
    tagged_fourgrams = pickle.load(f)
with open('tagged_fivegrams_test.pickle', 'rb') as f:
    tagged_fivegrams = pickle.load(f)

# Open tagged ngram boundaries
with open('tagged_unigrams_boundaries_test.pickle','rb') as f:
    tagged_unigrams_boundaries = pickle.load(f)
with open('tagged_bigrams_boundaries_test.pickle', 'rb') as f:
    tagged_bigrams_boundaries = pickle.load(f)
with open('tagged_trigrams_boundaries_test.pickle', 'rb') as f:
    tagged_trigrams_boundaries = pickle.load(f)
with open('tagged_fourgrams_boundaries_test.pickle', 'rb') as f:
    tagged_fourgrams_boundaries = pickle.load(f)
with open('tagged_fivegrams_boundaries_test.pickle', 'rb') as f:
    tagged_fivegrams_boundaries = pickle.load(f)

In [601]:
# Create a column for each ngram length
for key in unigram_scores:
    fce[key] = unigram_scores[key]
for key in bigram_scores:
    fce[key] = bigram_scores[key]
for key in trigram_scores:
    fce[key] = trigram_scores[key]
for key in fourgram_scores:
    fce[key] = fourgram_scores[key]
for key in fivegram_scores:
    fce[key] = fivegram_scores[key]

In [602]:
# Create a column for each ngram left context score
fce['bigram_left_context'] = bigram_context['bigram_1_left_context']
fce['trigram_left_context'] = trigram_context['trigram_1_left_context']
fce['fourgram_left_context'] = fourgram_context['fourgram_1_left_context']
fce['fivegram_left_context'] = fivegram_context['fivegram_1_left_context']

fce.reset_index(inplace=True, drop=True)

# create temporary dataframe with all right context columns
fce_rc = pd.DataFrame()

for key in bigram_context:
    if "right" in key:
        fce_rc[key] = bigram_context[key]
for key in trigram_context:
    if "right" in key:
        fce_rc[key] = trigram_context[key]
for key in fourgram_context:
    if "right" in key:
        fce_rc[key] = fourgram_context[key]
for key in fivegram_context:
    if "right" in key:
        fce_rc[key] = fivegram_context[key]

# replace "no score" with np.nan
fce_rc.replace("No score", np.nan, inplace=True)

# forward fill
fce_rc.loc[:, "bigram_1_right_context":"bigram_2_right_context"].fillna(method="ffill", axis=1, inplace=True)
fce_rc.loc[:, "trigram_1_right_context":"trigram_3_right_context"].fillna(method="ffill", axis=1, inplace=True)
fce_rc.loc[:, "fourgram_1_right_context":"fourgram_4_right_context"].fillna(method="ffill", axis=1, inplace=True)
fce_rc.loc[:, "fivegram_1_right_context":"fivegram_5_right_context"].fillna(method="ffill", axis=1, inplace=True)

# create new columns in main dataframe
fce['bigram_right_context'] = fce_rc['bigram_2_right_context']
fce['trigram_right_context'] = fce_rc['trigram_3_right_context']
fce['fourgram_right_context'] = fce_rc['fourgram_4_right_context']
fce['fivegram_right_context'] = fce_rc['fivegram_5_right_context']

# delete temporary dataframe
del fce_rc

In [603]:
fce_pos = fce.iloc[:, 0:2]

for key in tagged_unigrams:
    fce_pos[key] = tagged_unigrams[key]
for key in tagged_bigrams:
    fce_pos[key] = tagged_bigrams[key]
for key in tagged_trigrams:
    fce_pos[key] = tagged_trigrams[key]
for key in tagged_fourgrams:
    fce_pos[key] = tagged_fourgrams[key]
for key in tagged_fivegrams:
    fce_pos[key] = tagged_fivegrams[key]

fce_pos["tagged_unigram_boundaries"] = tagged_unigrams_boundaries
fce_pos["tagged_bigram_boundaries"] = tagged_bigrams_boundaries
fce_pos["tagged_trigram_boundaries"] = tagged_trigrams_boundaries
fce_pos["tagged_fourgram_boundaries"] = tagged_fourgrams_boundaries
fce_pos["tagged_fivegram_boundaries"] = tagged_fivegrams_boundaries

# rename the first two columns
fce.columns = ["word", "y"] + [col for col in fce.columns[2:]]

# Change our outcome variable to a binary variable
fce.loc[:,"y"] = fce["y"].map(lambda x: 0 if x == "i" else 1)

# replace "no score" with null 
fce.replace("No score", np.nan, inplace=True)


# Divide match count by the left context and create new feature columns
count_1 = ["bigram_left_context", "trigram_left_context", "fourgram_left_context", "fivegram_left_context"]
count_2 = ["bigram_1", "trigram_1", "fourgram_1", "fivegram_1"]
for i in range(len(count_1)):
    fce["proba_" + count_2[i]] = (fce[count_2[i]] / (fce[count_1[i]]))

# Impute null ngram match counts and right context match counts by forward filling 
fce_ff = fce.loc[:, "bigram_1": "fivegram_5"]
fce_ff.loc[:, "bigram_1":"bigram_2"] = fce_ff.loc[
    :, "bigram_1":"bigram_2"].fillna(method="ffill", axis=1)
fce_ff.loc[:, "trigram_1":"trigram_3"] = fce_ff.loc[
    :, "trigram_1":"trigram_3"].fillna(method="ffill", axis=1)
fce_ff.loc[:, "fourgram_1":"fourgram_4"] = fce_ff.loc[
    :, "fourgram_1":"fourgram_4"].fillna(method="ffill", axis=1)
fce_ff.loc[:, "fivegram_1":"fivegram_5"] = fce_ff.loc[
    :, "fivegram_1":"fivegram_5"].fillna(method="ffill", axis=1)
# fce_ff_2 = fce.loc[:, "bigram_right_context":"fivegram_right_context"].fillna(method="ffill", axis=1)

# Divide match count by the right context and create new feature columns
count_1 = ["bigram_right_context", "trigram_right_context", "fourgram_right_context", "fivegram_right_context"]
count_2 = ["bigram_2", "trigram_3", "fourgram_4", "fivegram_5"]
for i in range(len(count_1)):
    fce["proba_" + count_2[i]] = (fce_ff[count_2[i]] / (fce[count_1[i]]))

# delete temporary dataframes
del fce_ff

# reset index
fce.reset_index(inplace=True, drop=True)

# adjust sentence indices
sentence_indices = [[i, j-1] for i, j in sentence_indices]

# calculate outlier threhsolds using tukey method
unigram_upper, unigram_lower = ed.tukey_outlier_bounds(np.log(fce["unigram_1"]), 3)
bigram_upper, bigram_lower = ed.tukey_outlier_bounds(np.log(fce["bigram_1"]), 3)
trigram_upper, trigram_lower = ed.tukey_outlier_bounds(np.log(fce["trigram_1"]), 3)
fourgram_upper, fourgram_lower = ed.tukey_outlier_bounds(np.log(fce["fourgram_1"]), 3)
fivegram_upper, fivegram_lower = ed.tukey_outlier_bounds(np.log(fce["fivegram_1"]), 3)

# Impute probability for unigrams
fce["proba_word"] = fce["unigram_1"].map(lambda x: 1 if np.log(x)>unigram_lower else 0)

fce.loc[[i[0] for i in sentence_indices], "proba_bigram_1"] = fce.loc[
    [i[0] for i in sentence_indices], "unigram_1"].map(lambda x: 1 if np.log(x)>unigram_lower else 0)

# Correct word probabilities at beginning of sentences by assigning null values 
fce.loc[[i[0] for i in sentence_indices], "proba_trigram_1":"proba_fivegram_1"] = np.nan

fce.loc[[i[0]+1 if i[0]+1 <= i[1] else i[0] for i in sentence_indices], 
        "proba_trigram_1":"proba_fivegram_1"] = np.nan
fce.loc[[i[0]+2 if i[0]+2 <= i[1] else i[0] for i in sentence_indices], 
            "proba_fourgram_1":"proba_fivegram_1"] = np.nan

fce.loc[[i[0]+3 if i[0]+3 <= i[1] else i[0] for i in sentence_indices], "proba_fivegram_1"] = np.nan

# Correct word probabilities at end of sentences by assigning null values 
fce.loc[[i[1] for i in sentence_indices], "proba_bigram_2"] = fce.loc[
    [i[1] for i in sentence_indices], "unigram_1"].map(lambda x: 1 if np.log(x)>0 else 0)

fce.loc[[i[1] for i in sentence_indices], "proba_trigram_3":"proba_fivegram_5"] = np.nan
fce.loc[[i[1]-1 if i[1]-1 >= i[0] else i[1] for i in sentence_indices], 
        "proba_trigram_3":"proba_fivegram_5"] = np.nan
fce.loc[[i[1]-2 if i[1]-2 >= i[0] else i[1] for i in sentence_indices], 
        "proba_fourgram_4":"proba_fivegram_5"] = np.nan
fce.loc[[i[1]-3 if i[1]-3 >= i[0] else i[1] for i in sentence_indices ], "proba_fivegram_5"] = np.nan

# take log of probabilities, setting a sufficiently low negative value (-20) where probability is 0
fce.loc[:,"proba_bigram_1":"proba_fivegram_5"] = fce.loc[
    :,"proba_bigram_1":"proba_fivegram_5"].applymap(lambda x: -20 if (x==0) else np.log(x)) 
fce["proba_word"] = fce["proba_word"].map(lambda x: -20 if (x==0) else np.log(x))

# take log of counts, first setting any zero counts to value 0 (to avoid issues with log(0))
fce.loc[:,"unigram_1":"fivegram_5"] = fce.loc[
    :,"unigram_1":"fivegram_5"].applymap(lambda x: 0 if (x==0) else np.log(x)) 

# Find the log means and place into new columns in the dataframe
fce["unigram_mean"] = fce.loc[:, "unigram_1"]
fce["bigram_mean"] = fce.loc[:, "bigram_1" : "bigram_2"].mean(axis=1)
fce["trigram_mean"] = fce.loc[:, "trigram_1" : "trigram_3"].mean(axis=1)
fce["fourgram_mean"] = fce.loc[:, "fourgram_1" : "fourgram_4"].mean(axis=1)
fce["fivegram_mean"] = fce.loc[:, "fivegram_1" : "fivegram_5"].mean(axis=1)

# Sum number of ngrams where value is above lower threshold
fce["unigram_sum_threshold"] = fce.loc[:, "unigram_1"].map(
    lambda x: np.sum(x > unigram_lower))
fce["bigram_sum_threshold"] = fce.loc[:, "bigram_1" : "bigram_2"].apply(
    lambda x: np.sum(x > bigram_lower), axis=1)
fce["trigram_sum_threshold"] = fce.loc[:, "trigram_1" : "trigram_3"].apply(
    lambda x: np.sum(x > trigram_lower), axis=1)
fce["fourgram_sum_threshold"] = fce.loc[:, "fourgram_1" : "fourgram_4"].apply(
    lambda x: np.sum(x > fourgram_lower), axis=1)
fce["fivegram_sum_threshold"] = fce.loc[:, "fivegram_1" : "fivegram_5"].apply(
    lambda x: np.sum(x > fivegram_lower), axis=1)

# start by manually calculating standard scores (Standard Scalar won't work with null values)
fce_ss = ed.manual_zscore(fce.loc[:,"unigram_1":"fivegram_5"])

# forward fill
fce_ss = fce_ss.fillna(method='ffill', axis=1)

# rename columns
fce_ss.columns = [col + "_scaled" for col in fce_ss.columns]

# concatenate dataframes
fce = pd.concat([fce, fce_ss], axis=1)

# delete fce_ss
del fce_ss


# create binary POS feature
fce["proper_noun"] = fce_pos["unigrams_1"].map(lambda x: 1 if "NNP" in x else 0)

# merge trigram boundaries
fce["trigram_boundaries"] = fce_pos["tagged_trigram_boundaries"]

In [604]:
# Change infinity values to -20
fce.replace(np.inf, -20, inplace=True)

# Loop through probability columns and drop where probability is greater than 0 (log 1)
for col in fce.columns[25:33]:
    mask = fce[col] > 0
    fce = fce.loc[~mask, :]

# Loop through match count columns and drop where bigrams and larger contexts are equal to unigram max value
for col in fce.columns[3:17]:
    mask = fce[col] == fce["unigram_1"].max()
    fce = fce.loc[~mask, :]

# apply same cleaning to pos tagged dataframe
fce_pos = fce_pos.loc[fce.index]

# remove ampersands
mask = fce["word"]=="&"
fce = fce[~mask]
mask = fce_pos["0"]=="&"
fce_pos = fce_pos[~mask]

fce.to_csv("fce_test_final.csv")
fce_pos.to_csv("fce_pos_test_final.csv")