# General Assembly Data Science Immersive - Capstone Project #

## Creating an automated English language error detector ##

## Part 1: Data wrangling and API querying - overview of process

This is the first part of my data science immersive Capstone Project, covering the initial data wrangling process. I use the Cambridge FCE dataset as the foundation for my project. I generate sets of 1 to 5 overlapping n-gram sets for each word in the dataset and use these to query the Phrasfinder.io API (built on the Google Books Ngrams viewer), which returns a JSON file. 

From the JSON file I extract the "match count" (i.e. the number of appearances the ngram makes in the Google Ngrams dataset) for each ngram that overlaps a given word. I then map these counts back to the original word and its position within the dataframe using Python dictionaries. 

I also extract the match count of the ngrams to the left and right of the each ngram and map these back to the word. These contextual ngrams counts will be used at a later stage to calculate ngram probabilities, which will be used as features within a classifier.

All of these counts are entered as lists into Python dictionaries with each key representing a column in a pandas dataframe. This will allow for easy concatenation with the FCE dataframe at a later stage.

Finally, I also parse the sentences using the Spacy NLP library to get the Part of Speech tags, which will also be used in my modelling and feature selection process. 

As above, these are also entered as lists into Python dictionaries to be later merged into the FCE dataframe.

**This entire process is completed twice - once for the training set and once for the test set**

In [387]:
import pandas as pd
import numpy as np
import re
import nltk
import spacy
from tqdm import tqdm
import string
import requests
import time
import json
import pickle
import ErrorDetection as ed
import importlib

### 1. Extract sentences from FCE dataset (training set)###

In [368]:
# Load FCE dataset from CSV file
my_file = "./fce-public.train.original.tsv"
fce = pd.read_csv(my_file, header=None, sep='\t', skip_blank_lines=False, quoting=3)

In [361]:
fce.head()

Unnamed: 0,0,1
0,Dear,c
1,Sir,c
2,or,c
3,Madam,c
4,",",c


Note that the dataset is simply divided into words and correct / incorrect (c or i). Each sentence is usefully separated by a row of NaN values, so let's keep those Null value for now.

Using a function, I will find the row indices for each sentence and then use these indices to extract the sentences themselves. These sentences will be required later to generate our n-grams

In [369]:
# Find the sentence lengths using the indices
sentence_length = ed.find_sentence_length(fce, 0)

In [370]:
# create our sentences and save for later use
sentences = ed.create_save_sentences(fce, 0, 'sentences_train')

Here's what our list of sentences looks like:

In [236]:
sentences[0:5]

['Dear Sir or Madam ,',
 'I am writing in order to express my disappointment about your musical show " Over the Rainbow " .',
 "I saws the show 's advertisement hanging up of a wall in London where I was spending my holiday with some friends .",
 'I convinced them to go there with me because I had heard good references about your Company and , above all , about the main star , Danny Brook .',
 "The problems started in the box office , where we asked for the discounts you announced in the advertisement , and the man who was selling the tickets said that they did n't exist ."]

### 2. Remove null values from dataset (training set) and save for later###

With the null values having now served their purpose, let's remove them and create / pickle a new set of sentence indices that will work with the non-null dataset. This will come in useful when we need to combine our Google Ngrams data with our original dataframe later in the process.

In [237]:
# create sentence indices to be used again later with our FCE dataframe without Null values
sentence_indices_2 = ed.find_df_sentence_indices(fce, 0, 'sentence_indices_train')

In [238]:
# remove null values from dataframe
mask = fce[0].isnull()==True
fce = fce[~mask]

In [239]:
# save and delete our dataframe from currnt notebook
fce.to_csv("fce_train.csv")
del fce

### 3. Create n-grams from our sentences and run the queries through phrasefinder###

The next stage of the process is to:
- create a set of overlapping 1-5 length ngrams for each word;
- transform these ngrams into a percent encoded format that can be used as queries in Phrasefinder API;
- create a *score* dictionary that maps the ngrams (keys) to their queries (values) for future reference;
- create a *master reference* dictionary that maps words in the original dataframe to their corresponding ngrams. This master dictionary will enable me to preserve the word order in the dataframe and easily create feature columns at a later stage from the returned Phrasefinder scores.
- query Phrasefinder and save (pickle) the resulting JSON file for later use


In [240]:
# create our trigrams and corresponding queries
trigrams_dict, trigrams_reference = ed.create_ngram_dicts(sentences, 3, 'trigram')
trigram_queries = ed.create_query_list(trigrams_dict)

100%|██████████| 28731/28731 [00:03<00:00, 8036.44it/s]
100%|██████████| 452833/452833 [00:02<00:00, 213743.44it/s]
100%|██████████| 452833/452833 [00:35<00:00, 12872.04it/s]


In [None]:
# query trigrams
ed.run_api_queries('trigram', 'train', trigram_queries)

In [244]:
# create fourgrams and corresponding queries
fourgrams_dict, fourgrams_reference = ed.create_ngram_dicts(sentences, 4, 'fourgram')
fourgram_queries = ed.create_query_list(fourgrams_dict)


100%|██████████| 28731/28731 [00:04<00:00, 7031.79it/s]
100%|██████████| 452833/452833 [00:02<00:00, 163286.62it/s]
100%|██████████| 452833/452833 [00:42<00:00, 10600.29it/s]


In [None]:
# query fourgrams
ed.run_api_queries('fourgram', 'train', fourgram_queries)

In [248]:
# create fivegrams and corresponding queries
fivegrams_dict, fivegrams_reference = ed.create_ngram_dicts(sentences, 5, 'fivegram')
fivegram_queries = ed.create_query_list(fivegrams_dict)



100%|██████████| 28731/28731 [00:04<00:00, 6408.00it/s]
100%|██████████| 452833/452833 [00:05<00:00, 82811.52it/s] 
100%|██████████| 452833/452833 [00:50<00:00, 8907.48it/s]


In [None]:
# query fivegrams
ed.run_api_queries('fivegram', 'train', fivegram_queries, start_value=0, batch_size=1000)

In [255]:
# create bigrams and corresponding queries
bigrams_dict, bigrams_reference = ed.create_ngram_dicts(sentences, 2, 'bigram')
bigram_queries = ed.create_query_list(bigrams_dict)


100%|██████████| 28731/28731 [00:06<00:00, 4406.19it/s]
100%|██████████| 452833/452833 [00:04<00:00, 105643.06it/s]
100%|██████████| 452833/452833 [00:35<00:00, 12632.45it/s]


In [None]:
# query bigrams
ed.run_api_queries('bigram', 'train', bigram_queries, start_value=0, batch_size=1000)

In [257]:
# create unigrams and corresponding queries
unigrams_dict, unigrams_reference = ed.create_ngram_dicts(sentences, 1, 'unigram')
unigram_queries = ed.create_query_list(unigrams_dict)

100%|██████████| 28731/28731 [00:06<00:00, 4720.33it/s]
100%|██████████| 452833/452833 [00:02<00:00, 163671.79it/s]
100%|██████████| 452833/452833 [00:13<00:00, 34273.19it/s]


In [None]:
# query unigrams
ed.run_api_queries('unigram', 'train', unigram_queries, start_value=0, batch_size=1000)

### 4. Extract the match scores from the phrasefinder JSON files###

The JSON files returned by Phrasefinder (and now stored in pickled files) contain the match scores along with other information that I won't need for the purposes of this project. 

The code below will extract the relevant scores from each of our pickled files and, in a dictionary, map them to their query.

In [261]:
uni_gram_score = ed.load_extract_query_data(
    unigram_queries, 'unigram_train', 'unigram_train_final', 1000, "mc")

In [262]:
bi_gram_score = ed.load_extract_query_data(
    bigram_queries, 'bigram_train', 'bigram_train_final', 1000, "mc")

In [263]:
tri_gram_score = ed.load_extract_query_data(
    trigram_queries, 'trigram_train', 'trigram_train_final', 1000, "mc")

In [264]:
four_gram_score = ed.load_extract_query_data(
    fourgram_queries, 'fourgram_train', 'fourgram_train_final', 1000, "mc")

In [266]:
five_gram_score = ed.load_extract_query_data(
    fivegram_queries, 'fivegram_train', 'fivegram_train_final', 1000, "mc")

### 4. Move scores into our ngram dictionary. Separate out / re-query any errors###

I move the scores into the ngram dictionaries I created in step 2, mapping them back to their original ngram. 

As part of this process, I also capture any errors in the returned JSON files and re-run the queries

In [267]:
# move scores into dictionaries and separate out any errors
trigrams_dict, trigram_errors = ed.create_score_dict(trigrams_dict, tri_gram_score)
fourgrams_dict, fourgram_errors = ed.create_score_dict(fourgrams_dict, four_gram_score)
bigrams_dict, bigram_errors = ed.create_score_dict(bigrams_dict, bi_gram_score)
fivegrams_dict, fivegram_errors = ed.create_score_dict(fivegrams_dict, five_gram_score)
unigrams_dict, unigram_errors = ed.create_score_dict(unigrams_dict, uni_gram_score)

100%|██████████| 226683/226683 [00:00<00:00, 605041.55it/s]
100%|██████████| 290204/290204 [00:00<00:00, 653140.53it/s]
100%|██████████| 105564/105564 [00:00<00:00, 641121.53it/s]
100%|██████████| 304843/304843 [00:00<00:00, 647105.23it/s]
100%|██████████| 14532/14532 [00:00<00:00, 581028.43it/s]


### 5. Create a dictionary of scores to be used as columns in the FCE dataframe ###

Recall that the ultimate aim of this initial data wrangling process is to get the ngram scores into a format that can be merged with the original FCE dataframe. The final step for achieving this is to create a dictionary of lists where each key represents a proposed column name and each value represents a list of scores.

Each key/column will be an ngram that overlaps, e.g. "five_gram_1" (the first fivegram that overlaps the word), "five_gram_2" (the second fivegram)... "four_gram_1"... "uni_gram_1" (the number of times the word itself appears in the corpus).

In [276]:
# Assign a wordscores to the master ngram reference dictionary for future merging with dataframe
bigram_wordscore = ed.assign_word_scores(bigrams_reference, bigrams_dict, 'bigram', 'train')
trigram_wordscore = ed.assign_word_scores(trigrams_reference, trigrams_dict, 'trigram','train')
fourgram_wordscore = ed.assign_word_scores(fourgrams_reference, fourgrams_dict, 'fourgram','train')
fivegram_wordscore = ed.assign_word_scores(fivegrams_reference, fivegrams_dict, 'fivegram','train')
unigram_wordscore = ed.assign_word_scores(unigrams_reference, unigrams_dict, 'unigram','train')

### 6. Find ngram context scores and create further dictionaries as in step 5 ###

As mentioned previously, one of the key features of my model will be the language model probabilities of each ngram containing the word, *given its left and right context*.

To give an example, let's take the sentence "The quick, brown fox jumped over the lazy dog". For the word "jumped", I'd be looking to calculate firstly the *left context* ngram probabilities: 
- the bigram "fox jumped" given the unigram "fox"; 
- the trigram "brown fox jumped" given the bigram "brown fox"
- the fourgram ", brown fox jumped" given the trigram "brown fox jumped"
- the fivegram "quick , brown fox jumped" given the fourgram ", brown fox jumped"

and secondly, the *right context* ngram probabilities:
- the bigram "jumped over" given the unigram "over"; 
- the trigram "jumped over the" given the bigram "over the"
- the fourgram "jumped over the lazy" given the trigram "over the lazy"
- the fivegram "jumped over the lazy dog" given the fourgram "over the lazy dog".

So along with the ngram scores I mapped across in step 5, I will also map across the left and right context scores.

In [287]:
# Use custom functions to assign context scores and map across to master dictionary
bigram_context_wordscore = ed.assign_context_word_scores(
    bigrams_reference, unigrams_dict, 'bigram', 'train')
trigram_context_wordscore = ed.assign_context_word_scores(
    trigrams_reference, bigrams_dict, 'trigram', 'train')
fourgram_context_wordscore = ed.assign_context_word_scores(
    fourgrams_reference, trigrams_dict, 'fourgram', 'train')
fivegram_context_wordscore = ed.assign_context_word_scores(
    fivegrams_reference, fourgrams_dict, 'fivegram', 'train')

### 7. Parse sentences using Spacy and extract Part of Speech tags ###


In [371]:
# instantiate a Spacy object
nlp = spacy.load('en')

In [372]:
# use multithreading to efficiently parse all sentences
parsed_sentences = ed.parse(sentences, nlp)

28731it [01:38, 291.19it/s]


In [378]:
# create part of speech tagged ngrams dictionaries and save / pickle them
tagged_unigrams, tagged_unigram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 1, 'unigrams', 'train')
tagged_bigrams, tagged_bigram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 2, 'bigrams', 'train')
tagged_trigrams, tagged_trigram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 3, 'trigrams', 'train')
tagged_fourgrams, tagged_fourgram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 4, 'fourgrams', 'train')
tagged_fivegrams, tagged_fivegram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 5, 'fivegrams', 'train')

100%|██████████| 28731/28731 [00:04<00:00, 6127.51it/s]
100%|██████████| 452833/452833 [00:01<00:00, 288083.55it/s]
100%|██████████| 28731/28731 [00:02<00:00, 13661.96it/s]
100%|██████████| 452833/452833 [00:02<00:00, 225886.32it/s]
100%|██████████| 28731/28731 [00:02<00:00, 12740.43it/s]
100%|██████████| 452833/452833 [00:02<00:00, 184327.43it/s]
100%|██████████| 28731/28731 [00:02<00:00, 12261.36it/s]
100%|██████████| 452833/452833 [00:02<00:00, 167215.96it/s]
100%|██████████| 28731/28731 [00:01<00:00, 22827.26it/s]
100%|██████████| 452833/452833 [00:01<00:00, 287099.96it/s]


## 8. Repeat process for test set

In [381]:
# Load FCE dataset from CSV file
my_file = "./fce-public.test.original.tsv"
fce = pd.read_csv(my_file, header=None, sep='\t', skip_blank_lines=False, quoting=3)

In [382]:
# Find the sentence lengths using the indices
sentence_length = ed.find_sentence_length(fce, 0)

In [383]:
# create our sentences and save for later use
sentences = ed.create_save_sentences(fce, 0, 'sentences_test')

In [326]:
# create sentence indices to be used again later with our FCE dataframe without Null values
sentence_indices_2 = ed.find_df_sentence_indices(fce, 0, 'sentence_indices_test')

In [327]:
# remove null values from dataframe
mask = fce[0].isnull()==True
fce = fce[~mask]

In [328]:
# save and delete our dataframe from currnt notebook
fce.to_csv("fce_test.csv")
del fce

In [329]:
# create ngrams and corresponding queries
trigrams_dict, trigrams_reference = ed.create_ngram_dicts(sentences, 3, 'trigram')
trigram_queries = ed.create_query_list(trigrams_dict)
fourgrams_dict, fourgrams_reference = ed.create_ngram_dicts(sentences, 4, 'fourgram')
fourgram_queries = ed.create_query_list(fourgrams_dict)
fivegrams_dict, fivegrams_reference = ed.create_ngram_dicts(sentences, 5, 'fivegram')
fivegram_queries = ed.create_query_list(fivegrams_dict)
bigrams_dict, bigrams_reference = ed.create_ngram_dicts(sentences, 2, 'bigram')
bigram_queries = ed.create_query_list(bigrams_dict)
unigrams_dict, unigrams_reference = ed.create_ngram_dicts(sentences, 1, 'unigram')
unigram_queries = ed.create_query_list(unigrams_dict)

100%|██████████| 2720/2720 [00:00<00:00, 9320.81it/s]
100%|██████████| 41477/41477 [00:00<00:00, 210760.18it/s]
100%|██████████| 41477/41477 [00:03<00:00, 13250.49it/s]
100%|██████████| 2720/2720 [00:00<00:00, 9866.61it/s]
100%|██████████| 41477/41477 [00:00<00:00, 260337.89it/s]
100%|██████████| 41477/41477 [00:03<00:00, 10809.60it/s]
100%|██████████| 2720/2720 [00:00<00:00, 9486.71it/s]
100%|██████████| 41477/41477 [00:00<00:00, 239754.61it/s]
100%|██████████| 41477/41477 [00:04<00:00, 9177.16it/s]
100%|██████████| 2720/2720 [00:00<00:00, 9106.71it/s]
100%|██████████| 41477/41477 [00:00<00:00, 369413.47it/s]
100%|██████████| 41477/41477 [00:02<00:00, 17891.94it/s]
100%|██████████| 2720/2720 [00:01<00:00, 2188.62it/s]
100%|██████████| 41477/41477 [00:00<00:00, 535891.97it/s]
100%|██████████| 41477/41477 [00:01<00:00, 32870.04it/s]


In [None]:
# query Phrasefinder
ed.run_api_queries('trigram', 'test', trigram_queries)

In [None]:
# query Phrasefinder
ed.run_api_queries('fourgram', 'test', fourgram_queries)

In [None]:
# query Phrasefinder
ed.run_api_queries('fivegram', 'test', fivegram_queries)

In [None]:
# query Phrasefinder
ed.run_api_queries('bigram', 'test', bigram_queries)

In [None]:
# query Phrasefinder
ed.run_api_queries('unigram', 'test', unigram_queries)

In [339]:
# extract match scores from JSON files and map across to a dictionary
uni_gram_score = ed.load_extract_query_data(
    unigram_queries, 'unigram_test', 'unigram_test_final', 1000, "mc")

bi_gram_score = ed.load_extract_query_data(
    bigram_queries, 'bigram_test', 'bigram_test_final', 1000, "mc")

tri_gram_score = ed.load_extract_query_data(
    trigram_queries, 'trigram_test', 'trigram_test_final', 1000, "mc")

four_gram_score = ed.load_extract_query_data(
    fourgram_queries, 'fourgram_test', 'fourgram_test_final', 1000, "mc")

five_gram_score = ed.load_extract_query_data(
    fivegram_queries, 'fivegram_test', 'fivegram_test_final', 1000, "mc")

In [341]:
# move scores into dictionaries and identify / separate out any errors
trigrams_dict, trigram_errors = ed.create_score_dict(trigrams_dict, tri_gram_score)
fourgrams_dict, fourgram_errors = ed.create_score_dict(fourgrams_dict, four_gram_score)
bigrams_dict, bigram_errors = ed.create_score_dict(bigrams_dict, bi_gram_score)
fivegrams_dict, fivegram_errors = ed.create_score_dict(fivegrams_dict, five_gram_score)
unigrams_dict, unigram_errors = ed.create_score_dict(unigrams_dict, uni_gram_score)

100%|██████████| 27558/27558 [00:00<00:00, 397239.04it/s]
100%|██████████| 30170/30170 [00:00<00:00, 506795.01it/s]
100%|██████████| 17656/17656 [00:00<00:00, 527345.72it/s]
100%|██████████| 29683/29683 [00:00<00:00, 549717.75it/s]
100%|██████████| 3871/3871 [00:00<00:00, 433229.74it/s]
100%|██████████| 30170/30170 [00:00<00:00, 887635.13it/s]
100%|██████████| 29683/29683 [00:00<00:00, 1250962.35it/s]
100%|██████████| 17656/17656 [00:00<00:00, 791503.30it/s]
100%|██████████| 3871/3871 [00:00<00:00, 950594.31it/s]
100%|██████████| 27558/27558 [00:00<00:00, 974534.64it/s]


In [343]:
# Assign a wordscores to the master ngram reference dictionary for future merging with dataframe
bigram_wordscore = ed.assign_word_scores(bigrams_reference, bigrams_dict, 'bigram', 'test')
trigram_wordscore = ed.assign_word_scores(trigrams_reference, trigrams_dict, 'trigram','test')
fourgram_wordscore = ed.assign_word_scores(fourgrams_reference, fourgrams_dict, 'fourgram','test')
fivegram_wordscore = ed.assign_word_scores(fivegrams_reference, fivegrams_dict, 'fivegram','test')
unigram_wordscore = ed.assign_word_scores(unigrams_reference, unigrams_dict, 'unigram','test')

In [354]:
# Use custom functions to assign context scores for future probability calculations
# and map across to master dictionary
bigram_context_wordscore = ed.assign_context_word_scores(
    bigrams_reference, unigrams_dict, 'bigram', 'test')
trigram_context_wordscore = ed.assign_context_word_scores(
    trigrams_reference, bigrams_dict, 'trigram', 'test')
fourgram_context_wordscore = ed.assign_context_word_scores(
    fourgrams_reference, trigrams_dict, 'fourgram', 'test')
fivegram_context_wordscore = ed.assign_context_word_scores(
    fivegrams_reference, fourgrams_dict, 'fivegram', 'test')

In [384]:
# instantiate a Spacy object
nlp = spacy.load('en')

In [385]:
# use multithreading to efficiently parse all sentences
parsed_sentences = ed.parse(sentences, nlp)

2720it [00:08, 314.59it/s]


In [386]:
# create part of speech tagged ngrams dictionaries and save / pickle them
tagged_unigrams, tagged_unigram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 1, 'unigrams', 'test')
tagged_bigrams, tagged_bigram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 2, 'bigrams', 'test')
tagged_trigrams, tagged_trigram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 3, 'trigrams', 'test')
tagged_fourgrams, tagged_fourgram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 4, 'fourgrams', 'test')
tagged_fivegrams, tagged_fivegram_boundaries = ed.create_pos_ngram_dicts(
    sentences, parsed_sentences, 5, 'fivegrams', 'test')

100%|██████████| 2720/2720 [00:00<00:00, 18257.47it/s]
100%|██████████| 41477/41477 [00:00<00:00, 539118.74it/s]
100%|██████████| 2720/2720 [00:00<00:00, 23420.85it/s]
100%|██████████| 41477/41477 [00:00<00:00, 482999.23it/s]
100%|██████████| 2720/2720 [00:00<00:00, 22087.25it/s]
100%|██████████| 41477/41477 [00:00<00:00, 321806.20it/s]
100%|██████████| 2720/2720 [00:00<00:00, 20494.79it/s]
100%|██████████| 41477/41477 [00:00<00:00, 264469.35it/s]
100%|██████████| 2720/2720 [00:00<00:00, 25883.83it/s]
100%|██████████| 41477/41477 [00:00<00:00, 217261.77it/s]
