### NLP Classifier
1. [Objective](#obj)
1. [Modules](#import)
2. [Raw Text](#raw)
3. [Tokenizer](#toke)
4. [Feature](#feat)
4. [Classifier](#classify)
5. [Scratch](#scratch)

### Objective <a name="obj"></a>

Challenge: Build your own NLP model
For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

- Data cleaning / processing / language parsing
- Create features using two different NLP methods: For example, BoW vs tf-idf.
- Use the features to fit supervised learning models for each feature set to predict the category outcomes.
- Assess your models using cross-validation and determine whether one model performed better.
- Pick one of the models and try to increase accuracy by at least 5 percentage points.
- Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

### Module <a name="import"></a>

In [1]:
import spacy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

In [2]:
from nltk.corpus import gutenberg

In [3]:
import pandas as pd
import numpy as np
import string
import re
from collections import Counter, defaultdict

In [47]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer, TfidfTransformer
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier

### Raw Text <a name="raw"></a>

In [5]:
# use all of the gutenberg texts
texts = ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt',
         'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 
         'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
         'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt',
         'whitman-leaves.txt']
authors = [re.split(r'\W',text)[0] for text in texts]
titles  = [re.split(r'\W',text)[1] for text in texts]
raws =    [gutenberg.raw(text)     for text in texts]

#### Data cleaning / processing / language parsing

- Clean the raw text to minimize further processing after the tokenization.  Also reduces the text size allowing for more data while limited by a maximum char size of 100,000.

- use SpaCy to parse the text to get list of lemmas for each sentence.  I could have just used sklearn text feature extraction, but it seems more "black box" to me.  Spacy allows greater opportunity to check the work product and re-process as required.  




In [6]:
# specific cleaning re
rx0 = re.compile(r'\d+:\d+')                           # remove chapter:verse annotation for the bible
rx1 = re.compile(r'\[.*\]')                            # title at begining of all texts
rx2 = re.compile(r'[A-Z]{3}')                          # 3 or more capitals, all
rx3 = re.compile(r'Actus\s+\w+\.')                     # shakespeare
rx4 = re.compile(r'Scoena\s+\w+\.')                    # shakespeare
rx5 = re.compile(r'VOLUME\s[IVX]+\s+CHAPTER\s+[IVX]+') # emma,
rx6 = re.compile(r'Chapter \d+', flags=re.I)           #persuasion, sense
rx7 = re.compile(r'CHAPTER\s+[IVX]+')                  # alice
rx8 = re.compile(r'CHAPTER\s+[IVX]+')                  # leaves
rx9 = re.compile(r'\s{2}')                             # two or more white space

In [7]:
# text specific cleaning
raws = [rx1.sub("", raw) for raw in raws]
raws = [rx2.sub("", raw) for raw in raws]
raws[3] = rx0.sub("", raws[3])
raws[0] = rx5.sub("", raws[0])
raws[7] = rx7.sub("", raws[7])
raws[17] = rx8.sub("", raws[17])
for i in [1,2]:
    raws[i] = rx6.sub("", raws[i])
for i in [14,15,16]:
    raws[i] = rx3.sub("", raws[i])
raws = [rx9.sub("", raw) for raw in raws]    

In [8]:
# guttenberg specific text cleaning
def txt_clean(text):
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
cleans = [txt_clean(raw) for raw in raws]
# reduce size of each to under 100,000 to avoid allocation errors... 
# or actually just warnings about possible allocation errors
for (i,clean) in enumerate(cleans):
    cleans[i] = clean[len(clean) -100000:]

### Tokenizer <a name="toke"></a>

In [32]:
# run nlp model by sentence
nlp = spacy.load("en_core_web_sm")
dct = defaultdict(dict).fromkeys(titles)
for (i,clean) in enumerate(cleans):
    dct[titles[i]] = {'author':authors[i],'sents':None, 'title':titles[i]}
    dct[titles[i]].update(sents=[S for S in nlp(clean).sents])
                    

In [33]:
# further process .sents eliminate ntlk stops, lemmatize, lower()
# can then delete samples where result is []
def sentence_process(S):
    stops = STOP_WORDS
    A = [s for s in S if s.is_alpha and not s.is_stop]
    B = [a for a in A if a.lemma_ not in stops]
    return [b.lemma_.lower() if b.pos_ != 'PROPN' else b.lemma_ for b in B]

def text_df(sents, title, author):
    S1 = pd.Series(sents)
    S2 = np.repeat(title, len(S1)); S3 = np.repeat(author, len(S1))
    df = pd.DataFrame([S1, S2, S3], index=['sentence','title', 'author']).T

    df['lemma'] = df.sentence.apply(lambda x: sentence_process(x))
    df['check'] = df.lemma.apply(lambda x: x.__len__())
    df = df.drop(np.where(df.check == 0)[0], axis=0)
    df = df.drop('check', axis=1)
    df.index = range(len(df))
    return df

In [34]:
dfs = [text_df(**dct[title]) for title in titles]
df1 = pd.concat(dfs,axis=0)
df1.index = (range(len(df1)))

In [35]:
df1.tail()

Unnamed: 0,sentence,title,author,lemma
14764,"(Yet, let, me, not, be, too, hasty, ,, Long, i...",leaves,whitman,"[let, hasty, long, live, sleep, blend, die, di..."
14765,"(If, we, go, anywhere, we, 'll, go, together, ...",leaves,whitman,"[meet, happen, blither, learn]"
14766,"(May, -, be, it, is, yourself, now, really, us...",leaves,whitman,"[usher, true, song, know]"
14767,"(May, -, be, it, is, you, the, mortal, knob, r...",leaves,whitman,"[mortal, knob, undo, turn, finally, Good, bye,..."
14768,"(my, Fancy, .)",leaves,whitman,[Fancy]


### Features <a name="feat"></a>

##### Create features using two different NLP methods: For example, BoW vs tf-idf.

- Used bag of words and tf-idf (an engineer most have come up with that lame name) methods to create features.
- sklearn CountVectorizer fit and transform for the bag of words sparse matrix and then TfidfTransformer to transform bag of words to the tf-idf sparse matrix.  

In [16]:
# sklearn does not take list of words, each sample back to text string of lemmas
lemmas = [' '.join(x) for x in df1.lemma.tolist()]
vectorizer = CountVectorizer(lowercase=False,min_df=0.002, max_df=0.50)
X = vectorizer.fit_transform(lemmas)

In [50]:
tfr = TfidfTransformer()
Xt = tfr.fit_transform(X)
Xt.shape

(14769, 677)

In [42]:
# data for 2 targets x 2 feature sets
y1 = df1.title; y2 = df1.author
X1_trn, X1_tst, y1_trn, y1_tst = train_test_split(X, y1,  test_size=0.2, random_state=9)
X2_trn, X2_tst, y2_trn, y2_tst = train_test_split(X, y2,  test_size=0.2, random_state=9)
X3_trn, X3_tst, y3_trn, y3_tst = train_test_split(Xt, y1, test_size=0.2, random_state=9)
X4_trn, X4_tst, y4_trn, y4_tst = train_test_split(Xt, y2, test_size=0.2, random_state=9)

In [None]:
# save train / test data as tuples to more easily document initial 8 classifcations
A = train_test_split(X,  y1,  test_size=0.2, random_state=9)
B = train_test_split(X,  y2,  test_size=0.2, random_state=9)
C = train_test_split(Xt, y1,  test_size=0.2, random_state=9)
D = train_test_split(Xt, y2,  test_size=0.2, random_state=9)

### Classifier <a name="classify"></a>

In [69]:
# check two different classifiers 
clf1 = LogisticRegression(solver='lbfgs', multi_class='auto')
clf2 = RandomForestClassifier(n_estimators=20, random_state=1)

In [73]:
scores = []
dct = {'A':('title', 'BoW'), 'C':('title', 'tfidf'), 'B':('author', 'BoW'), 'D':('author', 'tfidf'), 
       1:'LogisticRegression', 2:'RandomForestClassifier'}
for c in 'ABCD':
    for i in range(1,3):
        tup = eval(c); clf = eval('clf' + str(i))
        clf.fit(tup[0], tup[2])
        scores.append(pd.Series({'classifier': dct[i], 'target': dct[c][0], 'method': dct[c][1], 
                                 'train':clf.score(tup[0], tup[2]), 'test':clf.score(tup[1], tup[3])}))
    



#### initial look at accuracy scores
Reviewing eight sets of results for the two targets (author, title), two NLP methods (bag of words and that poorly named tf-idf) and two classifiers.  As the Random Forest Classifier seems to be overfitting, continue with logit classifier and the author target.  

In [74]:
pd.DataFrame(scores)

Unnamed: 0,classifier,target,method,train,test
0,LogisticRegression,title,BoW,0.629285,0.495261
1,RandomForestClassifier,title,BoW,0.834025,0.433649
2,LogisticRegression,author,BoW,0.719255,0.633378
3,RandomForestClassifier,author,BoW,0.880152,0.561273
4,LogisticRegression,title,tfidf,0.599577,0.49323
5,RandomForestClassifier,title,tfidf,0.832924,0.433649
6,LogisticRegression,author,tfidf,0.687177,0.626608
7,RandomForestClassifier,author,tfidf,0.877782,0.573798


In [None]:
clf_bow = LogisticRegressionCV(cv=5, random_state=0, solver='lbfgs', multi_class='auto', max_iter=200).fit(X,  y2)
#clf_idf = LogisticRegressionCV(cv=5, random_state=0, solver='lbfgs', multi_class='auto', max_iter=200).fit(Xt, y2)

####  Assess your models using cross-validation and determine whether one model performed better.
 - Continue with the bag of words based opun the cross validation score and try to improve the accuracy from 63% to 68%
 - Logit classifier does not converge at 300 iterations or with 'saga' solver and 'elastic net' penalty.  Score does not change with increased iterations.  

In [105]:
print('Bag of words cross validation score: %.4f' % clf_bow.score(X, y2))
print('tf-idf cross validation score: %.4f' % clf_idf.score(X, y2))

Bag of words cross validation score: 0.7239
tf-idf cross validation score: 0.6970


####  Pick one of the models and try to increase accuracy by at least 5 percentage points.
1. Check if PCA helps with convergence. Or I meant TruncatedSVD since it is a sparse matrix.
  - At n_components=100 accuracy reduced from 63% to 55%. 
  - At n_components=300 the accuracy is 60%.  Plus the logit did converge with 127 iterations.
  - Decomposition likely need for the next step  

2. Increase the Bag of Words features.
 - increase bag of words features from 677 to 1400,  features then reduced by decomposition to 700.  Accuracy 66%
 - bag of words 1400 features then reduced by decomposition to 1000.  Accuracy 67%
 - bag of words 2485 features then reduced by decomposition to 1000.  Accuracy 68%
 
 A better approach would have been to add to the volume of text that was processed.  I stayed within the SpaCy default limit of 100,000 characters per doc for each of the tiles, but could have have processed the entire title with multiple iterations.  Or changed the setting.  

In [125]:
# classifier runs prior to increasing BoW feature number
svd = TruncatedSVD(n_components=300, random_state=3)
Xs = svd.fit_transform(X)
svd.explained_variance_ratio_.sum()

0.7655132936626161

In [127]:
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
clf.fit(X_trn, y_trn)
clf.score(X_tst, y_tst)

0.5995260663507109

In [138]:
# X1 features used to increase accuracy
vectorizer = CountVectorizer(lowercase=False,min_df=0.0005, max_df=0.50)
X1 = vectorizer.fit_transform(lemmas)
svd = TruncatedSVD(n_components=1000, random_state=3)
X1s = svd.fit_transform(X1)
print('explained variance ration: %.4f' % svd.explained_variance_ratio_.sum())
X_trn, X_tst, y_trn, y_tst = train_test_split(X1s, y2,  test_size=0.2, random_state=9)
clf = LogisticRegression(solver='lbfgs', multi_class='auto', max_iter=200)
clf.fit(X_trn, y_trn)
clf.score(X_tst, y_tst)

explained variance ration: 0.8348


0.6844955991875423

### Scratch   <a name="scratch"></a>

Save some of the superfluous code snippets that may be useful in the future. Plus any unit tests.  

In [41]:
# check that samples still match up with features
assert df1.shape[0] == X.shape[0]
assert df1.shape[0] == Xt.shape[0]
assert X.shape      == Xt.shape

In [None]:
scores = {}
clf1.fit(X3_trn, y3_trn)
scores = {'train': clf1.score(X3_trn, y3_trn), 'test': clf1.score(X3_tst, y3_tst)}
scores

In [18]:
len(vectorizer.get_feature_names())

677

In [47]:
df1.sentence[2]

It was all the service she could now render her poor friend; for as to any of that heroism of sentiment which might have prompted her to entreat him to transfer his affection from herself to Harriet, as infinitely the most worthy of the two or even the more simple sublimity of resolving to refuse him at once and for ever, without vouchsafing any motive, because he could not marry them both, Emma had it not.

In [49]:
df1.sentence[2].lemma_

'-PRON- be all the service -PRON- could now render -PRON- poor friend ; for as to any of that heroism of sentiment which may have prompt -PRON- to entreat -PRON- to transfer -PRON- affection from -PRON- to Harriet , as infinitely the most worthy of the two or even the more simple sublimity of resolve to refuse -PRON- at once and for ever , without vouchsafe any motive , because -PRON- could not marry -PRON- both , Emma have -PRON- not .'

In [37]:
lemmas[0:4]

['ad language feeling agitation doubt reluctance discouragement receive discouragement',
 'time conviction glow attendant happiness time rejoice harriet secret escape resolve nee',
 'service render poor friend heroism sentiment prompt entreat transfer affection harriet infinitely worthy simple sublimity resolve refuse vouchsafe motive marry emma',
 'feel harriet pain contrition flight generosity run mad oppose probable reasonable enter brain']

In [None]:
class sklearn.feature_extraction.text.CountVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’,
strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’,
ngram_range=(1, 1), analyzer=’word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, 
                                dtype=<class ‘numpy.int64’>)

In [None]:
rx4.sub('test',raws[14][0:100])

In [None]:
bow.isna().any().any()

In [None]:
# entity dectection 
entities=[(i, i.label_, i.label) for i in nytimes.ents]