### NLP Classifier
1. [Objective](#obj)
1. [Module Imports](#import)
2. [Raw Text](#raw)
3. [Tokenizer](#toke)
4. [Feature](#feat)
4. [Classifier](#classify)
5. [Scratch](#scratch)

#### Objective <a name="obj"></a>

#### Challenge 0:

Recall that the logistic regression model's best performance on the test set was **93%**.  See what you can do to improve performance.  Suggested avenues of investigation include: Other modeling techniques (SVM?), making more features that take advantage of the spaCy information (include grammar, phrases, POS, etc), making sentence-level features (number of words, amount of punctuation), or including contextual information (length of previous and next sentences, words repeated from one sentence to the next, etc), and anything else your heart desires.  Make sure to design your models on the test set, or use cross_validation with multiple folds, and see if you can get accuracy above 90%.  


#### Notes

The best accuracy score was actually **0.8707**.  The spaCy and sklean in the online notebook is outdated. Plus it would not be much of a challenge to get above 90% from a score of 93%.

Did not take much to get above 90%.  My "bag of words" was "cleaner" as I corrected the tokenizing errors.  Actually reducing the number words to 668 by only including in the bag of words increased the words 


logit with the default hyperparameters had the following [result](#scratch) shown below:

{'logit': {'train': 0.9665653495440729, 'test': 0.9036144578313253}

#### Challenge 1:
Find out whether your new model is good at identifying Alice in Wonderland vs any other work, Persuasion vs any other work, or Austen vs any other work.  This will involve pulling a new book from the Project Gutenberg corpus (print(gutenberg.fileids()) for a list) and processing it.

Record your work for each challenge in a notebook and submit it below.

In [None]:
'melville-moby_dick.txt'

#### Module Imports <a name="import"></a>

In [3]:
import spacy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

In [4]:
from nltk.corpus import gutenberg

In [5]:
import pandas as pd
import numpy as np
import string
import re
from collections import Counter

In [6]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn import metrics

#### Raw Text <a name="raw"></a>

In [7]:
# clean guttenberg specific text
def txt_clean(text):
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text
    
persn = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')
#remove the chapter headings
persn = re.sub(r'Chapter \d+', '', persn)
alice = re.sub(r'CHAPTER .*',  '', alice)
 
alice = txt_clean(alice[:int(len(alice)/10)]) 
persn = txt_clean(persn[:int(len(persn)/10)]) 

In [43]:
mobyd = gutenberg.raw('melville-moby_dick.txt')
mobyd = re.sub(r'Chapter \d+', '',mobyd)
mobyd = txt_clean(mobyd[:int(len(mobyd)/10)]) 

In [44]:
sense = gutenberg.raw('austen-sense.txt')
sense = re.sub(r'Chapter \d+', '',sense)
sense = txt_clean(sense[:int(len(sense)/10)]) 

#### Tokenizer <a name="toke"></a>

In [46]:
nlp = spacy.load("en_core_web_sm")
# process the .sents further to eliminate those without words
snt1 = [S for S in nlp(persn).sents if True in [s.is_alpha for s in S]] 
snt2 = [S for S in nlp(alice).sents if True in [s.is_alpha for s in S]] 
snt3 = [S for S in nlp(mobyd).sents if True in [s.is_alpha for s in S]]
snt4 = [S for S in nlp(sense).sents if True in [s.is_alpha for s in S]]

In [47]:
astn = [[S,  len(S), "Austen"]   for S in snt1] 
crll = [[S,  len(S), "Carroll"]  for S in snt2]
mlvl = [[S,  len(S), "Melville"] for S in snt3]
snse = [[S,  len(S), "Sense"]    for S in snt4]

dfs = pd.DataFrame((astn + crll),               columns = ['sentence','length','who'])
df3 = pd.DataFrame((astn + crll + mlvl),        columns = ['sentence','length','who'])
df4 = pd.DataFrame((astn + crll + mlvl + snse), columns = ['sentence','length','who'])

#### Features <a name="feat"></a>

In [26]:
def sentence_process(S):
    stops = STOP_WORDS
    A = [s for s in S if s.is_alpha and not s.is_stop]
    B = [a for a in A if a.lemma_ not in stops]
    return [b.lemma_.lower() if b.pos_ != 'PROPN' else b.lower_ for b in B]

In [11]:
# use the processed sentences to create word list and count words
F = [(sentence_process(S)) for S in dfs.sentence]
assert len(dfs) == len(F)
ctr = Counter()
for f in F: ctr.update(f)
fte = [k for (k,v) in ctr.most_common() if ctr[k] >= 2]

In [29]:
# use the processed sentences to create word list and count words
F = [(sentence_process(S)) for S in df3.sentence]
assert len(df3) == len(F)
ctr = Counter()
for f in F: ctr.update(f)
fts = [k for (k,v) in ctr.most_common() if ctr[k] >= 3]

In [48]:
# use the processed sentences to create word list and count words
F = [(sentence_process(S)) for S in df4.sentence]
assert len(df4) == len(F)
ctr = Counter()
for f in F: ctr.update(f)
ft4 = [k for (k,v) in ctr.most_common() if ctr[k] >= 4]

In [49]:
len(ft4)

1106

In [36]:
# zero dataframe length number of sentences and width number of words (1000)
bow = pd.DataFrame(np.zeros((len(dfs), len(fte))), columns=fte, dtype=int)
# assign the count per word from .lemma_ for .sents
bow['lemma'] = dfs.sentence.apply(lambda x: sentence_process(x))
for i in range(len(bow)):
    ctr = Counter(bow.lemma[i])
    for (j,f) in enumerate(fte):
        bow.iat[i,j] = ctr[f]

In [38]:
# zero dataframe length number of sentences and width number of words (1000)
bw3 = pd.DataFrame(np.zeros((len(df3), len(fts))), columns=fts, dtype=int)
# assign the count per word from .lemma_ for .sents
bw3['lemma'] = df3.sentence.apply(lambda x: sentence_process(x))
for i in range(len(bw3)):
    ctr = Counter(bw3.lemma[i])
    for (j,f) in enumerate(fts):
        bw3.iat[i,j] = ctr[f]

In [62]:
# zero dataframe length number of sentences and width number of words 
bw4 = pd.DataFrame(np.zeros((len(df4), len(ft4))), columns=ft4, dtype=int)
# assign the count per word from .lemma_ for .sents
bw4['lemma'] = df4.sentence.apply(lambda x: sentence_process(x))
for i in range(len(bw4)):
    ctr = Counter(bw4.lemma[i])
    for (j,f) in enumerate(ft4):
        bw4.iat[i,j] = ctr[f]

In [63]:
# check that it all adds up
for i in range(0, len(bow), 50):
    assert bow.loc[i][:-1].sum() == len([word for word in bow.lemma[i] if word in fte])
# check that it all adds up
for i in range(0, len(bw3), 50):
    assert bw3.loc[i][:-1].sum() == len([word for word in bw3.lemma[i] if word in fts])
for i in range(0, len(bw4), 50):
    assert bw4.loc[i][:-1].sum() == len([word for word in bw4.lemma[i] if word in ft4])

In [15]:
X = bow[bow.columns[:-1]]
y = dfs.who
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=9)

In [64]:
X3 = bw3[bw3.columns[:-1]]
X4 = bw4[bw4.columns[:-1]]

In [57]:
y1 = np.where(df3.who == "Carroll", 1, 0)
y2 = np.where(df3.who == "Austen",  1, 0)
y3 = np.where(df4.who == "Austen",  1, 0)

In [65]:
X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=9)
X1_trn, X1_tst, y1_trn, y1_tst = train_test_split(X3, y1, test_size=0.2, random_state=9)
X2_trn, X2_tst, y2_trn, y2_tst = train_test_split(X3, y2, test_size=0.2, random_state=9)
X3_trn, X3_tst, y3_trn, y3_tst = train_test_split(X4, y3, test_size=0.2, random_state=9)

#### Classifier <a name="classify"></a>

In [66]:
# run classifiers with default hyperparameters for algorithim selection
clf0 = LogisticRegression()
clf1 = RandomForestClassifier(random_state=1)
clf2 = SVC(random_state=1)

N = ['logit', 'Random Forest', 'SVM classifier']
scores = {}

####  Challenge 0  Result <a name="result1"></a>

In [17]:
for (i, n) in enumerate(N):
    clf = eval('clf' + str(i))
    clf.fit(X_trn, y_trn) 
    scores[n] = {'train': clf.score(X_trn, y_trn), 'test': clf.score(X_tst, y_tst)}
     
scores



{'logit': {'train': 0.9665653495440729, 'test': 0.9036144578313253},
 'Random Forest': {'train': 0.9878419452887538, 'test': 0.8674698795180723},
 'SVM classifier': {'train': 0.6960486322188449, 'test': 0.6746987951807228}}

####  Challenge 1  Result <a name="result2"></a>

In [69]:
clf = LogisticRegression()
scores = {}
clf.fit(X1_trn, y1_trn)
scores[1] = {'train': clf.score(X1_trn, y1_trn), 'test': clf.score(X1_tst, y1_tst)}
scores[1]



{'train': 0.9657643312101911, 'test': 0.9426751592356688}

In [70]:
clf = LogisticRegression()
clf.fit(X2_trn, y2_trn)
scores[2] = {'train': clf.score(X2_trn, y2_trn), 'test': clf.score(X2_tst, y2_tst)}
scores[2]



{'train': 0.9585987261146497, 'test': 0.9299363057324841}

In [None]:
clf = LogisticRegression()
clf.fit(X2_trn, y2_trn)
scores[2] = {'train': clf.score(X2_trn, y2_trn), 'test': clf.score(X2_tst, y2_tst)}
scores[2]

#### Scratch   <a name="scratch"></a>

In [18]:
gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [None]:
bow.isna().any().any()

In [None]:
dfs.loc[np.where(dfs.length <= 2)]

In [None]:
print(dfs.groupby('who').length.mean(), dfs.groupby('who').length.std())

In [None]:
stops = STOP_WORDS
A1 = [tok for tok in doc1 if tok.is_alpha and not tok.is_stop]
B1 = [a for a in A1 if a.lemma_ not in stops]
C1 = [b.lemma_.lower() if b.pos_ != 'PROPN' else b.lower_ for b in B1]
D1 = list(set(C1))
print( len(A1), len(B1),len(C1), len(D1))

In [None]:
A = [tok for for in A if a.is_alpha and not ]
B = [tok for tok in docv if tok.lemma_ not in puncs and tok.lemma_ not in stops]
C = [a.lemma_.lower() if a.pos_ != 'PROPN' else a.lower_ for a in A]
D = set(B)

In [None]:
# bag of words
bow_vector = CountVectorizer(tokenizer = spacy_tokens, ngram_range=(1,1))
# bag of words
tdf_vct = TfidfVectorizer(tokenizer = spacy_tokens)

In [None]:
tfidf = tdf_vct.fit_transform(all_txt)

In [None]:
alice[0]

In [None]:
X1 = tdf_vct.fit_transform(alice)

In [None]:
pd.DataFrame(X1).head()

In [None]:
len(doc1)

In [None]:
X1.shape

In [None]:
# POS tagging needs the model

nlp = en_core_web_sm.load()

for word in docs:
    print(word.text,word.pos_)

In [None]:
# entity dectection 
entities=[(i, i.label_, i.label) for i in nytimes.ents]

In [None]:
# Dependency Parsing
for chunk in docp.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)
displacy.render(docp, style="dep", jupyter= True)    

In [None]:
# Word Vector Representation
mango = toks[5]
print(mango.vector.shape)
print(mango.vector)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, ylabels, test_size=0.3)

In [None]:
# Logistic Regression Classifier

classifier = LogisticRegression()

# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(X_train,y_train)

In [None]:
# Predicting with a test dataset
predicted = pipe.predict(X_test)

# Model Accuracy
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

In [None]:
# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return {}

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()