<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Document Classification
## *Data Science Unit 4 Sprint 1 Lesson 3*

Today's lesson will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*). 

Today's all about having fun and practicing your skills.

## Learning Objectives
* <a href="#p0">Part 0</a>: Kaggle Competition
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy

## Text Feature Extraction & Classification Pieplines
<a id="p1"></a>

In [1]:
# Dataset
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism',
              'talk.religion.misc']

data = fetch_20newsgroups(subset='train', categories=categories)

### Sklearn Pipeline Objects

In [3]:
# Import Statements
from sklearn.pipeline import Pipeline

from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer


In [9]:
# Create Pipeline

vect = TfidfVectorizer(stop_words='english')
sgdc = SGDClassifier()

pipe = Pipeline([('vect', vect), ('clf', sgdc)])

In [5]:
# Fit Pipeline
pipe.fit(data.data, data.target)



Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))])

In [None]:
pipe.predict(['Send me lots of money now', 'you won the lottery in Nigeria'])

### Tuning a Pipeline Object with GridSearch

In [6]:
# Experiment Management
from sklearn.model_selection import GridSearchCV

In [7]:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'clf__max_iter':(20, 10, 100)
}

In [10]:
grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=-1, verbose=1)

In [11]:
grid_search.fit(data.data, data.target)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  45 out of  45 | elapsed:    5.1s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'vect__max_df': (0.5, 0.75, 1.0), 'clf__max_iter': (20, 10, 100)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

## Latent Semantic Indexing
<a id="p2"></a>

In [20]:
# Import

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, 
                   algorithm='randomized',
                   n_iter=10)

In [21]:
# LSI

lsi = Pipeline([('vect', vect), ('svd', svd)])

In [24]:
# Pipe

pipe = Pipeline([('lsi', lsi), ('clf', sgdc)])

params = {
    'lsi__vect__max_df':
}

In [23]:
# Fit
pipe.fit(data.data, data.target)



Pipeline(memory=None,
     steps=[('lsi', Pipeline(memory=None,
     steps=[('vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=...m_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False))])

## Word Embeddings with Spacy
<a id="p3"></a>

In [3]:
import spacy
nlp = spacy.load("en_core_web_md")

In [26]:
doc = nlp("Two bananas in pyjamas")

In [31]:
bananas_vector = doc.vector
print(len(bananas_vector))

300


In [32]:
def get_word_vectors(docs):
    return [nlp(doc).vector for doc in docs]

In [33]:
X = get_word_vectors(data.data)

sgdc.fit(X, data.target)



SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

# Which Whiskey

In [4]:
import pandas as pd

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# Remove training examples with missing labels
train = train[pd.notnull(train['category'])]

In [38]:
y_train = train['category']

X_train = train['description']
X_test = test['description']

In [39]:
X_train.head()

0    A marriage of 13 and 18 year old bourbons. A m...
1    There have been some legendary Bowmores from t...
2    This bottling celebrates master distiller Park...
3    What impresses me most is how this whisky evol...
5    A caramel-laden fruit bouquet, followed by une...
Name: description, dtype: object

In [40]:
y_train.value_counts(normalize=True)

1.0    0.633024
2.0    0.173627
3.0    0.116009
4.0    0.077340
Name: category, dtype: float64

In [41]:
## TF-IDF

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1,1), stop_words='english')

vectorizer.fit(X_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=1000,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words='english', strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [43]:
train_tf = vectorizer.transform(X_train)
test_tf = vectorizer.transform(X_test)

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

rfc = RandomForestClassifier(max_depth=10, n_estimators=1000)

rfc.fit(train_tf, y_train)

train_pred = rfc.predict(train_tf)
print(f'Train Accuracy: {accuracy_score(y_train, train_pred)}')

Train Accuracy: 0.8623356535189481
