<a href="https://colab.research.google.com/github/bs3537/DS-Unit-4-Sprint-1-NLP/blob/master/OK_Copy_Spacy_RFC_GridSearch_whiskey.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

## Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### Load Competition Data

In [0]:
import pandas as pd

# You may need to change the path
train = pd.read_csv('https://raw.githubusercontent.com/bs3537/DS-Unit-4-Sprint-1-NLP/master/module3-document-classification/train.csv')
#test = pd.read_csv('test.csv')

In [2]:
train.head()

Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few lef...",1
1,3861,\nAn uncommon exclusive bottling of a 6 year o...,0
2,655,\nThis release is a port version of Amrut’s In...,1
3,555,\nThis 41 year old single cask was aged in a s...,1
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1


In [0]:
train['text'] = train['description'].str.strip('\n')

In [0]:
train2 = train.drop(columns=['description', 'id'])

In [0]:
train3 = train2.rename(columns={"ratingCategory": "label"})

In [6]:
from sklearn.model_selection import train_test_split

df_trn, df_val = train_test_split(train3, stratify = train3['label'], test_size = 0.30, random_state=42)

df_trn.shape, df_val.shape

((2860, 2), (1227, 2))

In [7]:
#Vectorize text using spacy
pip install -U spacy[cuda92]

Requirement already up-to-date: spacy[cuda92] in /usr/local/lib/python3.6/dist-packages (2.2.4)


In [8]:
!python3 -m spacy download en_core_web_lg

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [0]:
import spacy

spacy.prefer_gpu()
nlp = spacy.load('en_core_web_lg')

In [0]:
doc = nlp("NLP is awesome!")

In [0]:
def get_word_vectors(docs):
    return [nlp(doc).vector for doc in docs]

In [0]:
X = df_trn['text']
Y = df_trn['label']
X_spacy = get_word_vectors(X)

In [0]:
# Import Statements
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

### Define Pipeline Components

In [0]:

rfc = RandomForestClassifier()

# Define the Pipeline
pipe = Pipeline([
    ('clf', rfc)         # RandomForest Classifier
])


In [15]:
parameters = {
    'clf__n_estimators': (25, 50, 75, 100),#start with just vectorizer default parameters first
    'clf__max_depth': (15, 20, 25, 30)
}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)

grid_search.fit(X_spacy, Y)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   28.1s
[Parallel(n_jobs=-1)]: Done  80 out of  80 | elapsed:   56.2s finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('clf',
                                        RandomForestClassifier(bootstrap=True,
                                                               ccp_alpha=0.0,
                                                               class_weight=None,
                                                               criterion='gini',
                                                               max_depth=None,
                                                               max_features='auto',
                                                               max_leaf_nodes=None,
                                                               max_samples=None,
                                                               min_impurity_decrease=0.0,
                                                               min_impurity_split=None,
                                              

In [16]:
grid_search.best_score_

0.7388111888111888

In [17]:
grid_search.best_params_

{'clf__max_depth': 20, 'clf__n_estimators': 100}

In [0]:
#Predict on val data

X_val = df_val['text']
Y_val = df_val['label']
X_val_spacy = get_word_vectors(X_val)

In [0]:
y_pred = grid_search.predict(X_val_spacy)


In [25]:
from sklearn.metrics import accuracy_score
val_accuracy = accuracy_score(Y_val, y_pred)
val_accuracy

0.7245313773431132

In [0]:
# 75.5% val accuracy using same train/val split with Spacy vectorization and Keras sequential for text classification

#Above Spacy vectors+RFC+Gridsearch had 82.6% test accuracy using whole of train dataset, so Keras model is giving slighly better results than this notebook's model

#Overall, Keras with epoch tuning can be used for text classification and may give better accuracy results than Spacy+RFC_GridsearchCV which should be used as a baseline

#Can also try TPOTClassifier to find other ML models which may have better accuracy using sklearn ML models and then rerun using the TPOT winner model.

