<a href="https://colab.research.google.com/github/ChanceDurr/DS-Unit-4-Sprint-1-NLP/blob/master/module3-document-classification/Chance_Dare_LS_DS_413_Document_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Prepare)

Today's guided module project will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*). a

Today's all about having fun and practicing your skills. The competition will begin

## Learning Objectives
* <a href="#p0">Part 0</a>: Kaggle Competition
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy

# Text Feature Extraction & Classification Pieplines (Learn)
<a id="p1"></a>

## Overview

In [0]:
# Dataset
import pandas as pd
train = pd.read_csv('https://raw.githubusercontent.com/ChanceDurr/DS-Unit-4-Sprint-1-NLP/master/module3-document-classification/train.csv')
test = pd.read_csv('https://raw.githubusercontent.com/ChanceDurr/DS-Unit-4-Sprint-1-NLP/master/module3-document-classification/test.csv')

In [0]:
train = train.dropna()

### Sklearn Pipeline Objects

In [0]:
# Import Statements
from sklearn.pipeline import Pipeline

from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
# Create Pipeline

vect = TfidfVectorizer(stop_words='english')
sgdc = SGDClassifier()

pipe = Pipeline([('vect', vect), ('clf', sgdc)])

In [0]:
test.head()

Unnamed: 0,id,author,description,price,ratingValue,pert_alcohol
0,955,Fred Minnick,"Think carnival aromasâ€”the good ones, anywayâ...",36.0,90,50.0
1,3532,Lew Bryson,"A blend of three bourbons, between 6 and 12 ye...",90.0,82,49.3
2,1390,Davin de Kergommeaux,"The nose is focused on cereal, hints of fresh ...",48.0,89,45.0
3,1024,Gavin Smith,Swiss-based Chapter 7 released this 19 year ol...,180.0,90,55.8
4,1902,Gavin Smith,Valkyrie replaces the current Dark Origins exp...,71.0,87,45.9


In [0]:
# Fit Pipeline
pipe.fit(train['description'], train['category'])

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patte...
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='hinge',
           

In [0]:
preds = pipe.predict(test['description'])

In [0]:
submission = pd.DataFrame({'id': test['id'], 'category': preds})

In [0]:
submission['category'] = submission['category'].astype('int')

In [0]:
submission.head()

Unnamed: 0,id,category
0,955,2
1,3532,2
2,1390,4
3,1024,1
4,1902,1


In [0]:
from google.colab import files
submission.to_csv('sub.csv', index=False) 
files.download('sub.csv')

### Tuning a Pipeline Object with GridSearch

In [0]:
# Experiment Management
from sklearn.model_selection import GridSearchCV

In [0]:
parameters = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__min_df': (.02, .05),
    'vect__max_features': (100, 500,1000),
    'clf__max_iter':(20, 10, 100)
}

In [0]:
grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=-1, verbose=1)

In [0]:
grid_search.fit(train['description'], train['category'])

Fitting 5 folds for each of 54 candidates, totalling 270 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    9.1s
[Parallel(n_jobs=-1)]: Done 196 tasks      | elapsed:   31.2s
[Parallel(n_jobs=-1)]: Done 270 out of 270 | elapsed:   42.4s finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                          

In [0]:
grid_search.best_params_

{'clf__max_iter': 20,
 'vect__max_df': 0.5,
 'vect__max_features': 500,
 'vect__min_df': 0.02}

In [0]:
grid_search.best_score_

0.9030694668820679

In [0]:
preds = grid_search.predict(test['description'])
submission = pd.DataFrame({'id': test['id'], 'category': preds})
submission['category'] = submission['category'].astype('int')
submission.head()

Unnamed: 0,id,category
0,955,2
1,3532,2
2,1390,1
3,1024,1
4,1902,1


In [0]:
from google.colab import files
submission.to_csv('sub_gridsearch.csv', index=False) 
files.download('sub_gridsearch.csv')

## Challenge

1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
4. Make a submission to Kaggle

## Latent Semantic Indexing (Learn)
<a id="p2"></a>

## Overview

In [0]:
# Import

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, 
                   algorithm='randomized',
                   n_iter=10)

In [0]:
params = { 
    'lsi__svd__n_components': [10,100,250],
}

In [0]:
# LSI

lsi = Pipeline([('vect', vect), ('svd', svd)])

In [0]:
# Pipe

pipe = Pipeline([('lsi', lsi), ('clf', sgdc)])

In [0]:
# Fit
pipe.fit(train['description'], train['category'])

Pipeline(memory=None,
         steps=[('lsi',
                 Pipeline(memory=None,
                          steps=[('vect',
                                  TfidfVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=<class 'numpy.float64'>,
                                                  encoding='utf-8',
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                                  max_features=None, min_df=1,
                                                  ngram_range=(1, 1), norm='l2',
                                                  preprocessor=None,
                                                  smooth_idf=True,
                                                  stop_words='english',
                                                  strip_a

## Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

# Overview

In [0]:
import spacy
nlp = spacy.load("en_core_web_lg")

In [0]:
embeddings = [nlp(doc).vector for doc in train['description']]
embeddings_test = [nlp(doc).vector for doc in test['description']]

In [0]:
# Create Pipeline

vect = TfidfVectorizer(stop_words='english')
sgdc = SGDClassifier()

pipe = Pipeline([('clf', sgdc)])

In [0]:
pipe.fit(embeddings, train['category'])

Pipeline(memory=None,
         steps=[('clf',
                 SGDClassifier(alpha=0.0001, average=False, class_weight=None,
                               early_stopping=False, epsilon=0.1, eta0=0.0,
                               fit_intercept=True, l1_ratio=0.15,
                               learning_rate='optimal', loss='hinge',
                               max_iter=1000, n_iter_no_change=5, n_jobs=None,
                               penalty='l2', power_t=0.5, random_state=None,
                               shuffle=True, tol=0.001, validation_fraction=0.1,
                               verbose=0, warm_start=False))],
         verbose=False)

In [0]:
preds = pipe.predict(embeddings_test)
submission = pd.DataFrame({'id': test['id'], 'category': preds})
submission['category'] = submission['category'].astype('int')
submission.head()

Unnamed: 0,id,category
0,955,2
1,3532,2
2,1390,4
3,1024,1
4,1902,1


In [0]:
from google.colab import files
submission.to_csv('sub_embeddings.csv', index=False) 
files.download('sub_embeddings.csv')

## Follow Along

In [0]:
doc = nlp(train['description'][10])
for chunk in doc.noun_chunks:
  print(chunk.lemma_)

the complete package
note
toffee - coat nut
vanilla fudge
polished leather
cedar - tinge tobacco
barrel char
cocoa powder
a hint
fig
a firm oak grip
the finish
the premium price
this commemorative release
editor 's choice


In [0]:
def tokenize(doc):
  
  d = nlp(doc)
  tokens = []
  
  for chunk in d.noun_chunks:
    if chunk.is_stop == False
    tokens.append(chunk.lemma_)
  
  return tokens

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(analyzer=tokenize, max_df=.9, min_df=.1)

In [0]:
vect.fit(train['description'])

CountVectorizer(analyzer=<function tokenize at 0x7fe6bb0ef950>, binary=False,
                decode_error='strict', dtype=<class 'numpy.int64'>,
                encoding='utf-8', input='content', lowercase=True, max_df=0.9,
                max_features=None, min_df=0.1, ngram_range=(1, 1),
                preprocessor=None, stop_words=None, strip_accents=None,
                token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None,
                vocabulary=None)

In [0]:
vect.get_feature_names()

['-PRON-',
 'a hint',
 'caramel',
 'cinnamon',
 'honey',
 'the finish',
 'the nose',
 'the palate',
 'vanilla',
 'water']

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# Review

To review this module: 
* Continue working on the Kaggle comeptition
* Find another text classification task to work on