<a href="https://colab.research.google.com/github/bgoueti/BloomTechSprint/blob/main/DS_413_Document_Classification_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a [kaggle competition](https://www.kaggle.com/c/whiskey-201911/) We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

We are going to run increasingly sophisticated classification models on our whisky reviews in parts 1, 2, and 3. For each of parts 1, 2, and 3, submit your best model's results to the Kaggle competition to measure `generalization accuracy` -- i.e. how well the model performs on new data.

##1. Classifier based on TfIdf vectorization of reviews

### Follow Along

1. Join the Kaggle Competition
2. Download the data
3. Train and hyperparameter tune a model using an sklearn pipeline

### 1.0 Setup

#### 1.0.1 Get spacy and restart runtime

In [1]:
# Locally (or on colab) let's use en_core_web_lg
!python -m spacy download en_core_web_md # Can do lg, takes awhile
# Also on Colab, need to restart runtime after this step!

Collecting en-core-web-md==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.7.1/en_core_web_md-3.7.1-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.7.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


#### 1.0.2 import necessary packages, load spacy

In [2]:
%pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [3]:
%pip install lightgbm



In [4]:
%pip install xgboost



In [5]:
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import RandomizedSearchCV
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier

from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
import xgboost as xgb
from xgboost import XGBClassifier

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
import spacy

from scipy.stats import randint
from scipy.stats import uniform

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



Load `spacy`

In [6]:
from spacy.lang.en import English
nlp = English()

#### 1.0.3 Load Kaggle Whisky Competition Data
The goal is to predict the rating from the review text

In [7]:
#!wget https://www.kaggle.com/competitions/whiskey-201911/data

In [8]:
!unzip '/content/whiskey-201911.zip'

Archive:  /content/whiskey-201911.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [9]:
# !!!!! You may need to change the path !!!!!
# You can download these datasets from the Kaggle in-class

# competition for your cohort.

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [10]:
train.head()

Unnamed: 0,id,description,category
0,1,A marriage of 13 and 18 year old bourbons. A m...,2
1,2,There have been some legendary Bowmores from t...,1
2,3,This bottling celebrates master distiller Park...,2
3,4,What impresses me most is how this whisky evol...,1
4,9,"A caramel-laden fruit bouquet, followed by une...",2


In [11]:
test.head()

Unnamed: 0,id,description
0,955,"Think carnival aromas—the good ones, anyway—me..."
1,3532,"A blend of three bourbons, between 6 and 12 ye..."
2,1390,"The nose is focused on cereal, hints of fresh ..."
3,1024,Swiss-based Chapter 7 released this 19 year ol...
4,1902,Valkyrie replaces the current Dark Origins exp...


In [12]:
train.shape, test.shape

((2586, 3), (288, 2))

### 1.1 Clean Text

In [13]:
def clean_doc(text):
  # COMPLETE THE CODE IN THIS CELL
  # remove new line characters
  #text = text.replace('\\n', ' ')
  # remove numbers from the text
  text = re.sub('[^a-zA-Z ]', ' ', text)
  # remove multiple white spaces
  text = re.sub('[ ]{2,}', ' ', text)

  # case normalize and strip extra white spaces on the far left and right hand side
  text = text.lower().lstrip().rstrip()
  return text

# before cleanning
print(train['description'][0])

# after cleanning
train['description'] = train['description'].apply(clean_doc)
test['description'] = test['description'].apply(clean_doc)


A marriage of 13 and 18 year old bourbons. A mature yet very elegant whiskey, with a silky texture and so easy to embrace with a splash of water. Balanced notes of honeyed vanilla, soft caramel, a basket of complex orchard fruit, blackberry, papaya, and a dusting of cocoa and nutmeg; smooth finish. Sophisticated, stylish, with well-defined flavors. A classic!


In [14]:
print(train['description'][0])

a marriage of and year old bourbons a mature yet very elegant whiskey with a silky texture and so easy to embrace with a splash of water balanced notes of honeyed vanilla soft caramel a basket of complex orchard fruit blackberry papaya and a dusting of cocoa and nutmeg smooth finish sophisticated stylish with well defined flavors a classic


### 1.2 Split training data into Feature Matrix `X` and Target Vector `y`

In [15]:
target = 'category'
# COMPLETE THE CODE IN THIS CELL
y = train[target] - 1
X = train['description']
X_test = test['description']

In [16]:
X.shape, y.shape, X_test.shape

((2586,), (2586,), (288,))

In [17]:
y.unique()

array([1, 0, 3, 2])

### 1.3 Specify the Model and Define the Pipeline Components

For the classifier model, you can try any or several of
* `RandomForestClassifier()` or `GradientBoostingClassifier()` from the `sklearn` library
* `XGBClassifier()` from the `xgboost` library
* `CatboostClassifier()` from the `catboost` library
* `LGBMClassifier()` from the `lightgbm` library


In [18]:
# limit max_features to 500 to speed up training on Colab.
# COMPLETE THE CODE IN THIS CELL
vect = TfidfVectorizer(stop_words='english', max_features=500, ngram_range=(1,2))
clf = XGBClassifier(learning_rate=0.1,
                    max_depth=5,
                    random_state=42,
                    use_label_encoder=False,
                    eval_metric='logloss'
                    )

pipe = Pipeline([
    ('vect', vect),
     ('clf', clf)
])
pipe

### 1.4 Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model.

In [20]:
# COMPLETE THE CODE IN THIS CELL
# Parameters to search in dictionary
parameters = {
    'vect__max_features': [500],
    'vect__max_df': [0.75],
    'vect__analyzer':['word'],
    'clf__max_depth':[15],
    'clf__n_estimators':[1500],
    'clf__learning_rate':[0.1]
}

# Implement a grid search with cross-validation
grid_search = GridSearchCV(pipe, parameters, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X, y)

# Display the best score from the grid search
print(grid_search.best_score_)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


Parameters: { "use_label_encoder" } are not used.



0.87122969837587


In [21]:
# Display the best parameters from the grid search
print(grid_search.best_params_)

{'clf__learning_rate': 0.1, 'clf__max_depth': 15, 'clf__n_estimators': 1500, 'vect__analyzer': 'word', 'vect__max_df': 0.75, 'vect__max_features': 500}


### 1.5 Make a Submission File
*Note:* In a typical Kaggle competition, you are only allowed two submissions a day, so only submit when your predicted test accuracy is the highest you can make it. For this competition the max daily submissions are capped at **20**.  The submission file is made from the results of running your best model on the **test data set**, for which we don't get the targets.<br><br>

In [22]:
# COMPLETE THE CODE IN THIS CELL
# Predictions on **test** sample
pred = grid_search.predict(X_test)
pred = pred + 1

In [23]:
# COMPLETE THE CODE IN THIS CELL
submission = pd.DataFrame({'id': test['id'], 'category': pred})
submission['category'] = submission['category'].astype('int64')

In [24]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,category
0,955,2
1,3532,2
2,1390,1
3,1024,1
4,1902,1


In [25]:
# import os
# print(os.listdir())

In [61]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 0

submission.to_csv(f'submission{submission_number}.csv', index=False)
submission_number += 1

In [60]:
# Download submission to local machine from this Google Colab notebook
from google.colab import files
files.download(f'submission{submission_number - 1}.csv')

submission_number += 1

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 1.6 Submit your results to `kaggle` and get your score

First, upload the `kaggle.json` API token file from your local machine.<br>
Do this by clicking the file icon in the left sidebar, <br>
then clicking file icon with an up arrow inside it at the upper left, <br>
then navigating to and selecting the `kaggle.json` file in your local machine.<br>
`kaggle.json` is usually found in a folder called `.kaggle` on your local machine, <br>
if you have a Kaggle account. Note that there is no cost involved in <br>registering for a Kaggle account.<br><br>

Then: make a folder `/root/.kaggle` in this notebook,<br>
and copy your `kaggle.json` file into the `/root/.kaggle/` folder

In [34]:
#!mkdir /root/.kaggle/
!mv /kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json # to safeguard your privacy
!ls -l /root/.kaggle/

mv: cannot stat '/kaggle.json': No such file or directory
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
ls: cannot access '/root/.kaggle/': No such file or directory


Now you can submit your predictions to Kaggle
directly from this Colab notebook!<br>


In [35]:
!kaggle competitions submit whiskey-201911 -f submission1.csv -m "submission1"

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.10/dist-packages/kaggle/__init__.py", line 7, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.10/dist-packages/kaggle/api/kaggle_api_extended.py", line 407, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /root/.config/kaggle. Or use the environment method. See setup instructions at https://github.com/Kaggle/kaggle-api/


You can check your score on the [Kaggle Whisky Competition website](https://www.kaggle.com/c/whiskey-201911/submissions),
<br>on the "My Submissions" tab:

_This submission got a score of $0.94186$_

## Challenge

You're trying to achieve a minimum of 75% Accuracy on your model.

## 2. Add Latent Semantic Indexing to your pipeline (Learn)
<a id="p2"></a>

### Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try:
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (LSI) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle


### 2.1 Define Pipeline Components

Nest pipelines to perform SVD on our vectorization (LSA)

In [66]:
# COMPLETE THE CODE IN THIS CELL
# Transforming our Vectorization with SVD is how LSA generates topic columns
svd = TruncatedSVD(n_components=2, algorithm='randomized', n_iter=10)

# vectorizer and classifier like before
vect = TfidfVectorizer(stop_words='english', max_features=500, ngram_range=(1,2))
clf = XGBClassifier()

# LSA pipeline with vectorizer & truncated SVD
lsa = Pipeline([('vect', vect), ('svd', svd)])

# combine LSA pipeline together with classifier
pipe = Pipeline([('lsa', lsa), ('clf', clf)])

### 2.2 Define Your grid search space and run a grid search with cross-validation
You're looking for both the best hyperparameters of your vectorizer and your classification model.

In [67]:
# COMPLETE THE CODE IN THIS CELL
parameters = {
    'lsa__svd__n_components': [50],
    'lsa__vect__max_df': [0.75],
    'clf__max_depth': [15],
    'clf__n_estimators': [1500],
    'clf__learning_rate': [0.1]
}

grid_search = GridSearchCV(pipe,parameters, cv=3, n_jobs=-1, verbose=1)
grid_search.fit(X, y)

Fitting 3 folds for each of 1 candidates, totalling 3 fits


In [68]:
grid_search.best_score_

0.8959783449342614

In [69]:
grid_search.best_params_

{'clf__learning_rate': 0.1,
 'clf__max_depth': 15,
 'clf__n_estimators': 1500,
 'lsa__svd__n_components': 50,
 'lsa__vect__max_df': 0.75}

### 2.3 Make a Submission File
See section $1.6$ above for instructions on how to submit your results file to `kaggle` and get your score

In [70]:
# Predictions on test sample
pred = grid_search.predict(test['description'])
pred = pred + 1

In [71]:
submission = pd.DataFrame({'id': test['id'], 'category':pred})
submission['category'] = submission['category'].astype('int64')

In [72]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,category
0,955,2
1,3532,2
2,1390,1
3,1024,1
4,1902,1


In [82]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 1
submission.to_csv(f'submission{submission_number}.csv', index=False)
submission_number +=2

In [83]:
# Download submission to your local machine from this Colab notebook
from google.colab import files
files.download(f'submission{submission_number}.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets.

# 3. Add Spacy Word Embeddings
<a id="p3"></a>

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try:
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use document vectors made from those word embeddings as your features for a classification model.
4. Make a submission to Kaggle

### 3.1 Process the data set with spacy

In [65]:
# Apply to your Dataset

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier

from scipy.stats import randint

param_dist = {

    'max_depth' : randint(3,10),
    'min_samples_leaf': randint(2,15)
}

In [87]:
# Continue Word Embedding Work Here
nlp = spacy.load("en_core_web_md")

def get_word_vectors(docs):
    # YOUR CODE HERE
    return  np.array([nlp(doc).vector for doc in docs])

X_train_emb = get_word_vectors(train['description'])
X_test_emb = get_word_vectors(test['description'])

In [88]:
rfc = RandomForestClassifier(oob_score=True)

rfc.fit(X_train_emb, y)

In [90]:
# massively overfit with the Random Forest
print('Training Accuracy: ', rfc.score(X_train_emb, y))

Training Accuracy:  1.0


Here we use oob_score_ (out-of-bag score) as a **proxy** for the test score;<br>
for your submission, you will predict on the test set, as before

In [91]:
# validation looks decent without any tuning

rfc.oob_score_

0.679814385150812

### 3.2 Make a Submission File
See section $1.6$ above for instructions on how to submit your results file to `kaggle` and get your score

### Make a Submission File

In [92]:
# YOUR CODE HERE
# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(rfc, param_distributions=param_dist, n_iter=10, cv=3, n_jobs=-1, verbose=1)
random_search.fit(X_train_emb, y)

# Output the best score and parameters
print("Best Score:", random_search.best_score_)
print("Best Parameters:", random_search.best_params_)


Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Score: 0.6666666666666666
Best Parameters: {'max_depth': 7, 'min_samples_leaf': 6}


In [93]:

# Make predictions on test set
test_pred = random_search.predict(X_test_emb)


In [94]:
# Create and save submission
submission = pd.DataFrame({
    'id': test['id'],
    'category': test_pred
})

In [95]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission_number = 2
submission.to_csv(f'submission{submission_number}.csv', index=False)

In [96]:
# Download submission to local machine from Google Colab
from google.colab import files
files.download(f'submission{submission_number}.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 3.3 Submit your predictions to Kaggle


---



In [100]:
# YOUR CODE HERE


# Post Lecture Assignment (Stretch)
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 80% accuracy on the Kaggle competition. <br>
Once you've accomplished that, do (1), and either (2) or (3):

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions:
    - What is "Sentiment Analysis"?
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do people create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?

2. Singular Value Decomposition (SVD) is one of the most important and powerful methods in Applied Mathematics and in all of Machine Learning.  Principal Components Analysis (PCA) -- which we used in Module 2 -- is closely releated to SVD. Research SVD using the resources below. Then write a few paragraphs explaining -- in your own words -- your understanding of SVD and why it has become so important in Machine Learning. As you write, pretend that you will be presenting this summary orally as an answer to a question during a job interview.<br>

* [Daniela Witten](https://www.danielawitten.com/), a Professor of Mathematical Statistics at the University of Washington, recently penned a highly amusing and informative [tweetstorm](https://twitter.com/WomenInStat/status/1285611042446413824) about SVD, well worth reading!<br>
* [Stanford University Lecture on SVD](https://www.youtube.com/watch?v=P5mlg91as1c) <br>
* [StatQuest Principal Components Analysis](https://www.youtube.com/watch?v=FgakZw6K1QQ)<br>
* [Luis Serrano Principal Components Analysis](https://www.youtube.com/watch?v=g-Hb26agBFg)<br>

3. Research which other models can be used for text classification -- see [Multi-Class Text Classification Model Comparison and Selection](https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568)
  - Try a few other classical machine learning models, and compare with the gradient boosting results
  - Neural Networks are becoming more popular for document classification. Why is that the case?
  - If you have the time and interest, check out this [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google
   