Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

## Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### Load Competition Data

In [28]:
import os

In [3]:
PATH_Train = os.path.join(os.path.curdir,  "data", "train.csv")
PATH_TEST = os.path.join(os.path.curdir,  "data", "test.csv")

In [4]:
os.path.curdir

'.'

In [5]:
import pandas as pd

# You may need to change the path
train = pd.read_csv(PATH_Train)
test = pd.read_csv(PATH_TEST)
print(train.shape, test.shape)

(4087, 3) (1022, 2)


In [12]:
train.head()

Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few lef...",1
1,3861,\nAn uncommon exclusive bottling of a 6 year o...,0
2,655,\nThis release is a port version of Amrut’s In...,1
3,555,\nThis 41 year old single cask was aged in a s...,1
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1


In [13]:
# Distribution of ratingCategory: 0 (Excellent), 1 (Good), 2 (Poor)
train.ratingCategory.value_counts()

1    2881
0    1141
2      65
Name: ratingCategory, dtype: int64

In [14]:
# Read a few reviews from the "Excellent" category
pd.set_option('display.max_colwidth', 0)
train[train.ratingCategory == 0].sample(3)

Unnamed: 0,id,description,ratingCategory
2073,4274,"\nThe first release from Wolfburn distillery is 3 years old. Matured in a mix of Spanish and American oak quarter casks previously used by an Islay distillery. The nose is soft and belies its youth, offering vanilla, lemon, ginger, and light smoke. The early palate is grassy. Sweeter fruit notes soon develop with more vanilla and ginger, plus white pepper. The finish is quite long and slightly smoky. Much to look forward to as this ages!",0
2799,53,"\nIf Canada made bourbon (it doesn’t), it would taste like this massive dram. The mashbill of 60% corn, 36% rye, and 4% malted barley is identical to that used for Crown Royal Hand Selected Barrel Coffey rye. Beer still distillation and virgin oak barrels yield huge vanillas, rye spices, barrel tones, cherries, dark fruits, soaring floral esters, and gingery, peppery spices. Strong woodiness, slightly pulling tannins, and something almost chocolaty.",0
3965,3930,"\nA very different whisky to its unaged namesake, and most unlike any of the other blends tasted for this issue. That’s no bad thing.\r\nThis is less sweet than most blends, with tobacco leaf and ashtray to the fore, and a dusty, grainy note with a touch of oak, grape skin, and sweet heather. That said, not a lot of evidence of the 12 years in cask.",0


In [15]:
# Read a few reviews from the "Poor" category
train[train.ratingCategory == 2].sample(3)

Unnamed: 0,id,description,ratingCategory
2990,5105,"\nWith its overt floral perfume notes and the scent of children’s powdered candy, this whisky is difficult to enjoy. Its unctuous artificial flavors are equally unsuitable for cocktails, mixing, or sipping. Fruity, winey, lavender notes duke it out with baby cereal and artificial coconut. The saving graces? A late lovely bitterness, long gingery burn, and creamy body. But then jujubes, grape gum and artificial bananas kick in and it’s over.",2
2606,5074,"\nPerhaps my least favorite of all the Experimental Collection releases to date. The nose shows nicely, but it comes across as rather aggressive and harsh on the palate toward the finish, which the label describes as being “earthy.” Otherwise, the whiskey is pleasantly sweet, with molasses, date, and fig, plus charcoal, leather, and bitter resin in the mix. Price is per 375 ml.",2
3508,5099,"\nReal maple syrup has an earthy, woodsy aroma; maple flavoring has strong overtones of coconut, and so does Cabin Fever. The nose evokes dried, sweetened baking coconut, while the sweet and spicy palate is a hot, liquid, coconut macaroon. Peppery notes suggest that it’s whisky, but without any traces of barrel aging that’s as close as you get. Best part? The long, spicy finish with its confection-sweet coconut.",2


### Split the Training Set into Train/Validation

In [6]:
from sklearn.model_selection import train_test_split
# Doing a train test split
X_train, X_test, y_train, y_test = train_test_split(train['description'], 
                                                    train['ratingCategory'], 
                                                    test_size=0.2, 
                                                    stratify=train['ratingCategory'],
                                                    random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3269,) (818,) (3269,) (818,)


### Define Pipeline Components

In [7]:
# Doing some imports for the Random forrest classifier a
# and the tfidf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

In [9]:
vect = TfidfVectorizer(stop_words="english", ngram_range=(1,2))
clf = RandomForestClassifier()

pipe = Pipeline([('vect', vect), ('clf', clf)])

In [18]:
from sklearn.model_selection import GridSearchCV

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [26]:
parameters = {
    'vect__max_df': (0.75, 1.0),
    'clf__max_depth':(5,10,15,None),
    'vect__min_df': (2,5,10),
    'vect__max_features': (5000, 10000, 7000),
    'clf__n_estimators': (100, 500)

}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=4, verbose=1)
grid_search.fit(X_train, y_train )

Fitting 5 folds for each of 144 candidates, totalling 720 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   32.6s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  2.4min
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:  4.7min
[Parallel(n_jobs=4)]: Done 720 out of 720 | elapsed: 12.7min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 2),
                                                        no

In [27]:
grid_search.best_score_

0.7421283092384712

In [28]:
grid_search.best_params_

{'clf__max_depth': None,
 'clf__n_estimators': 100,
 'vect__max_df': 1.0,
 'vect__max_features': 5000,
 'vect__min_df': 2}

### Make a Submission File
*Note:* In a typical Kaggle competition, you are only allowed two submissions a day, so you only submit if you feel you cannot achieve higher test accuracy. For this competition the max daily submissions are capped at **20**. Submit for each demo and for your assignment. 

In [29]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [31]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [32]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [33]:
subNumber = 0

In [36]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./submission{subNumber}.csv', index=False)
subNumber += 1

## Challenge

You're trying to achieve a minimum of 70% Accuracy on your model.

## Latent Semantic Indexing (Learn)
<a id="p2"></a>

## Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


In [1]:
# Doing some imports for the Laten semantic indexing
from sklearn.decomposition import TruncatedSVD

In [None]:
# These are the best parameters for the model from
# the grid search.(up above)
#{'clf__max_depth': None,
# 'clf__n_estimators': 100,
# 'vect__max_df': 1.0,
# 'vect__max_features': 5000,
# 'vect__min_df': 2}

In [11]:
svd = TruncatedSVD(n_components=100, algorithm="randomized",
n_iter=10,
random_state=49)

### Define Pipeline Components

In [12]:
# going to add some of the param for the vectorizer
vect = TfidfVectorizer(stop_words="english", ngram_range=(1,2), min_df=2, max_features=5000)
clf = RandomForestClassifier()


In [20]:
# These are the parameters for the next pipeline
params = {
    "lsi__svd__n_components": [65, 100, 250 ],
    "lsi__vect__max_df" : [.9, .95, 1.0],
    "clf__n_estimators": [5, 10,  20, 100]


    
}

parameters = {
    'lsi__svd__n_components': [10,100,250],
    'lsi__vect__max_df': (0.75, 1.0),
    'clf__max_depth':(5,10,15,20)
}

In [16]:
lsi = Pipeline([("vect", vect), ("svd", svd)])



pipe = Pipeline([('lsi', lsi), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [21]:


grid_search = GridSearchCV(pipe,parameters, cv=5, n_jobs=3, verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:   43.4s
[Parallel(n_jobs=3)]: Done 120 out of 120 | elapsed:  2.4min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('lsi',
                                        Pipeline(memory=None,
                                                 steps=[('vect',
                                                         TfidfVectorizer(analyzer='word',
                                                                         binary=False,
                                                                         decode_error='strict',
                                                                         dtype=<class 'numpy.float64'>,
                                                                         encoding='utf-8',
                                                                         input='content',
                                                                         lowercase=True,
                                                                         max_df=1.0,
             

In [22]:
# finding the best parameters
grid_search.best_params_

{'clf__max_depth': 10, 'lsi__svd__n_components': 10, 'lsi__vect__max_df': 1.0}

In [23]:
grid_search.best_score_

0.7265254225381794

### Make a Submission File

In [40]:
X_train.head(2)

1782    \nVery fragrant, with notes of fresh ripe plum...
3414    \nThe good news: This is one of the best Highl...
Name: description, dtype: object

In [24]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [25]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [26]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [None]:
PATH_SUBMISSION = os.path.join(".","submission",)

In [34]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
subNumber = 1
submission.to_csv(os.path.join(".",  f"submission{subNumber}.csv"), index=False)
subNumber += 1

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

## Follow Along

In [35]:
# Apply to your Dataset
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint

param_dist = {
    'max_depth' : randint(3,10),
    'min_samples_leaf': randint(2,15)
}

In [36]:
# doing the import for spacy here
import spacy
nlp = spacy.load("en_core_web_lg")

In [37]:
# getting all the vectors for the words from spacy
def get_vects(descrip_list):
    return [nlp(doc).vector for doc in descrip_list]

In [41]:
# Using the function to get all the vectors for 
# for all the descriptions
vect_list = get_vects(X_train)

In [42]:
# Going to make a fresh new random forest classifier
rf_classifier = RandomForestClassifier()


In [44]:
# will fit on the classifier
rf_classifier.fit(vect_list, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

### Make a Submission File

In [46]:
# Creating the vectors of the 
# test part
test_vect = get_vects(X_test)

In [47]:
# checking the prediction 
predictions = rf_classifier.predict(test_vect)

In [48]:
# Check the accuracy for the score
from sklearn.metrics import accuracy_score

In [49]:
score = accuracy_score(y_test, predictions)
print(f"The accuracy score is {score}")

The accuracy score is 0.7371638141809291


In [50]:
# Will try with the random forest classifier with the parameters better
rf_classifier = RandomForestClassifier(max_depth=None, n_estimators=100)

In [51]:
rf_classifier.fit(vect_list, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [52]:
predictions = rf_classifier.predict(test_vect)

In [53]:
score = accuracy_score(y_test, predictions)
print(f"The accuracy score is {score}")

The accuracy score is 0.7383863080684596


In [61]:
# Will recombine all the data and then will retrain the same model again

# Combining all the data
all_vect = vect_list + test_vect
all_target = y_train.to_list() + y_test.to_list()


In [62]:
rf_classifier.fit(all_vect, all_target)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [64]:
test_vect = get_vects(test['description'])

In [65]:
# Predictions on test sample
pred = rf_classifier.predict(test_vect)

In [67]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [68]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [69]:
subNumber

2

In [70]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission.to_csv(os.path.join(f"submission{subNumber}.csv"), index=False)
subNumber += 1

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# Post Lecture Assignment
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 70% accuracy on the Kaggle competition. Once you have achieved 70% accuracy, please work on the following: 

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions: 
    - What is "Sentiment Analysis"? 
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?
2. Research our why word embeddings worked better for the lecture notebook than on the whiskey competition.
    - This [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google might be of interest
    - Neural Networks are becoming more popular for document classification. Why is that the case?

Answers for part 1

Sentiment analysis is a form of a classification of text, but along the lines of the 
emotions or sentiment of the text.  For example if the text is negative, or if the text is positive (happy).

Sentiment analysis is a flavour of Document classification.  You are classifying the text into a type of sentiment.  You are trying to decide of the sentance (or the word) is negative, positive or neutral. 

The way to create labled data for a sentiment analysis, could be to ceate rules yourself and then apply those rules to the data to label it.  

Common application for sentiment analyisis would be to see the sentiment of a brand on the internet.  To monitor if reviews for a product are positive or negative.

Answers for question 2

It may have to do that word embeddings can generally work better when the number of samples/number of words per sample ratio is more than 1500.  The whiskey the number of words per sample is lower and the ratio is less than 1500.

Some of the reasons why neural networks are becoming more popular for classification is because neural networks can learn from what is in the data and also what is not.
They may also be more resistant to letting the classification be skewed when it shouldn't if the data that is fed to the neural networks is not quite balanced in the different types of classification.