Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Assignment)

This notebook is for you to practice skills during lecture.

Today's guided module project and assignment will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a kaggle competition. We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills.

## Sections
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy
* <a href="#p4">Part 4</a>: Post Lecture Assignment

# Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

## Follow Along 

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model (try using the pipe method I just demoed)

### Load Competition Data

In [1]:
import pandas as pd

# You may need to change the path
train = pd.read_csv('./whiskey-reviews-dspt7/train.csv')
test = pd.read_csv('./whiskey-reviews-dspt7/test.csv')
print(train.shape, test.shape)

(4087, 3) (1022, 2)


In [2]:
train.head()

Unnamed: 0,id,description,ratingCategory
0,1321,"\nSometimes, when whisky is batched, a few lef...",1
1,3861,\nAn uncommon exclusive bottling of a 6 year o...,0
2,655,\nThis release is a port version of Amrut’s In...,1
3,555,\nThis 41 year old single cask was aged in a s...,1
4,1965,"\nQuite herbal on the nose, with aromas of dri...",1


In [3]:
import re

SEQ_TO_REMOVE = ["\\\\n", "\\\\r"]

def clean_text(text):
    """
    Removes unnecessary sequences from the description.
    """
    for seq in SEQ_TO_REMOVE:
        text = re.sub(seq, '', text)
    return text.strip()

train["description"]  = train["description"].apply(clean_text)
test["description"]  = test["description"].apply(clean_text)

train.head()

Unnamed: 0,id,description,ratingCategory
0,1321,"Sometimes, when whisky is batched, a few lefto...",1
1,3861,An uncommon exclusive bottling of a 6 year old...,0
2,655,This release is a port version of Amrut’s Inte...,1
3,555,This 41 year old single cask was aged in a she...,1
4,1965,"Quite herbal on the nose, with aromas of dried...",1


In [4]:
# Distribution of ratingCategory: 0 (Excellent), 1 (Good), 2 (Poor)
train.ratingCategory.value_counts(normalize=True)

1    0.704918
0    0.279178
2    0.015904
Name: ratingCategory, dtype: float64

In [5]:
# Read a few reviews from the "Excellent" category
pd.set_option('display.max_colwidth', 0)
train[train.ratingCategory == 0].sample(3)

Unnamed: 0,id,description,ratingCategory
3230,4684,"The finish in question here is Muscatel casks and you can tell that from the start, as the nose is filled with a rich, sweet, and very pronounced dusky fruitiness — sloes and plums. The smoke as a result is diminished as are the grassy/bacony notes. While the smoke does emerge from its fruity bubble on the tongue, the effect is almost liqueur-like. It’s a very pleasing dram, but the question is, is it Caol Ila?",0
2791,4403,"South Africa’s most established distillery now makes Scotch-style single malt whisky that the country can be proud of. The downside is that it plays it safe, and the flavors on offer are subdued and subtle. That said though, there’s plenty to like here — delicate floral notes including rose, with a rich and honeyed heart, traces of exotic fruits including kumquat and kiwi, wispy smoke, and some cinnamon and paprika. Solid. €47.50 (Not available in the U.S.)",0
3049,4330,"Here’s a whisky not seen very often in the U.S. When it is seen, it’s from one of the independent bottlers. I have always felt that younger Glen Grant whiskies make a nice introduction to the single malt category-especially for a blend drinker trading up. The whisky is usually light to medium in body and uncomplicated-with no harsh edges to be particularly offensive. And so it is with this whisky. A soft, cereal grain maltiness marries nicely with floral, delicately fruity notes throughout. Gentle, dry but malty finish, with suggestions of shortbread cookies and vanilla. A nice representation of a younger Glen Grant. The flavors are clean and tight.",0


In [6]:
# Read a few reviews from the "Poor" category
train[train.ratingCategory == 2].sample(3)

Unnamed: 0,id,description,ratingCategory
2396,5088,"One of the most popular flavored whiskies today is Fireball cinnamon, so it’s no surprise that other cinnamon whiskies are entering the market. This one has a lovely, woody, cinnamon nose that bursts into sweet, blistering cinnamon on the palate. Cinnamon is a natural whisky flavor, but here, rather than complement the underlying whisky, it completely masks it. This is a cinnamon liqueur, and a good one. A fun shooter, perhaps, but it’s barely whisky.",2
3890,5032,"Citrus peel, light maple syrup, and almonds, with emerging grape and vanilla. Somewhat elegant in nature, but the flavors do not especially complement each other. (Exclusive to Binny’s Beverage Depot.)",2
1209,5092,"The sherry is very dominant and cloying, which is unfortunate. And I’m not crazy about the quality of the sherry (or perhaps even the wood it was aged in). I have great respect for both Highland Park and Binny’s, but this is somewhat disappointing for a Highland Park. Tasted twice, with the same opinion. (Bottled for Binny’s Beverage Depot)",2


### Split the Training Set into Train/Validation

In [7]:
from sklearn.model_selection import train_test_split

feature = 'description'
target = 'ratingCategory'

X_train, X_test, y_train, y_test = train_test_split(train[feature], 
                                                    train[target], 
                                                    test_size=0.2, 
                                                    stratify=train[target],
                                                    random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(3269,) (818,) (3269,) (818,)


### Define Pipeline Components

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline

vect = TfidfVectorizer(stop_words='english', ngram_range = (1,2))
clf = LinearSVC()

pipe = Pipeline([('vect', vect), ('clf', clf)])

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [9]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'vect__max_df': (0.3, 1.0), # .3 and .5 may be better search terms
    'vect__min_df': (2, 5, 10), # any words that appears in fewer than 2 documents - don't count it
    'vect__max_features': (5000, 20000),
    'clf__penalty': ('l1','l2'), # l1 is lasso regression and l2 is ridge regression
    'clf__C': (0.1, 0.5, 1., 2.) # C is the amount of regularization, where a higher number is more regularization
}

grid_search = GridSearchCV(pipe, parameters, cv=5, n_jobs=-1, verbose=1)
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 96 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   20.5s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed:  2.7min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 2),
                                                        no

In [10]:
grid_search.best_score_

0.7540586612716653

In [11]:
grid_search.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=20000,
                                 min_df=2, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=0.5, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
          

### Make a Submission File
*Note:* In a typical Kaggle competition, you are only allowed two submissions a day, so you only submit if you feel you cannot achieve higher test accuracy. For this competition the max daily submissions are capped at **20**. Submit for each demo and for your assignment. 

In [12]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [13]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [14]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [15]:
subNumber = 0

In [16]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./whiskey-reviews-dspt7/submission{subNumber}.csv', index=False)
subNumber += 1

## Challenge

You're trying to achieve a minimum of 70% Accuracy on your model.

## Latent Semantic Indexing (Learn)
<a id="p2"></a>

## Follow Along
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
4. Make a submission to Kaggle 


### Define Pipeline Components

In [17]:
from sklearn.decomposition import TruncatedSVD

vect = TfidfVectorizer(stop_words='english', 
                       ngram_range=(1,2),
                       min_df=2, 
                       max_df=1.0,
                       max_features=20000)

svd = TruncatedSVD(algorithm='randomized', n_iter=10)

clf = LinearSVC(C=0.5, penalty='l2')

### Define Your Search Space
You're looking for both the best hyperparameters of your vectorizer and your classification model. 

In [18]:
import scipy.stats as stats
from sklearn.model_selection import RandomizedSearchCV

params = {
    'vect__max_df': (0.75, 1.0),
    'svd__n_components': stats.randint(10, 100, 250)
}

pipe = Pipeline([
    ('vect', vect),      # TF-IDF Vectorizer
    ('svd', svd),        # Truncated SVD Dimensionality Reduction
    ('clf', clf)         # LinearSVC Classifier
])

random_search = RandomizedSearchCV(pipe, params, cv=2, n_iter=5, n_jobs=-1, verbose=1)
random_search.fit(X_train, y_train)

Fitting 2 folds for each of 5 candidates, totalling 10 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:   25.6s finished


RandomizedSearchCV(cv=2, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('vect',
                                              TfidfVectorizer(analyzer='word',
                                                              binary=False,
                                                              decode_error='strict',
                                                              dtype=<class 'numpy.float64'>,
                                                              encoding='utf-8',
                                                              input='content',
                                                              lowercase=True,
                                                              max_df=1.0,
                                                              max_features=20000,
                                                              min_df=2,
                                                       

In [19]:
random_search.best_estimator_

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=20000,
                                 min_df=2, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('svd',
                 TruncatedSVD(algorithm='randomized', n_components=323,
                              n_iter=10, random_state=None, tol=0.0)),
                ('cl

In [20]:
random_search.best_score_

0.7454884544409883

### Make a Submission File

In [21]:
# Predictions on test sample
pred = grid_search.predict(test['description'])

In [22]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory':pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [23]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [24]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model

submission.to_csv(f'./whiskey-reviews-dspt7/submission{subNumber}.csv', index=False)
subNumber += 1

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

# Word Embeddings with Spacy (Learn)
<a id="p3"></a>

## Follow Along

In [25]:
# Apply to your Dataset
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import randint

param_dist = {
    'max_depth': randint(3,10),
    'min_samples_leaf': randint(2,15)
}

In [26]:
import spacy

nlp = spacy.load('en_core_web_lg')

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV

class Embedderizer:
    def fit(self, docs):
        pass
    
    def transform(self, docs):
        return [nlp(doc).vector for doc in docs]

vect = Embedderizer()

X_train_vect = vect.transform(X_train)
X_test_vect  = vect.transform(X_test)

clf = GradientBoostingClassifier()
clf.fit(X_train_vect, y_train)

print("Accuracy score: ", clf.score(X_test_vect, y_test))

Accuracy score:  0.7506112469437652


### Make a Submission File

In [28]:
# Predictions on test sample
pred = clf.predict(vect.transform(test['description']))

In [29]:
submission = pd.DataFrame({'id': test['id'], 'ratingCategory': pred})
submission['ratingCategory'] = submission['ratingCategory'].astype('int64')

In [30]:
# Make Sure the Category is an Integer
submission.head()

Unnamed: 0,id,ratingCategory
0,3461,1
1,2604,1
2,3341,1
3,3764,1
4,2306,1


In [31]:
# Save your Submission File
# Best to Use an Integer or Timestamp for different versions of your model
submission.to_csv(f'./whiskey-reviews-dspt7/submission{subNumber}.csv', index=False)
subNumber += 1

## Challenge

What you should be doing now:
1. Join the Kaggle Competition
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# Post Lecture Assignment
<a id="p4"></a>

Your primary assignment this afternoon is to achieve a minimum of 70% accuracy on the Kaggle competition. Once you have achieved 70% accuracy, please work on the following: 

1. Research "Sentiment Analysis". Provide answers in markdown to the following questions: 
    - What is "Sentiment Analysis"? 
    - Is Document Classification different than "Sentiment Analysis"? Provide evidence for your response
    - How do create labeled sentiment data? Are those labels really sentiment?
    - What are common applications of sentiment analysis?
2. Research our why word embeddings worked better for the lecture notebook than on the whiskey competition.
    - This [text classification documentation](https://developers.google.com/machine-learning/guides/text-classification/step-2-5) from Google might be of interest
    - Neural Networks are becoming more popular for document classification. Why is that the case?

### What is Sentiment Analysis?

The use of natural language processing, text analysis, and computational linguistics to determine subjective information (i.e. opinion-based) about a document.  It can be thought of as determining how positive or negative the tone of a document is.

### Is Document Classification different than "Sentiment Analysis"?

Sentiment analysis is commonly used to perform Document Classification, usually regarding a particular opinion (or sentiment) about a given topic.  Many tools use sentiment analysis to analyze social media and product reviews of e-commerce website, such as determining whether a review is positive, negative, or neutral.

### How do you create labeled sentiment data? Are those labels really sentiment?

You cannot perform sentiment analysis without pre-labeled data - without labels there is no way to determine whether or not your analysis is accurate.  To label the sentiment data, this can either be done by a human

### What are common applications of sentiment analysis?

Common applications of sentiment analysis include rating social media mentions and determining whether a product review was positive, negative, or neutral.

### Why did word embeddings work better for the lecture notebook than on the whiskey competition?



In [32]:
from explore_data import get_num_words_per_sample

median_words_per_sample = get_num_words_per_sample(X_train)
sw_ratio = len(X_train) / median_words_per_sample

print(f'Number of Samples / Median Words per Sample ratio: {int(sw_ratio)}')

Number of Samples / Median Words per Sample ratio: 46


As the number of samples/number of words per sample ratio increases, the effectiveness of using pre-trained word embeddings decreases.  Since the S/W ratio for the whiskey competition is twice that of the S/W ratio calculated in the lecture notebook, the drop in accuracy is expected

### Why are neural networks are becoming more popular for document classification?

Because neural networks, such as Convolutional Neural Networks (CNN), are non-linear systems, they are better at predicting than classical classifiers especially when used with pre-trained word embeddings.