# Cross Validation Considerations for Movie Review Sentiment Analysis

This kernel describes and compares stratified k-fold cross validation and group k-fold cross validation for the [Stanford parsed Rotten Tomatoes dataset](https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data).

#### Table of Contents:
- Cross Validation Methods
    - Stratified K-Fold
    - Group K-Fold    
- Model Example


In [1]:
import pandas as pd

pd.set_option('display.max_colwidth', -1)

data = pd.read_csv('../input/train.tsv', delimiter='\t')

#### Note about Class Balance

In [2]:
print("Sentiment Count:", data['Sentiment'].size)
print("Sentiment Distribution:", data['Sentiment'].value_counts(normalize=True), sep='\n')

Sentiment Count: 156060
Sentiment Distribution:
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64


The data is not evenly distributed between classes. So validation splits should be stratified, i.e. each split should have roughly the same distribution. 

Also, while outside the scope of this kernel, [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) might be better metrics than accuracy. It may be worth trying training methods that boost the importance of the under represented classes, such as [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) or class weighting.

## Cross Validation Methods

### Stratified K-Fold

Preserve the Sentiment distribution in each fold.

http://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold

In [3]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
for f, split in enumerate(skf.split(data, data['Sentiment'])):
    print("Fold", f + 1, "-----")
    test = data.iloc[split[1]]
    print("Size:", test.size)
    print("Sentiment Distribution:", test['Sentiment'].value_counts(normalize=True), "", sep='\n')

Fold 1 -----
Size: 124860
Sentiment Distribution:
2    0.509915
3    0.210988
1    0.174756
4    0.059010
0    0.045331
Name: Sentiment, dtype: float64

Fold 2 -----
Size: 124856
Sentiment Distribution:
2    0.509931
3    0.210995
1    0.174761
4    0.058980
0    0.045332
Name: Sentiment, dtype: float64

Fold 3 -----
Size: 124844
Sentiment Distribution:
2    0.509948
3    0.210983
1    0.174778
4    0.058986
0    0.045305
Name: Sentiment, dtype: float64

Fold 4 -----
Size: 124840
Sentiment Distribution:
2    0.509965
3    0.210990
1    0.174752
4    0.058988
0    0.045306
Name: Sentiment, dtype: float64

Fold 5 -----
Size: 124840
Sentiment Distribution:
2    0.509965
3    0.210990
1    0.174752
4    0.058988
0    0.045306
Name: Sentiment, dtype: float64



There's a possible issue with stratified k-fold for this data set, however.
Lets look at the distribution of SentenceID 1 across folds.

In [4]:
Id = 1

print("SentenceId", Id, "Sentiment Counts:\n")
for f, split in enumerate(skf.split(data, data['Sentiment'])):
    print("Fold ", f + 1, ":", sep='')
    test = data.iloc[split[1]]
    if Id in test['SentenceId'].values:
        print(test['Sentiment'][test['SentenceId'] == Id].value_counts(sort=False), "\n")
    else:
        print("None\n")
    
    # use this later
    if f == 0:
        split1 = split

SentenceId 1 Sentiment Counts:

Fold 1:
1    2 
2    11
3    1 
Name: Sentiment, dtype: int64 

Fold 2:
2    10
3    1 
Name: Sentiment, dtype: int64 

Fold 3:
1    1 
2    10
Name: Sentiment, dtype: int64 

Fold 4:
1    1 
2    16
3    1 
Name: Sentiment, dtype: int64 

Fold 5:
2    9
Name: Sentiment, dtype: int64 



This is what the train and test set will look like for fold 1, SentenceId 1.

In [5]:
train = data.iloc[split1[0]]
test = data.iloc[split1[1]]

print("Train -----\nSentenceId =", Id, "Counts:")
print(train['Sentiment'][train['SentenceId'] == Id].value_counts(sort=False))
display(train[(train['SentenceId'] == Id)])

print("Test -----\nSentenceId =", Id, "Counts:")
print(test['Sentiment'][test['SentenceId'] == Id].value_counts(sort=False))
display(test[(test['SentenceId'] == Id)])

Train -----
SentenceId = 1 Counts:
1    2 
2    45
3    2 
Name: Sentiment, dtype: int64


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
1,2,1,A series of escapades demonstrating the adage that what is good for the goose,2
3,4,1,A,2
4,5,1,series,2
5,6,1,of escapades demonstrating the adage that what is good for the goose,2
6,7,1,of,2
7,8,1,escapades demonstrating the adage that what is good for the goose,2
8,9,1,escapades,2
9,10,1,demonstrating the adage that what is good for the goose,2
10,11,1,demonstrating the adage,2
12,13,1,the adage,2


Test -----
SentenceId = 1 Counts:
1    2 
2    11
3    1 
Name: Sentiment, dtype: int64


Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,"A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .",1
2,3,1,A series,2
11,12,1,demonstrating,2
27,28,1,"is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .",2
28,29,1,"is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story",2
31,32,1,"good for the gander , some of which occasionally amuses but none of which amounts to much of a story",2
34,35,1,"the gander ,",2
36,37,1,gander,2
38,39,1,some of which occasionally amuses but none of which amounts to much of a story,2
42,43,1,which,2


Notice how similar the phrases are between the train and test sets. Since a large proportion of the phrases in both sets are neutral, the model will appear to perform decently on SentenceId 1 if it classifies all phrases as neutral. I think with this data set, a favorable metric from stratified k-fold cross validation may be telling us how well the model has learned to recognize SentenceId, instead of evaluating the model's ability to recognize sentiment. 

When folds get cross contaminated like this, models get a misleading boost in performance. What we want is for the cross validation metrics to tell us how the model will generalize with unseen data. 

### Group K-Fold

Confine each SentenceId to a single fold.

http://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold

In [6]:
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)
print("SentenceId", Id, "Sentiment Counts:\n")
for f, split in enumerate(gkf.split(data, groups=data['SentenceId'])):
    print("Fold ", f + 1, ":", sep='')
    test = data.iloc[split[1]]
    if Id in test['SentenceId'].values:
        print(test['Sentiment'][test['SentenceId'] == Id].value_counts(sort=False), "\n")
    else:
        print("None\n")

SentenceId 1 Sentiment Counts:

Fold 1:
1    4 
2    56
3    3 
Name: Sentiment, dtype: int64 

Fold 2:
None

Fold 3:
None

Fold 4:
None

Fold 5:
None



Now, with group k-fold, SentimentId 1 is kept in one fold. 

Group k-fold doesn't specifically stratify though. Fortunately each fold still has a sentiment distribution that is close to the over all distribution.

In [7]:
for f, split in enumerate(gkf.split(data, groups=data['SentenceId'])):
    print("Fold", f + 1, "-----")
    test = data.iloc[split[1]]
    print("Size:", test.size)
    print("Sentiment Distribution:", test['Sentiment'].value_counts(normalize=True), "", sep='\n')

Fold 1 -----
Size: 124848
Sentiment Distribution:
2    0.507273
3    0.208574
1    0.178938
4    0.059625
0    0.045591
Name: Sentiment, dtype: float64

Fold 2 -----
Size: 124848
Sentiment Distribution:
2    0.515859
3    0.213412
1    0.170960
4    0.058503
0    0.041266
Name: Sentiment, dtype: float64

Fold 3 -----
Size: 124848
Sentiment Distribution:
2    0.507721
3    0.207677
1    0.182654
4    0.057318
0    0.044630
Name: Sentiment, dtype: float64

Fold 4 -----
Size: 124848
Sentiment Distribution:
2    0.510829
3    0.206171
1    0.176503
4    0.057286
0    0.049212
Name: Sentiment, dtype: float64

Fold 5 -----
Size: 124848
Sentiment Distribution:
2    0.508042
3    0.219114
1    0.164744
4    0.062220
0    0.045880
Name: Sentiment, dtype: float64



## Model Example

This is a basic logistic regression [pipeline](http://scikit-learn.org/stable/modules/pipeline.html#pipeline) that uses tf-idf for features.
I use a grid search with group k-fold to find the best value for the strenth of the l2 penalty.
Once the best value for `C` is selected, I'll get the pipeline accuracy reported by stratified k-fold and group k-fold cross validation. Then I'll use the same pipeline to predict on the test set.

Hypotheses:
- stratified k-fold will report the highest accuracy even though the pipeline is the same
- the test set accuracy will be closer to the group k-fold accuracy

In [8]:
import nltk
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf',  TfidfVectorizer()),
    ('lr', LogisticRegression())
])

analyzer = TfidfVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')

pipeline.set_params(
    tfidf__analyzer=lambda x: (stemmer.stem(w) for w in analyzer(x)),
    tfidf__ngram_range=(1,2),
    lr__solver='sag',
    lr__multi_class='multinomial',
    lr__penalty='l2', 
    lr__tol=0.001, 
    lr__verbose=False)

param_grid = {
    'lr__C': np.linspace(1, 3, 11) # I think the value is close to 2 based on previous testing, and want to reduce runtime
}

gs = GridSearchCV(
    pipeline, 
    param_grid=param_grid,
    cv=gkf,
    verbose=1,
    return_train_score=False)

gs.fit(data['Phrase'], y=data['Sentiment'], groups=data['SentenceId'])
print("Best C:", gs.best_params_['lr__C'])

Fitting 5 folds for each of 11 candidates, totalling 55 fits


[Parallel(n_jobs=1)]: Done  55 out of  55 | elapsed: 21.9min finished


Best C: 2.6


Using the best value for `C`, I'll get the pipeline accuracy reported by stratified k-fold and group k-fold.

In [9]:
from sklearn.model_selection import cross_validate

pipeline.set_params(lr__C=gs.best_params_['lr__C'])

print("Running stratified k-fold...", end='')
skf_results = cross_validate(
    pipeline, 
    X=data['Phrase'], 
    y=data['Sentiment'], 
    cv=skf, 
    return_train_score=False, 
    verbose=False)
print(" done.")

print("Running group k-fold...", end='')
gkf_results = cross_validate(
    pipeline, 
    X=data['Phrase'], 
    y=data['Sentiment'], 
    groups=data['SentenceId'], 
    cv=gkf, 
    return_train_score=False, 
    verbose=False)
print(" done.\n")

print("Stratified k-fold average accuracy:", np.mean(skf_results['test_score']))
print("Group k-fold average accuracy:", np.mean(gkf_results['test_score']))

Running stratified k-fold... done.
Running group k-fold... done.

Stratified k-fold average accuracy: 0.6455080993411368
Group k-fold average accuracy: 0.6086953735742664


Now I'll fit the same pipeline on the entire train set and then predict on the test set.

In [10]:
test = pd.read_csv('../input/test.tsv', delimiter='\t')
test_pred = pipeline.fit(data['Phrase'], y=data['Sentiment']).predict(test['Phrase'])

submission = pd.concat([test['PhraseId'], pd.Series(test_pred, name='Sentiment')], axis=1)
submission.to_csv('submission.csv', index=False)