# [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/leaderboard)

## Part 1 For Beginners Bag Of Words

In [1]:
import csv
import pandas

train = pandas.read_csv(
    'labeledTrainData.tsv.gz',
    compression='gzip',
    header=0,
    delimiter='\t',
    quoting=csv.QUOTE_NONE,
)

In [2]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [3]:
import re
from typing import List
from bs4 import BeautifulSoup

stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def clean_text(review: str) -> str:
    review = BeautifulSoup(review, 'html5lib').get_text()
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    return review

Here, instead of the loop, we use built-in `apply` method which applies a callable on each value:

In [4]:
train['review'] = train['review'].apply(clean_text)

In [5]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,with all this stuff going down at the moment ...
1,"""2381_9""",1,the classic war of the worlds by timothy ...
2,"""7759_3""",0,the film starts with a manager nicholas bell...
3,"""3630_4""",0,it must be assumed that those who praised thi...
4,"""9495_8""",1,superbly trashy and wondrously unpretentious ...


And again, instead of manual cleaning up stop-words, we pass `stop_words` straight to the vectorizer:

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', stop_words=stop_words, max_features=5000)
train_features = vectorizer.fit_transform(train['review'])

  from collections import Sequence


Setting `n_jobs=-1` sets the number of parallel jobs to the number of cores (by default it's just 1). It dramatically decreased training time for me.

Also, I set `random_state` to get a reproducible result.

In [7]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(
    n_estimators=100,
    n_jobs=-1,
    random_state=42,
).fit(train_features, train['sentiment'])

In [8]:
test = pandas.read_csv(
    'testData.tsv.gz',
    header=0,
    compression='gzip',
    delimiter='\t',
    quoting=csv.QUOTE_NONE,
)
test['review'] = test['review'].apply(clean_text)

In [9]:
test_features = vectorizer.transform(test['review'])
sentiment = forest.predict(test_features)

In [10]:
output = pandas.DataFrame({'id': test['id'], 'sentiment': sentiment})
output.to_csv('part_1.csv', index=False, quoting=csv.QUOTE_NONE)

### Bonus

We can have a look at which features (aka words) influence our predicted sentiment at most. Random forest instance has a nice attribute `feature_importances_` which stands for itself:

In [11]:
from operator import itemgetter

pandas.DataFrame(
    sorted(zip(vectorizer.get_feature_names(), forest.feature_importances_), key=itemgetter(1), reverse=True)[:15],
    columns=['feature', 'importance'],
)

Unnamed: 0,feature,importance
0,worst,0.020051
1,bad,0.018778
2,waste,0.011403
3,great,0.010839
4,awful,0.009708
5,terrible,0.007697
6,boring,0.007548
7,excellent,0.007332
8,best,0.006787
9,nothing,0.005559


Makes sense, hé?