# [Bag of Words Meets Bags of Popcorn](https://www.kaggle.com/c/word2vec-nlp-tutorial/)

## [Part 1 For Beginners Bag Of Words](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words)

In [1]:
import csv
import pandas

train = pandas.read_csv(
    'labeledTrainData.tsv.gz',
    compression='gzip',
    header=0,
    delimiter='\t',
    quoting=csv.QUOTE_NONE,
)

In [2]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [3]:
import re
from typing import List
from bs4 import BeautifulSoup

stop_words = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

def clean_text(review: str) -> str:
    review = BeautifulSoup(review, 'html5lib').get_text()
    review = re.sub('[^a-zA-Z]', ' ', review)
    review = review.lower()
    return review

Here, instead of the loop, we use built-in `apply` method which applies a callable on each value:

In [4]:
train['review'] = train['review'].apply(clean_text)

In [5]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,with all this stuff going down at the moment ...
1,"""2381_9""",1,the classic war of the worlds by timothy ...
2,"""7759_3""",0,the film starts with a manager nicholas bell...
3,"""3630_4""",0,it must be assumed that those who praised thi...
4,"""9495_8""",1,superbly trashy and wondrously unpretentious ...


And again, instead of manual cleaning up stop-words, we pass `stop_words` straight to the vectorizer:

In [6]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word', stop_words=stop_words, max_features=5000)
train_features = vectorizer.fit_transform(train['review'])

Setting `n_jobs=-1` sets the number of parallel jobs to the number of cores (by default it's just 1). It dramatically decreased training time for me.

Also, I set `random_state` to get a reproducible result.

In [7]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(
    n_estimators=100,
    n_jobs=-1,
    random_state=42,
).fit(train_features, train['sentiment'])

In [8]:
test = pandas.read_csv(
    'testData.tsv.gz',
    header=0,
    compression='gzip',
    delimiter='\t',
    quoting=csv.QUOTE_NONE,
)
test['review'] = test['review'].apply(clean_text)

In [9]:
test_features = vectorizer.transform(test['review'])
sentiment = forest.predict(test_features)

In [10]:
output = pandas.DataFrame({'id': test['id'], 'sentiment': sentiment})
output.to_csv('part_1.csv', index=False, quoting=csv.QUOTE_NONE)

### Bonus

We can have a look at which features (aka words) influence our predicted sentiment at most. Random forest instance has a nice attribute `feature_importances_` which stands for itself:

In [11]:
from operator import itemgetter

pandas.DataFrame(
    sorted(zip(vectorizer.get_feature_names(), forest.feature_importances_), key=itemgetter(1), reverse=True)[:15],
    columns=['feature', 'importance'],
)

Unnamed: 0,feature,importance
0,worst,0.020051
1,bad,0.018778
2,waste,0.011403
3,great,0.010839
4,awful,0.009708
5,terrible,0.007697
6,boring,0.007548
7,excellent,0.007332
8,best,0.006787
9,nothing,0.005559


Makes sense, hé?

## [Part 2 Word Vectors](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-2-word-vectors)

In [12]:
train: pandas.DataFrame = pandas.read_csv(
    'labeledTrainData.tsv.gz',
    compression='gzip',
    header=0,
    delimiter='\t',
    quoting=csv.QUOTE_NONE,
)
unlabeled_train: pandas.DataFrame = pandas.read_csv(
    'unlabeledTrainData.tsv.gz',
    compression='gzip',
    header=0,
    delimiter='\t',
    quoting=csv.QUOTE_NONE,
)

In [13]:
from pathlib import Path
import nltk.data

# Change the path if you wish.
nltk_path = Path.home() / 'nltk_data'
nltk_path.mkdir(exist_ok=True)
nltk.download('punkt', download_dir=str(nltk_path))

[nltk_data] Downloading package punkt to /Users/eigenein/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [14]:
from typing import List

from tqdm import tqdm_notebook

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

reviews: List[str] = [*unlabeled_train['review'], *train['review']]
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    sentences: List[List[str]] = [
        clean_text(sentence).split()
        for review in tqdm_notebook(reviews, desc='Cleaning…')
        for sentence in tokenizer.tokenize(review)
    ]

HBox(children=(IntProgress(value=0, description='Cleaning…', max=75000), HTML(value='')))




In [15]:
import logging

logging.basicConfig(
    format='%(asctime)s (%(name)s) [%(levelname)s] %(message)s',
    level=logging.INFO,
    datefmt='%d-%m %H:%M:%S',
)

In [16]:
from multiprocessing import cpu_count

from gensim.models.word2vec import Word2Vec

model = Word2Vec(sentences, workers=cpu_count(), size=300, min_count=40, window=10, sample=1e-3)

# If you don't plan to train the model any further,
# calling `init_sims` will make the model much more memory-efficient.
model.init_sims(replace=True)
model.save('300features_40minwords_10context.model.gz')

10-08 15:13:32 (summarizer.preprocessing.cleaner) [INFO] 'pattern' package not found; tag filters are not available for English
10-08 15:13:32 (gensim.models.word2vec) [INFO] collecting all words and their counts
10-08 15:13:32 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
10-08 15:13:32 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #10000, processed 228386 words, keeping 17381 word types
10-08 15:13:32 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #20000, processed 450696 words, keeping 24739 word types
10-08 15:13:32 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #30000, processed 676534 words, keeping 29983 word types
10-08 15:13:32 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #40000, processed 901155 words, keeping 34165 word types
10-08 15:13:32 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #50000, processed 1122654 words, keeping 37737 word types
10-08 15:13:32 (gensim.models.word2vec) [IN

10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #630000, processed 14126362 words, keeping 111683 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #640000, processed 14346469 words, keeping 112448 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #650000, processed 14569962 words, keeping 113311 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #660000, processed 14787210 words, keeping 114008 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #670000, processed 15008818 words, keeping 114731 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #680000, processed 15232354 words, keeping 115531 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #690000, processed 15455464 words, keeping 116273 word types
10-08 15:13:36 (gensim.models.word2vec) [INFO] PROGRESS: at sentence #700000, processed 15

10-08 15:14:09 (gensim.models.base_any2vec) [INFO] EPOCH 2 - PROGRESS: at 89.88% examples, 755619 words/s, in_qsize 14, out_qsize 1
10-08 15:14:10 (gensim.models.base_any2vec) [INFO] EPOCH 2 - PROGRESS: at 95.59% examples, 753131 words/s, in_qsize 13, out_qsize 2
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaiting finish of 7 more threads
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaiting finish of 6 more threads
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaiting finish of 5 more threads
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaiting finish of 4 more threads
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaiting finish of 3 more threads
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaiting finish of 2 more threads
10-08 15:14:11 (gensim.models.base_any2vec) [INFO] worker thread finished; awaitin

10-08 15:14:50 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 47.36% examples, 746547 words/s, in_qsize 15, out_qsize 0
10-08 15:14:51 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 53.08% examples, 744268 words/s, in_qsize 16, out_qsize 1
10-08 15:14:52 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 58.94% examples, 743564 words/s, in_qsize 16, out_qsize 0
10-08 15:14:53 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 65.05% examples, 746386 words/s, in_qsize 13, out_qsize 2
10-08 15:14:54 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 71.31% examples, 750560 words/s, in_qsize 15, out_qsize 0
10-08 15:14:55 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 77.47% examples, 752652 words/s, in_qsize 15, out_qsize 0
10-08 15:14:56 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 83.78% examples, 755235 words/s, in_qsize 14, out_qsize 1
10-08 15:14:57 (gensim.models.base_any2vec) [INFO] EPOCH 5 - PROGRESS: at 90

In [17]:
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    
    print('Extra word:', model.doesnt_match('man woman child kitchen'.split()))
    print('Similar:', model.most_similar(positive='woman king'.split(), negative='man'.split(), topn=3))
    print('Similarity:', model.similarity('darth', 'vader'))

Extra word: kitchen
Similar: [('queen', 0.5384258031845093), ('princess', 0.5378516316413879), ('prince', 0.49575579166412354)]
Similarity: 0.8874878


## [Part 3 More Fun With Word Vectors](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-3-more-fun-with-word-vectors)

TODO

## [Part 4 Comparing Deep And Non Deep Learning Methods](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-4-comparing-deep-and-non-deep-learning-methods)

TODO