# `word2vec`
Feature extraction is done via `word2vec`. `word2vec` model is trained from scratch with both labeled and unlabeled train data.

Submission accuracy (average): `0.83104`  
Submission accuracy (clustering): `0.80764`

Download necessary NLTK files:
* `stopwords`: Stopwords Corpus
* `wordnet`: WordNet
* `punkt`: Punkt Tokenizer Models

In [1]:
%%script false
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.donwload('punkt')

In [26]:
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from gensim.models import word2vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from os.path import join
from tqdm import tqdm

Since `word2vec` is unsupervised, we can use the unlabeled training data here.

In [2]:
src = 'data'
df_labeled_train = pd.read_csv(join(src, 'labeledTrainData.tsv'), sep='\t')
df_unlabeled_train = pd.read_csv(join(src, 'unlabeledTrainData.tsv'), sep='\t', quoting=3)
df_test = pd.read_csv(join(src, 'testData.tsv'), sep='\t')

df_labeled_train.shape, df_unlabeled_train.shape, df_test.shape

((25000, 3), (50000, 2), (25000, 2))

## Data cleaning
We clean the data similarly to how it was done previously. However, we avoid removing numbers and stop words. We also leave each sentence as a list of words, instead of an entire string.

In [41]:
stopwords = set(stopwords.words('english'))

def clean_sentence(sentence, remove_stopwords):
    removed_markup = BeautifulSoup(sentence, 'html.parser').get_text()
    removed_punctuation = re.sub(r'[^a-zA-Z0-9]', ' ', removed_markup)
    tokens = removed_punctuation.lower().split()
    if remove_stopwords:
        tokens = [t for t in tokens if t not in stopwords]
    return tokens

`word2vec` expects single sentences, each one as a list of words. In order to split a paragraph into sentences, we use NLTK's `punkt` tokenizer.

In [4]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer

<nltk.tokenize.punkt.PunktSentenceTokenizer at 0x2239d065748>

In [5]:
def review_to_sentences(review, remove_stopwords):
    raw_sentences = tokenizer.tokenize(review.strip())
    return [clean_sentence(sentence, remove_stopwords) for sentence in raw_sentences if len(sentence) > 0]

In [6]:
sentences = []
[sentences.extend(review_to_sentences(review, remove_stopwords=False)) for review in tqdm(df_labeled_train['review'])]
[sentences.extend(review_to_sentences(review, remove_stopwords=False)) for review in tqdm(df_unlabeled_train['review'])];

  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
100%|███████████████████████████████████████████████████████████████████████████| 25000/25000 [00:43<00:00, 571.67it/s]
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
100%|███████████████████████████████████████████████████████████████████████████| 50000/50000 [01:27<00:00, 568.33it/s]


_Note: Errors are from URLs in the sentences._

In [7]:
len(sentences)

795872

## Training `word2vec`
**Parameters**  
`num_features`: The dimension of each word vector  
`min_word_count`: Words that occur fewer than this value across all training samples are ignored. Reasonable values are [10, 100].  
`num_workers`: Parallel processing threads  
`window`: Size of context window  
`downsampling`: Amount of downsampling to use for frequent words. Google documentation recommends [1e-5, 1e-3].  

In [8]:
num_features = 300
min_word_count = 40
num_workers = 4
window = 10
downsampling = 1e-3

In [9]:
%%time
model = word2vec.Word2Vec(sentences,
                          size=num_features,
                          window=window,
                          min_count=min_word_count,
                          sample=downsampling)

Wall time: 1min 38s


In [10]:
# makes the model more memory efficient, but only if not training any further
model.init_sims(replace=True)
model.save(f'model/features{num_features}_minword{min_word_count}_window{window}')

## Evaluate model

In [11]:
model.wv.doesnt_match('france england germany berlin'.split())

  if np.issubdtype(vec.dtype, np.int):


'berlin'

In [12]:
model.wv.most_similar('man')

  if np.issubdtype(vec.dtype, np.int):


[('woman', 0.6391713619232178),
 ('lady', 0.59377121925354),
 ('lad', 0.5750746726989746),
 ('monk', 0.5367291569709778),
 ('millionaire', 0.5239022970199585),
 ('farmer', 0.5203163623809814),
 ('guy', 0.519966721534729),
 ('person', 0.507594645023346),
 ('soldier', 0.50559401512146),
 ('men', 0.4995683431625366)]

## Vectorizing paragraphs: averaging
For all words in each paragraph, we obtain their corresponding word vectors. The vector of the paragraph is then taken to be the average of all its word vectors.

In [13]:
def compute_sentence_vec(sentence):
    model_vocab = set(model.wv.vocab.keys())
    word_vecs = [model.wv[word] for word in sentence if word in model_vocab]
    if len(word_vecs) > 0:
        feature_vec = np.stack([model.wv[word] for word in sentence if word in model_vocab])
        return np.average(feature_vec, axis=0)
    return np.zeros([model.vector_size])

def compute_review_vec(review):
    feature_vec = np.stack([compute_sentence_vec(sentence) 
                            for sentence in review_to_sentences(review, remove_stopwords=True)
                            if len(sentence) > 0])
    return np.average(feature_vec, axis=0)

In [14]:
train_vecs = np.stack([compute_review_vec(review) for review in tqdm(df_labeled_train['review'])])
test_vecs = np.stack([compute_review_vec(review) for review in tqdm(df_test['review'])])
train_vecs.shape, test_vecs.shape

  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
  ' that document to Beautiful Soup.' % decoded_markup
100%|████████████████████████████████████████████████████████████████████████████| 25000/25000 [08:49<00:00, 47.18it/s]
  ' Beautiful Soup.' % markup)
  ' Beautiful Soup.' % markup)
100%|████████████████████████████████████████████████████████████████████████████| 25000/25000 [09:30<00:00, 43.79it/s]


((25000, 300), (25000, 300))

### Random forest classifier

In [19]:
%%time
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_vecs, df_labeled_train['sentiment'])

Wall time: 55.2 s


In [20]:
%%time
pred = clf.predict(test_vecs)

Wall time: 1.07 s


In [21]:
output = pd.DataFrame({'id': df_test['id'], 'sentiment': pred})
output

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,0
4,12128_7,1
5,2913_8,0
6,4396_1,0
7,395_2,0
8,10616_1,0
9,9074_9,1


In [22]:
output.to_csv('submission/word2vec_avg_randomforest.csv', index=False)

## Vectorizing paragraphs: clustering
The words in the `word2vec` model vocabulary are first clustered. The vector for each paragraph will then be the sum of occurrences of each cluster, based on the words in that paragraph.

In [31]:
%%time
word_vectors = model.wv.vectors
n_clusters = word_vectors.shape[0] // 5
clf = KMeans(n_clusters=n_clusters)
idx = clf.fit_predict(word_vectors)

Wall time: 10min 32s


Examine the words in 10 clusters.

In [34]:
model_vocab = np.array(list(model.wv.vocab.keys()))
for i in range(10):
    print(model_vocab[idx == i])

['just' 'because']
['mm' 'pong' 'tod' 'della']
['outings']
['pirates' 'favourites' 'acceptance' 'shenanigans']
['underworld' 'complained' 'complicated' 'slick' 'drew' 'ferrell' 'wasps'
 'braindead' 'ancestor']
['stuck' 'history' 'lesbian' 'rush' 'sequels' 'mentally' 'refugee' 'nell'
 'arresting' 'sirk']
['commits' 'rescues' 'jacobi']
['judy' 'cream' 'wayans' 'revolve' 'maniacs' 'maestro' 'ammo']
['precision' 'richness' 'seductress' 'indigo']
['pleasure' 'leave']


Create a dictionary where the keys are the words in the `word2vec` vocabulary, and the values are the cluster indices of those words.

In [35]:
cluster_mapping = dict(zip(model_vocab, idx))
cluster_mapping

{'with': 700,
 'all': 1207,
 'this': 1114,
 'stuff': 2490,
 'going': 2896,
 'down': 2573,
 'at': 2801,
 'the': 1214,
 'moment': 697,
 'mj': 2801,
 'i': 1014,
 've': 2967,
 'started': 1360,
 'listening': 1540,
 'to': 1952,
 'his': 2003,
 'music': 1537,
 'watching': 145,
 'odd': 1537,
 'documentary': 355,
 'here': 452,
 'and': 747,
 'there': 2957,
 'watched': 1553,
 'again': 693,
 'maybe': 825,
 'just': 0,
 'want': 2778,
 'get': 2222,
 'a': 2394,
 'certain': 2543,
 'insight': 995,
 'into': 2567,
 'guy': 1660,
 'who': 1557,
 'thought': 1927,
 'was': 2699,
 'really': 1838,
 'cool': 2545,
 'in': 825,
 'eighties': 2894,
 'make': 340,
 'up': 2985,
 'my': 1772,
 'mind': 2681,
 'whether': 2097,
 'he': 562,
 'is': 826,
 'guilty': 1650,
 'or': 1810,
 'innocent': 244,
 'part': 891,
 'biography': 1902,
 'feature': 1553,
 'film': 648,
 'which': 1891,
 'remember': 2412,
 'see': 2055,
 'cinema': 2904,
 'when': 1627,
 'it': 1014,
 'originally': 1651,
 'released': 257,
 'some': 2043,
 'of': 340,
 'has':

In [42]:
def review_centroid_vec(review):
    centroid_counts = np.zeros([len(cluster_mapping)])
    for sentence in review_to_sentences(review, remove_stopwords=True):
        for word in sentence:
            if word in cluster_mapping:
                centroid_counts[cluster_mapping[word]] += 1
    return centroid_counts

In [43]:
train_vecs = np.stack([review_centroid_vec(review) for review in tqdm(df_labeled_train['review'])])
test_vecs = np.stack([review_centroid_vec(review) for review in tqdm(df_test['review'])])



  0%|                                                                                        | 0/25000 [00:00<?, ?it/s]Exception ignored in: <bound method tqdm.__del__ of   0%|                                                                                        | 0/25000 [01:35<?, ?it/s]>
Traceback (most recent call last):
  File "d:\program files\python36\lib\site-packages\tqdm\_tqdm.py", line 879, in __del__
    self.close()
  File "d:\program files\python36\lib\site-packages\tqdm\_tqdm.py", line 1098, in close
    self._decr_instances(self)
  File "d:\program files\python36\lib\site-packages\tqdm\_tqdm.py", line 438, in _decr_instances
    cls._instances.remove(instance)
  File "d:\program files\python36\lib\_weakrefset.py", line 109, in remove
    self.data.remove(ref(item))
KeyError: <weakref at 0x000002238E7926D8; to 'tqdm' at 0x000002238245B780>


  0%|                                                                              | 32/25000 [00:00<01:18, 316.24it/s]

  0%|▎  

 26%|███████████████████▍                                                        | 6397/25000 [00:12<00:37, 500.78it/s]

 26%|███████████████████▌                                                        | 6448/25000 [00:12<00:37, 500.64it/s]

 26%|███████████████████▊                                                        | 6498/25000 [00:12<00:36, 500.55it/s]

 26%|███████████████████▉                                                        | 6557/25000 [00:13<00:36, 501.15it/s]

 26%|████████████████████                                                        | 6610/25000 [00:13<00:36, 501.13it/s]

 27%|████████████████████▎                                                       | 6662/25000 [00:13<00:36, 500.73it/s]

 27%|████████████████████▍                                                       | 6713/25000 [00:13<00:36, 500.81it/s]

 27%|████████████████████▌                                                       | 6764/25000 [00:13<00:36, 500.72it/s]

 27%|████████████████████▋      

 54%|████████████████████████████████████████▏                                  | 13399/25000 [00:27<00:23, 495.57it/s]

 54%|████████████████████████████████████████▎                                  | 13448/25000 [00:27<00:23, 495.53it/s]

 54%|████████████████████████████████████████▍                                  | 13496/25000 [00:27<00:23, 495.46it/s]

 54%|████████████████████████████████████████▋                                  | 13550/25000 [00:27<00:23, 495.62it/s]

 54%|████████████████████████████████████████▊                                  | 13605/25000 [00:27<00:22, 495.82it/s]

 55%|████████████████████████████████████████▉                                  | 13663/25000 [00:27<00:22, 496.08it/s]

 55%|█████████████████████████████████████████▏                                 | 13718/25000 [00:27<00:22, 496.24it/s]

 55%|█████████████████████████████████████████▎                                 | 13772/25000 [00:27<00:22, 496.36it/s]

 55%|███████████████████████████

 83%|█████████████████████████████████████████████████████████████▉             | 20636/25000 [00:41<00:08, 499.78it/s]

 83%|██████████████████████████████████████████████████████████████             | 20688/25000 [00:41<00:08, 499.81it/s]

 83%|██████████████████████████████████████████████████████████████▏            | 20738/25000 [00:41<00:08, 498.54it/s]

 83%|██████████████████████████████████████████████████████████████▎            | 20780/25000 [00:41<00:08, 497.96it/s]

 83%|██████████████████████████████████████████████████████████████▍            | 20819/25000 [00:41<00:08, 497.63it/s]

 83%|██████████████████████████████████████████████████████████████▌            | 20857/25000 [00:41<00:08, 496.93it/s]

 84%|██████████████████████████████████████████████████████████████▋            | 20893/25000 [00:42<00:08, 496.50it/s]

 84%|██████████████████████████████████████████████████████████████▊            | 20929/25000 [00:42<00:08, 496.14it/s]

 84%|███████████████████████████

### Random forest classifier

In [45]:
%%time
clf = RandomForestClassifier(n_estimators=100)
clf.fit(train_vecs, df_labeled_train['sentiment'])

Wall time: 2min 25s


In [46]:
%%time
pred = clf.predict(test_vecs)

Wall time: 4.1 s


In [47]:
output = pd.DataFrame({'id': df_test['id'], 'sentiment': pred})
output

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,0
3,7186_2,0
4,12128_7,1
5,2913_8,1
6,4396_1,0
7,395_2,0
8,10616_1,0
9,9074_9,1


In [48]:
output.to_csv('submission/word2vec_cluster_randomforest.csv', index=False)