# `word2vec`
Feature extraction is done via `word2vec`. `word2vec` model is trained from scratch with both labeled and unlabeled train data.

Submission accuracy (average): `0.82724`  
Submission accuracy (clustering): `0.80076`

Download necessary NLTK files:
* `stopwords`: Stopwords Corpus
* `wordnet`: WordNet
* `punkt`: Punkt Tokenizer Models

In [1]:
%%script false
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.donwload('punkt')

In [2]:
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from gensim.models import word2vec
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from os.path import join
from tqdm import tqdm

tqdm.pandas()

Suppress warnings by BeautifulSoup:  
```UserWarning: <string> looks like a filename, not markup. You should probably open this file and pass the filehandle into Beautiful Soup.```  

```UserWarning: <string> looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.```

In [3]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

Since `word2vec` is unsupervised, we can use the unlabeled training data here.

In [4]:
src = 'data'
df_labeled_train = pd.read_csv(join(src, 'labeledTrainData.tsv'), sep='\t')
df_unlabeled_train = pd.read_csv(join(src, 'unlabeledTrainData.tsv'), sep='\t', quoting=3)
df_test = pd.read_csv(join(src, 'testData.tsv'), sep='\t')

df_labeled_train.shape, df_unlabeled_train.shape, df_test.shape

((25000, 3), (50000, 2), (25000, 2))

## Data cleaning
We clean the data similarly to how it was done previously. However, we avoid removing numbers and stop words. We also leave each sentence as a list of words, instead of an entire string.

In [5]:
stopwords = set(stopwords.words('english'))

def clean_sentence(sentence, remove_stopwords):
    removed_markup = BeautifulSoup(sentence, 'html.parser').get_text()
    removed_punctuation = re.sub(r'[^a-zA-Z0-9]', ' ', removed_markup)
    tokens = removed_punctuation.lower().split()
    if remove_stopwords:
        tokens = [t for t in tokens if t not in stopwords]
    return tokens

`word2vec` expects single sentences, each one as a list of words. In order to split a paragraph into sentences, we use NLTK's `punkt` tokenizer.

In [6]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def review_to_sentences(review, remove_stopwords):
    raw_sentences = tokenizer.tokenize(review.strip())
    return [clean_sentence(sentence, remove_stopwords) for sentence in raw_sentences if len(sentence) > 0]

In [7]:
sentences = []
for review in tqdm(df_labeled_train['review']):
    sentences.extend(review_to_sentences(review, remove_stopwords=False))
for review in tqdm(df_unlabeled_train['review']):
    sentences.extend(review_to_sentences(review, remove_stopwords=False))

100%|██████████| 25000/25000 [00:35<00:00, 694.49it/s]
100%|██████████| 50000/50000 [01:13<00:00, 677.38it/s]


In [8]:
len(sentences)

795872

## Training `word2vec`
**Parameters**  
`num_features`: The dimension of each word vector  
`min_word_count`: Words that occur fewer than this value across all training samples are ignored. Reasonable values are [10, 100].  
`num_workers`: Parallel processing threads  
`window`: Size of context window  
`downsampling`: Amount of downsampling to use for frequent words. Google documentation recommends [1e-5, 1e-3].  

In [9]:
num_features = 300
min_word_count = 100
num_workers = 4
window = 10
downsampling = 1e-3

In [10]:
%%time
model = word2vec.Word2Vec(sentences,
                          size=num_features,
                          window=window,
                          min_count=min_word_count,
                          sample=downsampling)

CPU times: user 4min 5s, sys: 559 ms, total: 4min 6s
Wall time: 1min 26s


In [11]:
len(model.wv.vocab)

9469

In [12]:
# makes the model more memory efficient, but only if not training any further
model.init_sims(replace=True)
model.save(f'model/features{num_features}_minword{min_word_count}_window{window}')

## Evaluate model

In [13]:
model.wv.doesnt_match('france england germany berlin'.split())

  if np.issubdtype(vec.dtype, np.int):


'berlin'

In [14]:
model.wv.most_similar('man')

  if np.issubdtype(vec.dtype, np.int):


[('woman', 0.6231356859207153),
 ('lady', 0.6052083969116211),
 ('monk', 0.5642334222793579),
 ('lad', 0.5614923238754272),
 ('millionaire', 0.5309739112854004),
 ('guy', 0.5214076638221741),
 ('farmer', 0.5165209770202637),
 ('soldier', 0.5162621736526489),
 ('boxer', 0.5141708254814148),
 ('doctor', 0.5076218843460083)]

## Vectorizing paragraphs: averaging
For all words in each paragraph, we obtain their corresponding word vectors. The vector of the paragraph is then taken to be the average of all its word vectors.

In [15]:
def compute_sentence_vec(sentence):
    word_vecs = [model.wv[word] for word in sentence if word in model.wv.vocab]
    
    if len(word_vecs) == 0:
        return np.zeros([model.vector_size])
    
    feature_vec = np.stack(word_vecs)
    return np.average(feature_vec, axis=0)

def compute_review_vec(review):
    feature_vec = np.stack([compute_sentence_vec(sentence) 
                            for sentence in review_to_sentences(review, remove_stopwords=True)
                            if len(sentence) > 0])
    return np.average(feature_vec, axis=0)

In [16]:
train_vecs = np.stack([compute_review_vec(review) for review in tqdm(df_labeled_train['review'])])
test_vecs = np.stack([compute_review_vec(review) for review in tqdm(df_test['review'])])
train_vecs.shape, test_vecs.shape

100%|██████████| 25000/25000 [00:58<00:00, 423.92it/s]
100%|██████████| 25000/25000 [00:56<00:00, 446.29it/s]


((25000, 300), (25000, 300))

### Random forest classifier

In [17]:
%%time
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(train_vecs, df_labeled_train['sentiment'])

CPU times: user 41.1 s, sys: 9.31 ms, total: 41.1 s
Wall time: 41.1 s


In [18]:
%%time
pred = clf.predict(test_vecs)

CPU times: user 644 ms, sys: 7.73 ms, total: 651 ms
Wall time: 650 ms


In [19]:
output = pd.DataFrame({'id': df_test['id'], 'sentiment': pred})
output.head()

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,0
4,12128_7,1


In [20]:
output.shape

(25000, 2)

In [21]:
output.to_csv('submission/word2vec_avg_randomforest.csv', index=False)

## Vectorizing paragraphs: clustering
The words in the `word2vec` model vocabulary are first clustered. The vector for each paragraph will then be the sum of occurrences of each cluster, based on the words in that paragraph.

In [22]:
%%time
word_vectors = model.wv.vectors
n_clusters = word_vectors.shape[0] // 5
clf = KMeans(n_clusters=n_clusters, random_state=0)
idx = clf.fit_predict(word_vectors)

CPU times: user 9min 40s, sys: 5min 27s, total: 15min 8s
Wall time: 2min 56s


Examine the words in 10 clusters.

In [23]:
model_vocab = np.array(list(model.wv.vocab.keys()))
for i in range(10):
    print(model_vocab[idx == i])

['day' '75' 'block' 'chock']
['imagining' 'weather' 'gangster' '85' 'protection' 'uncover' 'bradley'
 'toe' 'recurring']
['glorious' 'storm' 'abuse' 'indiana' 'mcdonald']
['topless']
['elegant' 'guts' 'ingredients' 'shed' 'owner' 'irish' 'warfare'
 'realization' 'slater']
['overly' 'shown' 'weller' 'aura' 'defending']
['spelled' 'vinny' 'practices']
['lyrics']
['pursuit']
['valuable' 'busey' 'disturbing' 'follow' 'holding' 'transfer'
 'lackluster' 'resemble' 'concert' 'absent' 'falcon' 'overs']


Create a dictionary where the keys are the words in the `word2vec` vocabulary, and the values are the cluster indices of those words.  

Look at a few sample words and their clusters.

In [24]:
cluster_mapping = dict(zip(model_vocab, idx))
for i, k in enumerate(cluster_mapping):
    print(k, cluster_mapping[k])
    if i > 5:
        break

with 860
all 1325
this 1803
stuff 543
going 91
down 430
at 930


In [25]:
def review_centroid_vec(review):
    centroid_counts = np.zeros([len(cluster_mapping)])
    for sentence in review_to_sentences(review, remove_stopwords=True):
        for word in sentence:
            if word in cluster_mapping:
                centroid_counts[cluster_mapping[word]] += 1
    return centroid_counts

In [26]:
train_vecs = np.stack([review_centroid_vec(review) for review in tqdm(df_labeled_train['review'])])
test_vecs = np.stack([review_centroid_vec(review) for review in tqdm(df_test['review'])])

100%|██████████| 25000/25000 [00:36<00:00, 685.52it/s]
100%|██████████| 25000/25000 [00:35<00:00, 698.35it/s]


In [27]:
train_vecs.shape

(25000, 9469)

### Random forest classifier

In [28]:
%%time
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(train_vecs, df_labeled_train['sentiment'])

CPU times: user 1min 54s, sys: 351 ms, total: 1min 55s
Wall time: 1min 55s


In [29]:
%%time
pred = clf.predict(test_vecs)

CPU times: user 1.58 s, sys: 236 ms, total: 1.81 s
Wall time: 1.81 s


In [30]:
output = pd.DataFrame({'id': df_test['id'], 'sentiment': pred})
output.head()

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,1
4,12128_7,1


In [31]:
output.shape

(25000, 2)

In [32]:
output.to_csv('submission/word2vec_cluster_randomforest.csv', index=False)