# NLP Using Word Vectors

Now, we are going to try to build the neural network equivalents of the simple count-based NLP models we just went over, and extend our understanding of how to build NLP models by utilizing contextual information from our corpuses.

For this portion, I will introduce two additional NLP libraries:

- [spaCy](https://spacy.io/): Industrial-grade NLP library for building NLP pipelines over large corpora.
- [gensim](https://radimrehurek.com/gensim/index.html): A library that allows us to generate our own word vector embeddings

Use the following code to install the libraries you will need if they arent already installed:

```
conda install -c anaconda gensim
conda install -c conda-forge spacy
```

You also need to download a `spaCy` model that we can use as part of our explorations. 
In order to do so, run the following from a terminal window:

```
python3 -m spacy download en # for the smaller english language model
python3 -m spacy download en_core_web_lg # for the large english language model

```

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
import base64
import string #use for punctuation removal
import re
from collections import Counter
from time import time


from tqdm import tqdm #allows to monitor progress for long-running computations
tqdm.pandas(desc="progress-bar")

#nlp specific libraries
#nltk.download('stopwords')
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
tweet_tok = TweetTokenizer(preserve_case=False,strip_handles=True,reduce_len=True)

#package for building word2vec models ourselves
import gensim
from gensim.models.word2vec import Word2Vec # the word2vec model gensim class
from gensim.models.doc2vec import Doc2Vec
LabeledSentence = gensim.models.doc2vec.TaggedDocument # we'll talk about this down below
gensim.models.word2vec.FAST_VERSION
#industrial-strength NLP package that will allow us to use precomputed word vectors
import spacy
nlp = spacy.load('en_core_web_lg')

ModuleNotFoundError: No module named 'numpy'

## Using another dataset for generating word vectors when your dataset is too small
Our goal will be to compare a couple of different strategies for creating features from text, feed them to a simple linear classifier, and seeing the results.

We will attempt 2 distinct strategies on the data we have:

1. Compute our own word vector representations from a larger collection of preprocessed versions of the tweets.
2. Use precomputed word vector representations on lightly preprocessed versions of the tweets.


As we explore each strategy, we will get a feel for some of the limitations of both, and some overall strategies for  using word vectors for nlp tasks.

In [2]:
semeval_data = pd.read_csv("../data/semeval_sampled_cleaned_data.csv",sep="\t",names=["sentiment","tweet"],index_col=0)
semeval_data.head()

Unnamed: 0,sentiment,tweet
633616008910082048,negative,Donald Trump and Scott Walker would Negros bac...
639974060466663424,negative,@YidVids2 probably not cause he played with th...
664049298624114688,negative,"Woaw just because briana is ""having"" louis' ba..."
665627471899975680,negative,I wrote this about the 'SAS response' after th...
522951269271339008,negative,@MasterDebator_ @NFLosophy 2nd best in luck dr...


In [3]:
semeval_data["target"] = (semeval_data.sentiment == "positive").astype(int)

In [4]:
semeval_data.head()

Unnamed: 0,sentiment,tweet,target
633616008910082048,negative,Donald Trump and Scott Walker would Negros bac...,0
639974060466663424,negative,@YidVids2 probably not cause he played with th...,0
664049298624114688,negative,"Woaw just because briana is ""having"" louis' ba...",0
665627471899975680,negative,I wrote this about the 'SAS response' after th...,0
522951269271339008,negative,@MasterDebator_ @NFLosophy 2nd best in luck dr...,0


## An overview of frequency embeddings

We just used `CountVectorizer` and `TFIDFVectorizer` to generate features from our texts. 

Both of the above feature engineering strategies are a kind of **embedding** of the text. 

**An embedding in mathematics is a projection of some data into some other fixed space.**

In the case of the 2 above strategies, we are creating what are called **frequency-based embeddings** where we **embed** each individual variable-length text into a fixed-length vector (whose size is the number of distinct tokens we are using).

So, we are completely ignoring the context (words surrounding each word) and simply creating a mapping of texts onto token frequency within that text.

## Context-based Embeddings

**However, there are other ways to generate embeddings!** 

One simple way to do this is to **create word co-occurrence matrices with some (usually fixed) context window**.

What this means is you can simply count the frequency that a given word appears next to some other word, where *next to* is defined by the context window size (how many tokens away the context word is allowed to be in order to be considered part of the given word's context).

**E.g.:**

In the sentence **My name is Sergey and my home is in Brooklyn.** The co-occurrence value of the terms (Sergey,home) with a context window of 3 is 1, but would be 0 if the context window were 2.

**Now, after we have computed all of our cooccurrence counts, we transform each of our texts into some aggregate of all of the cooccurrence pairs that exist for that text (usually either taking the sum or the average of all cooccurrence contexts).**

However, there is a **huge** problem with this approach:
- This yields a massive cooccurrence matrix. It will be expensive to store (the number of distinct word/context pairs grows very quickly in the size of the vocabulary of the corpus).
- This massive matrix is also incredibly sparse. Again, having massive, sparse spaces is very bad for machine learning.

So, we need some way to shrink the size of this matrix. One way is to simply SVD or PCA it into some smaller size. This actually works reasonably well, but we can do even better...

## Context-based Embeddings: Neural-Network based Predictive Embeddings

Now, let's go just a bit further. 

Instead of attempting to deterministically use the co-occurrence statistics of some text, which is a very high-dimensional representation of the corpus, we can try to create a model that predicts the most likely context given a specific word (or the most likely word given some context). 

The idea here is to learn some fixed-dimensional representation of the probability distribution of words given their contexts (**CBOW Word2Vec**) or contexts given a word (**skip-gram Word2Vec**).

**In both cases, what we are actually doing is training a shallow (single hidden layer) neural network over our corpus of co-occurrence statistics and extracting the weights of the learned hidden layer after training.**

So, the result of a word2vec training is generating a fixed-element distributed representation of each distinct word in the vocabulary defined over the corpus.

Once we have such a representation, we can do interesting things with it like:

- Find most similar words (since this is a vector, we simply compute the cosine similarity)
- Word math (this is the famous $king - man + woman = queen$) example
- Odd one out (given a list of words, which is least like the others)

We can also then use these new word representations as inputs into ML pipelines (either simple models or deep neural nets, as we will see below).

## CBOW Word2Vec Architecture


In the case where we are trying to predict the most likely word given some context.

That is, we are trying to **maximize the probability of the target word by looking at the context.** 

Single word context (what is the most likely word given the word "really"?):
![word2vec](../images/word2vec_network.png)

Multiple Word Context, predicting most likely word (what is the most likely word in the context "Today is the last ... of the session"?):
![word2vec2](../images/word2vec_network_2.png)


In both of these cases, we will have a problem for rare words because the model is designed to predict the most probable word. Whatever this word is, it will be smoothed over a lot of examples with more frequent words.

## Skip-gram Word2Vec Architecture

In this case, we can turn the problem on its head, predicting the most likely context given some word.
![word2vec3](../images/word2vec_network_3.png)

Because the skip-gram model is designed to predict the context (rather than the most likely word), rare words do not compete with more frequent words.


Ultimately, both of these models have tradeoffs:
- **CBOW**: Very fast training (much faster than skip-gram), very good representation of the most frequent words in your vocabulary.
- **Skip-gram**: Slower to train, but preferred with a small amount of the training data, represents rare words or phrases well.




## Other Embedding Architectures

- Doc2Vec
- Sense2Vec
- GloVe - word2vec ultimately loses the statistical properties of corpus (cooccurrence statistics are not kept, since word2vec turns this into a prediction problem). GloVe factorizes the cooccurrence matrix in an interesting way (sort of like PCA/SVD) and can sometimes lead to better embeddings.

## Train your own word2vec model on a larger corpus of tweets using [gensim](https://radimrehurek.com/gensim/index.html)


Generating stable, usable word vectors requires a lot of data. We only have ~7k tweets. What we will do instead is use a much larger (1.6M tweets) [dataset available at kaggle](https://www.kaggle.com/kazanova/sentiment140/version/2#).

The process will be as follows:

- We generate word vectors using the large dataset above.
- We can then apply these word vectors to the preprocessed data from earlier, and combine each document as some kind of aggregation of each of the individual word vectors.

Regarding the tweet preprocessing, it requires some specific steps in order for the tweets to be fed into gensim:
- We need to generate lists of sentences across the entire corpus.
- We will lemmatize individual words, and split entire documents into sentences as below.

Let's get to it. Here are a couple functions for converting individual tweets into processed documents:

In [99]:
def custom_tokenize(tweet):
    try:
        tweet = tweet.lower()
        tokens = tweet_tok.tokenize(tweet)
        tokens = filter(lambda t: not t.startswith('@'), tokens)
        #tokens = filter(lambda t: not t.startswith('#'), tokens)
        tokens = filter(lambda t: not t.startswith('http'), tokens)
        return list(tokens)
    except: #case where no tokens come out
        return 'NC'

def postprocess(data,column="tweet"):
    data['tokens'] = data[column].progress_map(custom_tokenize)
    data = data[data.tokens != 'NC']
    data.reset_index(inplace=True,drop=True)
    return data

Let's process the individual tweets from the new dataset, as well as our original semeval dataset:

In [54]:
kaggle_data = pd.read_csv("../data/kaggle_1600000_tweets.csv")

In [100]:
kaggle_data_proc = postprocess(kaggle_data)

progress-bar: 100%|██████████| 1600000/1600000 [03:04<00:00, 8657.70it/s]


In [101]:
kaggle_data_proc.head()

Unnamed: 0,sentiment,tweet,tokens
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...","[-, awww, ,, that's, a, bummer, ., you, should..."
1,0,is upset that he can't update his Facebook by ...,"[is, upset, that, he, can't, update, his, face..."
2,0,@Kenichan I dived many times for the ball. Man...,"[i, dived, many, times, for, the, ball, ., man..."
3,0,my whole body feels itchy and like its on fire,"[my, whole, body, feels, itchy, and, like, its..."
4,0,"@nationwideclass no, it's not behaving at all....","[no, ,, it's, not, behaving, at, all, ., i'm, ..."


In [116]:
semeval_data_proc = postprocess(semeval_data)

progress-bar: 100%|██████████| 7105/7105 [00:01<00:00, 6083.59it/s]


In [198]:
semeval_data_proc.tokens.head().map(lambda x: " ".join(x))

0    donald trump and scott walker would negros bac...
1    probably not cause he played with the equivale...
2    woaw just because briana is " having " louis '...
3    i wrote this about the ' sas response ' after ...
4    2nd best in luck draft ? rg3 if u are talking ...
Name: tokens, dtype: object

In [102]:
def labelizeTweets(tweets, label_type):
    labelized = []
    for i,v in tqdm(enumerate(tweets)):
        label = '%s_%s'%(label_type,i)
        labelized.append(LabeledSentence(v, [label]))
    return labelized

In [103]:
kaggle_labelized = labelizeTweets(kaggle_data_proc.tokens, 'TRAIN')

1600000it [00:07, 216389.62it/s]


In [117]:
semeval_labelized  = labelizeTweets(semeval_data_proc.tokens, "TEST")

7105it [00:00, 339855.05it/s]


In [104]:
kaggle_labelized[0]

TaggedDocument(words=['-', 'awww', ',', "that's", 'a', 'bummer', '.', 'you', 'shoulda', 'got', 'david', 'carr', 'of', 'third', 'day', 'to', 'do', 'it', '.', ';d'], tags=['TRAIN_0'])

In [118]:
semeval_labelized[0]

TaggedDocument(words=['donald', 'trump', 'and', 'scott', 'walker', 'would', 'negros', 'back', 'to', 'africa', ';', 'they', 'would', 'try', 'to', 'change', 'the', '14th', 'amendment', 'to', 'the', 'constitution', '.'], tags=['TEST_0'])

In [232]:
#build the model on the kaggle data here, using 200-dimensional vectors
n_dim=200
tweet_w2v = Word2Vec(size=n_dim, min_count=2,workers=4)
tweet_w2v.build_vocab([x.words for x in tqdm(kaggle_labelized)])





  0%|          | 0/1600000 [00:00<?, ?it/s][A[A[A[A



  8%|▊         | 125019/1600000 [00:00<00:01, 1235279.95it/s][A[A[A[A



 15%|█▍        | 237605/1600000 [00:00<00:01, 1181409.70it/s][A[A[A[A



 22%|██▏       | 355265/1600000 [00:00<00:01, 1180408.15it/s][A[A[A[A



 30%|███       | 484486/1600000 [00:00<00:00, 1208640.70it/s][A[A[A[A



 39%|███▊      | 616165/1600000 [00:00<00:00, 1230423.19it/s][A[A[A[A



 47%|████▋     | 749878/1600000 [00:00<00:00, 1248340.47it/s][A[A[A[A



 56%|█████▌    | 892186/1600000 [00:00<00:00, 1273205.72it/s][A[A[A[A



 64%|██████▍   | 1030428/1600000 [00:00<00:00, 1286311.53it/s][A[A[A[A



 73%|███████▎  | 1163362/1600000 [00:00<00:00, 1291396.50it/s][A[A[A[A



 81%|████████  | 1297134/1600000 [00:01<00:00, 1295676.55it/s][A[A[A[A



 89%|████████▉ | 1428613/1600000 [00:01<00:00, 1297534.96it/s][A[A[A[A



 98%|█████████▊| 1562466/1600000 [00:01<00:00, 1300797.61it/s][A[A[A[A



100%|███

In [108]:
tweet_w2v.train([x.words for x in tqdm(kaggle_labelized)],total_examples=tweet_w2v.corpus_count,epochs=5)

100%|██████████| 1600000/1600000 [00:00<00:00, 1970467.56it/s]


(87481499, 117871440)

In [193]:
print("Words most similar to good:\n",tweet_w2v.wv.most_similar(positive=['good']))
print()
print("Words most similar to awesome minus bad:\n",tweet_w2v.wv.most_similar(positive=["awesome"],negative=["bad"]))
print()
print("Which of the words awesome, great, super, lunch don\'t match: ",tweet_w2v.wv.doesnt_match("awesome great super lunch".split()))
print()
print("How related is awesome to great: ",tweet_w2v.wv.similarity('awesome', 'great'))
print()
print("How related is great to bad: ",tweet_w2v.wv.similarity('great', 'bad'))

  if np.issubdtype(vec.dtype, np.int):


Words most similar to good:
 [('goood', 0.7672454118728638), ('great', 0.7289575338363647), ('rough', 0.6473965644836426), ('terrible', 0.6394548416137695), ('gd', 0.6380974650382996), ('fantastic', 0.6364789009094238), ('tough', 0.6343778371810913), ('fabulous', 0.6292765140533447), ('nice', 0.6236681342124939), ('horrible', 0.6177318096160889)]

Words most similar to awesome minus bad:
 [('amazing', 0.5577377080917358), ('incredible', 0.4800090193748474), ('awsome', 0.458223432302475), ('amaaazing', 0.4472930431365967), ('amazingg', 0.40331149101257324), ('amazinggg', 0.39434438943862915), ('amazin', 0.38376984000205994), ('unforgettable', 0.3774961233139038), ('awesomeee', 0.3774644434452057), ('hilarious', 0.3742592930793762)]

Which of the words awesome, great, super, lunch don't match:  lunch

How related is awesome to great:  0.68985933

How related is great to bad:  0.39900693


In [115]:
tweet_w2v.most_similar("shawty")

  """Entry point for launching an IPython kernel.
  if np.issubdtype(vec.dtype, np.int):


[('foo', 0.65892493724823),
 ('juiceman', 0.6547018885612488),
 ('0n', 0.6425812244415283),
 ('nich', 0.6377840042114258),
 ('jeezy', 0.6357232332229614),
 ('tua', 0.6331261396408081),
 ('bilang', 0.6303133964538574),
 ('hao', 0.6293398141860962),
 ('l0l', 0.6287763714790344),
 ('yo', 0.6266430616378784)]

In [119]:
def buildWordVector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size))
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

Now that we have vectors representing individual words, we need to do something to combine a sequence of words (all represented as individual vectors) into a single vector representing an individual tweet. One way to do this is to simply average all word vectors for individual words in each tweet. The function above does exactly that.

This step will take a while (~7 minutes on my machine). At this point, we are creating averaged vectors across all of the tokens that appear in every tweet in both our semeval and the kaggle dataset. Once that's done, we can finally build our model!

In [124]:
kaggle_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, kaggle_labelized))])
semeval_vecs_w2v = np.concatenate([buildWordVector(z, n_dim) for z in tqdm(map(lambda x: x.words, semeval_labelized))])

  
1600000it [06:46, 3931.65it/s]
7105it [00:02, 2394.79it/s]


In [122]:
semeval_vecs_w2v

(7105, 200)

Since the kaggle data is also coded based on sentiment, we can see if training on that data and testing on ours provides better performance than simply training/testing on the 7K tweets in the smaller semeval dataset we have: 

In [129]:
kaggle_data["target"] = (kaggle_data.sentiment >=1).astype(int)

Let's import what we need and build our model:

In [172]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [131]:
lr = LogisticRegression()
lr.fit(kaggle_vecs_w2v,kaggle_data.target)

In [132]:
accuracy_score(semeval_data.target,lr.predict(semeval_vecs_w2v))

0.685432793807178

Ok, so that didnt work too well. Let's see what happens if we cross-validate within just our own data:

In [133]:
from sklearn.model_selection import cross_val_score,StratifiedKFold
sk = StratifiedKFold(n_splits=10)
np.mean(cross_val_score(LogisticRegression(),
                        semeval_vecs_w2v,
                        semeval_data.target,
                        cv=sk,
                        scoring="accuracy"))

0.7952122580772963

One other thing we can do is to create a weighted average vector using the "importance" of a given word to a specific sentence. We can compute the word's importance in a variety of ways, one of which is to simply use the tf-idf value of that word, given the dataset we have. Let's try to do that using the kaggle data:

In [195]:
from sklearn.feature_extraction.text import TfidfVectorizer

def generate_tfidf_word_dict(labeled_sentences,vectorizer=TfidfVectorizer(analyzer=lambda x: x,min_df=10)):    
    matrix = vectorizer.fit_transform([x.words for x in tqdm(labeled_sentences)])
    tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
    print('vocab size :', len(tfidf))
    return tfidf

In [175]:
tfidf_kaggle = generate_tfidf_word_dict(kaggle_labelized)



  0%|          | 0/1600000 [00:00<?, ?it/s][A[A

  8%|▊         | 132380/1600000 [00:00<00:01, 1312709.44it/s][A[A

 17%|█▋        | 269515/1600000 [00:00<00:00, 1338714.40it/s][A[A

 26%|██▌       | 414601/1600000 [00:00<00:00, 1378704.41it/s][A[A

 35%|███▍      | 558631/1600000 [00:00<00:00, 1394006.39it/s][A[A

 45%|████▍     | 716081/1600000 [00:00<00:00, 1429697.54it/s][A[A

 55%|█████▍    | 879062/1600000 [00:00<00:00, 1463316.13it/s][A[A

 65%|██████▌   | 1042327/1600000 [00:00<00:00, 1487441.10it/s][A[A

 75%|███████▌  | 1206748/1600000 [00:00<00:00, 1507091.69it/s][A[A

 86%|████████▋ | 1382119/1600000 [00:00<00:00, 1533520.71it/s][A[A

 97%|█████████▋| 1555932/1600000 [00:01<00:00, 1553705.70it/s][A[A

100%|██████████| 1600000/1600000 [00:01<00:00, 1557450.59it/s][A[A

vocab size : 35203


In [186]:
tfidf_semeval = generate_tfidf_word_dict(semeval_labelized,TfidfVectorizer(analyzer=lambda x: x))





  0%|          | 0/7105 [00:00<?, ?it/s][A[A[A[A



100%|██████████| 7105/7105 [00:00<00:00, 1354077.15it/s][A[A[A[A

vocab size : 16725


In [180]:
#this is the same as the buildWordVector function above, except with the tfidf weighting not commented out
#and with the ability to pass in our own tfidf dictionary
def buildWordVector_tfidf(tokens, size,tfidf_dict):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size)) * tfidf_dict[word]
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

In [181]:
semeval_vecs_w2v_tfidf = np.concatenate([buildWordVector_tfidf(z,
                                                               n_dim,
                                                               tfidf_kaggle) for z in tqdm(map(lambda x: x.words,
                                                                                               semeval_labelized))])





  




116it [00:00, 1147.18it/s][A[A[A[A



272it [00:00, 1349.18it/s][A[A[A[A



433it [00:00, 1433.69it/s][A[A[A[A



599it [00:00, 1488.89it/s][A[A[A[A



769it [00:00, 1531.22it/s][A[A[A[A



944it [00:00, 1565.53it/s][A[A[A[A



1112it [00:00, 1580.52it/s][A[A[A[A



1283it [00:00, 1597.11it/s][A[A[A[A



1459it [00:00, 1614.85it/s][A[A[A[A



1646it [00:01, 1639.02it/s][A[A[A[A



1818it [00:01, 1645.12it/s][A[A[A[A



1994it [00:01, 1655.14it/s][A[A[A[A



2166it [00:01, 1656.03it/s][A[A[A[A



2350it [00:01, 1668.54it/s][A[A[A[A



2528it [00:01, 1675.88it/s][A[A[A[A



2717it [00:01, 1688.98it/s][A[A[A[A



2902it [00:01, 1698.45it/s][A[A[A[A



3083it [00:01, 1700.91it/s][A[A[A[A



3264it [00:01, 1706.69it/s][A[A[A[A



3451it [00:02, 1714.14it/s][A[A[A[A



3642it [00:02, 1722.87it/s][A[A[A[A



3838it [00:02, 1733.58it/s][A[A[A[A



4026it [00:02, 1739.65it/s][A[A[A[A



4224it

In [187]:
semeval_vecs_w2v_tfidf2 = np.concatenate([buildWordVector_tfidf(z,
                                                                n_dim,
                                                                tfidf_semeval) for z in tqdm(map(lambda x: x.words,
                                                                                                 semeval_labelized))])





  




185it [00:00, 1831.61it/s][A[A[A[A



383it [00:00, 1900.17it/s][A[A[A[A



574it [00:00, 1901.37it/s][A[A[A[A



757it [00:00, 1883.12it/s][A[A[A[A



943it [00:00, 1877.56it/s][A[A[A[A



1141it [00:00, 1892.98it/s][A[A[A[A



1330it [00:00, 1891.39it/s][A[A[A[A



1519it [00:00, 1889.32it/s][A[A[A[A



1710it [00:00, 1891.16it/s][A[A[A[A



1892it [00:01, 1873.73it/s][A[A[A[A



2073it [00:01, 1867.68it/s][A[A[A[A



2256it [00:01, 1863.97it/s][A[A[A[A



2448it [00:01, 1868.16it/s][A[A[A[A



2637it [00:01, 1869.22it/s][A[A[A[A



2827it [00:01, 1871.39it/s][A[A[A[A



3014it [00:01, 1869.10it/s][A[A[A[A



3205it [00:01, 1870.58it/s][A[A[A[A



3392it [00:01, 1869.85it/s][A[A[A[A



3581it [00:01, 1871.19it/s][A[A[A[A



3776it [00:02, 1874.99it/s][A[A[A[A



3979it [00:02, 1881.76it/s][A[A[A[A



4173it [00:02, 1884.39it/s][A[A[A[A



4367it [00:02, 1865.09it/s][A[A[A[A



4562i

Performance using kaggle tfidf scores:

In [184]:
np.mean(cross_val_score(LogisticRegression(),
                        semeval_vecs_w2v_tfidf,
                        semeval_data.target,
                        cv=sk,
                        scoring="accuracy"))

0.793242408034706

Performance using semeval tfidf scores:

In [188]:
np.mean(cross_val_score(LogisticRegression(),
                        semeval_vecs_w2v_tfidf2,
                        semeval_data.target,
                        cv=sk,
                        scoring="accuracy"))

0.7939456429151561

## Bonus: Instead of using word2vec and averaging vectors, use Doc2Vec

In [146]:
#build the model on the kaggle data here, using 200-dimensional vectors
n_dim=200
tweet_doc2v = Doc2Vec(size=n_dim, min_count=2,workers=4)
#tweet_doc2v.build_vocab([x.words for x in tqdm(kaggle_labelized)])



In [148]:
tweet_doc2v.build_vocab(tqdm(kaggle_labelized))

100%|██████████| 1600000/1600000 [00:16<00:00, 94347.15it/s]


In [151]:
tweet_doc2v.train(tqdm(kaggle_labelized),total_examples=tweet_doc2v.corpus_count,epochs=5)



  0%|          | 0/1600000 [00:00<?, ?it/s][A[A

  1%|          | 8584/1600000 [00:00<00:55, 28421.95it/s][A[A

  1%|          | 11198/1600000 [00:00<01:16, 20731.67it/s][A[A

  1%|          | 13855/1600000 [00:00<01:28, 18010.68it/s][A[A

  1%|          | 16580/1600000 [00:00<01:34, 16680.74it/s][A[A

  1%|          | 19250/1600000 [00:01<01:40, 15680.85it/s][A[A

  1%|▏         | 21887/1600000 [00:01<01:44, 15100.60it/s][A[A

  2%|▏         | 24573/1600000 [00:01<01:48, 14571.48it/s][A[A

  2%|▏         | 27287/1600000 [00:01<01:50, 14211.51it/s][A[A

  2%|▏         | 30025/1600000 [00:02<01:52, 13944.17it/s][A[A

  2%|▏         | 32687/1600000 [00:02<01:54, 13682.86it/s][A[A

  2%|▏         | 35390/1600000 [00:02<01:56, 13459.77it/s][A[A

  2%|▏         | 38038/1600000 [00:02<01:57, 13248.89it/s][A[A

  3%|▎         | 40696/1600000 [00:03<01:59, 13093.59it/s][A[A

  3%|▎         | 43412/1600000 [00:03<02:00, 12965.30it/s][A[A

  3%|▎         | 45985/1

 29%|██▊       | 459678/1600000 [00:40<01:39, 11464.07it/s][A[A

 29%|██▉       | 461183/1600000 [00:40<01:39, 11465.23it/s][A[A

 29%|██▉       | 462307/1600000 [00:40<01:39, 11461.71it/s][A[A

 29%|██▉       | 463783/1600000 [00:40<01:39, 11466.60it/s][A[A

 29%|██▉       | 464944/1600000 [00:40<01:39, 11458.78it/s][A[A

 29%|██▉       | 466424/1600000 [00:40<01:38, 11465.52it/s][A[A

 29%|██▉       | 467604/1600000 [00:40<01:38, 11457.57it/s][A[A

 29%|██▉       | 469756/1600000 [00:41<01:38, 11445.96it/s][A[A

 30%|██▉       | 472422/1600000 [00:41<01:38, 11444.69it/s][A[A

 30%|██▉       | 475065/1600000 [00:41<01:38, 11444.22it/s][A[A

 30%|██▉       | 477712/1600000 [00:41<01:38, 11444.67it/s][A[A

 30%|███       | 480341/1600000 [00:41<01:37, 11446.59it/s][A[A

 30%|███       | 482980/1600000 [00:42<01:37, 11446.61it/s][A[A

 30%|███       | 485680/1600000 [00:42<01:37, 11446.36it/s][A[A

 31%|███       | 488325/1600000 [00:42<01:37, 11445.22it/s][A

 52%|█████▏    | 824376/1600000 [01:11<01:07, 11485.74it/s][A[A

 52%|█████▏    | 825791/1600000 [01:11<01:07, 11486.39it/s][A[A

 52%|█████▏    | 827180/1600000 [01:12<01:07, 11486.22it/s][A[A

 52%|█████▏    | 828583/1600000 [01:12<01:07, 11486.72it/s][A[A

 52%|█████▏    | 829987/1600000 [01:12<01:07, 11485.09it/s][A[A

 52%|█████▏    | 831384/1600000 [01:12<01:06, 11486.41it/s][A[A

 52%|█████▏    | 832805/1600000 [01:12<01:06, 11483.98it/s][A[A

 52%|█████▏    | 834227/1600000 [01:12<01:06, 11487.08it/s][A[A

 52%|█████▏    | 835669/1600000 [01:12<01:06, 11484.72it/s][A[A

 52%|█████▏    | 837791/1600000 [01:12<01:06, 11491.76it/s][A[A

 52%|█████▏    | 839051/1600000 [01:13<01:06, 11492.06it/s][A[A

 53%|█████▎    | 840624/1600000 [01:13<01:06, 11493.41it/s][A[A

 53%|█████▎    | 841854/1600000 [01:13<01:05, 11491.76it/s][A[A

 53%|█████▎    | 843421/1600000 [01:13<01:05, 11490.96it/s][A[A

 53%|█████▎    | 844577/1600000 [01:13<01:05, 11490.79it/s][A

 74%|███████▎  | 1176932/1600000 [01:41<00:36, 11559.80it/s][A[A

 74%|███████▎  | 1178322/1600000 [01:41<00:36, 11560.78it/s][A[A

 74%|███████▎  | 1179753/1600000 [01:42<00:36, 11560.13it/s][A[A

 74%|███████▍  | 1181121/1600000 [01:42<00:36, 11560.61it/s][A[A

 74%|███████▍  | 1182545/1600000 [01:42<00:36, 11559.79it/s][A[A

 74%|███████▍  | 1183966/1600000 [01:42<00:35, 11562.05it/s][A[A

 74%|███████▍  | 1185184/1600000 [01:42<00:35, 11560.01it/s][A[A

 74%|███████▍  | 1186825/1600000 [01:42<00:35, 11562.89it/s][A[A

 74%|███████▍  | 1188054/1600000 [01:42<00:35, 11559.72it/s][A[A

 74%|███████▍  | 1189587/1600000 [01:42<00:35, 11562.03it/s][A[A

 74%|███████▍  | 1190791/1600000 [01:43<00:35, 11558.29it/s][A[A

 75%|███████▍  | 1192418/1600000 [01:43<00:35, 11562.39it/s][A[A

 75%|███████▍  | 1193660/1600000 [01:43<00:35, 11558.50it/s][A[A

 75%|███████▍  | 1195179/1600000 [01:43<00:35, 11561.26it/s][A[A

 75%|███████▍  | 1196398/1600000 [01:43<00:34, 1

 96%|█████████▌| 1533966/1600000 [02:12<00:05, 11586.11it/s][A[A

 96%|█████████▌| 1535349/1600000 [02:12<00:05, 11587.67it/s][A[A

 96%|█████████▌| 1536743/1600000 [02:12<00:05, 11587.52it/s][A[A

 96%|█████████▌| 1538176/1600000 [02:12<00:05, 11588.07it/s][A[A

 96%|█████████▌| 1539590/1600000 [02:12<00:05, 11589.15it/s][A[A

 96%|█████████▋| 1540986/1600000 [02:12<00:05, 11588.43it/s][A[A

 96%|█████████▋| 1542963/1600000 [02:13<00:04, 11594.47it/s][A[A

 97%|█████████▋| 1544385/1600000 [02:13<00:04, 11593.05it/s][A[A

 97%|█████████▋| 1545685/1600000 [02:13<00:04, 11593.79it/s][A[A

 97%|█████████▋| 1546971/1600000 [02:13<00:04, 11592.01it/s][A[A

 97%|█████████▋| 1548166/1600000 [02:13<00:04, 11591.66it/s][A[A

 97%|█████████▋| 1549348/1600000 [02:13<00:04, 11589.65it/s][A[A

 97%|█████████▋| 1550773/1600000 [02:13<00:04, 11591.46it/s][A[A

 97%|█████████▋| 1552114/1600000 [02:13<00:04, 11589.63it/s][A[A

 97%|█████████▋| 1553501/1600000 [02:14<00:04, 1

In [166]:
semeval_vecs_doc2v = np.concatenate([tweet_doc2v.infer_vector(z.words).reshape((1,-1)) for z in tqdm(semeval_labelized)],axis=0)



  0%|          | 0/7105 [00:00<?, ?it/s][A[A

  4%|▎         | 261/7105 [00:00<00:02, 2586.64it/s][A[A

  8%|▊         | 534/7105 [00:00<00:02, 2653.39it/s][A[A

 11%|█         | 791/7105 [00:00<00:02, 2622.60it/s][A[A

 15%|█▍        | 1049/7105 [00:00<00:02, 2614.97it/s][A[A

 18%|█▊        | 1298/7105 [00:00<00:02, 2588.44it/s][A[A

 22%|██▏       | 1544/7105 [00:00<00:02, 2565.98it/s][A[A

 25%|██▌       | 1788/7105 [00:00<00:02, 2545.85it/s][A[A

 29%|██▊       | 2036/7105 [00:00<00:01, 2538.13it/s][A[A

 32%|███▏      | 2271/7105 [00:00<00:01, 2511.37it/s][A[A

 35%|███▌      | 2514/7105 [00:01<00:01, 2501.55it/s][A[A

 39%|███▉      | 2755/7105 [00:01<00:01, 2492.51it/s][A[A

 42%|████▏     | 2994/7105 [00:01<00:01, 2483.34it/s][A[A

 46%|████▌     | 3250/7105 [00:01<00:01, 2488.77it/s][A[A

 49%|████▉     | 3493/7105 [00:01<00:01, 2482.38it/s][A[A

 53%|█████▎    | 3738/7105 [00:01<00:01, 2480.43it/s][A[A

 56%|█████▌    | 3994/7105 [00:01<00:0

In [171]:
np.mean(cross_val_score(LogisticRegression(),
                        semeval_vecs_doc2v,
                        semeval_data.target,
                        cv=sk,
                        scoring="accuracy",
                        n_jobs=-1)
       )

0.754679582417147

## TODO 

1. Alter the size of the embedding dimension. How does this change performance of our model?

In [None]:
pass

## Use precomputed spaCy vectors

Here, we will attempt to use the vector-space representations of words (word embeddings) that have already been pre-created for us. Since word vectors occur on a per-word basis, and each of our documents is a collection of words, the typical strategy is to generate a word vector representation for each word in the given document (tweet) and average across all found word vectors, as we'd done before. This is also exactly what the `spaCy` package does by default.

However, in order for spacy to work on our lightly preprocessed data, we need to rejoin each of our "cleaned" tweets back into "sentences". We will also filter it further by removing pronouns and lemmatizing all of our terms.

## Clean up our text using spaCy

In [211]:
# Clean text before feeding it to spaCy
punctuations = string.punctuation

# Define function to cleanup text by removing personal pronouns, stopwords, and punctuation
def cleanup_text(docs, logging=True):
    proc_docs = []
    counter = 1
    for doc in nlp.pipe(docs, n_threads=4,disable=['parser', 'ner']): #disables parser and named entity recognizer
        if counter % 100 == 0 and logging:
            print("Processed %d out of %d documents." % (counter, len(docs)))
        counter += 1
        tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-'] #lemmatize, remove pronouns
        tokens = [tok for tok in tokens if tok not in stopwords.words() and tok not in punctuations] #remove stopwords, punctuation
        tokens = ' '.join(tokens)
        proc_docs.append(tokens)
    return proc_docs

The code above does the following:
- tokenize on spaces
- lowercase every token
- lemmatization (find root word for each word in text)
- remove stopwords and punctuation

This is a fairly involved process and can take some time.

In [199]:
semeval_data_proc["tokens_joined"] = semeval_data_proc.tokens.map(lambda x: " ".join(x))

In [200]:
semeval_data_proc.head()

Unnamed: 0,sentiment,tweet,target,tokens,tokens_joined
0,negative,Donald Trump and Scott Walker would Negros bac...,0,"[donald, trump, and, scott, walker, would, neg...",donald trump and scott walker would negros bac...
1,negative,@YidVids2 probably not cause he played with th...,0,"[probably, not, cause, he, played, with, the, ...",probably not cause he played with the equivale...
2,negative,"Woaw just because briana is ""having"" louis' ba...",0,"[woaw, just, because, briana, is, "", having, ""...","woaw just because briana is "" having "" louis '..."
3,negative,I wrote this about the 'SAS response' after th...,0,"[i, wrote, this, about, the, ', sas, response,...",i wrote this about the ' sas response ' after ...
4,negative,@MasterDebator_ @NFLosophy 2nd best in luck dr...,0,"[2nd, best, in, luck, draft, ?, rg3, if, u, ar...",2nd best in luck draft ? rg3 if u are talking ...


In [212]:
#semeval_data_spacy_clean = pd.Series(cleanup_text(semeval_data_proc.tokens_joined,logging=True))

Processed 100 out of 7105 documents.
Processed 200 out of 7105 documents.
Processed 300 out of 7105 documents.
Processed 400 out of 7105 documents.
Processed 500 out of 7105 documents.
Processed 600 out of 7105 documents.
Processed 700 out of 7105 documents.
Processed 800 out of 7105 documents.
Processed 900 out of 7105 documents.
Processed 1000 out of 7105 documents.
Processed 1100 out of 7105 documents.
Processed 1200 out of 7105 documents.
Processed 1300 out of 7105 documents.
Processed 1400 out of 7105 documents.
Processed 1500 out of 7105 documents.
Processed 1600 out of 7105 documents.
Processed 1700 out of 7105 documents.
Processed 1800 out of 7105 documents.
Processed 1900 out of 7105 documents.
Processed 2000 out of 7105 documents.
Processed 2100 out of 7105 documents.
Processed 2200 out of 7105 documents.
Processed 2300 out of 7105 documents.
Processed 2400 out of 7105 documents.
Processed 2500 out of 7105 documents.
Processed 2600 out of 7105 documents.
Processed 2700 out of

In [222]:
# Clean up all text DO NOT RUN THIS AS IT TAKES SOME TIME

#semeval_data_spacy_clean = pd.Series(cleanup_text(semeval_data_proc.tokens_joined,logging=True))

# LOAD IN THE CLEANED DATA INSTEAD :)
semeval_data_spacy_proc = pd.read_csv("../data/semeval_data_cleaned_spacy.csv")
semeval_data_spacy_proc.head()

Unnamed: 0,tweet,spacy_cleaned_tweet,sentiment,target
0,Donald Trump and Scott Walker would Negros bac...,donald trump scott walker would negros back af...,negative,0
1,@YidVids2 probably not cause he played with th...,probably play equivalent real madrid 1st team ...,negative,0
2,"Woaw just because briana is ""having"" louis' ba...",woaw briana louis baby sun shin ass,negative,0
3,I wrote this about the 'SAS response' after th...,write sas response charlie hebdo murder januar...,negative,0
4,@MasterDebator_ @NFLosophy 2nd best in luck dr...,2nd best luck draft rg3 talk prospect russell ...,negative,0


An example original tweet:

In [225]:
semeval_data_spacy_proc.tweet[20]

'I wish I was in Bolton tonight :('

After cleaning with both gensim and spacy:

In [227]:
semeval_data_spacy_proc.spacy_cleaned_tweet[20]

'wish bolton tonight :('

So now we can do the automated spacy encoding model-based average word vector creation:

In [229]:
start = time()
semeval_spacy_vecs = []
for doc in nlp.pipe(semeval_data_spacy_proc.spacy_cleaned_tweet, batch_size=500):
    if doc.has_vector:
        #print(doc.vector.shape)
        semeval_spacy_vecs.append(doc.vector)
    # If doc doesn't have a vector, then fill it with zeros.
    else:
        semeval_spacy_vecs.append(np.zeros((300,), dtype="float32"))
        
#train_vec = [doc.vector for doc in nlp.pipe(train_cleaned, batch_size=500) if doc.has_vector else np.zeros((128,dtype="float32")]
semeval_spacy_vecs = np.array(semeval_spacy_vecs)

end = time()
print('Total time passed parsing documents: {} seconds'.format(end - start))
print('Total number of documents parsed: {}'.format(len(semeval_spacy_vecs)))
#print('Number of words in first document: ', len(yelp_smaller_spacy[0]))
#print('Number of words in second document: ', len(yelp_smaller_spacy[1]))
print('Size of vector embeddings: ', semeval_spacy_vecs.shape[1])
print('Shape of vectors embeddings matrix: ', semeval_spacy_vecs.shape)

Total time passed parsing documents: 23.35044002532959 seconds
Total number of documents parsed: 7105
Size of vector embeddings:  300
Shape of vectors embeddings matrix:  (7105, 300)


Again, lets build a simple linear model:

In [231]:
sk = StratifiedKFold(n_splits=10)
np.mean(cross_val_score(LogisticRegression(),
                        semeval_spacy_vecs,
                        semeval_data_spacy_proc.target,
                        cv=sk,
                        scoring="accuracy"))

0.8153404251104377

Look at that, we have a bit of improvement!