# Word2Vec for Text Classification
We will use the sentiment labelled sentences dataset from UCI repository http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

The dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, IMDB.

In [1]:
#basic imports
import warnings
warnings.filterwarnings('ignore')
import os
import wget
import gzip
import shutil
from time import time

#pre-processing imports
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#imports related to modeling
import numpy as np
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

[nltk_data] Downloading package stopwords to /home/rachel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/rachel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Download & load the pre-trained embedding model, we will use the Google News vectors:

In [2]:
path_to_model = 'DATAPATH/GoogleNews-vectors-negative300.bin'

if not os.path.exists(path_to_model):
    !mkdir DATAPATH
    !wget -P DATAPATH https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
    !gunzip DATAPATH/GoogleNews-vectors-negative300.bin.gz

#Load W2V model. This will take some time. 
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

CPU times: user 2.1 s, sys: 1min 10s, total: 1min 13s
Wall time: 1min 13s
done loading Word2Vec


"Word2Vec... model that embeds words in a lower-dimensional vector space using a shallow neural network. The result is a set of word-vectors where vectors close together in vector space have similar meanings based on context, and word-vectors distant to each other have differing meanings. For example, strong and powerful would be close together and strong and Paris would be relatively far." https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

This site provides a simply way to get the model (for future note):

In [25]:
# import gensim.downloader as api
# wv = api.load('word2vec-google-news-300')



Download the data:

In [3]:
training_data_path = f"Data/sentiment_sentences.txt"
if not os.path.exists(training_data_path):
    github_prefix = 'https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch4/Data/sentiment%20labelled%20sentences'
    !mkdir Data
    !wget -P Data {github_prefix}/amazon_cells_labelled.txt?raw=True
    !wget -P Data {github_prefix}/imdb_labelled.txt?raw=True
    !wget -P Data {github_prefix}/yelp_labelled.txt?raw=True
    file = open(fil, 'w')
    file.close()
    filenames = ['amazon_cells_labelled.txt?raw=True', 'imdb_labelled.txt?raw=True', 'yelp_labelled.txt?raw=True']
    with open(training_data_path, 'w') as outfile:
        for fname in filenames:
            with open(f'Data/{fname}') as infile:
                outfile.write(infile.read())
    print("File created")
else:
    print("File already exists")

File already exists


Load the data

In [4]:
#the file consists of tab separated sentences and categories {1: positive, 0:negative}
texts = []
cats = []
fh = open(training_data_path)
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text)
    cats.append(int(sentiment[:1]))
#Inspect the dataset
print(len(cats), len(texts))
list(zip(texts[:5], cats[:5]))

3000 3000


[('So there is no way for me to plug it in here in the US unless I go by a converter.',
  0),
 ('Good case, Excellent value.', 1),
 ('Great for the jawbone.', 1),
 ('Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!',
  0),
 ('The mic is great.', 1)]

### preprocess the text

In [10]:
def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        #Nested function that lowercases, removes stopwords and digits from a list of tokens
        return [token.lower() for token in tokens if token.lower() not in mystopwords and not token.isdigit()
               and token not in punctuation]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

texts_processed = preprocess_corpus(texts)

Length training data: 3000
Length labelled data: 3000
Original text: Good case, Excellent value.
Pre-processed text: ['good', 'case', 'excellent', 'value']
Is text positive (1) or negative (0)?: 1


In [14]:
print(f'Length training data: {len(cats)}\nLength labelled data: {len(texts_processed)}\n')
for i in range(5):
    print(f'Original text: {texts[i]}\nPre-processed text: {texts_processed[i]}')
    print(f'Labelled data positive (1) or negative (0)?: {cats[i]}\n')

Length training data: 3000
Length labelled data: 3000

Original text: So there is no way for me to plug it in here in the US unless I go by a converter.
Pre-processed text: ['way', 'plug', 'us', 'unless', 'go', 'converter']
Labelled data positive (1) or negative (0)?: 0

Original text: Good case, Excellent value.
Pre-processed text: ['good', 'case', 'excellent', 'value']
Labelled data positive (1) or negative (0)?: 1

Original text: Great for the jawbone.
Pre-processed text: ['great', 'jawbone']
Labelled data positive (1) or negative (0)?: 1

Original text: Tied to charger for conversations lasting more than 45 minutes.MAJOR PROBLEMS!!
Pre-processed text: ['tied', 'charger', 'conversations', 'lasting', 'minutes.major', 'problems']
Labelled data positive (1) or negative (0)?: 0

Original text: The mic is great.
Pre-processed text: ['mic', 'great']
Labelled data positive (1) or negative (0)?: 1



### Data Engineering/Text Representation
Creating a feature vector by averaging all embeddings for all sentences

First let's have a play about with the word2vec model:

In [23]:
w2v_model.most_similar('France')

[('French', 0.7000749707221985),
 ('extradites_Noriega', 0.6946742534637451),
 ('Belgium', 0.6933181285858154),
 ('Villebon_Sur_Yvette', 0.6776413321495056),
 ('PARIS_AFX_Gaz_de', 0.662800133228302),
 ('called_Xynthia_blew', 0.6588140726089478),
 ('Brive_la', 0.644013524055481),
 ('COLVILLE_SUR_MER', 0.6336530447006226),
 ('Paris', 0.6334909200668335),
 ('Germany', 0.6270756125450134)]

In [29]:
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, w2v_model.similarity(w1, w2)))

'car'	'minivan'	0.69
'car'	'bicycle'	0.54
'car'	'airplane'	0.42
'car'	'cereal'	0.14
'car'	'communism'	0.06


In [55]:
print(w2v_model.most_similar(positive=['dog', 'cat'], topn=10))

[('puppy', 0.8089798092842102), ('dogs', 0.8045638203620911), ('cats', 0.7861028909683228), ('beagle', 0.7763327360153198), ('pup', 0.7641833424568176), ('pooch', 0.7634377479553223), ('pit_bull', 0.7533083558082581), ('kitten', 0.7526556849479675), ('pet', 0.7490030527114868), ('golden_retriever', 0.7472557425498962)]


In [31]:
print(w2v_model.doesnt_match(['fire', 'water', 'land', 'sea', 'air', 'car']))

car


In [27]:
good_vec = w2v_model.get_vector('good')
print(good_vec[-5:])  # look at the last 5 vector in this embedding for 'good'
print(good_vec.shape)

[ 0.05444336 -0.01184082 -0.15625     0.00193024 -0.06982422]
(300,)


Note that the shape of the vector is 300. That is for ALL vectors in this word2vec model.

Use this pre-learned embedding to represent features by averaging the embeddings for individual words in preprocessed text. The code snippet below shows a simple function to do this:

In [42]:
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0 + 1e-5 # to avoid divide-by-zero 
        for token in tokens:
            if token in w2v_model:  # Use embeddings only for the words that are present in the dictionary
                feat_for_this += w2v_model[token] # get the 300-dim vector for the token
                count_for_this +=1
        if(count_for_this!=0):
            feats.append(feat_for_this/count_for_this) 
        else:
            print(token)  # by printing token we can see which embeddings are not represented in word2vec. None! :)
            feats.append(zero_vector)
    return feats

train_vectors = embedding_feats(texts_processed)
len(train_vectors)

3000

If we’re working on a custom domain whose vocabulary is remarkably different from that of the pre-trained news embeddings we used here, it would make sense to train our own embeddings to extract features.

Luckily the text was all represented by this word2vec model!

### Model and Evaluate

In [68]:
#Take any classifier (LogisticRegression here, and train/test it like before.
classifier = LogisticRegression(random_state=42)
train_data, test_data, train_cats, test_cats = train_test_split(train_vectors, cats)
classifier.fit(train_data, train_cats)
print("Accuracy: ", classifier.score(test_data, test_cats))
preds = classifier.predict(test_data)
print(classification_report(test_cats, preds))

Accuracy:  0.8266666666666667
              precision    recall  f1-score   support

           0       0.84      0.82      0.83       392
           1       0.81      0.83      0.82       358

    accuracy                           0.83       750
   macro avg       0.83      0.83      0.83       750
weighted avg       0.83      0.83      0.83       750



### Disadvantages of this word2vec model

If a word in our dataset was not present in the pre-trained model’s vocabulary, how will we get a representation for this word? This problem is popularly known as out of vocabulary (OOV)

In [51]:
w2v_model.most_similar('Flabbergast')

KeyError: "Key 'Flabbergast' not present"

One solution is called fast-text: 
- This approach can handle words that did not appear in training data (OOV).
- The implementation facilitates extremely fast learning on even very large corpora.

fastText is a general-purpose library to learn the embeddings, it also supports off-the-shelf text classification by providing end-to-end classifier training and testing; i.e., we don’t have to handle feature extraction separately.