# Word2Vec for Text Classification

In this short notebook, we will see an example of how to use a pre-trained Word2vec model for doing feature extraction and performing text classification.

We will use the sentiment labelled sentences dataset from UCI repository
http://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

The dataset consists of 1500 positive, and 1500 negative sentiment sentences from Amazon, Yelp, IMDB. Let us first combine all the three separate data files into one using the following unix command:

```cat amazon_cells_labelled.txt imdb_labelled.txt yelp_labelled.txt > sentiment_sentences.txt```

For a pre-trained embedding model, we will use the Google News vectors.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

Let us get started!

In [11]:
#basic imports
import os
from time import time

#pre-processing imports
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

#imports related to modeling
import numpy as np
import pandas as pd
from gensim.models import Word2Vec, KeyedVectors
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#Word2vector by another lib (DNN)

In [34]:
from gensim.models import Word2Vec, KeyedVectors #To load the model
import warnings
warnings.filterwarnings('ignore') #ignore any generated warnings
import numpy as np
# load model
path='https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/Models/word2vec_cbow.bin?raw=true'
model_new = KeyedVectors.load_word2vec_format(path, binary=True)

In [35]:
model_new

<gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7fa9713a8ef0>

In [37]:
# Inspect the model
word2vec_vocab_new = model_new.vocab.keys()
word2vec_vocab_lower_new = [item.lower() for item in word2vec_vocab_new]
print(len(word2vec_vocab_new))

161018


Model nay chi co 161,000 tu, kem hon nhieu model cua google la 3,000,000

# Word2vector by googlenew

In [4]:
#download google new vectors (other tool than word2vector)
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

/bin/bash: brew: command not found
--2020-08-01 04:02:29--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.129.237
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.129.237|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2020-08-01 04:03:20 (31.2 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [6]:
path_to_model = '/content/GoogleNews-vectors-negative300.bin.gz'
#Load W2V model. This will take some time. 
%time w2v_model = KeyedVectors.load_word2vec_format(path_to_model, binary=True)
print('done loading Word2Vec')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


CPU times: user 1min 53s, sys: 4.09 s, total: 1min 57s
Wall time: 1min 57s
done loading Word2Vec


In [19]:
training_data_path = '/content/drive/My Drive/Data/NLP/sentiment_all.txt'
#Read text data, cats.
#the file path consists of tab separated sentences and cats.
texts = []
cats = []
fh = open(training_data_path)
for line in fh:
    text, sentiment = line.split("\t")
    texts.append(text)
    cats.append(sentiment)


In [20]:
# Inspect the model
word2vec_vocab = w2v_model.vocab.keys()
word2vec_vocab_lower = [item.lower() for item in word2vec_vocab]
print(len(word2vec_vocab))

3000000


In [21]:
#Inspect the dataset
print(len(cats), len(texts))
print(texts[1])
print(cats[1])

3000 3000
Good case, Excellent value.
1



In [26]:
#Preprocessing our models vocabulary to make better visualizations
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [27]:
#preprocess the text.
def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        #Nested function that lowercases, removes stopwords and digits from a list of tokens
        return [token.lower() for token in tokens if token not in mystopwords and not token.isdigit()
               and token not in punctuation]
    #This return statement below uses the above function to process twitter tokenizer output further. 
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

texts_processed = preprocess_corpus(texts)
print(len(cats), len(texts_processed))
print(texts_processed[1])
print(cats[1])

3000 3000
['good', 'case', 'excellent', 'value']
1



In [31]:
texts_processed[:5]

[['so', 'way', 'plug', 'us', 'unless', 'i', 'go', 'converter'],
 ['good', 'case', 'excellent', 'value'],
 ['great', 'jawbone'],
 ['tied', 'charger', 'conversations', 'lasting', 'minutes.major', 'problems'],
 ['the', 'mic', 'great']]

In [53]:
# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists: # trich tung cau sentiments trong toan van ban
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0
        # tach tung tu trong 1 cau sentiments, chuyen sang vector kich thuoc 300
        for token in tokens: #['so', 'way', 'plug', 'us', 'unless', 'i', 'go', 'converter'],
            if token in w2v_model: #'so'...
                feat_for_this += w2v_model[token] # 300 dimension
                count_for_this +=1
        feats.append(feat_for_this/count_for_this)    
        # cong tat ca token vector cua tung tu trong 1 sentiment va chia tong lay trung binh     
    return feats # return tung sentiment da duoc vector (300 dim)

train_vectors = embedding_feats(texts_processed)

In [55]:
print(len(train_vectors))
print(len(train_vectors[0]))
# train_vectors chua 3000 tu sentiment da duoc convert sang vector 300 dim

3000
300


In [44]:
len(train_vectors[0])

300

In [None]:
#Take any classifier (LogisticRegression here, and train/test it like before.
classifier = LogisticRegression(random_state=1234)
train_data, test_data, train_cats, test_cats = train_test_split(train_vectors, cats)
classifier.fit(train_data, train_cats)
print("Accuracy: ", classifier.score(test_data, test_cats))
preds = classifier.predict(test_data)
print(classification_report(test_cats, preds))

Accuracy:  0.812
             precision    recall  f1-score   support

         0
       0.82      0.80      0.81       374
         1
       0.80      0.83      0.82       376

avg / total       0.81      0.81      0.81       750



Not bad. With little efforts we got 81% accuracy. Thats a great starting model to have!!