# Text Classification: Sentiment Analysis

<img src="cover.jpg" width="700">

A recent project at my work seeks to explore the possiblity of detecting sales potential in messages from customer support's live chat (i.e binary classification: sales or non-sales potential). Essentially, it is a text classification problem which could be solved by using word embedding. While I did not have sufficient data to conduct such experiment, I still wanted to understand how a pre-trained word embedding model could make training an accurate text classification model possible.

The idea of word embedding is to capture the context of a word in a document, semantic and syntactic similarity, relation with other words. While it doesn't "undestand" the meaning of a word the same way human would, it assigns close spatial positions to words with similar context.


## Loading libraries

In [8]:
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk import punkt




In [10]:
pd.set_option('display.max_colwidth', 100)

## Loading GloVe embedding 

### GloVe vs word2vec

**word2vec** is a predictive model, meaning that it trains by trying to predict a target word given a context (CBOW) or the context words from the target (skip-gram). The model uses trainable embedding weights to map words to their corresponding embeddings, which are used to help the model make predictions.

The **GloVe model** uses a co-occurence counts matrix to make the embeddings. Each row of the matrix represents a word, while each column represents the contexts that words can appear in. The matrix values represent the frequency a word appears in a given context. Then, dimensionality reduction is applied to this matrix to create the resulting embedding matrix (each row will be a word’s embedding vector).

What I am importing here is a **pre-trained GloVe by Stanford NLP Group**. It was trained on data of Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download).

Download the GloVe vectors used in this notebook [here](http://nlp.stanford.edu/data/glove.840B.300d.zip)

In [5]:
embeddings_index = {}
f = open('glove.840B.300d.txt', encoding="utf8")
for line in tqdm(f):
    values = line.split()
    word = values[0]
    try:
       coefs = np.asarray(values[1:], dtype='float32')
       embeddings_index[word] = coefs
    except ValueError:
       pass
f.close()

2196017it [03:05, 11837.83it/s]


In [6]:
print('Found %s word vectors.' % len(embeddings_index))

Found 2195884 word vectors.


## Positive/Negative Reviews Dataset

For this experiment, I'm using the [Sentiment Labelled Sentences Dataset](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#) from UC Irvine Machine Learning Repository. It contains sentences labelled with positive/negative sentiment. I will train a sentiment classification model with this dataset. 

## Reading The Data

In [10]:
data1 = pd.read_csv('./reviews/amazon_cells_labelled.txt', sep="\t", header=None)
data2 = pd.read_csv('./reviews/imdb_labelled.txt', sep="\t", header=None)
data3 = pd.read_csv('./reviews/yelp_labelled.txt', sep="\t", header=None)
data1.columns = ["review", 'positive']
data2.columns = ["review", 'positive']
data3.columns = ["review", 'positive']
data= pd.concat([data1,data2,data3])
data.reset_index(drop=True, inplace=True)

In [11]:
data.head(5)

Unnamed: 0,review,positive
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


In [12]:
print('There are %s reviews in this dataset' %(len(data)))

There are 2748 reviews in this dataset


In [13]:
stop_words = stopwords.words('english')
class_names = ['positive']
train = data
train.positive.value_counts(normalize=True)

1    0.504367
0    0.495633
Name: positive, dtype: float64

## Training the Model

In [14]:
train_all, test_all = train_test_split(train,test_size=0.20, random_state=42)
train_text = train_all['review']
test_text = test_all['review']

all_text = pd.concat([train_text, test_text])


In [16]:
# this function creates a normalized vector for the whole sentence
# reference: https://stackoverflow.com/questions/30795944/how-can-a-sentence-or-a-document-be-converted-to-a-vector
# reference: https://github.com/abhishekkrthakur/is_that_a_duplicate_quora_question/blob/master/feature_engineering.py

def sent2vec(s):
    words = str(s).lower()
    words = word_tokenize(words)
    words = [w for w in words if not w in stop_words]
    words = [w for w in words if w.isalpha()]
    M = []
    for w in words:
        try:
            M.append(embeddings_index[w])
        except:
            continue
    M = np.array(M)
    v = M.sum(axis=0)
    if type(v) != np.ndarray:
        return np.zeros(300)
    
    return v / np.sqrt((v ** 2).sum())

In [17]:
# create sentence vectors using the above function for training and validation set
xtrain_glove = [sent2vec(x) for x in tqdm(train_text)]
xtest_glove = [sent2vec(x) for x in tqdm(test_text)]


100%|██████████| 2198/2198 [00:00<00:00, 3105.87it/s]
100%|██████████| 550/550 [00:00<00:00, 3838.53it/s]


In [18]:
xtrain_glove = np.array(xtrain_glove)
xtest_glove = np.array(xtest_glove)

In [24]:
scores = []
result = pd.DataFrame.from_dict({'id': test_all.index})

train_target = train_all[class_name]
classifier = XGBClassifier(n_estimators=400, random_state = 0)

cv_score = np.mean(cross_val_score(classifier, xtrain_glove, train_target, cv=3, scoring='roc_auc'))
scores.append(cv_score)
print('CV score for class {} is {}'.format(class_name, cv_score))

classifier.fit(xtrain_glove, train_target)
result[class_name] = classifier.predict(xtest_glove)


CV score for class positive is 0.8861175709054306


  if diff:


## Predictions

In [26]:
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
print("Training Accuracy = {:.3f}".format(classifier.score(xtrain_glove, train_target)))

print("Test Accuracy = {:.3f}".format(classifier.score(xtest_glove, test_all[class_name])))
print("ROC_AUC_score : %.6f" % (roc_auc_score(test_all[class_name], result[class_name])))

#Confusion Matrix
print(confusion_matrix(test_all[class_name], result[class_name]))
print("-"*15,"CLASSIFICATION REPORT","-"*15)
print(classification_report(test_all[class_name], result[class_name]))

Training Accuracy = 0.998
Test Accuracy = 0.833
ROC_AUC_score : 0.835131
[[231  60]
 [ 32 227]]
--------------- CLASSIFICATION REPORT ---------------
             precision    recall  f1-score   support

          0       0.88      0.79      0.83       291
          1       0.79      0.88      0.83       259

avg / total       0.84      0.83      0.83       550



  if diff:
  if diff:


**note**:

ROC-AUC tells how much model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. Essentially, higher the AUC, better the model is at distinguishing between texts with positive review and negative review.

In [27]:
result.columns=['id','prediction']
test_merge = test_all.reset_index(drop=False)
pd.concat([test_merge,result['prediction']], axis=1)

Unnamed: 0,index,review,positive,prediction
0,2516,"It's close to my house, it's low-key, non-fanc...",1,1
1,2642,If you stay in Vegas you must get breakfast he...,1,0
2,1359,"Let's start with all the problemsthe acting, ...",0,0
3,1702,It's too bad that everyone else involved didn'...,0,1
4,2660,"i felt insulted and disrespected, how could yo...",0,0
5,564,Yet Plantronincs continues to use the same fla...,0,1
6,1330,Whatever prompted such a documentary is beyond...,0,0
7,2375,Any grandmother can make a roasted chicken bet...,0,1
8,695,Do NOT buy if you want to use the holster.,0,1
9,321,I ordered this product first and was unhappy w...,0,0


## Conclusion

Without doing any hyper-parameters tuning, I was able to obtain very decent results with the pre-trained GloVe model and out-of-the-box XGBoost algorithm. This simple experiment shows how powerful word embedding can be in retaining  the semantics of words. This methodology will be a good place to start for my initial classification problem (i.e. sales/non-sales) once I have collected an adequate amount of data. 