In this notebook, we will load a dataset containing sentences annotated with the sentiment. 

We will take the sentences and create vectors out of them, and represent them using bag of words (BoW).

We will then create a Support Vector Machine (SVM) model to classify the data into the sentiments.



In [1]:
import time
from sklearn import svm
from sklearn.metrics import classification_report
import pandas as pd
from tqdm import tqdm
import numpy as np
import re
from nltk.stem.snowball import SnowballStemmer

In [2]:
# train Data
trainData = pd.read_csv("https://raw.githubusercontent.com/Vasistareddy/sentiment_analysis/master/data/train.csv")
# test Data
testData = pd.read_csv("https://raw.githubusercontent.com/Vasistareddy/sentiment_analysis/master/data/test.csv")

In [3]:
trainData.sample(frac=1).head(5)

Unnamed: 0,Content,Label
1704,"late in down to you , the lead female characte...",neg
1389,jackie chan kicks his way into van damme terri...,neg
852,underrated movies are a common reoccurrence in...,pos
1472,* * * the following review contains some hars...,neg
842,"since 1990 , the dramatic picture has undergon...",pos


Let us convert the Label 'pos', 'neg' to 1, 0.

In [4]:
trainData['num_Label'] = trainData.Label.map(lambda x: 1 if x=='pos' else 0)
testData['num_Label'] = testData.Label.map(lambda x: 1 if x=='pos' else 0)

In [5]:
trainData.sample(frac=1).head(5)

Unnamed: 0,Content,Label,num_Label
1619,when it comes to the average teenage romantic ...,neg,0
1603,i love movies . \ni really do . \nevery time i...,neg,0
763,"in intolerance , d . w . griffith told four di...",pos,1
162,plot : odin is a great high school basketball ...,pos,1
611,it is often said by his fans that hal hartley ...,pos,1


In [6]:
words_list = {}
stemmer = SnowballStemmer("english")
for content in tqdm(trainData.Content.values):
  words = set(re.sub(r"[^a-zA-Z]+",
                                    " ",content).lower().strip().split(' '))
  for word in words:
    word = stemmer.stem(word)
    if word in words_list.keys():
      words_list[word] += 1
    else:
      words_list[word] = 1

  1%|▏         | 24/1800 [00:00<00:58, 30.47it/s]

100%|██████████| 1800/1800 [00:28<00:00, 62.77it/s] 


In [7]:
len(words_list)

23917

# Above we see that we have about 24000 unique words in our dataset. We can trim this to only contain most occurring top 1000 words. For this example we will keep every word.

In [8]:
BoW_signature = { w: id for id, w in 
                 enumerate(list(words_list.keys()))
                 }

### Let's convert our training data into vectors now

In [9]:
train_X, train_y = np.zeros((len(trainData), len(BoW_signature))), trainData.num_Label.values
idx = 0
for _, row in trainData.iterrows():
  c_list = re.sub(r"[^a-zA-Z]+",
             " ",row.Content).lower().strip().split(' ')
  c = [stemmer.stem(word) for word in c_list]
  rwords, counts = np.unique(c, return_counts=True)
  for w, c in zip(rwords, counts):
    train_X[idx, BoW_signature[w]] = c
  idx += 1

In [10]:
test_X, test_y = np.zeros((len(testData), len(BoW_signature))), testData.num_Label.values
idx = 0
for _, row in testData.iterrows():
  c_list = re.sub(r"[^a-zA-Z]+",
             " ",row.Content).lower().strip().split(' ')
  c = [stemmer.stem(word) for word in c_list]
  rwords, counts = np.unique(c, return_counts=True)
  for w, c in zip(rwords, counts):
    try:
      test_X[idx, BoW_signature[w]] = c
    except KeyError:
      continue
  idx += 1

In [11]:
classifier = svm.SVC(kernel='rbf')

classifier.fit(train_X, train_y)

In [12]:
predictions = classifier.predict(test_X)

In [13]:
report = classification_report(test_y, predictions)
print(report)

              precision    recall  f1-score   support

           0       0.76      0.84      0.80       100
           1       0.82      0.73      0.77       100

    accuracy                           0.79       200
   macro avg       0.79      0.78      0.78       200
weighted avg       0.79      0.79      0.78       200

