<h1><center> Sentiment Analysis

<h2> Table of Content

* [Introduction](#introduction)
    * [NLP](#introduction_nlp)
    * [Word2Vec](introduction_w2v)
    * [Tokenizer](introduction_tokenizer)
* [Prerequisites](#prerequisites)
    * [Packages](#prerequisites_packages)
    * [Prompt](#prerequisites_prompt)
    * [Read Data](#prerequisites_data)
* [EDA](#eda)    
* [Preprocess Data](#preprocess_data) preprocess_text
    * [Process Tweets](#preprocess_tweet)
    * [Split Dataset](#prerequisites_split)

## Introduction <a class="anchor" id="introduction"></a>

> <h3> Natural Language Processing <a class="anchor" id="introduction_nlp"></a>
> <h3> Word2Vec <a class="anchor" id="introduction_w2v"></a>
> <h3> Tokenizer <a class="anchor" id="introduction_tokenizer"></a>

## Prerequisites <a class="anchor" id="prerequisites"></a>

### Packages <a class="anchor" id="prerequisites_packages"></a>
- pandas
- numpy
- matplotlib
- scikit-learn
- gensim
- tensorflow
- nltk
- keras

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import TfidfVectorizer

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Activation, Dense, Dropout, Embedding, Flatten, Conv1D, MaxPooling1D, LSTM
from keras import utils
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

import gensim
from gensim.models import word2vec

import re

import os
from collections import Counter
import logging
import time
import pickle
import itertools

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

### Prompt  <a class="anchor" id="prerequisites_prompt"></a>

- Python version can't be higher than 3.6

### Read Data <a class="anchor" id="prerequisites_data"></a>

In [62]:
col_names = ["Sentiment","Text"]
data = pd.read_csv(r'D:\CSC590_Design_Project\Data\data.csv',names = col_names,encoding="ISO-8859-1")

Unnamed: 0,Sentiment,Text
0,-1,going to say goodbye to CP
1,1,"I want some new nail polish! - Hey, that remin..."
2,-1,"@OhCost yeah, this twitter cell phone app suck..."
3,1,"listening to crawl thru fire, btw"
4,1,I Love TECHNOLOGY! you have access to all info...
...,...,...
1599995,-1,JUST TOOK THE HARDEST TEST EVER UGHHH AND IT W...
1599996,-1,@machdemonic shut yur face.
1599997,-1,"Mutalating a dog: no nose,lips,toes,ears and c..."
1599998,1,In Seattle


## Expolratory Data Analysis <a class="anchor" id="eda"></a>

## Preprocess Data <a class="anchor" id="preprocess_data"></a>

### Process Text <a class="anchor" id="preprocess_tweet"></a>

In [63]:
stops = set(stopwords.words("english"))
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
stemmer = SnowballStemmer("english")

def process_text(text,remove_stops = False, stem = False):
    text = str(text).lower().strip()
    text = re.sub('https?://\S+|www\.\S+', '', text) # remove url links
    text = re.sub("@[\w]*",'',text) # remove "@user"
    text = re.sub('[^a-zA-Z]',' ',text) # leave only characters
    words =[]
    for word in text.split():
        if not remove_stops or word not in stops:
            if not stem:
                words.append(word)
            else:
                words.append(stemmer.stem(word))
    return words    

data['Text'] = data['Text'].apply(lambda x: process_text(x,remove_stops = True))

### Split Dataset  <a class="anchor" id="preprocess_split"></a>

In [64]:
train_rows = round(len(data.index)*0.6)
val_rows = round(len(data)*0.2)
test_rows = len(data.index)-val_rows-test_rows

train=data.iloc[:train_rows]
train.reset_index(drop=True, inplace=True)
val = data.iloc[train_rows:train_rows+val_rows]
val.reset_index(drop=True, inplace=True)
test = data.iloc[train_rows+val_rows:]
test.reset_index(drop=True, inplace=True)

## Word2Vec Model

###  Setup Model: 
> All parameters are set up while `sentences` is not given, leaving the model blank

#### Parameters:
- `size`: dimentionality of word vectors
- `window`: size of the context
- `min_count`: minimum appearance requirement
- `workers`: number of worker threads

In [65]:
W2V_SIZE = 300 
W2V_WINDOW = 7  
W2V_MIN_COUNT = 10 
W2V_WORKERS=8

w2v_model = word2vec.Word2Vec(size=W2V_SIZE, window=W2V_WINDOW, min_count=W2V_MIN_COUNT, workers=W2V_WORKERS)

### Build Vocab

> The model builds its vocabulary table

In [66]:
train_sentences = train['Text'].tolist()
w2v_model.build_vocab(train_sentences)

2021-05-13 00:57:27,670 : INFO : collecting all words and their counts
2021-05-13 00:57:27,675 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-05-13 00:57:27,701 : INFO : PROGRESS: at sentence #10000, processed 70496 words, keeping 13104 word types
2021-05-13 00:57:27,723 : INFO : PROGRESS: at sentence #20000, processed 140023 words, keeping 19729 word types
2021-05-13 00:57:27,754 : INFO : PROGRESS: at sentence #30000, processed 210227 words, keeping 25304 word types
2021-05-13 00:57:27,779 : INFO : PROGRESS: at sentence #40000, processed 279656 words, keeping 29821 word types
2021-05-13 00:57:27,805 : INFO : PROGRESS: at sentence #50000, processed 349619 words, keeping 34109 word types
2021-05-13 00:57:27,833 : INFO : PROGRESS: at sentence #60000, processed 420075 words, keeping 37930 word types
2021-05-13 00:57:27,858 : INFO : PROGRESS: at sentence #70000, processed 490052 words, keeping 41471 word types
2021-05-13 00:57:27,883 : INFO : PROGRESS: at s

2021-05-13 00:57:29,780 : INFO : PROGRESS: at sentence #720000, processed 5028762 words, keeping 163880 word types
2021-05-13 00:57:29,810 : INFO : PROGRESS: at sentence #730000, processed 5099143 words, keeping 165225 word types
2021-05-13 00:57:29,839 : INFO : PROGRESS: at sentence #740000, processed 5169237 words, keeping 166567 word types
2021-05-13 00:57:29,863 : INFO : PROGRESS: at sentence #750000, processed 5238906 words, keeping 167871 word types
2021-05-13 00:57:29,894 : INFO : PROGRESS: at sentence #760000, processed 5308502 words, keeping 169303 word types
2021-05-13 00:57:29,920 : INFO : PROGRESS: at sentence #770000, processed 5378932 words, keeping 170649 word types
2021-05-13 00:57:29,946 : INFO : PROGRESS: at sentence #780000, processed 5448913 words, keeping 171980 word types
2021-05-13 00:57:29,976 : INFO : PROGRESS: at sentence #790000, processed 5518771 words, keeping 173185 word types
2021-05-13 00:57:30,002 : INFO : PROGRESS: at sentence #800000, processed 558917

### Train Model

### Parameters:

- `total_examples`: count of sentences
- `epochs`: number of iteration over the corpus

In [67]:
W2V_EPOCH = 32 
w2v_model.train(train_sentences, total_examples=len(train_sentences), epochs=W2V_EPOCH)

2021-05-13 00:57:37,542 : INFO : training model with 8 workers on 24713 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=7
2021-05-13 00:57:38,678 : INFO : EPOCH 1 - PROGRESS: at 10.87% examples, 638551 words/s, in_qsize 16, out_qsize 2
2021-05-13 00:57:39,694 : INFO : EPOCH 1 - PROGRESS: at 22.64% examples, 664887 words/s, in_qsize 14, out_qsize 1
2021-05-13 00:57:40,696 : INFO : EPOCH 1 - PROGRESS: at 34.12% examples, 670748 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:57:41,711 : INFO : EPOCH 1 - PROGRESS: at 45.18% examples, 665377 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:57:42,712 : INFO : EPOCH 1 - PROGRESS: at 56.80% examples, 670768 words/s, in_qsize 13, out_qsize 2
2021-05-13 00:57:43,733 : INFO : EPOCH 1 - PROGRESS: at 68.42% examples, 672208 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:57:44,748 : INFO : EPOCH 1 - PROGRESS: at 78.71% examples, 662525 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:57:45,752 : INFO : EPOCH 1 - PROGRESS:

2021-05-13 00:58:21,087 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-13 00:58:21,092 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-13 00:58:21,116 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-13 00:58:21,117 : INFO : worker thread finished; awaiting finish of 4 more threads
2021-05-13 00:58:21,120 : INFO : worker thread finished; awaiting finish of 3 more threads
2021-05-13 00:58:21,122 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-13 00:58:21,128 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-13 00:58:21,135 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-13 00:58:21,137 : INFO : EPOCH - 5 : training on 6707215 raw words (5966484 effective words) took 8.6s, 695938 effective words/s
2021-05-13 00:58:22,168 : INFO : EPOCH 6 - PROGRESS: at 9.53% examples, 561556 words/s, in_qsize 14, out_qsize 1
2021-05-13 00:58:23,17

2021-05-13 00:58:58,670 : INFO : EPOCH 10 - PROGRESS: at 34.57% examples, 669205 words/s, in_qsize 14, out_qsize 5
2021-05-13 00:58:59,678 : INFO : EPOCH 10 - PROGRESS: at 47.12% examples, 686887 words/s, in_qsize 14, out_qsize 1
2021-05-13 00:59:00,684 : INFO : EPOCH 10 - PROGRESS: at 57.70% examples, 675274 words/s, in_qsize 16, out_qsize 0
2021-05-13 00:59:01,735 : INFO : EPOCH 10 - PROGRESS: at 68.72% examples, 666804 words/s, in_qsize 13, out_qsize 4
2021-05-13 00:59:02,749 : INFO : EPOCH 10 - PROGRESS: at 80.93% examples, 674217 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:59:03,761 : INFO : EPOCH 10 - PROGRESS: at 91.94% examples, 671315 words/s, in_qsize 16, out_qsize 0
2021-05-13 00:59:04,489 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-13 00:59:04,494 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-13 00:59:04,500 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-13 00:59:04,502 : INFO : work

2021-05-13 00:59:43,730 : INFO : EPOCH 14 - PROGRESS: at 42.33% examples, 623516 words/s, in_qsize 14, out_qsize 1
2021-05-13 00:59:44,737 : INFO : EPOCH 14 - PROGRESS: at 54.27% examples, 640111 words/s, in_qsize 15, out_qsize 1
2021-05-13 00:59:45,766 : INFO : EPOCH 14 - PROGRESS: at 66.03% examples, 647324 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:59:46,833 : INFO : EPOCH 14 - PROGRESS: at 76.77% examples, 640291 words/s, in_qsize 13, out_qsize 3
2021-05-13 00:59:47,842 : INFO : EPOCH 14 - PROGRESS: at 87.77% examples, 641622 words/s, in_qsize 15, out_qsize 0
2021-05-13 00:59:48,847 : INFO : EPOCH 14 - PROGRESS: at 96.71% examples, 629483 words/s, in_qsize 14, out_qsize 1
2021-05-13 00:59:49,073 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-13 00:59:49,077 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-13 00:59:49,085 : INFO : worker thread finished; awaiting finish of 5 more threads
2021-05-13 00:59:49,086 : INFO : work

2021-05-13 01:00:29,869 : INFO : EPOCH 18 - PROGRESS: at 43.83% examples, 419515 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:00:30,881 : INFO : EPOCH 18 - PROGRESS: at 50.25% examples, 413672 words/s, in_qsize 15, out_qsize 2
2021-05-13 01:00:31,886 : INFO : EPOCH 18 - PROGRESS: at 56.95% examples, 411855 words/s, in_qsize 16, out_qsize 1
2021-05-13 01:00:32,949 : INFO : EPOCH 18 - PROGRESS: at 63.95% examples, 409667 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:00:33,957 : INFO : EPOCH 18 - PROGRESS: at 73.34% examples, 423925 words/s, in_qsize 13, out_qsize 2
2021-05-13 01:00:34,996 : INFO : EPOCH 18 - PROGRESS: at 83.31% examples, 437623 words/s, in_qsize 16, out_qsize 0
2021-05-13 01:00:36,004 : INFO : EPOCH 18 - PROGRESS: at 92.84% examples, 447955 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:00:36,625 : INFO : worker thread finished; awaiting finish of 7 more threads
2021-05-13 01:00:36,643 : INFO : worker thread finished; awaiting finish of 6 more threads
2021-05-13 01

2021-05-13 01:01:14,930 : INFO : EPOCH 22 - PROGRESS: at 8.63% examples, 502856 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:01:15,946 : INFO : EPOCH 22 - PROGRESS: at 17.74% examples, 518545 words/s, in_qsize 13, out_qsize 2
2021-05-13 01:01:16,953 : INFO : EPOCH 22 - PROGRESS: at 27.71% examples, 542570 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:01:17,981 : INFO : EPOCH 22 - PROGRESS: at 36.81% examples, 538729 words/s, in_qsize 15, out_qsize 3
2021-05-13 01:01:19,014 : INFO : EPOCH 22 - PROGRESS: at 46.68% examples, 544684 words/s, in_qsize 13, out_qsize 2
2021-05-13 01:01:20,047 : INFO : EPOCH 22 - PROGRESS: at 57.40% examples, 557430 words/s, in_qsize 16, out_qsize 1
2021-05-13 01:01:21,126 : INFO : EPOCH 22 - PROGRESS: at 67.67% examples, 559165 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:01:22,173 : INFO : EPOCH 22 - PROGRESS: at 77.52% examples, 559308 words/s, in_qsize 14, out_qsize 1
2021-05-13 01:01:23,230 : INFO : EPOCH 22 - PROGRESS: at 86.15% examples, 551170 

2021-05-13 01:02:00,723 : INFO : EPOCH 26 - PROGRESS: at 7.29% examples, 432302 words/s, in_qsize 16, out_qsize 0
2021-05-13 01:02:01,728 : INFO : EPOCH 26 - PROGRESS: at 15.65% examples, 464060 words/s, in_qsize 16, out_qsize 0
2021-05-13 01:02:02,746 : INFO : EPOCH 26 - PROGRESS: at 25.18% examples, 496135 words/s, in_qsize 14, out_qsize 1
2021-05-13 01:02:03,789 : INFO : EPOCH 26 - PROGRESS: at 35.16% examples, 515126 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:02:04,792 : INFO : EPOCH 26 - PROGRESS: at 45.47% examples, 534287 words/s, in_qsize 14, out_qsize 1
2021-05-13 01:02:05,794 : INFO : EPOCH 26 - PROGRESS: at 56.20% examples, 551626 words/s, in_qsize 16, out_qsize 1
2021-05-13 01:02:06,852 : INFO : EPOCH 26 - PROGRESS: at 66.92% examples, 559479 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:02:07,865 : INFO : EPOCH 26 - PROGRESS: at 76.91% examples, 563170 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:02:08,897 : INFO : EPOCH 26 - PROGRESS: at 86.59% examples, 562715 

2021-05-13 01:02:48,769 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-05-13 01:02:48,770 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-05-13 01:02:48,773 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-05-13 01:02:48,774 : INFO : EPOCH - 29 : training on 6707215 raw words (5965359 effective words) took 11.1s, 535022 effective words/s
2021-05-13 01:02:49,822 : INFO : EPOCH 30 - PROGRESS: at 7.73% examples, 451229 words/s, in_qsize 15, out_qsize 2
2021-05-13 01:02:50,824 : INFO : EPOCH 30 - PROGRESS: at 17.59% examples, 517451 words/s, in_qsize 14, out_qsize 1
2021-05-13 01:02:51,841 : INFO : EPOCH 30 - PROGRESS: at 27.41% examples, 537526 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:02:52,867 : INFO : EPOCH 30 - PROGRESS: at 36.51% examples, 535240 words/s, in_qsize 15, out_qsize 0
2021-05-13 01:02:53,880 : INFO : EPOCH 30 - PROGRESS: at 45.47% examples, 533490 words/s, in_qsize 14, out_qsize 2
2021-05-13 01:

(190910928, 214630880)

In [69]:
w2v_model.save('w2v_model.model')

2021-05-13 01:08:33,477 : INFO : saving Word2Vec object under w2v_model.model, separately None
2021-05-13 01:08:33,488 : INFO : not storing attribute vectors_norm
2021-05-13 01:08:33,489 : INFO : not storing attribute cum_table
2021-05-13 01:08:34,192 : INFO : saved w2v_model.model


In [71]:
w2v_model = word2vec.Word2Vec.load("w2v_model.model")

2021-05-13 01:09:53,291 : INFO : loading Word2Vec object from w2v_model.model
2021-05-13 01:09:53,914 : INFO : loading wv recursively from w2v_model.model.wv.* with mmap=None
2021-05-13 01:09:53,915 : INFO : setting ignored attribute vectors_norm to None
2021-05-13 01:09:53,916 : INFO : loading vocabulary recursively from w2v_model.model.vocabulary.* with mmap=None
2021-05-13 01:09:53,918 : INFO : loading trainables recursively from w2v_model.model.trainables.* with mmap=None
2021-05-13 01:09:53,918 : INFO : setting ignored attribute cum_table to None
2021-05-13 01:09:53,919 : INFO : loaded w2v_model.model


In [None]:
w2v_model.most_similar('good')

  """Entry point for launching an IPython kernel.
2021-05-13 01:10:20,469 : INFO : precomputing L2-norms of word weight vectors


In [19]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train['Text'])
vocab_size = len(tokenizer.word_index)+1

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
SEQUENCE_LENGTH = 300
x_train = pad_sequences(tokenizer.texts_to_sequences(train['Text']), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(val['Text']), maxlen=SEQUENCE_LENGTH)

In [None]:
labels = [-1,1,0]
encoder = LabelEncoder()
encoder.fit(train['Sentiment'].tolist())

y_train = encoder.transform(train['Sentiment'].tolist())
y_val = encoder.transform(val['Sentiment'].tolist())

y_train = y_train.reshape(-1,1)
y_val = y_val.reshape(-1,1)

In [None]:
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
  if word in w2v_model.wv:
    embedding_matrix[i] = w2v_model.wv[word]
print(embedding_matrix.shape)

In [None]:
embedding_layer = Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_length=SEQUENCE_LENGTH, trainable=False)

In [None]:
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.summary()

In [None]:
SEQUENCE_LENGTH = 300
EPOCHS = 8
BATCH_SIZE = 1024
model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])
callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
              EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)]
history = model.fit(x_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=EPOCHS,
                    validation_split=0.1,
                    verbose=1,
                    callbacks=callbacks)