# Twitter Sentiment Analysis

Adapted from [Paolo Ripamonti's Kaggle notebook](https://www.kaggle.com/paoloripamonti/twitter-sentiment-analysis/)

![Twitter](https://miro.medium.com/max/900/1*VT7AxioAGXplMe7RAEYfSA.png)

In [0]:
# !pip install gensim

# Read Dataset

### Dataset details
* **target**: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
* **ids**: The id of the tweet ( 2087)
* **date**: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
* **flag**: The query (lyx). If there is no query, then this value is NO_QUERY.
* **user**: the user that tweeted (robotickilldozr)
* **text**: the text of the tweet (Lyx is cool)

In [0]:
import sys
from google.colab import drive
drive.mount('/gdrive')
drive_path = '/gdrive/My Drive/Open Source Spotlight/Flask/'
sys.path.append(drive_path)

import pandas as pd

DATASET_COLUMNS = ["target", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"

df = pd.read_csv(drive_path+'/training.1600000.processed.noemoticon.csv', 
                 encoding=DATASET_ENCODING , names=DATASET_COLUMNS)

print('Dataset loaded successfuly!')
print("Dataset size:", len(df))

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive
Dataset loaded successfuly!
Dataset size: 1600000


In [0]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


### Map target label to String
* **0** -> **NEGATIVE**
* **2** -> **NEUTRAL**
* **4** -> **POSITIVE**

In [0]:
decode_map = {0: "Negative", 2: "Neutral", 4: "Positive"}
def decode_sentiment(label):
    return decode_map[int(label)]

In [0]:
df.target = df.target.apply(lambda x: decode_sentiment(x))

In [0]:
df.target.value_counts()

Negative    800000
Positive    800000
Name: target, dtype: int64

### Pre-process dataset

In [0]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

In [0]:
import re
TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

def preprocess(text, stem=False):
    # Remove link, user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

In [0]:
%%time
df.text = df.text.apply(lambda x: preprocess(x))

CPU times: user 40 s, sys: 111 ms, total: 40.1 s
Wall time: 40.2 s


### Split train and test

In [0]:
from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
print("Train size:", len(df_train))
print("Test size:", len(df_test))

Train size: 1280000
Test size: 320000


### Word2Vec 

In [0]:
documents = [_text.split() for _text in df_train.text] 

In [0]:
import gensim

# Word2Vec Parameters
W2V_SIZE = 300
W2V_WINDOW = 7
W2V_EPOCH = 32
W2V_MIN_COUNT = 10

w2v_model = gensim.models.word2vec.Word2Vec(size=W2V_SIZE, 
                                            window=W2V_WINDOW, 
                                            min_count=W2V_MIN_COUNT, 
                                            workers=8)

In [0]:
w2v_model.build_vocab(documents)

2020-03-13 18:20:17,764 : INFO : collecting all words and their counts
2020-03-13 18:20:17,766 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-03-13 18:20:17,788 : INFO : PROGRESS: at sentence #10000, processed 72565 words, keeping 14005 word types
2020-03-13 18:20:17,804 : INFO : PROGRESS: at sentence #20000, processed 144393 words, keeping 21587 word types
2020-03-13 18:20:17,822 : INFO : PROGRESS: at sentence #30000, processed 215826 words, keeping 27541 word types
2020-03-13 18:20:17,840 : INFO : PROGRESS: at sentence #40000, processed 288271 words, keeping 32764 word types
2020-03-13 18:20:17,857 : INFO : PROGRESS: at sentence #50000, processed 359772 words, keeping 37587 word types
2020-03-13 18:20:17,875 : INFO : PROGRESS: at sentence #60000, processed 431431 words, keeping 42198 word types
2020-03-13 18:20:17,894 : INFO : PROGRESS: at sentence #70000, processed 503103 words, keeping 46458 word types
2020-03-13 18:20:17,912 : INFO : PROGRESS: at s

In [0]:
words = w2v_model.wv.vocab.keys()
vocab_size = len(words)
print("Vocab size", vocab_size)

Vocab size 30369


In [0]:
%%time
w2v_model.train(documents, total_examples=len(documents), epochs=W2V_EPOCH)

2020-03-13 18:20:26,229 : INFO : training model with 8 workers on 30369 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=7
2020-03-13 18:20:27,311 : INFO : EPOCH 1 - PROGRESS: at 5.22% examples, 402154 words/s, in_qsize 15, out_qsize 0
2020-03-13 18:20:28,320 : INFO : EPOCH 1 - PROGRESS: at 11.29% examples, 447248 words/s, in_qsize 15, out_qsize 0
2020-03-13 18:20:29,336 : INFO : EPOCH 1 - PROGRESS: at 17.02% examples, 452765 words/s, in_qsize 14, out_qsize 3
2020-03-13 18:20:30,375 : INFO : EPOCH 1 - PROGRESS: at 23.09% examples, 459775 words/s, in_qsize 15, out_qsize 0
2020-03-13 18:20:31,415 : INFO : EPOCH 1 - PROGRESS: at 29.16% examples, 463810 words/s, in_qsize 15, out_qsize 0
2020-03-13 18:20:32,472 : INFO : EPOCH 1 - PROGRESS: at 35.32% examples, 466593 words/s, in_qsize 14, out_qsize 1
2020-03-13 18:20:33,488 : INFO : EPOCH 1 - PROGRESS: at 41.37% examples, 469990 words/s, in_qsize 15, out_qsize 0
2020-03-13 18:20:34,489 : INFO : EPOCH 1 - PROGRESS: 

CPU times: user 17min 48s, sys: 3.93 s, total: 17min 52s
Wall time: 9min 8s


(263120805, 295270528)

In [0]:
# sanity check
w2v_model.most_similar("love")

  """Entry point for launching an IPython kernel.
2020-03-13 18:29:35,126 : INFO : precomputing L2-norms of word weight vectors
  if np.issubdtype(vec.dtype, np.int):


[('luv', 0.5816342234611511),
 ('loves', 0.5675166845321655),
 ('adore', 0.5288617014884949),
 ('loved', 0.515922486782074),
 ('amazing', 0.5006871223449707),
 ('looove', 0.49863100051879883),
 ('loooove', 0.46007949113845825),
 ('loveee', 0.45978081226348877),
 ('awesome', 0.458271861076355),
 ('lovee', 0.4533945918083191)]

### Tokenize Text

In [0]:
%%time
from keras.preprocessing.text import Tokenizer

tokenizer = Tokenizer()
tokenizer.fit_on_texts(df_train.text)

vocab_size = len(tokenizer.word_index) + 1
print("Total words", vocab_size)

Total words 290419
CPU times: user 16.8 s, sys: 165 ms, total: 17 s
Wall time: 16.9 s


In [0]:
%%time
from keras.preprocessing.sequence import pad_sequences

SEQUENCE_LENGTH = 300

x_train = pad_sequences(tokenizer.texts_to_sequences(df_train.text), maxlen=SEQUENCE_LENGTH)
x_test = pad_sequences(tokenizer.texts_to_sequences(df_test.text), maxlen=SEQUENCE_LENGTH)

CPU times: user 25.2 s, sys: 494 ms, total: 25.7 s
Wall time: 25.7 s


### Label Encoder 

In [0]:
labels = df_train.target.unique().tolist()
labels.append('Neutral')
labels

['Positive', 'Negative', 'Neutral']

In [0]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(df_train.target.tolist())

y_train = encoder.transform(df_train.target.tolist())
y_test = encoder.transform(df_test.target.tolist())

y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

In [0]:
print("x_train", x_train.shape)
print("y_train", y_train.shape)
print('------------------------')
print("x_test", x_test.shape)
print("y_test", y_test.shape)

x_train (1280000, 300)
y_train (1280000, 1)
------------------------
x_test (320000, 300)
y_test (320000, 1)


### Embedding layer

In [0]:
embedding_matrix = np.zeros((vocab_size, W2V_SIZE))
for word, i in tokenizer.word_index.items():
  if word in w2v_model.wv:
    embedding_matrix[i] = w2v_model.wv[word]
    
print(embedding_matrix.shape)

(290419, 300)


### Build Model

In [0]:
from keras.models import Sequential
from keras.layers import Dropout, LSTM, Dense, Embedding
model = Sequential()
model.add(Embedding(vocab_size, W2V_SIZE, weights=[embedding_matrix], input_length=SEQUENCE_LENGTH, trainable=False))
model.add(Dropout(0.5))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))





























































Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [0]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 300)          87125700  
_________________________________________________________________
dropout_1 (Dropout)          (None, 300, 300)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               160400    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 87,286,201
Trainable params: 160,501
Non-trainable params: 87,125,700
_________________________________________________________________


In [0]:
# compile model
model.compile(loss='binary_crossentropy',
              optimizer="adam",
              metrics=['accuracy'])













Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [0]:
from keras.callbacks import ReduceLROnPlateau, EarlyStopping

callbacks = [ ReduceLROnPlateau(monitor='val_loss', patience=5, cooldown=0),
              EarlyStopping(monitor='val_acc', min_delta=1e-4, patience=5)]

### Train

In [0]:
%%time

EPOCHS = 8
BATCH_SIZE = 1024

training_log = model.fit(x_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=EPOCHS,
                    validation_split=0.1,
                    verbose=1,
                    callbacks=callbacks)













Train on 1152000 samples, validate on 128000 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
CPU times: user 1h 52min 34s, sys: 7min 6s, total: 1h 59min 41s
Wall time: 1h 31min 21s


### Evaluate

In [0]:
%%time
model.load
score = model.evaluate(x_test, y_test, batch_size=BATCH_SIZE)
print()
print("ACCURACY:", score[1])
print("LOSS:", score[0])


ACCURACY: 0.79049375
LOSS: 0.4472036733627319
CPU times: user 1min 21s, sys: 5.65 s, total: 1min 27s
Wall time: 1min 15s


### Predict

In [0]:
predict("I love the music")

In [0]:
predict("I hate the rain")

In [0]:
predict("i don't know what i'm doing")

### Save model

In [0]:
import pickle

KERAS_MODEL = "model.h5"
TOKENIZER_MODEL = "tokenizer.pkl"
WORD2VEC_MODEL = "model.w2v"
ENCODER_MODEL = "encoder.pkl"

model.save(KERAS_MODEL)
w2v_model.save(WORD2VEC_MODEL) # only needed for fine-tuning / re-training
pickle.dump(tokenizer, open(TOKENIZER_MODEL, "wb"), protocol=0)
pickle.dump(encoder, open(ENCODER_MODEL, "wb"), protocol=0) # same