Twitter Sentiment Analysis

Import the libraries that will be usedin this project.

In [12]:
import pandas as pd
import numpy as np

pd.options.display.max_colwidth = 1000

In [13]:
import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Read in the twitter data into a panda dataframe. I originally ran into an encoding issue and had to save the csv in UTF-8. The csv file does not include a column header row, so add those in manually.

In [14]:
df = pd.read_csv("twitter_data.csv", header=None, names=["sentiment", "tweet_id", "date", "query", "user", "tweet"])

In [15]:
df.head()

Unnamed: 0,sentiment,tweet_id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


Remove columns that dont seem useful for sentiment analysis

In [16]:
df.drop(["tweet_id", "date", "query", "user"], axis=1, inplace=True)
df.head()

Unnamed: 0,sentiment,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
1,0,is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
2,0,@Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there."


DO NOT KEEP THIS IN THE PROJECT!!!
DROPPING A BUNCH OF ROWS TO SPEED UP PREPROCESSING WHILE TESTING!!!

In [17]:
df = df.sample(300000)
df.shape

(300000, 2)

Perform data preprocessing

In [19]:
import re

def processTweet(tweet):
    tweet = re.sub("[@|#]\w+\S","", tweet) # remove @usernames and #hashtags
    tweet = re.sub("http[s]?://[\S]+", '', tweet) # remove urls 
    tweet = re.sub(r"(.)\1\1+",r"\1\1", tweet) # remove letters that repeat more than 2 times
    tweet = re.sub('[!&()+,-./:;<=>?[\\]_{|}~]', ' ',tweet) # replace certain punctuation with a space 
    tweet = re.sub('[!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]', '',tweet) # remove remaining punctuation
    tweet = re.sub(r'\b\w{1,2}\b', '', tweet) # remove words that are less than 3 characters
    tweet = tweet.lower() # lowercase
    return tweet

In [20]:
df['tweet'] = df['tweet'].map(lambda tweet: processTweet(tweet))

df.head()

Unnamed: 0,sentiment,tweet
1355349,4,just looked the other stuff that shop and the name persephoneplus not surprised you like that stuff
1460405,4,the cutest thing ever two mice running their wheel together wish had photo
410754,0,with vortex2 bumblebee stayed alive windshield wiper through golfball size hail succumbed highway speeds later
702594,0,nice and not nice
302007,0,stinky men are sitting beside the jeep this sucks bad


In [21]:
from sklearn.model_selection import train_test_split

y = df['sentiment'].map({0: 0, 4: 1})
X = df.drop(['sentiment'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=42)

In [22]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()

train_vector = count_vector.fit_transform(X_train['tweet'])
test_vector = count_vector.transform(X_test['tweet'])

In [23]:
from sklearn.naive_bayes import MultinomialNB
naive_bayes = MultinomialNB()
naive_bayes.fit(train_vector, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [24]:
predictions = naive_bayes.predict(test_vector)

In [25]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[29717  7793]
 [ 9455 28035]]
              precision    recall  f1-score   support

           0       0.76      0.79      0.78     37510
           1       0.78      0.75      0.76     37490

   micro avg       0.77      0.77      0.77     75000
   macro avg       0.77      0.77      0.77     75000
weighted avg       0.77      0.77      0.77     75000

0.7700266666666666


BELOW HERE STARTING RNN

In [26]:
# import nltk
# from nltk.corpus import stopwords 
# from nltk.tokenize import word_tokenize 

# nltk.download('stopwords')

# stop_words = set(stopwords.words('english')) 

# X_train['tokens'] = X_train['tweet'].apply(word_tokenize)

# X_train['tokens'] = X_train['tokens'].apply(lambda x: [item for item in x if item not in stop_words])

# X_train.head()

In [27]:
from keras.preprocessing import sequence, text
from keras.utils import to_categorical

from sklearn.feature_extraction.text import TfidfVectorizer

vocabulary_size = 5000


# count_vector2 = TfidfVectorizer(max_features=vocabulary_size, stop_words="english")

# train_vector2 = count_vector2.fit_transform(X_train['tweet'])
# test_vector2 = count_vector2.transform(X_test['tweet'])


tokenizer = text.Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(X_train['tweet'])

train_token = tokenizer.texts_to_sequences(X_train['tweet'])
test_token = tokenizer.texts_to_sequences(X_test['tweet'])

max_words = 50
train_padded = sequence.pad_sequences(train_token, maxlen=max_words)
test_padded = sequence.pad_sequences(test_token, maxlen=max_words)

# train_padded = tokenizer.texts_to_matrix(X_train['tweet'], mode="count")
# test_padded = tokenizer.texts_to_matrix(X_test['tweet'], mode="count")

# y_train2= to_categorical(y_train, 2)
# y_test2 = to_categorical(y_test, 2)

Using TensorFlow backend.


In [28]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Activation

embedding_size=32
model=Sequential()
model.add(Embedding(vocabulary_size, embedding_size, input_length=max_words))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())


# model = Sequential()
# model.add(Dense(512, input_dim=vocabulary_size))
# model.add(Activation("relu"))
# model.add(Dropout(0.5))
# # model.add(Dense(256))
# # model.add(Activation("relu"))          
# # model.add(Dropout(0.5))
# model.add(Dense(2))
# model.add(Activation("softmax"))
# model.summary()


W1030 08:33:39.202919 140503265519360 deprecation_wrapper.py:119] From /home/blong/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W1030 08:33:39.204236 140503265519360 deprecation_wrapper.py:119] From /home/blong/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W1030 08:33:39.207383 140503265519360 deprecation_wrapper.py:119] From /home/blong/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.



_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 32)            160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 168,353
Trainable params: 168,353
Non-trainable params: 0
_________________________________________________________________
None


In [29]:
loss_func = 'binary_crossentropy'
# loss_func = 'categorical_crossentropy'
model.compile(loss=loss_func, 
             optimizer='adam', 
             metrics=['accuracy'])

W1030 08:33:40.641692 140503265519360 deprecation_wrapper.py:119] From /home/blong/anaconda3/lib/python3.7/site-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W1030 08:33:40.663249 140503265519360 deprecation_wrapper.py:119] From /home/blong/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W1030 08:33:40.667374 140503265519360 deprecation.py:323] From /home/blong/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [30]:
# ADD IN A CHECKPOINTER HERE AND SHUFFLE = TRUE
# from keras.callbacks import ModelCheckpoint  
# checkpointer = ModelCheckpoint(filepath='model.weights.best.hdf5', 
#                                save_best_only=True)

batch_size = 64
num_epochs = 3
model.fit(train_padded, y_train, validation_split=0.2, batch_size=batch_size, epochs=num_epochs)
# model.fit(train_vector2, y_train2, validation_split=0.2, batch_size=batch_size, epochs=num_epochs)

W1030 08:33:43.540779 140503265519360 deprecation_wrapper.py:119] From /home/blong/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.



Train on 180000 samples, validate on 45000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fc7e8d73128>

In [31]:
scores = model.evaluate(test_padded, y_test, verbose=0)
# scores = model.evaluate(test_vector2, y_test2, verbose=0)
print('Test accuracy:', scores[1])

Test accuracy: 0.7914266666666666


In [37]:
rnn_pred = model.predict_classes(test_padded)

In [40]:
print(confusion_matrix(y_test, rnn_pred))
print(classification_report(y_test, rnn_pred))
print(accuracy_score(y_test, rnn_pred))

[[30177  7333]
 [ 8310 29180]]
              precision    recall  f1-score   support

           0       0.78      0.80      0.79     37510
           1       0.80      0.78      0.79     37490

   micro avg       0.79      0.79      0.79     75000
   macro avg       0.79      0.79      0.79     75000
weighted avg       0.79      0.79      0.79     75000

0.7914266666666666
