## Building a Sentiment Predicting Model on a Social Media Corpus

### Using the 1.6 Million Tweets Positive/Negative Sentiment Corpus


Helpful Resources:
https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623

Good Post I found that most of this code is built off of
https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

Corpus (1.6million tweets as positive/negative)
https://drive.google.com/uc?id=0B04GJPshIjmPRnZManQwWEdTZjg&export=download
http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf





In [1]:
# Standard python helper libraries.
import collections
import itertools
import json
import os
import re
import sys
import time

# Numerical manipulation libraries.
import numpy as np
import pandas as pd
from scipy import stats
import scipy.optimize

# NLTK
import nltk
from nltk.tokenize import word_tokenize

# Helper libraries (from w266 Materials).
# import segment
#from shared_lib import utils
from shared_lib import vocabulary

# Machine Learning Packages
from sklearn.model_selection import train_test_split

# Word2Vec Model
import gensim
from gensim.models.word2vec import Word2Vec # the word2vec model gensim class

# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Conv1D, Flatten, Dropout, Conv2D
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard


Using TensorFlow backend.


In [2]:
# Load Sentiment140 Corpus (to use to encode vectors)

cols = ['TweetID', 'Sentiment', 'SentimentText']
tweets = pd.read_table("Data/twitter_download-master/ALL_SEMEVAL_TRAIN_DATA.txt", header=None,
                       names=cols, encoding='ISO-8859-1', error_bad_lines=False)
# num_tweets = 1000000
# tweets = pd.read_csv('Data/milliontweetCorpus/training.1600000.processed.noemoticon.csv', 
#                      header=None, names=cols, encoding='ISO-8859-1', error_bad_lines=False) #, nrows=num_tweets)
tweets.drop(['TweetID'], axis=1, inplace=True)
tweets = tweets[tweets.Sentiment.isnull() == False]
tweets['Sentiment'] = tweets['Sentiment'].map({'negative':0, 'neutral':2, 'positive':4})

cols2 = ['Sentiment', 'ItemID', 'DateTime', 'Query', 'SentimentSource', 'SentimentText']
tweets2 = pd.read_csv('Data/milliontweetCorpus/training.1600000.processed.noemoticon.csv', 
                     header=None, names=cols2, encoding='ISO-8859-1', error_bad_lines=False) #, nrows=num_tweets)
tweets2.drop(['ItemID', 'DateTime', 'Query', 'SentimentSource'], axis=1, inplace=True)
tweets2 = tweets2[tweets2.Sentiment.isnull() == False]
tweets2['Sentiment'] = tweets2['Sentiment'].map(int)

tweets = pd.concat([tweets,tweets2], axis=0)

tweets = tweets[tweets['SentimentText'].isnull() == False]
tweets.reset_index(inplace=True)
tweets.drop('index', axis=1, inplace=True)
print('dataset loaded with shape', tweets.shape)
tweets = tweets[tweets['SentimentText'] != 'Not Available']
tweets[tweets['Sentiment'] == 0].head(5)


dataset loaded with shape (1621665, 2)


Unnamed: 0,Sentiment,SentimentText
0,0,dear @Microsoft the newOoffice for Mac is grea...
1,0,@Microsoft how about you make a system that do...
6,0,@MikeWolf1980 @Microsoft I will be downgrading...
7,0,@Microsoft 2nd computer with same error!!! #Wi...
9,0,"After attempting a reinstall, it still bricks,..."


In [3]:
# Load SemEval Dataset (to use to train model, better context of a negative/positive tweet than Sentiment140)

cols = ['TweetID', 'Sentiment', 'SentimentText']
tweets = pd.read_table("Data/twitter_download-master/ALL_SEMEVAL_TRAIN_DATA.txt", header=None,
                       names=cols, encoding='ISO-8859-1', error_bad_lines=False)
tweets.drop(['TweetID'], axis=1, inplace=True)
tweets = tweets[tweets.Sentiment.isnull() == False]
tweets['Sentiment'] = tweets['Sentiment'].map({'negative':0, 'neutral':2, 'positive':4})

tweets = tweets[tweets['SentimentText'].isnull() == False]
tweets.reset_index(inplace=True)
tweets.drop('index', axis=1, inplace=True)
print('dataset loaded with shape', tweets.shape)
tweets = tweets[tweets['SentimentText'] != 'Not Available']
tweets[tweets['Sentiment'] == 0].head(5)

# SemEval Dataset is actually relatively small (6000 tweets in 2016). 
# We can group all of the Train/Test/Dev data from 2013 through 2016 to get more.

dataset loaded with shape (21665, 2)


Unnamed: 0,Sentiment,SentimentText
0,0,dear @Microsoft the newOoffice for Mac is grea...
1,0,@Microsoft how about you make a system that do...
6,0,@MikeWolf1980 @Microsoft I will be downgrading...
7,0,@Microsoft 2nd computer with same error!!! #Wi...
9,0,"After attempting a reinstall, it still bricks,..."


In [4]:
# Print first 5 Negative Tweets (encoded as 0's)
for i,text in enumerate(tweets[tweets['Sentiment'] == 0]['SentimentText']):
    print('Negative Tweet:', text)
    if i >= 50:
        break
        
# Print first 5 Positive Tweets (encoded as 4's)
for i,text in enumerate(tweets[tweets['Sentiment'] == 4]['SentimentText']):
    print('Positive Tweet:', text)
    if i >= 14:
        break

Negative Tweet: dear @Microsoft the newOoffice for Mac is great and all, but no Lync update? C'mon.
Negative Tweet: @Microsoft how about you make a system that doesn't eat my friggin discs. This is the 2nd time this has happened and I am so sick of it!
Negative Tweet: @MikeWolf1980 @Microsoft I will be downgrading and let #Windows10 be out for almost the 1st yr b4 trying it again. #Windows10fail
Negative Tweet: @Microsoft 2nd computer with same error!!! #Windows10fail Guess we will shelve this until SP1! http://t.co/QCcHlKuy8Q
Negative Tweet: After attempting a reinstall, it still bricks, says, "Windows cannot finish installing," or somesuch. @Microsoft may have cost me $600.
Negative Tweet: Did @Microsoft break Windows 10? Was working fine on Wednesday but now I can't get passed the login screen without it freezing up.
Negative Tweet: For the 1st time @Skype has a "High Startup impact"   Does anyone at @Microsoft have a clue? #Windows10Fail http://t.co/loO3yd5rwe
Negative Tweet: #teen

### Cleaning

Good idea on using the tokenizer.  we can use this as a function with df.apply to speed this up! Check out the stack overflow solution below for some inspiration.  Some exploratory code is below

https://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame

In [5]:
from nltk.tokenize import TweetTokenizer # a tweet tokenizer from nltk.
t = TweetTokenizer()


def create_tokens(tweet):
    tweet = str(tweet.lower())
    tokens = t.tokenize(tweet)
    tokens = list(filter(lambda x: not x.startswith('@'), tokens)) ##
    tokens = list(filter(lambda x: not x.startswith('#'), tokens)) ##
    tokens = list(filter(lambda x: not x.startswith('http'), tokens)) ##
    return tokens


#Clean SemEval Dataset
tweets['SentimentTextTokenized'] = tweets['SentimentText'].apply(create_tokens)
tweets.head()

#Clean Sentiment140 Dataset
tweets2['SentimentTextTokenized'] = tweets2['SentimentText'].apply(create_tokens)
tweets2.head()

X = tweets.SentimentTextTokenized
Y = tweets.Sentiment

X2 = tweets2.SentimentTextTokenized
Y2 = tweets2.Sentiment

ValueError: invalid literal for int() with base 10: 'dear'

### Exploration

In [None]:

import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
 
num_bins = 5
n, bins, patches = plt.hist(tweets.Sentiment, num_bins, facecolor='blue', alpha=0.5)
plt.show()

# Looks like there are only two categories in this Training Dataset. Negative = 0, Positive = 4

### Preprocessing & (future) Feature Engineering

### Split Train and Test Sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.20,random_state=100)
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2,Y2,test_size=0.20,random_state=100)
print(X_train[:10])
print(y_train[:10])

### Word2Vec

In [None]:
# Convert each word into a vector representation. Couldn't get Keras working with straight indexes for each word so I followed the steps laid out here:
# https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

vec_dim = 100

tweet_w2v = Word2Vec(size=vec_dim, min_count=5) #vector size and minimum threshold to include for rare words
tweet_w2v.build_vocab(x for x in X_train2)
tweet_w2v.train((x for x in X_train2), total_examples=tweet_w2v.corpus_count, epochs=1)

In [None]:
tweet_w2v.most_similar('good')

### Create Word Vectors (using SemEval Dataset now)
https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

print('building tf-idf matrix ...')
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x for x in X_train])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
print('vocab size :', len(tfidf))

In [None]:
def buildWordVector(tokens, size):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

In [None]:
from sklearn.preprocessing import StandardScaler

train_vecs_w2v = np.concatenate([buildWordVector(z, vec_dim) for z in map(lambda x: x, X_train)])

# scaler = StandardScaler()
# scaler.fit(train_vecs_w2v)
# train_vecs_w2v = scaler.transform(train_vecs_w2v)

test_vecs_w2v = np.concatenate([buildWordVector(z, vec_dim) for z in map(lambda x: x, X_test)])
# test_vecs_w2v = scaler.transform(test_vecs_w2v)

In [None]:
# This section only needed if doing more than binary classification (multi-class)
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder

print("Original Y:", y_train[:10])
encoder = LabelEncoder()
encoder.fit(Y)
y_train = encoder.transform(y_train)
y_test= encoder.transform(y_test)
print("Encoded Y:", y_train[:10])

y_train = to_categorical(y_train, 3)
y_test = to_categorical(y_test, 3)
print("One Hot Y:", y_train[:100])

### CNN Model (really simple)

In [None]:
# Pad the sequence to the same length (this hurt accuracy?)
# train_vecs_w2v = sequence.pad_sequences(train_vecs_w2v, maxlen=vec_dim)
# test_vecs_w2v = sequence.pad_sequences(test_vecs_w2v, maxlen=vec_dim)

# Build Keras Model

# # assemble the embedding_weights in one numpy array
# # vocab_dim = 300 # dimensionality of your word vectors
# n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
# embedding_weights = np.zeros((n_symbols, vocab_dim))
# for word,index in index_dict.items():
#     embedding_weights[index, :] = word_vectors[word]

# # define inputs here
# embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=True)
# embedding_layer.build((None,)) # if you don't do this, the next step won't work
# embedding_layer.set_weights([embedding_weights])

# embedded = embedding_layer(input_layer)



model = Sequential()
model.add(Dense(32, activation='relu', input_dim=vec_dim))
model.add(Dropout(0.2))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3, activation='softmax')) # softmax if multi-class

model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy', # categorical_crossentropy if multi-class
              metrics=['accuracy'])

# model.add(Embedding(top_words, vec_dim, input_length=max_review_length))
# model.add(Conv1D(64, 3, padding='same'))
# model.add(Conv1D(32, 3, border_mode='same'))
# model.add(Conv1D(16, 3, border_mode='same'))
# model.add(Flatten())
# model.add(Dropout(0.2))
# model.add(Dense(180,activation='sigmoid'))
# model.add(Dropout(0.2))
# model.add(Dense(1,activation='softmax'))
# model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit(train_vecs_w2v, y_train, epochs=9, batch_size=32, verbose=2)

In [None]:
score = model.evaluate(test_vecs_w2v, y_test, batch_size=32, verbose=2)
print("Accuracy: %.2f%%" % (score[1]*100))

### Model without Tweet Vectors

In [None]:
print(train_vecs_w2v[0])

In [None]:
embedding_matrix = np.zeros((len(tweet_w2v.wv.vocab), vec_dim))
for i in range(len(tweet_w2v.wv.vocab)):
    embedding_vector = tweet_w2v.wv[tweet_w2v.wv.index2word[i]]
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
        
# model.add(Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1],
#                       weights=[embedding_matrix]))

# Using embedding from Keras
# embedding_vecor_length = 300
# max_review_length = 30
# train_vecs_w2v = sequence.pad_sequences(train_vecs_w2v, maxlen=max_review_length)
# test_vecs_w2v = sequence.pad_sequences(test_vecs_w2v, maxlen=max_review_length)

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))


# model.add(Embedding(input_dim=embedding_matrix.shape[0], output_dim=embedding_matrix.shape[1], input_length=max_review_length,
#                       weights=[embedding_matrix]))
# model.add(Dense(32, activation='relu', input_dim=vec_dim))

# Convolutional model (3x conv, flatten, 2x dense)
model.add(Conv1D(64, 3, padding='same'))
model.add(Conv1D(32, 3, padding='same'))
model.add(Conv1D(16, 3, padding='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='softmax'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy', # categorical_crossentropy if multi-class
              metrics=['accuracy'])


model.fit(train_vecs_w2v, y_train, nb_epoch=3, batch_size=64)

In [None]:
# Using keras to load the dataset with the top_words
top_words = 10000
# (X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

# Pad the sequence to the same length
max_review_length = 40
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

# Using embedding from Keras
embedding_vector_length = 300
model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length))

# Convolutional model (3x conv, flatten, 2x dense)
model.add(Convolution1D(64, 3, border_mode='same'))
model.add(Convolution1D(32, 3, border_mode='same'))
model.add(Convolution1D(16, 3, border_mode='same'))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(180,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))

# Log to tensorboard

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit(X_train, y_train, nb_epoch=3, batch_size=64)

# Evaluation on the test set
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

# Applied to Reddit

In [None]:
# Function to Predict Positive/Neutral/Negative

def prediction(text):
    sentiment = ["Negative",'Neutral', "Positive"]
    tokens = create_tokens(text)
    vecs = buildWordVector(tokens, vec_dim)
    vecs = scaler.transform(vecs)
    predic = model.predict(vecs)
    result = sentiment[predic.argmax(axis=1)[0]]
    return result

In [None]:
import praw

#this is a read-only instance
reddit = praw.Reddit(user_agent='first_scrape (by /u/dswald)',
                     client_id='TyAK1zSuAvQjmA', 
                     client_secret="uxHGsL0zNODbowN6umVnBWpqLAQ")

subreddit = reddit.subreddit('tensorflow')
hot_python = subreddit.hot(limit = 3) #need to view >2 to get past promoted posts

for submission in hot_python:
    if not submission.stickied: #top 2 are promoted posts, labeled as 'stickied'
        print('Title: {}, ups: {}, downs: {}, Have we visited: {}'.format(submission.title,
                                                                          submission.ups,
                                                                          submission.downs,
                                                                          submission.visited))
        comments = submission.comments.list() #unstructured
        for comment in comments:
            print (20*'-')
            print ('Parent ID:', comment.parent())
            print ('Comment ID:', comment.id)
            print (comment.body)
            print("#"*10,'PREDICTED SENTIMENT:', prediction(comment.body),"#"*10)

It seems that everythng is positive or neutral.  Is this because of a unigram model or too specific training data or what?  This is food for thought.  Committing now.