## Building a Sentiment Predicting Model on a Social Media Corpus

### Using the SemEval 2017 Task 4A: Positive/Negative/Neutral Classifier Corpus

This model does not incorporate vector word embeddings or any smoothing. 

Helpful Resources:
https://medium.com/@thoszymkowiak/how-to-implement-sentiment-analysis-using-word-embedding-and-convolutional-neural-networks-on-keras-163197aef623

SemEval Tweet Download: 
https://github.com/seirasto/twitter_download

Good Post I found that most of this code is built off of
https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

Potential other corpus to use (1.6million tweets as positive/negative)
https://drive.google.com/uc?id=0B04GJPshIjmPRnZManQwWEdTZjg&export=download




In [5]:
!pip install keras

Collecting keras
  Downloading Keras-2.0.9-py2.py3-none-any.whl (299kB)
[K    100% |████████████████████████████████| 307kB 1.5MB/s ta 0:00:011
Installing collected packages: keras
Successfully installed keras-2.0.9


In [2]:

# Standard python helper libraries.
import collections
import itertools
import json
import os
import re
import sys
import time

# Numerical manipulation libraries.
import numpy
import pandas as pd
from scipy import stats
import scipy.optimize

# NLTK
import nltk
from nltk.tokenize import word_tokenize

# Helper libraries (from w266 Materials).
# import segment
#from shared_lib import utils
from shared_lib import vocabulary

# Machine Learning Packages
from sklearn.model_selection import train_test_split

# Word2Vec Model
import gensim
from gensim.models.word2vec import Word2Vec # the word2vec model gensim class

# Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, Conv1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard


Using TensorFlow backend.


In [3]:
# Pull in Tweet Data (Must be downloaded using https://github.com/seirasto/twitter_download)
tweets = pd.read_table("Data/twitter_download-master/2016train.txt_semeval_tweets.txt", header=None)
tweets

# SemEval Dataset is actually relatively small (6000 tweets in 2016). 
# We can group all of the Train/Test/Dev data from 2013 through 2016 to get more.
# Additionally, we could consider using this data which has 1.6 million rows but it is only a binary positive/negative class 
# https://drive.google.com/uc?id=0B04GJPshIjmPRnZManQwWEdTZjg&export=download

Unnamed: 0,0,1,2
0,628949369883000832,negative,dear @Microsoft the newOoffice for Mac is grea...
1,628976607420645377,negative,@Microsoft how about you make a system that do...
2,629023169169518592,negative,Not Available
3,629179223232479232,negative,Not Available
4,629186282179153920,neutral,If I make a game as a #windows10 Universal App...
5,629226490152914944,positive,"Microsoft, I may not prefer your gaming branch..."
6,629345637155360768,negative,@MikeWolf1980 @Microsoft I will be downgrading...
7,629394528336637953,negative,@Microsoft 2nd computer with same error!!! #Wi...
8,629650766580609026,positive,Just ordered my 1st ever tablet; @Microsoft Su...
9,629797991826722816,negative,"After attempting a reinstall, it still bricks,..."


In [4]:
# Segregate X and Y
#X = tweets[2]
Y = tweets[1]

### Cleaning

In [5]:
# Tokenize each Tweet (really slow, need to optimize for larger corpora?)
for i, tweet in enumerate(X):
    X[i,] = tweet.split()


# print(X)

# Alternatively, use this?
# from nltk.tokenize import TweetTokenizer # a tweet tokenizer from nltk.
# tokenizer = TweetTokenizer()


NameError: name 'X' is not defined

Good idea on using the tokenizer.  we can use this as a function with df.apply to speed this up! Check out the stack overflow solution below for some inspiration.  Some exploratory code is below

https://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame

In [6]:
#rewriting as a new dataframe to avoid overlap with your tokenized solution
df = pd.read_table("Data/twitter_download-master/2016train.txt_semeval_tweets.txt", header=None)
print (df[2].head(2))

0    dear @Microsoft the newOoffice for Mac is grea...
1    @Microsoft how about you make a system that do...
Name: 2, dtype: object


In [7]:
from nltk.tokenize import TweetTokenizer # a tweet tokenizer from nltk.
t = TweetTokenizer()
print (t.tokenize(df[2][0])) #proof that tokenize method works

['dear', '@Microsoft', 'the', 'newOoffice', 'for', 'Mac', 'is', 'great', 'and', 'all', ',', 'but', 'no', 'Lync', 'update', '?', "C'mon", '.']


In [8]:
#proof in the pudding - let's time it
start = time.time() 
df['unigrams'] = df[2].apply(t.tokenize)
print ('df.apply', (time.time() - start))


df.apply 0.334456205368042


In [10]:
#print (X[0]) 
print (df.unigrams[0]) # TweetTokenizer is better (see comma)

['dear', '@Microsoft', 'the', 'newOoffice', 'for', 'Mac', 'is', 'great', 'and', 'all', ',', 'but', 'no', 'Lync', 'update', '?', "C'mon", '.']


In [11]:
df.head()

Unnamed: 0,0,1,2,unigrams
0,628949369883000832,negative,dear @Microsoft the newOoffice for Mac is grea...,"[dear, @Microsoft, the, newOoffice, for, Mac, ..."
1,628976607420645377,negative,@Microsoft how about you make a system that do...,"[@Microsoft, how, about, you, make, a, system,..."
2,629023169169518592,negative,Not Available,"[Not, Available]"
3,629179223232479232,negative,Not Available,"[Not, Available]"
4,629186282179153920,neutral,If I make a game as a #windows10 Universal App...,"[If, I, make, a, game, as, a, #windows10, Univ..."


Based on above, I'm going to replace X with df.unigrams

In [12]:
X = df.unigrams

### Preprocessing & (future) Feature Engineering

### Split Train and Test Sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.20,random_state=100)
print(X_train[:10])

4395    [#Apology, to, Jeb, Bush, for, John, Dempsey, ...
4068    [Top, 5, Gambling, Apps, for, the, iPad, http:...
3710                                     [Not, Available]
4516    [@Milbank, doesn't, think, Vice, President, Jo...
4243    [1st, day, back, at, work, after, a, terrible,...
4288                                     [Not, Available]
4916                                     [Not, Available]
1360    [1, ), may, be, wrong, ,, but, if, I, read, it...
1012    [@GailSimone, Donald, Trump, may, think, he's,...
1538    [#nowplaying, Bob, Marley, -, Sun, Is, Shining...
Name: unigrams, dtype: object


In [14]:
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.20,random_state=100)
print(X_train[:10])

4395    [#Apology, to, Jeb, Bush, for, John, Dempsey, ...
4068    [Top, 5, Gambling, Apps, for, the, iPad, http:...
3710                                     [Not, Available]
4516    [@Milbank, doesn't, think, Vice, President, Jo...
4243    [1st, day, back, at, work, after, a, terrible,...
4288                                     [Not, Available]
4916                                     [Not, Available]
1360    [1, ), may, be, wrong, ,, but, if, I, read, it...
1012    [@GailSimone, Donald, Trump, may, think, he's,...
1538    [#nowplaying, Bob, Marley, -, Sun, Is, Shining...
Name: unigrams, dtype: object


### Word2Vec

In [15]:
# Convert each word into a vector representation. Couldn't get Keras working with straight indexes for each word so I followed the steps laid out here:
# https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

vec_dim = 10

tweet_w2v = Word2Vec(size=vec_dim, min_count=2) #vector size and minimum threshold to include for rare words
tweet_w2v.build_vocab(x for x in X_train)
tweet_w2v.train((x for x in X_train), total_examples=tweet_w2v.corpus_count, epochs=1)

57893

In [16]:
tweet_w2v.most_similar('yes')

[('profile', 0.92231285572052),
 ('note', 0.8966584205627441),
 ('DONT', 0.8914517164230347),
 ('companies', 0.8803945779800415),
 ('loud', 0.868857741355896),
 ('Work', 0.8656846284866333),
 ('simple', 0.8618882298469543),
 ('include', 0.8560652136802673),
 ('British', 0.8373156785964966),
 ('Committee', 0.8337675333023071)]

### Create Word Vectors 
https://ahmedbesbes.com/sentiment-analysis-on-twitter-using-word2vec-and-keras.html

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

print('building tf-idf matrix ...')
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=10)
matrix = vectorizer.fit_transform([x for x in X_train])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
print('vocab size :', len(tfidf))

building tf-idf matrix ...
vocab size : 918


In [18]:
def buildWordVector(tokens, size):
    vec = numpy.zeros(size).reshape((1, size))
    count = 0.
    for word in tokens:
        try:
            vec += tweet_w2v[word].reshape((1, size)) * tfidf[word]
            count += 1.
        except KeyError: # handling the case where the token is not
                         # in the corpus. useful for testing.
            continue
    if count != 0:
        vec /= count
    return vec

In [19]:
from sklearn.preprocessing import StandardScaler

train_vecs_w2v = numpy.concatenate([buildWordVector(z, vec_dim) for z in map(lambda x: x, X_train)])

scaler = StandardScaler()
scaler.fit(train_vecs_w2v)
train_vecs_w2v = scaler.transform(train_vecs_w2v)

test_vecs_w2v = numpy.concatenate([buildWordVector(z, vec_dim) for z in map(lambda x: x, X_test)])
test_vecs_w2v = scaler.transform(test_vecs_w2v)

In [20]:
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder

print("Original Y:", y_train[:10])
encoder = LabelEncoder()
encoder.fit(Y)
y_train = encoder.transform(y_train)
y_test= encoder.transform(y_test)
print("Encoded Y:", y_train[:10])

y_train = to_categorical(y_train, 3)
y_test = to_categorical(y_test, 3)
print("One Hot Y:", y_train[:10])

Original Y: 4395     neutral
4068     neutral
3710    positive
4516     neutral
4243    negative
4288     neutral
4916     neutral
1360     neutral
1012     neutral
1538    positive
Name: 1, dtype: object
Encoded Y: [1 1 2 1 0 1 1 1 1 2]
One Hot Y: [[ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]


### CNN Model (really simple)

In [21]:
# Pad the sequence to the same length
# train_vecs_w2v = sequence.pad_sequences(train_vecs_w2v, maxlen=vec_dim)
# test_vecs_w2v = sequence.pad_sequences(test_vecs_w2v, maxlen=vec_dim)

# Build Keras Model
model = Sequential()
model.add(Dense(32, activation='relu', input_dim=vec_dim))
model.add(Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(train_vecs_w2v, y_train, epochs=9, batch_size=32, verbose=2)

Epoch 1/9
 - 0s - loss: 1.0253 - acc: 0.4962
Epoch 2/9
 - 0s - loss: 0.9917 - acc: 0.5155
Epoch 3/9
 - 0s - loss: 0.9893 - acc: 0.5155
Epoch 4/9
 - 0s - loss: 0.9880 - acc: 0.5149
Epoch 5/9
 - 0s - loss: 0.9872 - acc: 0.5146
Epoch 6/9
 - 0s - loss: 0.9861 - acc: 0.5146
Epoch 7/9
 - 0s - loss: 0.9843 - acc: 0.5129
Epoch 8/9
 - 0s - loss: 0.9849 - acc: 0.5178
Epoch 9/9
 - 0s - loss: 0.9847 - acc: 0.5151


<keras.callbacks.History at 0x11a8c4438>

In [22]:
score = model.evaluate(test_vecs_w2v, y_test, batch_size=128, verbose=2)
print("Accuracy: %.2f%%" % (score[1]*100))

Accuracy: 50.72%


In [16]:
# # LSTM for sequence classification

# from keras.models import Sequential
# from keras.layers import Dense
# from keras.layers import LSTM, Conv1D, Flatten, Dropout
# from keras.layers.embeddings import Embedding
# from keras.preprocessing import sequence
# from keras.callbacks import TensorBoard

# # # Using keras to load the dataset with the top_words
# # top_words = 10000


# # # Pad the sequence to the same length
# max_review_length = vec_dim
# X_train = sequence.pad_sequences(train_vecs_w2v, maxlen=max_review_length)
# X_test = sequence.pad_sequences(test_vecs_w2v, maxlen=max_review_length)

# # Using embedding from Keras
# # embedding_vecor_length = 300
# model = Sequential()
# # model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))


# # Convolutional model (3x conv, flatten, 2x dense)
# model.add(Conv1D(64, 3, padding='same', input_shape=(None, vec_dim)))
# model.add(Conv1D(32, 3, padding='same'))
# model.add(Conv1D(16, 3, padding='same'))
# model.add(Flatten())
# model.add(Dropout(0.2))
# model.add(Dense(180,activation='relu'))
# model.add(Dropout(0.2))
# model.add(Dense(1,activation='softmax'))

# # Log to tensorboard
# tensorBoardCallback = TensorBoard(log_dir='./logs', write_graph=True)
# model.compile(loss='categorical_crossentropy', optimizer='adagrad', metrics=['accuracy'])

# model.fit(train_vecs_w2v, y_train, nb_epoch=3, callbacks=[tensorBoardCallback], batch_size=64)

# # Evaluation on the test set
# scores = model.evaluate(test_vecs_w2v, y_test, verbose=0)
# print("Accuracy: %.2f%%" % (scores[1]*100))

# Applied to Reddit

In [23]:
# Function to Predict Positive/Neutral/Negative

def prediction(text):
    sentiment = ["Negative", "Neutral", "Positive"]
    text = text.split() # Tokenize
    text = buildWordVector(text, vec_dim)
    text = scaler.transform(text)
    predic = model.predict(text, batch_size=32)
    result = sentiment[predic.argmax(axis=1)[0]]
    return result

In [27]:
import praw

#this is a read-only instance
reddit = praw.Reddit(user_agent='first_scrape (by /u/dswald)',
                     client_id='TyAK1zSuAvQjmA', 
                     client_secret="uxHGsL0zNODbowN6umVnBWpqLAQ")

subreddit = reddit.subreddit('politics')
hot_python = subreddit.hot(limit = 3) #need to view >2 to get past promoted posts

for submission in hot_python:
    if not submission.stickied: #top 2 are promoted posts, labeled as 'stickied'
        print('Title: {}, ups: {}, downs: {}, Have we visited: {}'.format(submission.title,
                                                                          submission.ups,
                                                                          submission.downs,
                                                                          submission.visited))
        comments = submission.comments.list() #unstructured
        for comment in comments:
            print (20*'-')
            print ('Parent ID:', comment.parent())
            print ('Comment ID:', comment.id)
            print (comment.body)
            print("#"*10,'PREDICTED SENTIMENT:', prediction(comment.body),"#"*10)

Title: More than 400 millionaires tell Congress: Don’t cut our taxes, ups: 9758, downs: 0, Have we visited: False
--------------------
Parent ID: 7ck1f6
Comment ID: dpqhmjh

As a reminder, this subreddit [is for civil discussion.](/r/politics/wiki/index#wiki_be_civil)

In general, be courteous to others. Attack ideas, not users. Personal insults, shill or troll accusations, hate speech, and other incivility violations can result in a permanent ban. 

If you see comments in violation of our rules, please report them.

***


*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/politics) if you have any questions or concerns.*
########## PREDICTED SENTIMENT: Positive ##########
--------------------
Parent ID: 7ck1f6
Comment ID: dpqhqap
But four **billionaires** (who keep the GOP in power) say: Do.
########## PREDICTED SENTIMENT: Positive ##########
--------------------
Parent ID: 7ck1f6
Comment ID: dpqhor8
And 

AttributeError: 'MoreComments' object has no attribute 'parent'

It seems that everythng is positive or neutral.  Is this because of a unigram model or too specific training data or what?  This is food for thought.  Committing now.