## Tweet sentiment analysis

In this section we will see how to extract features from tweets and use a classifier to classify the tweet as positive or negative.

We will use a pandas DataFrames (http://pandas.pydata.org/) to store tweets and process them.
Pandas DataFrames are very powerful python data-structures, like excel spreadsheets with the power of python.


In [None]:
# Let's create a DataFrame with each tweet using pandas
import pandas as pd
import json
import numpy as np


def getTweetID(tweet):
    """ If properly included, get the ID of the tweet """
    return tweet.get('id')
    
def getUserIDandScreenName(tweet):
    """ If properly included, get the tweet 
        user ID and Screen Name """
    user = tweet.get('user')
    if user is not None:
        uid = user.get('id')
        screen_name = user.get('screen_name')
        return uid, screen_name
    else:
        return (None, None)
    

    
filename = 'tweets.txt'

# create a list of dictionaries with the data that interests us
tweet_data_list = []
with open(filename, 'r') as fopen:
    # each line correspond to a tweet
    for line in fopen:
        if line != '\n':
            tweet = json.loads(line.strip('\n'))
            tweet_id = getTweetID(tweet)
            user_id = getUserIDandScreenName(tweet)[0]
            text = tweet.get('text')
            if tweet_id is not None:
                tweet_data_list.append({'tweet_id' : tweet_id,
                           'user_id' : user_id,
                           'text' : text})

# put everything in a dataframe
tweet_df = pd.DataFrame.from_dict(tweet_data_list)



In [None]:
print(tweet_df.shape)
print(tweet_df.columns)

#print 5 first element of one of the column
print(tweet_df.text.iloc[:5])
# or
print(tweet_df['text'].iloc[:5])


In [None]:
#show the first 10 rows
tweet_df.head(10)

### Extracting features from the tweets

#### 1) Tokenize the tweet in a list of words

This part uses concepts from [Naltural Langage Processing](https://en.wikipedia.org/wiki/Natural_language_processing).
We will use a tweet tokenizer I built based on TweetTokenizer from NLTK (http://www.nltk.org/).
You can see how it works by opening the file TwSentiment.py. The goal is to process any tweets and extract a list of words taking into account usernames, hashtags, urls, emoticons and all the informal text we can find in tweets. We also want to reduce the number of features by doing some transformations such as putting all the words in lower cases.

In [None]:
from TwSentiment import CustomTweetTokenizer

In [None]:
tokenizer = CustomTweetTokenizer(preserve_case=False, # keep Upper cases
                                 reduce_len=True, # reduce repetition of letter to a maximum of three
                                 strip_handles=False, # remove usernames (@mentions)
                                 normalize_usernames=True, # replace all mentions to "@USER"
                                 normalize_urls=True, # replace all urls to "URL"
                                 keep_allupper=True) # keep upercase for words that are all in uppercase

In [None]:
# example
tweet_df.text.iloc[0]

In [None]:
tokenizer.tokenize(tweet_df.text.iloc[0])

In [None]:
# other examples
tokenizer.tokenize('Hey! This is SO cooooooooooooooooool! :)')

In [None]:
tokenizer.tokenize('Hey! This is so cooooooool! :)')

#### 2) Define the features that will represent the tweet
We will use the occurrence of words and pair of words (bigrams) as features.

This corresponds to a bag-of-words representation (https://en.wikipedia.org/wiki/Bag-of-words_model): we just count each words (or [n-grams](https://en.wikipedia.org/wiki/N-gram)) without taking account their order. For document classification, the frequency of occurence of each words is usually taken as a feature. In the case of tweets, they are so short that we can just count each words once.

Using pair of words allows to capture some of the context in which each words appear. This helps capturing the correct meaning of words.

In [None]:
from TwSentiment import bag_of_words_and_bigrams

# this will return a dictionary of features,
# we just list the features present in this tweet
bag_of_words_and_bigrams(tokenizer.tokenize(tweet_df.text.iloc[0]))

#### Download the logistic regression classifier

https://www.dropbox.com/s/09rw6a85f7ezk31/sklearn_SGDLogReg_.pickle.zip?dl=1

I trained this classifier on this dataset: http://help.sentiment140.com/for-students/, following the approach from this paper: http://cs.stanford.edu/people/alecmgo/papers/TwitterDistantSupervision09.pdf

This is a set of 14 million tweets with emoticons. Tweets containing "sad" emoticons (7 million) are considered negative and tweets with "happy" emoticons (7 million) are considered positive.

I used a Logistic Regression classifier with L2 regularization that I optimized with a 10 fold cross-validation using $F_1$ score as a metric.


In [None]:
# the classifier is saved in a "pickle" file
import pickle

with open('sklearn_SGDLogReg_.pickle', 'rb') as fopen:
    classifier_dict = pickle.load(fopen)



In [None]:
# classifier_dict contain the classifier and label mappers
# that I added so that we remember how the classes are 
# encoded
classifier_dict

The classifier is in fact contained in a pipeline.
A sklearn pipeline allows to assemble several transformation of your data (http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [None]:
pipline = classifier_dict['sklearn_pipeline']

In our case we have two steps: 

- Vectorize the textual features (using http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html)
- Classify the vectorized features (using http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html)

In [None]:
pipline.steps

In [None]:
# this the step that will transform a list of textual features to a vector of zeros and ones
dict_vect = pipline.steps[0][1]

In [None]:
dict_vect.feature_names_

In [None]:
# number of features
len(dict_vect.feature_names_)

In [None]:
# a little example
text = 'Hi all, I am very happy today'
# first tokenize
tokens = tokenizer.tokenize(text)
print(tokens)

# list features
features = bag_of_words_and_bigrams(tokens)
print(features)

# vectorize features
X = dict_vect.transform(features)

print(X.shape)

In [None]:
# X is a special kind of numpy array. beacause it is extremely sparse
# it can be encoded to take less space in memory
# if we want to see it fully, we can use .toarray()

# number of non-zero values in X:
X.toarray().sum()


The mapping between the list of features and the vector of zeros and ones is done when you train the pipeline with its `.fit` method.

### Classifing the tweet
Now that we have vector representing the presence of features in a tweet, we can apply our logistic regression classifier to compute the probability that a tweet belong to the "sad" or "happy" category

In [None]:
classifier = pipline.steps[1][1]

In [None]:
classifier

In [None]:
# access the weights of the logistic regression
classifier.coef_

In [None]:
# we have as many weights as features
classifier.coef_.shape

In [None]:
# plus the intrecept 
classifier.intercept_

In [None]:
# let's check the weight associated with a given feature
x = dict_vect.transform({('sad'): True})
_, ind = np.where(x.todense())
print(classifier.coef_[0,ind])

In [None]:
x = dict_vect.transform({('good'): True})
_, ind = np.where(x.todense())
print(classifier.coef_[0,ind])

In [None]:
x = dict_vect.transform({('not', 'sad'): True})
_, ind = np.where(x.todense())
print(classifier.coef_[0,ind])

In [None]:
# find the probability for a specific tweet
classifier.predict_proba(X)

Using the sklearn pipeline to group the two last steps:

In [None]:
pipline.predict_proba(features)

We see to numbers, the first one is the probability of the tweet being sad, the second one is the probability of the tweet being happy.

In [None]:
# note that:
pipline.predict_proba(features).sum()

### Putting it all together:

We will use the class `TweetClassifier` from TwSentiment.py that puts together this process for us:

In [None]:
from TwSentiment import TweetClassifier

In [None]:
twClassifier = TweetClassifier(pipline,
                              tokenizer=tokenizer,
                              feature_extractor=bag_of_words_and_bigrams)

In [None]:
# example
text = 'Hi all, I am very happy today'
print(twClassifier.classify_text(text))

In [None]:
# the classify text method also accepts a list of text as input
print(twClassifier.classify_text(['great day today!', "bad day today..."]))

In [None]:
# you'll see that if the sentence becomes more complicated, 
# the classifier is not as accurate
print(twClassifier.classify_text(["I am not sad"]))

In [None]:
print(twClassifier.classify_text(["I am not bad"]))

### We can now classify our tweets:

In [None]:
emo_clas, prob = twClassifier.classify_text(tweet_df.text.tolist())


In [None]:
# add the result to the dataframe

In [None]:
tweet_df['pos_class'] = (emo_clas == 'pos')
tweet_df['pos_prob'] = prob[:,1]

In [None]:
tweet_df.head()

In [None]:
# plot the distribution of probability
import matplotlib.pyplot as plt
%matplotlib inline
h = plt.hist(tweet_df.pos_prob, bins=50)


We want to classify users based on the class of their tweets.
Pandas allows to easily group tweets per users using the [groupy](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method of DataFrames:

In [None]:
user_group = tweet_df.groupby('user_id')

In [None]:
print(type(user_group))

In [None]:
# let's look at one of the group
groups = user_group.groups
uid = list(groups.keys())[5]
user_group.get_group(uid)

In [None]:
# we need to make a function that takes the dataframe of tweets grouped by users and return the class of the users
def get_user_emo(group):
    num_pos = group.pos_class.sum()
    num_tweets = group.pos_class.size
    if num_pos/num_tweets > 0.5:
        return 'pos'
    elif num_pos/num_tweets < 0.5:
        return 'neg'
    else:
        return 'NA'

In [None]:
# apply the function to each group
user_df = user_group.apply(get_user_emo)

In [None]:
# This is a pandas Series where the index are the user_id
user_df.head(10)

### Let's add this information to the graph we created earlier

In [None]:
import networkx as nx

G = nx.read_graphml('twitter_lcc.graphml', node_type=int)

for n in G.nodes_iter():
    if n in user_df.index:
        # here we look at the value of the user_df series at the position where the index 
        # is equal to the user_id of the node
        G.node[n]['emotion'] = user_df.loc[user_df.index == n].values[0]
        
#we can also add the emotion associated with tweets to the edges of the graph
for u,v, tweet_id in G.edges_iter(data='tweet_id'):
    if tweet_df.tweet_id.isin([tweet_id]).any():
        G.edge[u][v]['pos_class'] = int(tweet_df.loc[tweet_df.tweet_id == tweet_id].pos_class.values[0])
        G.edge[u][v]['pos_prob'] = float(tweet_df.loc[tweet_df.tweet_id == tweet_id].pos_prob.values[0])

In [None]:
# we have added an attribute 'emotion' to the nodes
G.node[n]

In [None]:
G.edge[u][v]

In [None]:
# save the graph to open it with Gephi
nx.write_graphml(G, 'twitter_lcc_emo.graphml')

We can now open this file with [Gephi](https://gephi.org/) to vizualize it.

Here is an example where the size of nodes is proportional to their in-degree, their color indicate their out-degree (from white to dark green) and the color of edges indicates the probability of the tweet carrying an "happy" sentiment (blue = sad, orange = happy).

<img src="emo_network.png" style="width: 1024px;"/>

A very inclomplete list of references to go further:

- Perkins, J. Python 3 Text Processing With NLTK 3 Cookbook. Python 3 Text Processing With NLTK 3 Cookbook (2014).
- Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. Elements 1, (Springer New York, 2009).
- Serrano-Guerrero, J., Olivas, J. A., Romero, F. P. & Herrera-Viedma, E. Sentiment analysis: A review and comparative analysis of web services. Inf. Sci. (Ny). 311, 18–38 (2015).
- Go, A., Bhayani, R. & Huang, L. Twitter Sentiment Classification using Distant Supervision. Tech. Rep. 150, 1–6 (2009).
- O’Connor, B., Balasubramanyan, R., Routledge, B. R. & Smith, N. a. From tweets to polls: Linking text sentiment to public opinion time series. Proc. 4h Int. AAAI Conf. Weblogs Soc. Media 122–129 (2010)-
- Hannak, A., Anderson, E., Barrett, L. F., Lehmann, S., Mislove, A. & Riedewald, M. Tweetin’ in the Rain: Exploring societal-scale effects of weather on mood. in Proc. of the 6th International AAAI Conference on Weblogs and Social Media 479–482 (2012).
- Jungherr, A., Schoen, H., Posegga, O. & Ju rgens, P. Digital Trace Data in the Study of Public Opinion: An Indicator of Attention Toward Politics Rather Than Political Support. Soc. Sci. Comput. Rev. 894439316631043 (2016).
- Gayo-Avello, D. A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data. Soc. Sci. Comput. Rev. 31, 649–679 (2013).
- Ceron, A., Curini, L. & Iacus, S. M. ISA: A fast, scalable and accurate algorithm for sentiment analysis of social media content. Inf. Sci. (Ny). 367–368, 105–124 (2016).
- Bohannon, J. The pulse of the people. Science (80). 355, 470–472 (2017).
- Bovet, A. Morone, F. & Makse, H.A. Validation of Twitter opinion trends with national polling aggregates: Hillary Clinton vs Donald Trump. arXiv:1610.01587 (2017).