# Analyzing Sentiment

What is sentiment analysis? Given some text, try to determine if the sentiment of the writer of that text was positive, negative,or neutral. This is fairly easy for humans to do, but we want to automate this because we want to know if something (e.g., our restaurant, movie, book, or other product) is being talked about in real-time on social media, and, more importantly, what people think about our product. Knowing if a tweet or a post is positive or negative can help us see how our product is being received and how we can improve it. 

Example: sentiment about election candidates in Belgium: http://www.clips.ua.ac.be/pages/pattern-examples-elections

 * some data: http://help.sentiment140.com/for-students

**Columns:**

    0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
    1 - the id of the tweet (2087)
    2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    4 - the user that tweeted (robotickilldozr)
    5 - the text of the tweet (Lyx is cool)

In [6]:
import pandas as pd
import numpy as np

### Load data, clean up

In [7]:
cols = ['polarity','id', 'date', 'query', 'user', 'tweet']

data = pd.read_csv('data/sentiment.train.csv',names=cols, encoding='ISO-8859-1')
print('length of data {}'.format(len(data)))
test = pd.read_csv('data/sentiment.test.csv',names=cols, encoding='ISO-8859-1')
print('length of test {}'.format(len(test)))
data = pd.concat([data,test])
print('length of both {}'.format(len(data)))

length of data 1600000
length of test 498
length of both 1600498


In [8]:
data=data.sample(frac=0.005,random_state=200) # this is a lot of data, while we develop let's only use 10% of it
data = data.drop(['id', 'date', 'query', 'user'], axis=1)
data = data[data.polarity != 2]

In [9]:
data.shape

(8001, 2)

In [10]:
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) 
data['split_tweet'] = data.tweet.map(lambda x: tknzr.tokenize(x))

In [11]:
data[:2]

Unnamed: 0,polarity,tweet,split_tweet
888312,4,Breaky burrito at Whole Foods is a good way to...,"[Breaky, burrito, at, Whole, Foods, is, a, goo..."
516573,0,i'm out! gonna check my facebook. please!!!!!!...,"[i'm, out, !, gonna, check, my, facebook, ., p..."


In [12]:
from collections import Counter

Counter(data.polarity) # the counts should be about the same for 0 and 4

Counter({4: 3960, 0: 4041})

### Baseline Test using Naive Bayes and word counts as Features

In [13]:
import nltk
from nltk.classify.naivebayes import NaiveBayesClassifier


data['polarity'] = data.polarity.map(lambda x: 'neg' if x == 0 else 'pos')
data['feats'] = data['split_tweet'].map(lambda x: Counter(x))

dev=data.sample(frac=0.1,random_state=200)
train = data.drop(dev.index)

train_data = list(zip(train['feats'], train['polarity']))
dev_data = list(zip(dev['feats'], dev['polarity']))

classifier = NaiveBayesClassifier.train(train_data)

nltk.classify.util.accuracy(classifier, dev_data)

0.71125

In [14]:
dev_len = len(list(set(dev)))

## How can we represent words as numerical/cotinuous features?

### 1.) Try a LabelEncoder (where ['A','B','C'] = [1,2,3])

In [15]:
data['polarity'] = data.polarity.map(lambda x: 0 if x == 'neg' else 1)
data['id'] = data.index

In [16]:
words = list(set([x for line in data.split_tweet for x in line]))
ndata = data.copy()

In [17]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(words) 

LabelEncoder()

In [18]:
le.transform(['the'])[0]

13976

In [None]:
# this takes a long time

data['le'] = data.split_tweet.map(lambda x: [le.transform([i])[0] for i in x])

In [None]:
data[:5]

In [None]:
dev=data.sample(frac=0.1,random_state=200)
train=data.drop(dev.index)

train.shape, dev.shape, Counter(train.polarity)

In [None]:
neg = train[train.polarity == 0]
pos = train[train.polarity == 1]

In [None]:
neg.shape, pos.shape

In [None]:
neg[:2]

In [None]:
max_tweet = max([len(v) for v in data.split_tweet])

max_tweet

In [None]:
neg = list(neg['le'].map(lambda v: np.pad(np.array(v), (0, max_tweet-len(v)), 'constant')))
pos = list(pos['le'].map(lambda v: np.pad(np.array(v), (0, max_tweet-len(v)), 'constant')))
labels = len(neg) * [0] + len(pos) * [1]

In [None]:
train_data = neg + pos

In [None]:
from sklearn import linear_model

regr = linear_model.LogisticRegression(penalty='l2')

logres = regr.fit(train_data, labels)

In [None]:
logres.coef_

In [None]:
neg = dev[dev.polarity == 0]
pos = dev[dev.polarity == 1]

In [None]:
neg_dev = list(neg['le'].map(lambda v: np.pad(np.array(v), (0, max_tweet-len(v)), 'constant')))
pos_dev = list(pos['le'].map(lambda v: np.pad(np.array(v), (0, max_tweet-len(v)), 'constant')))

In [None]:
neg_guess = [logres.predict(v.reshape(1, -1)) for v in neg_dev]
pos_guess = [logres.predict(v.reshape(1, -1)) for v in pos_dev]

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(neg_guess+pos_guess, len(neg_guess)*[0] + len(pos_guess) * [1])

#### What went wrong? It's about the same as a random baseline!

Answer: Semantics. 

The classifier uses the words represented as numbers to draw a decision boundary, but the numbers that are assigned to the words are completely arbitrary. The distance between the words (thinking about the numbers as points in some n-dimensional space) has meaning. Words that have similar meaning should be grouped together, but they aren't in this case. 

### 2.) Try a One-hot encoder (where ['A','B','C'] = [[1,0,0],[0,1,0],[0,0,1]])

In [None]:
data.drop(['le'], inplace=True, axis=1) # we just proved the uselessness of this column

In [None]:
from collections import Counter

dev=data.sample(frac=0.1,random_state=200)
train=data.drop(dev.index)

train.shape, dev.shape, Counter(train.polarity)

In [None]:
s = data.split_tweet.apply(lambda x: pd.Series(x)).stack().reset_index(level=1,drop=True)
s.name = 'word'
data = data.drop('split_tweet', axis=1).join(s)
data.dropna(inplace=True)

In [None]:
data[:5]

In [None]:
# this takes a very long time
data = pd.get_dummies(data=data, columns=['word'])

In [None]:
#data.replace([np.inf, -np.inf], np.nan, inplace=True)
#data.dropna(inplace=True)

In [None]:
train = data[data.id.isin(train.id)]
neg = train[train.polarity == 0].groupby('id').sum()
pos = train[train.polarity == 1].groupby('id').sum()

In [None]:
labels = len(neg) * [0] + len(pos) * [1]

len(labels)

In [None]:
neg.drop(['polarity'], axis=1, inplace=True)
pos.drop(['polarity'], axis=1, inplace=True)


train_data = pd.concat((neg,pos))

In [None]:

train_data = train_data.as_matrix()

In [None]:
train_data.shape

In [None]:
regr = linear_model.LogisticRegression(penalty='l2')

logres = regr.fit(train_data, labels)

In [None]:
dev = data[data.id.isin(dev.id)]
neg_dev = dev[dev.polarity == 0].groupby('id').sum()
pos_dev = dev[dev.polarity == 1].groupby('id').sum()

In [None]:
neg_dev.drop(['polarity'], axis=1, inplace=True)
pos_dev.drop(['polarity'], axis=1, inplace=True)

In [None]:
labels = len(neg_dev) * [0] + len(pos_dev) * [1]

len(labels)

In [None]:
dev_data = pd.concat((neg_dev,pos_dev))
dev_data = dev_data.as_matrix()

dev_data.shape

In [None]:
guess = [logres.predict(np.array(v).reshape(1, -1)) for v in dev_data]

In [None]:
accuracy_score(guess, labels)

### Better than chance, but still not very useful. 

Why does the NaiveBayesClassifier do better?

* Answer: Because representing the words as string symbols instead of meaningless vectors (even if they are all equidistant) loses some of the semantic information. 

### 3.) Try word2vec model

In [None]:
from gensim.models.word2vec import Word2Vec as w


w2v = w.load_word2vec_format('data/GoogleNews-vectors-negative300.bin',binary=True)

In [None]:
veclen = len(w2v['red'])

veclen

In [None]:
from nltk.tokenize import TweetTokenizer

tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) 
data['split_tweet'] = data.tweet.map(lambda x: tknzr.tokenize(x.lower()))

In [None]:
data['w2v'] = data.split_tweet.map(lambda v: np.sum(np.array([np.array(w2v[x]) for x in v if x in w2v]).T, axis=-1))

In [None]:
data[:2]

In [None]:
data.replace([np.inf, -np.inf], np.nan, inplace=True)
data.dropna(inplace=True)

In [None]:
data.to_pickle('data/sentiment.pkl')

In [None]:
dev=data.sample(frac=0.1,random_state=200)
train=data.drop(dev.index)

train.shape, dev.shape, Counter(train.polarity)

In [None]:
train = data[data.index.isin(train.index)]
neg = train[train.polarity == 0]
pos = train[train.polarity == 1]

neg.w2v.shape, pos.w2v.shape

In [None]:
pos = [x for x in pos.w2v if type(x) is not np.float64]
neg = [x for x in neg.w2v if type(x) is not np.float64]
labels = len(neg) * [0] + len(pos) * [1]

train_data =  list(neg) + list(pos)

In [None]:
len(train_data)

In [None]:
from sklearn import linear_model

regr = linear_model.LogisticRegression(penalty='l2')

logres = regr.fit(train_data, labels)

In [None]:
dev = data[data.index.isin(dev.index)]

neg = dev[dev.polarity == 0].w2v
pos = dev[dev.polarity == 1].w2v

pos = [x for x in pos if type(x) is not np.float64]
neg = [x for x in neg if type(x) is not np.float64]

In [None]:
neg_guess = [logres.predict(v.reshape(1, -1)) for v in neg]
pos_guess = [logres.predict(v.reshape(1, -1)) for v in pos]

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(neg_guess+pos_guess, len(neg_guess)*[0] + len(pos_guess) * [1])