## Cmpe 545 - Term Project: Twitter Sentiment Classification

**Student:** Gonul Ayci - 2016800003<br>
**Instructor:** Prof. Ethem Alpaydin

## Technique:
I give a detailed information about the steps in the following cells. Briefly, I want to solve the problem using by the following steps:

**Step 0:** Split the dataset into two groups such as train and test data,

**Step 1:** Get a cleaned data by using the followings,
* Tokenization
* Punctuaton
* Stopwords

**Step 2:** Feature representation,
* **Step 2.1:** Use n-gram (baseline) method such as unigram, and bigram. Then, to apply this method to a cleaned data.

* **Step 2.2:** Use word-embedding (second method).

**Step 3:** Feature extraction,

**Step 4:** Dimensionality reduction,

**Step 5:** Use Multi-layer perceptron to classify tweets. 

## Abstract: 
Sentiment analysis is one of the popular topic on Artificial Neural network (ANN), and also on Natural Language Processing (NLP). The task of classifying subjective statements in the text into the categories such as "positive", "negative" and "neutral". <br>

The aim of this project is to perform sentiment analysis on twitter dataset which is available on http://help.sentiment140.com/for-students.

In [120]:
import math
import string
import numpy as np
import pandas as pd

from collections import Counter

import nltk
from nltk import *
from nltk import ngrams
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import IncrementalPCA

from scipy.sparse.csr import csr_matrix

from __future__ import division

## Data 

In this project, I use labeled data which is from http://help.sentiment140.com/for-students.

The data is a CSV with emoticons removed. Data file format has 6 fields: <br>
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive) <br>
1 - the id of the tweet (2087) <br>
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009) <br>
3 - the query (lyx). If there is no query, then this value is NO_QUERY. <br>
4 - the user that tweeted (robotickilldozr) <br>
5 - the text of the tweet (Lyx is cool) <br>

In [87]:
df = pd.read_csv('trainingandtestdata/testdata.manual.2009.06.14.csv')
len(df[['polarity', 'tweet']])

498

In [88]:
df2 = df[['polarity', 'tweet']]
df2

Unnamed: 0,polarity,tweet
0,4,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,Reading my kindle2... Love it... Lee childs i...
2,4,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,@kenburbary You'll love your Kindle2. I've had...
4,4,@mikefish Fair enough. But i have the Kindle2...
5,4,@richardebaker no. it is too big. I'm quite ha...
6,0,Fuck this economy. I hate aig and their non lo...
7,4,Jquery is my new best friend.
8,4,Loves twitter
9,4,how can you not love Obama? he makes jokes abo...


### Twitter word embedding:

I apply lower cases for all tweets because word embedding pre-trained data has lower characters.

2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download
https://nlp.stanford.edu/projects/glove/

In [4]:
df2['tweet'] = df2['tweet'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


## Split train and test data with .75 and .25 rate, respectively

In [None]:
data = pd.read_csv('trainingandtestdata/training.1600000.processed.noemoticon.csv')
len(data)

In [None]:
train_data, test_data = train_test_split(data, test_size=0.25)

## Tokenization, Punctuation, Stopwords

**Stopwords:** There are many non-informative words in the texts which occur more than once in almost every document such as articles, prepositions and conjunctions. These are called stop words. For the purpose of filtering these words,which have no contribution to the classification task, a stop word list is composed [1].

**Tokenization:** Tokenization is the task of dividing a sequence of words or a document unit into pieces.

**Punctuation:** Remove punctuation.

In [5]:
df3 = df2.copy() # copy the dataframe 

In [6]:
tknzr = TweetTokenizer()
stop_words = set(stopwords.words("english"))
tokenized_text = []
punctuated_text = []

## Construct tweets as N-gram
N-grams are sequences of letters or words extracted from documents. I use word N-grams. Common values used for n are 1, 2 or 3 for word n-grams. In this project, I select both unigram and bigram and merge them. The main idea behind this method is that words which are similar to each other will have a high proportion of N-grams [2].

In [7]:
for index in range(len(df3)):
    tokenized_text.append([x.encode('UTF8') for x in tknzr.tokenize(df2['tweet'][index])])    
    punctuated_text.append(filter(lambda x: x not in string.punctuation, tokenized_text[index]))    
    cleaned_text = filter(lambda y: y not in stop_words, punctuated_text[index])
    cleaned_text = ", ".join(cleaned_text)
    
    df3['tweet'][index] = cleaned_text

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [8]:
df3

Unnamed: 0,polarity,tweet
0,4,"@stellargirl, loooooooovvvvvveee, kindle, 2, d..."
1,4,"reading, kindle, 2, ..., love, ..., lee, child..."
2,4,"ok, first, assesment, #kindle2, ..., fucking, ..."
3,4,"@kenburbary, love, kindle, 2, i've, mine, mont..."
4,4,"@mikefish, fair, enough, kindle, 2, think, per..."
5,4,"@richardebaker, big, i'm, quite, happy, kindle, 2"
6,0,"fuck, economy, hate, aig, non, loan, given, asses"
7,4,"jquery, new, best, friend"
8,4,"loves, twitter"
9,4,"love, obama, makes, jokes"


## Merge bigrams to unigrams

In [13]:
for index in range(len(df3)):
    sentence = df3['tweet'][index]
    n = 2
    grams_list = []
    bigrams = ngrams(sentence.split(), n)
    for grams in bigrams:
        grams = ', {}_{}'.format(grams[0].replace(',', ''), grams[1].replace(',', ''))
        df3['tweet'][index] = df3['tweet'][index] + grams

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [36]:
df3['tweet']

0      @stellargirl, loooooooovvvvvveee, kindle, 2, d...
1      reading, kindle, 2, ..., love, ..., lee, child...
2      ok, first, assesment, #kindle2, ..., fucking, ...
3      @kenburbary, love, kindle, 2, i've, mine, mont...
4      @mikefish, fair, enough, kindle, 2, think, per...
5      @richardebaker, big, i'm, quite, happy, kindle...
6      fuck, economy, hate, aig, non, loan, given, as...
7      jquery, new, best, friend, jquery_new, new_bes...
8                          loves, twitter, loves_twitter
9      love, obama, makes, jokes, love_obama, obama_m...
10     check, video, president, obama, white, house, ...
11     @karoli, firmly, believe, obama, pelosi, zero,...
12     house, correspondents, dinner, last, night, wh...
13     watchin, espn, .., jus, seen, new, nike, comme...
14     dear, nike, stop, flywire, shit, waste, scienc...
15     #lebron, best, athlete, generation, time, bask...
16     talking, guy, last, night, telling, die, hard,...
17     love, lebron, http://bit

## Extract features by using Tf-idf
http://scikit-learn.org/stable/modules/feature_extraction.html

In [40]:
corpus = [x for x in df3['tweet']]
corpus[:200]

['@stellargirl, loooooooovvvvvveee, kindle, 2, dx, cool, 2, fantastic, right, @stellargirl_loooooooovvvvvveee, loooooooovvvvvveee_kindle, kindle_2, 2_dx, dx_cool, cool_2, 2_fantastic, fantastic_right',
 'reading, kindle, 2, ..., love, ..., lee, childs, good, read, reading_kindle, kindle_2, 2_..., ..._love, love_..., ..._lee, lee_childs, childs_good, good_read',
 'ok, first, assesment, #kindle2, ..., fucking, rocks, ok_first, first_assesment, assesment_#kindle2, #kindle2_..., ..._fucking, fucking_rocks',
 "@kenburbary, love, kindle, 2, i've, mine, months, never, looked, back, new, big, one, huge, need, remorse, :), @kenburbary_love, love_kindle, kindle_2, 2_i've, i've_mine, mine_months, months_never, never_looked, looked_back, back_new, new_big, big_one, one_huge, huge_need, need_remorse, remorse_:)",
 '@mikefish, fair, enough, kindle, 2, think, perfect, :), @mikefish_fair, fair_enough, enough_kindle, kindle_2, 2_think, think_perfect, perfect_:)',
 "@richardebaker, big, i'm, quite, happ

In [80]:
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus[:10])
idf = vectorizer.idf_

print dict(zip(vectorizer.get_feature_names(), idf))

{u'childs': 2.7047480922384253, u'kindle2': 2.7047480922384253, u'new_best': 2.7047480922384253, u'_fucking': 2.7047480922384253, u'good_read': 2.7047480922384253, u'ok_first': 2.7047480922384253, u'hate': 2.7047480922384253, u'cool_2': 2.7047480922384253, u'perfect': 2.7047480922384253, u'mikefish': 2.7047480922384253, u'_love': 2.7047480922384253, u'fucking_rocks': 2.7047480922384253, u'aig_non': 2.7047480922384253, u'good': 2.7047480922384253, u'read': 2.7047480922384253, u'big': 2.2992829841302607, u'2_': 2.7047480922384253, u'dx': 2.7047480922384253, u'kindle2_': 2.7047480922384253, u'cool': 2.7047480922384253, u'hate_aig': 2.7047480922384253, u'perfect_': 2.7047480922384253, u'happy_kindle': 2.7047480922384253, u'rocks': 2.7047480922384253, u'jokes': 2.7047480922384253, u'non_loan': 2.7047480922384253, u'right': 2.7047480922384253, u'fair': 2.7047480922384253, u'fucking': 2.7047480922384253, u'twitter': 2.7047480922384253, u'back': 2.7047480922384253, u'remorse_': 2.7047480922384

In [114]:
print vectorizer.get_feature_names()
print "shape:", X.toarray().shape
#print X.toarray()

[u'2_', u'2_dx', u'2_fantastic', u'2_i', u'2_think', u'_fucking', u'_lee', u'_love', u'aig', u'aig_non', u'asses', u'assesment', u'assesment_', u'back', u'back_new', u'best', u'best_friend', u'big', u'big_i', u'big_one', u'childs', u'childs_good', u'cool', u'cool_2', u'dx', u'dx_cool', u'economy', u'economy_hate', u'enough', u'enough_kindle', u'fair', u'fair_enough', u'fantastic', u'fantastic_right', u'first', u'first_assesment', u'friend', u'fuck', u'fuck_economy', u'fucking', u'fucking_rocks', u'given', u'given_asses', u'good', u'good_read', u'happy', u'happy_kindle', u'hate', u'hate_aig', u'huge', u'huge_need', u'jokes', u'jquery', u'jquery_new', u'kenburbary', u'kenburbary_love', u'kindle', u'kindle2', u'kindle2_', u'kindle_2', u'lee', u'lee_childs', u'loan', u'loan_given', u'looked', u'looked_back', u'loooooooovvvvvveee', u'loooooooovvvvvveee_kindle', u'love', u'love_', u'love_kindle', u'love_obama', u'loves', u'loves_twitter', u'm_quite', u'makes', u'makes_jokes', u'mikefish', u'

## Dimensionality Reduction

**fit_transform(X[, y]):** Fit the model with X and apply the dimensionality reduction on X. <br>
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [115]:
pca = PCA(n_components=50, svd_solver='full')
principalComponents = pca.fit_transform(X.toarray())
principalDf = pd.DataFrame(data = principalComponents)
             #, columns = ['principal component 1', 'principal component 2'])
    
print "explained_variance_ratio_: ", pca.explained_variance_ratio_
print "singular_values: ", pca.singular_values_

explained_variance_ratio_:  [  1.23632427e-01   1.16407456e-01   1.14912259e-01   1.12887316e-01
   1.12887316e-01   1.07912517e-01   1.06551974e-01   1.03761082e-01
   1.01047653e-01   9.98816534e-33]
singular_values:  [  1.04651058e+00   1.01547170e+00   1.00892900e+00   1.00000000e+00
   1.00000000e+00   9.77717384e-01   9.71534384e-01   9.58726366e-01
   9.46107623e-01   2.97454324e-16]


In [116]:
principalDf

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.318329,-0.314108,-0.001672,-4.0538550000000004e-17,-3.689064e-15,0.513173,-0.611184,-0.147906,-0.122438,9.406332000000001e-17
1,0.296854,0.223728,-0.328827,-5.998976e-15,-3.29652e-15,0.084619,0.067494,0.773808,-0.030125,9.406332000000001e-17
2,-0.392847,-0.247869,-0.097611,-0.5530957,-0.6006262,-0.132157,-0.016534,0.047552,-0.046888,9.406332000000001e-17
3,0.207245,0.356022,0.293709,6.052741e-15,4.785524e-15,-0.439843,-0.123739,-0.132185,-0.609859,9.406332000000001e-17
4,0.34471,-0.300264,0.000232,-7.720761e-16,-2.56902e-15,0.224037,0.728307,-0.243694,-0.149544,9.406332000000001e-17
5,0.427301,-0.139887,0.134751,3.392141e-15,2.034353e-15,-0.488394,-0.103142,-0.083745,0.61899,9.406332000000001e-17
6,-0.392847,-0.247869,-0.097611,-0.2436097,0.779308,-0.132157,-0.016534,0.047552,-0.046888,9.406332000000001e-17
7,-0.28507,0.317287,0.712893,8.980694e-15,9.86453e-15,0.361896,0.092607,0.128488,0.240946,9.406332000000001e-17
8,-0.392847,-0.247869,-0.097611,0.7967054,-0.1786818,-0.132157,-0.016534,0.047552,-0.046888,9.406332000000001e-17
9,-0.130829,0.60083,-0.518251,-9.582074e-15,-1.919979e-15,0.140983,-0.000742,-0.43742,0.192693,9.406332000000001e-17


## Save the feature matrix

In [85]:
matrix_file = X.toarray()

In [86]:
np.savetxt("matrix.txt", matrix_file)

## https://nlp.stanford.edu/projects/glove/

paper: http://crowdsourcing-class.org/assignments/downloads/pak-paroubek.pdf

## References

[1] C. Silvatt, B. Ribeirot, “The Importance of Stop Word Removal on Recall Values in Text Categorization”, In International Joint Conference on Neural Networks, V ol. 3 (2003), p. 1661-1666, 2003.

[2] P Majumder, M Mitra, B. B. Chaudhuri, “N-gram: a language independent approach to IR and NLP”.