Data Preprocessing
==================

Use a small subset of data to experiment with the data preprocessing and feature extraction. Testing the CSV module and look at the data.

In [6]:
import csv
import re
subsetData = open("SAsubset.csv", "r")
for row in csv.DictReader(subsetData):
    print row['Sentiment'], row['SentimentText']
subsetData.close()

0                      is so sad for my APL friend.............
0                    I missed the New Moon trailer...
1               omg its already 7:30 :O
0           .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...
0          i think mi bf is cheating on me!!!       T_T
0          or i just worry too much?        
1        Juuuuuuuuuuuuuuuuussssst Chillin!!
0        Sunny Again        Work Tomorrow  :-|       TV Tonight
1       handed in my uniform today . i miss you already
1       hmmmm.... i wonder how she my number @-)
0       I must think about positive..
1       thanks to all the haters up in my face all day! 112-102
0       this weekend has sucked so far
0      jb isnt showing in australia any more!
0      ok thats it you win.
0     &lt;-------- This is the way i feel right now...
0     awhhe man.... I'm completely useless rt now. Funny, all I can do is twitter. http://myloc.me/27HX
1     Feeling stran

Typical Noisy data

- escape character
- url
- @handle


In [7]:
def getData(csvFname):
    sent = []
    tweet = []
    dataSource = open(csvFname, "r")
    for row in csv.DictReader(dataSource):
        sent.append(row['Sentiment'])
        tweet.append(row['SentimentText'])
    dataSource.close()
    return sent, tweet

In [8]:
sent, tweet = getData("SAsubset.csv")

from scipy.stats import itemfreq
itemfreq(sent)

array([['0', '133'],
       ['1', '66']], 
      dtype='|S21')

In [9]:
tweet

['                     is so sad for my APL friend.............',
 '                   I missed the New Moon trailer...',
 '              omg its already 7:30 :O',
 "          .. Omgaga. Im sooo  im gunna CRy. I've been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)...",
 '         i think mi bf is cheating on me!!!       T_T',
 '         or i just worry too much?        ',
 '       Juuuuuuuuuuuuuuuuussssst Chillin!!',
 '       Sunny Again        Work Tomorrow  :-|       TV Tonight',
 '      handed in my uniform today . i miss you already',
 '      hmmmm.... i wonder how she my number @-)',
 '      I must think about positive..',
 '      thanks to all the haters up in my face all day! 112-102',
 '      this weekend has sucked so far',
 '     jb isnt showing in australia any more!',
 '     ok thats it you win.',
 '    &lt;-------- This is the way i feel right now...',
 "    awhhe man.... I'm completely useless rt now. Funny, all I can do is twitter. http://m

ballpark preprocessing: "unescape", lowercase, remove all puncts

In [10]:
tweet[15]

'    &lt;-------- This is the way i feel right now...'

In [11]:
from HTMLParser import HTMLParser
h = HTMLParser()
print h.unescape(tweet[15])

    <-------- This is the way i feel right now...


In [12]:
re.sub("[^\w\s]", " ", h.unescape(tweet[15])).lower()

u'              this is the way i feel right now   '

modify the getData a little and the the 200K tweets dataset.

In [13]:
def getData(csvFname):
    h = HTMLParser()
    corpus = []
    dataSource = open(csvFname, "r")
    for row in csv.DictReader(dataSource):
        try:
            corpus.append({"tweet": re.sub("[^a-zA-Z\s]", " ", h.unescape(row['SentimentText'])).lower(), "sent": int(row['Sentiment'])})
        except:
            continue
    dataSource.close()
    return corpus
corpus = getData("SA200K.csv")

In [14]:
print len(corpus)
print corpus[2]


199720
{'tweet': '              omg its already       o', 'sent': 1}


Feature extraction
==================

Conversion of tweets to BOW feature matrix (using only default setting of CountVectorizer)

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [16]:
X = vectorizer.fit_transform([item['tweet'] for item in corpus])
X

<199720x164787 sparse matrix of type '<type 'numpy.int64'>'
	with 2388088 stored elements in Compressed Sparse Row format>

In [17]:
#X.toarray()

In [18]:
vectorizer.get_feature_names()

[u'aa',
 u'aaa',
 u'aaaa',
 u'aaaaa',
 u'aaaaaa',
 u'aaaaaaaaa',
 u'aaaaaaaaaaaaaaaaaa',
 u'aaaaaaaaaaaaaaaaaaaaaaa',
 u'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaah',
 u'aaaaaaaaaaaaaaaaaaaah',
 u'aaaaaaaaaaaaaaah',
 u'aaaaaaaaaaarggggggggggghhhhhhhhh',
 u'aaaaaaaaaahhhhhhhhhhhhh',
 u'aaaaaaaaaaw',
 u'aaaaaaaaah',
 u'aaaaaaaaahh',
 u'aaaaaaaaahhhhhhhaaaa',
 u'aaaaaaaaaw',
 u'aaaaaaaah',
 u'aaaaaaaahhhhh',
 u'aaaaaaah',
 u'aaaaaaahahahaha',
 u'aaaaaaahh',
 u'aaaaaaahhh',
 u'aaaaaaalcohol',
 u'aaaaaaarrgghh',
 u'aaaaaaawwwwwwwww',
 u'aaaaaah',
 u'aaaaaahh',
 u'aaaaaahhh',
 u'aaaaaahhhhhhhh',
 u'aaaaaahhhhhhhhh',
 u'aaaaaainnnn',
 u'aaaaaakkkh',
 u'aaaaaalllat',
 u'aaaaaand',
 u'aaaaaaw',
 u'aaaaaawwwwwww',
 u'aaaaabbey',
 u'aaaaah',
 u'aaaaahhh',
 u'aaaaahhhh',
 u'aaaaahhhhh',
 u'aaaaahhhhhh',
 u'aaaaalison',
 u'aaaaand',
 u'aaaaargh',
 u'aaaaarrgh',
 u'aaaaaw',
 u'aaaaawwwwwww',
 u'aaaae',
 u'aaaages',
 u'aaaah',
 u'aaaahahahahahahahahahahahahahahahah',
 u'aaaahh',
 u'aaaahhh',


In [19]:
y = [item['sent'] for item in corpus]

Randomly split the X and y into training and test set

In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1697)

In [21]:
X_train

<139804x164787 sparse matrix of type '<type 'numpy.int64'>'
	with 1670944 stored elements in Compressed Sparse Row format>

In [22]:
X_test

<59916x164787 sparse matrix of type '<type 'numpy.int64'>'
	with 717144 stored elements in Compressed Sparse Row format>

In [23]:
y_train

[1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 0,
 0,
 0,
 1,
 1,
 0,
 1,
 1,
 0,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 1,
 1,
 0,
 0,
 1,
 1,
 1,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 0,
 0,
 1,
 0,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 1,
 0,
 1,
 1,
 1,
 0,
 1,
 0,


Training
========

Try to fit a naive bayes classifier $$h_{\theta}(X)$$
Naive Bayes convergence rate: $$\sim O(\log{n})$$

In [24]:
from sklearn.naive_bayes import MultinomialNB
hx_nb = MultinomialNB()

In [25]:
hx_nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [26]:
hx_nb.predict(X_train)

array([1, 0, 1, ..., 1, 0, 0])

Evaluate the effectiveness of nbmX using F1 score

In [27]:
from sklearn.metrics import confusion_matrix, f1_score

In [28]:
print confusion_matrix(y_train, hx_nb.predict(X_train))
print f1_score(y_train, hx_nb.predict(X_train))

[[48923  9949]
 [ 7941 72991]]
0.890829427846


Do it on test set

In [29]:
print confusion_matrix(y_test, hx_nb.predict(X_test))
print f1_score(y_test, hx_nb.predict(X_test))

[[17633  7711]
 [ 6802 27770]]
0.792828287154


Classify a new tweet

In [30]:
newTweetFeatureVector = vectorizer.transform(["I feel so bad now. Let's go to hell!"])

In [31]:
newTweetFeatureVector

<1x164787 sparse matrix of type '<type 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [32]:
hx_nb.predict(newTweetFeatureVector)

array([0])

In [33]:
newTweetFeatureVector = vectorizer.transform(["scikit learn is so cool!"])
hx_nb.predict(newTweetFeatureVector)

array([1])

In [34]:
newTweetFeatureVector = vectorizer.transform(["I am feeling not good with scikit learn"])
hx_nb.predict(newTweetFeatureVector)

array([1])

In [35]:
hx_nb.predict_proba(newTweetFeatureVector)

array([[ 0.36319366,  0.63680634]])

Logistic regression with regularization (C is the regularization rate)
$$ \sim O(n)$$

In [36]:
from sklearn.linear_model import LogisticRegression

In [37]:
hx_log = LogisticRegression(C=0.6)

In [38]:
hx_log.fit(X_train, y_train)

LogisticRegression(C=0.6, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [39]:
confusion_matrix(y_train, hx_log.predict(X_train))

array([[47061, 11811],
       [ 7364, 73568]])

In [40]:
print "Training set F1: %s" %f1_score(y_train, hx_log.predict(X_train))
print "Test set F1: %s" %f1_score(y_test, hx_log.predict(X_test))

Training set F1: 0.884703958247
Test set F1: 0.807792388506


### Tuning

Tuning the value of C in the above LogisticRegression model

## Bigram tokenization

In [41]:
bigramvect = CountVectorizer(ngram_range = (1,2))

In [42]:
X_bi = bigramvect.fit_transform([item['tweet'] for item in corpus])

In [43]:
X_bi

<199720x1046715 sparse matrix of type '<type 'numpy.int64'>'
	with 4677871 stored elements in Compressed Sparse Row format>

In [44]:
X

<199720x164787 sparse matrix of type '<type 'numpy.int64'>'
	with 2388088 stored elements in Compressed Sparse Row format>

In [45]:
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X_bi, y, test_size = 0.3, random_state=1697)

In [46]:
bnb = MultinomialNB()
bi_nbhx = bnb.fit(X_train_bi, y_train_bi)

In [47]:
confusion_matrix(y_train_bi, bi_nbhx.predict(X_train_bi))

array([[55823,  3049],
       [ 2141, 78791]])

In [48]:
f1_score(y_train_bi, bi_nbhx.predict(X_train_bi))

0.96811490919814236

In [49]:
f1_score(y_test_bi, bi_nbhx.predict(X_test_bi))

0.80256599193793077

In [50]:
newTweetFeatureVector = bigramvect.transform(["I am feeling not good with scikit learn"])
bi_nbhx.predict(newTweetFeatureVector)[0]

0

In [51]:
def Ask_regina(talk):
    print "@you: " + talk 
    newTweetFeatureVector = bigramvect.transform([talk])
    sent = bi_nbhx.predict(newTweetFeatureVector)[0]
    if sent == 0:
        print "@regina: 你 hea 做十年！"
    elif sent == 1:
        print "@regina: 你行政經驗不足。"

Ask_regina("Don't cheat on me, Eliza! You son of a bitch! Girl, I love you. Don't do this to me.")

@you: Don't cheat on me, Eliza! You son of a bitch! Girl, I love you. Don't do this to me.
@regina: 你行政經驗不足。


In [52]:
Ask_regina("You sucks, Regina!")

@you: You sucks, Regina!
@regina: 你 hea 做十年！
