Data Preprocessing
==================

Use a small subset of data to experiment with the data preprocessing and feature extraction. Testing the CSV module and look at the data.

In [None]:
import csv
import re
subsetData = open("SAsubset.csv", "r")
for row in csv.DictReader(subsetData):
    print row['Sentiment'], row['SentimentText']
subsetData.close()

Typical Noisy data

- escape character
- url
- @handle


In [None]:
def getData(csvFname):
    sent = []
    tweet = []
    dataSource = open(csvFname, "r")
    for row in csv.DictReader(dataSource):
        sent.append(row['Sentiment'])
        tweet.append(row['SentimentText'])
    dataSource.close()
    return sent, tweet

In [None]:
sent, tweet = getData("SAsubset.csv")

from scipy.stats import itemfreq
itemfreq(sent)

In [None]:
tweet

ballpark preprocessing: "unescape", lowercase, remove all puncts

In [None]:
tweet[15]

In [None]:
from HTMLParser import HTMLParser
h = HTMLParser()
print h.unescape(tweet[15])

In [None]:
re.sub("[^\w\s]", " ", h.unescape(tweet[15])).lower()

modify the getData a little and the the 200K tweets dataset.

In [None]:
def getData(csvFname):
    h = HTMLParser()
    corpus = []
    dataSource = open(csvFname, "r")
    for row in csv.DictReader(dataSource):
        try:
            corpus.append({"tweet": re.sub("[^a-zA-Z\s]", " ", h.unescape(row['SentimentText'])).lower(), "sent": int(row['Sentiment'])})
        except:
            continue
    dataSource.close()
    return corpus
corpus = getData("SA200K.csv")

In [None]:
print len(corpus)
print corpus[2]


Feature extraction
==================

Conversion of tweets to BOW feature matrix (using only default setting of CountVectorizer)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
vectorizer

In [None]:
X = vectorizer.fit_transform([item['tweet'] for item in corpus])
X

In [None]:
#X.toarray()

In [None]:
vectorizer.get_feature_names()

In [None]:
y = [item['sent'] for item in corpus]

Randomly split the X and y into training and test set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1697)

In [None]:
X_train

In [None]:
X_test

In [None]:
y_train

Training
========

Try to fit a naive bayes classifier $$h_{\theta}(X)$$
Naive Bayes convergence rate: $$\sim O(\log{n})$$

In [None]:
from sklearn.naive_bayes import MultinomialNB
hx_nb = MultinomialNB()

In [None]:
hx_nb.fit(X_train, y_train)

In [None]:
hx_nb.predict(X_train)

Evaluate the effectiveness of nbmX using F1 score

In [None]:
from sklearn.metrics import confusion_matrix, f1_score

In [None]:
print confusion_matrix(y_train, hx_nb.predict(X_train))
print f1_score(y_train, hx_nb.predict(X_train))

Do it on test set

In [None]:
print confusion_matrix(y_test, hx_nb.predict(X_test))
print f1_score(y_test, hx_nb.predict(X_test))

Classify a new tweet

In [None]:
newTweetFeatureVector = vectorizer.transform(["I feel so bad now. Let's go to hell!"])

In [None]:
newTweetFeatureVector

In [None]:
hx_nb.predict(newTweetFeatureVector)

In [None]:
newTweetFeatureVector = vectorizer.transform(["scikit learn is so cool!"])
hx_nb.predict(newTweetFeatureVector)

In [None]:
newTweetFeatureVector = vectorizer.transform(["I am feeling not good with scikit learn"])
hx_nb.predict(newTweetFeatureVector)

In [None]:
hx_nb.predict_proba(newTweetFeatureVector)

Logistic regression with regularization (C is the regularization rate)
$$ \sim O(n)$$

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
hx_log = LogisticRegression(C=0.6)

In [None]:
hx_log.fit(X_train, y_train)

In [None]:
confusion_matrix(y_train, hx_log.predict(X_train))

In [None]:
print "Training set F1: %s" %f1_score(y_train, hx_log.predict(X_train))
print "Test set F1: %s" %f1_score(y_test, hx_log.predict(X_test))

### Tuning

Tuning the value of C in the above LogisticRegression model

## Bigram tokenization

In [None]:
bigramvect = CountVectorizer(ngram_range = (1,2))

In [None]:
X_bi = bigramvect.fit_transform([item['tweet'] for item in corpus])

In [None]:
X_bi

In [None]:
X

In [None]:
X_train_bi, X_test_bi, y_train_bi, y_test_bi = train_test_split(X_bi, y, test_size = 0.3, random_state=1697)

In [None]:
bnb = MultinomialNB()
bi_nbhx = bnb.fit(X_train_bi, y_train_bi)

In [None]:
confusion_matrix(y_train_bi, bi_nbhx.predict(X_train_bi))

In [None]:
f1_score(y_train_bi, bi_nbhx.predict(X_train_bi))

In [None]:
f1_score(y_test_bi, bi_nbhx.predict(X_test_bi))

In [None]:
newTweetFeatureVector = bigramvect.transform(["I am feeling not good with scikit learn"])
bi_nbhx.predict(newTweetFeatureVector)[0]

### Your move

Create a bot which talk back based on the sentiment of your input sentence.