# Sentiment Analysis - Question A - Voulgari Eleni - A.M. 17005

## Create and Evaluate Classifier for Tweets

### Step 1: Prepare Project

   1. Load libraries
   2. Load dataset

In [2]:
# Import all the libraries needed.
import re
import os
import nltk
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from sklearn.svm import LinearSVC
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from nltk.tokenize import WordPunctTokenizer
from sklearn.grid_search import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [3]:
# Load training sets from years 2013, 2015 and 2016 into pandas dataframes and concatenate them into one large training 
# dataframe and also load dev sets and concatenate them in one develop set and the test set.
cols = ['id','sentiment','text']
file2013 = pd.read_csv("./sets/twitter-2013train-A.tsv", sep='\t', header=None, names=cols)
file2015 = pd.read_csv("./sets/twitter-2015train-A.tsv", sep='\t', header=None, names=cols)
file2016 = pd.read_csv("./sets/twitter-2016train-A.tsv", sep='\t', header=None, names=cols)
test2016 = pd.read_csv("./sets/twitter-2016test-A.tsv", sep='\t', header=None, names=cols)
dev = pd.read_csv("./sets/twitter-2016dev-A.tsv", sep='\t', header=None, names=cols)
devtest = pd.read_csv("./sets/twitter-2016devtest-A.tsv", sep='\t', header=None, names=cols)
training = pd.concat([file2013, file2015, file2016])
develop = pd.concat([dev, devtest])

### Step 2: Define Problem

##### What is your task? What are your goals? What do you want to achieve?

Our task is to create and evaluate a supervised classifier for tweets according to their polarity (positive, negative, neutral). The goal is to be able to recognize the sentiment behind a tweeter text.

### Step 3: Exploratory Analysis

##### Understand your data: Take a “peek” of your data, answer basic questions about the dataset. Summarise your data. Explore descriptive statistics and visualisations.

In [4]:
# Check what the datasets looks like.
print "Training dataset:\n", training.head(3)
print "\n"
print "Development dataset:\n", develop.head(3)
print "\n"
print "Test dataset:\n", test2016.head(3)

Training dataset:
                   id sentiment  \
0  264183816548130816  positive   
1  263405084770172928  negative   
2  262163168678248449  negative   

                                                text  
0  Gas by my house hit $3.39!!!! I'm going to Cha...  
1                                      Not Available  
2                                      Not Available  


Development dataset:
                   id sentiment  \
0  638060586258038784   neutral   
1  638061181823922176  positive   
2  638083821364244480   neutral   

                                                text  
0  05 Beat it - Michael Jackson - Thriller (25th ...  
1  Jay Z joins Instagram with nostalgic tribute t...  
2  Michael Jackson: Bad 25th Anniversary Edition ...  


Test dataset:
                   id sentiment  \
0  619950566786113536   neutral   
1  619969366986235905   neutral   
2  619971047195045888  negative   

                                                text  
0  Picturehouse's, Pink F

In [6]:
# Check the shape of the datasets, thus their size.
print "Training dataset shape:", training.shape
print "Development dataset shape:", develop.shape
print "Test dataset shape:", test2016.shape

Training dataset shape: (16045, 3)
Development dataset shape: (3947, 3)
Test dataset shape: (20342, 3)


In [7]:
# Check if there are any null values in the datasets.
print "Training null values:\n", pd.isnull(training).any()
print "\n"
print "Development null values:\n", pd.isnull(develop).any()
print "\n"
print "Test null values:\n", pd.isnull(test2016).any()

Training null values:
id           False
sentiment    False
text         False
dtype: bool


Development null values:
id           False
sentiment    False
text         False
dtype: bool


Test null values:
id           False
sentiment    False
text         False
dtype: bool


In [8]:
# Check the number of instances that have each sentiment label in the training set.
training.groupby('sentiment').size()

sentiment
negative    2373
neutral     6829
positive    6843
dtype: int64

### Step 4: Prepare Data

##### Data Cleaning/Data Wrangling/Collect more data (if necessary).

In [9]:
# Create some useful variables like removing negative words from stopwords, a lemmatizer and the vectorizers.
old_stop_words = set(stopwords.words('english'))
no_words = ['not', 'no', 'nor', 'against']
new_stop_words = [word for word in old_stop_words if word not in no_words]

wordnet_lemmatizer = WordNetLemmatizer()
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))

In [10]:
# Remove the instances where the tweet text is not available
training = training[training.text != 'Not Available']
develop = develop[develop.text != 'Not Available']
test2016 = test2016[test2016.text != 'Not Available']

print "Training shape:", training.shape
print "Development shape:", develop.shape
print "Test shape:", test2016.shape

Training shape: (11918, 3)
Development shape: (3083, 3)
Test shape: (15437, 3)


In [11]:
# Remove column id which is not needed for the purpose of sentiment analysis
training = training.drop(['id'], axis=1)
develop = develop.drop(['id'], axis=1)
test2016 = test2016.drop(['id'], axis=1)

In [12]:
# Check the number of instances that have each sentiment label in the training set again. As we can see the instances 
# with 'negative' sentiment are a lot less than the other two which makes our dataset somehow unbalanced. 
training.groupby('sentiment').size()

sentiment
negative    1658
neutral     5143
positive    5117
dtype: int64

In [13]:
# Check the number of instances that have each sentiment label in the development set, because if it does not have similar
# layout as the training set the results might be misleading. The layout seems to be proportional.
develop.groupby('sentiment').size()

sentiment
negative     552
neutral     1109
positive    1422
dtype: int64

In [14]:
# Create some regural expression for pattern recognition so as to exclude or replace some words in the text.
pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not", "ain't":"is not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

In [15]:
# Function that gets the text of the tweet, cleans (urls, @names) or transforms (negations) it according to the patterns,
# tokenizes it, makes all letters lowercase, removes punctuation and stop words,lemmatizes the words and returns a
# cleaned string.

def text_cleaning(text):

    pulled_data = BeautifulSoup(text, 'lxml')
    pulled_text = pulled_data.get_text()
    stripped = re.sub(combined_pat, '', pulled_text)
    stripped = re.sub(www_pat, '', stripped)
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], stripped)
    tokens = nltk.word_tokenize(neg_handled)
    lower_case = [word.lower() for word in tokens]
    nonPunct = re.compile('.*[A-Za-z].*')
    raw_words = [tok for tok in lower_case if nonPunct.match(tok)]
    filtered_result = list(filter(lambda l: l not in new_stop_words, raw_words))
    lemmas = [wordnet_lemmatizer.lemmatize(t) for t in filtered_result]
    final_string = (" ".join(lemmas)).strip()
    return final_string

In [16]:
# Perform the cleaning of the datasets and create an extra column to store the cleaned tweet.
pd.set_option('display.max_colwidth', -1) # Setting this so we can see the full content of cells
training['cleaned_tweet'] = training.text.apply(text_cleaning)
develop['cleaned_tweet'] = develop.text.apply(text_cleaning)
test2016['cleaned_tweet'] = test2016.text.apply(text_cleaning)

In [17]:
# Check if everything went well with the data cleaning
print "Training set:", training[['sentiment','text','cleaned_tweet']].head(), "\n"
print "Development set:", develop[['sentiment','text','cleaned_tweet']].head(), "\n"
print "Test set:", test2016[['sentiment','text','cleaned_tweet']].head()

Training set:   sentiment  \
0  positive   
3  negative   
6  positive   
7  negative   
9  negative   

                                                                                                                                               text  \
0  Gas by my house hit $3.39!!!! I'm going to Chapel Hill on Sat. :)                                                                                  
3  Iranian general says Israel's Iron Dome can't deal with their missiles (keep talking like that and we may end up finding out)                      
6  with J Davlar 11th. Main rivals are team Poland. Hopefully we an make it a successful end to a tough week of training tomorrow.                    
7  Talking about ACT's &amp;&amp; SAT's, deciding where I want to go to college, applying to colleges and everything about college stresses me out.   
9  They may have a SuperBowl in Dallas, but Dallas ain't winning a SuperBowl. Not with that quarterback and owner. @S4NYC @RasmussenPoll    

In [18]:
# Make sure nothing is left null after the cleaning of the text.
print "Null Training:\n", training[training.isnull().any(axis=1)].head(), "\n"
print "Null Develop:\n", develop[develop.isnull().any(axis=1)].head(), "\n"
print "Null Test:\n", test2016[test2016.isnull().any(axis=1)].head()

Null Training:
Empty DataFrame
Columns: [sentiment, text, cleaned_tweet]
Index: [] 

Null Develop:
Empty DataFrame
Columns: [sentiment, text, cleaned_tweet]
Index: [] 

Null Test:
Empty DataFrame
Columns: [sentiment, text, cleaned_tweet]
Index: []


### Step 5: Feature Engineering

##### Feature selection/feture engineering (as in new features)/data transformations. 

A form of feature selection is already been done by lemmatizing the words and removing stop words and other things that does not seem necessary in the cleaning stage. 

In [19]:
# Create the vectors out of the words using count vecrtorizer. We transform the data so that it can be the input for the 
# algorithms.
vectorized_data = count_vectorizer.fit_transform(training.cleaned_tweet)
vectorized_develop = count_vectorizer.transform(develop.cleaned_tweet)
vectorized_test = count_vectorizer.transform(test2016.cleaned_tweet)

In [20]:
# Create the vectors out of the words using tfidf vecrtorizer.
tfidf_vectorized_data = tfidf_vectorizer.fit_transform(training.cleaned_tweet)
tfidf_vectorized_develop = tfidf_vectorizer.transform(develop.cleaned_tweet)
tfidf_vectorized_test = tfidf_vectorizer.transform(test2016.cleaned_tweet)

In [21]:
# Transform the sentiment (positive, negative, neutral) in numerical form to be able to compare the results easily.
def sentiment2target(sentiment):
    return {
        'negative': 0,
        'neutral': 1,
        'positive' : 2
    }[sentiment]

training_targets = training.sentiment.apply(sentiment2target)
develop_targets = develop.sentiment.apply(sentiment2target)
test_targets = test2016.sentiment.apply(sentiment2target)

### Step 6: Algorithm Selection

##### Select a set of algorithms to apply, select evaluation metrics, and evaluate/compare algorithms.

We are going to compare a set of algorithms, regarding their accuracy:
1. Random Forests
2. K-Nearest Neighbors
3. LinearSVC
4. Multinomial Naive Bayes
5. Logistic Regression
6. Decision Trees

In [24]:
# Evaluation metric:accuracy
scoring = 'accuracy'
kfold = KFold(n_splits=10, random_state=7)

In [25]:
# Perform 10-fold cross-validation on training set to find the algorithm with the best performance. Even though we have
# development set, we choose to perform this cross-validation and use the development set to tune the hyperparameters 
# of the chosen model. We try firstly with the count vectorized set and then with the tfidf vectorized set.  

# count vectorizer
models = []
models.append(('RF',  RandomForestClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', LinearSVC()))
models.append(('MNB', MultinomialNB()))
models.append(('LR',  LogisticRegression()))
models.append(('DT',  DecisionTreeClassifier()))

results = []
names   = []

for name, model in models:
    cv_results = cross_val_score(model, vectorized_data, training_targets, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("%03s: %f (+/- %f)" % (name, cv_results.mean(), cv_results.std()))

 RF: 0.603856 (+/- 0.081373)
KNN: 0.447209 (+/- 0.075410)
SVM: 0.617535 (+/- 0.070093)
MNB: 0.594719 (+/- 0.040343)
 LR: 0.624750 (+/- 0.072598)
 DT: 0.595214 (+/- 0.077542)


In [26]:
#tfidf vectorizer
models = []
models.append(('RF',  RandomForestClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', LinearSVC()))
models.append(('MNB', MultinomialNB()))
models.append(('LR',  LogisticRegression()))
models.append(('DT',  DecisionTreeClassifier()))

results = []
names   = []

for name, model in models:
    cv_results = cross_val_score(model, tfidf_vectorized_data, training_targets, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print("%03s: %f (+/- %f)" % (name, cv_results.mean(), cv_results.std()))

 RF: 0.599578 (+/- 0.076483)
KNN: 0.510566 (+/- 0.025548)
SVM: 0.622485 (+/- 0.070648)
MNB: 0.582808 (+/- 0.033425)
 LR: 0.603607 (+/- 0.070324)
 DT: 0.587244 (+/- 0.071337)


### Step 7: Model Training

##### Apply ensembles and improve performance by hyperparameter optimisation.

In [35]:
# Try to find the best hyperparameters for LinearSVC which we chose.
params = {'C': [0.01, 0.1, 1, 10, 100], 'loss': ['hinge', 'squared_hinge']}
linearSVC = GridSearchCV(LinearSVC(), params, scoring='accuracy')
linearSVC.fit(tfidf_vectorized_develop, (np.array(develop_targets)).ravel())
print linearSVC.best_params_

{'loss': 'squared_hinge', 'C': 0.1}


### Step 8: Finalise Model

##### Predictions on validation set, create model from the entire (training) dataset.

In [40]:
linearSVC = LinearSVC(loss='squared_hinge', C=0.1)
model = linearSVC.fit(tfidf_vectorized_data, training_targets)
pred = model.predict(tfidf_vectorized_test)
print "Accuracy for test data is", accuracy_score(test_targets, pred)

Accuracy for test data is 0.593444322083


### Procedure and choices made

After loading and exploring the datasets, we perform cleaning of the tweets' text using some regular expressions and other useful variables. We choose to clean the text from urls, @names and transform negations according to the patterns, because these are not useful for sentiment analysis. For the same reason and to reduce the number of features we also remove stop words and perform lemmatization.

Then, we choose to vectorize the features left, on the one hand with the count vectorizer and on the other hand with the tfidf vectorizer, to check which one makes each algorithm performs better. We also transform the sentiment of the tweets by enumerating them in range [0-2] to help the algorithms with the calculations.

Then, we perform algorithm selection by using 10-fold cross-validation on the training set to find the algorithm with the best performance. Even though we have development set, we choose to use it only to tune the hyperparameters of the chosen model. The algorithms that we choose to try are: RandomForestClassifier, KNeighborsClassifier, LinearSVC, MultinomialNB, LogisticRegression and DecisionTreeClassifier.

As we can observe from the results of the cross-validation for the two vectorized training data, LogisticRegression produces the best mean accuracy for count vectorized data and LinearSVC produces the best mean accuracy for tfidf vectorized data. Because of the fact that between the two of them, the LinearSVC produces an outcome with less standard deviation, we choose this algorithm for the next step which is the tuning of the hyperparameters.

We use grid search to find the optimal hyperparameters. We choose to tune the constant C which is the penalty parameter of the error term and the loss function. The best hyperparameters are C = 0.1 and loss = square hinge. 

Finally, we train the algorithm with these specific hyperparameters and test the model with the test set.