# Sentiment Analysis on Tweets

For this project, we will be using a large dataset of Twitter reviews for US Airlines, which are classified into three categories: positive, negative, or neutral. Our goal is to train a model based on this dataset to determine sentiments, i.e. positive, negative, or neutral, of new tweets on Airlines.
 

## Load Data
The standard libraies are loaded.

The data is first loaded using `pd.read_csv()`. Then the two columns `text` annd `airline_sentiment` are extracted from the file and stored in `df2`. 
We only want to analyze the tweets that have a 100% confidence from the dataset. The texts are stored separately in an array, `df2`.

In [17]:
import pandas as pd
import numpy as np

df = pd.read_csv('Twitter_analysis_sentiment.csv')
df2 = df[['text','airline_sentiment']]
# the tweets we are using are those labelled with 100% confidence
df2 = df2[df['airline_sentiment_confidence'] == 1]
textArray = np.array(df2['text'])

## Vectorize Tweets
Here the class `CountVectorizer` is imported from `sklearn` along with the library `nltk` in order to tokenize the text. CountVectorizer allows us to convert a collection of text into a matrix of token counts- the frequency of occurance of the words in each line.

The array of text is transformed using the `CountVectorizer` with `min_df=2` and `tokenizer = nltk.word_tokenize` and stored in `tweet_counts`. 

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
import nltk
tweet_vec = CountVectorizer(min_df=2, tokenizer=nltk.word_tokenize) 
tweet_counts = tweet_vec.fit_transform(textArray)
print(tweet_counts.shape)

(10445, 5442)


## Create TF-IDF Vector
Next the class `TfidfTransformer` is imported from `sklearn`. Like with CountVectorizer, TfidfTransformer is stored in a variable, `tfidf_transformer` and is used to fit the tweet_counts. 

`TfidfTransformer` converts the count matrix to a tf-idf representation ora  term-frequency times inverse document-frequency. This gives a weighing factor to the counts to reduce the impact of words that occur frequently in a corpus and would be less informative than other words. 

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
tweet_tfidf = tfidf_transformer.fit_transform(tweet_counts)
print(tweet_tfidf.shape)

(10445, 5442)


## Create Target Vector
We create `target` vector to map the `positive` comments as `1` and `negative` comments as `-1` 

In [20]:
# map negative to -1, positve to 1, and neutral to 0
target = np.zeros(textArray.shape[0])
target[df2['airline_sentiment'] == 'positive'] = 1
target[df2['airline_sentiment'] == 'negative'] = -1

## Using SVM from SKlearn

### Use k-fold cross validation
We will first use the SVM method to classify the data. K-fold cross validation is used to train and test the data. The data is first shuffled before going throught the cross-validation in order to not be training consecutive tweets about one airline since the dataset is organized by airlines. 

The cross vaidation is ran for `10 folds`. After the cross-validation, the `confusion matrix` is also created by using `confusion_matrix` from sklearn.

In [21]:
from sklearn import svm
svc = svm.SVC(probability=False,  kernel="rbf", C=2.8, gamma=.0073,verbose=10)

from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
nfold = 10 
kf = KFold(n_splits=nfold, shuffle = True)
C = np.zeros([3, 3])
acc = []
i = 0
for train, test in kf.split(tweet_tfidf):
    i = i+1
    Xtr = tweet_tfidf[train,:]
    ytr = target[train]
    Xts = tweet_tfidf[test,:]
    yts = target[test] 
    
    svc.fit(Xtr,ytr) 
    yhat = svc.predict(Xts)
    C = C + confusion_matrix(yts, yhat, labels=[1,-1,0])
    acci = np.mean(yhat == yts)
    acc.append(acci)
    print(i)

[LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][LibSVM]

### Confusion Matrix and Accuracy
The accuracy of the SVM classifier is determined along with the standard error and the confusion matrix is printed. The SVM method provided an accuracy of around `75%` with most of the comment being negative. 

In [22]:
accm= np.mean(acc) 
acc_se = np.std(acc)/np.sqrt(nfold-1)
Cm = C / np.sum(C, axis = 1)[:,None] 
print('C = ')
print(np.array_str(C, precision=4, suppress_small=True))
print('Cm = ')
print(np.array_str(Cm, precision=4, suppress_small=True))
print('Accuracy =  {0:.4f}, SE={1:.4f}'.format(accm, acc_se)) 


C = 
[[ 418. 1096.    1.]
 [   3. 7378.    1.]
 [  10. 1506.   32.]]
Cm = 
[[0.2759 0.7234 0.0007]
 [0.0004 0.9995 0.0001]
 [0.0065 0.9729 0.0207]]
Accuracy =  0.7495, SE=0.0063


## Using Multinomial Naive Bayes Classifier from SKlearn

### Use k-fold cross validation
Next, we also used the more commonly used text classifier, the `Multinomial Naive Bayes` Classifier from sklearn. This classifier uses the `Bayes's theorem` of probability in order to determine the probability the given parameter is likely to be closer to the specified fields. 

The tweets are shuffled and ran through the K-cross validation. The accuracy vector and confusion matrix is also created.

In [23]:
from sklearn.naive_bayes import MultinomialNB
nfold = 10 
kf = KFold(n_splits=nfold, shuffle = True)
C = np.zeros([3, 3])
acc = []
i = 0
for train, test in kf.split(tweet_tfidf):
    i = i+1
    Xtr = tweet_tfidf[train,:]
    ytr = target[train]
    Xts = tweet_tfidf[test,:]
    yts = target[test] 
    clf = MultinomialNB().fit(Xtr,ytr)
    yhat = clf.predict(Xts)
    C = C + confusion_matrix(yts, yhat, labels=[1,-1,0])
    acci = np.mean(yhat == yts)
    acc.append(acci)
   # print(i)

### Confusion Matrix and Accuracy
Finally, the confusion matrix and accuracy of the Naive Bayes Classifier is printed. This method gave an accuracy of around `77%` (slightly better compared to the SVM method). 

Like with the SVM method, the confusion matrix shows that most of the comments were accurately predicted as negative (since most of the tweets were negative. Also like the SVM method, most of the tweets were also predicted as negative then it was neutral, so both methods have problems with distincting accurately negative and neutral sentiment. 

In [24]:
accm= np.mean(acc) 
acc_se = np.std(acc)/np.sqrt(nfold-1)
Cm = C / np.sum(C, axis = 1)[:,None] 
print('C = ')
print(np.array_str(C, precision=4, suppress_small=True))
print('Cm = ')
print(np.array_str(Cm, precision=4, suppress_small=True))
print('Accuracy =  {0:.4f}, SE={1:.4f}'.format(accm, acc_se)) 


C = 
[[ 450. 1058.    7.]
 [   0. 7372.   10.]
 [  19. 1284.  245.]]
Cm = 
[[0.297  0.6983 0.0046]
 [0.     0.9986 0.0014]
 [0.0123 0.8295 0.1583]]
Accuracy =  0.7723, SE=0.0046
