# Sentiment Analysis

Lots of libraries exist.

In [None]:
# !pip install nltk
# !pip install textblob

# NLTK

Can do lots of other stuff

In [None]:
import nltk
nltk.download('vader_lexicon')

### A Sentiment Analyser

The SentimentIntensityAnalyzer below can give us the `polarity_score` for each piece of text. The resulting output is four values in a dictionary:
- negative: the negative sentiment in a sentence
- neutral: the neutral sentiment in a sentence
- positive: the postivie sentiment in the sentence
- compound: the aggregated sentiment. 

This model is trained on "VADER" data, which is "a type of sentiment analysis that is based on lexicons of sentiment-related words. In this approach, each of the words in the lexicon is rated as to whether it is positive or negative, and in many cases, how positive or negative. Below you can see an excerpt from VADER’s lexicon, where more positive words have higher positive ratings and more negative words have lower negative ratings." 

You can read more about the classifier [here](http://t-redactyl.io/blog/2017/04/using-vader-to-handle-sentiment-analysis-with-social-media-text.html).

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

sia = SIA()
sia.polarity_scores("This restaurant was great.")

Using some examples from the above link to see how it works with "real" data?

In [None]:
text = "I just got a call from my boss - does he realise it's Saturday?"
sia.polarity_scores(text)

# Add emoticon
text = "I just got a call from my boss - does he realise it's Saturday?"
sia.polarity_scores(text)

# Add emoji

# TextBlob

Similar to NLTK, and provides similar functions as NLTK. Spacy is another one. 

The thing is, a lot of these libraries are _almooooost_ the same, but just a little bit different. So, for example, if we try to do the same sentiment analysis with `TextBlob`, we don't get four items in our output but only two: the `polarity` and the `subjectivity`. 

In [None]:
from textblob import TextBlob

In [None]:
phrase = TextBlob("This restaurant was great.")
phrase.sentiment

# But how did they learn?

It's just a classification (ish) problem.

In [None]:
# TextBlob is based on product reviews
# NLTK is based on movie reviews

# DIY

We can do it ourselves! We'll use this: http://help.sentiment140.com/for-students and build our own classifier.

In [None]:
# Do the same thing, again, where we load in all the data, drop NAs, and see how much data we have.

import pandas as pd

columns = ['polarity', 'id', 'datetime', 'query', 'username', 'content']
df = pd.read_csv("trainingandtestdata/training.1600000.processed.noemoticon.csv", 
                 names=columns,
                 encoding='latin-1')
df = df.dropna()
df.head()
df.shape

In [None]:
# See the breakdown of the label, which, in this case, is the polarity. 

df.polarity.value_counts()

0 is negative, 4 is positive. We'll make it 0-1 instead.

In [None]:
df.polarity = df.polarity.replace({4: 1})
df.polarity.value_counts()

We're in for some long run times below, so take a sample.

In [None]:
# sample can go up or down
df = df.sample(30000)
df.polarity.value_counts()

# Vectorize

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
%%time

# create a TfidfVectorizer with max_features specified. 
# max_features is based on term frequency. 
vectorizer = TfidfVectorizer(max_features=5000)
vectors = vectorizer.fit_transform(df.content)
words_df = pd.DataFrame(vectors.toarray(), columns=vectorizer.get_feature_names())
words_df.head()

## Question: should the output be a category, a probability, an amount?

Let's train a few different models.

In [None]:
# Create our X and y

X = words_df
y = df.polarity

In [None]:
# import models 

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

In [None]:
# Run a linear regression

%%time
linreg = LinearRegression()
linreg.fit(X, y)

In [None]:
# Run a logistic regression

%%time
logreg = LogisticRegression(solver='lbfgs')
logreg.fit(X, y)

In [None]:
# Run a RandomForestClassifier

%%time
forest = RandomForestClassifier(n_estimators=20)
forest.fit(X, y)

In [None]:
# Run a LinearSVC

%%time
svc = LinearSVC()
svc.fit(X, y)

# Use our models on some new data

These sentences are horrible we need more better ones!!!

In [None]:
# Create some test data

unknown = pd.DataFrame({
   'sentence': [
       "I'm not sure how I feel about toast",
       "Did you see the baseball game yesterday?",
       "I find chirping birds irritating, but I know I'm not the only one"
   ] 
})
unknown

In [None]:
# Put it through the vectoriser

# transform, not fit_transform, because we already learned all our words
unknown_vectors = vectorizer.transform(unknown.sentence)
unknown_words_df = pd.DataFrame(unknown_vectors.toarray(), columns=vectorizer.get_feature_names())
unknown_words_df.head()

In [None]:
# Predict using all our models. 

# Linear Regression predictions
unknown['prediction_linreg'] = linreg.predict(unknown_words_df)

# Logistic Regression predictions + probabilities
unknown['prediction_logreg'] = logreg.predict(unknown_words_df)
unknown['prediction_logreg_proba'] = logreg.predict_proba(unknown_words_df)[:,1]

# Random forest predictions + probabilities
unknown['prediction_forest'] = forest.predict(unknown_words_df)
unknown['prediction_forest_proba'] = forest.predict_proba(unknown_words_df)[:,1]

# SVG predictions
unknown['prediction_svc'] = svc.predict(unknown_words_df)

In [None]:
unknown

## Thoughts and feelings on those numbers and how they agree or disagree? What's a 0.5 mean? Do we like the 0/1 or the percent? What's that mean compared to how we usually deal with sentiment?

# Maybe we should have tested this?

## Split and train

Linear doesn't fit into the confusion matrix scheme very well, so skipping it.

In [None]:
# Create test and training data 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
logreg.fit(X_train, y_train)
svc.fit(X_train, y_train)
forest.fit(X_train, y_train)

## Confusion matrices

In [None]:
from sklearn.metrics import confusion_matrix

### Logistic Regression

In [None]:
# Have a look at the confusion matrix for logistic regression

y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

### Random forest

In [None]:
# Have a look at the confusion matrix for Random Forest. 

y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

### SVC

In [None]:
# ...and finally, for SVC. 

y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names)

## Percentage-based confusion matrices

### Logisitic

In [None]:
y_true = y_test
y_pred = logreg.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

### Random forest

In [None]:
y_true = y_test
y_pred = forest.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

### SVC

In [None]:
y_true = y_test
y_pred = svc.predict(X_test)
matrix = confusion_matrix(y_true, y_pred)

label_names = pd.Series(['negative', 'positive'])
pd.DataFrame(matrix,
     columns='Predicted ' + label_names,
     index='Is ' + label_names) / matrix.sum(axis=1)

## What do we think about training time vs performance? What can that mean about feature selection or training set size?

## What do we think about increasing features (more words) vs more sentences?