# Sentiment analysis

### What is sentiment analysis?

In this notebook, we're going to perform [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) on a dataset of tweets about US airlines. Sentiment analysis is the task of extracting [affective states][1] from text. Sentiment analysis is most ofen used to answer questions like:

[1]: https://en.wikipedia.org/wiki/Affect_(psychology)

- _is this tweet positive or negative?_
- _what do our customers think of us?_
- _do our users like the look of our product?_
- _what aspects of our service are users dissatisfied with?_

### Sentiment analysis as text classification

We're going to treat sentiment analysis as a text classification problem. Text classification is just like other instances of classification in data science. We use the term "text classification" when the features come from natural language data. (You'll also hear it called "document classification" at times.) What makes text classification interestingly different from other instances of classification is the way we extract numerical features from text. 

### Dataset

The dataset was collected by [Crowdflower](https://www.crowdflower.com/), which they then made public through [Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment). I've downloaded it for you and put it in the "data" directory.

### Tasks

I've already read in the data into two lists, `tweets` and `sentiments`.

You'll need to:
- Preprocess the dataset
- Perform classification

In [None]:
%matplotlib inline
import os
import re
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from string import punctuation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
sns.set()

DATA_DIR = 'data'

## Read in the dataset

In [None]:
def read_airline_tweets():
    fname = os.path.join(DATA_DIR, 'tweets.csv')
    df = pd.read_csv(fname)
    tweets = list(df['text'])
    sentiment = list(df['airline_sentiment'])
    return tweets, sentiment

In [None]:
tweets, sentiments = read_airline_tweets()
tweets[:5]

In [None]:
sentiments[:5]

## Preprocess

#### Challenge

You can preprocess this dataset however you like. Some recommended steps are:
- removing hastags, URLs and user mentions
- remove punctuation
- lower case everything
- remove extra whitespace
- replace any URLs with something like " URL "
- replace any digits with " DIGIT "
- remove any stopwords
- remove any words less than 3 characters in length
- stem/lemmatize words

We can use regular expressions to find hashtags and user mentions in a tweet. We first write the pattern we're looking for as a (raw) string, using regular expression's special syntax. The `twitter_handle_pattern` says "find me a @ sign immediately followed by one or more upper or lower case letters, digits or underscore". The `hashtag_pattern` is a little more complicated; it says "find me exactly one ＃ or #, immediately followed by one or more upper or lower case letters, digits or underscore, but only if it's at the beginning of a line or immediately after a whitespace character".

In [None]:
twitter_handle_pattern = r'@(\w+)'
hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
url_pattern = r'https?:\/\/.*.com'
example_tweet = "lol @justinbeiber and @BillGates are like soo #yesterday #amiright saw it on https://twitter.com #yolo"

In [None]:
re.findall(twitter_handle_pattern, example_tweet)

In [None]:
re.findall(hashtag_pattern, example_tweet)

In [None]:
re.findall(url_pattern, example_tweet)

In [None]:
cleaned_texts = [clean(tweet) for tweet in tweets]
assert type(cleaned_texts) == type([]), "cleaned_texts should be a list"
assert type(cleaned_texts[0]) == type(''), "each element in cleaned_texts should be a string"

## DTM/TF-IDF

#### Challenge
Now let's take our list of strings `cleaned_texts` and turn it into a DTM, with either counts or TF-IDF scores. It's up to you which one you choose. Here's the documentation for the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and here's the documentation for [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). I'd suggest limited the `max_features` to 5000 and setting `binary=True`. Feel free to play around with other options too!

## Classification

Now we're on to the actual classification step. The first thing we need to do here is split our data into a training and a test set. This is so we can evaluate the quality of our classifier.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features, sentiments, test_size=0.2)

In [None]:
def fit_logistic_regression(X_train, y_train):
    model = LogisticRegressionCV(Cs=5, penalty='l1', cv=3, solver='liblinear', refit=True)
    model.fit(X_train, y_train)
    return model

def conmat(model, X_test, y_test):
    """Wrapper for sklearn's confusion matrix."""
    labels = model.classes_
    y_pred = model.predict(X_test)
    c = confusion_matrix(y_test, y_pred)
    sns.heatmap(c, annot=True, fmt='d', 
                xticklabels=labels, 
                yticklabels=labels, 
                cmap="YlGnBu", cbar=False)
    plt.ylabel('Ground truth')
    plt.xlabel('Prediction')
    
def test_model(model, X_train, y_train):
    conmat(model, X_test, y_test)
    print('Accuracy: ', model.score(X_test, y_test))
    
def interpret(vectorizer, model):
    vocab = [(v,k) for k,v in vectorizer.vocabulary_.items()]
    vocab = sorted(vocab, key=lambda x: x[0])
    vocab = [word for num,word in vocab]
    important = pd.DataFrame(model.coef_).T
    if len(model.classes_) == 2:
        important.columns = [model.classes_[0]]
    else:
        important.columns = model.classes_
    important['word'] = vocab
    return important

### Train classification model

### Test classification model

### Interpreting what our model learnt

### Challenge

Use the `test_tweet` function below to test your classifier's performance on a list of tweets. Write your tweets 

In [None]:
def test_tweets(tweets, model):
    tweets = [clean(tweet) for tweet in tweets]
    features = vectorizer.transform(tweets)
    predictions = model.predict(features)
    return list(zip(tweets, predictions))

In [None]:
my_tweets = [example_tweet,
            'omg I am never flying on Delta again',
            'I love @VirginAmerica so much #friendlystaff']

test_tweets(my_tweets, model)

### Challenge

Use the `fit_random_forest` function below to train a random forest classifier on the training set and test the model on the test set. Which performs better?

In [None]:
def fit_random_forest(X_train, y_train):
    model = RandomForestClassifier()
    model.fit(X_train, y_train)
    return model

## Extra: Exploratory Data Analysis

Exploratory data analysis (EDA) is an important stage in any computational text analysis project. It's when we look closely at our data to understand it. We'd be interested in what each column holds, how individual values are distributed within each column. Normally, we would want to explore our dataset before we do any preprocessing or classification. For today's workshop though, we wanted to focus on the preprocessing and classification so we skipped the exploratory data analysis stage. The way I like to do EDA is to ask a series of questions about the dataset and then answer them. Unlike above, we'll work with the whole dataset as a pandas DataFrame.

In [None]:
fname = os.path.join(DATA_DIR, 'tweets.csv')
df = pd.read_csv(fname)
df.head(3)

Which airlines are tweeted about and how many of each in this dataset?

### Challenge

- How many tweets are in the dataset?
- How many tweets are positive, neutral and negative?
- What **proportion** of tweets are positive, neutral and negative?
- Visualize these last two questions.

### Extra challenge

- When did the tweets come from?
- Who gets more retweets: positive, negative or neutral tweets?
- What are the three main reasons why people are tweeting negatively? What could airline companies do to improve this?
- What's the distribution of time zones in which people are tweeting?
- Is this distribution consistent depending on what airlines they're tweeting about?

What other questions might you like to know about this dataset?