# Classifying Tweets: Supervised Machine Learning

## Problem 

In this problem, Twitter data is analyzed and extracted using [this](https://dev.twitter.com/overview/api) api. The data contains tweets posted by the following six Twitter accounts: `realDonaldTrump, mike_pence, GOP, HillaryClinton, timkaine, TheDemocrats`

For every tweet, there are two pieces of information:
- `screen_name`: the Twitter handle of the user tweeting and
- `text`: the content of the tweet.

The tweets have been divided into two parts - train and test available in CSV files. For train, both the `screen_name` and `text` attributes are provided but for test, `screen_name` is hidden.

The goal of the problem is to "predict" the political inclination (Republican/Democratic) of the Twitter user from one of his/her tweets. The ground truth (true class labels) is determined from the `screen_name` of the tweet as follows
- `realDonaldTrump, mike_pence, GOP` are Republicans
- `HillaryClinton, timkaine, TheDemocrats` are Democrats

This is a binary classification problem. 

The problem proceeds in three stages:
- **Text processing**: clean up the raw tweet text using the various functions offered by the [nltk](http://www.nltk.org/genindex.html) package.
- **Feature construction**: construct bag-of-words feature vectors and training labels from the processed text of tweets and the `screen_name` columns respectively.
- **Classification**: use [sklearn](http://scikit-learn.org/stable/modules/classes.html) package to learn a model which classifies the tweets as desired. 


In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import sklearn 
import string
import re
from sklearn.metrics import accuracy_score
from IPython.display import display, Latex, Markdown
import copy
from scipy import stats

`nltk` and `sklearn` are used in this problem. NLTK comes with many toy grammars, trained models, etc, which have to be downloaded manually. This assignment requires NLTK's stopwords list and WordNetLemmatizer. Install them using:

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')


# Verify that the following commands work for you, before moving on.

lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()
stopwords=nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\danie\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


## A. Text Processing

A function was created to process and tokenize raw text. The generated list of tokens meet the following specifications:
  1. The tokens are all in lower case.
  2. The tokens appear in the same order as in the raw text.
  3. The tokens are in their lemmatized form. If a word cannot be lemmatized (i.e, an exception), simply catch it and ignore it. These words will not appear in the token list.
  4. The tokens do not contain any punctuations. Punctuations are handled as follows: 
    1. Apostrophe of the form `'s` are ignored. e.g., `She's` becomes `she`. 
    2. Other apostrophes are omitted. e.g, `don't` becomes `dont`. 
    3. Words are broken at the hyphen and other punctuations. 
  5. The tokens do not contain any part of a url.

`string.punctuation` is used to get hold of all punctuation symbols. 
[Regular expressions](https://docs.python.org/3/library/re.html) are used for capturing urls in the text. 

Tokens must be of type `str`. `nltk.word_tokenize()` is used for tokenization once punctuation is handled. 

`lemmatize()` function [here](https://www.nltk.org/_modules/nltk/stem/wordnet.html).

In order for `lemmatize()` to give the root form for any word, the context in which to lemmatize through the `pos` parameter (`lemmatizer.lemmatize(word, pos=SOMEVALUE)`) has to be provided. The context is the part of speech (POS) for that word. [nltk.pos_tag()](https://www.nltk.org/book/ch05.html) gives the lexical categories for each word. The results from `pos_tag()` are then used for the `pos` parameter.

However, the POS tag returned from `pos_tag()` is in different format than the expected pos by `lemmatizer`.
> pos
(Syntactic category): n for noun files, v for verb files, a for adjective files, r for adverb files.

These pos need to be mapped appropriately. `nltk.help.upenn_tagset()` provides description of each tag returned by `pos_tag()`.


In [3]:
def process(text, lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()):
  """ Normalizes case and handles punctuation
  Inputs:
  text: str: raw text
  lemmatizer: an instance of a class implementing the lemmatize() method
              (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
  Outputs:
  list(str): tokenized text
  """

  text = text.lower()
  text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text) # remove url
  text = text.replace("'s", '') # remove possessive 's 
  text = text.replace("'", '') # remove other apostrophes
  text = text.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) # remove punctuation
  text = re.sub(' +', ' ', text) # remove whitespace b/w words
  text = text.strip() # remove leading or trailing whitespace

  tokens = nltk.word_tokenize(text)
  tokens = nltk.pos_tag(tokens)

  pos_mapping = {
    "N":'n',
    "V":'v',
    "J":'a',
    "R":'r'
  }

  lemmatized = []
  for word, pos in tokens:
    tag = pos_mapping.get(pos[0])
    if (tag != None):
      lemmatized.append(lemmatizer.lemmatize(word, tag))
    else:
      lemmatized.append(lemmatizer.lemmatize(word))

  return lemmatized


Test the above function 

In [4]:
print(process("I'm doing well! How about you?"))
# ['im', 'do', 'well', 'how', 'about', 'you']

print(process("Education is the ability to listen to almost anything without losing your temper or your self-confidence."))
# ['education', 'be', 'the', 'ability', 'to', 'listen', 'to', 'almost', 'anything', 'without', 'lose', 'your', 'temper', 'or', 'your', 'self', 'confidence']

print(process("been had done languages cities mice"))
# ['be', 'have', 'do', 'language', 'city', 'mice']

print(process("It's hilarious. Check it out http://t.co/dummyurl"))
# ['it', 'hilarious', 'check', 'it', 'out']

['im', 'do', 'well', 'how', 'about', 'you']
['education', 'be', 'the', 'ability', 'to', 'listen', 'to', 'almost', 'anything', 'without', 'lose', 'your', 'temper', 'or', 'your', 'self', 'confidence']
['be', 'have', 'do', 'language', 'city', 'mice']
['it', 'hilarious', 'check', 'it', 'out']


Now use the `process()` function implemented to convert the pandas dataframe just loaded from tweets_train.csv file. The function is able to handle any data frame which contains a column called `text`. The data frame returned replaces every string in `text` with the result of `process()` and retains all other columns as such. 

In [5]:
tweets = pd.read_csv("tweets_train.csv", na_filter=False)
print(tweets.head())

      screen_name                                               text
0             GOP  RT @GOPconvention: #Oregon votes today. That m...
1    TheDemocrats  RT @DWStweets: The choice for 2016 is clear: W...
2  HillaryClinton  Trump's calling for trillion dollar tax cuts f...
3  HillaryClinton  .@TimKaine's guiding principle: the belief tha...
4        timkaine  Glad the Senate could pass a #THUD / MilCon / ...


In [6]:
def process_all(df, lemmatizer=nltk.stem.wordnet.WordNetLemmatizer()):
  """ process all text in the dataframe using process() function.
  Inputs
    df: pd.DataFrame: dataframe containing a column 'text' loaded from the CSV file
    lemmatizer: an instance of a class implementing the lemmatize() method
      (the default argument is of type nltk.stem.wordnet.WordNetLemmatizer)
  Outputs
    pd.DataFrame: dataframe in which the values of text column have been changed from str to list(str),
      the output from process() function. Other columns are unaffected.
  """
  df['text'] = df['text'].apply(process)
  return df

In [7]:
# test code
processed_tweets = process_all(tweets)
print(processed_tweets.head())

#       screen_name                                               text
# 0             GOP  [rt, gopconvention, oregon, vote, today, that,...
# 1    TheDemocrats  [rt, dwstweets, the, choice, for, 2016, be, cl...
# 2  HillaryClinton  [trump, call, for, trillion, dollar, tax, cut,...
# 3  HillaryClinton  [timkaine, guide, principle, the, belief, that...
# 4        timkaine  [glad, the, senate, could, pass, a, thud, milc...

      screen_name                                               text
0             GOP  [rt, gopconvention, oregon, vote, today, that,...
1    TheDemocrats  [rt, dwstweets, the, choice, for, 2016, be, cl...
2  HillaryClinton  [trump, call, for, trillion, dollar, tax, cut,...
3  HillaryClinton  [timkaine, guide, principle, the, belief, that...
4        timkaine  [glad, the, senate, could, pass, a, thud, milc...


## B. Feature Construction

The next step is to derive feature vectors from the tokenized tweets. In this section, a bag-of-words TF-IDF feature vector is construted. But before that, the number of possible words is large and not all of them may be useful for the classification task. I determined which words to retain, and which to omit. 
> "A common heuristic is to construct a frequency distribution of words in the corpus and prune out the head and tail of the distribution. The intuition of the above operation is as follows." Very common words (i.e. stopwords) add almost no information regarding similarity of two pieces of text. Similarly with very rare words. NLTK has a list of in-built stop words which is a good substitute for head of the distribution. A word is considered rare if it occurs only in a single document (row) in whole of `tweets_train.csv`. 

A sparse matrix of features is constructed for each tweet with the help of `sklearn.feature_extraction.text.TfidfVectorizer` (documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)). `min_df=2` is passed to filter out the words occuring only in one document in the whole training set. Stop words are ignored. Other optional parameters are left at their default values. 

In [8]:
def create_features(processed_tweets, stop_words):
  """ creates the feature matrix using the processed tweet text
  Inputs:
    tweets: pd.DataFrame: tweets read from train/test csv file, containing the column 'text'
    stop_words: list(str): stop_words by nltk stopwords
  Outputs:
    sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used
      we need this to tranform test tweets in the same way as train tweets
    scipy.sparse.csr.csr_matrix: sparse bag-of-words TF-IDF feature matrix
  """
  docs2 = copy.deepcopy(processed_tweets['text'])
  docs = [' '.join(i) for i in docs2]


  vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(use_idf = True, min_df = 2, stop_words = stop_words)
  vectors = vectorizer.fit_transform(docs)

  return vectorizer, vectors

In [9]:
# execute this code 
(tfidf, X) = create_features(processed_tweets, stopwords)
print(X.shape)

(17298, 7923)


For each tweet, assign a class label (0 or 1) using its `screen_name`. Use 0 for realDonaldTrump, mike_pence, GOP and 1 for the rest.

In [10]:
def create_labels(processed_tweets):
  """ creates the class labels from screen_name
  Inputs:
    tweets: pd.DataFrame: tweets read from train file, containing the column 'screen_name'
  Outputs:
    numpy.ndarray(int): dense binary numpy array of class labels
  """
  labels = []
  initialize = 1
  for user in processed_tweets['screen_name']:
    if (initialize == 1):
      if (user == 'realDonaldTrump' or user == 'mike_pence' or user == 'GOP'):
        labels = np.array(0, dtype = int)
      else:
        labels = np.array(1, dtype = int)
      initialize = 0
    else:
      if(user == 'realDonaldTrump' or user == 'mike_pence' or user == 'GOP'):
        labels = np.append(labels, 0)
      else:
        labels = np.append(labels, 1)

  return labels

In [11]:
# execute this code
y = create_labels(processed_tweets)
print(len(y))

17298


## C. Classification

The classifier used is [`sklearn.svm.SVC`](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) (Support Vector Machine). 

At the heart of SVMs is the concept of kernel functions, which determines how the similarity/distance between two data points is computed. `sklearn`'s SVM provides four kernel functions: `linear`, `poly`, `rbf`, `sigmoid` (details [here](http://scikit-learn.org/stable/modules/svm.html#svm-kernels)) (**IDEA: implement own distance function and pass it as an argument to the classifier**).

Tasks for classifier:

1. Implement and evaluate a simple baseline classifier MajorityLabelClassifier.
2. Implement the `learn_classifier()` function assuming `kernel` is always one of {`linear`, `poly`, `rbf`, `sigmoid`}. 
3. Implement the `evaluate_classifier()` function which scores a classifier based on accuracy of a given dataset.
4. Implement `best_model_selection()` to perform cross-validation by calling `learn_classifier()` and `evaluate_classifier()` for different folds and determine which of the four kernels performs the best.
5. Go back to `learn_classifier()` and fill in the best kernel. 


To determine whether the classifier is performing well, compare it to a baseline classifier. A baseline is generally a simple or trivial classifier and the classifier implemented should beat the baseline in terms of a performance measure such as accuracy. The implemented classifier, `MajorityLabelClassifier`, always predicts the class equal to the **mode** of the labels (i.e., the most frequent label) in training data.

In [12]:
class MajorityLabelClassifier():
  """
  A classifier that predicts the mode of training labels
  """
  def __init__(self):
    """
    Initialize
    """
    self.mode = None


  def fit(self, X, y):
    """
    Implement fit by taking training data X and their labels y and finding the mode of y
    """

    m = stats.mode(y)
    self.mode = int(m[0])


  def predict(self, X):
    """
    Implement to give the mode of training labels as a prediction for each data instance in X
    return labels
    """

    labels = []
    for i in X:
      labels.append(self.mode)

    return labels

# Report the accuracy of your classifier by comparing the predicted label of each example to its true label
obj = MajorityLabelClassifier()
obj.fit(X,y)
preds = obj.predict(X)

length = len(preds)
i = 0
count = 0
while (i < length):
  if (preds[i] != y[i]):
      count = count + 1
  i = i + 1
pred_acc = count / length
print(pred_acc)

# training accuracy = 0.500173

0.4998265695456122


Implement the `learn_classifier()` function assuming `kernel` is always one of {`linear`, `poly`, `rbf`, `sigmoid`}. Default values are used for any other optional parameters.

In [13]:
def learn_classifier(X_train, y_train, kernel):
  """ learns a classifier from the input features and labels using the kernel function supplied
  Inputs:
    X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features, output of create_features()
    y_train: numpy.ndarray(int): dense binary vector of class labels, output of create_labels()
    kernel: str: kernel function to be used with classifier. [linear|poly|rbf|sigmoid]
  Outputs:
    sklearn.svm.classes.SVC: classifier learnt from data
  """

  classifier = sklearn.svm.SVC(kernel = kernel)
  classifier.fit(X_train, y_train)
  return classifier

In [14]:
# execute code
classifier = learn_classifier(X, y, 'linear')
print('done')

done


The next step is to evaluate the classifier, ie., characterize how good its classification performance is. (This step is necessary to select the best model among a given set of models, or even tune hyperparameters for a given model).

There are two questions that now come to mind:
1. **What data to use?** 
    - **Validation Data**: The data used to evaluate a classifier is called **validation data** (or hold-out data), and it is usually different from the data used for training. The model or hyperparameter with the best performance in the held out data is chosen. This approach is relatively fast and simple but vulnerable to biases found in validation set.
    - **Cross-validation**: This approach divides the dataset in $k$ groups (so, called k-fold cross-validation). One of group is used as test set for evaluation and other groups as training set. The model or hyperparameter with the best average performance across all k folds is chosen. For this question you will perform 4-fold cross validation to determine the best kernel. We will keep all other hyperparameters default for now. This approach provides robustness toward biasness in validation set. However, it takes more time.
    

2. **And what metric?** 
  - There are several evaluation measures available in the literature (e.g., accuracy, precision, recall, F-1,etc) and different fields have different preferences for specific metrics due to different goals. We will go with accuracy. According to wiki, **accuracy** of a classifier measures the fraction of all data points that are correctly classified by it; it is the ratio of the number of correct classifications to the total number of (correct and incorrect) classifications. `sklearn.metrics` provides a number of performance metrics.

In [15]:
def evaluate_classifier(classifier, X_validation, y_validation):
  """ evaluates a classifier based on a supplied validation data
  Inputs:
    classifier: sklearn.svm.classes.SVC: classifer to evaluate
    X_train: scipy.sparse.csr.csr_matrix: sparse matrix of features
    y_train: numpy.ndarray(int): dense binary vector of class labels
  Outputs:
    double: accuracy of classifier on the validation data
  """

  predictions = classifier.predict(X_validation)
  accuracy = accuracy_score(y_validation, predictions) 
  return accuracy

In [16]:
# test code by evaluating the accuracy on the training data
accuracy = evaluate_classifier(classifier, X, y)
print(accuracy) 
# should give 0.956700196554515

0.951959764134582


Now it is time to decide which kernel works best by using the cross-validation technique. The training data is split into 4-folds (75% training and 25% validation) by shuffling randomly. For each kernel, the average accuracy for all folds is recorded and the best classifier is determined. Since the dataset is balanced (both classes are in almost equal propertion), `sklearn.model_selection.KFold` [doc](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) can be used for cross-validation.

In [17]:
kf = sklearn.model_selection.KFold(n_splits=4, random_state=1, shuffle=True)
kf

KFold(n_splits=4, random_state=1, shuffle=True)

Then determine which classifier is the best. 

In [19]:
def best_model_selection(kf, X, y):
  """
  Select the kernel giving best results using k-fold cross-validation.
  Other parameters should be left default.
  Input:
  kf (sklearn.model_selection.KFold): kf object defined above
  X (scipy.sparse.csr.csr_matrix): training data
  y (array(int)): training labels
  Return:
  best_kernel (string)
  """
  best_kernel = 'linear'
  best_accuracy = 0

  for kernel in ['linear', 'rbf', 'poly', 'sigmoid']:
    accuracy = []
    for train, test in kf.split(X):
      # Split train-test
      X_train, X_test = X[train], X[test]
      y_train, y_test = y[train], y[test]
      classifier = learn_classifier(X_train, y_train, kernel)
      accuracy.append(evaluate_classifier(classifier, X_test, y_test))

    if (sum(accuracy) / 4 > best_accuracy):
      best_accuracy = sum(accuracy) / 4
      best_kernel = kernel

  return best_kernel


#Test your code
best_kernel = best_model_selection(kf, X, y)
print(best_kernel)

poly


A wrapper function that uses the model to classify unlabeled tweets from tweets_test.csv file. 

In [93]:
def classify_tweets(tfidf, classifier, unlabeled_tweets):
  """ predicts class labels for raw tweet text
  Inputs:
    tfidf: sklearn.feature_extraction.text.TfidfVectorizer: the TfidfVectorizer object used on training data
    classifier: sklearn.svm.classes.SVC: classifier learnt
    unlabeled_tweets: pd.DataFrame: tweets read from tweets_test.csv
  Outputs:
    numpy.ndarray(int): dense binary vector of class labels for unlabeled tweets
  """
  tweets = pd.read_csv("tweets_train.csv", na_filter=False)
  train_tweets = process_all(tweets)
  tfidf, X = create_features(train_tweets, stopwords)
  y = create_labels(train_tweets)

  classifier = learn_classifier(X, y, 'poly')

  tfidf, X_test = create_features(test_tweets, stopwords)
  classifier.predict(X_test)

  return preds

In [None]:
# **TO-DO** Fill in best classifier in the function and re-trian classifier using all training data
# **TO-DO** Get predictions for unlabelled test data
#classifier = learn_classifier(X, y, best_kernel)
#unlabeled_tweets = pd.read_csv("tweets_test.csv", na_filter=False)
#y_pred = classify_tweets(tfidf, classifier, unlabeled_tweets)

## Closing Questions

Did the SVM classifier perform better than the baseline?