<h2>Final Project: Identifying Trump's Tweets</h2>

<center>
<img src="white_house.jpg"/>
</center>


<h3>Introduction</h3>

<p>The goal is to classify the device that Trump uses to write each tweet with. It's been hypothesized that President Trump tweets only from his android phone and that someone else (his staff) tweets from his account using an iPhone. Analyze the text of the tweet as well as other contextual information to predict where each tweet came from. </p>

<h3>Rules</h3>

<p> Rules of the competition: You may use any techniques you've learned in class including any open source implementations in packages such as scikit-learn, tensorflow, or pre-trained models. If you use any open source implementations, <b>please cite them in your comments</b>. The sharing of personal code between teams is strictly not allowed. Additionally obtaining a copy of the labeled test set through any means is expressly forbidden. </p>

<p><b>NOTE: You are only allowed 10 submissions for this project. Please use them carefully. We will use your 10th and final submission (not be the best one) for grading.</b></p>

<h3>Grading</h3>

<p>There are two baselines we have implemented. <code>Baseline 1 = 0.7</code> and <code>Baseline 2 = 0.82</code>. If you beat the first baseline, you will 90 points. If you beat the second baseline, you'll get 100 points.</p>
<p>The top 30 teams on the leaderboard will receive an extra 5 bonus points.</p>

Unfortunately our initially high accuracy was due to overfitting. When not explicitly specifying a kernel in sklearn's SVM function, it uses an rbf kernel. Using k-fold cross-validation our accuracy dropped from 98% to 67%. A linear kernel performs slightly better with a little over 70% accuracy.

### To do (added by Martin)

Implement multiple learning algorithms.

Optimize functions.

Feature ideas:
- Average number of words per sentence
- Average word length
- Are URLs or emojis in the tweet?
- Tokenize words using word2vec or bag of words

### What has been done

Implemented features:
- Number of sentences per tweet
- Numbers of characters per tweet
- Number of characters per sentence
- Day of the week
- time of the day
- number of capital letters
- Sentiment analysis -> strangely sentiment analysis reduces the performance!
- Number of punctuation symbols (...only exclamation points, but we can scale up)

Implemented sklearn's SVM as learning algorithm.

Implement k-fold cross validation. Added by Cole: grid search with cross-validation evaluating C and kernel parameters

In [22]:
#<GRADED>
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import KFold, GridSearchCV
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#</GRADED>

## include your imports as necessary and cite open-source implementations appropriately

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\marti\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [23]:
def read_files(train_file):
    """
    Output:
    df_X : pandas data frame of training data
    Y    : numpy array of labels
    """
    df = pd.read_csv(train_file, index_col=0)
    df_X = df[df.columns[0:17]]
    Y = np.array(df['label'])
        
    return df_X, Y

<h3> Training Data </h3>

<p> Take a look at the file <code>train.csv</code>. Here are the first 4 tweets in the train dataset.</p>

In [24]:
df_X_train, Y_train = read_files('train.csv')
df_X_train[:4]

Unnamed: 0_level_0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id.1,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,Senior United States District Judge Robert E. ...,False,14207,,7/12/2016 0:56,False,,752668000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,5256,False,False,,,-1
1,Speech on Veterans' Reform: https://t.co/XB7R...,False,9666,,7/11/2016 22:18,False,,752628000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,3432,False,False,,,-1
2,Great poll- Florida! Thank you! https://t.co/4...,False,25531,,7/11/2016 21:40,False,,752619000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,8810,False,False,,,-1
3,Thoughts and prayers with the victims; and the...,False,28850,,7/11/2016 19:51,False,,752591000000000000,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,9112,False,False,,,-1


<h3> Train and Classify </h3>

<p> Implement <code>train_and_classify</code>. It should extract feature vectors from the given pandas dataframes. Train a model and return the labels of the test data. The feature vectors and models to use are up to you to decide.</p>

<p><b>Your final score will be determined by executing <code>train_and_classify</code> with the provided training set for training and a hidden test set for classification. We will then evaluate the accuracy of your output.</b></p>
<p><b>NOTE: Please limit your training time to 10 minutes.</b></p>

In [25]:
def extract_length(df_X_train):
    # the tweets themselves are in the zero-th column. Extract from dataframe
    tweets = df_X_train.iloc[:,0]
    length = tweets.str.len()
    return length/max(length)

In [26]:
def split_sentences(string):
    '''
    Splits a tweet on ". " such that sentences are separated from each other.
    Does not split on "." so that periods in URLs are not misunderstood as the end of a sentence
    
    We should add additional characters to split on like exclamation marks and question marks. We could also separate out URLs
    so we can split on punctuation without worrying about spaces.
    '''
    return string.split(". ")

def extract_number_of_sentences(df_X_train):
    tweets = df_X_train.iloc[:,0]
    splitted_tweets = tweets.apply(split_sentences)
    n_sentences = splitted_tweets.apply(len)
    
    return n_sentences/max(n_sentences)

def extract_number_of_characters_per_sentence(df_X_train):    
    def average_characters_per_string(list_of_strings):
        return np.mean(list(map(len,list_of_strings)))

    tweets = df_X_train.iloc[:,0]
    splitted_tweets = tweets.apply(split_sentences)
    n_characters_per_sentence = splitted_tweets.apply(average_characters_per_string)
    
    return n_characters_per_sentence/max(n_characters_per_sentence)

In [27]:
def extract_weekday(df_X_train):
    df_X_train['created'] = pd.to_datetime(df_X_train['created'])
    day_of_week = df_X_train['created'].dt.dayofweek
    
    return day_of_week/max(day_of_week)

def extract_time_of_day(df_X_train):
    df_X_train['created'] = pd.to_datetime(df_X_train['created'])
    hour = df_X_train['created'].dt.hour
    
    return hour/max(hour)

In [28]:
def count_uppercase_letters(string):
    uppercase = filter(str.isupper, string) 
    return len(list(uppercase))

def extract_number_of_uppercase_letters(df_X_train):
    tweets = df_X_train.iloc[:,0]
    n_uppercase_characters = tweets.apply(count_uppercase_letters)
    
    return n_uppercase_characters/max(n_uppercase_characters)

In [29]:
def find_sentiment(string):
    
    sid = SentimentIntensityAnalyzer()
    
    # return the compound (index 3) sentiment score on a string
    return list(sid.polarity_scores(string).values())[3]

def extract_sentiment(df_X_train):
    tweets = df_X_train.iloc[:,0]
    
    return tweets.apply(find_sentiment)

In [30]:
def extract_exclamation_points(df_X_train):
    n_exclamation_points = df_X_train['text'].str.count('!')
    return n_exclamation_points/max(n_exclamation_points)

In [31]:
#<GRADED>
def train_and_classify(df_X_train, Y_train, df_X_test):
    """
    Extracts features from df_X_train. Train a model
    on training data and training labels (Y_train).
    Predict the labels of df_X_test.
    
    df_X_train : pandas data frame of training data
    Y_train    : numpy array of labels for training data
    df_X_test  : pandas data frame of test data
    
    Output:
    Y_test : numpy array of labels for test data
    """
    
    ## fill in code here
    def extract_feature_vec(df_X):
        # extracts feature vectors
        features = []
        
        features.append(extract_length(df_X))
        features.append(extract_number_of_sentences(df_X))
        features.append(extract_number_of_characters_per_sentence(df_X))
        features.append(extract_weekday(df_X))
        features.append(extract_time_of_day(df_X))
        features.append(extract_number_of_uppercase_letters(df_X))
        features.append(extract_sentiment(df_X))
        features.append(extract_exclamation_points(df_X))
        
        return pd.concat(features, axis=1)
    
    X_train = extract_feature_vec(df_X_train)
    X_test  = extract_feature_vec(df_X_test)
    
    # create and train model (consider doing k-fold cross validation as well)
    svc = svm.SVC(kernel = 'linear')
    parameters = {'C':(2.0**np.linspace(-10, 4, 5))}
    clf = GridSearchCV(svc, parameters, cv = 3)
    clf.fit(X_train, Y_train)
    
    # evaulate model
    Y_test = clf.predict(X_test) 

    return Y_test
#</GRADED>

<h3> Evaluation</h3>

<p>Below is some code to see your accuracy when trained and tested on the training data set.</p>

In [32]:
# evalulate and classify on training set
Y_pred = train_and_classify(df_X_train, Y_train, df_X_train)

def accuracy(Y_pred, Y_true):
    return np.sum(Y_pred == Y_true) / Y_pred.shape[0]

acc = accuracy(Y_pred, Y_train)
print('accurary: ' + str(round(acc * 100, 2)) + '%')

accurary: 71.63%


In [12]:
k = 10
kf = KFold(n_splits=k, shuffle=False)
acc = np.zeros(k)
i=0
  
for train_index, test_index in kf.split(df_X_train):
    
    df_X_train_selection = df_X_train.loc[train_index]
    Y_train_selection = Y_train[train_index]
    
    df_X_test_selection = df_X_train.loc[test_index]
    Y_test_selection = Y_train[test_index]
    
    Y_pred = train_and_classify(df_X_train_selection, Y_train_selection, df_X_test_selection)
    
    acc[i] = accuracy(Y_pred, Y_test_selection)
    i+=1
    
print("Accuracy: ", round(np.mean(acc)*100,2), "+/-", round(np.std(acc)*100/np.sqrt(k),2), "%")

Accuracy:  70.42 +/- 1.88 %
