# Tweet classification: Trump vs. Trudeau

For this data science project, I aimed to classify tweets from two prominent and polarising North American politicians: Donald Trump and Justin Trudeau. 

[Donald Trump and Justin Trudeau shaking hands.](https://upload.wikimedia.org/wikipedia/commons/4/47/President_Donald_Trump_and_Prime_Minister_Justin_Trudeau_Joint_Press_Conference%2C_February_13%2C_2017.jpg)

The task involved delving into the realm of social media text classification, and focusing specifically on the challenges posed by tweets, such as their brevity, tweet-specific syntax (e.g. mentions, hashtags, emoji, links, and usernames), and the need to develop effective strategies to process this text and make it legible to an ML model and useful for training purposes.

### Data Collection and Preparation

To start the project, I gathered tweets from both Donald Trump and Justin Trudeau. I utilized various data collection techniques and libraries to obtain a diverse and representative sample of tweets from each politician.

### Data Preprocessing

Once the data was collected, I performed extensive preprocessing to clean and prepare the text data for analysis. This included removing irrelevant information such as links, emoji, and special characters, as well as standardizing text format and handling platform-specific conventions like mentions and hashtags.

### Model Selection

For tweet classification, I experimented with several classification models, including Multinomial Naive Bayes, Linear Support Vector Classifier (SVC), and Passive Aggressive Classifier. Each model was evaluated based on its performance metrics and ability to accurately classify tweets.

### Feature Engineering

To represent the text data in a format suitable for machine learning models, I employed various vectorization techniques such as CountVectorizer and TfidfVectorizer. These methods transformed the text data into numerical features while preserving important information about word frequency and importance.

### Model Evaluation and Optimization

To assess the performance of each classification model, I utilized metrics such as accuracy, precision, recall, and F1-score. Additionally, I employed techniques like cross-validation and grid search to optimize model parameters and improve classification performance.

### Importing the Data

In [None]:
# Set seed for reproducibility
import random; random.seed(53)

# Import all we need from sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn import metrics

### Transforming the Collected Data

Initially, I worked with a corpus of tweets collected in November 2017, stored in a CSV file. Utilizing a Pandas DataFrame, I imported this data to facilitate its preparation for machine learning with scikit-learn.

Since the dataset was fetched via the Twitter API without any pre-existing split into training and testing sets, I needed to perform this division myself. I used the `train_test_split()` function with `random_state=53` and a test size of `0.33`, which ensured consistent results across different executions. 

This approach also guaranteed a sufficient amount of test data for the evaluation phase.

In [None]:
import pandas as pd

# Load data
tweet_df = pd.read_csv('datasets/trump_trudeau/trump_trudeau_tweets.csv')

# Create target
y = tweet_df['author']

# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(
    tweet_df['status'], y, 
    test_size=0.33, 
    random_state=53)

### Vectorizing the Tweets

With the training and testing data prepared, the next step was to convert the tweets into a format suitable for machine learning. I used CountVectorizer and TfidfVectorizer to create vectorized representations of the tweets. 

First, I fitted these vectorizers to the data. Once the tweets were vectorized, I was ready to move forward with modeling using these new vector representations.

In [None]:
# Initialize count vectorizer
count_vectorizer = CountVectorizer(stop_words='english', min_df=0.05, max_df=0.9)

# Create count train and test variables
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)

# Initialize tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=0.05, max_df=0.9)

# Create tfidf train and test variables
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

### Training a Multinomial Naive Bayes Model

After vectorizing the data, I trained my first model using the Multinomial Naive Bayes approach with both CountVectorizer and TfidfVectorizer data. My goal was to determine which vectorization method would enhance the model's performance and why.

To evaluate the effectiveness of each model, I compared the accuracy scores from the test sets for both the CountVectorizer and TfidfVectorizer implementations.

In [None]:
# Create a MulitnomialNB model
tfidf_nb = MultinomialNB()

tfidf_nb.fit(tfidf_train, y_train)

tfidf_nb_pred = tfidf_nb.predict(tfidf_test)

# Calculate the accuracy of your predictions
tfidf_nb_score = metrics.accuracy_score(y_test, tfidf_nb_pred)

# Create a MulitnomialNB model
count_nb = MultinomialNB()

count_nb.fit(count_train, y_train)

# Run predict on your count test data to get your predictions
count_nb_pred = count_nb.predict(count_test)

# Calculate the accuracy of your predictions
count_nb_score = metrics.accuracy_score(y_test, count_nb_pred)

print('NaiveBayes Tfidf Score: ', tfidf_nb_score)
print('NaiveBayes Count Score: ', count_nb_score)

### Evaluating the Model with a Confusion Matrix

I observed that the TF-IDF model outperformed the count-based approach. From what I learned in the NLP fundamentals course, this likely stems from TF-IDF's ability to emphasize unique tokens that might be key identifiers for each tweeter.

To comprehensively assess the model, I relied not just on accuracy scores but on the confusion matrix. This matrix highlighted the number of correct and incorrect classifications for each class, using metrics like True Positives, False Positives, False Negatives, and True Negatives. 

This detailed view helped me understand the model’s performance, particularly how often Trump was misclassified as Trudeau.

In [None]:
%matplotlib inline

from datasets.trump_trudeau.helper_functions import plot_confusion_matrix

# Calculate the confusion matrices for the tfidf_nb model and count_nb models
tfidf_nb_cm = metrics.confusion_matrix(y_test, tfidf_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau']) 
count_nb_cm = metrics.confusion_matrix(y_test, count_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])

# Plot the tfidf_nb_cm confusion matrix
plot_confusion_matrix(tfidf_nb_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="TF-IDF NB Confusion Matrix")

# Plot the count_nb_cm confusion matrix without overwriting the first plot 
plot_confusion_matrix(count_nb_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="Count NB Confusion Matrix", figure=1)

### Experimenting with Linear SVC

After evaluating the Bayesian model, which showed minimal difference (only one prediction difference) in predictions between TF-IDF and count vectorizers, I noticed some misclassifications particularly where Trump was predicted for Trudeau's tweets. This prompted an exploration into the problematic tokens to refine the model further.

Motivated to explore alternative methods, I decided to try LinearSVC, known for its effectiveness in text classification. I was keen to see if applying it with TF-IDF vectors would improve the accuracy further.

In [None]:
# Create a LinearSVC model
tfidf_svc = LinearSVC()

tfidf_svc.fit(tfidf_train, y_train)

# Run predict on your tfidf test data to get your predictions
tfidf_svc_pred = tfidf_svc.predict(tfidf_test)

# Calculate your accuracy using the metrics module
tfidf_svc_score = metrics.accuracy_score(y_test, tfidf_svc_pred)

print("LinearSVC Score:   %0.3f" % tfidf_svc_score)

# Calculate the confusion matrices for the tfidf_svc model
svc_cm = metrics.confusion_matrix(y_test, tfidf_svc_pred, labels=['Donald J. Trump', 'Justin Trudeau'])

# Plot the confusion matrix using the plot_confusion_matrix function
plot_confusion_matrix(svc_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="TF-IDF LinearSVC Confusion Matrix")

### Introspecting the best-performing model

I discovered that the LinearSVC model performed even better than the Multinomial Bayesian one, which was a significant achievement. 

Analyzing the confusion matrix, there were still instances where Trudeau's tweets were misclassified as Trump's, but the False Positive rate had improved compared to the previous model. 

There is room for further refinement, which I could potentially achieve by enhancing all the previous models through parameter optimization and implementing more effective preprocessing strategies for the tweets.

I wanted to find out what the model had learned, so I used the LinearSVC Classifier, which categorized tweets into two classes: Trump and Trudeau. By sorting the features (tokens) by their weight, I was able to identify the most significant tokens for both Trump and Trudeau. 

This exploration aimed to uncover whether the model had learned distinguishing characteristics that were genuinely useful for differentiating between these two prominent figures. 

What I found raised intriguing questions about the most 'Trump-like' or 'Trudeau-like' words and whether the model had been able to effectively captured the eessential distinctions between these two politicians and their Tweet vocabulary.

In [None]:
from datasets.trump_trudeau.helper_functions import plot_and_return_top_features

# Import pprint from pprint
from pprint import pprint

# Get the top features using the plot_and_return_top_features function and your top model and tfidf vectorizer
top_features = plot_and_return_top_features(tfidf_svc, tfidf_vectorizer)

# pprint the top features
pprint(top_features)

### Tweet Impersonations

My model recognized Trudeau's tendency to tweet in French. I challenged myself to write a tweet that could deceive the model into thinking it was authored by either Trump or Trudeau. With further tinkering and collation of words that our model associates with each politician, it could possible to craft tweets that would be misclassified. 

For those proficient in French, I suggest trying a Trudeau impersonation in French. 
Interestingly, while removing both English and French stop words might streamline preprocessing, it could lower the model's accuracy since Trudeau is the sole French-speaker in our dataset. 

This observation suggests to me, however, that expanding the dataset to include more French-speaking politicians would probably warrant this preprocessing step in the future.

In [None]:
trump_tweet = "America is great!"
trudeau_tweet = "Canada les"

trump_tweet_vectorized = tfidf_vectorizer.transform([trump_tweet])
trudeau_tweet_vectorized = tfidf_vectorizer.transform([trudeau_tweet])

trump_tweet_pred = tfidf_svc.predict(trump_tweet_vectorized)
trudeau_tweet_pred = tfidf_svc.predict(trudeau_tweet_vectorized)

print("Predicted Trump tweet", trump_tweet_pred)
print("Predicted Trudeau tweet", trudeau_tweet_pred)

### Results and Conclusion

After thorough experimentation and evaluation, I identified the most effective classification model for distinguishing between tweets from Donald Trump and Justin Trudeau - LinearSVC with TF-IDF vectorizer.

The project provided valuable insights into the challenges of tweet classification and demonstrated the feasibility of building accurate classifiers for social media text analysis.

Future work on this dataset could involve:
- add extra preprocessing to my current workflow, in order to observe how these modifications affect the performance of the classifiers.
- firstly, removing URLs is essential as they do not contribute meaningful information to text classification.
- secondly, since Trudeau tweets occasionally in French, removing French stop words might reduce noise in the data.
- use GridSearchCV to improve both my Bayesian and LinearSVC models by finding the optimal parameters
- introspect my Bayesian model to determine what words are used more often by Trump or Trudeau
- use tweepy to allow me to access and retrieve more recent tweets to your dataset and retrain to ensure that the classifiers remain effective and relevant
- continue writing impersonation tweets as a fun application of the project but also a practical test of the classifier's effectiveness