# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [1]:
# Dependencies
import nltk
import csv
import json
import math
import random
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [2]:
# Notice: ignore retweets 

def load_tweets_from_csv(filepath, column_name):
    tweets = []

    with open(filepath, 'r', encoding='utf-8') as file:
        reader = csv.DictReader(file)

        for row in reader:
            if column_name in row: 
                tweets.append(row[column_name])

    return tweets

def load_tweets_from_json(filepath, key_name):

    tweets = []

    with open(filepath, 'r', encoding='utf-8') as file:
        data = json.load(file) 
        for entry in data:
            if entry.get('isRetweet') == 'f' and key_name in entry:
                tweets.append(entry[key_name])

    return tweets


def load_trump_tweets(filepath):
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE
    return load_tweets_from_json(filepath, "text")
    ### END YOUR CODE


def load_obama_tweets(filepath):
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    return load_tweets_from_csv(filepath, "Tweet-text")

    ### END YOUR CODE
    

def load_biden_tweets(filepath):
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    return load_tweets_from_csv(filepath, "tweet")
    
    ### END YOUR CODE

In [3]:
# Notice: think about start and end tokens

NUM_TEST = 100

nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('english')) 

def tokenize(text):
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE
    tokenizer = TweetTokenizer(preserve_case=True, reduce_len=True, strip_handles=False)
    tokens = tokenizer.tokenize(text)
    tokens = [token for token in tokens if token.isalnum()]
    filtered = [token for token in tokens if token.lower() not in stop_words]
    filtered = ['<s>'] + filtered + ['</s>']
    return filtered
    ### END YOUR CODE
    

def split_and_tokenize(data, num_test=NUM_TEST):
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE
    
    split_index = len(data) - num_test
    
    train_data = data[:split_index]
    test_data = data[split_index:]
    
    tokenized_train = [tokenize(tweet) for tweet in train_data]
    tokenized_test = [tokenize(tweet) for tweet in test_data]
    
    return tokenized_train, tokenized_test
    
    ### END YOUR CODE

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\André\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\André\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
trump_tweets = split_and_tokenize(load_trump_tweets("./data/tweets_01-08-2021.json"))
obama_tweets = split_and_tokenize(load_obama_tweets("./data/Tweets-BarackObama.csv"))
biden_tweets = split_and_tokenize(load_biden_tweets("./data/JoeBidenTweets.csv"))

In [5]:
print(obama_tweets[1][:2])

[['<s>', 'ultimate', 'goal', 'agreement', 'gets', 'deficit', 'control', 'way', 'fair', 'balanced', 'President', 'Obama', '</s>'], ['<s>', 'voices', 'American', 'people', 'part', 'debate', 'President', 'Obama', '</s>']]


### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [6]:
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE
    
    models = {}
    for i in range(1, n+1):
        n_gram_counts = {}
        prefix_counts = {}
        for tweet in data:
            # Generate n-grams and their respective prefixes
            # Ensure everything is a tuple to avoid TypeError
            n_grams = [tuple(tweet[j:j+i]) for j in range(len(tweet) - i + 1)]
            prefixes = [tuple(tweet[j:j+i-1]) for j in range(len(tweet) - i + 1)] if i > 1 else [('<s>',) * (i-1)] * (len(tweet) - i + 1)

            for n_gram in n_grams:
                if n_gram in n_gram_counts:
                    n_gram_counts[n_gram] += 1
                else:
                    n_gram_counts[n_gram] = 1

            for prefix in prefixes:
                if prefix in prefix_counts:
                    prefix_counts[prefix] += 1
                else:
                    prefix_counts[prefix] = 1

        model = {}
        for n_gram, count in n_gram_counts.items():
            prefix = n_gram[:-1]
            model[n_gram] = count / prefix_counts[prefix]
        models[i] = model

    return models
    
    ### END YOUR CODE


def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    
    if len(prev) != len(next(iter(n_gram_model.keys()))) - 1:
        raise ValueError("The size of the previous tokens must match the n-1 value of the n-gram model.")

    possible_continuations = {key[-1]: n_gram_model[key] for key in n_gram_model if key[:-1] == prev}

    if not possible_continuations:
        return None

    next_words = list(possible_continuations.keys())
    probabilities = list(possible_continuations.values())

    next_word = random.choices(next_words, weights=probabilities, k=1)[0]

    return next_word
    
    ### END YOUR CODE


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given data set."""
    ### YOUR CODE HERE

    if n not in n_gram_models:
        raise ValueError("Specified order n is not available in the provided n-gram models.")

    current_tweet = ['<s>']
    sentence_finished = False

    while not sentence_finished:
        current_context = tuple(current_tweet[-(n-1):])
        next_word = get_suggestion(current_context[:min(len(current_context) - 1, n-1)], n_gram_models[min(len(current_context), n)])
        if next_word == '</s>' or next_word is None:
            sentence_finished = True
        else:
            current_tweet.append(next_word)

    current_tweet.append("</s>")
    final_tweet = ' '.join(current_tweet)
    return final_tweet
    
    ### END YOUR CODE

In [18]:
n_gram_models = build_n_gram_models(2, trump_tweets[0])
random_tweet_trump = get_random_tweet(2, n_gram_models)
print(random_tweet_trump)

<s> playing Dallas way violent big release American time resort Trump run </s>


### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [8]:
def calculate_single_token_log_ratio(prev, token, n_gram_model1, n_gram_model2):
    """Calculates the log ration of a token for two different n-grams"""
    ### YOUR CODE HERE
    
    epsilon = 1e-10

    prob1 = n_gram_model1.get((prev + (token,)), epsilon)
    prob2 = n_gram_model2.get((prev + (token,)), epsilon)
    
    log_ratio = math.log(prob1 + epsilon) - math.log(prob2 + epsilon)
    
    return log_ratio
    
    ### END YOUR CODE


def classify(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    log_ratio_sum = 0
    for i in range(n - 1, len(tokens) - 1):
        prev = tuple(tokens[i - (n - 1):i])
        token = tokens[i]
        log_ratio = calculate_single_token_log_ratio(prev, token, n_gram_models1, n_gram_models2)
        log_ratio_sum += log_ratio
    
    if log_ratio_sum > 0:
        return True
    elif log_ratio_sum <= 0:
        return False
    
    ### END YOUR CODE


In [9]:
def validate(n, data1, data2, classify_fn):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE
    
    train_data_1, test_data_1 = data1[0], data1[1]
    train_data_2, test_data_2 = data2[0], data2[1]

    n_gram_model_1 = build_n_gram_models(n, train_data_1)[n]
    n_gram_model_2 = build_n_gram_models(n, train_data_2)[n]

    test_data = test_data_1 + test_data_2
    test_labels = [True] * len(test_data_1) + [False] * len(test_data_2)  # True for 1, False for 2

    # Validate on test data
    correct_predictions = 0
    for tweet, label in zip(test_data, test_labels):
        prediction = classify_fn(n, tweet, n_gram_model_1, n_gram_model_2)
        if prediction == label:
            correct_predictions += 1

    accuracy = correct_predictions / len(test_data)
    print(f"Accuracy: {accuracy:.2f}")
    
    ### END YOUR CODE

In [10]:
context_length = 2
validate(context_length, trump_tweets, biden_tweets, classify_fn=classify)
validate(context_length, obama_tweets, biden_tweets, classify_fn=classify)

Accuracy: 0.86
Accuracy: 0.74


### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [11]:
def calculate_perplexity_of_sequence(tokens, n_gram_model, n):
    """
    Calculate the perplexity of a sequence given an n-gram model.
    """
    if len(tokens) < n:
        return float('inf')
    log_prob_of_sequence = 0
    total_n_grams = len(tokens) - n + 1

    for i in range(total_n_grams):
        n_gram = tuple(tokens[i:i+n])
        if n_gram in n_gram_model:
            probability = n_gram_model[n_gram]
        else:
            probability = 1e-10 

        log_prob_of_sequence += math.log(probability)

    return math.pow(2, -log_prob_of_sequence / total_n_grams)


def classify_with_perplexity(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    perplexity1 = calculate_perplexity_of_sequence(tokens, n_gram_models1, n)
    perplexity2 = calculate_perplexity_of_sequence(tokens, n_gram_models2, n)
    return perplexity1 < perplexity2 
    
    ### END YOUR CODE

In [12]:
context_length = 2
validate(context_length, trump_tweets, biden_tweets, classify_fn=classify_with_perplexity)
validate(context_length, obama_tweets, biden_tweets, classify_fn=classify_with_perplexity)

Accuracy: 0.89
Accuracy: 0.72
