# Assignment 2: POTUS

---

## Task 1) President of the United States (Trump vs. Obama)

Surely, you're aware that the 45th President of the United States (@POTUS45) was an active user of Twitter, until (permanently) banned on Jan 8, 2021.
You can still enjoy his greatness at the [Trump Twitter Archive](https://www.thetrumparchive.com/). We will be using original tweets only, so make sure to remove all retweets.
Another fan of Twitter was Barack Obama (@POTUS43 and @POTUS44), who used the platform in a rather professional way.
Please also consider the POTUS Tweets of Joe Biden; we will be using those for testing.

### Data

There are multiple ways to get the data, but the easiest way is to download the files from the `Supplemental Materials` in the `Files` section of our Microsoft Teams group. 
Another way is to directly use the data from [Trump Twitter Archive](https://www.thetrumparchive.com/), [Obama Kaggle](https://www.kaggle.com/jayrav13/obama-white-house), and [Biden Kaggle](https://www.kaggle.com/rohanrao/joe-biden-tweets).
Before you get started, please download the files; you can put them into the data folder.

### N-gram Models

In this assignment, you will be doing some Twitter-related preprocessing and training n-gram models to be able to distinguish between Tweets of Trump, Obama, and Biden.
We will be using [NLTK](https://www.nltk.org), more specifically it's [`lm`](https://www.nltk.org/api/nltk.lm.html) module. 
Install the NLTK package within your working environment.
You can use some of the NLTK functions, but you have to implement the functions for likelihoods and perplexity from scratch.

*In this Jupyter Notebook, we will provide the steps to solve this task and give hints via functions & comments. However, code modifications (e.g., function naming, arguments) and implementation of additional helper functions & classes are allowed. The code aims to help you get started.*

---

In [27]:
# Dependencies
import nltk
import pandas
import numpy

### Prepare the Data

1.1 Prepare all the Tweets. Since the `lm` modules will work on tokenized data, implement a tokenization method that strips unnecessary tokens but retains special words such as mentions (@...) and hashtags (#...).

1.2 Partition into training and test sets; select about 100 tweets each, which we will be testing on later. As with any Machine Learning task, training and test must not overlap.

In [28]:
# Notice: ignore retweets 

def load_trump_tweets(filepath) -> list:
    """Loads all Trump tweets and returns them as a list."""
    ### YOUR CODE HERE

    #     {
    #     "id": 98454970654916600,
    #     "text": "Republicans and Democrats have both created our economic problems.",
    #     "isRetweet": "f",
    #     "isDeleted": "f",
    #     "device": "TweetDeck",
    #     "favorites": 49,
    #     "retweets": 255,
    #     "date": "2011-08-02 18:07:48",
    #     "isFlagged": "f"
    #   },

    df = pandas.read_json(filepath)
    filtered_df = df[df['isRetweet'] == "f"]
    
    return(filtered_df['text'].to_list())
    
    ### END YOUR CODE


def load_obama_tweets(filepath) -> list:
    """Loads all Obama tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    # Date,Username,Tweet-text,Tweet Link,Retweets,Likes,TweetImageUrl,Image
    
    with open(filepath, "r") as f:
        df = pandas.read_csv(filepath)
    return df['Tweet-text'].to_list()

    ### END YOUR CODE
    

def load_biden_tweets(filepath) -> list:
    """Loads all Biden tweets and returns them as a list."""
    ### YOUR CODE HERE
    
    # id,timestamp,url,tweet,replies,retweets,quotes,likes

    with open(filepath, "r") as f:
        df = pandas.read_csv(filepath)
    return df['tweet'].to_list()
    
    ### END YOUR CODE

In [53]:
# Notice: think about start and end tokens

import string


NUM_TEST = 100

def tokenize(text) -> list:
    """Tokenizes a single Tweet."""
    ### YOUR CODE HERE

    tk = nltk.tokenize.TweetTokenizer()
    tokens = tk.tokenize(text)
    
    return ['<s>'] + tokens + ['</s>']

    ### END YOUR CODE
    

def split_and_tokenize(data, num_test=NUM_TEST):
    """Splits and tokenizes the given list of Twitter tweets."""
    ### YOUR CODE HERE
    
    tokenizedTweets: list = []
    for tweet in data:
        tokenizedTweets.append(tokenize(tweet))
    
    return tokenizedTweets
    
    ### END YOUR CODE

In [54]:
basepath = "C:\\Dev\\uni\\seqlrn-assignments\\2-markov-chains\\data\\"
trump_tweets = split_and_tokenize(load_trump_tweets(basepath + "tweets_01-08-2021.json"))
obama_tweets = split_and_tokenize(load_obama_tweets(basepath + "Tweets-BarackObama.csv"))
biden_tweets = split_and_tokenize(load_biden_tweets(basepath + "JoeBidenTweets.csv"))

### Train N-gram Models

2.1 Train n-gram models with n = [1, ..., 5] for Obama, Trump, and Biden.

2.2 Also train a joint model, that will serve as background model.

In [126]:
def build_n_gram_models(n, data):
    """
    To predict the first few words of the Tweet, we need the smaller n-grams as
    well. This method does calculate all n-grams up to the given n.
    """
    ### YOUR CODE HERE

    ngrams = []
    for i in range(1, n+1):
        freq = nltk.FreqDist()
        for tweet in data: 
            ngram = list(nltk.ngrams(tweet, i))
            freq.update(ngram)
             
        ngrams.append(freq)
    return ngrams
    
    ### END YOUR CODE

def get_suggestion(prev, n_gram_model):
    """
    Gets the next random word for the given n_grams.
    The size of the previous tokens must be exactly one less than the n-value
    of the n-gram, or it will not be able to make a prediction.
    """
    ### YOUR CODE HERE
    count = 1 #for laplace
    possible_words = []
    for ngram in n_gram_model.keys():
        if list(ngram[:len(prev)]) == prev:
            possible_words.append(ngram) 
            count += n_gram_model.get(ngram)

    suggestions = []
    for word in possible_words:
        x = n_gram_model.get(word) / count  
        suggestions.append((word[-1], x)) 
    suggestions.sort(key=lambda x: x[1], reverse=True)

    selected = numpy.random.choice(len(suggestions))
    return suggestions[selected][0]
    
    ### END YOUR CODE


def get_random_tweet(n, n_gram_models):
    """Generates a random tweet using the given data set."""
    ### YOUR CODE HERE
    
    tweet = ['<s>']
    while tweet[-1] != '</s>':    
        # use smaller ngram for start
        if len(tweet) < n:
            tweet.append(get_suggestion(tweet, n_gram_models[len(tweet)]))

        # biggest ngram model for rest  
        else:
            tweet.append(get_suggestion(tweet[-n+1:], n_gram_models[n-1]))
    
    return tweet         
    
    ### END YOUR CODE

In [67]:
#ngram models
n_gram_models = {}
n_gram_models['trump'] = build_n_gram_models(5, trump_tweets)
n_gram_models['obama'] = build_n_gram_models(5, obama_tweets)
n_gram_models['biden'] = build_n_gram_models(5, biden_tweets)
n_gram_models['all'] = build_n_gram_models(5, trump_tweets + obama_tweets + biden_tweets)

In [130]:
random_tweet_trump = get_random_tweet(5, n_gram_models['trump'])
print(random_tweet_trump)

['<s>', 'Busy', 'doing', 'phoners', 'this', 'week', 'with', 'Neil', 'Cavuto', ',', '4', 'p', '.', 'm', '.', "We'll", 'be', 'discussing', 'current', 'affairs', 'and', 'politics', '.', 'Tune', 'in', '.', '</s>']


### Classify the Tweets

3.1 Use the log-ratio method to classify the Tweets for Trump vs. Biden. Trump should be easy to spot; but what about Obama vs. Biden?

3.2 Analyze: At what context length (n) does the system perform best?

In [191]:
import math

def calculate_single_token_log_ratio(prev, token, n_gram_model1, n_gram_model2):
    """Calculates the log ration of a token for two different n-grams"""
    ### YOUR CODE HERE

    prob1 = n_gram_model1.freq(tuple(numpy.append(prev, token)))
    prob2 = n_gram_model2.freq(tuple(numpy.append(prev, token)))
    
    return math.log((prob1 + 1e-10) / (prob2 + 1e-10))  # 1e-10 to avoid 0 division errors 
   
    ### END YOUR CODE


def classify(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE

    m1_log_ratio = 0
    m2_log_ratio = 0

    for i in range(0, len(tokens)):
        m1_log_ratio += calculate_single_token_log_ratio(tokens[i-n:i], tokens[i], n_gram_models1[n-1], n_gram_models2[n-1])
        m2_log_ratio = calculate_single_token_log_ratio(tokens[i-n:i], tokens[i], n_gram_models2[n-1], n_gram_models1[n-1])

    return m1_log_ratio > m2_log_ratio
    
    ### END YOUR CODE


In [192]:
def validate(n, data1, data2, classify_fn):
    """
    Trains the n-gram models on the train data and validates on the test data.
    Uses the implemented classification function to predict the Tweeter.
    """
    ### YOUR CODE HERE

    match = 0

    # indexes where to split the data into train and test (80% to 20%)
    testsplit_index_1 = int(len(data1) * 0.8)
    testsplit_index_2 = int(len(data2) * 0.8)

    d1_train = data1[0:testsplit_index_1]
    d1_test = data1[testsplit_index_1:]
    d2_train = data2[0:testsplit_index_2]
    d2_test = data2[testsplit_index_2:]

    d1_ngram_models = build_n_gram_models(n, d1_train)
    d2_ngram_models = build_n_gram_models(n, d2_train)

    for data in d1_test: 
        result = classify_fn(n, data, d1_ngram_models, d2_ngram_models)
        if result == True: match += 1
    for data in d2_test:
        result = classify_fn(n, data, d1_ngram_models, d2_ngram_models)
        if result == True: match += 1

    value = match / (len(data1) + len(data2))
    
    print("score: {}".format(value))
    
    ### END YOUR CODE

In [160]:
trump_tweet = get_random_tweet(5, n_gram_models['trump'])
obama_tweet = get_random_tweet(5, n_gram_models['obama'])
biden_tweet = get_random_tweet(5, n_gram_models['biden'])

In [193]:
context_length = 5
for i in range(1, context_length+1):
    print('Context length: {}'.format(i))
    validate(i, trump_tweets, biden_tweets, classify_fn=classify)
    validate(i, obama_tweets, biden_tweets, classify_fn=classify)

Context length: 1
score: 0.20000758178854391
score: 0.20007742934572204
Context length: 2
score: 0.0
score: 0.0
Context length: 3
score: 0.0
score: 0.0
Context length: 4
score: 0.0
score: 0.0
Context length: 5
score: 0.0
score: 0.0


### Compute Perplexities

4.1 Compute (and plot) the perplexities for each of the test tweets and models. Is picking the Model with minimum perplexity a better classifier than in 3.1?

In [None]:
def classify_with_perplexity(n, tokens, n_gram_models1, n_gram_models2):
    """
    Checks which of the two given datasets is more likely for the given Tweet.
    If true is returned, the first one is more likely, otherwise the second.
    """
    ### YOUR CODE HERE
    
    raise NotImplementedError()
    
    ### END YOUR CODE

In [None]:
context_length = 5
for i in range(1, context_length+1):
    validate(context_length, trump_tweets, biden_tweets, classify_fn=classify_with_perplexity)
    validate(context_length, obama_tweets, biden_tweets, classify_fn=classify_with_perplexity)