## Assignment 4
The selected implementation is the one published by the GitHub's user [spro](https://github.com/spro/char-rnn.pytorch), which uses PyTorch.

Some customization were made, in order to adapt the code to the experiment, about presentation and usage rather than functionality.

### Preliminaries

In [3]:
import sys
import csv
import random
import string
import math
import torch
import torch.nn as nn
import numpy as np
from torch.autograd import Variable
from tqdm import tqdm

First, we define the CharRNN model. It's customizable in input and hidden sizes, as well as LSTM or GRU usage and the number of layers.

In [4]:
class CharRNN(nn.Module):  # https://github.com/spro/char-rnn.pytorch
    def __init__(self, input_size, hidden_size, output_size, model="gru", n_layers=1):
        super(CharRNN, self).__init__()
        self.model = model.lower()  # to select between LSTM and GRU layer
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        self.encoder = nn.Embedding(input_size, hidden_size)  # encodes the input
        if self.model == "gru":  # recurrent passage with GRU
            self.rnn = nn.GRU(hidden_size, hidden_size, n_layers)
        elif self.model == "lstm":  # recurrent passage with LSTM
            self.rnn = nn.LSTM(hidden_size, hidden_size, n_layers)
        self.decoder = nn.Linear(hidden_size, output_size)  # decodes the output state

    def forward(self, input, hidden):
        batch_size = input.size(0)
        encoded = self.encoder(input)
        output, hidden = self.rnn(encoded.view(1, batch_size, -1), hidden)
        output = self.decoder(output.view(batch_size, -1))
        return output, hidden

    def forward2(self, input, hidden):
        encoded = self.encoder(input.view(1, -1))
        output, hidden = self.rnn(encoded.view(1, 1, -1), hidden)
        output = self.decoder(output.view(1, -1))
        return output, hidden

    def init_hidden(self, batch_size):
        if self.model == "lstm":
            return (Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size)),
                    Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size)))
        return Variable(torch.zeros(self.n_layers, batch_size, self.hidden_size))

The we define some utility functions. The first is a function used to turn a given string into a Tensor of character indexes

In [5]:
all_characters = string.printable
def char_tensor(string):
    tensor = torch.zeros(len(string)).long()  # inits the tensor
    for c in range(len(string)):
        # encodes each character with its string.printable index
        try:
            tensor[c] = all_characters.index(string[c])
        except:
            continue
    return tensor

The models take some time to be trained (circa 1h each), so we define a function to save it in order to be able to do some quick experiments.

In [None]:
def save(model, filename):
    torch.save(model, filename)
    print('Saved as %s' % filename)

During training, we sample a random chunk from the training set in order to speed up the training while still being able to see most of the training corpus.

In [None]:
def random_training_set(text, chunk_len, batch_size):
    text_len = len(text)
    inp = torch.LongTensor(batch_size, chunk_len)
    target = torch.LongTensor(batch_size, chunk_len)
    for bi in range(batch_size):  # we produce batch_size batches
        # each batch has a chunk_len length portion of the corpus
        start_index = random.randint(0, text_len - chunk_len)
        end_index = start_index + chunk_len + 1
        chunk = text[start_index:end_index]
        inp[bi] = char_tensor(chunk[:-1])
        target[bi] = char_tensor(chunk[1:])
    inp = Variable(inp)
    target = Variable(target)
    return inp, target

Then we define the training function, which iterates over the chunks and trains the model.

In [None]:
def train(model, inp, target, chunk_len, batch_size):
    hidden = model.init_hidden(batch_size)
    model.zero_grad()
    loss = 0

    for c in range(chunk_len):
        output, hidden = model(inp[:,c], hidden)
        loss += criterion(output.view(batch_size, -1), target[:,c])

    loss.backward()
    decoder_optimizer.step()

    return loss.data / chunk_len

Now we define some global parameters

In [None]:
model = "gru"
n_epochs = 2000
hidden_size = 250
n_layers = 2
learning_rate = 0.015
chunk_len = 250
batch_size = 100

Then we load the CSV with the presidential speeches and build two lists with the sentences of each president we're interested in: Bill Clinton and Donald Trump.

In [10]:
clinton, trump = [], []
csv.field_size_limit(sys.maxsize)
with open("corpus.csv") as file:
    reader = csv.reader(file, delimiter=",", quotechar="\"")
    for row in reader:
        if (row[0] == "Bill Clinton"): clinton = row
        if (row[0] == "Donald Trump"): trump = row

# to build the sentences we split at each sentence delimiter: ".", "!", "?" and "\n"
clinton_sentences = clinton[2].replace(".", "\n").replace("!", "\n").replace("?", "\n").split("\n")
print(clinton[0], len(clinton[2]))
print(random.choice(clinton_sentences))

trump_sentences = trump[2].replace(".", "\n").replace("!", "\n").replace("?", "\n").split("\n")
print(trump[0], len(trump[2]))
print(random.choice(trump_sentences))

Bill Clinton 768815
 I challenge every community, every school, and every state to adopt national standards of excellence, to measure whether schools are meeting those standards, to cut bureaucratic red tape so that schools and teachers have more flexibility for grassroots reform, and to hold them accountable for results
Donald Trump 475677
 That is why I am here today, to break the logjam and provide Congress with a path forward to end the government shutdown and solve the crisis on the southern border


### Recurrent Neural Presidents training

The first president examined is Bill Clinton. First, we build its model.

In [None]:
clinton_RNP = CharRNN(
    len(all_characters),  # input size
    hidden_size, # 250 hidden units per layer
    len(all_characters),  # output size
    model=model,  # gru
    n_layers=n_layers,  # 2 layers
)
decoder_optimizer = torch.optim.Adam(clinton_RNP.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()

And we train, saving it when done

In [None]:
print("Training \"Clinton Recurrent Neural President\" for %d epochs..." % n_epochs)
for epoch in tqdm(range(1, n_epochs + 1)):
    loss = train(clinton_RNP, *random_training_set(clinton[2], chunk_len, batch_size), chunk_len, batch_size)

save(clinton_RNP, "RNP_clinton.pt")

Similar process for Donald Trump.

In [None]:
trump_RNP = CharRNN(
    len(all_characters),  # input size
    hidden_size, # 250 hidden units per layer
    len(all_characters),  # output size
    model=model,  # gru
    n_layers=n_layers,  # 2 layers
)
decoder_optimizer = torch.optim.Adam(trump_RNP.parameters(), lr=learning_rate)
criterion = nn.CrossEntropyLoss()
print("Training \"Trump Recurrent Neural President\" for %d epochs..." % n_epochs)
for epoch in tqdm(range(1, n_epochs + 1)):
    loss = train(trump_RNP, *random_training_set(trump[2], chunk_len, batch_size), chunk_len, batch_size)

save(trump_RNP, "RNP_trump.pt")

### Analysis

Before analyzing the two RNP, we define the generation function.

This function feeds a "starter" string to the specified model and generates a string of length `predict_len`.

In order to have a nicer output, if `proper_ending` is `True`, the model generates at least the specified number of characters and stops only when a `.` is generated.



In [6]:
def generate(decoder, prime_str='A', predict_len=100, temperature=0.8, proper_ending=True):
    hidden = decoder.init_hidden(1)
    prime_input = Variable(char_tensor(prime_str).unsqueeze(0))
    predicted = prime_str

    # Use priming string to "build up" hidden state
    for p in range(len(prime_str) - 1):
        _, hidden = decoder(prime_input[:,p], hidden)

    inp = prime_input[:,-1]

    for p in range(predict_len):
        output, hidden = decoder(inp, hidden)

        # Sample from the network as a multinomial distribution
        output_dist = output.data.view(-1).div(temperature).exp()
        # ugly fix for "probability tensor contains either `inf`, `nan` or element < 0"
        output_dist[output_dist == float("Inf")] = 1.0e+10
        top_i = torch.multinomial(output_dist, 1)[0]

        # Add predicted character to string and use as next input
        predicted_char = all_characters[top_i]
        predicted += predicted_char
        inp = Variable(char_tensor(predicted_char).unsqueeze(0))

    if (proper_ending):
        while (predicted[-1] != "."):  # generates until the next "."
            output, hidden = decoder(inp, hidden)

            output_dist = output.data.view(-1).div(temperature).exp()
            output_dist[output_dist == float("Inf")] = 1.0e+10
            top_i = torch.multinomial(output_dist, 1)[0]

            predicted_char = all_characters[top_i]
            predicted += predicted_char
            inp = Variable(char_tensor(predicted_char).unsqueeze(0))

    return predicted[len(prime_str):]

For the generations, we want the temperature value and a "custom" random sentence.

In [None]:
custom_sentence = "People keep telling me orange but I still prefer pink."
temperature = 0.25

We now analyze the RNP Clinton.

In [13]:
clinton_sentences_lengths = [len(str) for str in clinton_sentences]
clinton_sentences_avglen = float(sum(clinton_sentences_lengths))/len(clinton_sentences_lengths)
print ("The real President Clinton has an average sentence length of " + "{:5.5}".format(clinton_sentences_avglen) + " characters")
clinton_sentences_avglen = int(clinton_sentences_avglen)
clinton_RNP = torch.load("RNP_clinton.pt")

The real President Clinton has an average sentence length of 110.28 characters


To evaluate the CharRNN model we pick 100 random sentences of the real president, feed the first few characters to the CharRNN and then evaluate the similarity between the real and the generated sentence using the `SequenceMatcher`'s ratio from the `difflib` library.

In [None]:
from difflib import SequenceMatcher
scores = []
for i in range(100):
    rnd_sentence = random.choice(clinton_sentences).strip()
    gen_sentence = rnd_sentence[0:5] + generate(clinton_RNP, rnd_sentence[0:5], len(rnd_sentence) - 5, temperature = temperature, proper_ending = False)
    scores.append(SequenceMatcher(None, rnd_sentence, gen_sentence).ratio())
print("The Recurrent Neural President Clinton resembles the real President Clinton at", "{:.3}%".format(np.mean(scores)*100))

The Recurrent Neural President Clinton resembles the real President Clinton at 35.2%


For the sake of analysis, let's see how well the CharRNN of the president Clinton resembles the real president Trump.

In [None]:
scores = []
for i in range(100):
    rnd_sentence = random.choice(trump_sentences).strip()
    gen_sentence = rnd_sentence[0:5] + generate(clinton_RNP, rnd_sentence[0:5], len(rnd_sentence) - 5, temperature = temperature, proper_ending = False)
    scores.append(SequenceMatcher(None, rnd_sentence, gen_sentence).ratio())
print("The Recurrent Neural President Clinton resembles the real President Trump at", "{:.3}%".format(np.mean(scores)*100))

The Recurrent Neural President Clinton resembles the real President Trump at 41.2%


A surprising result. After training on almost 770k characters pronounced by the real president Clinton, the artificial president Clinton is 6% more similar to the real president Trump than Clinton.

Let's see some sample speeches. First, a simple speech made of 15 sentences starting with "Dear Americans,"

In [None]:
print("\nSample speech")
print("\"Dear Americans," + generate(clinton_RNP, "Dear Americans,", clinton_sentences_avglen*15, temperature=temperature) + '\"\n')


Sample speech
"Dear Americans, the Congress the states to do the Congress with the Congress the people who are provide the first time to provide the Senator Kennedy and Congress for the Congress the American people to strengthen the people who have a chance to see the people to our people to see the streets and the first time to provide the health care and the security of the people the states to strengthen the first time to help the programs the problems of the states and the people in the people to do the people to pay the conversations to the Union and Congress and the Congress to see the first time to all the Congress to see the Congress the strength of the Congress to see the political secure the people to the most of the environment and the first time to see the people to provide the people for a single people are the street of the people are the communities and the environment of the people the people the Congress the commitment to pay to pay to see the first time and the Congr

Then a speech of 5 sentences where we feed the RNP Clinton a random sentence of its real counterpart. The feeded real sentence is highlighted in square brackets.

In [None]:
rnd_sentence = random.choice(clinton_sentences).strip()
print("Clinton as Clinton:\"[" + rnd_sentence + "]", end = "")
print(generate(clinton_RNP, rnd_sentence, clinton_sentences_avglen*5, temperature=temperature) + '\"\n')

Clinton as Clinton:"[If we want America to lead, we've got to set a good example] and the problems and the problems of the first time to provide the streets and the streets that the police officers of the streets of the programs that is the problems to the people the community and the community of the first time to the programs that the streets and the people to continued to the same stronger and the next people to the streets and the people to the people to the programs the people to the same time of the problems of the streets and the police officer to do the first time to pay to the first time the world to the programs to the problems and the streets and the programs to the productive and the programs that is a children of the proposal to end the first time and the problems of the people who work to stop the people to the last about the problems of the people to have a communities and the streets to the common and the first time to provide the first time to help you to the enduring 

A noticeable element is the frequent repetition of sequences of words. This may be due to poor training, or tunable with the temperature parameter.

Next, a speech still of 5 sentences but this time with a random sentence said by the real president Donald Trump.

In [None]:
rnd_sentence = random.choice(trump_sentences).strip()
print("Clinton as Trump:\"[" + rnd_sentence + "]", end = "")
print(generate(clinton_RNP, rnd_sentence, clinton_sentences_avglen*5, temperature=temperature) + '\"\n')

Clinton as Trump:"[The biggest thing] to help to the first time to strengthen the people to be here to do the private training the programs to the same true the strength of the problems and the problems of the people to provide and the first time that the people who work and the world's who are states the problems of the streets to be the conversity in the productivity and the streets and the community for the streets and the profound the different companies and the world to do a secure the opportunity to stop the programs the continue to support the people who work with heart and the people to the programs the same time to do the streets and the streets and the law about the families and the change the programs the communities to help the programs to the police officers of the same truth the communities that the people to be able to make the police officers of the people of the State of the Congress to the people to help the people of the Congress the American people to provide the st

Finally, a speech with the custom random sentence stated before, "*People keep telling me orange but I still prefer pink.*"

In [None]:
print("Clinton:\"[" + custom_sentence + "]", end = "")
print(generate(clinton_RNP, custom_sentence, clinton_sentences_avglen*5, temperature=temperature) + '\"\n')

Clinton:"[People keep telling me orange but I still prefer pink.] I want to strengthen the first time to stay the first time to provide the things the streets and the streets and the Senator Somalia and the Senators and the first time to the Congress to the people are stronger and the police officers of the streets and the people to strengthen the first time to help the problems of the police officers of the people who are all the first time in the streets and the streets of the streets of the people to the commitment to save the people who work to provide the things the students when the teachers of the problems of the first time to achieve the security to the people who are some of the people who are and the programs to the end the politics of the people who have the continue to stay that the streets and the problems that is the first time the people to help the problems of the people who are and the people who are the first time to stop the provides the people the police officers of

Switching to RNP Trump.

In [14]:
trump_sentences_lengths = [len(str) for str in trump_sentences]
trump_sentences_avglen = float(sum(trump_sentences_lengths))/len(trump_sentences_lengths)
print ("The real President Trump has an average sentence length of " + "{:5.5}".format(trump_sentences_avglen) + " characters")
trump_sentences_avglen = int(trump_sentences_avglen)
trump_RNP = torch.load("RNP_trump.pt")

The real President Trump has an average sentence length of 70.681 characters


We can see a major difference in the style of the two presidents: Clinton has an average sentence length of 40 characters longer than Trump. This may highlight a change in dialectic overtime, or just of the two presidents, shifting from longer and elaborated sentences to shorter, impactful, simpler ones. This surely has contributed to the better result for the RNP Clinton in imitating Trump rather than its real counterpart.

In [None]:
scores = []
for i in range(100):
    rnd_sentence = random.choice(trump_sentences).strip()
    gen_sentence = rnd_sentence[0:5] + generate(trump_RNP, rnd_sentence[0:5], len(rnd_sentence) - 5, temperature = temperature, proper_ending = False)
    scores.append(SequenceMatcher(None, rnd_sentence, gen_sentence).ratio())
print("The Recurrent Neural President Trump resembles the real President Trump at", "{:.3}%".format(np.mean(scores)*100))
scores = []
for i in range(100):
    rnd_sentence = random.choice(clinton_sentences).strip()
    gen_sentence = rnd_sentence[0:5] + generate(trump_RNP, rnd_sentence[0:5], len(rnd_sentence) - 5, temperature = temperature, proper_ending = False)
    scores.append(SequenceMatcher(None, rnd_sentence, gen_sentence).ratio())
print("The Recurrent Neural President Trump resembles the real President Clinton at", "{:.3}%".format(np.mean(scores)*100))

The Recurrent Neural President Trump resembles the real President Trump at 41.6%
The Recurrent Neural President Trump resembles the real President Clinton at 32.0%


As exptected, RNP Trump can imitate the real Trump better than it can imitate the real Clinton. It's interesting to see the accuracy of the imitation: while RNP Clinton imitates the real Clinton with an accuracy of circa 35%, RNP Trump has a fairly better result of almost 42%.

The causes may be several. The first thing that comes to mind is the fairly shorter average length of the sentences of president Trump compared to the sentences of president Clinton. This may help the CharRNN in reconstructing, as its always easier generating shorter sentences rather than longer ones.

Let's see the sample speeches of RNP Trump.

In [None]:
print("\nSample speech")
print("\"Dear Americans," + generate(trump_RNP, "Dear Americans,", trump_sentences_avglen*15, temperature=temperature) + '\"\n')


Sample speech
"Dear Americans, we are working our country. We will never been a lot of stand our country. We have a lot of people are now include of the world. They love the presidents and the world. We will not strong and sore warrior. We will not be all of the world war on the privilege of American people that are and the world. We have to be a lot of life, and they were all care of our country. We will not be all of the world with the world to the same workers and the world to protect our country. We will be all of the progress with the president of the presidency of our country. We have to be a lot of surproted and the world to the United States and the world to the brave to the world. We will seek to protect the United States with the world to the United States of American people who are the world to the presidency of the world because they want to thank the love the world. We have to be able to be a lot of people in the world. We will never been a lot of people that the way. We 

In [None]:
rnd_sentence = random.choice(trump_sentences).strip()
print("Trump as Trump:\"[" + rnd_sentence + "]", end = "")
print(generate(trump_RNP, rnd_sentence, trump_sentences_avglen*5, temperature=temperature) + '\"\n')

Trump as Trump:"[And American grit will ensure that what we dream, and what we build, will truly be second to none] only also said, we will be all of the world. We have a lot of people are now protect our country. We will never been better than any thought it was a lot of people that is why they want to thank you very much. We will not be all over the world. We have a lot of the world. We have to be a lot of people in the United States and the world to protect the world."



In [None]:
rnd_sentence = random.choice(clinton_sentences).strip()
print("Trump as Clinton:\"[" + rnd_sentence + "]", end = "")
print(generate(trump_RNP, rnd_sentence, trump_sentences_avglen*5, temperature=temperature) + '\"\n')

Trump as Clinton:"[Anyone, anyone who takes on our troops will suffer the consequences] and the world to be a lot of people with the presidency. We will be all over a big problem. We will be all over the best things to defend the world. We will never been the world. We will be all of the world. We have to special countries to the people who are an incredible private special progress to the country. We will never been better than any thought it was a lot of people."



In [None]:
print("Trump:\"[" + custom_sentence + "]", end = "")
print(generate(trump_RNP, custom_sentence, trump_sentences_avglen*5, temperature=temperature) + '\"\n')

Trump:"[People keep telling me orange but I still prefer pink.] We will not seek to the last train and the way they have to make the world of American people that is the long the world. We will starting to the future of the presidents to be all the privilege and the future that we have a lot of people are now includes that we will be all over the world to be a lot of countries of the region and allies and sore work to protect our country and the people that are the same prosperity and strength and conflict and defends of the world our borders and the world."



Changing the temperature parameter

In [1]:
temperature = 0.8

In [16]:
rnd_sentence = random.choice(clinton_sentences).strip()
print("Clinton as Clinton:\"[" + rnd_sentence + "]", end = "")
print(generate(clinton_RNP, rnd_sentence, clinton_sentences_avglen*5, temperature=temperature) + '\"\n')
rnd_sentence = random.choice(trump_sentences).strip()
print("Trump as Trump:\"[" + rnd_sentence + "]", end = "")
print(generate(trump_RNP, rnd_sentence, trump_sentences_avglen*5, temperature=temperature) + '\"\n')

Clinton as Clinton:"[They're going to see their own doctor when Bob Dole is President]s of Congress here and it a performing millions of America a choices. For the American of the fix the middle far that if democracy to provide at 150 iploble encourage, the resources who proposed it will move the Union I want to help paying the service that we have other preserve the same true a way that they camparts. Now the work uniforta $ 1,000 firms a forward for them to the even expensus together ais. As we know and the one party of them. Abring specitive there are we have to system in the looking the world, and the honor in America to discriminationship of this every harder, because you fighting it."

Trump as Trump:"[Your loved ones will never be forgotten, we will always honor their memory] of our citizens. We're also partnership outside of the lower. We're helped 0 - and their days on of the people, lost held and we have to be the Scout that. We than you have a far biils that would have all. 

We notice the difference in quality in the generated speeches. They now both show punctuation marks, as well as a more diverse vocabulary and overall more sense. The comparison between the two Recurrent Neural Presidents imitating their real counterparts highlights the difference in sentence's length of the respective dialectic.

### Conclusions
Thanks to the trained CharRNN models, we have been able to assess some core differences between the two real Presidents Bill Clinton and Donald Trump. These differences range from used vocabulary to sentence length, and some of these evaluations may be used to fine-tune the models to achieve better results. For example, we may measure the entropy of the sentences in order to obtain a similar measure with our models.

This work was also a useful way to see and understand some implementations of a model studied in class, in order to choose which one to use based on personal preferences like coding style, features implemented, libraries used and so on.

Lastly, it was fun seeing how the models are able to generate realistic sentences, that could've been said by one of the american presidents, followed by almost completely random sentences after few words. Then, whether some generated sentences may recall certain real speeches or not is not a suitable argument for this analysis. 