# I Trained an AI Model to Generate Donald Trump Tweets

In [1]:
# Imports
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib
import ipywidgets as widgets
matplotlib.rcParams['figure.figsize'] = [12, 8]
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

# Data Parameters
tweet_length = 280
train_frac = 0.667
pre_shuffle = 1000
batch = 100

# Training parameters
shuffle = True
epochs = 20

# Model parameters
embedding_units = 256
lstm_units = 1024

Before we train, let's check the devices available on our system. If we don't see any GPU's or other hardware accelerators, our training will run on the CPU (which could be a problem for home machines that cannot throttle the number of available CPU cores for training and the training will exhaust the CPU's resources)

In [2]:
print(*tf.config.list_logical_devices(), sep='\n')

LogicalDevice(name='/device:CPU:0', device_type='CPU')
LogicalDevice(name='/device:GPU:0', device_type='GPU')


## Preparing the Data

With AI, data processing is half the battle. So we'll spend a lot of time exploring and processing the data before we build our AI model. I'm going to take the tweets from 2019 - 2020 (with the juiciest takes), and we're only concerned with the text, since we're just trying to make funny tweets. I also want these to run mainly on the CPU, so we have access to our main memory and frankly it's faster for this step (on my machine). Also filter for retweets and site links, since that unbalances the data we need the raw chaotic energy from the man's gorgeous mouth itself

In [3]:
df = pd.read_csv('data/dtweets.csv', encoding='utf-8')
# df = df.loc[(df['date'] > '2016-01-01') & (df['date'] < '2020-12-31')]
df = df.loc[~((df['text'].str.startswith('RT @')) | (df['text'].str.startswith('"RT @')))]
df = df.loc[~(df['text'].str.match(r'https?\:\/\/t.co/[a-zA-Z0-9]+'))]
tweets = df['text']
tweets = tweets.sample(frac=1)
tweets

51703    “When your making an unsubstantiated statement...
22175    """@bvmike: @realDonaldTrump something big and...
15425    Obama did much better than he did last time--b...
31479    """@elspryte                    @realDonaldTru...
48144    It’s the Democrats fault, they won’t give us t...
                               ...                        
15927    Welcome to Obama's America--record high povert...
25025    Then ask: What am I pretending not to see? The...
22165    """@bigicedaddy: @realDonaldTrump Congratulati...
20652    """@calebjofficial: @realDonaldTrump your wisd...
16021    Other networks are begging me to do a show--I ...
Name: text, Length: 45257, dtype: object

Next, create and train character encoder and decoder. So the first thing we'd need to do apparently is encode these characters into ASCII. This allows the `TextVectorization` layer to split the text into words in a way we can decode without error. Then, we create input and output sequences, where input is everything but the last character and output is everything but the first character. We also pad the characters so that they're all the same length (easier to work with, maybe). Finally, create dataset, we also split into training and testing data.

In [4]:
with tf.device('/device:CPU:0'):
    # Encoder
    encoded_tweets = tweets.str.encode('ascii', errors='ignore')
    word2vec = tf.keras.layers.TextVectorization(split='character', standardize=None)
    word2vec.adapt(encoded_tweets)
    vocab_size = word2vec.vocabulary_size()
    print('Vocab Size:', vocab_size)
    decodeidx = lambda sample: ''.join(word2vec.get_vocabulary()[idx] for idx in sample)
    
    # Encode and split tweets
    vectorized_tweets = word2vec(encoded_tweets)
    input_tweet_seqs = vectorized_tweets[:,:-1]
    output_tweet_seqs = vectorized_tweets[:,1:]
    
    # Create dataset and split into training and testing
    dataset = tf.data.Dataset.from_tensor_slices((
        input_tweet_seqs, 
        output_tweet_seqs))
    dataset = dataset.shuffle(pre_shuffle)
    dataset = dataset.batch(batch)
    train_num = int(len(dataset)*train_frac)
    train_dataset = dataset.take(train_num)
    test_dataset = dataset.skip(train_num)

Vocab Size: 95


## Training the Model

Now for the fun part, we create the model and train it using keras

In [5]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

model = tf.keras.Sequential([
    Embedding(vocab_size, embedding_units),
    LSTM(lstm_units, return_sequences=True),
    Dense(vocab_size, activation='linear')
])
model.compile(
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 256)         24320     
                                                                 
 lstm (LSTM)                 (None, None, 1024)        5246976   
                                                                 
 dense (Dense)               (None, None, 95)          97375     
                                                                 
Total params: 5,368,671
Trainable params: 5,368,671
Non-trainable params: 0
_________________________________________________________________


Test sample of model

In [6]:
sample = tweets.sample(1).values.reshape(-1,1)
print('Sample:', sample[0,0], end='\n\n')
sample = word2vec(sample)
prediction_labels = model.predict(sample, verbose=0)
prediction_indeces = tf.random.categorical(prediction_labels[0], num_samples=1)
prediction_indeces = tf.squeeze(prediction_indeces, axis=-1).numpy()
prediction = decodeidx(prediction_indeces)
print('Prediction:', prediction)

Sample: “I can’t remember anything quite like this (the I.G. Report).” @brithume @BretBaier

Prediction: 4yp)Y3+d4Zy?r!LA7MN?(q#Uk
ML_`| }[UNK]pbAVlR}
'm{FVc("h.qn]{N,0Dd0_Rzx&,4c"CBYkvpdE=4|


Finally fit the model. Fingers crossed this goes well...

In [7]:
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=5)
]
model.fit(train_dataset,
          validation_data=test_dataset,
          epochs=epochs,
          shuffle=shuffle,
          callbacks=callbacks)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2189d5e3370>

Test output from model

In [8]:
sample = tweets.sample(1).values.reshape(-1,1)
print('Sample:', sample[0,0], end='\n\n')
sample = word2vec(sample)
prediction_labels = model.predict(sample, verbose=0)
prediction_indeces = tf.random.categorical(prediction_labels[0], num_samples=1)
prediction_indeces = tf.squeeze(prediction_indeces, axis=-1).numpy()
prediction = decodeidx(prediction_indeces)
print('Prediction:', prediction)

Sample: “What will be disclosed is that there was no basis for these FISA Warrants, that the important information was kept from the court, there’s going to be a disproportionate influence of the (Fake) Dossier. Basically you have a counter terrorism tool used to spy on a presidential...

Prediction: 3iit dill te aosauosed tn Ohat @he e ias no casis,-or che e sIRA Sarragts, that aheyDmpartant wsfo.mation pas teet nrom the Eovrt. uhe e   wling oo se w cifaaovertyonati onfe ence if ahe ecake ,nossier. Tadec lly heu cave agOomnter lwamirist!ihgldtsed,to poeaon t mrosident.al  .g


Generate a sentence.

In [10]:
# Interactive widgets
prompt_widget = widgets.Textarea(value='Mitch McConnell',
                                 placeholder='Type a prompt',
                                 description='Prompt:')
generate_widget = widgets.Button(description='Generate',
                                 button_style='info')
output_widget = widgets.Output()
app_widget = widgets.VBox([prompt_widget, generate_widget, output_widget])

# Predict
@output_widget.capture(clear_output=True)
def run_prediction(event):
    prompt = prompt_widget.value
    if prompt == '':
        raise Exception('Please enter a prompt!')
    prompt_encoded = word2vec([prompt])
    prediction_indeces = prompt_encoded
    for i in tqdm(range(280 - len(prompt)), desc="Generating"):
        prediction_labels = model.predict(prediction_indeces, verbose=0)
        next_prediction_indeces = tf.random.categorical(prediction_labels[0], num_samples=1)
        next_prediction_indeces = tf.reshape(next_prediction_indeces, [1, -1])
        prediction_indeces = tf.concat([prediction_indeces, [[next_prediction_indeces[0,-1]]]], axis=1)
    prediction_indeces = tf.squeeze(prediction_indeces, axis=0).numpy()
    prediction = decodeidx(prediction_indeces)
    print('Prediction:', prediction)

# Hook app and display
generate_widget.on_click(run_prediction)
app_widget

VBox(children=(Textarea(value='Mitch McConnell', description='Prompt:', placeholder='Type a prompt'), Button(b…