# I Trained an AI Model to Generate Donald Trump Tweets

In [1]:
# Imports
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm.notebook import tqdm

# Data Parameters
tweet_length = 280
train_frac = 0.75
batch = 200

# Training parameters
shuffle = True
epochs = 5

# Model parameters
embedding_units = 64
lstm_1_units = 256
lstm_2_units = 256
dense_units = 256
dropout_rate = 0.15

Before we train, let's check the devices available on our system. If we don't see any GPU's or other hardware accelerators, our training will run on the CPU (which could be a problem for home machines that cannot throttle the number of available CPU cores for training and the training will exhaust the CPU's resources)

In [2]:
print(*tf.config.list_logical_devices(), sep='\n')

LogicalDevice(name='/device:CPU:0', device_type='CPU')
LogicalDevice(name='/device:GPU:0', device_type='GPU')


## Preparing the Data

With AI, data processing is half the battle. So we'll spend a lot of time exploring and processing the data before we build our AI model. I'm going to take the tweets for the year 2020 (with the juiciest takes), and we're only concerned with the text, since we're just trying to make funny tweets. I also want these to run mainly on the CPU, so we have access to our main memory and frankly it's faster for this step (on my machine)

In [3]:
df = pd.read_csv('data/dtweets.csv')
df = df.loc[(df['date'] > '2020-01-01') & (df['date'] < '2020-12-31')]
tweets = df['text']
tweets = tweets.sample(frac=1)
tweets

11052    .@RepKevinBrady (R) of Texas-08 loves Texas &a...
4841     RT @GOPChairwoman: “As one grateful nation, we...
7418     Disgraceful Anarchists. We are watching them c...
11352    RT @charliekirk11: All the lockdowns must end ...
6375     RT @TrumpWarRoom: HISTORIC: After 49 years, Is...
                               ...                        
1982     Under my leadership, our ECONOMY is now growin...
9802     RT @charliekirk11: One week ago today, Democra...
11867    RT @PeteHegseth: This Atlantic “story” is noth...
5638     Many Democrats want to Defund and Abolish Poli...
631      Look at this in Wisconsin! A day AFTER the ele...
Name: text, Length: 12234, dtype: object

Next, split characters and train character encoder and decoder. Create input and output sequences, where input is everything but the last character and output is everything but the first character. We also pad the characters so that they're all the same length (easier to work with, maybe). Finally, create dataset, we also split into training and testing data

In [41]:
with tf.device('/device:CPU:0'):
    # Encode chars
    tweet_chars = tf.strings.unicode_split(tweets, input_encoding='UTF-8')
    encode_chars = tf.keras.layers.StringLookup()
    encode_chars.adapt(tweet_chars)
    vocab_size = encode_chars.vocabulary_size()
    print('Vocab Size:', vocab_size)
    decode_chars = tf.keras.layers.StringLookup(invert=True, 
                                                vocabulary=encode_chars.get_vocabulary())
    
    # Create padded input and output sequences
    tweet_charids = encode_chars(tweet_chars).to_list()
    input_tweet_charids = [ list(tensor)[:-1] for tensor in tweet_charids ]
    output_tweet_charids = [ list(tensor)[1:] for tensor in tweet_charids ]
    input_sequences = tf.keras.utils.pad_sequences(input_tweet_charids, 
                                                   maxlen=(tweet_length - 1), 
                                                   padding='pre', 
                                                   truncating='pre')
    output_sequences = tf.keras.utils.pad_sequences(output_tweet_charids, 
                                                    maxlen=(tweet_length - 1), 
                                                    padding='pre', 
                                                    truncating='pre')
    output_labels = tf.one_hot(output_sequences, depth=vocab_size)
    
    # Create training and testing dataset
    dataset = tf.data.Dataset.from_tensor_slices((input_sequences, output_labels))
    dataset = dataset.batch(batch)
    print('Dataset:', tf.data.DatasetSpec.from_value(dataset))
    train_num = int(train_frac*len(dataset))
    train_dataset = dataset.take(train_num)
    test_dataset = dataset.skip(train_num)
    
    # Get top start characters
    start_chars = np.array([ seq[0] for seq in input_tweet_charids ])
    uqsc, counts = np.unique(start_chars, return_counts=True)
    order = np.flip(np.argsort(counts))[:9]
    top_start_chars = uqsc[order]
    print('Top Start Chars:', decode_chars(top_start_chars).numpy())

Vocab Size: 389
Dataset: DatasetSpec((TensorSpec(shape=(None, 279), dtype=tf.int32, name=None), TensorSpec(shape=(None, 279, 389), dtype=tf.float32, name=None)), TensorShape([]))
Top Start Chars: [b'R' b'T' b'h' b'.' b'I' b'W' b'S' b'C' b'G']


## Training the Model

Now for the fun part, we create the model and train it using keras

In [5]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

model = tf.keras.Sequential([
    Embedding(vocab_size, embedding_units),
    Dropout(dropout_rate),
    LSTM(lstm_1_units, return_sequences=True),
    Dropout(dropout_rate),
    LSTM(lstm_2_units, return_sequences=True),
    Dropout(dropout_rate),
    Dense(dense_units, activation='relu'),
    Dense(vocab_size, activation='softmax')
])
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          24896     
                                                                 
 dropout (Dropout)           (None, None, 64)          0         
                                                                 
 lstm (LSTM)                 (None, None, 256)         328704    
                                                                 
 dropout_1 (Dropout)         (None, None, 256)         0         
                                                                 
 lstm_1 (LSTM)               (None, None, 256)         525312    
                                                                 
 dropout_2 (Dropout)         (None, None, 256)         0         
                                                                 
 dense (Dense)               (None, None, 256)         6

Finally fit the model. Fingers crossed this goes well...

In [6]:
model.fit(train_dataset,
          validation_data=test_dataset,
          epochs=epochs,
          shuffle=shuffle)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1a74eb0d210>

Test output from model

In [51]:
from random import choice
model.reset_states()
sequence = np.array([[choice(top_start_chars)]])
for i in tqdm(range(tweet_length - 1), desc='Generating'):
    labels = model.predict(sequence[0,-1:].reshape(-1,1), verbose=0)
    nextch = np.argmax(labels, axis=2)
    sequence = np.hstack((sequence, nextch))
    
sequence = decode_chars(sequence)
sequence = tf.strings.reduce_join(sequence)
print(sequence)

Generating:   0%|          | 0/279 [00:00<?, ?it/s]

tf.Tensor(b'W                                                                                                                                                                                                                                                                                       ', shape=(), dtype=string)
