# I Trained an AI Model to Generate Donald Trump Tweets

In [1]:
# Imports
import numpy as np
import pandas as pd
import tensorflow as tf
from tqdm.notebook import tqdm

# Data/training Parameters
train_frac = 0.75
batch = 200
shuffle = 200
epochs = 5

# Model parameters
embedding_units = 64
lstm_units = 256
dense_units = 256
dropout_rate = 0.15

Before we train, let's check the devices available on our system. If we don't see any GPU's or other hardware accelerators, our training will run on the CPU (which could be a problem for home machines that cannot throttle the number of available CPU cores for training and the training will exhaust the CPU's resources)

In [2]:
print(*tf.config.list_logical_devices(), sep='\n')

LogicalDevice(name='/device:CPU:0', device_type='CPU')
LogicalDevice(name='/device:GPU:0', device_type='GPU')


## Preparing the Data

With AI, data processing is half the battle. So we'll spend a lot of time exploring and processing the data before we build our AI model. I'm going to take the tweets for the year 2020 (with the juiciest takes), and we're only concerned with the text, since we're just trying to make funny tweets. I also want these to run mainly on the CPU, so we have access to our main memory and frankly it's faster for this step (on my machine)

In [3]:
df = pd.read_csv('data/dtweets.csv')
df = df.loc[(df['date'] > '2020-01-01') & (df['date'] < '2020-12-31')]
tweets = df['text']
tweets = tweets.sample(frac=1)
tweets

1952     Biden wants to LOCKDOWN our Country, maybe for...
3173     Will be doing a press conference today at 5:00...
3343     RT @TomFitton: ICYMI, because of dishonest lib...
2026     In my opinion, these patriots did nothing wron...
580      RT @KLoeffler: Court packing. Defunding police...
                               ...                        
7719     RT @LATAMforTRUMP: Our sincere condolences to ...
11433                              https://t.co/vVSkTSlM1X
7663     RT @MazurikL: Make an entrance. Make a run. TH...
12147    RT @DineshDSouza: BIG ANNOUNCEMENT: My new fil...
670                                https://t.co/bjC5XWlfOJ
Name: text, Length: 12234, dtype: object

The next thing to do is the encoding. We first split the string into sequences of characters. We can use tensorflow's `unicode_split` function for this. We then encode these characters into a string of integers using tensorflow's `StringLookup` layer. This will also give us insight into the size of our input and output spaces (our vocab size) which will determine our input and output layers. 

In [4]:
with tf.device('/device:CPU:0'):
    tweet_chars = tf.strings.unicode_split(tweets, input_encoding='UTF-8')
    encode_chars = tf.keras.layers.StringLookup()
    encode_chars.adapt(tweet_chars)
    vocab_size = encode_chars.vocabulary_size()
    print('Vocab Size:', vocab_size)
    tweet_char_ids = encode_chars(tweet_chars)

Vocab Size: 389


The next part is going to be lengthy and also shamelessly copied from Tensorflow's NLP Zero to Hero course on YouTube. We're going to record every subsequence of each tweet sequence (from the start up to a set character) into a larger dataset. We're also going to pad each sequence with leading zeros so that each sequence is the same length (the length of a tweet, or 280 characters). Since this cell does take a long time, I've used tqdm to indicate the progress as it runs.

In [5]:
tweet_seqs = []
with tf.device('/device:CPU:0'):
    for tweet in tqdm(tweet_char_ids, desc='Creating padded n-gram sequences', total=tweet_char_ids.shape[0]):
        subseqs = [tweet[:i+1] for i in range(1, len(tweet))]
        subseqs = tf.keras.utils.pad_sequences(subseqs, maxlen=280, padding='pre', truncating='pre', value=0)
        tweet_seqs.extend(subseqs)
tweet_seqs = np.array(tweet_seqs)
print(tweet_seqs.shape, tweet_seqs.dtype)
print(tweet_seqs)

Creating padded n-gram sequences:   0%|          | 0/12234 [00:00<?, ?it/s]

(1614323, 280) int32
[[ 0  0  0 ...  0 42  7]
 [ 0  0  0 ... 42  7 12]
 [ 0  0  0 ...  7 12  2]
 ...
 [ 0  0  0 ... 43 11 22]
 [ 0  0  0 ... 11 22 40]
 [ 0  0  0 ... 22 40 49]]


Now the final preparations. We will create a tensorflow Dataset. The last column of our sequence is the output we're trying to predict. The remaining columns are the input sequences. These will be shuffled and split into batches within our dataset. We'll then split the dataset into training and testing data. This will conclude our data preparation step.

In [6]:
# Create dataset
outputs = tweet_seqs[:,-1]
sequences = tweet_seqs[:,:-1]
dataset = tf.data.Dataset.from_tensor_slices((sequences, outputs))
dataset = dataset.map(lambda seq, out: (seq, tf.one_hot(out, depth=vocab_size)))

# Shuffle and batch
dataset = dataset.shuffle(shuffle)
dataset = dataset.batch(batch)

# Display data sample
print(dataset)
for sequence_batch, output_batch in dataset.take(1):
    print(f'Input sequences: {sequence_batch}')
    print(f'Output labels: {output_batch}')

# Split into training and testing
train_num = int(train_frac*len(dataset))
train_dataset = dataset.take(train_num)
test_dataset = dataset.skip(train_num)

<BatchDataset element_spec=(TensorSpec(shape=(None, 279), dtype=tf.int32, name=None), TensorSpec(shape=(None, 389), dtype=tf.float32, name=None))>
Input sequences: [[ 0  0  0 ...  6  1 33]
 [ 0  0  0 ...  3  4  1]
 [ 0  0  0 ...  1  7  9]
 ...
 [ 0  0  0 ... 38 20 41]
 [ 0  0  0 ...  1 12  7]
 [ 0  0  0 ... 24  5 11]]
Output labels: [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]


## Building the Model

Now for the fun part. We'll create an LSTM model with an embedding layer for our input sequences. The output width will be our vocab size. We will also put dropout in between our layers to reduce overfitting in training.

In [7]:
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

# Create model
model = tf.keras.Sequential([
    Embedding(vocab_size, embedding_units),
    Dropout(dropout_rate),
    LSTM(lstm_units),
    Dropout(dropout_rate),
    Dense(dense_units, activation='relu'),
    Dense(vocab_size, activation='softmax')
])
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 64)          24896     
                                                                 
 dropout (Dropout)           (None, None, 64)          0         
                                                                 
 lstm (LSTM)                 (None, 256)               328704    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense (Dense)               (None, 256)               65792     
                                                                 
 dense_1 (Dense)             (None, 389)               99973     
                                                                 
Total params: 519,365
Trainable params: 519,365
Non-trai

Now, fingers crossed, we can train this model and not run into issues.

In [None]:
model.fit(train_dataset)