# NLP Core 3 Exercise: Learning how to tweet from Trump

In this exercise, we will train a character-level RNN language model in Keras in order to generate text, training on a dataset of tweets contributed by Donald Trump.

**Note:** if you are solving this exercise in Google Colab, you must first upload the file to your runtime (use the 'Files' tab in the expanding menu to the left). Also make sure that you have the runtime type set to use GPU (in the menu *Runtime > Change runtime type*).

## Loading and cleaning the data

The accompanying file *trump_tweets.txt* contains a list of newline-separated Trump tweets. 

**Questions:**
1. Load these tweets into a Pandas Dataframe with column 'text'.
2. Add a new column 'cleaned' to the dataframe containing the text of the tweets with some noise cleaned -- remove URLs and replace every character that is not a basic English letter or punctuation symbol (A-Za-z.,!?@:; or a space) with the character '?'. Also feel free to clean any other kind of noise that you can find as well.
3. Add the character '^' to the beginning of each cleaned tweet and '$' to the end.
4. Filter the dataframe to only contain 2000 tweets between 50 and 180 letters long. Now plot a histogram of the number of characters in Trumps' tweets. What pattern do you see? (Hint: use Pandas df.str.len() and df.column.hist()).

In [1]:
import pandas as pd

In [2]:
with open('trump_tweets.txt') as open_file:
    my_lines = open_file.readlines()

In [3]:
trump_df = pd.DataFrame()

In [4]:
trump_df["text"] = my_lines

In [5]:
trump_df

Unnamed: 0,text
0,"The race for DNC Chairman was, of course, tota..."
1,For first time the failing @nytimes will take ...
2,"Russia talk is FAKE NEWS put out by the Dems, ..."
3,Big dinner with Governors tonight at White Hou...
4,Congressman John Lewis should spend more time ...
5,mention crime infested) rather than falsely co...
6,INTELLIGENCE INSIDERS NOW CLAIM THE TRUMP DOSS...
7,Congressman John Lewis should finally focus on...
8,Inauguration Day is turning out to be even big...
9,I am now going to the brand new Trump Internat...


In [6]:
import re

trump_df['cleaned'] = trump_df['text'].str.replace("http\S+", "").str.replace('[^A-Za-z\.,!\?@:; ]','?')
trump_df['cleaned'] = "^" + trump_df['cleaned'] + "$"

In [7]:
len_vec = trump_df.cleaned.apply(len)
trump_df = trump_df[(len_vec>=50) & (len_vec <= 180)]
trump_df = trump_df.sample(2000)

In [8]:
len_vec_2 = trump_df.cleaned.apply(len)

In [9]:
len_vec_2.hist(bins = 50)

<matplotlib.axes._subplots.AxesSubplot at 0x11f9fffd0>

We can see that trump's tweets centered around 144 which used to be the maximmu number of tweets until they changed it to 280.

## Computing feature vectors

We will now convert the tweets into feature vectors that can be used to train a network.

**Questions:**

5. Create a variable *charset* containing a string with all of the unique characters used in the cleaned tweets, and with the padding character '0' at the beginning (so it should have a value similar to '0 !$,.:;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz').
6. Convert the tweets to vectors of character indices (starting at 1). Hint: use the function charset.index().
7. Use the function *pad_sequences()* from keras.preprocessing.sequence to make each feature vector of length 200, by adding zeros at the end of the vectors (use the attributes value = 0, padding = 'post', maxlen = 200). Save the output matrix of feature vectors in a numpy array *data*. Each row of *data* should be the feature vector for one tweet. What is the shape of *data*?
8. Use *data* to generate input and target matrices *X* and *Y*. X should be a matrix of character indices of shape (2000, 199). *Y* should be a one-hot encoded tensor of shape (2000, 199, #) for some number # (use *to_categorical()* from keras.utils to one-hot encode Y). Note: Y should be offset one character from X since we want to predict the next character in a string given what came before.

In [10]:
charset = set(trump_df['cleaned'].str.cat())

In [11]:
charset = '0' + "".join(charset)

In [12]:
trump_vectors = trump_df['cleaned'].apply(lambda x: [charset.index(c) for c in x])

In [13]:
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [14]:
data = pad_sequences(trump_vectors.to_list(), value = 0, padding = 'post', maxlen = 200)

In [15]:
X = data[:, :-1]

In [16]:
y = data[:, 1:]

In [17]:
from keras.utils import to_categorical

In [18]:
y = to_categorical(y)

## Building and training the language model

We will start by using the following imports and hyperparameter settings:

In [19]:
from keras.models import Sequential
from keras.layers import Activation, Dense, LSTM, Embedding, TimeDistributed
hidden_size = 128
embedding_size = 8

We will now build a character-level RNN language model that can learn from the given data.

**Questions:**

9.  Build a sequential model *model*, and using *model.add()* add the following four layers:
  * Embedding layer with output dimension *embedding_size*. Use mask_zero = True since we zero-padded the input, and set input_length and input_dim to match the dimensions of X and the number of possible values that features in X can take.
  * LSTM layer with hidden state dimension *hidden_size*. Use return_sequences = True to make the layer output the sequence of hidden states.
  * Fully-connected layer -- use *TimeDistributed(Dense(#))* for some number #
  * Softmax activation layer
  
10. Compile the model (*model.compile()*) with loss function 'categorical_crossentropy' and optimizer 'adam'. Examine the output shapes of the model's layers with *model.summary()*. How do you interpret the output of the final layer?

In [20]:
vocab_size = len(charset)

model = Sequential()
model.add(Embedding(mask_zero = True, input_dim = vocab_size, output_dim=embedding_size, input_length=199))
model.add(LSTM(hidden_size, return_sequences=True))
model.add(TimeDistributed(Dense(vocab_size)))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
model.summary()

W0811 23:59:29.838676 140735838831488 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0811 23:59:29.916720 140735838831488 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0811 23:59:29.943813 140735838831488 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0811 23:59:30.582655 140735838831488 deprecation.py:323] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2974: add_dispatch_support.<locals>.wrapper (from tensorflow.pytho

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 199, 8)            504       
_________________________________________________________________
lstm_1 (LSTM)                (None, 199, 128)          70144     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 199, 63)           8127      
_________________________________________________________________
activation_1 (Activation)    (None, 199, 63)           0         
Total params: 78,775
Trainable params: 78,775
Non-trainable params: 0
_________________________________________________________________


##  Generating tweets

Now we will train our language model and use it to generate tweets.

We can generate a tweet using our model as follows:
* Start with the beginning-of-string token '^' as the initial input.
* Predict the distribution of the next character using the model and use the distribution to select the next character in the tweet (can use *np.random.choice()* for this)
* Repeat the process until the end-of-string token '$' is predicted or until 200 characters are generated

**Questions:**
11. Make a function *generate_tweet()* that returns the string of a tweet generated by the model, using the above procedure. What does its output look like (before the model is trained)?
12. Train the model using *model.fit*, with batch_size = 128 and validation_split = 0.2, for 100 epochs. Generate a tweet using the model every 20 epochs. What do you see?
13. Keep training the model as long as it is underfitting (or until you get bored) and observe how the model learns to generate better tweets.

**Bonus exercises:**
* Try changing the model hyperparameters (embedding and hidden dimensions, batch size). How does this affect the learning rate and/or output?
* Add a temperature parameter T in the generation step by adding a Lambda layer before the softmax layer with lambda x: x / temp. How do you expect this would affect output, and why?
* Use https://faketrumptweet.com/ to fool your family and friends with your best randomly-generated Trump tweet.

In [23]:
#11
maxlen=200
import numpy as np
def generate_tweet():
    indices = [charset.index('^')]
    while len(indices) < maxlen:
        X = pad_sequences([indices], maxlen=maxlen-1, padding="post", value=0)
        next_word_dist = model.predict(X)[0,-1,:]
        next_word = np.random.choice(len(next_word_dist), p=next_word_dist)
        indices.append(next_word)
        if next_word == charset.index('$'):
            break
    return ''.join(charset[i] for i in indices)
generate_tweet()

W0812 00:01:32.396860 140735838831488 deprecation_wrapper.py:119] From /Users/Yohan/Desktop/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:2741: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.



'^$'

In [24]:
epoch,epochs_per_iteration=0,20

In [None]:
#12,13
for i in range(400//epochs_per_iteration):
    model.fit(X,y, epochs=20, batch_size=128, validation_split=0.2)
    epoch += epochs_per_iteration
    print(f'Result of {epoch}-th epoch:')
    print(generate_tweet())
    print()
    print()

Train on 1600 samples, validate on 400 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Result of 420-th epoch:
^@larleveilllig:  @petmlase @ReCYoshe Couls Jot halls in tirnaby, ? polls of The Carson alvess ROP on @No newtupply trithon runiot Mirson phing tominament? Digg...?$


Train on 1600 samples, validate on 400 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
Result of 440-th epoch:
^Tere crazy, in Colmonst kipt Othirrat mor in Rissine forwing you at ?:?? P.M. jush sunestist are good to wlrefh, can?t on you!?$


Train on 1600 samples, validate on 400 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch