# Recurrent Neural Networks

Neural networks are another powerful way of learning functions.

Recurrent neural networks (RNNs) are particularly useful for applications involve _sequences_ of data. Data that comes in meaningful sequences, like natural language (a sequence of words), typically means that the context of any data point (i.e. the history that leads up to it) is just as relevant as the data point itself. The neural network must have some kind of "memory"; this is what RNNs essentially do (though their memory is very short-term).

So we can use RNNs to learn something like a language model, which can in some sense learn the style of a particular author and riff off of an input sequence of words (e.g. a sentence) in that style. There are examples abound here, such as generated texts in the style of [Finnegan's Wake](https://www.countbayesie.com/blog/2015/5/24/writing-finnegans-wake-with-a-recurrent-neural-net), endless new episode scripts for [Full House](http://fullest.house/), and algorithimically-composed [sheet music](https://medium.com/dbrs-innovation-labs/in-his-novel-galatea-2-2-e9d11c9b7c2a#.xd00cremz).

So pretty much anything that can be thought of as a sequence can be handled by an RNN. Ultimately we are still just learning a function that describes the dynamics in whatever sequence we're looking at, it's just more complex.

In [3]:
import numpy as np
import pandas as pd

In [26]:
df_ = pd.read_csv('events.csv')

In [3]:
df_.head()

Unnamed: 0,Event ID,Event Date,Source Name,Source Sectors,Source Country,Event Text,CAMEO Code,Intensity,Target Name,Target Sectors,Target Country,Story ID,Sentence Number,Publisher,City,District,Province,Country,Latitude,Longitude
0,20718170,2014-01-01,Police (Australia),"Police,Government",Australia,"Arrest, detain, or charge with legal action",173,-5,Men (Australia),"Social,General Population / Civilian / Social",Australia,32493690,2,Daily Telegraph,Surfers Paradise,Gold Coast,State of Queensland,Australia,-28.0027,153.43
1,20718171,2014-01-01,Police (Australia),"Police,Government",Australia,"Arrest, detain, or charge with legal action",173,-5,Children (Australia),"Social,General Population / Civilian / Social",Australia,32493693,1,Daily Telegraph,Maroubra,Randwick,State of New South Wales,Australia,-33.95,151.233
2,20718172,2014-01-01,Government Official (Democratic Republic of Co...,Government,Democratic Republic of Congo,Make statement,10,0,Attacker (Democratic Republic of Congo),"Criminals / Gangs,Dissident",Democratic Republic of Congo,32495112,3,The Australian,Kinshasa,,Kinshasa City,Democratic Republic of Congo,-4.32142,15.3081
3,20718174,2014-01-01,Military (South Sudan),"Military,Government",South Sudan,Use conventional military force,190,-10,Armed Rebel (South Sudan),"Rebel,Dissident",South Sudan,32495113,1,The Australian,Juba,,Central Equatoria State,South Sudan,4.85165,31.5825
4,20718173,2014-01-01,Armed Rebel (South Sudan),"Rebel,Dissident",South Sudan,Use unconventional violence,180,-9,Military (South Sudan),"Military,Government",South Sudan,32495113,1,The Australian,Juba,,Central Equatoria State,South Sudan,4.85165,31.5825


In [27]:
len(df_)

120000

In [5]:
# Source and Target Sectors are kind of redundant with Source and Target Names
cols = ['Source Name', 'Source Country', 'Event Text', 'Target Name', 'Target Country']
df = df_.loc[:, cols]

# the category label space is massive, this reduces it
df['Source Name'].replace(to_replace='\s\(.+', value='', inplace=True, regex=True)
df['Target Name'].replace(to_replace='\s\(.+', value='', inplace=True, regex=True) 

# NaN values are when source/target don't have a country
# df[df.isnull().any(axis=1)]
df.fillna('NONE', inplace=True)

df.head()

Unnamed: 0,Source Name,Source Country,Event Text,Target Name,Target Country
0,Police,Australia,"Arrest, detain, or charge with legal action",Men,Australia
1,Police,Australia,"Arrest, detain, or charge with legal action",Children,Australia
2,Government Official,Democratic Republic of Congo,Make statement,Attacker,Democratic Republic of Congo
3,Military,South Sudan,Use conventional military force,Armed Rebel,South Sudan
4,Armed Rebel,South Sudan,Use unconventional violence,Military,South Sudan


In [6]:
cat_map = {}
for col in cols:
    cat = df.loc[:,col].astype('category')
    cat_map[col] = cat.cat.categories
    df.loc[:,col] = cat.cat.codes

In [7]:
cat_map['Source Country'][df['Source Country'][0]]

'Australia'

In [10]:
n_samples = len(df)
data = df.as_matrix()
data

NameError: name 'df' is not defined

In [9]:
Y = data[1:]
Y.shape

(119999, 5)

In [10]:
data = data[:-1]
data.shape

(119999, 5)

In [11]:
vocab_size = np.max(data) + 1
vocab_size

11132

In [12]:
# check how many GB of memory will be needed
# using int16
print(Y.shape[0] * Y.shape[1] * (Y.max() + 1) * 16 * 1.25e-10)

y_onehot = np.zeros((Y.shape[0], Y.shape[1], Y.max() + 1))
layer_idx = np.arange(Y.shape[0]).reshape(Y.shape[0], 1)
component_idx = np.tile(np.arange(Y.shape[1]), (Y.shape[0], 1))
y_onehot[layer_idx, component_idx, Y] = 1

13.35828868


In [13]:
from keras.layers.embeddings import Embedding
from keras.layers.core import Dropout, Dense
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.layers.wrappers import TimeDistributed

embed_size = 100
batch_size = 128
epochs = 100

model = Sequential()
model.add(Embedding(vocab_size, embed_size, input_length=data.shape[1]))
model.add(Dropout(0.3))
model.add(LSTM(embed_size, return_sequences=True))
model.add(Dropout(0.3))
model.add(LSTM(embed_size, return_sequences=True))
model.add(Dropout(0.3))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

model.fit(data, y_onehot, batch_size=batch_size, nb_epoch=epochs, validation_split=0.05)

Train on 113999 samples, validate on 6000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/1

Using TensorFlow backend.


<keras.callbacks.History at 0x7f44157cdd68>

In [14]:
model.save_weights('icews_weights.h5', overwrite=True)

[TIP] Next time specify overwrite=True in save_weights!


In [15]:
def render(arr):
    parts = []
    for idx, col in enumerate(cols):
        parts.append(cat_map[col][int(arr[idx])])
    return parts

In [39]:
def ramble(seed, n):
    rambling = [seed]
    for _ in range(n):
        probs = model.predict(seed.T)[0]
        seed = []
        # random draw based on probs
        cs = np.cumsum(probs, axis=1)
        for row in cs:          
            idx = row.searchsorted(np.random.random() * row[-1], 'right')
            seed.append(idx)
        rambling.append(render(seed))
        seed = np.array(seed).reshape(5,1)
    return rambling

In [44]:
seed = np.random.randint(0, 20, (5,1))
for source_name, source_country, event_text, target_name, target_country in ramble(seed, 10):
    print('{} ({}) -> {} -> {} ({})'.format(source_name, source_country, event_text, target_name, target_country))

Citizen (Turkey) -> Protest violently, riot -> Argentina (Italy)
Communist Party of India (Turkey) -> Refuse to release persons or property -> Activist (Indonesia)
High Ranking Military Personnel (NONE) -> Make statement -> Militant (Switzerland)
Domestic Affairs (South Korea) -> Accuse -> Indonesia (Turkey)
Media (South Korea) -> Appeal for judicial cooperation -> Japan (India)
Protester (United Kingdom) -> Consult -> Men (United States)
Citizen (Australia) -> Consult -> Protester (Thailand)
Barack Obama (Thailand) -> Consult -> Men (South Korea)
Barack Obama (United States) -> Consult -> Unspecified Actor (United States)
Barack Obama (France) -> Cooperate economically -> Lakhdar Brahimi (Saudi Arabia)
