# Generating new Norwegian girl baby names with Deep Learning and RNNs 

This post will go through the use of Deep Recurrent Neural Nets using TensorFlow on how to generate new Norwegian girl baby names. This might be useful for upcoming (scared-to-death) parents not deciding on a potential name :)   Our main goal with this task is to train a model that perhaps can generate new "sensible" character combinations to successfully create new names or names that do not exist in the training data. This is just a high level hands-on example on how to use TensorFlow and the underlying components used in the net will not be explained in detail. 

## Task

We will be building a similar character-level language model to generate character sequences, a la Andrej Karpathy’s char-rnn (and see, e.g., a TensorFlow implementation by Sherjil Ozair [here](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). In the end we want the model spit out new names so that it can create new and existing character combinations that hopefully will end up sounding like sensible girl names in Norwegian. 

We will use data from SSB (Statistics Norway) with names of born babies between 2006 and 2015 with more than 4 occurrences in at least one of the 10 years. If the model will be able create new names they might indeed exist from before if less than 4 people in Norway has not been given that name the last 10 years.

https://www.ssb.no/statistikkbanken/selecttable/hovedtabellHjem.asp?KortNavnWeb=navn&CMSSubjectArea=befolkning&checked=true


In [1]:
#Imports
import numpy as np
import tensorflow as tf
%matplotlib inline
import matplotlib.pyplot as plt
import time
import os
import urllib
from tensorflow.models.rnn.ptb import reader
import pandas as pd
import io
import requests


## Data

Download the data from SSB that is temporarily stored DropBox and list some simple statistics.
The list consists of 630 (not really a deep learning problem :) ) girl names of babies born in Norway between 2005 - 2015 and that has occurred 4 or more times at least one of the years. This may limit the model to create more exotic combinations as it learns from the most common names.

In [136]:
#Load data an functions
# -*- coding: utf-8 -*-
url = "https://www.dropbox.com/s/rxe3vsvdt03jtvi/2017%20-%2001%20-%2009%20-%20SSBJentenavn20062015V2.csv?dl=1"  # dl=1 is important
import urllib
import csv
import codecs
import sys  
#sys.setdefaultencoding('utf8')

#Download file
u = urllib.urlopen(url)
data = u.read()
u.close()
#print (u.headers.getparam("charset"))

#Save csv to disc 
with open('SSBJentenavn20062015,csv', "wb") as f :
    f.write(data)
f.close()

#Convert data to dataset 
df = pd.read_csv('SSBJentenavn20062015,csv',encoding ='iso-8859-1',delimiter=";")

#Convert df to text and remove numbers
df_to_text = ''.join([i for i in df["Navn"].to_string() if not i.isdigit()])
df_to_text = df_to_text.replace(" ", "")

#List the data frame and number of names 
print 'Number of Norwegian girl names of babies born between 2005-2015 with more than 4 occurences: ' + str(len(df))
print 'List of names and counts each year:'
print df
print df_to_text

Number of Norwegian girl names of babies born between 2005-2015 with more than 4 occurences: 630
List of names and counts each year:
           Navn  2015  2014  2013  2012  2011  2010  2009  2008  2007  2006
0       Abigail     8    11    15    18     6    10    10     8     8     4
1           Ada   131   110   128   128   109    95    99    91   101    71
2         Adela     0     7     6     5     6     4    11     4     0     0
3         Adele    46    46    69    78    59    49    50    58    45    41
4        Adelen    48    37    50     6     7     9     0     0     7     0
5       Adelina     9    10     6     7     6     6     5     5     6     0
6         Adina     7     7     4     6     0     0    10     8     5     0
7         Adine     4     9     8    10     7     8    11    12     8    10
8          Adna    10     7     0     0     0     5     7     4     5     6
9       Adriana    23    19    19    25    24    24    28    17    19    13
10       Agathe     8    12    

## Generating vocabulary 

In this step we generate unique characters used in all the different names. Translations lists for indexing are also created as we all know it is easier to work with numbers than with characters. This is quite a small data set but it is always a good practice to index your vocabulary.   

The data variable is just a number representation of all the text.

In [138]:
# -*- coding: utf-8 -*-
vocab = set(df_to_text)
vocab_size = len(vocab)
idx_to_vocab = dict(enumerate(vocab))
vocab_to_idx = dict(zip(idx_to_vocab.values(), idx_to_vocab.keys()))
data = [vocab_to_idx[c] for c in df_to_text]
print vocab
print vocab_size
print idx_to_vocab
print vocab_to_idx
print data

set([u'\xe6', u'\n', u'\xc5', u'V', u'A', u'C', u'B', u'E', u'D', u'G', u'F', u'I', u'H', u'K', u'J', u'M', u'L', u'O', u'N', u'P', u'S', u'R', u'U', u'T', u'W', u'\xf8', u'Y', u'\xe5', u'Z', u'a', u'c', u'b', u'e', u'd', u'g', u'f', u'i', u'h', u'k', u'j', u'm', u'l', u'o', u'n', u'q', u'p', u's', u'r', u'u', u't', u'w', u'v', u'y', u'x', u'z'])
55
{0: u'\xe6', 1: u'\n', 2: u'\xc5', 3: u'V', 4: u'A', 5: u'C', 6: u'B', 7: u'E', 8: u'D', 9: u'G', 10: u'F', 11: u'I', 12: u'H', 13: u'K', 14: u'J', 15: u'M', 16: u'L', 17: u'O', 18: u'N', 19: u'P', 20: u'S', 21: u'R', 22: u'U', 23: u'T', 24: u'W', 25: u'\xf8', 26: u'Y', 27: u'\xe5', 28: u'Z', 29: u'a', 30: u'c', 31: u'b', 32: u'e', 33: u'd', 34: u'g', 35: u'f', 36: u'i', 37: u'h', 38: u'k', 39: u'j', 40: u'm', 41: u'l', 42: u'o', 43: u'n', 44: u'q', 45: u'p', 46: u's', 47: u'r', 48: u'u', 49: u't', 50: u'w', 51: u'v', 52: u'y', 53: u'x', 54: u'z'}
{u'j': 39, u'\n': 1, u'E': 7, u'x': 53, u'A': 4, u'C': 5, u'B': 6, u'\xc5': 2, u'D': 8, u'G': 

## The Black Magic

In this step we create all the functions that we will need to use the Tensor Flow framework for Deep Learning. I will together with the code provide a high level overview of the components used to train the model. Later this year I will create a more detailed step-by-step guide for DNN in Norwegian. But for now links and references will be provided if you want to extend your knowledge on the details. 

The following material may be useful in order to better understand what goes on under the hood:

1. [Neural Networks Demystified YouTube series](https://www.google.com)
2. [Udacity Deep Learning Course](https://classroom.udacity.com/courses/ud730/)
3. [Udacity Linear Algebra Course](https://classroom.udacity.com/courses/ud953)

### So what are Deep Learning Nets (Warning: High-Level Wikipedia Definition)?

Deep learning is characterized as a class of machine learning algorithms that.

* use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Each successive layer uses the output from the previous layer as input. The algorithms may be supervised or unsupervised and applications include pattern analysis (unsupervised) and classification (supervised).
* are based on the (unsupervised) learning of multiple levels of features or representations of the data. Higher level features are derived from lower level features to form a hierarchical representation.
* are part of the broader machine learning field of learning representations of data.
* learn multiple levels of representations that correspond to different levels of abstraction; the levels form a hierarchy of concepts.

For supervised learning tasks, deep learning methods obviate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures which remove redundancy in representation.

Many deep learning algorithms are applied to unsupervised learning tasks. This is an important benefit because unlabeled data are usually more abundant than labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks.

GPUs have revived the area of Neural Networks as they are great at performing big linear matrix multiplications (required for graphics in gaming). The use of RELUs (rectified linear units) and dropouts has also helped evolving NNs and boosted the performance of Deep Learning. RELUs are essential simple nonlinear function that are inserted between layers in order to capture non-linearity. They serve well for NNs as they are more easily differentiable than other functions. 

![alt text](https://oakmachine.com/img/network-of-relus.png "Deep Learning with RELUs")

Some terms regarding NN and Deep Learning that might be useful to learn: 
* Weights
* Gradient Descent
* LOSS 
* Derivation
* Chain Rule
* Forward Propagation 
* Backward Propagation
* Softmax
* Regularization
* Cross entropy
* Epochs
* Batch Size: Sample used at each training step for stochastic gradient descent to create an estimate of the loss
* Dropout

This is just meant to be a short tutorial on how to get a simple example running in Tensor Flow.
If you want to read about the different components in more details, please have a look at this blog as well: 
[R2RT Blog](http://r2rt.com/recurrent-neural-networks-in-tensorflow-i.html)

In this example we will use a special flavor of NNs called RNNs. The magic behind them is that they also take into account the history of a sequence.  In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps:

![alt text](http://karpathy.github.io/assets/rnn/charseq.jpeg "RNN")
![alt text](http://img.youtube.com/vi/H3ciJF2eCJI/0.jpg "RNN")

Some additional neural network terminology:

* one epoch = one forward pass and one backward pass of all the training examples
* batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you'll need.
* number of iterations = number of passes, each pass using [batch size] number of examples. To be clear, one pass = one forward pass + one backward pass (we do not count the forward pass and backward pass as two different passes).

* Example: if you have 1000 training examples, and your batch size is 500, then it will take 2 iterations to complete 1 epoch.

An epoch usually means one iteration over all of the training data. For instance, if you have 20,000 images and a batch size of 100 then the epoch should contain 20,000 / 100 = 200 steps. However, I usually just set a fixed number of steps like 1000 per epoch even though I have a much larger data set. At the end of the epoch I check the average cost and if it improved I save a checkpoint. There is no difference between steps from one epoch to another. I just treat them as checkpoints.

In [139]:
def gen_epochs(n, num_steps, batch_size):
    for i in range(n):
        yield reader.ptb_iterator(data, batch_size, num_steps)

def reset_graph():
    if 'sess' in globals() and sess:
        sess.close()
    tf.reset_default_graph()

def train_network(g, num_epochs, num_steps = 200, batch_size = 32, verbose = True, save=False):
    tf.set_random_seed(2345)
    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        training_losses = []
        for idx, epoch in enumerate(gen_epochs(num_epochs, num_steps, batch_size)):
            training_loss = 0
            steps = 0
            training_state = None
            for X, Y in epoch:
                steps += 1

                feed_dict={g['x']: X, g['y']: Y}
                if training_state is not None:
                    feed_dict[g['init_state']] = training_state
                training_loss_, training_state, _ = sess.run([g['total_loss'],
                                                      g['final_state'],
                                                      g['train_step']],
                                                             feed_dict)
                training_loss += training_loss_
            if verbose:
                print("Average training loss for Epoch", idx, ":", training_loss/steps)
            training_losses.append(training_loss/steps)

        if isinstance(save, str):
            g['saver'].save(sess, save)

    return training_losses

Let's see what performance we get using a RNN with 512 hidden nodes with 3 layers. For each epoch (pass through the training data) we use a sample size of 32 and 200 steps in each epoch. 

In [57]:
def build_multilayer_lstm_graph_with_dynamic_rnn(
    state_size = 512,
    num_classes = vocab_size,
    batch_size = 32,
    num_steps = 200,
    num_layers = 3,
    learning_rate = 1e-4):

    reset_graph()

    x = tf.placeholder(tf.int32, [batch_size, num_steps], name='input_placeholder')
    y = tf.placeholder(tf.int32, [batch_size, num_steps], name='labels_placeholder')

    embeddings = tf.get_variable('embedding_matrix', [num_classes, state_size])

    # Note that our inputs are no longer a list, but a tensor of dims batch_size x num_steps x state_size
    rnn_inputs = tf.nn.embedding_lookup(embeddings, x)

    cell = tf.nn.rnn_cell.LSTMCell(state_size, state_is_tuple=True)
    cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_tuple=True)
    init_state = cell.zero_state(batch_size, tf.float32)
    rnn_outputs, final_state = tf.nn.dynamic_rnn(cell, rnn_inputs, initial_state=init_state)

    with tf.variable_scope('softmax'):
        W = tf.get_variable('W', [state_size, num_classes])
        b = tf.get_variable('b', [num_classes], initializer=tf.constant_initializer(0.0))

    #reshape rnn_outputs and y so we can get the logits in a single matmul
    rnn_outputs = tf.reshape(rnn_outputs, [-1, state_size])
    y_reshaped = tf.reshape(y, [-1])

    logits = tf.matmul(rnn_outputs, W) + b
    predictions = tf.nn.softmax(logits)

    total_loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits, y_reshaped))
    train_step = tf.train.AdamOptimizer(learning_rate).minimize(total_loss)

    return dict(
        x = x,
        y = y,
        init_state = init_state,
        final_state = final_state,
        total_loss = total_loss,
        train_step = train_step,
        preds = predictions,
        saver = tf.train.Saver()
    )

In [58]:
t = time.time()
g=build_multilayer_lstm_graph_with_dynamic_rnn(num_steps=4)
print("It took", time.time() - t, "seconds to build the graph.")
losses = train_network(g, num_epochs=100, num_steps=4, save="LSTM_30_epochs_variousscripts")


('It took', 1.0260069370269775, 'seconds to build the graph.')
('Average training loss for Epoch', 0, ':', 3.522979659418906)
('Average training loss for Epoch', 1, ':', 3.1210772914271199)
('Average training loss for Epoch', 2, ':', 3.0855817794799805)
('Average training loss for Epoch', 3, ':', 3.0510836955039733)
('Average training loss for Epoch', 4, ':', 2.9330338278124408)
('Average training loss for Epoch', 5, ':', 2.8085998565919938)
('Average training loss for Epoch', 6, ':', 2.6748207307630971)
('Average training loss for Epoch', 7, ':', 2.5456942204506166)
('Average training loss for Epoch', 8, ':', 2.4470833040052846)
('Average training loss for Epoch', 9, ':', 2.380311612159975)
('Average training loss for Epoch', 10, ':', 2.3284218003672938)
('Average training loss for Epoch', 11, ':', 2.2575367958314958)
('Average training loss for Epoch', 12, ':', 2.206624754013554)
('Average training loss for Epoch', 13, ':', 2.1657388094932801)
('Average training loss for Epoch', 14, 

![alt text](https://somyasinghal.files.wordpress.com/2016/02/try.png?w=660 "Interpreting softmax")

In [28]:
def generate_characters(g, checkpoint, num_chars, prompt='A', pick_top_chars=None):
    """ Accepts a current character, initial state"""

    with tf.Session() as sess:
        sess.run(tf.initialize_all_variables())
        g['saver'].restore(sess, checkpoint)

        state = None
        current_char = vocab_to_idx[prompt]
        chars = [current_char]

        for i in range(num_chars):
            if state is not None:
                feed_dict={g['x']: [[current_char]], g['init_state']: state}
            else:
                feed_dict={g['x']: [[current_char]]}

            preds, state = sess.run([g['preds'],g['final_state']], feed_dict)

            if pick_top_chars is not None:
                p = np.squeeze(preds)
                p[np.argsort(p)[:-pick_top_chars]] = 0
                p = p / np.sum(p)
                current_char = np.random.choice(vocab_size, 1, p=p)[0]
            else:
                current_char = np.random.choice(vocab_size, 1, p=np.squeeze(preds))[0]

            chars.append(current_char)

    chars = map(lambda x: idx_to_vocab[x], chars)
    print("".join(chars))
    return("".join(chars))

In [60]:
g=build_multilayer_lstm_graph_with_dynamic_rnn(num_steps=1, batch_size=1)
generated_names = generate_characters(g, "LSTM_30_epochs_variousscripts", 500, prompt='A', pick_top_chars=5)



Anita
Agnes
Agne
Anneta
Angele
Annabica
Annebeste
Ingele
Inga
Andrika
Amena
Helda
Vedina
Hedidika
Jenenek
Heslia
Recina
June
Juni
Roniana
Lorja
Lana
Leja
Lejaa
Lena
Lene
Lenkke
Leona
Leina
Licke
Line
Line
Lina
Linnea
Line
Lina
Linn
Linea
Line
Line
ina
Lina
Line
Line
Tine
Line
Tine
Tine
Sine
Sine
Sire
Sirin
Simom
Sikone
Sofia
Sorie
Sofie
Sofje
Sofia
Sofie
Soja
Soisa
Ushile
Suma
Amøy
Astra
Avami
Asane
Asta
Ajeliana
Anetre
Anne
Anne
Anetianea
Dine
Renise
Anetine
Jina
Junate
Runa
Robannike
Rnenne
Ren


In [128]:
# -*- coding: utf-8-*-
#Check which names that are not in the orginal list
generated_names_list = generated_names.split('\n')
names_from_ssb_list = df_to_text.split('\n')

#Exclude list already in the list 
new_names_generated = list(set(generated_names_list) - set(names_from_ssb_list))

#Output number of names generated etc. 
print "Number of generated names: " + str(len(generated_names_list))
print "Number of new names generated that does not exist in the original list: " + str(len(new_names_generated))
print "Rate of new names: " + str(len(new_names_generated)/float(len(generated_names_list))) + '\n'


Number of generated names: 83
Number of new names generated that does not exist in the original list: 46
Rate of new names: 0.55421686747



In [130]:
#Print out names
print "New Norwegian girl baby names produced by RNN 3 Layer Deep Neural Net: " + str(len(generated_names_list)) + '\n'

for x in sorted(new_names_generated):
    print x
   

New Norwegian girl baby names produced by RNN 3 Layer Deep Neural Net: 83

Agne
Ajeliana
Amena
Amøy
Andrika
Anetianea
Anetine
Anetre
Angele
Annabica
Annebeste
Anneta
Asane
Astra
Avami
Dine
Hedidika
Helda
Heslia
Ingele
Jenenek
Jina
Junate
Leina
Lejaa
Lenkke
Licke
Lorja
Recina
Ren
Renise
Rnenne
Robannike
Roniana
Sikone
Simom
Sire
Sirin
Sofje
Soisa
Soja
Sorie
Suma
Ushile
Vedina
ina


## Conclusion

The model has for the most part learned how to generate sensible character combinations to form girl names and even new girl names which is quite impressive. 55% of the names generated where new names not contained in the training data. If the model is trained more extensively it might lose this flair and it would be interesting to see the results. 

Some of the names generated are indeed weird and interesting from a Norwegian language perspective:

* Agne
* Ajeliana
* Amena
* Amøy - Sounds like an island up north or on the west coast of Norway :)
* Andrika
* Anetianea - This one is weird
* Anetine
* Anetre - This one does not make sense 
* Angele
* Annabica
* Annebeste
* Anneta
* Asane
* Astra
* Avami - Probably a result of ethnic minority names increasing in Norway
* Dine
* Hedidika
* Helda
* Heslia
* Ingele
* Jenenek - Polish sounding :) 
* Jina
* Junate
* Leina - Sounds like a dog name
* Lejaa
* Lenkke
* Licke
* Lorja
* Recina
* Ren
* Renise
* Rnenne
* Robannike
* Roniana
* Sikone
* Simom - It is a boy's name in the US
* Sire
* Sirin
* Sofje
* Soisa
* Soja
* Sorie
* Suma
* Ushile - Probably a result of ethnic minority names increasing in Norway
* Vedina
* ina - Only name it was not able to provide with a capital letter. It is also a common Norwegian name. 

The model also learned (for the most part 82/83) learned that new names should start with new capital letters without me telling it to. 

Nevertheless, when arguing about potential baby names one could always turn to trivial projects such as this. I find it mind-blowing even though I get in hold (at least some) of the math and concepts involved. This is indeed the start of something big. 


In [None]:
#TODO:
#Dropouts
#Normalization
#Do for boys names as well
#Train deeper networks