# 01 Baby name

![](https://images.unsplash.com/photo-1519689680058-324335c77eba?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1050&q=80)

Photo by [Valeria Zoncoll](https://unsplash.com/photos/AVGc87j_vNA)

In this challenge, you will generate baby names using recurrent neural networks!

The used dataset is in the file `names.txt`, a file encoded in `'ISO-8859-1'`, containing more than 10 000 names.

First load it, and have a look at the names, and clean the dataset if needed.

In [1]:
# TODO: Load the dataset and explore it
### STRIP_START ###
import pandas as pd
names=pd.read_csv('names.txt', encoding="ISO-8859-1")

names = names.drop_duplicates()

names.head()
### STRIP_END ###

Unnamed: 0,name
0,aaliyah
1,aapeli
2,aapo
3,aaren
4,aarne


The RNN needs to understand where is the beginning and the end of a word. So we need to add a new character at the beginning of every word, for example `'\t'` (it could be anything else as long as it can be identified easily). We can also add `'\n'` to the end of every word as the end.

In [2]:
# TODO: add '\t' at the beginning of every word
### STRIP_START ###
names['name'] =names["name"].apply(lambda x: '\t'+str(x)+'\n')
### STRIP_END ###

To generate names, we will have to play at the character level: we will train a RNN to predict the next character, knowing the previous one. So, compute a list of all the possible characters.

In [3]:
# TODO: Compute and display the list of all possible characters
### STRIP_START ###
# Get the vocab dict
all_chars=set()
for name in names.name:
    for c in name:
        if c not in all_chars:
            all_chars.add(c)
all_chars.add('\n')

print('number of characters', len(all_chars))
print(all_chars)
### STRIP_END ###

number of characters 55
{'ö', 'à', 'ä', 'ô', 'ã', 'x', 'f', 'ø', 'm', 'ì', 'c', 'l', 'k', 'b', 'u', 'ï', 'i', 'y', 'ú', 'ò', 'ê', '\t', 'n', 'é', 'ë', 'z', 'ñ', 'a', 'ü', 'r', 'o', 'ù', 'd', 'õ', 'h', 't', 'æ', 'w', 'á', 's', 'p', '-', 'è', 'ð', 'ç', 'ó', 'í', 'j', 'þ', 'e', 'q', 'v', 'g', 'å', '\n'}


You should get 55 characters, right?

As usual when playing with characters (or words), we will convert them into integers. So build a dictionary `char_to_idx` that, given a character as key, returns an integer. And build the opposite dictionary `idx_to_char` that, given an integer as key, returns the corresponding character.

In [4]:
# TODO: Compute the idx_to_char and char_to_idx dict
### STRIP_START ###
# max length of a name is 11
char_to_idx = { ch:i for i,ch in enumerate(sorted(all_chars)) }
idx_to_char = { i:ch for i,ch in enumerate(sorted(all_chars)) }
char_to_idx
### STRIP_END ###

{'\t': 0,
 '\n': 1,
 '-': 2,
 'a': 3,
 'b': 4,
 'c': 5,
 'd': 6,
 'e': 7,
 'f': 8,
 'g': 9,
 'h': 10,
 'i': 11,
 'j': 12,
 'k': 13,
 'l': 14,
 'm': 15,
 'n': 16,
 'o': 17,
 'p': 18,
 'q': 19,
 'r': 20,
 's': 21,
 't': 22,
 'u': 23,
 'v': 24,
 'w': 25,
 'x': 26,
 'y': 27,
 'z': 28,
 'à': 29,
 'á': 30,
 'ã': 31,
 'ä': 32,
 'å': 33,
 'æ': 34,
 'ç': 35,
 'è': 36,
 'é': 37,
 'ê': 38,
 'ë': 39,
 'ì': 40,
 'í': 41,
 'ï': 42,
 'ð': 43,
 'ñ': 44,
 'ò': 45,
 'ó': 46,
 'ô': 47,
 'õ': 48,
 'ö': 49,
 'ø': 50,
 'ù': 51,
 'ú': 52,
 'ü': 53,
 'þ': 54}

Before going into the neural network part, we have one more step: **create the X and y data**!

So the **X** data is going to be, for every name, all but the `'\n'` character. The **y** data will be all but the `'\t'` character.

Indeed, we will try to predict the following character knowing the previous. To the **X** does not need the final character, and the **y** does not need the first character.

Create the columns X and y to the dataframe.

In [5]:
# TODO: Create the columns X and y
### STRIP_START ###
names['X'] = names["name"].apply(lambda x: x[:len(x)-1])
names['y'] = names["name"].apply(lambda x: x[1:len(x)])
names.head()
### STRIP_END ###

Unnamed: 0,name,X,y
0,\taaliyah\n,\taaliyah,aaliyah\n
1,\taapeli\n,\taapeli,aapeli\n
2,\taapo\n,\taapo,aapo\n
3,\taaren\n,\taaren,aaren\n
4,\taarne\n,\taarne,aarne\n


Now, using your `char_to_idx` dict, compute the corresponding `X` and `y` containing, for each name, a list of integers.

In [6]:
# TODO: Create the X and y variables containing integers only
### STRIP_START ###
X = names['X'].apply(lambda x: [char_to_idx[c] for c in x])
y = names['y'].apply(lambda x: [char_to_idx[c] for c in x])
### STRIP_END ###

That was complicated, but are now in a known case, use keras and `pad_sequence()` function to get a proper `X` and `y` variables with a `maxlen=16`.

In [7]:
# TODO: Use pad_sequences to get only sequences of length 16 for each name
### STRIP_START ###
from tensorflow.keras.preprocessing import sequence

maxlen = 16

X_train = sequence.pad_sequences(X,
                                 value=0,
                                 padding='post',
                                 maxlen=maxlen)

y_train = sequence.pad_sequences(y,
                                 value=0,
                                 padding='post',
                                 maxlen=maxlen)
X_train.shape, y_train.shape
### STRIP_END ###

((11497, 16), (11497, 16))

Finally, using the function `to_categorical()`, make the one-hot-encoding needed.

In [8]:
# TODO: use to_categorical to perform one hot encoding
### STRIP_START ###
from tensorflow.keras.utils import to_categorical


X_train = to_categorical(X_train)
y_train = to_categorical(y_train)

X_train.shape, y_train.shape
### STRIP_END ###

((11497, 16, 55), (11497, 16, 55))

You should finally have arrays of shape `(number of names, 16, 55)`:
- `16` is the sequence length
- `55` is the number of possible characters

Now you have to build a neural network. You can for example use one or two layers of GRU (or LSTM). Do not forget to set `return_sequences=True`. 

Then you will have to add a `TimeDistributed(Dense(55))` with a softmax activation function. This layer will handle the fact you have a dense layer at each time step with a softmax prediction of the next word.

In [13]:
# TODO: Build the neural network
### STRIP_START ###
from tensorflow.keras.layers import GRU, Dense, TimeDistributed
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(GRU(32, input_shape=(maxlen, len(all_chars)), return_sequences=True))
model.add(GRU(32, return_sequences=True))
model.add(TimeDistributed(Dense(len(all_chars), activation='softmax')))
### STRIP_END ###

Finally, train your model!

In [14]:
# TODO: fit the model
### STRIP_START ###
model.compile(loss='categorical_crossentropy', optimizer='adam')

model.fit(X_train, y_train, batch_size=64, epochs=50)
### STRIP_END ###

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7feafd7b9e10>

The final step will be to generate names, through a function `generate_names()`. 

To do so, you will have to give the output of the previous time step prediction as input to the next time step.

You will have to use the method `predict_proba` of your model, as will as the method `numpy.random.choice`.

Finally, use your function to generate some names!

In [15]:
# TODO: implement the function generate_names
### STRIP_START ###
from generate import generate_n_names

generate_n_names(20, maxlen, char_to_idx, model)
### STRIP_END ###

	lida
	ivaert
	gerdadd
	grand
	hada
	pekra
	marim
	sim
	vede
	inceline
	dedwa
	cenka
	groxen
	agn
	luken
	perzs
	klildit
	avsgeron
	ozima
	ahdel


In case this looks too complicated (indeed it is far from being simple), you can use the function `generate_n_names()` in the file `generate.py`. But first have a look at it and try to understand what it does!

If you have more time, you can try to improve the results by tuning your neural network hyperparameters.

You can also use the original file, `Prenoms.csv`, and use only names from a given origin, to build a model more specific for example.

**Conclusion**: This method can be applied to almost anything: you can generate music, shakespeare, lyrics... using this method. All it takes is to change the data preprocessing and adapt the dimensions.