To generate baby names from scratch, you need to have a system that generates short texts quickly. These texts should have a unique style and could actually serve as names for newborn babies. In this exercise, you'll start pre-processing a dataset of people's names so that it can be used to train such a system.

In [3]:
import pandas as pd
add="C:/Users/ANTHONY/Desktop/CSV&XLSX/names.txt"
names_df=pd.read_csv(add,names=["input"])
# Print the head of df
names_df

Unnamed: 0,input
0,John
1,William
2,James
3,Charles
4,George
...,...
257995,Carleigh
257996,Iyana
257997,Kenley
257998,Sloane


You'll append the start token at the start of each name and update this column in-place. You'll also create another column which will have the end token appended to each name. 

In [4]:
# Insert a tab in front of all the names
names_df['input'] = names_df['input'].apply(lambda x : '\t' + x)

# Append a newline at the end of every name
# We already appended a tab in front, so the target word should start at index 1
names_df['target'] = names_df['input'].apply(lambda x : x[1:len(x)] + '\n')

names_df

Unnamed: 0,input,target
0,\tJohn,John\n
1,\tWilliam,William\n
2,\tJames,James\n
3,\tCharles,Charles\n
4,\tGeorge,George\n
...,...,...
257995,\tCarleigh,Carleigh\n
257996,\tIyana,Iyana\n
257997,\tKenley,Kenley\n
257998,\tSloane,Sloane\n


Now you have a DataFrame with two columns containing the names with the start and end tokens appended. The next step is to encode these as numeric values because machine learning models only accept numeric inputs.

In this exercise, you'll create two dictionaries, char_to_idx and idx_to_char, that will contain mappings of characters to integers, e.g., {'\t': 0, '\n': 1, 'a': 2, 'b': 3, ...} and the reverse mappings of integers to characters, e.g, {0: '\t', 1: '\n', 2: 'a', 3: 'b', ...}.

In [5]:
# Get vocabulary of Names dataset
def get_vocabulary(names):  
    # Define vocabulary to be set
    all_chars=set()
    
    # Add the start and end token to the vocabulary
    all_chars.add('\t')
    all_chars.add('\n')  
    
    # Iterate for each name
    for name in names:

        # Iterate for each character of the name
        for c in name:

            if c not in all_chars:
            # If the character is not in vocabulary, add it
                all_chars.add(c)

    # Return the vocabulary
    return all_chars
vocabulary = get_vocabulary(names_df['input'])
vocabulary

{'\t',
 '\n',
 'A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z'}

In [6]:
# Get the vocabulary

# Sort the vocabulary
vocabulary_sorted = sorted(vocabulary)

# Create the mapping of the vocabulary chars to integers
char_to_idx = { char : idx for idx, char in enumerate(vocabulary_sorted) }

# Create the mapping of the integers to vocabulary chars
idx_to_char = { idx : char for idx, char in enumerate(vocabulary_sorted) }

# Print the dictionaries
print(char_to_idx,"\n")
print(idx_to_char)

{'\t': 0, '\n': 1, 'A': 2, 'B': 3, 'C': 4, 'D': 5, 'E': 6, 'F': 7, 'G': 8, 'H': 9, 'I': 10, 'J': 11, 'K': 12, 'L': 13, 'M': 14, 'N': 15, 'O': 16, 'P': 17, 'Q': 18, 'R': 19, 'S': 20, 'T': 21, 'U': 22, 'V': 23, 'W': 24, 'X': 25, 'Y': 26, 'Z': 27, 'a': 28, 'b': 29, 'c': 30, 'd': 31, 'e': 32, 'f': 33, 'g': 34, 'h': 35, 'i': 36, 'j': 37, 'k': 38, 'l': 39, 'm': 40, 'n': 41, 'o': 42, 'p': 43, 'q': 44, 'r': 45, 's': 46, 't': 47, 'u': 48, 'v': 49, 'w': 50, 'x': 51, 'y': 52, 'z': 53} 

{0: '\t', 1: '\n', 2: 'A', 3: 'B', 4: 'C', 5: 'D', 6: 'E', 7: 'F', 8: 'G', 9: 'H', 10: 'I', 11: 'J', 12: 'K', 13: 'L', 14: 'M', 15: 'N', 16: 'O', 17: 'P', 18: 'Q', 19: 'R', 20: 'S', 21: 'T', 22: 'U', 23: 'V', 24: 'W', 25: 'X', 26: 'Y', 27: 'Z', 28: 'a', 29: 'b', 30: 'c', 31: 'd', 32: 'e', 33: 'f', 34: 'g', 35: 'h', 36: 'i', 37: 'j', 38: 'k', 39: 'l', 40: 'm', 41: 'n', 42: 'o', 43: 'p', 44: 'q', 45: 'r', 46: 's', 47: 't', 48: 'u', 49: 'v', 50: 'w', 51: 'x', 52: 'y', 53: 'z'}


Create input and target tensors
In this exercise, you'll create two tensors to encode the input and the target sequences. The input is a list containing all the names in the dataset. So, the first dimension of the input tensor will be the number of names in the dataset. Each name can be thought of as a string having length equal to the length of the longest name and each character in each name is a one-hot encoded vector of size vocabulary. So, the second and third dimensions of the input tensor will be the length of the longest name and the size of the vocabulary. Similar is the case for the target tensor.

In [7]:
def get_max_len(names):
    """
    Function to return length of the longest name.
    Input: list of names
    Output: length of the longest name
    """

    # create a list to contain all the name lengths
    length_list=[]

    # Iterate over all names and save the name length in the list.]
    for l in names:
        length_list.append(len(l))

    # Find the maximum length
    max_len = np.max(length_list)

    # return maximum length
    return max_len

In [9]:
import numpy as np
# Find the length of longest name
max_len = get_max_len(names_df['input'])

# Initialize the input vector
input_data = np.zeros((len(names_df['input']), max_len+1, len(vocabulary)), dtype='float32')

# Initialize the target vector
target_data = np.zeros((len(names_df['input']), max_len+1, len(vocabulary)), dtype='float32')

Initialize input and target vectors with values
In the last exercise, you created the input and target tensors of appropriate shape containing all zeros. Now, you'll fill these with actual values. The input and target tensors contain all the names in the dataset. Each name can be thought of as a string having length equal to the length of the longest name and each character in each name is a one-hot encoded vector of size vocabulary.

The tensors can be filled-in as follows: input_data[n_idx, p_idx, char_to_idx[char]] will be set to 1 whenever the index of the name in the dataset is n_idx and it contains the character char in position p_idx.

The dataset and the character to integer mapping are available in names_df and char_to_idx. The zero tensors from the last exercise are available in input_data and target_data.

In [None]:
# Iterate for each name in the dataset
for n_idx, name in enumerate(names_df['input']):
  # Iterate over each character and convert it to a one-hot encoded vector
  for c_idx, char in enumerate(name):
    input_data[n_idx, c_idx, char_to_idx[char]] = 1

# Iterate for each name in the dataset
for n_idx, name in enumerate(names_df['target']):
  # Iterate over each character and convert it to a one-hot encoded vector
  for c_idx, char in enumerate(name):
    target_data[n_idx, c_idx, char_to_idx[char]] = 1

Build and compile RNN network
So far, you completed all the data preprocessing steps and have the input and target vectors ready. It is time to build the recurrent neural network. You'll create a small network architecture that will have 50 simple RNN nodes in the first layer followed by a dense layer. The dense layer will generate a probability distribution over the vocabulary for the next character. So, the size of the dense layer will be the same as the size of the vocabulary.

The dataset is available as the DataFrame names. The length of the longest name is saved in variable max_len. The vocabulary is available in variable vocabulary. The SimpleRNN, Dense, Activation, TimeDistributed layers are already imported from keras.layers and the Sequential model is already imported from keras.models.

In [10]:
# Create a Sequential model
model = Sequential()

# Add SimpleRNN layer of 50 units
model.add(SimpleRNN(50, input_shape=(max_len+1, len(vocabulary)), return_sequences=True))

# Add a TimeDistributed Dense layer of size same as the vocabulary
model.add(TimeDistributed(Dense(len(vocabulary), activation='softmax')))

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam')

# Print the model summary
model.summary()

NameError: name 'Sequential' is not defined