# Gender Classification by Name
In this project, I will train a recurrent neural network to classify a given first name as either male or female

In [1]:
import numpy as np
import pandas as pd
import sklearn.utils
import random
import tensorflow as tf
from keras.models import load_model, Model, Sequential
from keras.layers import Dense, Activation, Dropout, LSTM, Reshape, Bidirectional
from keras.initializers import glorot_uniform
from keras.utils import to_categorical
from keras.optimizers import Adam
from keras import backend as K

Using TensorFlow backend.


## Dataset and Preprocessing

We will first address the issue of unisex name. We will first read the file and then create a dictionary that maps an index to a name, its gender, and the number of people who were assigned that name and gender at birth. If a name is unisex, it will be assigned the gender which is most popularly assigned to someone with that name.

In [0]:
with open('name_data.txt', 'r') as file:
    dataList = file.readlines()

name_dict = {}

for name_data in dataList:
    list = name_data.split(',')
    name = list[0].lower()
    gender = list[1]
    data_num_people = int(list[2][:-1])
    
    #If name is unisex, check if name is more commonly male or female
    if name in name_dict.keys() and name_dict[name][1] > data_num_people:
        continue
    else:
        name_dict[name] = [gender,data_num_people]

We will continue our preprocessing by creating a pandas datafram from our dictionary of names. We will then shuffle the rows in our dataframe. Lastly, we will assign the names to our 'X' array and genders to our 'Y' array.

In [0]:
name_data = pd.DataFrame.from_dict(data = name_dict, orient = 'index' )
name_data = sklearn.utils.shuffle(name_data)
X_string = name_data.index
Y_char = name_data[0].values

In the cell below, we create a python dictionary (i.e., a hash table) to map each character to an index from 0-26. We also create a second python dictionary that maps each index back to the corresponding character character. This will help to figure out what index corresponds to what character in the probability distribution output of the softmax layer.

In [4]:
index_to_char = {0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 
                 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 
                 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 
                 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 
                 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 
                 25: 'y', 26: 'z'}
char_to_index = inv_map = {v: k for k, v in index_to_char.items()}
print(char_to_index)

{'\n': 0, 'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26}


We will now convert each name into a list of indices. We will also pad the list of indices with zeros up to the maximum vocab length. After this, we will create one-hot representations of each index in each list and use these to create our final training set of matrices that represents words using matrices of one-hot representations of each character. Finally, we will convert Y to a binary list where 'M' maps to 0 and 'F' maps to 1

In [0]:
max_name_length = len(max(X_string, key = len))

def convert_name_to_list_of_indices(name):
    list = []
    for char in name:
        list.append(char_to_index[char])
    while len(list) < max_name_length:
        list.append(0)
    return np.asarray(list)

X_index = np.asarray([convert_name_to_list_of_indices(name) for name in X_string])

X_one_hot = tf.one_hot(indices = X_index, depth = 27)

Y_binary = (Y_char == "F").astype(int)
Y_one_hot = tf.one_hot(indices = Y_binary, depth = 2, dtype = 'float32')

## Learning Model
We will now move on to creating our neural network architecture using tensorflow. For this project I have decided to create a 2-layer LSTM model using dropout regularization and a dense-activation layer at the end to compute our binary output.

In [0]:
model = Sequential()
model.add(Bidirectional(LSTM(512, return_sequences=True, input_shape=(max_name_length,27))))
model.add(Dropout(rate = 0.2))
model.add(Bidirectional(LSTM(512, return_sequences=False)))
model.add(Dropout(rate = 0.2))
model.add(Dense(2))
model.add(Activation('softmax'))

We will use adam optimization, categorical_crossentropy loss, and an accuracy metric. We will use gradient clipping with an absolute value of 10 to avoid the exploding gradients that can come with this character-level rnn.

In [0]:
opt = Adam(clipvalue = 10)
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])

We will now split our name and gender sets into training, development, and test sets. We will compile the X and Y tensors in order to pass them through our model

In [0]:
length = X_one_hot.get_shape().as_list()[0]
tenth = length//10

sess = tf.Session()
X = sess.run(X_one_hot)
Y = sess.run(Y_one_hot)

X_train = X[:tenth*8]
X_dev = X[tenth*8:tenth*9]
X_test = X[tenth*9:length]

Y_train = Y[:tenth*8]
Y_dev = Y[tenth*8:tenth*9]
Y_test = Y[tenth*9:length]

We are now ready to train our model

In [91]:
model.fit(X_train, Y_train, epochs=5, batch_size = 32, validation_data=(X_dev, Y_dev))
model.save_weights('gender_model',overwrite=True)

Train on 23928 samples, validate on 2991 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


## Creating Predictions based on Model
We will now take the steps necessary to classify new names using our prediction model. We will start by displaying the score and accuracy of our model on the development set.

In [92]:
score, acc = model.evaluate(X_dev, Y_dev)
print('Dev score:', score)
print('Dev accuracy:', acc)

Dev score: 0.3063549758179536
Dev accuracy: 0.8990304246470108


We will now run our model on our test set

In [93]:
score, acc = model.evaluate(X_test, Y_test)
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.3137703503664821
Test accuracy: 0.8913406887926493


We will now enable the model to create prediction on new names.


In [121]:
def create_one_hot_matrix(name):
  list_indices = convert_name_to_list_of_indices(name)
  one_hot_matrix = tf.one_hot(indices = list_indices, depth = 27)
  sess = tf.Session()
  one_hot_matrix = sess.run(one_hot_matrix)
  return one_hot_matrix

list_of_names = ['derrick', 'alexis', 'brittany', 'sierra', 'jeff', 'emma', 'kamara']

prediction_input = []
for name in list_of_names:
  one_hot_matrix = create_one_hot_matrix(name)
  prediction_input.append(one_hot_matrix)

pred = model.predict(np.asarray(prediction_input))
prob_m = ["F" if i[0] < 0.5 else "M" for i in pred]
print(prob_m)

['M', 'F', 'F', 'F', 'M', 'F', 'F']
