<a href="https://colab.research.google.com/github/ZhouyaoXie/cnn_gender_prediction_from_first_name/blob/main/Character_level_CNN_Model_for_Gender_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Character Level CNN Model for Gender Classification

Zhouyao Xie


---



# 0. Overview

The goal is to train a binary classifier that outputs gender predictions given first names as inputs. After some literature research, I was pointed to two papers on character-level convolutional neural nets for text classification ([Zhang, Zhao, Lecun](https://arxiv.org/abs/1509.01626) and [Kim](https://arxiv.org/abs/1408.5882)). First names contain almost no semantic or syntactic information, which makes the task of inferring from first names quite different from understanding normal words. Since not much information is lost from viewing the text at character-level, character-level CNN seems appropriate for the task.

The training dataset is the [national name dataset ](https://www.ssa.gov/oact/babynames/limits.html) provided by U.S. Social Security Administration. I used the data from 1950 to 2018, which contains 517490 unique name-gender pairs (including ambiguous names). 2% of the data were randomly sampled to use as the testset. From the rest data, 20% were randomly selected as the validation set.

The neural network I implemented below slightly modified Zhang's design. It consists of one 128-filter, three 64-filter 1-D convolution layers, and two fully connected layers. Each convolution layer is followed by a max pooling layer, with pooling size equals to 3. I also used dropout in between the three dense layers to regularize. I used an Adam optimizer with learning rate 0.0005 to perform gradient descent.

The model attained an accuracy of **86.11%** on the testset.

I also referred to [this](https://github.com/mhjabreel/CharCnn_Keras), [this](https://github.com/Irvinglove/char-CNN-text-classification-tensorflow), and [this](https://github.com/BrambleXu/nlp-beginner-guide-keras/blob/master/char-level-cnn/char_cnn.py) github repos for the implementation of character-level CNN.

In [44]:
import tensorflow as tf
import pandas as pd
import re
from sklearn.model_selection import train_test_split
import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, Activation
from keras.layers.convolutional import Conv1D, MaxPooling1D
from keras.layers.core import Dense, Flatten, Dropout
from keras.models import Model
from sklearn.metrics import confusion_matrix, accuracy_score
import numpy as np
from keras import optimizers
from keras.utils import to_categorical

# I. Data Preprocessing


In [46]:
# import training data
gender_ds = pd.DataFrame({})
for year in [str(x) for x in range(1950, 2019)]:
  gender_ds = gender_ds.append(pd.read_csv('/content/drive/MyDrive/name_gender_classification/yob'+year+'.txt',header=None,names=['name','gender','frequency']))

gender_ds.drop_duplicates(inplace = True)

# female = gender_ds.loc[gender_ds['gender']=='F'].name.values
# male = gender_ds.loc[gender_ds['gender']=='M'].name.values
# strictly_female_names = set(female) - set(male)
# gender_ds = gender_ds.loc[~gender_ds['name'].isin(list(strictly_female_names)[:len(female) - len(male)])]
# print(gender_ds.gender.value_counts())

# sample 2% testset
test = gender_ds.sample(frac = 0.02)
gender_ds = gender_ds.loc[~gender_ds.name.isin(test.name.values)]

# train-validate split
names_train, names_valid, y_train, y_valid = train_test_split(
        gender_ds['name'], gender_ds['gender'], test_size=0.20)

In [47]:
len(gender_ds)

517490

In [48]:
# Preprocessing
# lower case all texts
names_train = [s.lower() for s in names_train]
names_valid = [s.lower() for s in names_valid]

tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(names_train)
tk.fit_on_texts(names_valid)

# Index each letter in the alphabet
alphabet = "abcdefghijklmnopqrstuvwxyz"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1
tk.word_index = char_dict.copy()

# Add 'UNK' to the vocabulary
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1

# Convert text to sequence of integers
train_sequences = tk.texts_to_sequences(names_train)
test_texts = tk.texts_to_sequences(names_valid)

maxlen = max(gender_ds.name.apply(len))
# Apply padding
train_data = pad_sequences(train_sequences, maxlen=maxlen, padding='post')
test_data = pad_sequences(test_texts, maxlen=maxlen, padding='post')
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')

In [49]:
# format classes
train_class_list = np.where(y_train=='F',0,1)
test_class_list = np.where(y_valid=='F',0,1)

train_classes = to_categorical(train_class_list)
test_classes = to_categorical(test_class_list)

# II. Model Training

In [50]:
# CNN Model
# Parameter
input_size = np.shape(train_data)[1] #15
vocab_size = len(tk.word_index) #27
embedding_size = len(tk.word_index) #27
conv_layers = [[128, 3, 3],
               [64, 3, 3],
               [64, 3, 3],
               [64, 3, 3]]
dense_1 = 128
dense_2 = 128
num_of_classes = 2
dropout_p = 0.5
optimizer = optimizers.Adam(lr=.0005)
# optimizer = optimizers.SGD(lr=0.001, clipvalue=0.5)
# optimizer = optimizers.Adagrad(learning_rate = 0.005)
# optimizer = optimizers.Ftrl(learning_rate = 0.001)
# optimizer = optimizers.Adamax()

loss = 'binary_crossentropy'

# # Embedding weights
# embedding_weights = []
# embedding_weights.append(np.zeros(vocab_size))

# # creating one-hot vector for each char
# for char, i in tk.word_index.items(): 
#     onehot = np.zeros(vocab_size)
#     onehot[i - 1] = 1
#     embedding_weights.append(onehot) #(28,27)

# Embedding layer
embedding_layer = Embedding(vocab_size + 1, #28
                            embedding_size, #27
                            input_length=input_size,
                            embeddings_initializer ='random_normal')
# Instantiate keras tensor
inputs = Input(shape=(input_size,), 
               name='input', 
               dtype='int64')
# Embedding
x = embedding_layer(inputs)
# 1D CNN
for filter_num, filter_size, pooling_size in conv_layers:
    x = Conv1D(filter_num, filter_size)(x)
    x = Activation('relu')(x)
    x = MaxPooling1D(pool_size=pooling_size,data_format='channels_first')(x)
x = Flatten()(x)
# Fully connected layers
x = Dense(dense_1, activation='relu')(x)
x = Dropout(dropout_p)(x)
x = Dense(dense_2, activation='sigmoid')(x)
x = Dropout(dropout_p)(x)
# Output Layer
predictions = Dense(num_of_classes, activation='sigmoid')(x)
# Build model
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])
model.summary()

# shuffle train and test sets
indices = np.arange(train_data.shape[0])
np.random.shuffle(indices)
x_train = train_data[indices]
y_train = train_classes[indices]

indices = np.arange(test_data.shape[0])
np.random.shuffle(indices)
x_test = test_data[indices]
y_test = test_classes[indices]

# Train
model.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=32,
          epochs=12,
          verbose=2,
          #class_weight = {0:0.4, 1:0.6}
          )


Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           [(None, 15)]              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 15, 27)            756       
_________________________________________________________________
conv1d_6 (Conv1D)            (None, 13, 128)           10496     
_________________________________________________________________
activation_6 (Activation)    (None, 13, 128)           0         
_________________________________________________________________
max_pooling1d_6 (MaxPooling1 (None, 13, 42)            0         
_________________________________________________________________
conv1d_7 (Conv1D)            (None, 11, 64)            8128      
_________________________________________________________________
activation_7 (Activation)    (None, 11, 64)            0   

<tensorflow.python.keras.callbacks.History at 0x7fdd15caa0d0>

In [51]:
model.save('/content/drive/MyDrive/name_gender_classification/gender_classifier.h5')

# III. Prepare Testset

In [52]:
# load model
model = keras.models.load_model('/content/drive/MyDrive/name_gender_classification/gender_classifier.h5')

# convert an array of first names to the format of NN inputs
def get_input_expr(names, tk):
  names = [s.lower() for s in names]
  sequences = tk.texts_to_sequences(names)
  data = pad_sequences(sequences, maxlen=maxlen, padding='post')
  return np.array(data, dtype='float32')

# x_names = get_input_expr(data.name.values)
x_names = get_input_expr(test.name.values, tk)

# IV. Predict and Evaluate

In [53]:
# predict
prediction = model.predict(x_names) 
gender_pred = [str(np.where(x[0]>x[1],'F','M')) for x in prediction]
test['pred'] = gender_pred

# check accuracy
print('Accuracy: ', accuracy_score(test.gender.values, test.pred.values))
print('Confusion Matrix: \n',confusion_matrix(test.gender.values, test.pred.values, labels = ['F', 'M']))

Accuracy:  0.8611001187178472
Confusion Matrix: 
 [[9575 1352]
 [1105 5657]]
