# Notebook to build a deep learning model to predict Gender from name

We will follow the following steps in this notebook.
1. Download the data set
2. Explore and pre-process the dataset
3. Showcase the encoding  (names, character-integer encoding, character-one-hot encoding)
4. Submit Sagemaker training job


#### Step 1: Download the data from https://www.ssa.gov/oact/babynames/names.zip
When you unzip the download, you will find several files with names 'yob1880.txt'. 
The naming convention of this file is 'yob' stands for 'Year of Birth' and the year. 
Which means, each file contains the popular names of babies born in that year.

We will first create a folder called data. Download and unzip the file. We will then proceed to 
extract the content of all those files into a single file named 'allnames.txt'

In [1]:
! mkdir data ;  cd data ; wget https://www.ssa.gov/oact/babynames/names.zip ; unzip names.zip 

--2018-03-21 16:26:10--  https://www.ssa.gov/oact/babynames/names.zip
Resolving www.ssa.gov (www.ssa.gov)... 137.200.4.16, 2001:1930:d07::aaaa
Connecting to www.ssa.gov (www.ssa.gov)|137.200.4.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8207194 (7.8M) [application/zip]
Saving to: ‘names.zip’


2018-03-21 16:26:12 (5.09 MB/s) - ‘names.zip’ saved [8207194/8207194]

Archive:  names.zip
  inflating: yob1884.txt             
  inflating: yob1885.txt             
  inflating: yob1886.txt             
  inflating: yob1887.txt             
  inflating: yob1888.txt             
  inflating: yob1889.txt             
  inflating: yob1890.txt             
  inflating: yob1891.txt             
  inflating: yob1892.txt             
  inflating: yob1893.txt             
  inflating: yob1894.txt             
  inflating: yob1895.txt             
  inflating: yob1896.txt             
  inflating: yob1897.txt             
  inflating: yob1898.txt             
  inflating

In [2]:
! mv data/yob2016.txt data/test_data.txt
! cat data/yob* > data/allnames.txt

### Step 2: Explore and pre-process the data

In [3]:
import numpy as np
import pandas as pd
from numpy import genfromtxt

filename = 'data/allnames.txt'
df=pd.read_csv(filename, sep=',', names = ["Name", "Gender", "Count"])

Lets look at the data size. 

In [4]:
df.shape

(1859026, 3)

There are 189K rows and 3 columns. Now lets see how the data.

In [5]:
df.head(10)

Unnamed: 0,Name,Gender,Count
0,Mary,F,7065
1,Anna,F,2604
2,Emma,F,2003
3,Elizabeth,F,1939
4,Minnie,F,1746
5,Margaret,F,1578
6,Ida,F,1472
7,Alice,F,1414
8,Bertha,F,1320
9,Sarah,F,1288


Data set has 3 columns, Name, Gender, and count. Here Count is the number of times this name was registered with the 
United States social security department. The names sound familiar for United states. Since we collected data
from all 50 states, there might be some names that occur multiple times. Lets us check how many time Mary occurs.

In [6]:
df.loc[df['Name'] == 'Mary'].head(10)

Unnamed: 0,Name,Gender,Count
0,Mary,F,7065
1273,Mary,M,27
2000,Mary,F,6919
3238,Mary,M,29
3935,Mary,F,8148
5276,Mary,M,30
6062,Mary,F,8012
7407,Mary,M,32
8146,Mary,F,9217
9610,Mary,M,36


## Looking at sample data
The name 'Mary' occurs multple times, and at the same time Mary is also 
listed as a Male. In the early 20th century Mary used to be a
common name for boys, and it somewhat related to Mario.
But, looking at the counts, Mary is much more popular 
as a female name than a male name. So, it is not possible to 
guess the gender of a person by just looking at it. 

The second problem is that, the name Mary appears multple times 
in the dataset. We will remove redundant entries. 
But before we remove redundant entries, we will drop the counts as 
we will not be using it for training.

In [7]:
# Since we do not need the 'count' lets drop it from the dataframe
df = df.drop(['Count'], axis=1)

In [8]:
# let remove duplicates
df = df.drop_duplicates()

#checking the presence of Mary again
df.loc[df['Name'] == 'Mary']

#lets shuffle the data set
df = df.sample(frac=1).reset_index(drop=True)

In [9]:
# lets find the number of rows we have now. We want to 
# have a reasonable number to rows to train our deep learning model
num_names = df.shape[0]
print ('Number of names in the training dataset', num_names)

Number of names in the training dataset 105431


In [10]:
# Find the longest name
max_name_length = (df['Name'].map(len).max())
print("Longest name:", max_name_length)

Longest name: 15


In [11]:
!mkdir namesdata

In [12]:
df.to_csv('namesdata/train_names.csv',index=False)

In [13]:
test_file = 'data/test_data.txt'
df_test=pd.read_csv(test_file, sep=',', names = ["Name", "Gender", "Count"])
df_test = df_test.drop(['Count'], axis=1)
df_test.to_csv('namesdata/test_names.csv',index=False,header=False)

In [14]:
df_test.shape

(32868, 2)

### Assumption
Beyond this point, this model will assume that the names only contain
english alphabets (26). The algorithm has to be modified slightly if you 
use the same model for other languages.

# One hot encoding of characters
We cannot use the character symbols as is to send as input to the neural network,
so we will convert this into a one-hot encoded sequence, based on the mapping.

First lets encode the character as integer and then encode the integers into one-hot 
In one-hot encodeing a is represented as an array with the first column selected and so on 

a => [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

e => [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [15]:
# Lets define a dictionar to help us with char to integer encoding
char_to_int = {'a':0,'b':1,'c':2,'d':3,'e':4,'f':5,'g':6,'h':7,'i':8,'j':9,'k':10,'l':11,'m':12,'n':13,'o':14,'p':15,'q':16,'r':17,'s':18,'t':19,'u':20,'v':21,'w':22,'x':23,'y':24,'z':25}

In [16]:
# X will be the input to the neural network, is a 3D numpyarray.
# X is initialized with zeros
alphabet_size = 26
names = df['Name'].values
genders = df['Gender']
X = np.zeros((num_names, max_name_length, alphabet_size))

# we will in each column we will encode 1 in in the column that represents the character
for i,name in enumerate(names):
    name = name.lower()
    for t, char in enumerate(name):
        X[i, t,char_to_int[char]] = 1


In [17]:
# lets look at the first name
# every name will be of the same size 26 x 15. IN case of the 
# first name 'Mary' only the first 4 letters will be encoded
# the rest of the rows will be all zeros

print ('first name is: ', names[0])
X[0,:,:]

first name is:  Kaben


array([[ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,
         0.,  0.,  0.,  0.,  0.,  0.,  0.,  0

In [18]:
# Now lets encode the gender in a numpy array Y
Y = np.ones((num_names,1))
Y[df['Gender'] == 'F',0] = 0

### Training job setup
The above exercise was only to show you how to create the input and target 
for the model. We will not be training in this notebook instance, but will 
submit a training job to sagemaker

In [19]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

In [20]:
inputs = sagemaker_session.upload_data(path='namesdata', key_prefix='namesdata')

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-514539217087


# todo draw a picture of the neural network

In [22]:
from sagemaker.tensorflow import TensorFlow

gender_estimator = TensorFlow(entry_point='highlevel-tensorflow-helper.py',
                               role=role,
                               training_steps= 4000,                                  
                               evaluation_steps= 10,
                               hyperparameters={'learning_rate': 0.01},
                               train_instance_count=1,
                               train_instance_type='ml.p2.xlarge',
                               base_job_name='tf-names')

gender_estimator.fit(inputs, run_tensorboard_locally=True)

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-514539217087
INFO:sagemaker:Creating training-job with name: tf-names-2018-03-21-17-49-32-325


.

INFO:sagemaker:TensorBoard 0.1.7 at http://localhost:6006


.................................................................................................
[31mexecuting startup script (first run)[0m
[31m2018-03-21 17:57:33,589 INFO - root - running container entrypoint[0m
[31m2018-03-21 17:57:33,589 INFO - root - starting train task[0m
[31m2018-03-21 17:57:36,577 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-03-21 17:57:37,891 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com[0m
[31m2018-03-21 17:57:38,013 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): s3.amazonaws.com[0m
[31mINFO:tensorflow:----------------------TF_CONFIG--------------------------[0m
[31mINFO:tensorflow:{"environment": "cloud", "cluster": {"master": ["algo-1:2222"]}, "task": {"index": 0, "type": "master"}}[0m
[31mINFO:tensorflow:------------------------

[31mINFO:tensorflow:global_step/sec: 14.9729[0m
[31mINFO:tensorflow:loss = 0.69034934, step = 2801 (6.679 sec)[0m
[31mINFO:tensorflow:global_step/sec: 15.2957[0m
[31mINFO:tensorflow:loss = 0.67213434, step = 2901 (6.538 sec)[0m
[31mINFO:tensorflow:global_step/sec: 14.3294[0m
[31mINFO:tensorflow:loss = 0.66357255, step = 3001 (6.979 sec)[0m
[31mINFO:tensorflow:global_step/sec: 15.3856[0m
[31mINFO:tensorflow:loss = 0.65175617, step = 3101 (6.500 sec)[0m
[31mINFO:tensorflow:global_step/sec: 15.4313[0m
[31mINFO:tensorflow:loss = 0.59832513, step = 3201 (6.480 sec)[0m
[31mINFO:tensorflow:global_step/sec: 15.3885[0m
[31mINFO:tensorflow:loss = 0.5295191, step = 3301 (6.498 sec)[0m
[31mINFO:tensorflow:global_step/sec: 15.3998[0m
[31mINFO:tensorflow:loss = 0.555522, step = 3401 (6.493 sec)[0m
[31mINFO:tensorflow:global_step/sec: 15.4156[0m
[31mINFO:tensorflow:loss = 0.58799, step = 3501 (6.875 sec)[0m
[31mINFO:tensorflow:global_step/sec: 14.545[0m
[31mINFO:ten

In [None]:
#To load a preexisting model

#from sagemaker.tensorflow import TensorFlowPredictor
#predictor = TensorFlowPredictor('sagemaker-tensorflow-py2-cpu-2018-03-09-19-27-43-438')

In [123]:
gender_predictor = gender_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

INFO:sagemaker:Creating model with name: tensorboard-names-2018-03-20-22-40-47-154
INFO:sagemaker:Creating endpoint with name tensorboard-names-2018-03-20-22-40-47-154


--------------------------------------------------------------------------------------------------------------!

In [118]:
sagemaker.Session().delete_endpoint(gender_predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: tensorboard-names-2018-03-20-19-18-18-841


In [168]:
data = {}
data['name'] = 'pratap'
json_obj = json.loads('{"names": {"name1":"pratap","name2":"swetha"}}')
json_data = json.dumps(data)
print (json_obj['names'])

{'name1': 'pratap', 'name2': 'swetha'}


In [27]:
!rm output.json
!aws sagemaker-runtime invoke-endpoint --endpoint-name tensorboard-names-2018-03-20-22-40-47-154 --body '{"name":"swetha"}' --content-type "application/json" output.json
! cat output.json

{
    "ContentType": "*/*",
    "InvokedProductionVariant": "AllTraffic"
}
{
  "outputs": {
    "Gender": {
      "dtype": "DT_FLOAT", 
      "floatVal": [
        0.8597514033317566
      ], 
      "tensorShape": {
        "dim": [
          {
            "size": "1"
          }
        ]
      }
    }
  }
}

In [None]:
from sagemaker.tensorflow import TensorFlowPredictor
predictor = TensorFlowPredictor('tensorflowgendermodel571')
sagemaker.Session().delete_endpoint(predictor)
