### !!! Important Note: This model is trained on full names instead of first names

## Training a multi-class logistic regression model to predict gender of Chinese name

To train a logistic regression, the names and genders in the train set are converted into word vectors and numerical values respectively. 

As there are three gender classes available in the dataset, namely M (male), F (female), and U (undefined), the logistic regression model is a multi-class one. In other words, to enable the model to distinguish among these three gender classes, we need to train three separate logistic regression models that can evaluate the probablity of one gender class at a time. During the training, M, F and U are represented by 0, 1 and 2 respectively. 

As for the Chinese names, they are converted into word vectors based on the one-hot encoding approach. That is, the total number of unique characters seen in the training set is first calculated so that the dimension of the word vector can be determined. For example, suppose we have 3 unique characters in total, say `'春'`, `'杰'`, `'乐'`, we can use `(1, 0, 0)`, `(0, 1, 0)` and `(0, 0, 1)` respectively to represent these three characters. To represent `'春杰'`, we can combine two vector together (vector addition) and use `(1, 1, 0)`. More concretely, to be able to represent any character, seen or unseen, we need to add an extra dimension so that our model can deal with names with unseen characters. For example, the example just above, we can then use `(1, 0, 0, 0)` to represent `'春'` such that when `'春'` comes with a unseen character, say `A`, we can use `(1, 0, 0, 1)` to represent `'春A'`. However, to make the model simpler, no matter how many unseen characters there are, they will only be represented by the extra one in the additional dimension for the one-hot word vector. Therefore, `'春ABCDEFGAAAAAA'` will still be `(1, 0, 0, 1)`.

## Overview

- [Data Loading and Preprocessing](#data)
    - [Data Loading](#loading)
    - [Preprocessing](#prep)
- [Model Training](#)
- [Testing](#test)
    - [Testing against the train/dev sets](#train_dev)
    - [Testing against the test set](#test_set)
- [Test on first names](#first_name)
    - [Testing against the train/dev sets](#train_dev2)
    - [Testing against the test set](#test_set2)

<a name='data'></a>
## Data Loading and Preprocessing

It turns out there are over 6500 unique characters for the 633857 in original train set spilted previously (see <ins>train_dev_test_split.ipynb</ins> within this directory), which means for every training example in the train set, the word vector for a any given name has over 6500 dimensions. That means, for the entire train set, the names as a matrix will have at least 6500 by 633857 dimensions, which turn out to be overwhelmingly large to train because of the enormous calculations involved!

However, it also turns out that the performance of the logistic regression in predicting gender of Chinese names soon plateau after fed with some ten thousand examples. Therefore without changing the algorithms, for example by adding some extra hidden layers (i.e., nerual networks), more examples do not make any worthy differences. As a tradeoff, here I only selected the first 100,000 examples from the original train set. The rest of the original train set will be treated as an additional dev set to test the performance of the model on unseen names. 

**Note**: With more hidden layers betwen the input names and the output genders, one-hot word vectors can be even more expensive to train. 

<a name='loading'></a>
### Data Loading 

In [1]:
from utils import *

# data_loader (from utils) will return [[name, gender], [name, gender]...]
train_ds_all = data_loader('data/train_ds.txt')
# the model will only be trained on train_ds (the first 100000 examples of the entire train set)
train_ds, train_ds_dev = train_ds_all[:100000], train_ds_all[100000:]
assert(len(train_ds_all) == len(train_ds) + len(train_ds_dev))

In [2]:
# check the first 5 examples
for i in range(5):
    print(train_ds[i])

['阎莹暂', 'F']
['吕荣辉', 'M']
['曾泽彬', 'M']
['董二庄', 'M']
['华治权', 'M']


<a name='prep'></a>
### Preprocessing 

In [3]:
from collections import defaultdict


# make the char_dict that will be useful in converting name vectors 
def char_dict(ds):
    '''Returns a char dict that stores the indices for the seen
    characters in the selected train set and returns a constant index
    for all unseen characters.
    
    Param:
        ds: dataset --> [list, list,...] where list=[name, gender]
    '''
    dic = {}
    idx = 1
    for item in ds:
        for char in item[0]:
            if char not in dic:
                dic[char] = idx
                idx += 1
                
    # the size of the dic = idx, which all includes the one index 
    # reserved for all unseen characters
    dic['size'] = idx
    dic = defaultdict(lambda :idx, dic)
    return dic

In [4]:
# checking the char dic
dic = char_dict(train_ds)
assert(dic['size'] == len(dic))
dic['size'], dic

(3918,
 defaultdict(<function __main__.char_dict.<locals>.<lambda>()>,
             {'阎': 1,
              '莹': 2,
              '暂': 3,
              '吕': 4,
              '荣': 5,
              '辉': 6,
              '曾': 7,
              '泽': 8,
              '彬': 9,
              '董': 10,
              '二': 11,
              '庄': 12,
              '华': 13,
              '治': 14,
              '权': 15,
              '顾': 16,
              '哧': 17,
              '天': 18,
              '彦': 19,
              '成': 20,
              '乔': 21,
              '其': 22,
              '荷': 23,
              '温': 24,
              '志': 25,
              '峰': 26,
              '王': 27,
              '诚': 28,
              '贾': 29,
              '晔': 30,
              '秋': 31,
              '紫': 32,
              '俊': 33,
              '景': 34,
              '絮': 35,
              '雨': 36,
              '黄': 37,
              '洪': 38,
              '君': 39,
              '戴': 40,
              '世':

In [5]:
# save the char dic for ease of reusing them later on 
import json


with open('data/char_dic.json', 'w') as f:
    json.dump(dic, f)

**Converting the names and genders as vectors.**

In [6]:
import numpy as np


def name2vec(name, char_dic, add_one=True):
    '''convert a given name into vec with one-hot encoding. 
    
    Params:
        name: str
        char_dic: dict
            this dict contains the indices for chars in the selected train set that 
            can be used as an one-hot encoder
    Returns:
        name_vec: array-like (dim=(1, n+1)) where n = num of chars in the selected train set
            please the first columm for this row vector is equal to 1 (i.e., x0=1)
    '''
    name_vec = np.zeros((1, char_dic['size']+1))
    # x0 = 1
    name_vec[0,0] = 1
    for char in name:
        name_vec[0, char_dic[char]] = 1
    return name_vec


def convert_example(ds, char_dic):
    '''Converts the dataset and returns both names and gender as vectors.
    '''
    # m = num of examples, n = num of dimensions
    m, n = len(ds), char_dic['size']
    name_vec = np.zeros((m, n+1))
    gender_vec = np.zeros((m, 1))
    for i in range(m):
        name_vec[i] = name2vec(ds[i][0], char_dic)
        if ds[i][1] == 'F': gender_vec[i] = 1
        elif ds[i][1] == 'U': gender_vec[i] = 2
    
    return name_vec, gender_vec

In [7]:
# Converting the dataset 
name_vec, gender_vec = convert_example(train_ds, dic)
assert(len(name_vec) == len(gender_vec))
print(f'Training sample size: {len(name_vec)}')

Training sample size: 100000


<a name='model'></a>
## Model Training

To train the logistic regression model, we first define the functions for the **sigmoid function** as follows:

$$sigmoid(z) = \frac{1}{1+e^{-z}} \;\;\; where \;\;\; z = scalar \;\;\; or \;\;\; z(x) = \theta^T x$$

where $\theta^T x = \theta_0 x_0 + \theta_1 x_1 + ... + \theta_n x_n = \sum_{i=0}^{n}{\theta_i x_i}$ and $x_0 = 1$. Since the $X$ in our model contains either &0& or &1&, separating $\theta$ as $w$ and $b$ is less useful here.


Then the **cost function** for the sigmoid is defined as follows (cross-entropy) without regularization:

$$𝐽(𝜃)=−1𝑚∑𝑖=1𝑚[𝑦(𝑖)log(ℎ𝜃(𝑥(𝑖)))+(1−𝑦(𝑖))log(1−ℎ𝜃(𝑥(𝑖)))]$$

And the **gradient** for each $\theta$ (partial derivative) can be formularized as follows without regularization:

$$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} $$

We do not add regularization terms for both cost and gradient functions here because the `scipy.optimize.fmin_tnc` will take care of all these. 

In [8]:
from scipy import optimize


def sigmoid(z):
    return 1/(1+np.exp(-z))


def costFunc(theta, X, y):
    '''
    paras:
        theta: array-like
        X: array-like
            the X should have an added first column filled with ones before inputted.
        y: array-like
    return:
        float
    '''
    y_pred = sigmoid(X @ theta)
    J = np.sum(np.multiply(y, np.log(y_pred)) + np.multiply((1-y), np.log(1 - y_pred)))
    return -J/len(y)


def gradient(theta, X, y):
    '''
    paras:
        theta: array-like
        X: array-like
            X should have an added first column filled with ones before inputted.
        y: array-like
            y should be converted into a column vector before inputted. 
    return:
        array-like
    '''
    return X.T @ (sigmoid(X @ theta) - y) / len(y)


def optimized_theta(X, y, theta):
    
    opt_theta = optimize.fmin_tnc(func = costFunc,
                                  x0 = theta, 
                                  fprime = gradient,
                                  args = (X, y.flatten()))
    return opt_theta[0]

In [9]:
# the three sets of theta values that will be used to train three LR classifier  
# To save your time, you can simply load the already trained paramters in the next two cell
theta_all = np.zeros((dic['size']+1, 3))
# train every classifer one at a time: one vs the rest of all
for i in [0, 1, 2]:
    theta_all[:, i]=optimized_theta(name_vec, gender_vec==i, theta_all[:, i])

  J = np.sum(np.multiply(y, np.log(y_pred)) + np.multiply((1-y), np.log(1 - y_pred)))
  J = np.sum(np.multiply(y, np.log(y_pred)) + np.multiply((1-y), np.log(1 - y_pred)))


In [10]:
# saving the trained model's parameters
np.save('data/params.npy', theta_all)

In [None]:
# uncomment the below two lines to load the trained paramters directly
# theta_all = np.load('data/params.npy')
# theta_all

<a name='test'></a>
## Testing

- First, define the function to make prediction based on the model that has already been trained. 
- Second, define an accuracy function to see the accuracy score of the model as well as the mismatches cases.  

In [11]:
def predict(name, char_dic, theta=theta_all, show_all=True):
    '''Predicting the gender of Chinese names based on the model trained. 
    
    Params:
        name: str, can be both only first name or full name
        char_dic: dict that contain char-index pairs
        theta: trained multi-class logistic regression model's parameters 
        show_all: bool, defaults to True. 
            if changed to False, the output will only show the optimal prediction.
    '''
    # getFirstName (from utils) returns the frist name of a given name
    # no matter whether the last name is included or not

    X = name2vec(name, char_dic)
    prob = sigmoid(np.squeeze(X @ theta))
    prob = prob/np.sum(prob)
    if show_all:
        return name, {'M': prob[0], 'F': prob[1], 'U': prob[2]}
    else:
        M, F, U = prob
        if M==F and F==U: return name, 'M=F=U', M
        elif M == np.max(prob): return name, 'M', M
        elif F>U: return name, 'F', F
        else: return name, 'U', U
        
        
def accuracy(examples, char_dic, theta, exclude_U=False):
    right = 0
    mismatch = [['name', 'gender', 'pred', 'prob']]
    smp_sz = len(examples)
    if not exclude_U:
        for example in examples:
            name, gender = example
            _, pred, prob = predict(name, char_dic, theta, show_all=False)
            if gender == pred: right += 1
            else: mismatch.append([name, gender, pred, prob])
    else:
        for example in examples:
            name, gender = example
            if gender != 'U':
                _, pred, prob = predict(name, char_dic, theta, show_all=False)
                if gender == pred: 
                    right += 1
                else: mismatch.append([name, gender, pred, prob])
            else:
                smp_sz -= 1
    return right/smp_sz, mismatch

<a name='train_dev'></a>
### Testing against the train/dev sets

In [12]:
# first test the predict function 
names = ['李柔落', '许健康', '黄恺之', '周牧', '梦娜', '爱富']

for name in names:
    print(predict(name, dic, theta=theta_all, show_all=False))

('李柔落', 'F', 0.8282688845040158)
('许健康', 'M', 0.9964575515758208)
('黄恺之', 'M', 0.9451322745226411)
('周牧', 'M', 0.7347239512831906)
('梦娜', 'F', 0.9999998462342853)
('爱富', 'F', 0.4765655923345157)


In [13]:
# used to be = 0.96199 (trained on first names and provided models knows inputted names are first names)

# test the accuracy of the model on the selected train set
# including those gender undefined cases
accu, mismatch = accuracy(train_ds, dic, theta_all)
accu, mismatch[:20]

(0.93879,
 [['name', 'gender', 'pred', 'prob'],
  ['秋紫俊', 'M', 'F', 0.978642577432092],
  ['黄洪君', 'M', 'F', 0.4684807029242576],
  ['秋梓韦', 'M', 'F', 0.7349659046290781],
  ['林文冰', 'U', 'M', 0.7755468523083373],
  ['宗政昕雨', 'F', 'M', 0.7726569348042753],
  ['宋文会', 'U', 'M', 0.5855739797870683],
  ['吴乐懿', 'U', 'M', 0.7997276780342996],
  ['林水荣', 'U', 'M', 0.9459097662859194],
  ['张彩龙', 'M', 'F', 0.9638426230748547],
  ['施正漪', 'M', 'F', 0.5594720346983055],
  ['柯逸', 'U', 'M', 0.7497511120495094],
  ['张晓华', 'M', 'U', 0.5530268501136897],
  ['吴宝懿', 'U', 'M', 0.8771868925990866],
  ['溥睿舒', 'F', 'M', 0.5057040784622785],
  ['魏俊华', 'U', 'M', 0.8072405600913962],
  ['王景朋', 'M', 'F', 0.48697135123435425],
  ['陈游', 'U', 'M', 0.530722922547336],
  ['于树玉', 'U', 'M', 0.8663626479621284],
  ['段常桃', 'M', 'F', 0.8428272922172572]])

In [14]:
# used to be = 0.9799027255207855 (trained on first names and provided models knows inputted names are first names)

# test the accuracy of the model on the selected train set
# excluding those gender undefined cases
accu, mismatch = accuracy(train_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9586229521641959,
 [['name', 'gender', 'pred', 'prob'],
  ['秋紫俊', 'M', 'F', 0.978642577432092],
  ['黄洪君', 'M', 'F', 0.4684807029242576],
  ['秋梓韦', 'M', 'F', 0.7349659046290781],
  ['宗政昕雨', 'F', 'M', 0.7726569348042753],
  ['张彩龙', 'M', 'F', 0.9638426230748547],
  ['施正漪', 'M', 'F', 0.5594720346983055],
  ['张晓华', 'M', 'U', 0.5530268501136897],
  ['溥睿舒', 'F', 'M', 0.5057040784622785],
  ['王景朋', 'M', 'F', 0.48697135123435425],
  ['段常桃', 'M', 'F', 0.8428272922172572],
  ['童惠康', 'M', 'F', 0.5515323815542275],
  ['吉栩缘', 'F', 'M', 0.7320513034001833],
  ['陶甄', 'F', 'M', 0.5256535793651373],
  ['李旭彤', 'F', 'M', 0.5965216041972269],
  ['乔海云', 'F', 'M', 0.713114646136835],
  ['王越M', 'F', 'U', 0.7260726797468378],
  ['夏政芝', 'M', 'F', 0.9930304371276503],
  ['仲天苑', 'F', 'M', 0.9042388396102958],
  ['吉润秋', 'F', 'M', 0.8541009647265594]])

**For the rest of the original train set**

In [15]:
# used to be = 0.9399856890247558 (trained on first names and provided models knows inputted names are first names)

# test the accuracy of the model on the rest of the original train set
# including those gender undefined cases
accu, mismatch = accuracy(train_ds_dev, dic, theta_all)
accu, mismatch[:20]

(0.9249067570120637,
 [['name', 'gender', 'pred', 'prob'],
  ['车禹含', 'M', 'F', 0.5902672327562544],
  ['滕工巧', 'F', 'M', 0.5088430089259336],
  ['强改青', 'F', 'M', 0.9283118267748325],
  ['刘博今', 'U', 'M', 0.9444498146818915],
  ['德春', 'F', 'M', 0.868517936396465],
  ['吴秀全', 'M', 'F', 0.9432043603946353],
  ['王水', 'U', 'M', 0.5718363323194758],
  ['白若清', 'M', 'F', 0.7091115903204372],
  ['萧丁', 'U', 'M', 0.5837274513251047],
  ['果叶', 'U', 'F', 0.5427028535663759],
  ['张玉乐', 'U', 'M', 0.47145126163887],
  ['舒心棋', 'M', 'F', 0.5097107631627963],
  ['麻盱东', 'M', 'F', 0.5657613345526245],
  ['廖朱雨', 'M', 'F', 0.563882517109989],
  ['冯惠贤', 'F', 'M', 0.5643180831653223],
  ['高冉然', 'F', 'M', 0.5434077342880294],
  ['聂秀国', 'M', 'F', 0.6246737145889762],
  ['卓礼萱', 'M', 'F', 0.8326541949241154],
  ['兰坡', 'M', 'F', 0.7679628820175514]])

In [16]:
# used to be = 0.9518350451315245 (trained on first names and provided models knows inputted names are first names)

# test the accuracy of the model on the rest of the original train set
# excluding those gender undefined cases
accu, mismatch = accuracy(train_ds_dev, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9466477397763199,
 [['name', 'gender', 'pred', 'prob'],
  ['车禹含', 'M', 'F', 0.5902672327562544],
  ['滕工巧', 'F', 'M', 0.5088430089259336],
  ['强改青', 'F', 'M', 0.9283118267748325],
  ['德春', 'F', 'M', 0.868517936396465],
  ['吴秀全', 'M', 'F', 0.9432043603946353],
  ['白若清', 'M', 'F', 0.7091115903204372],
  ['舒心棋', 'M', 'F', 0.5097107631627963],
  ['麻盱东', 'M', 'F', 0.5657613345526245],
  ['廖朱雨', 'M', 'F', 0.563882517109989],
  ['冯惠贤', 'F', 'M', 0.5643180831653223],
  ['高冉然', 'F', 'M', 0.5434077342880294],
  ['聂秀国', 'M', 'F', 0.6246737145889762],
  ['卓礼萱', 'M', 'F', 0.8326541949241154],
  ['兰坡', 'M', 'F', 0.7679628820175514],
  ['邱春荣', 'F', 'M', 0.7214029963348159],
  ['董欣武', 'M', 'F', 0.6428967144464044],
  ['倪钰滋', 'M', 'F', 0.8845204321196758],
  ['唐缜', 'M', 'U', 0.5096159948448584],
  ['濮阳晓晓', 'F', 'M', 0.5728163627514946]])

**For the original dev set**

In [17]:
# loading the original dev set
dev_ds = data_loader('data/dev_ds.txt')
len(dev_ds), dev_ds[:5]

(365811,
 [['冯瑞琳', 'F'], ['曹凯棋', 'M'], ['危义祥', 'M'], ['强识闻', 'M'], ['钮缤鲃', 'M']])

In [18]:
# used to be = 0.9483429635642997 (trained on first names and provided models knows inputted names are first names)

# test the accuracy of the model on the original dev set
# including those gender undefined cases
accu, mismatch = accuracy(dev_ds, dic, theta_all)
accu, mismatch[:20]

(0.9251717416917479,
 [['name', 'gender', 'pred', 'prob'],
  ['宰玉墨', 'F', 'M', 0.5536449346577309],
  ['卞佳臻', 'F', 'M', 0.5669373403716166],
  ['付睿', 'U', 'M', 0.556971322316718],
  ['班柳淳', 'F', 'M', 0.8845782550863046],
  ['井雁林', 'M', 'F', 0.5241512154512104],
  ['郭连', 'U', 'F', 0.39477043974537906],
  ['游丁', 'U', 'M', 0.7284637241418767],
  ['贝学敏', 'M', 'F', 0.9275515171953749],
  ['童雨杨', 'M', 'F', 0.520700926465257],
  ['梅必霏', 'M', 'F', 0.6403583813371776],
  ['林乔', 'U', 'M', 0.7496937229717253],
  ['尹腕', 'F', 'U', 0.5144769504657397],
  ['闫韶华', 'U', 'M', 0.4692820009375148],
  ['盛建梅', 'F', 'M', 0.6282129188679001],
  ['李思宁', 'M', 'F', 0.6329770950549312],
  ['奚雷筠', 'M', 'F', 0.8402054040990207],
  ['连国萌', 'M', 'F', 0.5143946257084303],
  ['南英', 'U', 'F', 0.7198246970528164],
  ['伊岩', 'M', 'F', 0.5607781084719637]])

In [19]:
# used to be = 0.9651273759520524 (trained on first names and provided models knows inputted names are first names)

# test the accuracy of the model on the original dev set
# excluding those gender undefined cases
accu, mismatch = accuracy(dev_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9472190564946694,
 [['name', 'gender', 'pred', 'prob'],
  ['宰玉墨', 'F', 'M', 0.5536449346577309],
  ['卞佳臻', 'F', 'M', 0.5669373403716166],
  ['班柳淳', 'F', 'M', 0.8845782550863046],
  ['井雁林', 'M', 'F', 0.5241512154512104],
  ['贝学敏', 'M', 'F', 0.9275515171953749],
  ['童雨杨', 'M', 'F', 0.520700926465257],
  ['梅必霏', 'M', 'F', 0.6403583813371776],
  ['尹腕', 'F', 'U', 0.5144769504657397],
  ['盛建梅', 'F', 'M', 0.6282129188679001],
  ['李思宁', 'M', 'F', 0.6329770950549312],
  ['奚雷筠', 'M', 'F', 0.8402054040990207],
  ['连国萌', 'M', 'F', 0.5143946257084303],
  ['伊岩', 'M', 'F', 0.5607781084719637],
  ['武亭', 'F', 'M', 0.6606498431132384],
  ['尚子烟', 'F', 'M', 0.7961143935137059],
  ['梁颜', 'F', 'M', 0.506179886081871],
  ['容博君', 'M', 'F', 0.6665849006561815],
  ['孙坤玥', 'M', 'F', 0.6625383047495877],
  ['鲍家容', 'F', 'M', 0.5188534322767071]])

<a name='test_set'></a>
## Testing against the test set

In [20]:
# loading the original test set
# please note that the test set contains full names
test_ds = data_loader('data/test_ds.txt')
len(test_ds), test_ds[:5]

(365811,
 [['邬爱清', 'F'], ['杜文吕', 'M'], ['任千焱', 'M'], ['鲍梦冉', 'F'], ['薛俊霖', 'M']])

In [21]:
# used to be = 0.946272655203521 (trained on first names and provided models knows inputted names are full names)

# test the accuracy of the model on the original test set
# including those gender undefined cases
accu, mismatch = accuracy(test_ds, dic, theta_all)
accu, mismatch[:20]

(0.9241903605960455,
 [['name', 'gender', 'pred', 'prob'],
  ['顾仁疋', 'F', 'M', 0.7066731462680538],
  ['幸路', 'U', 'M', 0.738155870343337],
  ['邓白', 'F', 'M', 0.365399496096957],
  ['商禹', 'U', 'M', 0.7200626894164673],
  ['易宛其', 'M', 'F', 0.8030107586410502],
  ['常杨榴', 'F', 'M', 0.7206600791937212],
  ['蔚韦君', 'M', 'F', 0.9564539560244072],
  ['任爰好', 'M', 'F', 0.9999350951945652],
  ['封萧', 'F', 'M', 0.6820436028485304],
  ['洪常佳', 'M', 'F', 0.5277110801943192],
  ['童仕君', 'U', 'M', 0.8295342209158066],
  ['梅彦云', 'M', 'F', 0.9527268083948703],
  ['杨晓一', 'U', 'M', 0.6717443239197195],
  ['舒海华', 'U', 'F', 0.671713168486229],
  ['贺小草', 'F', 'M', 0.5037333198383865],
  ['卢少冰', 'U', 'M', 0.7046783352521785],
  ['范文佳', 'M', 'F', 0.5762414235361206],
  ['刘宗缨', 'M', 'F', 0.9755349767439488],
  ['吴昕阳', 'U', 'M', 0.7313588211155064]])

In [23]:
# used to be = 0.9701746013923198 (trained on first names and provided models knows inputted names are full names)

# test the accuracy of the model on the original test set
# excluding those gender undefined cases
accu, mismatch = accuracy(test_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9462610985259978,
 [['name', 'gender', 'pred', 'prob'],
  ['顾仁疋', 'F', 'M', 0.7066731462680538],
  ['邓白', 'F', 'M', 0.365399496096957],
  ['易宛其', 'M', 'F', 0.8030107586410502],
  ['常杨榴', 'F', 'M', 0.7206600791937212],
  ['蔚韦君', 'M', 'F', 0.9564539560244072],
  ['任爰好', 'M', 'F', 0.9999350951945652],
  ['封萧', 'F', 'M', 0.6820436028485304],
  ['洪常佳', 'M', 'F', 0.5277110801943192],
  ['梅彦云', 'M', 'F', 0.9527268083948703],
  ['贺小草', 'F', 'M', 0.5037333198383865],
  ['范文佳', 'M', 'F', 0.5762414235361206],
  ['刘宗缨', 'M', 'F', 0.9755349767439488],
  ['易琪千', 'M', 'F', 0.6145151604590868],
  ['王春清', 'F', 'U', 0.4037118310758777],
  ['农月部', 'M', 'F', 0.9042110713781923],
  ['林汝桂', 'F', 'M', 0.7409999530752821],
  ['任利君', 'F', 'M', 0.3664572918008011],
  ['辜永红', 'F', 'M', 0.738371037162651],
  ['沈舵', 'M', 'U', 0.6025547161463135]])

<a name='first_name'></a>
## Test on first names

Rerun the above algorithms again, including splitting the train set into train_ds and train_dev

<a name='train_dev2'></a>
### Testing against the train/dev sets¶

In [25]:
train_ds_all = data_loader('data/train_ds.txt', full_name=False)
train_ds, train_ds_dev = train_ds_all[:100000], train_ds_all[100000:]
len(train_ds), train_ds[:5]

(100000, [['莹暂', 'F'], ['荣辉', 'M'], ['泽彬', 'M'], ['二庄', 'M'], ['治权', 'M']])

In [26]:
# The corresponding full names accuracy = 0.93879

# test the accuracy of the model on the selected train set
# including those gender undefined cases
accu, mismatch = accuracy(train_ds, dic, theta_all)
accu, mismatch[:20]

(0.91322,
 [['name', 'gender', 'pred', 'prob'],
  ['紫俊', 'M', 'F', 0.8157475023351292],
  ['洪君', 'M', 'U', 0.407583844798154],
  ['玉新', 'M', 'U', 0.44121113213586827],
  ['文冰', 'U', 'F', 0.5476496315616565],
  ['墨', 'M', 'U', 0.5002026320492552],
  ['乐懿', 'U', 'M', 0.7010292904988477],
  ['水荣', 'U', 'M', 0.7905228459855879],
  ['彩龙', 'M', 'F', 0.9634296437458582],
  ['正漪', 'M', 'F', 0.771087502182137],
  ['晗', 'M', 'F', 0.5169714448693695],
  ['君烨', 'F', 'U', 0.498546849741367],
  ['青平', 'M', 'U', 0.5389533022405638],
  ['晨云', 'M', 'U', 0.6307108857190767],
  ['晓华', 'M', 'U', 0.5806031425347951],
  ['宝懿', 'U', 'M', 0.8170942531945977],
  ['佳霖', 'M', 'U', 0.5422359381606827],
  ['双牛', 'M', 'F', 0.4760595044888528],
  ['木', 'M', 'U', 0.49100975735911123],
  ['张昕', 'M', 'F', 0.4370659471201789]])

In [27]:
# The corresponding full names accuracy = 0.9586229521641959

# test the accuracy of the model on the selected train set
# excluding those gender undefined cases
accu, mismatch = accuracy(train_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9212851899694491,
 [['name', 'gender', 'pred', 'prob'],
  ['紫俊', 'M', 'F', 0.8157475023351292],
  ['洪君', 'M', 'U', 0.407583844798154],
  ['玉新', 'M', 'U', 0.44121113213586827],
  ['墨', 'M', 'U', 0.5002026320492552],
  ['彩龙', 'M', 'F', 0.9634296437458582],
  ['正漪', 'M', 'F', 0.771087502182137],
  ['晗', 'M', 'F', 0.5169714448693695],
  ['君烨', 'F', 'U', 0.498546849741367],
  ['青平', 'M', 'U', 0.5389533022405638],
  ['晨云', 'M', 'U', 0.6307108857190767],
  ['晓华', 'M', 'U', 0.5806031425347951],
  ['佳霖', 'M', 'U', 0.5422359381606827],
  ['双牛', 'M', 'F', 0.4760595044888528],
  ['木', 'M', 'U', 0.49100975735911123],
  ['张昕', 'M', 'F', 0.4370659471201789],
  ['一诺', 'M', 'F', 0.5190393581767695],
  ['景朋', 'M', 'U', 0.3649086632077081],
  ['渝', 'M', 'U', 0.4893212989432761],
  ['郡', 'M', 'U', 0.6162315380954463]])

In [28]:
# The corresponding full names accuracy = 0.9249067570120637

# test the accuracy of the model on the rest of the original train set
# including those gender undefined cases
accu, mismatch = accuracy(train_ds_dev, dic, theta_all)
accu, mismatch[:20]

(0.8998112851080812,
 [['name', 'gender', 'pred', 'prob'],
  ['禹含', 'M', 'F', 0.8877296719740336],
  ['工巧', 'F', 'M', 0.5017063378305321],
  ['博今', 'U', 'M', 0.8920227963744335],
  ['春', 'F', 'U', 0.649434358569899],
  ['秀全', 'M', 'F', 0.8822946232766238],
  ['翦敏', 'M', 'U', 0.5540017538891048],
  ['水', 'U', 'M', 0.5079179006935647],
  ['若清', 'M', 'F', 0.7438410004977143],
  ['子', 'M', 'U', 0.5577160289084869],
  ['韶', 'M', 'U', 0.6369869984536615],
  ['买宁', 'M', 'U', 0.6330389011942824],
  ['路', 'M', 'U', 0.48938674221422185],
  ['敏敏', 'F', 'U', 0.5540017538891048],
  ['锦', 'M', 'U', 0.5031310563877914],
  ['盱东', 'M', 'F', 0.8922536932994531],
  ['朱雨', 'M', 'F', 0.5018197282717076],
  ['猷', 'M', 'U', 0.723957977881521],
  ['秀国', 'M', 'F', 0.8184444535444713],
  ['礼萱', 'M', 'F', 0.9590124258230304]])

In [30]:
# The corresponding full names accuracy = 0.9466477397763199

# test the accuracy of the model on the rest of the original train set
# excluding those gender undefined cases
accu, mismatch = accuracy(train_ds_dev, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.910031542396307,
 [['name', 'gender', 'pred', 'prob'],
  ['禹含', 'M', 'F', 0.8877296719740336],
  ['工巧', 'F', 'M', 0.5017063378305321],
  ['春', 'F', 'U', 0.649434358569899],
  ['秀全', 'M', 'F', 0.8822946232766238],
  ['翦敏', 'M', 'U', 0.5540017538891048],
  ['若清', 'M', 'F', 0.7438410004977143],
  ['子', 'M', 'U', 0.5577160289084869],
  ['韶', 'M', 'U', 0.6369869984536615],
  ['买宁', 'M', 'U', 0.6330389011942824],
  ['路', 'M', 'U', 0.48938674221422185],
  ['敏敏', 'F', 'U', 0.5540017538891048],
  ['锦', 'M', 'U', 0.5031310563877914],
  ['盱东', 'M', 'F', 0.8922536932994531],
  ['朱雨', 'M', 'F', 0.5018197282717076],
  ['猷', 'M', 'U', 0.723957977881521],
  ['秀国', 'M', 'F', 0.8184444535444713],
  ['礼萱', 'M', 'F', 0.9590124258230304],
  ['安勤', 'M', 'U', 0.4989259753355388],
  ['亚卓', 'M', 'F', 0.5035591254830749]])

**For the original dev set**

In [31]:
# loading the original dev set
dev_ds = data_loader('data/dev_ds.txt', full_name=False)
len(dev_ds), dev_ds[:5]

(365811, [['瑞琳', 'F'], ['凯棋', 'M'], ['义祥', 'M'], ['识闻', 'M'], ['缤鲃', 'M']])

In [32]:
# The corresponding full names accuracy = 0.9251717416917479

# test the accuracy of the model on the original dev set
# including those gender undefined cases
accu, mismatch = accuracy(dev_ds, dic, theta_all)
accu, mismatch[:20]

(0.9009980563733732,
 [['name', 'gender', 'pred', 'prob'],
  ['穗君', 'F', 'U', 0.5184177657307677],
  ['楠', 'F', 'U', 0.5363644612276439],
  ['涵之', 'M', 'F', 0.6362171017076287],
  ['刘敏', 'F', 'U', 0.5691554092844591],
  ['平', 'M', 'U', 0.6380979534121062],
  ['柳淳', 'F', 'M', 0.7382002195239615],
  ['雁林', 'M', 'F', 0.5879787898570178],
  ['学敏', 'M', 'U', 0.5970204872760623],
  ['雨杨', 'M', 'F', 0.4330292766322597],
  ['苏', 'F', 'U', 0.5989972136986694],
  ['乔', 'M', 'U', 0.49913805600337074],
  ['丁', 'M', 'U', 0.5534870286741153],
  ['腕', 'F', 'U', 0.723957977881521],
  ['路', 'M', 'U', 0.48938674221422185],
  ['思宁', 'M', 'F', 0.5699373636220598],
  ['奕渝', 'M', 'F', 0.47664839392618313],
  ['雷筠', 'M', 'F', 0.964506348430902],
  ['国萌', 'M', 'F', 0.529412411434513],
  ['苏平', 'M', 'U', 0.4753317822222272]])

In [33]:
# The corresponding full names accuracy = 0.9472190564946694

# test the accuracy of the model on the original dev set
# excluding those gender undefined cases
accu, mismatch = accuracy(dev_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9108958416375201,
 [['name', 'gender', 'pred', 'prob'],
  ['穗君', 'F', 'U', 0.5184177657307677],
  ['楠', 'F', 'U', 0.5363644612276439],
  ['涵之', 'M', 'F', 0.6362171017076287],
  ['刘敏', 'F', 'U', 0.5691554092844591],
  ['平', 'M', 'U', 0.6380979534121062],
  ['柳淳', 'F', 'M', 0.7382002195239615],
  ['雁林', 'M', 'F', 0.5879787898570178],
  ['学敏', 'M', 'U', 0.5970204872760623],
  ['雨杨', 'M', 'F', 0.4330292766322597],
  ['苏', 'F', 'U', 0.5989972136986694],
  ['乔', 'M', 'U', 0.49913805600337074],
  ['丁', 'M', 'U', 0.5534870286741153],
  ['腕', 'F', 'U', 0.723957977881521],
  ['路', 'M', 'U', 0.48938674221422185],
  ['思宁', 'M', 'F', 0.5699373636220598],
  ['奕渝', 'M', 'F', 0.47664839392618313],
  ['雷筠', 'M', 'F', 0.964506348430902],
  ['国萌', 'M', 'F', 0.529412411434513],
  ['苏平', 'M', 'U', 0.4753317822222272]])

<a name='test_set2'></a>
## Testing against the test set

In [34]:
# loading the original test set
test_ds = data_loader('data/test_ds.txt', full_name=False)
len(test_ds), test_ds[:5]

(365811, [['爱清', 'F'], ['文吕', 'M'], ['千焱', 'M'], ['梦冉', 'F'], ['俊霖', 'M']])

In [35]:
# The corresponding full names accuracy = 0.9241903605960455

# test the accuracy of the model on the original test set
# including those gender undefined cases
accu, mismatch = accuracy(test_ds, dic, theta_all)
accu, mismatch[:20]

(0.8995109496433951,
 [['name', 'gender', 'pred', 'prob'],
  ['付利', 'M', 'U', 0.5023307792153112],
  ['誉钓', 'M', 'U', 0.532105030459336],
  ['白', 'F', 'U', 0.5656655421635373],
  ['晓海', 'M', 'U', 0.47158518030179114],
  ['禹', 'U', 'M', 0.5487360243476288],
  ['建', 'M', 'U', 0.5073244792521341],
  ['宛其', 'M', 'F', 0.8959079506701031],
  ['韦君', 'M', 'U', 0.44370157458416504],
  ['爰好', 'M', 'F', 0.9999331966728348],
  ['枫', 'M', 'U', 0.6338212981047525],
  ['萧', 'F', 'U', 0.4403297161561554],
  ['捷', 'M', 'U', 0.6148766821414646],
  ['常佳', 'M', 'F', 0.6022250400125727],
  ['仕君', 'U', 'M', 0.5677469126101281],
  ['彦云', 'M', 'U', 0.6135852520109305],
  ['晓一', 'U', 'M', 0.4596048955982566],
  ['子华', 'M', 'U', 0.504439593793137],
  ['少冰', 'U', 'M', 0.38389848849914554],
  ['文佳', 'M', 'U', 0.45808656256111335]])

In [37]:
# The corresponding full names accuracy = 0.9462610985259978

# test the accuracy of the model on the original test set
# excluding those gender undefined cases
accu, mismatch = accuracy(test_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9098250903180712,
 [['name', 'gender', 'pred', 'prob'],
  ['付利', 'M', 'U', 0.5023307792153112],
  ['誉钓', 'M', 'U', 0.532105030459336],
  ['白', 'F', 'U', 0.5656655421635373],
  ['晓海', 'M', 'U', 0.47158518030179114],
  ['建', 'M', 'U', 0.5073244792521341],
  ['宛其', 'M', 'F', 0.8959079506701031],
  ['韦君', 'M', 'U', 0.44370157458416504],
  ['爰好', 'M', 'F', 0.9999331966728348],
  ['枫', 'M', 'U', 0.6338212981047525],
  ['萧', 'F', 'U', 0.4403297161561554],
  ['捷', 'M', 'U', 0.6148766821414646],
  ['常佳', 'M', 'F', 0.6022250400125727],
  ['彦云', 'M', 'U', 0.6135852520109305],
  ['子华', 'M', 'U', 0.504439593793137],
  ['文佳', 'M', 'U', 0.45808656256111335],
  ['梵', 'M', 'U', 0.6482237123592558],
  ['宗缨', 'M', 'F', 0.9855381600524019],
  ['琪千', 'M', 'F', 0.5851414698921656],
  ['春清', 'F', 'U', 0.5272978538499282]])