## Training a multi-class logistic regression model to predict gender of Chinese name

To train a logistic regression, the names and genders in the train set are converted into word vectors and numerical values respectively. 

As there are three gender classes available in the dataset, namely M (male), F (female), and U (undefined), the logistic regression model is a multi-class one. In other words, to enable the model to distinguish among these three gender classes, we need to train three separate logistic regression models that can evaluate the probablity of one gender class at a time. During the training, M, F and U are represented by 0, 1 and 2 respectively. 

As for the Chinese names, they are converted into word vectors based on the one-hot encoding approach. That is, the total number of unique characters seen in the training set is first calculated so that the dimension of the word vector can be determined. For example, suppose we have 3 unique characters in total, say `'春'`, `'杰'`, `'乐'`, we can use `(1, 0, 0)`, `(0, 1, 0)` and `(0, 0, 1)` respectively to represent these three characters. To represent `'春杰'`, we can combine two vector together (vector addition) and use `(1, 1, 0)`. More concretely, to be able to represent any character, seen or unseen, we need to add an extra dimension so that our model can deal with names with unseen characters. For example, the example just above, we can then use `(1, 0, 0, 0)` to represent `'春'` such that when `'春'` comes with a unseen character, say `A`, we can use `(1, 0, 0, 1)` to represent `'春A'`. However, to make the model simpler, no matter how many unseen characters there are, they will only be represented by the extra one in the additional dimension for the one-hot word vector. Therefore, `'春ABCDEFGAAAAAA'` will still be `(1, 0, 0, 1)`.

## Overview

- [Data Loading and Preprocessing](#data)
    - [Data Loading](#loading)
    - [Preprocessing](#prep)
- [Model Training](#)
- [Testing](#test)
    - [Testing against the train/dev sets](#train_dev)
    - [Testing against the test set](#test_set)
- [A problem](#problem)

<a name='data'></a>
## Data Loading and Preprocessing

It turns out there are over 6500 unique characters for the 633857 in original train set spilted previously (see <ins>train_dev_test_split.ipynb</ins> within this directory), which means for every training example in the train set, the word vector for a any given name has over 6500 dimensions. That means, for the entire train set, the names as a matrix will have at least 6500 by 633857 dimensions, which turn out to be overwhelmingly large to train because of the enormous calculations involved!

However, it also turns out that the performance of the logistic regression in predicting gender of Chinese names soon plateau after fed with some ten thousand examples. Therefore without changing the algorithms, for example by adding some extra hidden layers (i.e., nerual networks), more examples do not make any worthy differences. As a tradeoff, here I only selected the first 100,000 examples from the original train set. The rest of the original train set will be treated as an additional dev set to test the performance of the model on unseen names. 

**Note**: With more hidden layers betwen the input names and the output genders, one-hot word vectors can be even more expensive to train. 

<a name='loading'></a>
### Data Loading 

In [1]:
from utils import *

# data_loader (from utils) will return [[name, gender], [name, gender]...]
train_ds_all = data_loader('data/train_ds.txt')
# the model will only be trained on train_ds (the first 100000 examples of the entire train set)
train_ds, train_ds_dev = train_ds_all[:100000], train_ds_all[100000:]
assert(len(train_ds_all) == len(train_ds) + len(train_ds_dev))

In [2]:
# check the first 5 examples
for i in range(5):
    print(train_ds[i])

['莹暂', 'F']
['荣辉', 'M']
['泽彬', 'M']
['二庄', 'M']
['治权', 'M']


<a name='prep'></a>
### Preprocessing 

In [3]:
from collections import defaultdict


# make the char_dict that will be useful in converting name vectors 
def char_dict(ds):
    '''Returns a char dict that stores the indices for the seen
    characters in the selected train set and returns a constant index
    for all unseen characters.
    
    Param:
        ds: dataset --> [list, list,...] where list=[name, gender]
    '''
    dic = {}
    idx = 1
    for item in ds:
        for char in item[0]:
            if char not in dic:
                dic[char] = idx
                idx += 1
                
    # the size of the dic = idx, which all includes the one index 
    # reserved for all unseen characters
    dic['size'] = idx
    dic = defaultdict(lambda :idx, dic)
    return dic

In [4]:
# checking the char dic
dic = char_dict(train_ds)
assert(dic['size'] == len(dic))
dic['size'], dic

(4376,
 defaultdict(<function __main__.char_dict.<locals>.<lambda>()>,
             {'莹': 1,
              '暂': 2,
              '荣': 3,
              '辉': 4,
              '泽': 5,
              '彬': 6,
              '二': 7,
              '庄': 8,
              '治': 9,
              '权': 10,
              '哧': 11,
              '天': 12,
              '彦': 13,
              '成': 14,
              '其': 15,
              '荷': 16,
              '志': 17,
              '峰': 18,
              '诚': 19,
              '晔': 20,
              '紫': 21,
              '俊': 22,
              '絮': 23,
              '雨': 24,
              '洪': 25,
              '君': 26,
              '世': 27,
              '佳': 28,
              '讯': 29,
              '谨': 30,
              '德': 31,
              '会': 32,
              '川': 33,
              '嘉': 34,
              '艺': 35,
              '贤': 36,
              '晓': 37,
              '晴': 38,
              '银': 39,
              '芳': 40,
              '琼':

In [5]:
# save the char dic for ease of reusing them later on 
import json


with open('data/char_dic.json', 'w') as f:
    json.dump(dic, f)

**Converting the names and genders as vectors.**

In [6]:
import numpy as np


def name2vec(name, char_dic, add_one=True):
    '''convert a given name into vec with one-hot encoding. 
    
    Params:
        name: str
        char_dic: dict
            this dict contains the indices for chars in the selected train set that 
            can be used as an one-hot encoder
    Returns:
        name_vec: array-like (dim=(1, n+1)) where n = num of chars in the selected train set
            please the first columm for this row vector is equal to 1 (i.e., x0=1)
    '''
    name_vec = np.zeros((1, char_dic['size']+1))
    # x0 = 1
    name_vec[0,0] = 1
    for char in name:
        name_vec[0, char_dic[char]] = 1
    return name_vec


def convert_example(ds, char_dic):
    '''Converts the dataset and returns both names and gender as vectors.
    '''
    # m = num of examples, n = num of dimensions
    m, n = len(ds), char_dic['size']
    name_vec = np.zeros((m, n+1))
    gender_vec = np.zeros((m, 1))
    for i in range(m):
        name_vec[i] = name2vec(ds[i][0], char_dic)
        if ds[i][1] == 'F': gender_vec[i] = 1
        elif ds[i][1] == 'U': gender_vec[i] = 2
    
    return name_vec, gender_vec

In [7]:
# Converting the dataset 
name_vec, gender_vec = convert_example(train_ds, dic)
assert(len(name_vec) == len(gender_vec))
print(f'Training sample size: {len(name_vec)}')

Training sample size: 100000


<a name='model'></a>
## Model Training

To train the logistic regression model, we first define the functions for the **sigmoid function** as follows:

$$sigmoid(z) = \frac{1}{1+e^{-z}} \;\;\; where \;\;\; z = scalar \;\;\; or \;\;\; z(x) = \theta^T x$$

where $\theta^T x = \theta_0 x_0 + \theta_1 x_1 + ... + \theta_n x_n = \sum_{i=0}^{n}{\theta_i x_i}$ and $x_0 = 1$. Since the $X$ in our model contains either &0& or &1&, separating $\theta$ as $w$ and $b$ is less useful here.


Then the **cost function** for the sigmoid is defined as follows (cross-entropy) without regularization:

$$𝐽(𝜃)=−1𝑚∑𝑖=1𝑚[𝑦(𝑖)log(ℎ𝜃(𝑥(𝑖)))+(1−𝑦(𝑖))log(1−ℎ𝜃(𝑥(𝑖)))]$$

And the **gradient** for each $\theta$ (partial derivative) can be formularized as follows without regularization:

$$ \frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m \left( h_\theta \left( x^{(i)} \right) - y^{(i)} \right) x_j^{(i)} $$

We do not add regularization terms for both cost and gradient functions here because the `scipy.optimize.fmin_tnc` will take care of all these. 

In [8]:
from scipy import optimize


def sigmoid(z):
    return 1/(1+np.exp(-z))


def costFunc(theta, X, y):
    '''
    paras:
        theta: array-like
        X: array-like
            the X should have an added first column filled with ones before inputted.
        y: array-like
    return:
        float
    '''
    y_pred = sigmoid(X @ theta)
    J = np.sum(np.multiply(y, np.log(y_pred)) + np.multiply((1-y), np.log(1 - y_pred)))
    return -J/len(y)


def gradient(theta, X, y):
    '''
    paras:
        theta: array-like
        X: array-like
            X should have an added first column filled with ones before inputted.
        y: array-like
            y should be converted into a column vector before inputted. 
    return:
        array-like
    '''
    return X.T @ (sigmoid(X @ theta) - y) / len(y)


def optimized_theta(X, y, theta):
    
    opt_theta = optimize.fmin_tnc(func = costFunc,
                                  x0 = theta, 
                                  fprime = gradient,
                                  args = (X, y.flatten()))
    return opt_theta[0]

In [9]:
# the three sets of theta values that will be used to train three LR classifier  
# To save your time, you can simply load the already trained paramters in the next two cell
theta_all = np.zeros((dic['size']+1, 3))
# train every classifer one at a time: one vs the rest of all
for i in [0, 1, 2]:
    theta_all[:, i]=optimized_theta(name_vec, gender_vec==i, theta_all[:, i])

  J = np.sum(np.multiply(y, np.log(y_pred)) + np.multiply((1-y), np.log(1 - y_pred)))
  J = np.sum(np.multiply(y, np.log(y_pred)) + np.multiply((1-y), np.log(1 - y_pred)))


In [10]:
# saving the trained model's parameters
np.save('data/params.npy', theta_all)

In [None]:
# uncomment the below two lines to load the trained paramters directly
# theta_all = np.load('data/params.npy')
# theta_all

<a name='test'></a>
## Testing

- First, define the function to make prediction based on the model that has already been trained. 
- Second, define an accuracy function to see the accuracy score of the model as well as the mismatches cases.  

In [11]:
def predict(name, char_dic, theta=theta_all, show_all=True, full_name=False):
    '''Predicting the gender of Chinese names based on the model trained. 
    
    Params:
        name: str, can be both only first name or full name
        char_dic: dict that contain char-index pairs
        theta: trained multi-class logistic regression model's parameters 
        show_all: bool, defaults to True. 
            if changed to False, the output will only show the optimal prediction.
    '''
    # getFirstName (from utils) returns the frist name of a given name
    # no matter whether the last name is included or not
    if not full_name:
        X = name2vec(name, char_dic)
    else:
        fname = getFirstName(name)
        X = name2vec(fname, char_dic)
    prob = sigmoid(np.squeeze(X @ theta))
    prob = prob/np.sum(prob)
    if show_all:
        return name, {'M': prob[0], 'F': prob[1], 'U': prob[2]}
    else:
        M, F, U = prob
        if M==F and F==U: return name, 'M=F=U', M
        elif M == np.max(prob): return name, 'M', M
        elif F>U: return name, 'F', F
        else: return name, 'U', U
        
        
def accuracy(examples, char_dic, theta, exclude_U=False, full_name=False):
    right = 0
    mismatch = [['name', 'gender', 'pred', 'prob']]
    smp_sz = len(examples)
    if not exclude_U:
        for example in examples:
            name, gender = example
            _, pred, prob = predict(name, char_dic, theta, show_all=False, full_name=full_name)
            if gender == pred: right += 1
            else: mismatch.append([name, gender, pred, prob])
    else:
        for example in examples:
            name, gender = example
            if gender != 'U':
                _, pred, prob = predict(name, char_dic, theta, show_all=False, full_name=full_name)
                if gender == pred: 
                    right += 1
                else: mismatch.append([name, gender, pred, prob])
            else:
                smp_sz -= 1
    return right/smp_sz, mismatch

<a name='train_dev'></a>
### Testing against the train/dev sets

In [12]:
# first test the predict function 
names = ['李柔落', '许健康', '黄恺之', '周牧', '梦娜', '爱富']

for name in names:
    print(predict(name, dic, theta=theta_all, show_all=False))

('李柔落', 'F', 0.7406189716495426)
('许健康', 'M', 0.9999990047182503)
('黄恺之', 'M', 0.9985069564065047)
('周牧', 'M', 0.9939343959114006)
('梦娜', 'F', 0.9999985293819316)
('爱富', 'M', 0.9655679649000578)


In [13]:
# test the accuracy of the model on the selected train set
# including those gender undefined cases
accu, mismatch = accuracy(train_ds, dic, theta_all)
accu, mismatch[:20]

(0.96199,
 [['name', 'gender', 'pred', 'prob'],
  ['晔', 'U', 'M', 0.40640049019849533],
  ['文冰', 'U', 'M', 0.6743175415253797],
  ['文会', 'U', 'M', 0.6135669859259315],
  ['乐懿', 'U', 'M', 0.8300395610175295],
  ['水荣', 'U', 'M', 0.9270598936609558],
  ['正漪', 'M', 'F', 0.7704180986894983],
  ['晗', 'M', 'F', 0.6536717990218155],
  ['逸', 'U', 'M', 0.7354662573578914],
  ['晨云', 'M', 'F', 0.5007160117983905],
  ['晓华', 'M', 'F', 0.5262898528073406],
  ['宝懿', 'U', 'M', 0.939861336175106],
  ['珣', 'U', 'M', 0.47723471469191386],
  ['俊华', 'U', 'M', 0.8885340496765125],
  ['郡', 'M', 'F', 0.4492423468024235],
  ['游', 'U', 'M', 0.8091822301550559],
  ['树玉', 'U', 'M', 0.7308590990958171],
  ['常桃', 'M', 'F', 0.5039338213337725],
  ['英', 'U', 'F', 0.8609835356439166],
  ['杨华', 'U', 'M', 0.7864069411425665]])

In [14]:
# test the accuracy of the model on the selected train set
# excluding those gender undefined cases
accu, mismatch = accuracy(train_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9799027255207855,
 [['name', 'gender', 'pred', 'prob'],
  ['正漪', 'M', 'F', 0.7704180986894983],
  ['晗', 'M', 'F', 0.6536717990218155],
  ['晨云', 'M', 'F', 0.5007160117983905],
  ['晓华', 'M', 'F', 0.5262898528073406],
  ['郡', 'M', 'F', 0.4492423468024235],
  ['常桃', 'M', 'F', 0.5039338213337725],
  ['旭彤', 'F', 'M', 0.7795938004611246],
  ['海云', 'F', 'M', 0.6367215582947406],
  ['松菱', 'M', 'F', 0.6828052234934916],
  ['思嘉', 'M', 'F', 0.529137167730791],
  ['华昕', 'M', 'F', 0.44526170742995663],
  ['艳忠', 'M', 'F', 0.6786791188738392],
  ['琦', 'M', 'F', 0.3581180542375616],
  ['新捧', 'F', 'M', 0.5762718239593515],
  ['寿丹', 'M', 'F', 0.8798694099727873],
  ['云清', 'M', 'U', 0.3532734617770906],
  ['桂阳', 'F', 'M', 0.5339663638265105],
  ['祥英', 'F', 'M', 0.6135133617781545],
  ['晶源', 'F', 'M', 0.5065397044341299]])

**For the rest of the original train set**

In [15]:
# test the accuracy of the model on the rest of the original train set
# including those gender undefined cases
accu, mismatch = accuracy(train_ds_dev, dic, theta_all)
accu, mismatch[:20]

(0.9399856890247558,
 [['name', 'gender', 'pred', 'prob'],
  ['喜凰', 'M', 'F', 0.8309708425166227],
  ['贤云', 'U', 'M', 0.7071264765266768],
  ['秋田', 'U', 'F', 0.6289107487859427],
  ['痄闪', 'M', 'F', 0.9999877462419087],
  ['晖敏', 'M', 'U', 0.45086867020702753],
  ['启连', 'F', 'M', 0.9484412963561113],
  ['海螺', 'F', 'M', 0.7080750455083059],
  ['铃庚', 'M', 'F', 0.8383498813784845],
  ['元琦', 'U', 'M', 0.9322678356667757],
  ['满华', 'F', 'M', 0.7872297253423495],
  ['裎鹭', 'M', 'F', 0.9460165883672208],
  ['嵛涵', 'M', 'F', 0.5011353607034297],
  ['颖辉', 'U', 'F', 0.6657750799137295],
  ['向云', 'U', 'M', 0.8847812811524083],
  ['云卓', 'M', 'F', 0.4217788799448591],
  ['邦美', 'M', 'F', 0.7689001251900788],
  ['芊江', 'M', 'F', 0.5320734732173031],
  ['俊亚', 'U', 'M', 0.894946746291144],
  ['任红', 'F', 'M', 0.6166244729718314]])

In [16]:
# test the accuracy of the model on the rest of the original train set
# excluding those gender undefined cases
accu, mismatch = accuracy(train_ds_dev, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9518350451315245,
 [['name', 'gender', 'pred', 'prob'],
  ['喜凰', 'M', 'F', 0.8309708425166227],
  ['痄闪', 'M', 'F', 0.9999877462419087],
  ['晖敏', 'M', 'U', 0.45086867020702753],
  ['启连', 'F', 'M', 0.9484412963561113],
  ['海螺', 'F', 'M', 0.7080750455083059],
  ['铃庚', 'M', 'F', 0.8383498813784845],
  ['满华', 'F', 'M', 0.7872297253423495],
  ['裎鹭', 'M', 'F', 0.9460165883672208],
  ['嵛涵', 'M', 'F', 0.5011353607034297],
  ['云卓', 'M', 'F', 0.4217788799448591],
  ['邦美', 'M', 'F', 0.7689001251900788],
  ['芊江', 'M', 'F', 0.5320734732173031],
  ['任红', 'F', 'M', 0.6166244729718314],
  ['离', 'M', 'F', 0.6067424662448887],
  ['条条', 'F', 'U', 0.554112070901956],
  ['秩荧', 'M', 'F', 0.9769377797564126],
  ['晓嬗', 'F', 'U', 0.47603372050175013],
  ['楠希', 'M', 'F', 0.5002595657968659],
  ['楫本', 'M', 'U', 0.5020391814162953]])

**For the original dev set**

In [17]:
# loading the original dev set
dev_ds = data_loader('data/dev_ds.txt')
len(dev_ds), dev_ds[:5]

(167528, [['瑞琳', 'F'], ['凯棋', 'M'], ['义祥', 'M'], ['识闻', 'M'], ['缤鲃', 'M']])

In [18]:
# test the accuracy of the model on the original dev set
# including those gender undefined cases
accu, mismatch = accuracy(dev_ds, dic, theta_all)
accu, mismatch[:20]

(0.9483429635642997,
 [['name', 'gender', 'pred', 'prob'],
  ['睿', 'U', 'M', 0.8311858450077128],
  ['连', 'U', 'F', 0.6128853366948264],
  ['丁', 'U', 'M', 0.7661898310914489],
  ['乔', 'U', 'M', 0.46442977434012034],
  ['韶华', 'U', 'M', 0.8835874436603273],
  ['思宁', 'M', 'F', 0.5513951408603314],
  ['詹晨', 'M', 'U', 0.4289339476065103],
  ['萃卜', 'F', 'M', 0.5218822803705685],
  ['畏', 'U', 'M', 0.5632632879903952],
  ['云夕', 'M', 'F', 0.5139186479273077],
  ['英', 'U', 'F', 0.8609835356439166],
  ['亭', 'F', 'M', 0.5835804599462253],
  ['云瑞', 'M', 'F', 0.5278466163026296],
  ['庚', 'U', 'M', 0.7025191336050127],
  ['润培', 'U', 'M', 0.948242222916413],
  ['雪明', 'U', 'F', 0.5222983314582598],
  ['九', 'U', 'M', 0.8759067846656167],
  ['棣华', 'U', 'M', 0.7546019635927789],
  ['文', 'U', 'M', 0.6451627573058977]])

In [19]:
# test the accuracy of the model on the original dev set
# excluding those gender undefined cases
accu, mismatch = accuracy(dev_ds, dic, theta_all, exclude_U=True)
accu, mismatch[:20]

(0.9651273759520524,
 [['name', 'gender', 'pred', 'prob'],
  ['思宁', 'M', 'F', 0.5513951408603314],
  ['詹晨', 'M', 'U', 0.4289339476065103],
  ['萃卜', 'F', 'M', 0.5218822803705685],
  ['云夕', 'M', 'F', 0.5139186479273077],
  ['亭', 'F', 'M', 0.5835804599462253],
  ['云瑞', 'M', 'F', 0.5278466163026296],
  ['坤玥', 'M', 'F', 0.5255140277152626],
  ['乔楚', 'M', 'F', 0.5397862223520067],
  ['继红', 'M', 'F', 0.5753134922560033],
  ['会会', 'F', 'U', 0.5074093787912801],
  ['琳伟', 'M', 'F', 0.7279461459535833],
  ['军红', 'F', 'M', 0.9009563144219227],
  ['文瑜', 'M', 'F', 0.6038500746856287],
  ['佳桤', 'M', 'F', 0.7706763303877168],
  ['滠芳', 'M', 'F', 0.9982434513411969],
  ['银', 'F', 'M', 0.6033869728115671],
  ['晗', 'M', 'F', 0.6536717990218155],
  ['舍明', 'M', 'U', 0.5650848281413636],
  ['砰', 'M', 'U', 0.554112070901956]])

<a name='test_set'></a>
## Testing against the test set

In [20]:
# loading the original test set
# please note that the test set contains full names
test_ds = data_loader('data/test_ds.txt')
len(test_ds), test_ds[:5]

(365810,
 [['邬爱清', 'F'], ['杜文吕', 'M'], ['任千焱', 'M'], ['鲍梦冉', 'F'], ['薛俊霖', 'M']])

In [21]:
# test the accuracy of the model on the original test set
# including those gender undefined cases
accu, mismatch = accuracy(test_ds, dic, theta_all, full_name=True)
accu, mismatch[:20]

(0.946272655203521,
 [['name', 'gender', 'pred', 'prob'],
  ['顾仁疋', 'F', 'M', 0.951494650027016],
  ['幸路', 'U', 'M', 0.7046140119549571],
  ['商禹', 'U', 'M', 0.9143276590605068],
  ['易宛其', 'M', 'F', 0.7085827978822331],
  ['蔚韦君', 'M', 'F', 0.7219030270992229],
  ['任爰好', 'M', 'F', 0.7521378056877494],
  ['童仕君', 'U', 'M', 0.8478630748258673],
  ['杨晓一', 'U', 'M', 0.7286339835935126],
  ['舒海华', 'U', 'M', 0.7885843833905349],
  ['卢少冰', 'U', 'M', 0.7701917415910329],
  ['范文佳', 'M', 'F', 0.45505228616429405],
  ['史梵', 'M', 'F', 0.39515143017123255],
  ['吴昕阳', 'U', 'M', 0.9279654532438039],
  ['王春清', 'F', 'M', 0.7151642962315684],
  ['农月部', 'M', 'F', 0.9903755821554568],
  ['淳于文灵', 'M', 'F', 0.5427887968864502],
  ['井建华', 'U', 'M', 0.8129042376544283],
  ['辜永红', 'F', 'M', 0.702495246245912],
  ['安彬', 'U', 'M', 0.8379973144736247]])

In [22]:
# test the accuracy of the model on the original test set
# excluding those gender undefined cases
accu, mismatch = accuracy(test_ds, dic, theta_all, exclude_U=True, full_name=True)
accu, mismatch[:20]

(0.9701746013923198,
 [['name', 'gender', 'pred', 'prob'],
  ['顾仁疋', 'F', 'M', 0.951494650027016],
  ['易宛其', 'M', 'F', 0.7085827978822331],
  ['蔚韦君', 'M', 'F', 0.7219030270992229],
  ['任爰好', 'M', 'F', 0.7521378056877494],
  ['范文佳', 'M', 'F', 0.45505228616429405],
  ['史梵', 'M', 'F', 0.39515143017123255],
  ['王春清', 'F', 'M', 0.7151642962315684],
  ['农月部', 'M', 'F', 0.9903755821554568],
  ['淳于文灵', 'M', 'F', 0.5427887968864502],
  ['辜永红', 'F', 'M', 0.702495246245912],
  ['聂驯侘', 'M', 'U', 0.554112070901956],
  ['李来月', 'F', 'M', 0.788230634374804],
  ['宁文蔚', 'M', 'F', 0.5391734909647223],
  ['常海仪', 'M', 'F', 0.7240201061900376],
  ['蓝柳扬', 'F', 'M', 0.5368824575412814],
  ['尹昕', 'F', 'M', 0.4691869057902933],
  ['查群', 'F', 'U', 0.5271229497252596],
  ['雷琦琦', 'M', 'F', 0.3581180542375616],
  ['家雅凯', 'M', 'F', 0.6117890475114353]])

<a name='problem'></a>
## A problem

When the first names are processed as full names, the accuracy will go down by around 7%. That is because the rule-based `getFirstName` function will take a first name as a full name if the first name starts with a char that can also be used as the last name. 

Things to do: either improve the accuracy of the `getFirstName` function or train the model with last names as well. 

In [23]:
# originally = 0.9483429635642997
# now = 0.8710424526049377
accu, mismatch = accuracy(dev_ds, dic, theta_all, full_name=True)
accu, mismatch[:20]

(0.8710424526049377,
 [['name', 'gender', 'pred', 'prob'],
  ['梅云', 'F', 'U', 0.5925564159943968],
  ['穗君', 'F', 'U', 0.5206657693467208],
  ['梦饶', 'F', 'M', 0.5512089884475259],
  ['佳嘉', 'F', 'M', 0.6504799666787816],
  ['玉墨', 'F', 'M', 0.6682442165247393],
  ['承隍', 'M', 'U', 0.554112070901956],
  ['艳砚', 'F', 'M', 0.5814055986164764],
  ['天秘', 'M', 'F', 0.5832588205825965],
  ['凤辰', 'F', 'M', 0.7723805377370389],
  ['子懿', 'M', 'F', 0.5040915502527811],
  ['宇烟', 'M', 'F', 0.9808327176375053],
  ['奇', 'M', 'U', 0.554112070901956],
  ['佳臻', 'F', 'M', 0.7452530112063541],
  ['明华', 'M', 'U', 0.5883437014656338],
  ['凤桐', 'F', 'M', 0.5139461732815358],
  ['智', 'M', 'U', 0.554112070901956],
  ['睿', 'U', 'M', 0.8311858450077128],
  ['磊', 'M', 'U', 0.554112070901956],
  ['文藻', 'M', 'U', 0.6607274090794647]])

In [24]:
# originally = 0.9483429635642997
# now = 0.8831277771361534
accu, mismatch = accuracy(dev_ds, dic, theta_all, exclude_U=True, full_name=True)
accu, mismatch[:20]

(0.8831277771361534,
 [['name', 'gender', 'pred', 'prob'],
  ['梅云', 'F', 'U', 0.5925564159943968],
  ['穗君', 'F', 'U', 0.5206657693467208],
  ['梦饶', 'F', 'M', 0.5512089884475259],
  ['佳嘉', 'F', 'M', 0.6504799666787816],
  ['玉墨', 'F', 'M', 0.6682442165247393],
  ['承隍', 'M', 'U', 0.554112070901956],
  ['艳砚', 'F', 'M', 0.5814055986164764],
  ['天秘', 'M', 'F', 0.5832588205825965],
  ['凤辰', 'F', 'M', 0.7723805377370389],
  ['子懿', 'M', 'F', 0.5040915502527811],
  ['宇烟', 'M', 'F', 0.9808327176375053],
  ['奇', 'M', 'U', 0.554112070901956],
  ['佳臻', 'F', 'M', 0.7452530112063541],
  ['明华', 'M', 'U', 0.5883437014656338],
  ['凤桐', 'F', 'M', 0.5139461732815358],
  ['智', 'M', 'U', 0.554112070901956],
  ['磊', 'M', 'U', 0.554112070901956],
  ['文藻', 'M', 'U', 0.6607274090794647],
  ['平', 'M', 'U', 0.554112070901956]])