# Gender prediction based on voice with _k_-nearest neigbor

Using the Voicegender[Gender Recognition by Voice](https://www.kaggle.com/primaryobjects/voicegender/home) dataset from Kaggle. We'll try to predict the gender of the person based on max 7 variables. To do this we use the _k_-NN algorithm.

## Import and check-out dataset

In [76]:
import seaborn as sns
import sklearn as sk
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

In [77]:
df = pd.read_csv('voice.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3168 entries, 0 to 3167
Data columns (total 21 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   meanfreq  3168 non-null   float64
 1   sd        3168 non-null   float64
 2   median    3168 non-null   float64
 3   Q25       3168 non-null   float64
 4   Q75       3168 non-null   float64
 5   IQR       3168 non-null   float64
 6   skew      3168 non-null   float64
 7   kurt      3168 non-null   float64
 8   sp.ent    3168 non-null   float64
 9   sfm       3168 non-null   float64
 10  mode      3168 non-null   float64
 11  centroid  3168 non-null   float64
 12  meanfun   3168 non-null   float64
 13  minfun    3168 non-null   float64
 14  maxfun    3168 non-null   float64
 15  meandom   3168 non-null   float64
 16  mindom    3168 non-null   float64
 17  maxdom    3168 non-null   float64
 18  dfrange   3168 non-null   float64
 19  modindx   3168 non-null   float64
 20  label     3168 non-null   obje

We have a dataset with 3168 cases and 21 columns, mostly floats on voice attributes. Last column label is the one we will try to predict. It contains 'male' or 'female'.

In [78]:
df['label'].value_counts()

male      1584
female    1584
Name: label, dtype: int64

Checking the value counts helps to show how many males and females there are and what the accuracy would be if we always choose a specific label. In this case 50%.

## Pre-processing

In [79]:
df_subset = pd.get_dummies(df, columns=['label'])
corr = df_subset.corr()
corr

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label_female,label_male
meanfreq,1.0,-0.739039,0.925445,0.911416,0.740997,-0.627605,-0.322327,-0.316036,-0.601203,-0.784332,...,0.460844,0.383937,0.274004,0.536666,0.229261,0.519528,0.51557,-0.216979,0.337415,-0.337415
sd,-0.739039,1.0,-0.562603,-0.846931,-0.161076,0.87466,0.314597,0.346241,0.71662,0.838086,...,-0.466281,-0.345609,-0.129662,-0.482726,-0.357667,-0.482278,-0.475999,0.12266,-0.479539,0.479539
median,0.925445,-0.562603,1.0,0.774922,0.731849,-0.477352,-0.257407,-0.243382,-0.502005,-0.66169,...,0.414909,0.337602,0.251328,0.455943,0.191169,0.438919,0.435621,-0.213298,0.283919,-0.283919
Q25,0.911416,-0.846931,0.774922,1.0,0.47714,-0.874189,-0.319475,-0.350182,-0.648126,-0.766875,...,0.545035,0.320994,0.199841,0.467403,0.302255,0.459683,0.454394,-0.141377,0.511455,-0.511455
Q75,0.740997,-0.161076,0.731849,0.47714,1.0,0.009636,-0.206339,-0.148881,-0.174905,-0.378198,...,0.155091,0.258002,0.285584,0.359181,-0.02375,0.335114,0.335648,-0.216475,-0.066906,0.066906
IQR,-0.627605,0.87466,-0.477352,-0.874189,0.009636,1.0,0.249497,0.316185,0.640813,0.663601,...,-0.534462,-0.22268,-0.069588,-0.333362,-0.357037,-0.337877,-0.331563,0.041252,-0.618916,0.618916
skew,-0.322327,0.314597,-0.257407,-0.319475,-0.206339,0.249497,1.0,0.97702,-0.195459,0.079694,...,-0.167668,-0.216954,-0.080861,-0.336848,-0.061608,-0.305651,-0.30464,-0.169325,-0.036627,0.036627
kurt,-0.316036,0.346241,-0.243382,-0.350182,-0.148881,0.316185,0.97702,1.0,-0.127644,0.109884,...,-0.19456,-0.203201,-0.045667,-0.303234,-0.103313,-0.2745,-0.272729,-0.205539,-0.087195,0.087195
sp.ent,-0.601203,0.71662,-0.502005,-0.648126,-0.174905,0.640813,-0.195459,-0.127644,1.0,0.866411,...,-0.513194,-0.305826,-0.120738,-0.293562,-0.294869,-0.324253,-0.319054,0.198074,-0.490552,0.490552
sfm,-0.784332,0.838086,-0.66169,-0.766875,-0.378198,0.663601,0.079694,0.109884,0.866411,1.0,...,-0.421066,-0.3621,-0.192369,-0.428442,-0.289593,-0.436649,-0.43158,0.211477,-0.357499,0.357499


First we make dummies (0 and 1's) from the label. This creates 2 columns label_female and label_male that contain a 1 if the gender is in the name. I chose a correlation matrix to look for the most effect of another column on these labels. The higher the number the bigger the effect. Out of those I select a subset with the biggest influencers.  

In [80]:
df_subset = df_subset[['meanfreq', 'sd', 'Q25', 'IQR', 'sp.ent', 'sfm', 'meanfun', 'label_female']]
df_subset.head()

Unnamed: 0,meanfreq,sd,Q25,IQR,sp.ent,sfm,meanfun,label_female
0,0.059781,0.064241,0.015071,0.075122,0.893369,0.491918,0.084279,0
1,0.066009,0.06731,0.019414,0.073252,0.892193,0.513724,0.107937,0
2,0.077316,0.083829,0.008701,0.123207,0.846389,0.478905,0.098706,0
3,0.151228,0.072111,0.096582,0.111374,0.963322,0.727232,0.088965,0
4,0.13512,0.079146,0.07872,0.127325,0.971955,0.783568,0.106398,0


In [81]:
X = df_subset[['meanfreq', 'sd', 'Q25', 'IQR', 'sp.ent', 'sfm', 'meanfun']] #create the X matrix
y = df_subset['label_female'] #create the y-variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) #split the data, store it into different variables

We are predicting the chance a voice is female, by using: 'meanfreq', 'sd', 'Q25', 'IQR', 'sp.ent', 'sfm' and 'meanfun' as variables. We define $X$ with these variables and $y$ as label_female Which will be 1 if it is a female and a 0 if it's a male. After that we split the data 70% train and 30% test. By adding random_state=1 we will get the same sets even if we run it multiple times. 

## What is _k_-NN
_k_-NN is an algorithm in a n-dimensional space. It classifies each case by looking for the closest neighbours in that n-dimensional space. You can define the amount of neighbours to look for, but it works best if this is an odd numbers as you otherwise have ties. The default is 5-neighbours, I tested a few and noticed 3 neighbours give the best results in this case.

## Training the model

In [82]:
knn = KNeighborsClassifier(n_neighbors=3) #create a KNN-classifier with 3 neighbors
knn = knn.fit(X_train, y_train) #this fits the k-nearest neigbor model with the train data

## Evaluating the model

In [83]:
knn.score(X_test, y_test) #calculate the fit on the test data

0.9747634069400631

97% of the voices are predicted accurately. So, is that good or bad?

Well, given that 50% of the voices are female, we couldn't get this performance by predicting _everything_ is 'female'. So, pretty good. We are now going to look at a confusion matrix to calculate recall and precision.

In [84]:
y_test_pred = knn.predict(X_test) #the predicted values
cm = confusion_matrix(y_test, y_test_pred) #creates a "confusion matrix"
conf_matrix = pd.DataFrame(cm, index=['male', 'female'], columns = ['male.p', 'female.p']) 
conf_matrix

Unnamed: 0,male.p,female.p
male,481,13
female,11,446


The way to read this is that of the female voices, 442 are correctly predicted as 'female', 15 are instead predicted as 'male'. The _recall_ and _precision_ for the category drama is:

$recall = \frac{442}{442 + 11} = .98$

$precision = \frac{442}{442 + 13} = .97$

Different values for _k_ delivered around the same accuracy, precision and recall but in the end 3 scored the best.