The aim of this notebook is to demonstrate how to implement logistic regression with stochastic gradient descent. 

In this project, we are trying to do gender voice recognition with logistic regression. The data is obtained from kaggle, where each voice signal has already been analyzed and the data about the voice samples are recorded numerically. 

# Prepare notebook and import data

In [7]:
import pandas as pd
import numpy as np

In [8]:
# import the data and see what is in there (first ten lines)

df_voice=pd.read_csv("voice.csv")
df_voice.shape

(3168, 21)

Taking a look at the dataframe itself:

In [9]:
df_voice.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


All the variables are represented below:

In [10]:
list(df_voice.columns)

['meanfreq',
 'sd',
 'median',
 'Q25',
 'Q75',
 'IQR',
 'skew',
 'kurt',
 'sp.ent',
 'sfm',
 'mode',
 'centroid',
 'meanfun',
 'minfun',
 'maxfun',
 'meandom',
 'mindom',
 'maxdom',
 'dfrange',
 'modindx',
 'label']

Since label is now a categorical data, split into males and females, it would be beneficial to convert them to 0 and 1, where 0 represent females and 1 represent males

In [11]:
gender_dict = {"male":1, "female":0}
df_voice["gender"] = df_voice["label"].map(gender_dict);
df_voice = df_voice.drop("label",1)

In [12]:
df_voice.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,gender
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,1
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,1
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,1
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,1
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,1


This is the data that we have, and from the data, we observe several problems:

1. The data is listed in such a way that the male voices comes first and female voices comes later, therefore we need to shuffle the rows to ensure that both male and female voices are taken into account of the learning process.
2. Males and females are label, and we need to change them into 0 and 1 so that we can fit the data into the logistic regression algorithm
3. We need a set of testing data to test the algorithm. 

In [13]:
df_voice = df_voice.iloc[np.random.permutation(len(df_voice))]

It is also beneficial to split the data into training and testing data, so that we can verify whether logistic regression can accurately predict the gender of a voice. 

In [14]:
df_test = pd.DataFrame()
df_train = pd.DataFrame()
df_train = df_voice[:2100]
df_test = df_voice[2100:]

In [15]:
df_train.shape

(2100, 21)

In [16]:
# Now we split the testing data to a column vector of gender and
y_train = df_train["gender"].as_matrix()
X_train = df_train.drop("gender",1).as_matrix()

In [17]:
y_train = y_train.transpose()
X_train = X_train.transpose()

In [18]:
X_train.shape

(20L, 2100L)

In [19]:
def sigmoid(x):
    return 1/(1+np.exp(-x))

In [20]:
def stochGradDescent(outputVector,inputMatrix,learningSpeed):
    #outputVector should be a binary column vector of classifications
    a = learningSpeed
    y = outputVector
    X = inputMatrix
    numberOfTrials = y.shape[0]
    numberOfEstimators = X.shape[0]
    t = np.zeros(numberOfEstimators) #estimator vector (_t_heta)
    
    for i in range( numberOfTrials ):
        grad = (y[i] - sigmoid(np.dot(np.transpose(t),X[:,i])))
        t = t + a * grad * X[:,i] 
        
    return t

In [27]:
theta = stochGradDescent(y_train,X_train,1)
theta

array([  -202.57869256,    -52.05485541,   -207.45971622,   -175.32922156,
         -236.90627494,    -61.57705338,  -3144.80138173, -25990.06534875,
         -925.58584899,   -366.81777735,   -189.99888655,   -202.57869256,
         -179.90325718,    -41.68140347,   -280.02746032,   -995.63588855,
          -70.58916016,  -6082.68222656,  -6012.09306641,   -184.31173083])

In [33]:
def predict(inputVector,estimatorVector):
    t = estimatorVector
    x = inputVector
    confidence = sigmoid(np.dot(np.transpose(t) , x))
    if confidence > 0.5:
        return 1
    else:
        return 0

In [38]:
predict(X_train[:,1223],theta)

1

In [39]:
y_train[1223]

0

3.7200759760208356e-44