The aim of this notebook is to demonstrate how to implement logistic regression with stochastic gradient descent. 

In this project, we are trying to do gender voice recognition with logistic regression. The data is obtained from kaggle, where each voice signal has already been analyzed and the data about the voice samples are recorded numerically. 

# Prepare notebook and import data

In [161]:
import pandas as pd
import numpy as np
from random import *
from math import *

In [162]:
# import the data and see what is in there (first ten lines)

df_voice=pd.read_csv("voice.csv")
df_voice.shape

(3168, 21)

Taking a look at the dataframe itself:

In [163]:
df_voice.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


All the variables are represented below:

In [164]:
list(df_voice.columns)

['meanfreq',
 'sd',
 'median',
 'Q25',
 'Q75',
 'IQR',
 'skew',
 'kurt',
 'sp.ent',
 'sfm',
 'mode',
 'centroid',
 'meanfun',
 'minfun',
 'maxfun',
 'meandom',
 'mindom',
 'maxdom',
 'dfrange',
 'modindx',
 'label']

Since label is now a categorical data, split into males and females, it would be beneficial to convert them to 0 and 1, where 0 represent females and 1 represent males

In [165]:
gender_dict = {"male":1.0, "female":0.0}
df_voice["gender"] = df_voice["label"].map(gender_dict);
df_voice = df_voice.drop("label",1)

In [166]:
df_voice.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,gender
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,1.0
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,1.0
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,1.0
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,1.0
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,1.0


This is the data that we have, and from the data, we observe several problems:

1. The data is listed in such a way that the male voices comes first and female voices comes later, therefore we need to shuffle the rows to ensure that both male and female voices are taken into account of the learning process.
2. Males and females are label, and we need to change them into 0 and 1 so that we can fit the data into the logistic regression algorithm
3. We need a set of testing data to test the algorithm. 

In [167]:
df_voice = df_voice.iloc[np.random.permutation(len(df_voice))]

It is also beneficial to split the data into training and testing data, so that we can verify whether logistic regression can accurately predict the gender of a voice. 

In [168]:
df_test = pd.DataFrame()
df_train = pd.DataFrame()
df_train = df_voice[:2100]
df_test = df_voice[2100:]

In [169]:
df_train.shape

(2100, 21)

In [170]:
# Now we split the testing data to a column vector of gender and
y_train = df_train["gender"].as_matrix()
X_train = df_train.drop("gender",1).as_matrix()
y_test = df_test["gender"].as_matrix()
X_test = df_test.drop("gender",1).as_matrix()

In [171]:
y_train = y_train.transpose()
X_train = X_train.transpose()
X_train = np.append(np.ones((1,2100)), X_train, axis = 0)
y_test = y_test.transpose()
X_test = X_test.transpose()
X_test= np.append(np.ones((1,1068)), X_test, axis = 0)

In [193]:
def sigmoid(x):
    result = 1.0/(1.0+np.exp(-x))
    print "SIGMOID: ",result
    return result

In [194]:
def log_likelihood(x,y,w):
    item = np.dot(np.transpose(w),x)
    print "NP.DOT: ", item
    print sigmoid(item)
    result = y*log(sigmoid(item))+(1-y)*log(1.0-(sigmoid(item)))
    print result
    return result

In [198]:
def stochGradDescent(outputVector,inputMatrix,learningSpeed):
    #outputVector should be a binary column vector of classifications
    # every column is a single trial, every row is a feature
    a = learningSpeed
    y = outputVector
    X = inputMatrix
    numberOfTrials = y.shape[0]
    numberOfEstimators = X.shape[0]
    prev_weight = np.ones(numberOfEstimators) #estimator vector (_t_heta)
    weight = np.zeros(numberOfEstimators)
    error = 10000.0
    prev_training_cost = 1000.0
    training_cost = 0 # This is what we're trying to maximize, the log likelihood
    # add an additional loop to observe the training_cost
    while ((prev_training_cost < training_cost) or (abs(prev_training_cost - training_cost) > 1e-10)):
        training_cost = 0.0
        while (np.linalg.norm(prev_weight-weight,ord=2) > 1e-10):
            prev_weight = weight
            # Randomly choose an integer
            i = randint(1,2099)
            grad = (y[i] - sigmoid(np.dot(np.transpose(prev_weight),X[:,i])))
            weight = prev_weight + a * grad * X[:,i]
            training_cost += log_likelihood(X[:,i],y[i],weight)   
        prev_training_cost = training_cost
    return weight

In [200]:
theta = stochGradDescent(y_train,X_train,1.0/200.0)
theta

SIGMOID:  0.5
NP.DOT:  -0.515586743319
SIGMOID:  0.373884778424
0.373884778424
SIGMOID:  0.373884778424
SIGMOID:  0.373884778424
-0.468220864794
SIGMOID:  0.464033857608
NP.DOT:  -0.280632720662
SIGMOID:  0.430298662992
0.430298662992
SIGMOID:  0.430298662992
SIGMOID:  0.430298662992
-0.562643025634
SIGMOID:  0.399266403097
NP.DOT:  -0.631121641142
SIGMOID:  0.347256252603
0.347256252603
SIGMOID:  0.347256252603
SIGMOID:  0.347256252603
-0.426570650312
SIGMOID:  0.303717550658
NP.DOT:  -0.268293822234
SIGMOID:  0.433326007041
0.433326007041
SIGMOID:  0.433326007041
SIGMOID:  0.433326007041
-0.836264931172
SIGMOID:  0.358139327148
NP.DOT:  -2.55451723152
SIGMOID:  0.0721235992536
0.0721235992536
SIGMOID:  0.0721235992536
SIGMOID:  0.0721235992536
-0.0748567439172
SIGMOID:  4.48880631096e-20
NP.DOT:  4108.28514751
SIGMOID:  1.0
1.0
SIGMOID:  1.0
SIGMOID:  1.0


ValueError: math domain error

In [176]:
def predict(inputVector,estimatorVector):
    t = estimatorVector
    x = inputVector
    confidence = sigmoid(np.dot(np.transpose(t) , x))
    if confidence > 0.5:
        return 1
    else:
        return 0

In [177]:
correct = 0
wrong = 0
total = 0
for j in range(1068):
    # print(predict(X_test[:,j],theta)), y_test[j]
    if (y_test[j] == predict(X_test[:,j],theta)):
        correct += 1
    else:
        wrong += 1
    total += 1
print correct, wrong, total

535 533 1068
