# **Import Datasets**

### Fetching UCI Diabetes dataset

I have used [Pima Indians Diabetes Dataset](https://www.kaggle.com/uciml/pima-indians-diabetes-database) for implementing the models. The dataset is available with this repository. You can also download from the link given above.

**Context** : This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset.

**Content** : The datasets consists of several medical predictor variables and one target variable ```Outcome```. Predictor variables (Features) includes,
- ```Number of pregnancies```
- ```Glucose```
- ```Blood Pressure```
- ```Skin thickness```
- ```Insulin level```
- ```BMI```
- ```Diabetes Pedegree Function```
- ```Age```

**Classification task** : In each instance ${x_i}$, from the given feature values, the task is to predict if the patient instance has diabetes or not.

In [1]:
import pandas as pd
df = pd.read_csv('./datasets/diabetes.csv')
df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


# **Half Space using LP**

### Data Preparation

In [16]:
import pandas as pd
import numpy as np
df = pd.read_csv('./datasets/diabetes.csv')

Y = np.array(df['Outcome'])
X = np.array([list(df.loc[i][:-1]) for i in range(len(df))])

In [17]:
for i in range(len(Y)):
  if Y[i] == 0:
    Y[i]=-1

size = len(Y)
random_indice = np.random.permutation(size)
num_train = int(size*0.7)
num_test = int(size*0.3)

X_train = X[random_indice[:num_train]]
y_train = Y[random_indice[:num_train]]
X_test = X[random_indice[-num_test:]]
y_test = Y[random_indice[-num_test:]]

### Formulation of LP

Maximize ${1}$ with subject to ${<\underline{w}, y_i\underline{x_i}> \ge 1}$

In [18]:
# yixi matrix (not inner product)
res = np.array([[0 for a in range(len(X_train[0]))] for b in range(num_train)])
for i in range(num_train):
  for j in range(len(X_train[0])):
    res[i][j] = y_train[i] * X_train[i][j]

In [19]:
import pulp as p

Lp_prob = p.LpProblem('HSLP',p.LpMinimize)
Lp_prob+=1

w = np.array([p.LpVariable('w'+str(i)) for i in range(len(X_train[0]))])
w = np.transpose(w)
inner_prod = list(np.inner(w,res)) # <w, yixi>
for i in inner_prod:
  Lp_prob+=i>=1

status = Lp_prob.solve()
print(p.LpStatus[status])

for i in range(len(X_train[0])):
  print('w'+str(i)+' = '+str(p.value(w[i])))

Infeasible
w0 = 0.026120028
w1 = 0.0067761417
w2 = -0.023588319
w3 = -0.0029847554
w4 = -0.00010776914
w5 = -0.0051484147
w6 = 0.054173608
w7 = -0.00059618779


### Prediction using Test set

Using the obtained weight vector $\underline{w}$, the Test set is predicted and the obtained accuracy is around ${69.56\%}$

In [20]:
w_val = [p.value(w[i]) for i in range(len(w))]
y_pred = list()
for i in range(num_test):
  if np.inner(w_val, X_test[i])>0:
    y_pred.append(1)
  else :
    y_pred.append(-1)

cclf=0
for i in range(num_test):
  if y_pred[i]==y_test[i]:
    cclf+=1
print("Accuracy = "+str(cclf/num_test))

Accuracy = 0.6521739130434783


# **Half Space using Perceptron**

The Perceptron model starts with some initial random $\underline{w}$ and gets updated as it scans each instance of the Training set.

### Data Preparation

In [21]:
import pandas as pd
import numpy as np
df = pd.read_csv('./datasets/diabetes.csv')

Y = np.array(df['Outcome'])
X = np.array([list(df.loc[i][:-1]) for i in range(len(df))])

In [22]:
for i in range(len(Y)):
  if Y[i] == 0:
    Y[i]=-1

size = len(Y)
random_indice = np.random.permutation(size)
num_train = int(size*0.7)
num_test = int(size*0.3)

X_train = X[random_indice[:num_train]]
y_train = Y[random_indice[:num_train]]
X_test = X[random_indice[-num_test:]]
y_test = Y[random_indice[-num_test:]]

### Training iteratively using Train set

In [23]:
import random
w = np.array([random.random() for _ in range(len(X_train[0]))])
print("Initial weight : ",w)
for i in range(num_train):
  inner_prod = np.inner(w, X_train[i])
  if y_train[i]*inner_prod <= 0 :
    w = np.add(w, np.dot(y_train[i], X_train[i]))
print("Updated weight : ",w)

Initial weight :  [0.01457531 0.38077669 0.81243782 0.2058043  0.01818444 0.94797902
 0.12066552 0.03133738]
Updated weight :  [  76.01457531  -28.61922331 -337.18756218  -74.7941957    -7.98181556
    7.04797902    7.54266552 -118.96866262]


### Prediction using Test set

Using the updated weight vector $\underline{w}$, the prediction is performed from Test set instances and the obtained accuracy is around ${63.47\%}$

In [24]:
y_pred=list()
for i in range(num_test):
  if np.inner(w, X_test[i]) > 0 :
    y_pred.append(1)
  else :
    y_pred.append(-1)
cclf=0
for i in range(num_test):
  if y_pred[i]==y_test[i]:
    cclf+=1
print("Accuracy = "+str(cclf/num_test))

Accuracy = 0.6565217391304348


# **Logistic Regression using Gradient Descent**

It starts with some initial $\underline{w}$ and applies gradient descent using the training set till convergence. Here, around ${10^6}$ iterations have been performed and the learning rate $\alpha = 0.00015$

### Data preparation

In [25]:
import pandas as pd
import numpy as np
df = pd.read_csv('./datasets/diabetes.csv')

Y = np.array(df['Outcome'])
X = np.array([list(df.loc[i][:-1]) for i in range(len(df))])

In [26]:
import numpy as np
size = len(Y)
random_indice = np.random.permutation(size)
num_train = int(size*0.7)
num_test = int(size*0.3)

X_train = X[random_indice[:num_train]]
y_train = Y[random_indice[:num_train]]
X_test = X[random_indice[-num_test:]]
y_test = Y[random_indice[-num_test:]]

### Applying Gradient descent using Training set

In [27]:
def sigmoid(z):
  return 1/(1+np.exp(-z))

import random
epoch = 1000000
alpha = 0.00015
w = [0 for _ in range(len(X_train[0]))]
b=0
print("Initial weight : ",w)
num_features = len(X_train[0])
for i in range(epoch):
  z = np.dot(X_train, np.transpose(w))+b
  h = sigmoid(z)
  cost = -(1/num_features)*(y_train*np.log(h) + (1-y_train)*np.log(1-h))
  d_cost = (1/num_features)*np.dot(np.transpose(X_train), (h-y_train))
  d_b = (1/num_features)*(h-y_train)
  w = w - alpha*np.transpose(d_cost)
  b = b - alpha*b
print("Updated weight : ",w)

Initial weight :  [0, 0, 0, 0, 0, 0, 0, 0]


  cost = -(1/num_features)*(y_train*np.log(h) + (1-y_train)*np.log(1-h))
  cost = -(1/num_features)*(y_train*np.log(h) + (1-y_train)*np.log(1-h))


Updated weight :  [ 5.68597219 -0.08103908 -1.3511854  -0.2805684  -0.18509202 -0.16750063
 13.336995   -0.04240825]


### Predicting using Test set

Using the obtained weight vector $\underline{w}$, the Test set is predicted and the obtained accuracy is around ${63.47\%}$

In [28]:
z_pred = np.dot(X_test, np.transpose(w))+b
h_pred = sigmoid(z_pred)

cclf = 0
for i in range(num_test) :
  if y_test[i]==1 and h_pred[i]>0.5:
    cclf+=1
  if y_test[i]==0 and h_pred[i]<0.5:
    cclf+=1

print("Accuracy : "+str(cclf/num_test))

Accuracy : 0.6304347826086957
