# Diabetes Data Set

Dataset file: 'diabetes.data'  
Reference link for description of dataset: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

### Preview of the Data Set

Load the data set.

a) Analyse the data set. Print the number of features, feature names, data types of the features, number of data points and the values of the first 10 data points.

In [55]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

In [56]:
df = pd.read_table("./diabetes.data")
df.describe()

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,48.5181,1.468326,26.375792,94.647014,189.140271,115.43914,49.788462,4.070249,4.641411,91.260181,152.133484
std,13.109028,0.499561,4.418122,13.831283,34.608052,30.413081,12.934202,1.29045,0.522391,11.496335,77.093005
min,19.0,1.0,18.0,62.0,97.0,41.6,22.0,2.0,3.2581,58.0,25.0
25%,38.25,1.0,23.2,84.0,164.25,96.05,40.25,3.0,4.2767,83.25,87.0
50%,50.0,1.0,25.7,93.0,186.0,113.0,48.0,4.0,4.62005,91.0,140.5
75%,59.0,2.0,29.275,105.0,209.75,134.5,57.75,5.0,4.9972,98.0,211.5
max,79.0,2.0,42.2,133.0,301.0,242.4,99.0,9.09,6.107,124.0,346.0


In [57]:
df.head(10)

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135
5,23,1,22.6,89.0,139,64.8,61.0,2.0,4.1897,68,97
6,36,2,22.0,90.0,160,99.6,50.0,3.0,3.9512,82,138
7,66,2,26.2,114.0,255,185.0,56.0,4.55,4.2485,92,63
8,60,2,32.1,83.0,179,119.4,42.0,4.0,4.4773,94,110
9,29,1,30.0,85.0,180,93.4,43.0,4.0,5.3845,88,310


In [58]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
AGE    442 non-null int64
SEX    442 non-null int64
BMI    442 non-null float64
BP     442 non-null float64
S1     442 non-null int64
S2     442 non-null float64
S3     442 non-null float64
S4     442 non-null float64
S5     442 non-null float64
S6     442 non-null int64
Y      442 non-null int64
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


In [59]:
print("Total number of data points : {}".format(len(df)))

Total number of data points : 442


In [60]:
df.columns[:-1]

Index(['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6'], dtype='object')

### Training and Testing Data Sets

b) Split the data set into training and testing data set with a 80:20 ratio.

(Hint: What precautions must you take before you split the data set?)

In [61]:
train_test_split_ratio = 0.8
X = df[df.columns[:-1]].values
y = df[df.columns[-1]].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_test_split_ratio, random_state=42)

We have to ensure that we randomize the data and then split it into 80:20 ratio other wise there might be some inherent bias due to ordering. Also both the spilts must belong to the same distribution otherwise it might give us wrong impression about our model

### Linear Regression

c) Using linear regression, seek a model for the response of interest ($Y$), as a function of the baseline variables such as age, sex, body mass index, etc. Compute the training error and testing error.

In [62]:
#Fitting using linear regression
reg = LinearRegression().fit(X_train, y_train)
coefficients = reg.coef_.copy()
print("The parameters are([1,x]): {}".format(coefficients))
#MSE
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)
print("MSE_train:{}".format(np.mean((y_train-y_pred_train)**2)))
print("MSE_test:{}".format(np.mean((y_test-y_pred_test)**2)))

The parameters are([1,x]): [  0.13768782 -23.06446772   5.84636265   1.19709252  -1.28168474
   0.81115203   0.60165319  10.15953917  67.1089624    0.20159907]
MSE_train:2868.549702835578
MSE_test:2900.1936284934827


### Data Preprocessing

d) Normalize the data set and perform linear regression again. Compute the training error and testing error. Comment.

In [63]:
df_norm = (df-np.mean(df))/np.std(df)
X = df_norm[df_norm.columns[:-1]].values
y = df_norm[df_norm.columns[-1]].values
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_test_split_ratio, random_state=42)
#Fitting using linear regression
reg = LinearRegression().fit(X_train, y_train)
coefficients = reg.coef_.copy()
print("The parameters are([1,x]): {}".format(coefficients))
#MSE
y_pred_train = reg.predict(X_train)
y_pred_test = reg.predict(X_test)
print("MSE_train:{}".format(np.mean((y_train-y_pred_train)**2)))
print("MSE_test:{}".format(np.mean((y_test-y_pred_test)**2)))

The parameters are([1,x]): [ 0.02341267 -0.1494573   0.33504909  0.2147708  -0.57536494  0.31999832
  0.10094177  0.17005922  0.45473761  0.03006304]
MSE_train:0.48374458403571424
MSE_test:0.4890809314029949


We see that the train and test error are really close when we normalized, this is because the model couldnt perform well in previous example when the features were of different scales

### Feature Reduction

e) Rank the features in order of importance (based on the study in d)). Comment.

In [64]:
#Based on the coefficents in the normalized model from high importance to low importance
df.columns[:-1][np.argsort(abs(coefficients))][::-1]

Index(['S1', 'S5', 'BMI', 'S2', 'BP', 'S4', 'SEX', 'S3', 'S6', 'AGE'], dtype='object')

High absolute value for normalize data means that variable will have higher impact on Y

### Polynomial Regression

f) Repeat the exercise in d) with quadratic features. List the features you would add to the existing data set. Compute the training error and the testing error. Comment.

In [65]:
#Given degree
degree = 2
#Generating quadratic features 
poly = PolynomialFeatures(degree)
x_poly_train=poly.fit_transform(X_train)#normalized
x_poly_test=poly.fit_transform(X_test)
#Fitting using linear regression
reg = LinearRegression().fit(x_poly_train, y_train)
coefficients = reg.coef_.copy()
coefficients[0] = reg.intercept_
print("The parameters are([1,x1,x2,x1^2,x2^2,x1x2 ....]): {}".format(coefficients))
#MSE
y_pred_train = reg.predict(x_poly_train)
y_pred_test = reg.predict(x_poly_test)
print("MSE train:{}".format(np.mean((y_train-y_pred_train)**2)))
print("MSE test:{}".format(np.mean((y_test-y_pred_test)**2)))

The parameters are([1,x1,x2,x1^2,x2^2,x1x2 ....]): [-1.12235040e+00  6.72214753e-02 -1.89605629e-01  2.60650414e-01
  2.41710347e-01 -1.04993046e+01  9.15358301e+00  3.74258830e+00
  2.74078125e-02  3.85639373e+00  6.08856798e-03  7.02270314e-02
  4.47947305e-02 -3.79148269e-02  1.94484668e-02 -2.75815317e-02
 -2.12802768e-01  1.73563607e-01  2.87852421e-01  3.60041436e-02
  2.52060913e-02 -2.40707795e-02  1.01917444e-02  3.68516819e-02
  4.72530327e-02  4.86186091e-02 -1.03692315e-01 -2.23353507e-01
  6.26225636e-02  2.02187025e-02  2.35305458e-02  9.83629779e-02
 -1.80860261e-01  2.14391814e-01 -6.21200202e-03 -1.05307933e-01
  1.08577793e-01  1.39347696e-02 -7.53818382e-03  5.85598533e-01
 -4.30068414e-01 -2.43465913e-01 -4.10180644e-02 -2.31927001e-01
 -9.70101671e-02  2.74379102e+00 -3.78289120e+00 -1.80846970e+00
 -7.41508726e-01 -1.18382880e+00 -2.30272930e-01  1.34215970e+00
  1.01226226e+00  1.89956593e-01  6.95116226e-01  1.21620331e-01
  3.24261196e-01  4.75168727e-01  3.901

The features I would add are Age^2,BMI^2,S1^2.....,Age*BMI,Age*Sex,Age*S1,S1*S2,S3*S4 etc . This is clearly a case of overfitting incomparision of linear model as there is a huge gap between train and test error. Even though train error is less in comparision to linear model the test error is larger than train