# Diabetes Data Set

Dataset file: 'diabetes.data'  
Reference link for description of dataset: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

### Preview of the Data Set

Load the data set.

a) Analyse the data set. Print the number of features, feature names, data types of the features, number of data points and the values of the first 10 data points.

In [17]:
import pandas as pd
import numpy as np
import csv
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_csv('diabetes.data', delimiter = '\t')

In [3]:
print('number of features: ', len(data.columns))

number of features:  11


In [4]:
data.columns

Index(['AGE', 'SEX', 'BMI', 'BP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'Y'], dtype='object')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
AGE    442 non-null int64
SEX    442 non-null int64
BMI    442 non-null float64
BP     442 non-null float64
S1     442 non-null int64
S2     442 non-null float64
S3     442 non-null float64
S4     442 non-null float64
S5     442 non-null float64
S6     442 non-null int64
Y      442 non-null int64
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


In [7]:
data.head(10)

Unnamed: 0,AGE,SEX,BMI,BP,S1,S2,S3,S4,S5,S6,Y
0,59,2,32.1,101.0,157,93.2,38.0,4.0,4.8598,87,151
1,48,1,21.6,87.0,183,103.2,70.0,3.0,3.8918,69,75
2,72,2,30.5,93.0,156,93.6,41.0,4.0,4.6728,85,141
3,24,1,25.3,84.0,198,131.4,40.0,5.0,4.8903,89,206
4,50,1,23.0,101.0,192,125.4,52.0,4.0,4.2905,80,135
5,23,1,22.6,89.0,139,64.8,61.0,2.0,4.1897,68,97
6,36,2,22.0,90.0,160,99.6,50.0,3.0,3.9512,82,138
7,66,2,26.2,114.0,255,185.0,56.0,4.55,4.2485,92,63
8,60,2,32.1,83.0,179,119.4,42.0,4.0,4.4773,94,110
9,29,1,30.0,85.0,180,93.4,43.0,4.0,5.3845,88,310


### Training and Testing Data Sets

b) Split the data set into training and testing data set with a 80:20 ratio.

(Hint: What precautions must you take before you split the data set?)

The data should be shuffled properly

In [8]:
train, test = train_test_split(data, test_size = 0.2)

### Linear Regression

c) Using linear regression, seek a model for the response of interest ($Y$), as a function of the baseline variables such as age, sex, body mass index, etc. Compute the training error and testing error.

In [9]:
xtrain = train.drop(columns = ['Y'], axis = 1)
ytrain = train['Y']
xtest = test.drop(columns = ['Y'], axis = 1)
ytest = test['Y']

In [10]:
model = LinearRegression()
model.fit(xtrain,ytrain)
print('train_error: ', mean_squared_error(model.predict(xtrain), ytrain))
print('test_error: ', mean_squared_error(model.predict(xtest), ytest))

train_error:  2897.6530536113396
test_error:  2751.6307478959934


In [11]:
scaler = preprocessing.MinMaxScaler()
data_scaled = scaler.fit_transform(data)

In [12]:
train, test = train_test_split(data_scaled, test_size = 0.2)

In [13]:
xtrain = train[:,:-1]
ytrain = train[:,-1]
xtest = test[:,:-1]
ytest = test[:,-1]

In [14]:
model2 = LinearRegression()
model2.fit(xtrain,ytrain)
print('train error: ',mean_squared_error(model2.predict(xtrain), ytrain))
print('test error: ',mean_squared_error(model2.predict(xtest), ytest))

train error:  0.02748195973854094
test error:  0.030104986164587674


#The train and test error for the model are much lower for the normalized data compared to the original data

### Feature Reduction

e) Rank the features in order of importance (based on the study in d)). Comment.

In [15]:
def regression(data):
    train, test = train_test_split(data, test_size = 0.2)
    xtrain = train[:,:-1]
    ytrain = train[:,-1]
    xtest = test[:,:-1]
    ytest = test[:,-1]
    model.fit(xtrain, ytrain)
    train_error = mean_squared_error(model.predict(xtrain), ytrain)
    test_error = mean_squared_error(model.predict(xtest), ytest)
    return train_error, test_error

In [16]:
feature_rank = {}
for i in range(10):
    data2 = np.delete(data_scaled, [i], axis = 1)
    train_error, test_error = regression(data2)
    feature_rank[i] = train_error
    #print(train_error, test_error)
    
rank =sorted(feature_rank.items(),key = lambda x:x[1], reverse=True)
print(rank)

[(2, 0.033329229749865213), (8, 0.030624029724290517), (7, 0.02935608076712675), (3, 0.028441452857575416), (0, 0.02843095044583374), (4, 0.02804731263203712), (9, 0.027881125609874886), (6, 0.02781606239329534), (1, 0.02760661560119698), (5, 0.027166144176737548)]


The dictionary shows the training error after the specific feature is removed and the sorted accordinly to show the importance of each feature.

### Polynomial Regression

f) Repeat the exercise in d) with quadratic features. List the features you would add to the existing data set. Compute the training error and the testing error. Comment.

In [35]:
def polynomial_reg(x, y):
    poly = PolynomialFeatures(2)
    xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size = 0.2)
    model = LinearRegression()
    model.fit(xtrain, ytrain)
    train_error = mean_squared_error(model.predict(xtrain), ytrain)
    test_error = mean_squared_error(model.predict(xtest), ytest)
    return train_error, test_error

In [54]:
poly = PolynomialFeatures(2)
data_quad = poly.fit_transform(data_scaled[:,:-1])
linear_train_error, linear_test_error = polynomial_reg(data_quad, data_scaled[:,-1])

In [55]:
print(linear_train_error, linear_test_error)

0.023285962174396345 0.03324387273802605


Clearly it shows that the model is overfitting because of the high test error and low training error compared to simple linear regression model.

In [61]:
feature_ranking = {}
delete_features = []
for i in range(78):
    data2 = np.delete(data_quad, [i],axis = 1)
    train_error, test_error = polynomial_reg(data2, data_scaled[:,-1])
    feature_ranking[i] = test_error
    if train_error < linear_train_error:
        delete_features.append(i)

sorted(feature_ranking.items(), key = lambda x:x[1], reverse = True)

  after removing the cwd from sys.path.


[(3, 0.8742410664825541),
 (16, 0.7855236992977713),
 (62, 0.48579530134879684),
 (19, 0.29873552670489545),
 (76, 0.1637622850391086),
 (42, 0.13334013330464903),
 (30, 0.12707578412696569),
 (44, 0.09728042625978034),
 (39, 0.09144273103696436),
 (12, 0.09073021344095814),
 (57, 0.07740002198830387),
 (31, 0.06567961267031058),
 (45, 0.06395587545321656),
 (7, 0.06184591853676619),
 (35, 0.057178000541244695),
 (29, 0.05642248142077081),
 (66, 0.054877565879045),
 (25, 0.05418300221713954),
 (20, 0.05080927513241872),
 (52, 0.045975714302597566),
 (32, 0.04410266611292314),
 (54, 0.04389048391816163),
 (65, 0.04221773471533749),
 (0, 0.04119100014869592),
 (63, 0.04111620029142311),
 (26, 0.041006158779809024),
 (43, 0.04097066291416401),
 (64, 0.04063459989793153),
 (77, 0.04034493704970894),
 (2, 0.03984930074937661),
 (13, 0.03972032797802737),
 (28, 0.03954460771130657),
 (37, 0.03927187882815178),
 (50, 0.03884377886886601),
 (46, 0.03826398034769101),
 (56, 0.03775759684573211)

remove the feature which decrease the test error after training the model.

In [62]:
final_xtrain = np.delete(data_quad, delete_features, axis = 1)

  """Entry point for launching an IPython kernel.


In [64]:
train_error, test_error = polynomial_reg(final_xtrain, data_scaled[:,-1])

In [65]:
print(train_error, test_error)

0.025228450654240907 0.029208412395730243


Ranking the features based on the test accuracy of the model decreased the overfitting of the model by removing certain features which increase the complexity of the model.