# AI 1 Exercise 3: "Building AI Solutions"

## 3.3 Diabetes

#### a) You start to explore and analyze the data. Does the Diabetes Pedigree function predict diabetes well?

In [21]:
import pandas as pd

# import the data
data = pd.read_csv("diabetes.csv")
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [22]:
# Get the correlation matrix to check the correlation of the pedigree function with the outcome
data.corr()

# Result: There's only a small positive correlation (0.173844) between the pedigree function an the outcome. 
# So it seems like it does not work so well

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
Pregnancies,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
Glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
BloodPressure,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
SkinThickness,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
Insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
BMI,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
DiabetesPedigreeFunction,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
Age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
Outcome,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


#### b) Then, you use a regression algorithm to predict the BMI of a person. Does it work well?

In [23]:
# You can train, e.g., a linear regression:
from sklearn.linear_model import LinearRegression
import sklearn.model_selection as model_selection
from sklearn.metrics import mean_squared_error, explained_variance_score
from math import sqrt

X = data.drop(['BMI'], axis=1)  # create a dataframe without the label 
y = data["BMI"]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,y, test_size = 0.4, random_state = 42)

linReg=LinearRegression()
linReg.fit(X_train, y_train)

linReg_predict = linReg.predict(X_test)

print("Root Mean Squared Error (RMSE):")
print(sqrt(mean_squared_error(y_test, linReg_predict)))
print("")
print("Computed weights:")
print(linReg.coef_)

# It does work well (not awesome, but okay)! An RMSE of 7.24 is not too bad here! :)

Root Mean Squared Error (RMSE):
7.248284405975125

Computed weights:
[ 1.94773340e-03  2.27978996e-02  7.14802371e-02  1.75848472e-01
 -6.29697075e-03  7.21595873e-01 -5.67860774e-02  4.37915491e+00]


#### c)	Finally, you try to build an AI that can predict whether a person has diabetes. 
1. Using 10-fold-cross validation, (with KFold from sklearn.model_selection) you train an algorithm. Is your AI good enough to change the world?

In [29]:
# Alternatively to the “train_test_split” function you can use a k-fold cross validation.
# The 10-fold cross validation is a special case of the K-fold cross validation with k = 10.
# (i.e., simply a k-fold cross validation, for which the number of splits is set to 10)

# This alternative approach can be used to better evaluate your model's perfmance as it uses all available data 
# in different train/test configurations, simulating a more realistic evaluation on average.

from sklearn.model_selection import cross_validate
from sklearn import tree

#data["Outcome"]= data["Outcome"].astype(str) 

dt = tree.DecisionTreeClassifier()
X = data.drop(['Outcome'], axis=1)
y = data["Outcome"]

performance = cross_validate(dt, X, y, cv=10, scoring=('accuracy', 'f1', 'precision', 'recall'), return_train_score=True)

print("Average Accuracy:")
print(performance["test_accuracy"].mean())

print("Average F1:")
print(performance["test_f1"].mean())

print("Average Precision:")
print(performance["test_precision"].mean())

print("Average Recall:")
print(performance["test_recall"].mean())

# Unfortunately, your decision tree does not work well. It has an accuracy of almost 70% but an even worse 
# F1 value of approx. 55%...

Average Accuracy:
0.6978810663021189
Average F1:
0.5511562025476389
Average Precision:
0.5710017000543316
Average Recall:
0.5367521367521368


#### c)	Finally, you try to build an AI that can predict whether a person has diabetes. 
2. Define a function that directly evaluates a predefined model based on a set of training and test data. Use this function to test a couple of algorithms.

In [46]:
from sklearn.ensemble import RandomForestClassifier
import sklearn.svm as svm

# Definition of Function 
def get_score(model, X, y):
    performance = cross_validate(model, X, y, cv=10, scoring=('accuracy'), return_train_score=True)
    return performance

# Using the defined Function
X = data.drop(['Outcome'], axis=1)
y = data["Outcome"]

models= [('DT',tree.DecisionTreeClassifier()), ('RF',RandomForestClassifier(n_estimators=150)), 
         ('SVM',svm.SVC(gamma='auto'))]
for name, model in models:
    print("Average Accuracy of " + name + ":")
    print(get_score(model, X, y)['test_score'].mean()) # Return average accuracy for each algorithm

Average Accuracy of DT:
0.7109193438140806
Average Accuracy of RF:
0.7643369788106631
Average Accuracy of SVM:
0.6510594668489406
