#### Create a model that will predict whether a person does or does not have diabetes. 

- Use the diabetes.csv dataset. The target column in the dataset is "Outcome". Assume no features leak information about the target.

Your solution should include the below. You may use whichever python libraries you wish to complete the task:

- Feature engineering
- Model fitting and performance evaluation
- A function that takes as arguments: a model, train data, test data, and returns the model's predictions on the test data
- A function that takes a set of predictions and true values and that validates the predictions using appropriate metrics
- Anything else you feel is necessary for modelling or improving the performance of your model
- This exercise is intended for you to show your proficiency in machine learning, understanding of the various techniques that can be employed to improve the performance of a model, and your ability to implement those techniques. Please, therefore, show your working at all times. You will be judged more for the above than for the performance of the final model your produce.

In [64]:
import pandas as pd
import sklearn 
import numpy as np
from sklearn import metrics
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

In [65]:
data = pd.read_csv("test_diabetes.csv", sep = ";")
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
4,0.0,,40.0,35.0,168,43.1,2.288,,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,Zero,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,N
766,1.0,126.0,60.0,0.0,Zero,30.1,0.349,47.0,1


### Data transformations

In [66]:
data = data.replace("Zero", "0")

data = data.replace('N', "0")
data = data.replace('Y', "1")

data['Outcome'] = data.Outcome.astype('int')
data['Insulin'] = data.Insulin.astype('float')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               731 non-null    float64
 1   Glucose                   730 non-null    float64
 2   BloodPressure             734 non-null    float64
 3   SkinThickness             734 non-null    float64
 4   Insulin                   717 non-null    float64
 5   BMI                       733 non-null    float64
 6   DiabetesPedigreeFunction  728 non-null    float64
 7   Age                       717 non-null    float64
 8   Outcome                   768 non-null    int64  
dtypes: float64(8), int64(1)
memory usage: 54.1 KB


### Imputing missing data and scaling

In [67]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
scaler = StandardScaler()

In [68]:
X, y = data.iloc[:, :-1], data.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=0)

In [69]:
print(len(X_train), len(y_train), len(X_test), len(y_test))

537 537 231 231


In [70]:
X_train = imp_mean.fit_transform(X_train)
X_train = scaler.fit_transform(X_train)

In [71]:
X_train

array([[-1.15250789e+00,  9.28846747e-01,  1.07177528e+00, ...,
         1.25204374e+00, -3.21419911e-01, -8.66854863e-01],
       [-8.51264139e-01, -1.25255110e+00, -7.51092988e-02, ...,
        -1.77056690e+00,  4.56239868e-01, -4.81839168e-01],
       [-5.50020389e-01, -1.45527760e-03,  2.91529354e-02, ...,
         5.81757697e-01, -4.16706288e-01, -4.81839168e-01],
       ...,
       [ 5.24671111e-02, -8.99677922e-01, -2.31502650e-01, ...,
        -9.48517984e-01, -1.00686707e+00, -8.66854863e-01],
       [ 0.00000000e+00, -1.18839234e+00,  0.00000000e+00, ...,
        -2.65585036e-01, -5.39656450e-01,  3.15150914e-02],
       [ 3.53710861e-01,  4.47656045e-01,  6.54726341e-01, ...,
        -4.07230388e+00,  5.05419933e-01,  2.21327069e+00]])

In [72]:
X_test = imp_mean.fit_transform(X_test)
X_test = scaler.fit_transform(X_test)

### Declaring the model

In [73]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [74]:
logreg.fit(X_train, y_train)

LogisticRegression(random_state=1, solver='liblinear')

In [75]:
predictions = logreg.predict(X_test)

In [76]:
predictions[:10]

array([1, 0, 0, 1, 0, 0, 1, 1, 1, 0])

In [77]:
accuracy_score(predictions, y_test)

0.7748917748917749

In [78]:
X = np.vstack((X_train,X_test))
len(X)

768

In [79]:
#print(y_test.ndim)
#print(y_test.shape[0])
#y = np.vstack((y_train,y_test))

In [80]:
y_test_list = y_test.to_list()

y_train_list = y_train.to_list()

print(type(y_train_list))
y_fin = y_train_list + y_test_list

<class 'list'>


In [81]:
y = pd.Series(y_fin)
type(y)

pandas.core.series.Series

In [82]:
type(y)

pandas.core.series.Series

In [83]:
scores = cross_val_score(logreg, X, y,  cv = 10)

In [84]:
scores

array([0.79220779, 0.72727273, 0.74025974, 0.81818182, 0.66233766,
       0.76623377, 0.75324675, 0.81818182, 0.76315789, 0.71052632])

In [85]:
np.std(scores)

0.04595015019035371

In [86]:
# build a function that takes as arguments: 
# a model, train data, test data, and returns the model's predictions on the test data

def get_predictions(model, X_train, y_train, X_test):    
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    return predictions 

### Using the model

In [87]:
predictions = get_predictions(logreg, X_train, y_train, X_test)    
    

In [88]:
# function that takes a set of predictions and true values and that validates the predictions 
#    using appropriate metrics

def evaluate(predictions, true_values):
    
    score = accuracy_score(predictions, true_values)
    
    return score
    
    

In [89]:
evaluate(predictions, y_test)

0.7748917748917749

In [90]:
metrics.confusion_matrix(predictions, y_test)

array([[138,  33],
       [ 19,  41]])