# Data

Read in the three files: clients.csv, loans.csv, payments.csv. These files are related by the following:
1. The clients file is the parent of the loans file. Each client can have multiple distinct loans. The client_id column links the two files
2. The loans file is the child of the clients file and the parent of the payments file. Each loan can have multiple distinct payments associated with it. The loan_id column links the two files.

With the above datasets, answer the following questions. Show the steps taken to produce your final answer.

# Section 1 Questions

1. Give the 5 client IDs with the highest mean payment amount
2. How many unique loans have been given out to clients who joined prior to 2001?
3. What is the mean number of payments missed by clients with a credit score of less than 700 and who have missed more than 50 payments?

# Section 2 Questions

Create the following visualizations:
    
1. Create a histogram of the payment amounts. Briefly describe the distribution.
2. Produce a line plot the cumulative sum of the number of clients by year.
3. Produce a scatter plot of the percentage of payments missed in december for each year in the dataset.

# Section 3 - Modelling

Create a model that will predict whether a person does or does not have diabetes. Use the diabetes.csv dataset. The target column in the dataset is "Outcome". Assume no features leak information about the target.

Your solution should include the below. You may use whichever python libraries you wish to complete the task:
1. Feature engineering
2. Model fitting and performance evaluation
3. A function that takes as arguments: a model, train data, test data, and returns the model's predictions on the test data
4. A function that takes a set of predictions and true values and that validates the predictions using appropriate metrics
5. Anything else you feel is necessary for modelling or improving the performance of your model


__This exercise is intended for you to show your proficiency in machine learning, understanding of the various techniques that can be employed to improve the performance of a model, and your ability to implement those techniques. Please, therefore, show your working at all times. You will be judged more for the above than for the performance of the final model your produce.__

In [161]:
import pandas as pd
import sklearn 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [162]:
data = pd.read_csv("test_diabetes.csv", sep = ";")
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
4,0.0,,40.0,35.0,168,43.1,2.288,,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,Zero,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,N
766,1.0,126.0,60.0,0.0,Zero,30.1,0.349,47.0,1


In [163]:
data = data.replace("Zero", "0")
data

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,,148.0,72.0,35.0,0,33.6,0.627,50.0,1
1,1.0,85.0,66.0,29.0,0,26.6,0.351,31.0,0
2,8.0,183.0,64.0,0.0,0,23.3,0.672,32.0,1
3,1.0,89.0,66.0,23.0,94,28.1,0.167,21.0,0
4,0.0,,40.0,35.0,168,43.1,2.288,,1
...,...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180,32.9,0.171,63.0,0
764,2.0,122.0,70.0,27.0,0,36.8,0.340,27.0,0
765,5.0,121.0,72.0,23.0,112,26.2,0.245,30.0,N
766,1.0,126.0,60.0,0.0,0,30.1,0.349,47.0,1


In [164]:
data = data.replace('N', "0")
data = data.replace('Y', "1")

In [165]:
data['Outcome'] = data.Outcome.astype('int')
data['Insulin'] = data.Insulin.astype('float')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               731 non-null    float64
 1   Glucose                   730 non-null    float64
 2   BloodPressure             734 non-null    float64
 3   SkinThickness             734 non-null    float64
 4   Insulin                   717 non-null    float64
 5   BMI                       733 non-null    float64
 6   DiabetesPedigreeFunction  728 non-null    float64
 7   Age                       717 non-null    float64
 8   Outcome                   768 non-null    int64  
dtypes: float64(8), int64(1)
memory usage: 54.1 KB


In [166]:
data.to_csv("diabetes_data_corrected.csv")

In [129]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,,148.0,72.0,35.0,0.0,33.6,0.627,50.0
1,1.0,85.0,66.0,29.0,0.0,26.6,0.351,31.0
2,8.0,183.0,64.0,0.0,0.0,23.3,0.672,32.0
3,1.0,89.0,66.0,23.0,94.0,28.1,0.167,21.0
4,0.0,,40.0,35.0,168.0,43.1,2.288,
...,...,...,...,...,...,...,...,...
763,10.0,101.0,76.0,,180.0,32.9,0.171,63.0
764,2.0,122.0,70.0,27.0,0.0,36.8,0.340,27.0
765,5.0,121.0,72.0,23.0,112.0,26.2,0.245,30.0
766,1.0,126.0,60.0,0.0,0.0,30.1,0.349,47.0


In [130]:
X = data.iloc[:, :-1]
y = data['Outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [131]:
X_train

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
663,9.0,145.0,80.0,46.0,130.0,37.9,0.637,
712,10.0,129.0,62.0,36.0,0.0,41.2,0.441,38.0
161,7.0,102.0,,40.0,105.0,37.2,0.204,45.0
509,8.0,120.0,78.0,0.0,0.0,25.0,,64.0
305,2.0,120.0,76.0,37.0,105.0,39.7,0.215,29.0
...,...,...,...,...,...,...,...,...
645,2.0,157.0,74.0,35.0,440.0,39.4,0.134,30.0
715,7.0,187.0,50.0,33.0,392.0,33.9,0.826,34.0
72,13.0,126.0,,0.0,0.0,43.4,0.583,42.0
235,4.0,171.0,,0.0,0.0,43.6,0.479,26.0


In [141]:
#df_new = pd.read_csv('http://bit.ly/kaggletest')
#X_new = df_new[cols]

imp_mean = SimpleImputer(strategy='mean')
scaler = StandardScaler()

imp_mean_pipe = make_pipeline(imp_mean, scaler)

ct = make_column_transformer(
    (imp_mean_pipe, ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age'])
)

logreg = LogisticRegression(solver='liblinear', random_state=1)

pipe = make_pipeline(ct, logreg)
pipe.fit(X_train, y_train)
pipe.predict(X_test)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0])

In [142]:
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.7513114336643748

In [143]:
# A function that takes as arguments: a model, train data, test data, 
# and returns the model's predictions on the test data

In [150]:
def get_predictions(model, train, test): 
    
    imp_mean = SimpleImputer(strategy='mean')
    scaler = StandardScaler()
    imp_mean_pipe = make_pipeline(imp_mean, scaler)

    ct = make_column_transformer(
        (imp_mean_pipe, ['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI','DiabetesPedigreeFunction','Age']),
        remainder='passthrough')

    X_train, y_train = train.iloc[: , : -1], train.iloc[:, -1]
    pipe = make_pipeline(ct, model)
    pipe.fit(X_train, y_train)
    prediction = pipe.predict(X_test)
    
    return prediction

In [151]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [152]:
predicted = get_predictions(logreg, data[:550], data[550:])
len(predicted)

154

In [153]:
def evaluate(predictions, true_values):
    metrics = accuracy_score(true_values, predictions)
    return metrics

In [154]:
data[550:].isna().sum()

Pregnancies                 16
Glucose                     11
BloodPressure                9
SkinThickness                9
Insulin                     19
BMI                          7
DiabetesPedigreeFunction    12
Age                         16
Outcome                      0
dtype: int64

In [155]:
m = evaluate(predicted, data.iloc[550:, :-1])

ValueError: Found input variables with inconsistent numbers of samples: [218, 154]