# Data

Read in the three files: clients.csv, loans.csv, payments.csv. These files are related by the following:
1. The clients file is the parent of the loans file. Each client can have multiple distinct loans. The client_id column links the two files
2. The loans file is the child of the clients file and the parent of the payments file. Each loan can have multiple distinct payments associated with it. The loan_id column links the two files.

In [None]:
# Import basic packages.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
# Read the data sets into data frames.
clients = pd.read_csv('clients.csv')
loans = pd.read_csv('loans.csv')
payments = pd.read_csv('payments.csv')

In [None]:
# Quick check of the structure of the data.
x = clients.info()
x = loans.info()
x = payments.info()


In [None]:
print(clients.shape)
print(loans.shape)
print(payments.shape)

With the above datasets, answer the following questions. Show the steps taken to produce your final answer.

# Section 1 Questions

1. Give the 5 client IDs with the highest mean payment amount
2. How many unique loans have been given out to clients who joined prior to 2001?
3. What is the mean number of payments missed by clients with a credit score of less than 700 and who have missed more than 50 payments?

In [None]:
# loan_ids with highest mean payment amount.
highest_loan_ids = payments[['loan_id', 'payment_amount']]. \
                   groupby('loan_id'). \
                   mean(). \
                   sort_values(ascending=False, by='payment_amount'). \
                   head(5)
# Client ids with top 5 mean payment amount.
pd.merge(highest_loan_ids, loans, on='loan_id')['client_id']

In [None]:
# Convert the attribute 'joined' in the dataframe clients in datatime format.
clients['joined'] = pd.to_datetime(clients['joined'])
clients_prior_to_2001 = clients.loc[clients['joined'].dt.year <= 2001]['client_id']
pd.merge(clients_prior_to_2001, loans, on='client_id')

# Section 2 Questions

In [None]:
# Clients with credit score < 700.
clients_poor_cs = clients.loc[clients['credit_score'] < 700]['client_id']
# Loan ids of these clients.
loan_ids_poor_cs = pd.merge(clients_poor_cs, loans, on='client_id')[['client_id', 'loan_id']]
# Payment details of these clients.
payments_poor_cs = pd.merge(loan_ids_poor_cs, payments, on='loan_id')
# Get count of missed payments by client.
missed_payments = payments_poor_cs.loc[payments_poor_cs['missed'] == 1].groupby('client_id').count()['missed']
# Get the mean number of missed payments where the missed count > 50. 
missed_payments[missed_payments > 50].mean()


Create the following visualizations:
    
1. Create a histogram of the payment amounts. Briefly describe the distribution.
2. Produce a line plot the cumulative sum of the number of clients by year.
3. Produce a scatter plot of the percentage of payments missed in december for each year in the dataset.

In [None]:
payments['payment_amount'].hist(bins=20)

In [None]:
clients['joined_year'] = clients['joined'].dt.year
ax = clients[['client_id', 'joined_year']].groupby('joined_year').count().cumsum().reset_index().plot('joined_year', 'client_id')
ax.set_xlabel('year')
ax.set_ylabel('# clients')
ax.set_title('Cumulative sum of clients by year')


In [None]:
payments['payment_date'] = pd.to_datetime(payments['payment_date'])
payments['payment_year'] = payments['payment_date'].dt.year
payments['payment_month'] = payments['payment_date'].dt.month
dec_count = payments.loc[payments['payment_month'] == 12][['loan_id', 'payment_year']].groupby('payment_year').count().rename(columns={'loan_id': 'n_loans'}).reset_index()
#
dec_missed_count = payments.loc[((payments['payment_month'] == 12) & (payments['missed'] == 1))][['loan_id', 'payment_year']].groupby('payment_year').count().rename(columns={'loan_id': 'n_loans_missed'}).reset_index()

X = pd.merge(dec_count, dec_missed_count, on='payment_year')
X['pct_missed'] = round(X['n_loans_missed']/X['n_loans'] * 100, 2)
ax = X.plot('payment_year', 'pct_missed')
ax.set_xlabel('year')
ax.set_ylabel('% missed')
ax.set_title('Percentage of payments missed in December')

# Section 3 - Modelling

Create a model that will predict whether a person does or does not have diabetes. Use the diabetes.csv dataset. The target column in the dataset is "Outcome". Assume no features leak information about the target.

Your solution should include the below. You may use whichever python libraries you wish to complete the task:
1. Feature engineering
2. Model fitting and performance evaluation
3. A function that takes as arguments: a model, train data, test data, and returns the model's predictions on the test data
4. A function that takes a set of predictions and true values and that validates the predictions using appropriate metrics
5. Anything else you feel is necessary for modelling or improving the performance of your model


__This exercise is intended for you to show your proficiency in machine learning, understanding of the various techniques that can be employed to improve the performance of a model, and your ability to implement those techniques. Please, therefore, show your working at all times. You will be judged more for the above than for the performance of the final model your produce.__

In [None]:
# Read the data, conduct a preliminary examination
all_diab = pd.read_csv('test_diabetes.csv', sep=';')

In [None]:
all_diab.info()

In [None]:
all_diab.head()

In [None]:
# Why is Insulin not numeric?
all_diab['Insulin'].unique()

In [None]:
# We must replace 'Zero' in the column 'Insulin' with '0'
all_diab.loc[all_diab['Insulin'] =='Zero', 'Insulin'] = 0

In [None]:
all_diab.info()

In [None]:
# Convert 'Insulin' to float
all_diab['Insulin'] = all_diab['Insulin'].apply(lambda s: float(s))

In [None]:
all_diab.info()

In [None]:
# Why is 'Outcome' and object?
all_diab['Outcome'].unique()

In [None]:
# Replace 'N' with '0' and 'Y' with '1'
all_diab.loc[all_diab['Outcome'] == 'N', 'Outcome'] = '0'
all_diab.loc[all_diab['Outcome'] == 'Y', 'Outcome'] = '1'
all_diab['Outcome'].unique()

In [None]:
# Convert '1' and '0' to respective integers
all_diab['Outcome'] = all_diab['Outcome'].apply(lambda s: int(s))

In [None]:
all_diab.info()

In [None]:
# By this stage, we have cleaned the data. All values are numeric. There are a few 'NaN's but they need a 
# separate consideration.
all_diab[all_diab.applymap(np.isnan).any(True)]

In [None]:
all_diab.describe()

In [None]:
# Find the number of NaN's in each column
C = all_diab.describe().loc[['count']]
C = (all_diab.shape[0] - C)
C

In [None]:
# How do the medians look like?
all_diab.median()

In [None]:
# Let us impute all the NaN's with respective medians.
all_diab = all_diab.fillna(all_diab.median())

In [None]:
# Does the data look good now?
all_diab.describe()

In [None]:
# Check the population of each class
all_diab.groupby('Outcome').count()

In [None]:
# How skewed are the X's?
all_diab.describe().loc[['mean', '50%']]

In [None]:
# There is a huge difference between mean and median of Insulin. The data seems to be skewed to the right.
all_diab['Insulin'].hist()

In [None]:
all_Xs = ['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']
sm = pd.plotting.scatter_matrix(all_diab[all_Xs], figsize=(15, 15))

In [None]:
# Only BMI and SkinThickness seem to be correlated. Quite naturally!!
# Next run t-test to check if the means of X's in groups 0 and 1 are similar.
from scipy.stats import stats

Y = 'Outcome'
for x in all_Xs:
  print(all_diab[[x, Y]].groupby(Y).mean().round(2).transpose())
  print(stats.f_oneway(all_diab.loc[all_diab[Y]==0][x], all_diab.loc[all_diab[Y]==1][x]))
  print('-'*80)

In [None]:
# SkinThickness and BloodPressure are not significantly different in diabetics and non-diabetics. We can drop these
# variables. We will now try a logistic regression model.
import random
from patsy import dmatrices
import statsmodels.api as sm

def get_split(N, p):
    """ Split data indexed 0 to (N-1) into training and test."""
    n_train = int(N * p)
    train_indices = set(random.sample(range(N), n_train))
    test_indices = set(range(N)) - train_indices
    
    assert len(train_indices.intersection(test_indices)) == 0
    assert len(train_indices) + len(test_indices) == N
    
    # Return both as lists
    return [[i for i in train_indices], [i for i in test_indices]]

train_indices, test_indices = get_split(all_diab.shape[0], 0.70)

trn_data = all_diab.iloc[train_indices, :]
tst_data = all_diab.iloc[test_indices, :]

assert trn_data.shape[0] + tst_data.shape[0] == all_diab.shape[0]

In [None]:
def print_cm_results(cm):
    """Print diagnostics from the confusion matrix"""
    recall = cm[0, 0]/(cm[0, 0] + cm[1, 0])
    precision = cm[0, 0]/(cm[0, 0] + cm[0, 1])
    specificity = cm[1, 1]/(cm[1, 0] + cm[1, 1])
    f1_score = 2*recall*precision/(recall + precision)
    accuracy = np.trace(cm)/np.sum(cm)
    
    print(f'% +ves correctly predicted = {round(recall * 100, 2)}')
    print(f'% +ves detected out of all = {round(precision * 100, 2)}')
    print(f'% -ves detected out of all = {round(specificity * 100, 2)}')
    print(f'f1 score = {round(f1_score, 2)}')
    print(f'accuracy = {round(accuracy * 100, 2)}')
    
print_cm_results(cm)

In [None]:
# Let us try using all X's.
def get_model_matrices_v0(D):
    return dmatrices('Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + Age', \
                     data = D, return_type='dataframe')

# Training model matrices
yn, Xn = get_model_matrices_v0(trn_data)
yt, Xt = get_model_matrices_v0(tst_data)

# Version 0 of the model.
model_v0 = sm.Logit(yn, Xn)
results_v0 = model_v0.fit()
print(results_v0.summary())

In [None]:
# Training data performance
cm_v0 = results_v0.pred_table()
print(cm_v0)
print_cm_results(cm_v0)

In [None]:
def get_model_matrices_v1(D):
    return dmatrices('Outcome ~ Pregnancies + Glucose + Insulin + BMI + DiabetesPedigreeFunction + Age', \
                    data = D, return_type='dataframe')

# Training model matrices
yn, Xn = get_model_matrices_v1(trn_data)
yt, Xt = get_model_matrices_v1(tst_data)

In [None]:
# Version 1 of the model.
model_v1 = sm.Logit(yn, Xn)
results_v1 = model_v1.fit()
print(results_v1.summary())

In [None]:
# Training data performance
cm_v1 = results_v1.pred_table()
print(cm_v1)
print_cm_results(cm_v1)

In [None]:
# Quite surprisingly, only Pregnancies and Glucose remain!!
def get_model_matrices_v2(D):
    return dmatrices('Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction', \
                     data = D, return_type='dataframe')

# Training model matrices
yn, Xn = get_model_matrices_v2(trn_data)
yt, Xt = get_model_matrices_v2(tst_data)

# Version 2 of the model.
model_v2 = sm.Logit(yn, Xn)
results_v2 = model_v2.fit()
print(results_v2.summary())

In [None]:
# Training data performance
cm_v2 = results_v2.pred_table()
print(cm_v2)
print_cm_results(cm_v2)

In [None]:
# Print all confusion matrices
cms = [cm_v0, cm_v1, cm_v2]
for i, cm in enumerate(cms):
    print(f'Confusion matrix for version {i}:')
    print_cm_results(cm)
    print('-' * 80)

In [None]:
# Going by the scores on training data, version 1 looks to be the best.
# We will check how it does on the test data.
yt, Xt = get_model_matrices_v1(tst_data)
yt_pred = results_v1.predict(Xt)

def prob_to_outcome(y, threshold = 0.5):
    if y < threshold:
        return 0
    else:
        return 1
    
yt['predicted'] = [prob_to_outcome(y) for y in yt_pred]
yt.columns = ['actual', 'predicted']
yt['actual'] = yt['actual'].apply(lambda f: int(f))
tcm_v1 = pd.crosstab(yt['actual'], yt['predicted']).to_numpy()
print(f'Test Confusion matrix for version 1:')
print_cm_results(tcm_v1)