# Data

Read in the three files: clients.csv, loans.csv, payments.csv. These files are related by the following:
1. The clients file is the parent of the loans file. Each client can have multiple distinct loans. The client_id column links the two files
2. The loans file is the child of the clients file and the parent of the payments file. Each loan can have multiple distinct payments associated with it. The loan_id column links the two files.

In [None]:
# Import basic packages.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
# Read the data sets into data frames.
clients = pd.read_csv('clients.csv')
loans = pd.read_csv('loans.csv')
payments = pd.read_csv('payments.csv')

In [None]:
# Quick check of the structure of the data.
clients.info()
loans.info()
payments.info()


In [None]:
print(clients.shape)
print(loans.shape)
print(payments.shape)

With the above datasets, answer the following questions. Show the steps taken to produce your final answer.

# Section 1 Questions

1. Give the 5 client IDs with the highest mean payment amount
2. How many unique loans have been given out to clients who joined prior to 2001?
3. What is the mean number of payments missed by clients with a credit score of less than 700 and who have missed more than 50 payments?

In [None]:
# loan_ids with highest mean payment amount.
highest_loan_ids = payments[['loan_id', 'payment_amount']]. \
                   groupby('loan_id'). \
                   mean(). \
                   sort_values(ascending=False, by='payment_amount'). \
                   head(5)
# Client ids with top 5 mean payment amount.
pd.merge(highest_loan_ids, loans, on='loan_id')['client_id']

In [None]:
# Convert the attribute 'joined' in the dataframe clients in datatime format.
clients['joined'] = pd.to_datetime(clients['joined'])
clients_prior_to_2001 = clients.loc[clients['joined'].dt.year <= 2001]['client_id']
pd.merge(clients_prior_to_2001, loans, on='client_id')

# Section 2 Questions

In [None]:
# Clients with credit score < 700.
clients_poor_cs = clients.loc[clients['credit_score'] < 700]['client_id']
# Loan ids of these clients.
loan_ids_poor_cs = pd.merge(clients_poor_cs, loans, on='client_id')[['client_id', 'loan_id']]
# Payment details of these clients.
payments_poor_cs = pd.merge(loan_ids_poor_cs, payments, on='loan_id')
# Get count of missed payments by client.
missed_payments = payments_poor_cs.loc[payments_poor_cs['missed'] == 1].groupby('client_id').count()['missed']
# Get the mean number of missed payments where the missed count > 50. 
missed_payments[missed_payments > 50].mean()


Create the following visualizations:
    
1. Create a histogram of the payment amounts. Briefly describe the distribution.
2. Produce a line plot the cumulative sum of the number of clients by year.
3. Produce a scatter plot of the percentage of payments missed in december for each year in the dataset.

In [None]:
payments['payment_amount'].hist()

In [None]:
clients['joined_year'] = clients['joined'].dt.year
ax = clients[['client_id', 'joined_year']].groupby('joined_year').count().cumsum().reset_index().plot('joined_year', 'client_id')
ax.set_xlabel('year')
ax.set_ylabel('# clients')
ax.set_title('Cumulative sum of clients by year')


In [None]:
payments['payment_date'] = pd.to_datetime(payments['payment_date'])
payments['payment_year'] = payments['payment_date'].dt.year
payments['payment_month'] = payments['payment_date'].dt.month
dec_count = payments.loc[payments['payment_month'] == 12][['loan_id', 'payment_year']].groupby('payment_year').count().rename(columns={'loan_id': 'n_loans'}).reset_index()
#
dec_missed_count = payments.loc[((payments['payment_month'] == 12) & (payments['missed'] == 1))][['loan_id', 'payment_year']].groupby('payment_year').count().rename(columns={'loan_id': 'n_loans_missed'}).reset_index()

X = pd.merge(dec_count, dec_missed_count, on='payment_year')
X['pct_missed'] = round(X['n_loans_missed']/X['n_loans'] * 100, 2)
ax = X.plot('payment_year', 'pct_missed')
ax.set_xlabel('year')
ax.set_ylabel('% missed')
ax.set_title('Percentage of payments missed in December')

# Section 3 - Modelling

Create a model that will predict whether a person does or does not have diabetes. Use the diabetes.csv dataset. The target column in the dataset is "Outcome". Assume no features leak information about the target.

Your solution should include the below. You may use whichever python libraries you wish to complete the task:
1. Feature engineering
2. Model fitting and performance evaluation
3. A function that takes as arguments: a model, train data, test data, and returns the model's predictions on the test data
4. A function that takes a set of predictions and true values and that validates the predictions using appropriate metrics
5. Anything else you feel is necessary for modelling or improving the performance of your model


__This exercise is intended for you to show your proficiency in machine learning, understanding of the various techniques that can be employed to improve the performance of a model, and your ability to implement those techniques. Please, therefore, show your working at all times. You will be judged more for the above than for the performance of the final model your produce.__