![COUR_IPO.png](attachment:COUR_IPO.png)

# Welcome to the Data Science Coding Challange!

Test your skills in a real-world coding challenge. Coding Challenges provide CS & DS Coding Competitions with Prizes and achievement badges!

CS & DS learners want to be challenged as a way to evaluate if they’re job ready. So, why not create fun challenges and give winners something truly valuable such as complimentary access to select Data Science courses, or the ability to receive an achievement badge on their Coursera Skills Profile - highlighting their performance to recruiters.

## Introduction

In this challenge, you'll get the opportunity to tackle one of the most industry-relevant machine learning problems with a unique dataset that will put your modeling skills to the test. Financial loan services are leveraged by companies across many industries, from big banks to financial institutions to government loans. One of the primary objectives of companies with financial loan services is to decrease payment defaults and ensure that individuals are paying back their loans as expected. In order to do this efficiently and systematically, many companies employ machine learning to predict which individuals are at the highest risk of defaulting on their loans, so that proper interventions can be effectively deployed to the right audience.

In this challenge, we will be tackling the loan default prediction problem on a very unique and interesting group of individuals who have taken financial loans. 

Imagine that you are a new data scientist at a major financial institution and you are tasked with building a model that can predict which individuals will default on their loan payments. We have provided a dataset that is a sample of individuals who received loans in 2021. 

This financial institution has a vested interest in understanding the likelihood of each individual to default on their loan payments so that resources can be allocated appropriately to support these borrowers. In this challenge, you will use your machine learning toolkit to do just that!

## Understanding the Datasets

### Train vs. Test
In this competition, you’ll gain access to two datasets that are samples of past borrowers of a financial institution that contain information about the individual and the specific loan. One dataset is titled `train.csv` and the other is titled `test.csv`.

`train.csv` contains 70% of the overall sample (255,347 borrowers to be exact) and importantly, will reveal whether or not the borrower has defaulted on their loan payments (the “ground truth”).

The `test.csv` dataset contains the exact same information about the remaining segment of the overall sample (109,435 borrowers to be exact), but does not disclose the “ground truth” for each borrower. It’s your job to predict this outcome!

Using the patterns you find in the `train.csv` data, predict whether the borrowers in `test.csv` will default on their loan payments, or not.

### Dataset descriptions
Both `train.csv` and `test.csv` contain one row for each unique Loan. For each Loan, a single observation (`LoanID`) is included during which the loan was active. 

In addition to this identifier column, the `train.csv` dataset also contains the target label for the task, a binary column `Default` which indicates if a borrower has defaulted on payments.

Besides that column, both datasets have an identical set of features that can be used to train your model to make predictions. Below you can see descriptions of each feature. Familiarize yourself with them so that you can harness them most effectively for this machine learning task!

In [1]:
import pandas as pd
data_descriptions = pd.read_csv('data_descriptions.csv')
pd.set_option('display.max_colwidth', None)


data_descriptions

Unnamed: 0,Column_name,Column_type,Data_type,Description
0,LoanID,Identifier,string,A unique identifier for each loan.
1,Age,Feature,integer,The age of the borrower.
2,Income,Feature,integer,The annual income of the borrower.
3,LoanAmount,Feature,integer,The amount of money being borrowed.
4,CreditScore,Feature,integer,"The credit score of the borrower, indicating their creditworthiness."
5,MonthsEmployed,Feature,integer,The number of months the borrower has been employed.
6,NumCreditLines,Feature,integer,The number of credit lines the borrower has open.
7,InterestRate,Feature,float,The interest rate for the loan.
8,LoanTerm,Feature,integer,The term length of the loan in months.
9,DTIRatio,Feature,float,"The Debt-to-Income ratio, indicating the borrower's debt compared to their income."


## How to Submit your Predictions to Coursera
Submission Format:

In this notebook you should follow the steps below to explore the data, train a model using the data in `train.csv`, and then score your model using the data in `test.csv`. Your final submission should be a dataframe (call it `prediction_df` with two columns and exactly 109,435 rows (plus a header row). The first column should be `LoanID` so that we know which prediction belongs to which observation. The second column should be called `predicted_probability` and should be a numeric column representing the __likelihood that the borrower will default__.

Your submission will show an error if you have extra columns (beyond `LoanID` and `predicted_probability`) or extra rows. The order of the rows does not matter.

The naming convention of the dataframe and columns are critical for our autograding, so please make sure to use the exact naming conventions of `prediction_df` with column names `LoanID` and `predicted_probability`!

To determine your final score, we will compare your `predicted_probability` predictions to the source of truth labels for the observations in `test.csv` and calculate the [ROC AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html). We choose this metric because we not only want to be able to predict which loans will default, but also want a well-calibrated likelihood score that can be used to target interventions and support most accurately.

## Import Python Modules

First, import the primary modules that will be used in this project. Remember as this is an open-ended project please feel free to make use of any of your favorite libraries that you feel may be useful for this challenge. For example some of the following popular packages may be useful:

- pandas
- numpy
- Scipy
- Scikit-learn
- keras
- maplotlib
- seaborn
- etc, etc

In [2]:
# Import required packages

# Data packages
import pandas as pd
import numpy as np

# Machine Learning / Classification packages
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

# Visualization Packages
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
# Import any other packages you may want to use

## Load the Data

Let's start by loading the dataset `train.csv` into a dataframe `train_df`, and `test.csv` into a dataframe `test_df` and display the shape of the dataframes.

In [4]:
train_df = pd.read_csv("train.csv")
print('train_df Shape:', train_df.shape)
train_df.head()

train_df Shape: (255347, 18)


Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In [5]:
test_df = pd.read_csv("test.csv")
print('test_df Shape:', test_df.shape)
test_df.head()

test_df Shape: (109435, 17)


Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner
0,7RYZGMKJIR,32,131645,43797,802,23,2,6.1,24,0.13,High School,Full-time,Divorced,Yes,No,Other,No
1,JDL5RH07AM,61,134312,18402,369,87,2,12.99,60,0.59,High School,Self-employed,Single,No,No,Business,Yes
2,STAL716Y79,55,115809,151774,563,3,3,5.51,48,0.82,Bachelor's,Full-time,Single,Yes,Yes,Other,Yes
3,SO0KKJ3IQB,58,94970,55789,337,24,1,23.93,36,0.77,Bachelor's,Unemployed,Divorced,No,No,Business,No
4,T99CWTYDCP,63,71727,189798,451,52,3,22.05,48,0.44,PhD,Unemployed,Single,Yes,No,Auto,No


## Explore, Clean, Validate, and Visualize the Data (optional)

Feel free to explore, clean, validate, and visualize the data however you see fit for this competition to help determine or optimize your predictive model. Please note - the final autograding will only be on the accuracy of the `prediction_df` predictions.

In [6]:
# your code here (optional)
train_df.isna().sum()


LoanID            0
Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
Default           0
dtype: int64

In [7]:
test_df.isna().sum()


LoanID            0
Age               0
Income            0
LoanAmount        0
CreditScore       0
MonthsEmployed    0
NumCreditLines    0
InterestRate      0
LoanTerm          0
DTIRatio          0
Education         0
EmploymentType    0
MaritalStatus     0
HasMortgage       0
HasDependents     0
LoanPurpose       0
HasCoSigner       0
dtype: int64

In [8]:
train_df.describe()

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Default
count,255347.0,255347.0,255347.0,255347.0,255347.0,255347.0,255347.0,255347.0,255347.0,255347.0
mean,43.498306,82499.304597,127578.865512,574.264346,59.541976,2.501036,13.492773,36.025894,0.500212,0.116128
std,14.990258,38963.013729,70840.706142,158.903867,34.643376,1.117018,6.636443,16.96933,0.230917,0.320379
min,18.0,15000.0,5000.0,300.0,0.0,1.0,2.0,12.0,0.1,0.0
25%,31.0,48825.5,66156.0,437.0,30.0,2.0,7.77,24.0,0.3,0.0
50%,43.0,82466.0,127556.0,574.0,60.0,2.0,13.46,36.0,0.5,0.0
75%,56.0,116219.0,188985.0,712.0,90.0,3.0,19.25,48.0,0.7,0.0
max,69.0,149999.0,249999.0,849.0,119.0,4.0,25.0,60.0,0.9,1.0


In [9]:
train_df.corr() #0.0-0.2 none, 0.2-0.4 weak, 0.4-0.6 normal, 0.6-0.8 strong?

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Default
Age,1.0,-0.001244,-0.002213,-0.000548,-0.000341,-0.00089,-0.001127,0.000263,-0.004689,-0.167783
Income,-0.001244,1.0,-0.000865,-0.00143,0.002675,-0.002016,-0.002303,-0.000998,0.000205,-0.099119
LoanAmount,-0.002213,-0.000865,1.0,0.001261,0.002817,0.000794,-0.002291,0.002538,0.001122,0.086659
CreditScore,-0.000548,-0.00143,0.001261,1.0,0.000613,1.6e-05,0.000436,0.00113,-0.001039,-0.034166
MonthsEmployed,-0.000341,0.002675,0.002817,0.000613,1.0,0.001267,9.6e-05,-0.001166,0.001765,-0.097374
NumCreditLines,-0.00089,-0.002016,0.000794,1.6e-05,0.001267,1.0,-0.000297,-0.000226,-0.000586,0.02833
InterestRate,-0.001127,-0.002303,-0.002291,0.000436,9.6e-05,-0.000297,1.0,0.000892,0.000575,0.131273
LoanTerm,0.000263,-0.000998,0.002538,0.00113,-0.001166,-0.000226,0.000892,1.0,0.002273,0.000545
DTIRatio,-0.004689,0.000205,0.001122,-0.001039,0.001765,-0.000586,0.000575,0.002273,1.0,0.019236
Default,-0.167783,-0.099119,0.086659,-0.034166,-0.097374,0.02833,0.131273,0.000545,0.019236,1.0


In [10]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 255347 entries, 0 to 255346
Data columns (total 18 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   LoanID          255347 non-null  object 
 1   Age             255347 non-null  int64  
 2   Income          255347 non-null  int64  
 3   LoanAmount      255347 non-null  int64  
 4   CreditScore     255347 non-null  int64  
 5   MonthsEmployed  255347 non-null  int64  
 6   NumCreditLines  255347 non-null  int64  
 7   InterestRate    255347 non-null  float64
 8   LoanTerm        255347 non-null  int64  
 9   DTIRatio        255347 non-null  float64
 10  Education       255347 non-null  object 
 11  EmploymentType  255347 non-null  object 
 12  MaritalStatus   255347 non-null  object 
 13  HasMortgage     255347 non-null  object 
 14  HasDependents   255347 non-null  object 
 15  LoanPurpose     255347 non-null  object 
 16  HasCoSigner     255347 non-null  object 
 17  Default   

In [11]:
train_df.head()

Unnamed: 0,LoanID,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,I38PQUQS96,56,85994,50587,520,80,4,15.23,36,0.44,Bachelor's,Full-time,Divorced,Yes,Yes,Other,Yes,0
1,HPSK72WA7R,69,50432,124440,458,15,1,4.81,60,0.68,Master's,Full-time,Married,No,No,Other,Yes,0
2,C1OZ6DPJ8Y,46,84208,129188,451,26,3,21.17,24,0.31,Master's,Unemployed,Divorced,Yes,Yes,Auto,No,1
3,V2KKSFM3UN,32,31713,44799,743,0,3,7.07,24,0.23,High School,Full-time,Married,No,No,Business,No,0
4,EY08JDHTZP,60,20437,9139,633,8,4,6.51,48,0.73,Bachelor's,Unemployed,Divorced,No,Yes,Auto,No,0


In [12]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()

train_df_encoded = train_df.copy()
test_df_encoded = test_df.copy()

train_df_encoded.drop(columns = 'LoanID',inplace = True)
train_df_loanid = train_df.LoanID.copy()

test_df_encoded.drop(columns = 'LoanID', inplace = True)
test_df_loanid = test_df.LoanID.copy()

categorical_columns = [col for col in train_df_encoded.columns if train_df[col].dtype == 'object']

train_df_encoded[categorical_columns] = encoder.fit_transform(train_df[categorical_columns])
test_df_encoded[categorical_columns] = encoder.transform(test_df[categorical_columns])

train_df_encoded.head()

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,56,85994,50587,520,80,4,15.23,36,0.44,0.0,0.0,0.0,1.0,1.0,4.0,1.0,0
1,69,50432,124440,458,15,1,4.81,60,0.68,2.0,0.0,1.0,0.0,0.0,4.0,1.0,0
2,46,84208,129188,451,26,3,21.17,24,0.31,2.0,3.0,0.0,1.0,1.0,0.0,0.0,1
3,32,31713,44799,743,0,3,7.07,24,0.23,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0
4,60,20437,9139,633,8,4,6.51,48,0.73,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0


In [13]:
train_df_encoded.astype(float)

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner,Default
0,56.0,85994.0,50587.0,520.0,80.0,4.0,15.23,36.0,0.44,0.0,0.0,0.0,1.0,1.0,4.0,1.0,0.0
1,69.0,50432.0,124440.0,458.0,15.0,1.0,4.81,60.0,0.68,2.0,0.0,1.0,0.0,0.0,4.0,1.0,0.0
2,46.0,84208.0,129188.0,451.0,26.0,3.0,21.17,24.0,0.31,2.0,3.0,0.0,1.0,1.0,0.0,0.0,1.0
3,32.0,31713.0,44799.0,743.0,0.0,3.0,7.07,24.0,0.23,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
4,60.0,20437.0,9139.0,633.0,8.0,4.0,6.51,48.0,0.73,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255342,19.0,37979.0,210682.0,541.0,109.0,4.0,14.11,12.0,0.85,0.0,0.0,1.0,0.0,0.0,4.0,0.0,0.0
255343,32.0,51953.0,189899.0,511.0,14.0,2.0,11.55,24.0,0.21,1.0,1.0,0.0,0.0,0.0,3.0,0.0,1.0
255344,56.0,84820.0,208294.0,597.0,70.0,3.0,5.29,60.0,0.50,1.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0
255345,42.0,85109.0,60575.0,809.0,40.0,1.0,20.90,48.0,0.44,1.0,1.0,2.0,1.0,1.0,4.0,0.0,0.0


In [14]:
test_df_encoded.astype(float)

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner
0,32.0,131645.0,43797.0,802.0,23.0,2.0,6.10,24.0,0.13,1.0,0.0,0.0,1.0,0.0,4.0,0.0
1,61.0,134312.0,18402.0,369.0,87.0,2.0,12.99,60.0,0.59,1.0,2.0,2.0,0.0,0.0,1.0,1.0
2,55.0,115809.0,151774.0,563.0,3.0,3.0,5.51,48.0,0.82,0.0,0.0,2.0,1.0,1.0,4.0,1.0
3,58.0,94970.0,55789.0,337.0,24.0,1.0,23.93,36.0,0.77,0.0,3.0,0.0,0.0,0.0,1.0,0.0
4,63.0,71727.0,189798.0,451.0,52.0,3.0,22.05,48.0,0.44,3.0,3.0,2.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109430,67.0,76970.0,108110.0,404.0,67.0,3.0,10.51,36.0,0.18,3.0,2.0,1.0,1.0,1.0,0.0,0.0
109431,44.0,108272.0,238508.0,335.0,28.0,1.0,9.65,24.0,0.32,3.0,0.0,0.0,1.0,1.0,2.0,0.0
109432,54.0,73526.0,18513.0,576.0,75.0,2.0,17.22,36.0,0.62,2.0,1.0,0.0,0.0,1.0,2.0,0.0
109433,60.0,75296.0,38414.0,369.0,71.0,4.0,17.69,36.0,0.66,3.0,1.0,1.0,0.0,0.0,1.0,1.0


In [15]:
X = train_df_encoded.drop(columns = 'Default')
y = train_df_encoded.Default
X_valid = test_df_encoded

## Make predictions (required)

Remember you should create a dataframe named `prediction_df` with exactly 109,435 entries plus a header row attempting to predict the likelihood of borrowers to default on their loans in `test_df`. Your submission will throw an error if you have extra columns (beyond `LoanID` and `predicted_probaility`) or extra rows.

The file should have exactly 2 columns:
`LoanID` (sorted in any order)
`predicted_probability` (contains your numeric predicted probabilities between 0 and 1, e.g. from `estimator.predict_proba(X, y)[:, 1]`)

The naming convention of the dataframe and columns are critical for our autograding, so please make sure to use the exact naming conventions of `prediction_df` with column names `LoanID` and `predicted_probability`!

### Example prediction submission:

The code below is a very naive prediction method that simply predicts loan defaults using a Dummy Classifier. This is used as just an example showing the submission format required. Please change/alter/delete this code below and create your own improved prediction methods for generating `prediction_df`.

**PLEASE CHANGE CODE BELOW TO IMPLEMENT YOUR OWN PREDICTIONS**

In [16]:
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax, cmap='Blues'); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['no default', 'default']); ax.yaxis.set_ticklabels(['no default', 'default']) 
    plt.show() 

In [17]:
from sklearn.preprocessing import StandardScaler

In [18]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [19]:
X_scaled = pd.DataFrame(X_scaled)
X_scaled.columns = X.columns

In [20]:
X_scaled

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner
0,0.833990,0.089693,-1.086833,-0.341492,0.590533,1.341937,0.261771,-0.001526,-0.260753,-1.335708,-1.342541,-1.225315,0.999973,0.999464,1.415354,0.999785
1,1.701221,-0.823021,-0.044309,-0.731666,-1.285731,-1.343791,-1.308350,1.412793,0.778585,0.451884,-1.342541,0.000101,-1.000027,-1.000537,1.415354,0.999785
2,0.166888,0.043854,0.022715,-0.775718,-0.968209,0.446694,1.156831,-0.708685,-0.823728,0.451884,1.342369,-1.225315,0.999973,0.999464,-1.416063,-1.000215
3,-0.767053,-1.303452,-1.168538,1.061875,-1.718715,0.446694,-0.967805,-0.708685,-1.170174,-0.441912,-1.342541,0.000101,-1.000027,-1.000537,-0.708209,-1.000215
4,1.100830,-1.592855,-1.671921,0.369631,-1.487790,1.341937,-1.052188,0.705634,0.995114,-1.335708,1.342369,-1.225315,-1.000027,0.999464,-1.416063,-1.000215
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
255342,-1.634285,-1.142632,1.173101,-0.209337,1.427636,1.341937,0.093006,-1.415845,1.514783,-1.335708,-1.342541,0.000101,-1.000027,-1.000537,1.415354,-1.000215
255343,-0.767053,-0.783984,0.879724,-0.398130,-1.314597,-0.448549,-0.292744,-0.708685,-1.256785,-0.441912,-0.447571,-1.225315,-1.000027,-1.000537,0.707499,-1.000215
255344,0.833990,0.059562,1.139391,0.143078,0.301877,0.446694,-1.236022,1.412793,-0.000918,-0.441912,0.447399,0.000101,0.999973,0.999464,-1.416063,0.999785
255345,-0.099952,0.066979,-0.945840,1.477221,-0.564091,-1.343791,1.116146,0.705634,-0.260753,-0.441912,-0.447571,1.225517,0.999973,0.999464,1.415354,-1.000215


In [21]:
X_valid = pd.DataFrame(scaler.transform(X_valid))
X_valid.columns = X_scaled.columns
X_valid.head()

Unnamed: 0,Age,Income,LoanAmount,CreditScore,MonthsEmployed,NumCreditLines,InterestRate,LoanTerm,DTIRatio,Education,EmploymentType,MaritalStatus,HasMortgage,HasDependents,LoanPurpose,HasCoSigner
0,-0.767053,1.261345,-1.182682,1.433169,-1.054806,-0.448549,-1.113968,-0.708685,-1.603231,-0.441912,-1.342541,-1.225315,0.999973,-1.000537,1.415354,-1.000215
1,1.16754,1.329794,-1.541163,-1.291754,0.792592,-0.448549,-0.07576,1.412793,0.388833,-0.441912,0.447399,1.225517,-1.000027,-1.000537,-0.708209,0.999785
2,0.767279,0.854907,0.341543,-0.070888,-1.632118,0.446694,-1.202872,0.705634,1.384866,-1.335708,-1.342541,1.225517,0.999973,0.999464,1.415354,0.999785
3,0.96741,0.320066,-1.0134,-1.493134,-1.025941,-1.343791,1.572717,-0.001526,1.168337,-1.335708,1.342369,-1.225315,-1.000027,-1.000537,-0.708209,-1.000215
4,1.30096,-0.276476,0.878298,-0.775718,-0.217704,0.446694,1.289432,0.705634,-0.260753,1.34568,1.342369,1.225517,0.999973,-1.000537,-1.416063,-1.000215


In [22]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y, test_size = 0.2)

In [24]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.losses import BinaryCrossentropy

In [70]:
from tensorflow.keras import regularizers


In [110]:
model = keras.Sequential([
    
    layers.Dense(units = 32, activation = 'sigmoid'),
    layers.Dense(units = 16, activation = 'sigmoid'),
    layers.Dense(units = 8, activation = 'sigmoid'),
    layers.Dense(units = 1, activation = 'sigmoid')


])

In [111]:
model.compile(loss=BinaryCrossentropy(), optimizer='adam')

In [112]:
X_train_array = np.asarray(X_train)

In [113]:
X_train_array.shape

(204277, 16)

In [114]:
y_train_array = np.asarray(y_train)
y_train_array.shape

(204277,)

In [115]:
model.fit(X_train_array,y_train_array, epochs = 10)

Train on 204277 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x72968d3c42d0>

In [116]:
prediction = model.predict(np.asarray(X_test))

roc_auc_score(y_test, prediction)

0.7535689126201243

In [117]:
model.fit(
    np.asarray(X_scaled), 
    np.asarray(y), 
    epochs = 100,
    validation_split = 0.2
)

Train on 204277 samples, validate on 51070 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/

<tensorflow.python.keras.callbacks.History at 0x72968ba7a190>

In [118]:
test_prediction = model.predict(np.asarray(X_scaled))

In [119]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

# Open an embedded TensorBoard viewer
%tensorboard --logdir {logdir}/sizes

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


In [120]:
 test_prediction

array([[0.01621174],
       [0.01831801],
       [0.29204455],
       ...,
       [0.01981489],
       [0.06590788],
       [0.02218826]], dtype=float32)

In [121]:
roc_auc_score(y, test_prediction)

0.7609379127019275

In [122]:
final_prediction = model.predict(np.asarray(X_valid))

In [123]:
final_prediction = final_prediction.flatten()
final_prediction.shape

(109435,)

In [124]:
test_df_loanid.shape

(109435,)

In [125]:
prediction_df = pd.DataFrame({
    'LoanID': test_df_loanid, 
    'predicted_probability': final_prediction
})

In [126]:
# final submission should be a dataframe 
# (call it prediction_df with two columns and exactly 109,435 rows (plus a header row). 
# The first column should be LoanID so that we know which prediction belongs to which observation. 
# The second column should be called predicted_probability 
# and should be a numeric column representing the likelihood that the borrower will default.

print('Shape:',prediction_df.shape,',should be equal to 109435,2')
print('Columns:',prediction_df.columns,',should be LoanID and "predicted_probability"')
print('Last check:')

prediction_df.head()

Shape: (109435, 2) ,should be equal to 109435,2
Columns: Index(['LoanID', 'predicted_probability'], dtype='object') ,should be LoanID and "predicted_probability"
Last check:


Unnamed: 0,LoanID,predicted_probability
0,7RYZGMKJIR,0.067909
1,JDL5RH07AM,0.015496
2,STAL716Y79,0.033215
3,SO0KKJ3IQB,0.150631
4,T99CWTYDCP,0.12278


**PLEASE CHANGE CODE ABOVE TO IMPLEMENT YOUR OWN PREDICTIONS**

## Final Tests - **IMPORTANT** - the cells below must be run prior to submission

Below are some tests to ensure your submission is in the correct format for autograding. The autograding process accepts a csv `prediction_submission.csv` which we will generate from our `prediction_df` below. Please run the tests below an ensure no assertion errors are thrown.

In [127]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

# Writing to csv for autograding purposes
prediction_df.to_csv("prediction_submission.csv", index=False)
submission = pd.read_csv("prediction_submission.csv")

assert isinstance(submission, pd.DataFrame), 'You should have a dataframe named prediction_df.'

In [128]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.columns[0] == 'LoanID', 'The first column name should be CustomerID.'
assert submission.columns[1] == 'predicted_probability', 'The second column name should be predicted_probability.'

In [129]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.shape[0] == 109435, 'The dataframe prediction_df should have 109435 rows.'

In [130]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

assert submission.shape[1] == 2, 'The dataframe prediction_df should have 2 columns.'

In [131]:
# FINAL TEST CELLS - please make sure all of your code is above these test cells

## This cell calculates the auc score and is hidden. Submit Assignment to see AUC score.

## SUBMIT YOUR WORK!

Once we are happy with our `prediction_df` and `prediction_submission.csv` we can now submit for autograding! Submit by using the blue **Submit Assignment** at the top of your notebook. Don't worry if your initial submission isn't perfect as you have multiple submission attempts and will obtain some feedback after each submission!