# Project: Creditworthiness

In [None]:
# Load Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [13, 13]

## Step 1: Business and Data Understanding

In [None]:
# load past applications
past_applications = pd.read_excel('credit-data-training.xlsx')
new_customers = pd.read_excel('customers-to-score.xlsx')


In [None]:
past_applications.head()

### Key Decisions:

* What decisions needs to be made?
  * I need to evaluate the creditworthiness of the new 500 loan applicants.

* What data is needed to inform those decisions?
  * I need past loan applicant's information on credit application results and the data used to rate those results like Duration of credit, credit amount, installment, age of the applicant, etc.

* What kind of model (Continuous, Binary, Non-Binary, Time-Series) do we need to use to help make these decisions?
  * The model type will be Binary as I will be predicting an applicant to be either creditworthy or non-creditworthy.


## Step 2: Building the Training Set

### Guidelines:
* For numerical data fields, are there any fields that highly-correlate with each other? The correlation should be at least .70 to be considered “high”.
* Are there any missing data for each of the data fields? Fields with a lot of missing data should be removed
* Are there only a few values in a subset of your data field? Does the data field look very uniform (there is only one value for the entire field?). This is called “low variability” and you should remove fields that have low variability. Refer to the "Tips" section to find examples of data fields with low-variability.
*Your clean data set should have 13 columns where the Average of Age Years should be 36 (rounded up)


In [None]:
# Variables Non Null Count
past_applications.info()

columns_to_drop = []

In [None]:
# Data Vizualization
fig, axes = plt.subplots(4,5, figsize=(23, 23))
x = list(past_applications.columns)

for i, column in enumerate(past_applications.columns):
    if past_applications[column].dtype == np.dtype('O'):
        past_applications[column].value_counts().plot(kind='bar', rot=0, ax=axes[int(i/5)][i%5]).set_title(column)
    else:
        past_applications[column].hist(ax=axes[int(i/5)][i%5]).set_title(column)


In [None]:
# drop Duration-in-Current-address due to many missing data
columns_to_drop.append('Duration-in-Current-address')
# drop Concurrent-Credits due to low variability
columns_to_drop.append('Concurrent-Credits')
# drop Occupation due to low variability
columns_to_drop.append('Occupation')

# drop due to low variability
columns_to_drop.append('Guarantors')
columns_to_drop.append('Type-of-apartment')
columns_to_drop.append('No-of-dependents')
columns_to_drop.append('Foreign-Worker')

clean_data = past_applications.drop(columns=columns_to_drop)

past_applications[columns_to_drop].info()

In [None]:
# Data Removed Vizualization
fig, axes = plt.subplots(2,4, figsize=(15, 9))

for i, column in enumerate(columns_to_drop):
    if past_applications[column].dtype == np.dtype('O'):
        past_applications[column].value_counts().plot(kind='bar', rot=0, ax=axes[int(i/4)][i%4]).set_title(column)
    else:
        past_applications[column].hist(ax=axes[int(i/4)][i%4]).set_title(column)

fig.savefig('droped_variables_graph.png')

In [None]:
# Correlation
clean_data.corr()

In [None]:
# Mean
clean_data = clean_data.fillna(clean_data.mean())
clean_data.mean().round(0)

### Answer this question:

* In your cleanup process, which fields did you remove or impute? Please justify why you removed or imputed these fields. Visualizations are encouraged.
  * The imputed field is Age-years, There 12 applicants with empty age data. I can not remove these applicants as I will lose 2.4% of the data. I will fill all empty data with an age average of 36.
  * I will remove all fields with low variability to remove bias in my model. The removed fields are:
    - Duration in a current address
    - Concurrent credits
    - Occupation
    - Guarantors
    - Type of apartment
    - No of dependents
    - Foreign worker

## Step 3: Train your Classification Models

First, create your Estimation and Validation samples where 70% of your dataset should go to Estimation and 30% of your entire dataset should be reserved for Validation. Set the Random Seed to 1.

Create all of the following models: Logistic Regression, Decision Tree, Forest Model, Boosted Model

In [None]:
# replace target to binary
target_column = 'Credit-Application-Result'
target_label = ['Creditworthy', 'Non-Creditworthy'] # list(clean_data[target_column].unique())
clean_data[target_column].replace({'Creditworthy': 0, 'Non-Creditworthy': 1}, inplace=True)

# Categorical Columns
categorical_columns = ['Account-Balance', 'Payment-Status-of-Previous-Credit', 'Purpose', 'Value-Savings-Stocks', 'Length-of-current-employment', 'No-of-Credits-at-this-Bank']
prefix = ['AccountB', 'PaymentSPC', 'Purpose', 'ValueSS', 'LengthCE', 'NumberCB']

# convert categorical varibles into dummy [indicator variables]
clean_data_with_dummies = pd.get_dummies(clean_data, prefix=prefix, columns=categorical_columns, drop_first=False)

clean_data_with_dummies.info()

In [None]:
# Split data set to train and test subsets
train, test = train_test_split(clean_data_with_dummies, test_size=0.3, random_state=1)

# Training Data
Y_train = train['Credit-Application-Result']
X_train = train.drop(columns='Credit-Application-Result')

# Test Data
Y_test = test['Credit-Application-Result']
X_test = test.drop(columns='Credit-Application-Result')

In [None]:
X_train.head()

In [None]:
# Logistic Regression Model


In [None]:
# Decision Tree Model


In [None]:
# Forest Model Model


In [None]:
# Boosted Tree Model


### Answer these questions for each model you created:

* Which predictor variables are significant or the most important? Please show the p-values or variable importance charts for all of your predictor variables.

* Validate your model against the Validation set. What was the overall percent accuracy? Show the confusion matrix. Are there any bias seen in the model’s predictions? 


## Step 4: Writeup

Decide on the best model and score your new customers. For reviewing consistency, if Score_Creditworthy is greater than Score_NonCreditworthy, the person should be labeled as “Creditworthy

### Answer these questions
* Which model did you choose to use? 
* How many individuals are creditworthy?