In [1]:
# dependencies and path(s)
import pandas as pd
path = 'Resources/lending_data.csv'

## Data Exploration

In [2]:
# load the data into dataframe
df = pd.read_csv(path)
df

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.430740,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0
...,...,...,...,...,...,...,...,...
77531,19100.0,11.261,86600,0.653580,12,2,56600,1
77532,17700.0,10.662,80900,0.629172,11,2,50900,1
77533,17600.0,10.595,80300,0.626401,11,2,50300,1
77534,16300.0,10.068,75300,0.601594,10,2,45300,1


In [3]:
# print columns info for the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77536 entries, 0 to 77535
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   loan_size         77536 non-null  float64
 1   interest_rate     77536 non-null  float64
 2   borrower_income   77536 non-null  int64  
 3   debt_to_income    77536 non-null  float64
 4   num_of_accounts   77536 non-null  int64  
 5   derogatory_marks  77536 non-null  int64  
 6   total_debt        77536 non-null  int64  
 7   loan_status       77536 non-null  int64  
dtypes: float64(3), int64(5)
memory usage: 4.7 MB


In [6]:
# ranges defined below: prints min - max of each column in dataframe
ranges(df)

The range of "loan_size" is 18,800.0
The range of "interest_rate" is 7.98
The range of "borrower_income" is 75,200
The range of "debt_to_income" is 0.71
The range of "num_of_accounts" is 16
The range of "derogatory_marks" is 3
The range of "total_debt" is 75,200
The range of "loan_status" is 1


In [7]:
# show unique values for 'loan_status' column
set(df['loan_status'])

{0, 1}

Note:
- There is no null data in the original dataset.
- The dataframe contains 77536 rows across all 8 columns.
- Each column claims to contain numeric data and all data types correctly represent it as numeric (float or int).
- The ranges of values for each of the input columns span different orders of magnitude.
- The possible values of 'loan_status' are '0' or '1'. 

## Methodology
The target, 'loan_status', is a binary value that represents whether the loan is current or delinquent. A classifier is used to model a system with discrete outcomes such as 'loan_status'. The other 7 columns contain numeric data about the loan and debtor and the goal is to use this data to predict loan status. 
 
A logistic regression (classifier) maps data from N dimensions to a value between 0 and 1 by using the logistic equation that minimizes error on the training data. The output value is then used to classify the result via a threshold (i.e. outputs less than 0.5 classify as a 0 and outputs greater than 0.5 classify as a 1). Due to the nature of the logistic model, data that spans larger ranges will be biased compared to smaller data. Because of this, logistic models are at risk of overfitting data that is **not** scaled. 
 
A random forest classifier classifies data from N dimensions by selecting the most popular outcome (0 or 1) made by many individual decision trees. This type of model does not require scaling.
 
Since the data spans different orders of magnitude across each column, a logistic classifier may not work well and may tend to overfit the data. Therefore I predict a random forest model will perform better and score a higher accuracy than an un-scaled logistic classifier for the provided credit risk data. However, when scaling the data first I expect the logistic classifier to perform with similar accuracy to the random forest. 
 
Both no scaling and scaling methods will be used with the logistic model. For the both models, I expect to have to adjust the 'max_iter' to get convergence. For the random forest model, I will be leaving 'n_estimators' at default (=100) but it is worth noting that this could also affect the accuracy of the model. 


In [5]:
def ranges(dataframe):

    for column in dataframe.columns:
        min_ = min(dataframe[column])
        max_ = max(dataframe[column])
        range_ = max_ - min_
        range_int = int(range_)
    
        if not (range_ == range_int):
            range_ = round(range_,2)
        print('The range of "{0}" is {1:,}'.format(column, range_))