In [1]:
# Import the modules
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.metrics import confusion_matrix, classification_report

---

## Split the Data into Training and Testing Sets

### Step 1: Read the `lending_data.csv` data from the `Resources` folder into a Pandas DataFrame.

In [11]:
# Read the CSV file from the Resources folder into a Pandas DataFrame
df=pd.read_csv(
    Path('lending_data.csv')   
)

# Review the DataFrame
df.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt,loan_status
0,10700.0,7.672,52800,0.431818,5,1,22800,0
1,8400.0,6.692,43600,0.311927,3,0,13600,0
2,9000.0,6.963,46100,0.349241,3,0,16100,0
3,10700.0,7.664,52700,0.43074,5,1,22700,0
4,10800.0,7.698,53000,0.433962,5,1,23000,0


In [12]:
#Description of the desired variable
df.value_counts(['loan_status'])

loan_status
0              75036
1               2500
dtype: int64

### Step 2: Create the labels set (`y`)  from the “loan_status” column, and then create the features (`X`) DataFrame from the remaining columns.

In [3]:
# Separate the data into labels and features

# Separate the y variable, the labels
y=df['loan_status']

# Separate the X variable, the features
X=df.drop(columns='loan_status')

In [4]:
# Review the y variable Series
y[:5]

0    0
1    0
2    0
3    0
4    0
Name: loan_status, dtype: int64

In [5]:
# Review the X variable DataFrame
X.head()

Unnamed: 0,loan_size,interest_rate,borrower_income,debt_to_income,num_of_accounts,derogatory_marks,total_debt
0,10700.0,7.672,52800,0.431818,5,1,22800
1,8400.0,6.692,43600,0.311927,3,0,13600
2,9000.0,6.963,46100,0.349241,3,0,16100
3,10700.0,7.664,52700,0.43074,5,1,22700
4,10800.0,7.698,53000,0.433962,5,1,23000


### Step 3: Split the data into training and testing datasets by using `train_test_split`.

In [6]:
# Import the train_test_learn module
from sklearn.model_selection import train_test_split

# Split the data using train_test_split
# Assign a random_state of 1 to the function
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

###  Step 1: Fit a logistic regression model by using the training data (`X_train` and `y_train`).

In [7]:
# Import the LogisticRegression module from SKLearn
from sklearn.linear_model import LogisticRegression

# Instantiate the Logistic Regression model
# Assign a random_state parameter of 1 to the model
logistic_regression_model = LogisticRegression(random_state=1)

# Fit the model using training data
lr_model = logistic_regression_model.fit(X_train, y_train)


### Step 2: Save the predictions on the testing data labels by using the testing feature data (`X_test`) and the fitted model.

In [8]:
# Make a prediction using the testing data
testing_predictions = lr_model.predict(X_test)

### Step 3: Evaluate the model’s performance by doing the following:

* Generate a confusion matrix.

* Print the classification report.

In [9]:
# Generate a confusion matrix for the model
#Confusion matrix for the testing data
test_matrix = confusion_matrix(y_test, testing_predictions)


#Print confusion matrix for the testing data
print(test_matrix)

[[18663   102]
 [   56   563]]


In [10]:
 # Create and save the testing classification report
testing_report = classification_report(y_test, testing_predictions)

# Print the testing classification report
print(testing_report)

              precision    recall  f1-score   support

           0       1.00      0.99      1.00     18765
           1       0.85      0.91      0.88       619

    accuracy                           0.99     19384
   macro avg       0.92      0.95      0.94     19384
weighted avg       0.99      0.99      0.99     19384



# Overview of Analysis

The purpose of this analysis was to use historical lending data from peer-to-peer lending services company to train, evaluate and build a model based on loan risk, which will be  able to identify the creditworthiness of borrowers.

The financial information the data was on consisted of the following features:size of the loan, interest rate, income fo the borrower, ratio of debt to the borrowers income, number of accounts, derogatory marks, total debt and the loan status(whether its a healthy loan or a high risk). With all this information we had to be able to predict whether each borrower was going to fulfill their loan or was at a high risk of defaulting.

Of the original data(training set ) used the loan status variable had a count of 0=75036 and 1=2500.

The stages of machine learning I went through as part of the analysis , consisted of:
-Preprocessing:cleaning the data and splitting it into training and testing sets
-Training: using the Logistic regression model on the training set 
-Validating and Predicting: using the Logistic regression model on the testing set to make predictions
-Evaluating : evaluating the performance of the model using a confusion matrix and a classification report

The model used for this statistical analysis was a Logistic regression model to predict a binary outcome of either 0 or 1 on the testing set based on the training set.





# Results


## Machine Learning  Model:

### Accuracy
The accuracy score achieved using the Logistic regression model was extremely high at 0.99.In other words, the model had a high level of correctness in how many predictions were correct  out of all the total predictions.


### Precision
The precision score for the 0 class was perfect at 1.00, whereas the precision score for the 1 class was slightly lower  at 0.85. Therefore, in our model class 0 had more actual positive predictions out of all the total positives  predicted than class 1 did,making the predictions of class 0 more precise than that of class 1.(TRUE POSITIVE / TRUE POSITIVE + FALSE POSITIVE)


### Recall
The recall score for the 0 class was 0.99, whereas the recall score for class 1 was not far off at 0.91.Therefore, in our model class 0 had more  true positive predictions predicted correctly out of all the positive predictions in comparison to class 1.(TRUE POSITIVE/TRUE POSITIVE + FALSE NEGATIVE)

# Summary

In relation to the Logistic regression model we used and its evaluation it seemed that it worked well with making its predictions based on the training data set. In regards to the classes it seemed that 0's (borrowers with healthy loans) were predicted  more correctly using our model, in comparison to 1's(borrowers with high risk loans).

Furthermore the training set data used demonstrated that the loan status counts of 0's and 1' was skewed towards 0's rather than to 1's ( 0=75036 and 1=2500), in which could have played a factor in predicting 0's more correctly  and could have also created a falsely higher level of accuracy in our model due to the imbalanced data.Neverthesless, in the real world it would seem that loans would also be skewed that way too in terms of a healthy loan being more prevalent in comparison to a high risk loan.

To possibly counter act that a larger dataset could have been used for the training set data which could have potentially leveled the skewness a bit better.Or a different machine learning model could have been used that is specifically tailored to deal with skewed data.

Finally, in terms of the problem we are trying to solve in identifying high risk loans over healthy loans  using our logistic regression model, the model seems to slighlty under perform in regards to the more important class of 1's in comparison to 0's,nonetheless the model still predicts quite well in relation to its evaluation scores.Yet, slight modifications and adjustments can be made to improve the predictions of 1's.



