# Homework 19
## William Vann

# My Prediction: 

My prediction is that **Logistic Regression (LR) will perform better than Random Forest Classification (RFC)**. 

One reason is that the credit risk data problem is similar to other problems (like medical disease risk) we've used LR effectively for. We are looking for a prediction of 0 or 1 for credit risk, not a continuous numeric value, and LR is a standard and trusty choice (often the first choice, at least for testing) for such situations. 

Of course RFC is a good choice for such situations as well, so I've considered what I know about the math underlying each approach to favor LR.  My understanding is that the math underlying LR involves calculation of probabilities, while the math underlying RFC involves a "majority voting" scenario among some number of decision trees.  This particular data problem seems more probabilistic to me, as the question "Is person A a good credit risk (loan_status=0), or a bad credit risk (loan_status=1)?" seems better answered with "Yes (with an x% probability" than "Yes (a majority of the trees vote Yes)". I am persuaded on this point even more when I consider that LR admits of a "threshold" probability that the data scientist can change, so as a result it seems like it might be more "fine-tunable" than RFC.

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load the dataset

In [None]:
# Import the data

df = pd.read_csv("Resources/lending_data.csv")
df.head()

In [None]:
df.describe()

In [None]:
df["loan_status"].value_counts()

In [None]:
df.groupby(["loan_status"]).mean()

In [None]:
df.groupby(["loan_status"]).median()

# Preprocess the data

In [None]:
X = df.drop("loan_status", axis=1).values

y = df["loan_status"].values

target_names=["good risk", "bad risk"] # loan_status=0 or 1, respectively

In [None]:
# Split the data into X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
# Create a scaler to standardize the data

scaler = StandardScaler()

# Train the scaler with the X_train data.

scaler.fit(X_train)

In [None]:
# Transform X_train and X_test.
# Note that the scaler used to transform X_train and X_test was trained on X_train.

X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create predictions with Logistic Regression

In [None]:
# Train a Logistic Regression model 

model = LogisticRegression().fit(X_train_scaled, y_train)

In [None]:
# Score the model

print(f"Logisitic Regression Training Data Score: {model.score(X_train_scaled, y_train)}")
print(f"Logisitic Regression Testing Data Score: {model.score(X_test_scaled, y_test)}")

In [None]:
y_true = y_test
y_pred = model.predict(X_test_scaled)

cm_lr = confusion_matrix(y_true, y_pred)
cr_lr = classification_report(y_true, y_pred, target_names=target_names)

print(f"Logisitic Regression CONFUSION MATRIX: \n\n {cm_lr}\n\n")
print(f"Logisitic Regression CLASSIFICATION REPORT: \n\n {cr_lr}\n\n")

# Create predictions with Random Forest Classifier

In [None]:
# Train a Random Forest Classifier model

clf = RandomForestClassifier(random_state=0, n_estimators=500).fit(X_train_scaled, y_train)

In [None]:
# Score the model

print(f"Random Forest Classifier Training Score: {clf.score(X_train_scaled, y_train)}")
print(f"Random Forest Classifier Testing Score: {clf.score(X_test_scaled, y_test)}")

In [None]:
y_true = y_test
y_pred = clf.predict(X_test_scaled)

cm_rfc = confusion_matrix(y_true, y_pred)
cr_rfc = classification_report(y_true, y_pred, target_names=target_names)

print(f"Random Forest Classifier CONFUSION MATRIX: \n\n {cm_rfc}\n\n")
print(f"Random Forest Classifier CLASSIFICATION REPORT: \n\n {cr_rfc}\n\n")

## Results and Reflection


In [None]:
print(f"LR: {cr_lr}")
print(f"RFC: {cr_rfc}")

## Logistic Regression is the Better-Performing Model

Looking at the classification reports for the two models, both models perform well on precision, recall, and f1-score for "good risk", but less well for the same metrics for "bad risk". Nevertheless, LR performed better than RFC on "bad risk" recall and f1-score. Therefore, my prediction was confirmed.  

After exploring the dataset, it appears that the dataset is somewhat "lop-sided" in that only about 3% of the sample had a "loan status" value of 1 (i.e., meaning loan was not approved, they were judged to be a "bad" credit risk).  Hence, 97% of the sample was approved for a loan, judged to be a "good" risk.  My intuition is that we would expect most of the classification models to perform well with such a skew in the training dataset, and that was borne out in our comparison of two of the most-often used binary classification models.  