**Note:** Much of the original starter file has been removed, to slim down the end product; this is in line with the rubric, which states to follow DRY principles and use concise, relevant notes. Every resource I've found on the matter of good commenting practices insists that the code should be almost entirely self-explanatory, with as few comments as possible. I have adhered to this.

---

In [1]:
import pandas as pd
from pathlib import Path
from sklearn.metrics import balanced_accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import RandomOverSampler

In [2]:
def logistic_regression_evaluation(X_train, X_test, y_train, y_test):
    model = LogisticRegression(random_state=1)
    model.fit(X_train, y_train)
    
    predictions = model.predict(X_test)

    # Balanced accuracy score not required here by rubric; left for use in analysis
    print(f"""\
Balanced Accuracy Score:
{balanced_accuracy_score(y_test, predictions)}

Confusion Matrix:
{confusion_matrix(predictions, y_test)}

Classification Report:
{classification_report(predictions, y_test)}""")

---

## Split the Data into Training and Testing Sets

In [3]:
loan_df = pd.read_csv(Path('Resources/lending_data.csv'))

y = loan_df["loan_status"]
X = loan_df.drop(columns="loan_status")

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

---

## Create a Logistic Regression Model with the Original Data

In [4]:
logistic_regression_evaluation(X_train, X_test, y_train, y_test)

Balanced Accuracy Score:
0.9520479254722232

Confusion Matrix:
[[18663    56]
 [  102   563]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     18719
           1       0.91      0.85      0.88       665

    accuracy                           0.99     19384
   macro avg       0.95      0.92      0.94     19384
weighted avg       0.99      0.99      0.99     19384



**Question:** How well does the logistic regression model predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The model predicts healthy loans with a precision of 99% and high-risk loans with a precision of 91%.

---

## Predict a Logistic Regression Model with Resampled Training Data

In [5]:
# This entire section is missing from the rubric, which feels like an error, so I left it in.
ros = RandomOverSampler(random_state=1)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

logistic_regression_evaluation(X_resampled, X_test, y_resampled, y_test)

Balanced Accuracy Score:
0.9936781215845847

Confusion Matrix:
[[18649     4]
 [  116   615]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00     18653
           1       0.99      0.84      0.91       731

    accuracy                           0.99     19384
   macro avg       0.99      0.92      0.95     19384
weighted avg       0.99      0.99      0.99     19384



**Question:** How well does the logistic regression model, fit with oversampled data, predict both the `0` (healthy loan) and `1` (high-risk loan) labels?

**Answer:** The model predicts both healthy loans and high-risk loans with a precision of 99%.