# Logistic Regression

In [None]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_curve,
    roc_auc_score
)

In [None]:
# import the necessary functions from src
from rice_ml.processing.preprocessing import (load_data, preprocess_data)
from rice_ml.supervised_learning.logistic_regression import (train_model,
                                                             evaluate_model,
                                                             plot_confusion_matrix,
                                                             plot_roc_curve,
                                                             plot_top_coefficients)

In [None]:
# load in and preprocess data
df = load_data("adult.csv")
X, y, preprocessor = preprocess_data(df)

Once the data is loaded and processed, we can train the model!

In [None]:

model, X_train, X_test, y_train, y_test = train_model(
    X, y, preprocessor
)

Now, we can evaluate the accuracy and effectiveness of the model. This table shows that the model predicted income below $50,000 with 88% accuracy and income above $50,000 with 74% accuracy, which is pretty solid. 

In [None]:

y_pred, y_prob = evaluate_model(model, X_test, y_test)


The confusion matrix below shows further insight into the accuracy of the model. We can see that the majority of the data falls under true positive or true negative, which is good. We can also see that the majority of the observations have an actual value of less than $50,000 which could impact the results if the model were only guessing lower income. The model does not do this all too often, but there a number of false negatives. Overall, though, the matrix shows a relatively accurate model. 

In [None]:

plot_confusion_matrix(y_test, y_pred)

ROC curves show the relationship between true and false positives. The model should ideally reach as close to the top left corner (many true positives, few false positives) as possible, and the model does a decent job of this, with an area under the curve score of 0.91 indicated the model can differentiate between the income groups about 91% of the time.

In [None]:
plot_roc_curve(y_test, y_prob)

The final graph shows the varaibles (and their specific outcomes) that had the largest impact on this logistic regression model. As we can see, the values that increase a person's chance of a higher income are capital gain, being married, being from France, and working an executive or managerial job. 

In [None]:
plot_top_coefficients(model, preprocessor)