In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score
from sklearn.pipeline import Pipeline
import pickle

# Logistic Regression Model

In this section, we explore Logistic Regression, a fundamental classification algorithm in machine learning. Logistic Regression predicts the probability that a given input belongs to a certain category. It is particularly well-suited for binary classification problems.

## Key Characteristics of Logistic Regression:

- **Outcome Modelling**: It models the probability of a binary outcome, typically represented as 0 or 1.
- **Probability Estimation**: The output is a probability score between 0 and 1, obtained through a logistic function.
- **Feature Relationship**: It estimates the relationship between features and the probability of the outcome.

Through careful feature selection and regularization, we can enhance the predictive power of Logistic Regression for a variety of applications.

The goal of this notebook is to build, evaluate, and optimize a Logistic Regression model, which will later be compared with other models in a comprehensive analysis.

For a detailed account of how we've set up our analysis, refer to the notebook "Final_Project_Data_Gen."

---


Here is where we will establish our custom `scorer`. We have decided to use a custom F2 score to prioritize recall (minimizing the chance of not giving a coupon to someone who would use it) while considering the value of precision (minimizing the chance of giving a coupon to someone who will not use it).


In [None]:
def f2_func(y_true, y_pred):
    f2_score = fbeta_score(y_true, y_pred, beta=2, average="weighted")
    return f2_score


def my_f2_scorer():
    return make_scorer(f2_func)

Pulling in the cleaned-up dataset


In [4]:
X_train = pd.read_csv("train_X_In-Car-Rec.csv")
y_train = pd.read_csv("train_y_In-Car-Rec.csv")
X_test = pd.read_csv("test_X_In-Car-Rec.csv")
y_test = pd.read_csv("test_y_In-Car-Rec.csv")

To optimize our KNN classifier, we've defined a `param_grid` that specifies the hyperparameters we want to experiment with. We are varying `n_neighbors` from 1 to 100 in steps of 5 to determine the best number of neighbors, and we're comparing two distance metrics: `euclidean` and `cosine`, to see which metric yields the best performance.


In [5]:
param_grid = [
    {
        "penalty": ["l1", "l2", "elasticnet", "none"],
        "C": np.logspace(-4, 4, 20),
        "solver": ["lbfgs", "newton-cg", "liblinear", "sag", "saga"],
        "max_iter": [100, 1000, 2500, 5000, 100000],
    }
]

We've instantiated a `LogisticRegression` model and wrapped it with a `GridSearchCV` process. The `GridSearchCV` will systematically work through multiple combinations of parameter tunes, using the parameters we've defined in `param_grid`, and will perform a 3-fold cross-validation for each combination. We've set it to be verbose so that we can follow the progress, and `n_jobs=-1` to use all available CPU cores to speed up the search. For evaluating model performance, we're using our custom F2 scoring function provided to the `scoring` parameter, prioritizing a balance between precision and recall with a focus on recall.


In [6]:
logModel = LogisticRegression()
clf = GridSearchCV(
    logModel,
    param_grid=param_grid,
    cv=3,
    verbose=True,
    n_jobs=-1,
    scoring=my_f2_scorer(),
)

In [7]:
best_clf = clf.fit(X_train, y_train.values.ravel())

Fitting 3 folds for each of 2000 candidates, totalling 6000 fits


2700 fits failed out of a total of 6000.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
300 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\baker\miniconda3\envs\snowflakes\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\baker\miniconda3\envs\snowflakes\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\baker\miniconda3\envs\snowflakes\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1169, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)

Displaying the best parameters


We're now fitting our `GridSearchCV` object, `clf`, to the training data. By calling the `fit` method with `X_train` and the flattened `y_train` array using `ravel()`, we initiate the grid search to find the best hyperparameters for our Logistic Regression model. This process will train multiple models over different combinations of hyperparameters and cross-validate their performance to identify the most effective settings.


In [8]:
best_clf.best_params_

{'C': 0.615848211066026,
 'max_iter': 100000,
 'penalty': 'l1',
 'solver': 'liblinear'}

Below are the results for actual versus predicted


Let's now construct a DataFrame named `results` to compare the actual and predicted values side by side. This DataFrame is populated with the true labels from `y_test` and the predictions made by our best performing Logistic Regression model, `best_clf`, on the test data `X_test`. This visual comparison will help us assess the accuracy of our model's predictions.


In [9]:
results = pd.DataFrame()
results["actual"] = y_test
results["predicted"] = best_clf.predict(X_test)
results

Unnamed: 0,actual,predicted
0,0,0
1,1,1
2,1,1
3,0,1
4,0,0
...,...,...
2532,0,1
2533,0,0
2534,0,1
2535,1,1


We calculate the weighted F2 score for our Logistic Regression model using the `fbeta_score` from the `metrics` module, setting `beta` to 2 to put more emphasis on recall. We then print the F2 score, providing us with a measure of the model's performance that takes into account both precision and recall, with a bias towards minimizing false negatives.


In [10]:
f2_score = metrics.fbeta_score(y_test, results["predicted"], average="weighted", beta=2)
print(f"F2Score for the Logistic Regressor Classification Model is: " + str(f2_score))

F2Score for the Logistic Regressor Classification Model is: 0.692038482245298


We have computed the F2 score for our Logistic Regression classification model, and it stands at approximately 0.692. This score reflects the model's performance, with a particular focus on recall, suggesting that our model is fairly good at identifying positive instances, albeit with some room for improvement, especially in terms of reducing false negatives.


### Pickling Our Model


In [3]:
# final pipeline

pipeline_logReg = Pipeline(
    [
        (
            "logReg",
            LogisticRegression(
                C=0.615848211066026, max_iter=100000, penalty="l1", solver="liblinear"
            ),
        ),
    ]
)

In [4]:
# Specify the filename where you want to save the model
filename = "LogReg_Model.pkl"

# Export the model to the file using pickle.dump
with open(filename, "wb") as file:
    pickle.dump(pipeline_logReg, file)