# Model Training - Logistic Regression

This is a simple example of fitting and serializing a Logistic Regression model. We will be using the `LogReg` class from the `turtles` Python package. Yes, this is my own Python package.

https://github.com/adammotzel/glms

The purpose of this project is to demonstrate the entire ML model lifecycle, with a focus on deploment. Because of this, we will skip the usual (ableit *very* important) model training steps (EDA, CV, tuning, etc.).

We will be using the [breast cancer Wisconsin dataset](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic), downloaded from [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html).

In [1]:
import pickle as pkl

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from turtles.stats.glms import LogReg

In [2]:
random_state = 5
test_size = 0.2

In [3]:
# load sample data
data = load_breast_cancer(as_frame=True)

X = data["data"]
y = data["target"]
feature_names = data["feature_names"]

print("X:", X.shape)
print("y:", y.shape)
print("Feature Names: ", feature_names)

X: (569, 30)
y: (569,)
Feature Names:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


In [4]:
# let's just select a few features
features = [str(feat) for feat in feature_names if "mean" in feat]
print(features)

# slice df
X = X[features].copy()
print("New X:", X.shape)

['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension']
New X: (569, 10)


In [5]:
# create splits
Xtrain, Xtest, Ytrain, Ytest = train_test_split(
    X, 
    y, 
    test_size=test_size, 
    random_state=random_state
)

print("Xtrain:", Xtrain.shape)
print("Xtest:", Xtest.shape)
print("Ytrain:", Ytrain.shape)
print("Ytest", Ytest.shape)

Xtrain: (455, 10)
Xtest: (114, 10)
Ytrain: (455,)
Ytest (114,)


In [6]:
# convert to numpy
Xtrain = Xtrain.to_numpy()
Xtest = Xtest.to_numpy()
Ytrain = Ytrain.to_numpy().reshape(Ytrain.shape[0], 1)
Ytest = Ytest.to_numpy().reshape(Ytest.shape[0], 1)

print("Xtrain:", Xtrain.shape)
print("Xtest:", Xtest.shape)
print("Ytrain:", Ytrain.shape)
print("Ytest", Ytest.shape)

Xtrain: (455, 10)
Xtest: (114, 10)
Ytrain: (455, 1)
Ytest (114, 1)


In [7]:
# fit model using newtons method
model = LogReg(
    method="newton",
    learning_rate=0.01,
    tolerance=0.00001
)

model.fit(Xtrain, Ytrain, var_names=features)

print("Model Summary:\n")
display(model.summary())

Model Summary:



Unnamed: 0,Variable,Coefficient,Std Error,z-statistic,p-value,[0.025,0.075]
0,Intercept,13.5702,15.6411,0.8676,0.3856,-17.0857,44.2261
1,mean radius,2.719,4.1212,0.6598,0.5094,-5.3584,10.7965
2,mean texture,-0.4304,0.0794,-5.423,0.0,-0.586,-0.2749
3,mean perimeter,-0.0941,0.5763,-0.1634,0.8702,-1.2237,1.0354
4,mean area,-0.0363,0.0197,-1.8412,0.0656,-0.0749,0.0023
5,mean smoothness,-87.6111,37.9331,-2.3096,0.0209,-161.9585,-13.2637
6,mean compactness,12.0804,24.414,0.4948,0.6207,-35.7702,59.9309
7,mean concavity,-7.9231,9.4823,-0.8356,0.4034,-26.5081,10.6618
8,mean concave points,-70.6945,32.9,-2.1488,0.0317,-135.1773,-6.2117
9,mean symmetry,-11.0483,11.868,-0.9309,0.3519,-34.3092,12.2126


In [8]:
# ensure we logged the features names
print(model.variable_names)

['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension']


In [9]:
# results on test data using 0.5 as threshold
preds = model.predict(Xtest) > 0.5

print("TEST Accuracy:", np.sum(preds == Ytest) / len(Ytest))

TEST Accuracy: 0.9649122807017544


In [10]:
# save the model as a pickle file
with open("../models/model.pkl", "wb") as file:
    pkl.dump(model, file)

In [11]:
# test the saved model to ensure it works as expected
with open("../models/model.pkl", "rb") as file:
    loaded_model = pkl.load(file)

print(loaded_model)
print(loaded_model.betas)
print(loaded_model.variable_names)

<turtles.stats.glms._logreg.LogReg object at 0x0000017A415F3F50>
[[ 1.35702105e+01  2.71903741e+00 -4.30439215e-01 -9.41448707e-02
  -3.62963324e-02 -8.76111312e+01  1.20803569e+01 -7.92314349e+00
  -7.06945121e+01 -1.10483115e+01  2.70747687e+01]]
['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension']
