<a href="https://colab.research.google.com/github/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning/blob/main/GB886_V_3_LogisticRegressionWithTransformations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Logistic Regression Example

We go back to our male/female example, where we modeled:
$$
\text{Proability Male} = p(\underbrace{\text{height},\text{weight}}_x).
$$

We will now check if including a square term (height-squared) and interaction terms (height times weight) improves our predictions. We may think that the probability response to height-squared is improved because very tall individuals are more likely to be male whereas the relationship is not as strong for moderate heights.

This is a little complicated by our link function (the logistic function), which already accounts for some degree of non-linearity. In some way, this example shows that developing an intuition with GLMs is tricky (because, as we will see, in this example the square does not help).

### Libraries and Data

As always, let's start with importing the libraries:

In [None]:
!pip install scipy

from scipy import stats
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

import statsmodels.api as sm

And let's load the data:

In [None]:
!git clone https://github.com/danielbauer1979/MSDIA_PredictiveModelingAndMachineLearning.git

In [None]:
hw_data = pd.read_csv('MSDIA_PredictiveModelingAndMachineLearning/GB886_IV_3_Davis.csv')

As before, Let's recast the `sex` variable as a dummy variable, because that's the input for logistic regression packages.

In [None]:
hw_data['sex'] = pd.get_dummies(hw_data['sex'],drop_first=True)
hw_data.head()

### Fitting baseline Logistic Regression Model

Let's run our baseline classification model again:

In [None]:
X = hw_data[['height','weight']]
X = sm.add_constant(X)
y = hw_data[['sex']]

logit_mod1 = sm.Logit(y, X)
logit_mod1_res = logit_mod1.fit()

print(logit_mod1_res.summary())

Here is the confsion matrix:

In [None]:
pred = (logit_mod1_res.predict(X) > 0.5)
conf_mat = pd.crosstab(y['sex'], pred, rownames=['Actual Sex'], colnames=['Predicted Sex'])
# Add row and column sums
conf_mat.loc['Column_Total']= conf_mat.sum(numeric_only=True, axis=0)
conf_mat.loc[:,'Row_Total'] = conf_mat.sum(numeric_only=True, axis=1)
print(conf_mat)

And the ROC Curve:

In [None]:
# prompt: Generate an ROC curver for pred and y
fpr, tpr, thresholds = roc_curve(y, logit_mod1_res.predict(X))
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()


### Including New Features

So, as indicated above, let's include height-squared and the interaction between height and weight:

In [None]:
X2 = hw_data[['height','weight']]
X2['height_sq'] = X2['height']**2
X2['height_weight'] = X2['height'] * X2['weight']
X2 = sm.add_constant(X2)
y = hw_data[['sex']]

logit_mod2 = sm.Logit(y, X2)
logit_mod2_res = logit_mod2.fit()

print(logit_mod2_res.summary())

Let's check the predictions:

In [None]:

pred2 = (logit_mod2_res.predict(X2) > 0.5)
conf_mat = pd.crosstab(y['sex'], pred2, rownames=['Actual Sex'], colnames=['Predicted Sex'])
conf_mat.loc['Column_Total']= conf_mat.sum(numeric_only=True, axis=0)
conf_mat.loc[:,'Row_Total'] = conf_mat.sum(numeric_only=True, axis=1)
print(conf_mat)

fpr, tpr, thresholds = roc_curve(y, logit_mod2_res.predict(X2))
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (AUC = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

So the predictions are completely analogous! This arguably suggests that we should not include the new variables. These types of questions are exactly what we discuss in the next part---i.e., should we include certain features or not.