# **Categorical, Continuous, and Multiple Predictors**

---
1. Categorical Predictor
2. Continuous Predictor
3. Multiple Predictors

## **1. Categorical Predictor**
---

### **1.1. Binary Predictor**
---

- Predictor $(X)$ : independent variable, explanatory variable, feature.
- Binary predictor : categorical predictor with only 2 categories.
- Reference cell coding ("zero-one" coding)
  - Assigns the zero value to the lower code for $x$ as a reference, and one to the higher code.
  - $x=0$ as the reference.
- **Interpretation**:

> The estimate of the odds ratio between category 1 and category 0 is $\text{OR}=\exp(\beta_{1})$.

### **1.2. Multicategory Predictor**
---

- Multicategory predictor : categorical predictor with more than 2 categories.
- Reference cell coding:
  - Assigns the zero value as the reference category for $x$ and use indicator/dummy variables for others.
  - Thus, a predictor with $c$ categories will have one reference category and $c-1$ indicator/dummy variables.
- For example, the original label/code for predictor Color $(c)$ with 4 categories:

<center>

|Color $(x)$|Code|
|:--:|:--:|
|Medium Light|1|
|Medium|2|
|Medium Dark|3|
|Dark|4|

</center>

- With reference cell coding, predictor Color become 3 indicator/dummy variable. Color **Dark as reference**:

<center>

|Color|$c_{1}$|$c_{2}$|$c_{3}$|
|:--:|:--:|:--:|:--:|
|Medium Light|1|0|0|
|Medium|0|1|0|
|Medium Dark|0|0|1|
|Dark|0|0|0|

</center>

- Thus, the logit model becomes:

$$
\text{log(odds)} = \beta_{0}+\beta_{1}c_{1}+\beta_{2}c_{2}+\beta_{3}c_{3}
$$

- With reference cell coding,
  - $c_{1}=1$ for color = medium light, $0$ otherwise
  - $c_{2}=1$ for color = medium, $0$ otherwise
  - $c_{3}=1$ for color = medium dark, $0$ otherwise
  - Color is dark when $c_{1}=c_{2}=c_{3}=0$

- **Interpretation**:
> The estimate of the odds ratio between category $k$ and reference category is $\text{OR}(k,\text{reference})=\exp(\beta_{k})$, with $k=1,2,\dots,c$.

- In general, to interpret the odds ratio between two category:
  1. Calculate the logit difference.
  2. Interpret in terms of the odds ratio.
- Thus, the odds ratio between two category, say category $a$ and $b$ is $\text{OR}(a,b)=\exp(\beta_{a}-\beta_{b})$

- **Interpretation**:
> The odds of success at $x$ = category $a$  equals $\exp(\beta_{a}-\beta_{b})$ times the odds of success at $x$ = category $b$.


### **1.3. Ordinal Predictor**
---

- Ordinal predictor : predictor with ordered scale categories, e.g: low, medium, high.
- Ordinal predictors treated in a quantitative manner (continuous scale).
- Thus the code:
  - $x=1$ --> low
  - $x=2$ --> medium
  - $x=3$ --> high

## **2. Continuous Predictor**
---

### **2.1. The Intercept**
---

- Intercept value is the logit value when $x=0$.
$$
\begin{align*}
\text{logit}(\pi(x)) &= \beta_{0}+\beta_{1}(x) \\
\text{logit}(\pi(0)) &= \beta_{0}+\beta_{1}(0) \\
\text{logit}(\pi(0)) &= \beta_{0} \\
\end{align*}
$$

<br>

- For example, the logit model from the horseshoe crab data:
$$
\begin{align*}
\text{logit(Width=0)} &= -12.3508 + 0.4972 (0) \\
\text{logit(Width=0)} &= -12.3508
\end{align*}
$$
  - Interpretation: the estimated odds of a crab having any satellite is $\exp(-12.3508)$ when its width is 0 cm.
  - "Zero width" crab sounds non reasonable.
  - Intercept is not meaningful and difficult to interpret.
- To make it interpretable, we can transform predictor $x$ by centering its value.
  - `c_width = width - np.mean(width)`
  - Thus, `c_width = 0` represents the mean value of width.
  - In general, zero value for the centered predictor represents the average value of the predictor.

### **2.2. The Effect of Continuous Predictor**
---

- Under the assumption that the logit is linear in the continuous predictor $x$, the equation for the logit is:

$$
\text{logit}(\pi(x)) = \beta_{0}+\beta_{1}(x)
$$

<br>

- Slope coefficient, $\beta_{1}$, gives the change in the log odds for an increase of 1 unit in $x$.
- Thus, the odds of success multiply by $\exp(\beta_{1})$ for every 1 unit increase in $x$.
  - But if $x$ in range [0,1], then a change of 1 is too large. Change of 0.01 may be more realistic.
  - In another case, a 1 ml increase in coffee consumption may be too small to be considered important.
Change of 50 or 100 ml may be more realistic.
  - **Solution : use the term “$c$” for the change of $x$.**
- The interpretation for $c$ unit change in $x$:
> The odds of success multiply by $\exp(c\times\beta_{1})$ for every $c$ unit increase in $x$.

## **3. Multiple Predictors**
---

- Denote the $k$ predictors for a binary response $Y$ by $X = x_{1}, x_{2}, \dots, x_{k}$. The model for the log odds is:
$$
\text{logit}(\pi(X)) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \dots + \beta_{k}x_{k}
$$

<br>

- The parameter $\beta_{j}$ refers to the effect of $x_{j}$ on the log odds that $Y = 1$, controlling the other $x$.
- Interpretation:

> The odds of success multiply by $\exp(\beta_{j})$ for 1 unit increase in $x_{j}$, at fixed levels of the other $x$.



# **Case: Categorical Predictor**
---

Note:
- Ideally, we perform the significance test for each parameter estimate before interpreting the logistic regression model.
- We don't perform significance test in this section since we learned how to conduct significance test last week.
- Fortunately, all models in this section yield significant parameter estimates (P-value < $\alpha=0.05$).

## **Load Data**
---

The sample we will use in this example is a fictive dataset from [here](https://www.kaggle.com/datasets/laotse/credit-risk-dataset).

The sample consist of some demographic, bureau, and financial information.

Note that we are not defining the default or bad status from our dataset here. Instead, we already have the binary response variable:

- `loan_status`
  - `loan_status = 0` for non default loan.
  - `loan_status = 1` for default loan.

The potential predictors for predicting the response variable are:

1. `person age` : age of the debtor.
2. `person_income` : annual income of the debtor.
3. `person_home_ownership`
  - `RENT`
  - `MORTGAGE`
  - `OWN`
  - `OTHER`
4. `person_emp_length` : employment length of debtor (in years).
5. `loan_intent` : purpose of the loan.
  - `EDUCATION`
  - `MEDICAL`
  - `VENTURE`
  - `PERSONAL`
  - `DEBTCONSOLIDATION`
6. `loan_grade`
7.  `loan_amnt`	: amount of the loan.
8. `loan_int_rate` : interest rate of the loan.
10. `loan_percent_income`	: percent loan of the debtor's income.
11. `cb_person_default_on_file`	: historical default.
  - `0` : the debtor does not have any history of defaults.
  - `1` : the debtor has a history of defaults on their credit file.
12. `cb_preson_cred_hist_length` : length of the credit history.

First, load the data from `credit_risk_dataset.csv` file.

In [1]:
# Load data manipulation package
import numpy as np
import pandas as pd

# Load data visualization package
import matplotlib.pyplot as plt
import seaborn as sns

# Load Statistics package
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression

In [2]:
# Import dataset from csv file
data = pd.read_csv('../data/credit_risk_dataset.csv')

# Table check
data.head().T

Unnamed: 0,0,1,2,3,4
person_age,22,21,25,23,24
person_income,59000,9600,9600,65500,54400
person_home_ownership,RENT,OWN,MORTGAGE,RENT,RENT
person_emp_length,123.0,5.0,1.0,4.0,8.0
loan_intent,PERSONAL,EDUCATION,MEDICAL,MEDICAL,MEDICAL
loan_grade,D,B,C,C,C
loan_amnt,35000,1000,5500,35000,35000
loan_int_rate,16.02,11.14,12.87,15.23,14.27
loan_status,1,0,1,1,1
loan_percent_income,0.59,0.1,0.57,0.53,0.55


In [3]:
# Check the data shape
data.shape

(32581, 12)

Our sample contains 12 variables from 32,581 credit records.
- 1 response variable, `loan_status`,
- and 11 potential predictors/characteristics.

Before modeling, make sure you split the data first for model validation.

In the classification case, check the proportion of response variable first to decide the splitting strategy.

In [4]:
# Define response variable
response_variable = 'loan_status'

# Check the proportion of response variable
data[response_variable].value_counts(normalize = True)

loan_status
0    0.781836
1    0.218164
Name: proportion, dtype: float64

The proportion of the response variable, `loan status`, is not quite balanced (in a ratio of 78:22).

To get the same ratio in training and testing set, define a stratified splitting based on the response variable, `loan_status`.

## **Sample Splitting**
---

First, define the predictors (X) and the response (y).

In [5]:
# Split response and predictors
y = data[response_variable]
X = data.drop(columns = [response_variable],
              axis = 1)

# Validate the splitting
print('y shape :', y.shape)
print('X shape :', X.shape)

y shape : (32581,)
X shape : (32581, 11)


Next, split the training and testing set from each predictors (X) and response (y).
- Set `stratify = y` for splitting the sample with stratify, based on the proportion of response y.
- Set `test_size = 0.3` for holding 30% of the sample as a testing set.
- Set `random_state = 42` for reproducibility.

In [6]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    test_size = 0.3,
                                                    random_state = 42)

# Validate splitting
print('X train shape :', X_train.shape)
print('y train shape :', y_train.shape)
print('X test shape  :', X_test.shape)
print('y test shape  :', y_test.shape)

X train shape : (22806, 11)
y train shape : (22806,)
X test shape  : (9775, 11)
y test shape  : (9775,)


Check the proportion of response y in each training and testing set.

In [7]:
y_train.value_counts(normalize = True)

loan_status
0    0.781856
1    0.218144
Name: proportion, dtype: float64

In [8]:
y_test.value_counts(normalize = True)

loan_status
0    0.78179
1    0.21821
Name: proportion, dtype: float64

In [9]:
# Concatenate X_train and y_train as data_train
data_train = pd.concat((X_train, y_train),
                       axis = 1)

# Validate data_train
print('Train data shape:', data_train.shape)
data_train.head()

Train data shape: (22806, 12)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
11491,26,62000,RENT,1.0,DEBTCONSOLIDATION,B,10000,11.26,0.16,N,2,0
3890,23,39000,MORTGAGE,3.0,EDUCATION,C,5000,12.98,0.13,N,4,0
17344,24,35000,RENT,1.0,DEBTCONSOLIDATION,A,12000,6.54,0.34,N,2,1
13023,24,86000,RENT,1.0,HOMEIMPROVEMENT,B,12000,10.65,0.14,N,3,0
29565,42,38400,RENT,4.0,MEDICAL,B,13000,,0.34,N,11,1


## **Binary Predictor**
---

- We will use variable `cb_person_default_on_file` as the binary predictor for response `loan_status`.
- `cb_person_default_on_file` $(x)$ has 2 categories:
  - $x$ = N --> the debtor does not have any history of defaults.
  - $x$ = Y --> the debtor has a history of defaults on their credit file.
- Thus, the logit model becomes: $\text{logit} =\beta_{0} + \beta_{1}x$

### Model Fitting
---

In [10]:
# Define the response 'default' and binary predictor 'history'
default = data_train[response_variable]
history = data_train['cb_person_default_on_file']

history.head()

11491    N
3890     N
17344    N
13023    N
29565    N
Name: cb_person_default_on_file, dtype: object

In [11]:
# Make dummy variable where 'N' (never default) is the reference for 'history'
# history_default = 0 --> does not have any history of defaults
history_default = pd.get_dummies(history,
                                 drop_first = True)

# Rename the column
history_default.columns = ['history_default']

history_default.head()

Unnamed: 0,history_default
11491,False
3890,False
17344,False
13023,False
29565,False


In [14]:
# Modeling with statsmodels
# Use dummy variable 'history_default' as predictor

# Load the package
import statsmodels.api as sm

# Add constant to predictor
history_default_sm = sm.add_constant(history_default.astype(float))

# Model fitting
model_history_default = sm.Logit(endog = default,
                                 exog = history_default_sm)
result_history_default = model_history_default.fit()

# Print the result
print(result_history_default.summary())

Optimization terminated successfully.
         Current function value: 0.510410
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:            loan_status   No. Observations:                22806
Model:                          Logit   Df Residuals:                    22804
Method:                           MLE   Df Model:                            1
Date:                Mon, 11 Nov 2024   Pseudo R-squ.:                 0.02695
Time:                        09:15:40   Log-Likelihood:                -11640.
converged:                       True   LL-Null:                       -11963.
Covariance Type:            nonrobust   LLR p-value:                2.876e-142
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -1.4879      0.019    -79.001      0.000      -1.525      -1.451
history_defaul

In [15]:
# Modeling with statsmodels.formula
# You can use the predictor as it is

# Load the package
import statsmodels.formula.api as smf

# Model fitting
model_history = smf.logit('default ~ history',
                          data = y_train)
result_model_history = model_history.fit()

# Print the result
print(result_model_history.summary())

Optimization terminated successfully.
         Current function value: 0.510410
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                22806
Model:                          Logit   Df Residuals:                    22804
Method:                           MLE   Df Model:                            1
Date:                Mon, 11 Nov 2024   Pseudo R-squ.:                 0.02695
Time:                        09:15:53   Log-Likelihood:                -11640.
converged:                       True   LL-Null:                       -11963.
Covariance Type:            nonrobust   LLR p-value:                2.876e-142
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -1.4879      0.019    -79.001      0.000      -1.525      -1.451
history[T.Y]     0.9781

- Thus, we have the logit model:
$$
\begin{align*}
\text{logit(historydefault)} &=\beta_{0} + \beta_{1} \text{historydefault} \\
\text{logit(historydefault)} &=-1.4879 + 0.9781(\text{historydefault})
\end{align*}
$$

In [18]:
# Modelling with sklearn

# Load Statistics package
from sklearn.linear_model import LogisticRegression

# Create the object
model_history_sk = LogisticRegression(penalty = None)

# Use dummy variable 'history_default' as predictor
model_history_sk.fit(X = history_default,
                     y = default)

Extract $\beta_{0}$ estimate.

In [19]:
# Print the parameter estimate of b0
b0_history = model_history_sk.intercept_
b0_history

array([-1.48802833])

Extract $\beta_{1}$ estimate.

In [20]:
# Print the parameter estimate of b1
b1_history = model_history_sk.coef_
b1_history

array([[0.97818755]])

We already fit logistic regression model with:
1. `statsmodels.api`
2. `statsmodels.formula.api`
3. `sklearn.linear_model`

To keep it simple for interpreting the parameter estimates, next we will use only `statsmodels.api` to fit the logistic regression model since its package yields the summarized output for the fitted model.

### Interpretation
---

Interpretation for $\beta_{0}$.
- $\beta_{0}$ is the logit value when $x=0$ or $x$= reference.
- Thus, $\beta_{0}$ is the logit or log odds of success of reference category.
- $\text{odds(reference)}=\exp(\beta_{0})$

In [22]:
# Calculate odds of reference: does not have any history of defaults
odds_never_default = np.exp(b0_history)

print(f"The odds of default for those who have never been in default is {odds_never_default[0]:.2f}.")

The odds of default for those who have never been in default is 0.23.


Interpretation:
> Debtors who have never been in default are less likely to default.

Interpretation for $\beta_{1}$.

In [23]:
# Calculate the OR between history_default=1 and history_default=0
odds_ratio_history = np.exp(b1_history)

print(f"OR (ever default, never default) = {odds_ratio_history[0][0]:.2f}")

OR (ever default, never default) = 2.66


Interpretation:
> Debtors who have been in default tend to default again than those who have never been in default. The odds of default for debtors who have been in default is 2.66 times the odds for those who have never been in default.

Let's prove the Odds Ratio formula above.

In [24]:
# Contingency table 'default' and 'history'
crosstab_history = pd.crosstab(history,
                               default,
                               margins = True)
crosstab_history

loan_status,0,1,All
cb_person_default_on_file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
N,15302,3456,18758
Y,2529,1519,4048
All,17831,4975,22806


In [25]:
# Calculate the odds of default for debtors who have been in default
default_and_ever_default = crosstab_history[1]['Y']
non_default_and_ever_default  = crosstab_history[0]['Y']

odds_ever_default = default_and_ever_default/non_default_and_ever_default
odds_ever_default

0.6006326611308818

In [26]:
# Calculate the odds of default for debtors who have NEVER been in default
default_and_never_default = crosstab_history[1]['N']
non_default_and_never_default  = crosstab_history[0]['N']

odds_never_default = default_and_never_default/non_default_and_never_default
odds_never_default

0.22585282969546464

The odds of default for those who have never been in default is proven.

In [27]:
# Calculate the OR
odds_ratio_history = odds_ever_default / odds_never_default
odds_ratio_history

2.6593984318937363

OR (ever default, never default) = 2.66 is proven.