# **Categorical, Continuous, and Multiple Predictors**

---
1. Categorical Predictor
2. Continuous Predictor
3. Multiple Predictors

## **1. Categorical Predictor**
---

### **1.1. Binary Predictor**
---

- Predictor $(X)$ : independent variable, explanatory variable, feature.
- Binary predictor : categorical predictor with only 2 categories.
- Reference cell coding ("zero-one" coding)
  - Assigns the zero value to the lower code for $x$ as a reference, and one to the higher code.
  - $x=0$ as the reference.
- **Interpretation**:

> The estimate of the odds ratio between category 1 and category 0 is $\text{OR}=\exp(\beta_{1})$.

### **1.2. Multicategory Predictor**
---

- Multicategory predictor : categorical predictor with more than 2 categories.
- Reference cell coding:
  - Assigns the zero value as the reference category for $x$ and use indicator/dummy variables for others.
  - Thus, a predictor with $c$ categories will have one reference category and $c-1$ indicator/dummy variables.
- For example, the original label/code for predictor Color $(c)$ with 4 categories:

<center>

|Color $(x)$|Code|
|:--:|:--:|
|Medium Light|1|
|Medium|2|
|Medium Dark|3|
|Dark|4|

</center>

- With reference cell coding, predictor Color become 3 indicator/dummy variable. Color **Dark as reference**:

<center>

|Color|$c_{1}$|$c_{2}$|$c_{3}$|
|:--:|:--:|:--:|:--:|
|Medium Light|1|0|0|
|Medium|0|1|0|
|Medium Dark|0|0|1|
|Dark|0|0|0|

</center>

- Thus, the logit model becomes:

$$
\text{log(odds)} = \beta_{0}+\beta_{1}c_{1}+\beta_{2}c_{2}+\beta_{3}c_{3}
$$

- With reference cell coding,
  - $c_{1}=1$ for color = medium light, $0$ otherwise
  - $c_{2}=1$ for color = medium, $0$ otherwise
  - $c_{3}=1$ for color = medium dark, $0$ otherwise
  - Color is dark when $c_{1}=c_{2}=c_{3}=0$

- **Interpretation**:
> The estimate of the odds ratio between category $k$ and reference category is $\text{OR}(k,\text{reference})=\exp(\beta_{k})$, with $k=1,2,\dots,c$.

- In general, to interpret the odds ratio between two category:
  1. Calculate the logit difference.
  2. Interpret in terms of the odds ratio.
- Thus, the odds ratio between two category, say category $a$ and $b$ is $\text{OR}(a,b)=\exp(\beta_{a}-\beta_{b})$

- **Interpretation**:
> The odds of success at $x$ = category $a$  equals $\exp(\beta_{a}-\beta_{b})$ times the odds of success at $x$ = category $b$.


### **1.3. Ordinal Predictor**
---

- Ordinal predictor : predictor with ordered scale categories, e.g: low, medium, high.
- Ordinal predictors treated in a quantitative manner (continuous scale).
- Thus the code:
  - $x=1$ --> low
  - $x=2$ --> medium
  - $x=3$ --> high

## **2. Continuous Predictor**
---

### **2.1. The Intercept**
---

- Intercept value is the logit value when $x=0$.
$$
\begin{align*}
\text{logit}(\pi(x)) &= \beta_{0}+\beta_{1}(x) \\
\text{logit}(\pi(0)) &= \beta_{0}+\beta_{1}(0) \\
\text{logit}(\pi(0)) &= \beta_{0} \\
\end{align*}
$$

<br>

- For example, the logit model from the horseshoe crab data:
$$
\begin{align*}
\text{logit(Width=0)} &= -12.3508 + 0.4972 (0) \\
\text{logit(Width=0)} &= -12.3508
\end{align*}
$$
  - Interpretation: the estimated odds of a crab having any satellite is $\exp(-12.3508)$ when its width is 0 cm.
  - "Zero width" crab sounds non reasonable.
  - Intercept is not meaningful and difficult to interpret.
- To make it interpretable, we can transform predictor $x$ by centering its value.
  - `c_width = width - np.mean(width)`
  - Thus, `c_width = 0` represents the mean value of width.
  - In general, zero value for the centered predictor represents the average value of the predictor.

### **2.2. The Effect of Continuous Predictor**
---

- Under the assumption that the logit is linear in the continuous predictor $x$, the equation for the logit is:

$$
\text{logit}(\pi(x)) = \beta_{0}+\beta_{1}(x)
$$

<br>

- Slope coefficient, $\beta_{1}$, gives the change in the log odds for an increase of 1 unit in $x$.
- Thus, the odds of success multiply by $\exp(\beta_{1})$ for every 1 unit increase in $x$.
  - But if $x$ in range [0,1], then a change of 1 is too large. Change of 0.01 may be more realistic.
  - In another case, a 1 ml increase in coffee consumption may be too small to be considered important.
Change of 50 or 100 ml may be more realistic.
  - **Solution : use the term “$c$” for the change of $x$.**
- The interpretation for $c$ unit change in $x$:
> The odds of success multiply by $\exp(c\times\beta_{1})$ for every $c$ unit increase in $x$.

## **3. Multiple Predictors**
---

- Denote the $k$ predictors for a binary response $Y$ by $X = x_{1}, x_{2}, \dots, x_{k}$. The model for the log odds is:
$$
\text{logit}(\pi(X)) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \dots + \beta_{k}x_{k}
$$

<br>

- The parameter $\beta_{j}$ refers to the effect of $x_{j}$ on the log odds that $Y = 1$, controlling the other $x$.
- Interpretation:

> The odds of success multiply by $\exp(\beta_{j})$ for 1 unit increase in $x_{j}$, at fixed levels of the other $x$.



# **Case: Categorical Predictor**
---

Note:
- Ideally, we perform the significance test for each parameter estimate before interpreting the logistic regression model.
- We don't perform significance test in this section since we learned how to conduct significance test last week.
- Fortunately, all models in this section yield significant parameter estimates (P-value < $\alpha=0.05$).

## **Load Data**
---

The sample we will use in this example is a fictive dataset from [here](https://www.kaggle.com/datasets/laotse/credit-risk-dataset).

The sample consist of some demographic, bureau, and financial information.

Note that we are not defining the default or bad status from our dataset here. Instead, we already have the binary response variable:

- `loan_status`
  - `loan_status = 0` for non default loan.
  - `loan_status = 1` for default loan.

The potential predictors for predicting the response variable are:

1. `person age` : age of the debtor.
2. `person_income` : annual income of the debtor.
3. `person_home_ownership`
  - `RENT`
  - `MORTGAGE`
  - `OWN`
  - `OTHER`
4. `person_emp_length` : employment length of debtor (in years).
5. `loan_intent` : purpose of the loan.
  - `EDUCATION`
  - `MEDICAL`
  - `VENTURE`
  - `PERSONAL`
  - `DEBTCONSOLIDATION`
6. `loan_grade`
7.  `loan_amnt`	: amount of the loan.
8. `loan_int_rate` : interest rate of the loan.
10. `loan_percent_income`	: percent loan of the debtor's income.
11. `cb_person_default_on_file`	: historical default.
  - `0` : the debtor does not have any history of defaults.
  - `1` : the debtor has a history of defaults on their credit file.
12. `cb_preson_cred_hist_length` : length of the credit history.

First, load the data from `credit_risk_dataset.csv` file.

In [1]:
# Load data manipulation package
import numpy as np
import pandas as pd

# Load data visualization package
import matplotlib.pyplot as plt
import seaborn as sns

# Load Statistics package
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression

In [2]:
# Import dataset from csv file
data = pd.read_csv('../data/credit_risk_dataset.csv')

# Table check
data.head().T

Unnamed: 0,0,1,2,3,4
person_age,22,21,25,23,24
person_income,59000,9600,9600,65500,54400
person_home_ownership,RENT,OWN,MORTGAGE,RENT,RENT
person_emp_length,123.0,5.0,1.0,4.0,8.0
loan_intent,PERSONAL,EDUCATION,MEDICAL,MEDICAL,MEDICAL
loan_grade,D,B,C,C,C
loan_amnt,35000,1000,5500,35000,35000
loan_int_rate,16.02,11.14,12.87,15.23,14.27
loan_status,1,0,1,1,1
loan_percent_income,0.59,0.1,0.57,0.53,0.55


In [3]:
# Check the data shape
data.shape

(32581, 12)

Our sample contains 12 variables from 32,581 credit records.
- 1 response variable, `loan_status`,
- and 11 potential predictors/characteristics.

Before modeling, make sure you split the data first for model validation.

In the classification case, check the proportion of response variable first to decide the splitting strategy.

In [4]:
# Define response variable
response_variable = 'loan_status'

# Check the proportion of response variable
data[response_variable].value_counts(normalize = True)

loan_status
0    0.781836
1    0.218164
Name: proportion, dtype: float64

The proportion of the response variable, `loan status`, is not quite balanced (in a ratio of 78:22).

To get the same ratio in training and testing set, define a stratified splitting based on the response variable, `loan_status`.

## **Sample Splitting**
---

First, define the predictors (X) and the response (y).

In [5]:
# Split response and predictors
y = data[response_variable]
X = data.drop(columns = [response_variable],
              axis = 1)

# Validate the splitting
print('y shape :', y.shape)
print('X shape :', X.shape)

y shape : (32581,)
X shape : (32581, 11)


Next, split the training and testing set from each predictors (X) and response (y).
- Set `stratify = y` for splitting the sample with stratify, based on the proportion of response y.
- Set `test_size = 0.3` for holding 30% of the sample as a testing set.
- Set `random_state = 42` for reproducibility.

In [6]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    test_size = 0.3,
                                                    random_state = 42)

# Validate splitting
print('X train shape :', X_train.shape)
print('y train shape :', y_train.shape)
print('X test shape  :', X_test.shape)
print('y test shape  :', y_test.shape)

X train shape : (22806, 11)
y train shape : (22806,)
X test shape  : (9775, 11)
y test shape  : (9775,)


Check the proportion of response y in each training and testing set.

In [7]:
y_train.value_counts(normalize = True)

loan_status
0    0.781856
1    0.218144
Name: proportion, dtype: float64

In [8]:
y_test.value_counts(normalize = True)

loan_status
0    0.78179
1    0.21821
Name: proportion, dtype: float64

In [9]:
# Concatenate X_train and y_train as data_train
data_train = pd.concat((X_train, y_train),
                       axis = 1)

# Validate data_train
print('Train data shape:', data_train.shape)
data_train.head()

Train data shape: (22806, 12)


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length,loan_status
11491,26,62000,RENT,1.0,DEBTCONSOLIDATION,B,10000,11.26,0.16,N,2,0
3890,23,39000,MORTGAGE,3.0,EDUCATION,C,5000,12.98,0.13,N,4,0
17344,24,35000,RENT,1.0,DEBTCONSOLIDATION,A,12000,6.54,0.34,N,2,1
13023,24,86000,RENT,1.0,HOMEIMPROVEMENT,B,12000,10.65,0.14,N,3,0
29565,42,38400,RENT,4.0,MEDICAL,B,13000,,0.34,N,11,1


## **Binary Predictor**
---

- We will use variable `cb_person_default_on_file` as the binary predictor for response `loan_status`.
- `cb_person_default_on_file` $(x)$ has 2 categories:
  - $x$ = N --> the debtor does not have any history of defaults.
  - $x$ = Y --> the debtor has a history of defaults on their credit file.
- Thus, the logit model becomes: $\text{logit} =\beta_{0} + \beta_{1}x$

### Model Fitting
---

In [10]:
# Define the response 'default' and binary predictor 'history'
default = data_train[response_variable]
history = data_train['cb_person_default_on_file']

history.head()

11491    N
3890     N
17344    N
13023    N
29565    N
Name: cb_person_default_on_file, dtype: object

In [11]:
# Make dummy variable where 'N' (never default) is the reference for 'history'
# history_default = 0 --> does not have any history of defaults
history_default = pd.get_dummies(history,
                                 drop_first = True)

# Rename the column
history_default.columns = ['history_default']

history_default.head()

Unnamed: 0,history_default
11491,False
3890,False
17344,False
13023,False
29565,False


In [12]:
# Modeling with statsmodels
# Use dummy variable 'history_default' as predictor

# Load the package
import statsmodels.api as sm

# Add constant to predictor
history_default_sm = sm.add_constant(history_default.astype(float))

# Model fitting
model_history_default = sm.Logit(endog = default,
                                 exog = history_default_sm)
result_history_default = model_history_default.fit()

# Print the result
print(result_history_default.summary())

Optimization terminated successfully.
         Current function value: 0.510410
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:            loan_status   No. Observations:                22806
Model:                          Logit   Df Residuals:                    22804
Method:                           MLE   Df Model:                            1
Date:                Tue, 12 Nov 2024   Pseudo R-squ.:                 0.02695
Time:                        19:32:16   Log-Likelihood:                -11640.
converged:                       True   LL-Null:                       -11963.
Covariance Type:            nonrobust   LLR p-value:                2.876e-142
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -1.4879      0.019    -79.001      0.000      -1.525      -1.451
history_defaul

In [13]:
# Modeling with statsmodels.formula
# You can use the predictor as it is

# Load the package
import statsmodels.formula.api as smf

# Model fitting
model_history = smf.logit('default ~ history',
                          data = y_train)
result_model_history = model_history.fit()

# Print the result
print(result_model_history.summary())

Optimization terminated successfully.
         Current function value: 0.510410
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                22806
Model:                          Logit   Df Residuals:                    22804
Method:                           MLE   Df Model:                            1
Date:                Tue, 12 Nov 2024   Pseudo R-squ.:                 0.02695
Time:                        19:32:16   Log-Likelihood:                -11640.
converged:                       True   LL-Null:                       -11963.
Covariance Type:            nonrobust   LLR p-value:                2.876e-142
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -1.4879      0.019    -79.001      0.000      -1.525      -1.451
history[T.Y]     0.9781

- Thus, we have the logit model:
$$
\begin{align*}
\text{logit(historydefault)} &=\beta_{0} + \beta_{1} \text{historydefault} \\
\text{logit(historydefault)} &=-1.4879 + 0.9781(\text{historydefault})
\end{align*}
$$

In [14]:
# Modelling with sklearn

# Load Statistics package
from sklearn.linear_model import LogisticRegression

# Create the object
model_history_sk = LogisticRegression(penalty = None)

# Use dummy variable 'history_default' as predictor
model_history_sk.fit(X = history_default,
                     y = default)

Extract $\beta_{0}$ estimate.

In [15]:
# Print the parameter estimate of b0
b0_history = model_history_sk.intercept_
b0_history

array([-1.48802833])

Extract $\beta_{1}$ estimate.

In [16]:
# Print the parameter estimate of b1
b1_history = model_history_sk.coef_
b1_history

array([[0.97818755]])

We already fit logistic regression model with:
1. `statsmodels.api`
2. `statsmodels.formula.api`
3. `sklearn.linear_model`

To keep it simple for interpreting the parameter estimates, next we will use only `statsmodels.api` to fit the logistic regression model since its package yields the summarized output for the fitted model.

### Interpretation
---

Interpretation for $\beta_{0}$.
- $\beta_{0}$ is the logit value when $x=0$ or $x$= reference.
- Thus, $\beta_{0}$ is the logit or log odds of success of reference category.
- $\text{odds(reference)}=\exp(\beta_{0})$

In [17]:
# Calculate odds of reference: does not have any history of defaults
odds_never_default = np.exp(b0_history)

print(f"The odds of default for those who have never been in default is {odds_never_default[0]:.2f}.")

The odds of default for those who have never been in default is 0.23.


Interpretation:
> Debtors who have never been in default are less likely to default.

Interpretation for $\beta_{1}$.

In [18]:
# Calculate the OR between history_default=1 and history_default=0
odds_ratio_history = np.exp(b1_history)

print(f"OR (ever default, never default) = {odds_ratio_history[0][0]:.2f}")

OR (ever default, never default) = 2.66


Interpretation:
> Debtors who have been in default tend to default again than those who have never been in default. The odds of default for debtors who have been in default is 2.66 times the odds for those who have never been in default.

Let's prove the Odds Ratio formula above.

In [19]:
# Contingency table 'default' and 'history'
crosstab_history = pd.crosstab(history,
                               default,
                               margins = True)
crosstab_history

loan_status,0,1,All
cb_person_default_on_file,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
N,15302,3456,18758
Y,2529,1519,4048
All,17831,4975,22806


In [20]:
# Calculate the odds of default for debtors who have been in default
default_and_ever_default = crosstab_history[1]['Y']
non_default_and_ever_default  = crosstab_history[0]['Y']

odds_ever_default = default_and_ever_default/non_default_and_ever_default
odds_ever_default

0.6006326611308818

In [21]:
# Calculate the odds of default for debtors who have NEVER been in default
default_and_never_default = crosstab_history[1]['N']
non_default_and_never_default  = crosstab_history[0]['N']

odds_never_default = default_and_never_default/non_default_and_never_default
odds_never_default

0.22585282969546464

The odds of default for those who have never been in default is proven.

In [22]:
# Calculate the OR
odds_ratio_history = odds_ever_default / odds_never_default
odds_ratio_history

2.6593984318937363

OR (ever default, never default) = 2.66 is proven.

## **Multicategory Predictor**
---

- We use variable `person_home_ownership` as the multicategory predictor for response `loan_status`.
- `person_home_ownership` has 4 categories, where 0-9 years experience or beginner driver is the **reference**.
  - $k_{1}$ = `OTHER`
  - $k_{2}$ = `OWN`
  - $k_{3}$ = `RENT`
  - $k_{4}$ = `MORTGAGE` (**reference**)
- Thus, the logit model becomes:
$$
\text{logit} =\beta_{0} + \beta_{1}k_{1} + \beta_{2}k_{2} + \beta_{3}k_{3}
$$

### Model Fitting
---

In [23]:
# Define the response 'default' and multicategory predictor 'home'
default = data_train[response_variable]
home = data_train['person_home_ownership']

home.head()

11491        RENT
3890     MORTGAGE
17344        RENT
13023        RENT
29565        RENT
Name: person_home_ownership, dtype: object

In [24]:
# MORTGAGE as the reference
# Results 3 dummies predictor; OTHER, OWN, RENT
home_mortgage = pd.get_dummies(home,
                               drop_first = True)

home_mortgage.head()

Unnamed: 0,OTHER,OWN,RENT
11491,False,False,True
3890,False,False,False
17344,False,False,True
13023,False,False,True
29565,False,False,True


In [26]:
# Modeling with statsmodels
# Use 3 dummy variables as predictors

# Add constant to predictors
home_mortgage_sm = sm.add_constant(home_mortgage.astype(float))

# Model fitting
model_home = sm.Logit(endog = default,
                      exog = home_mortgage_sm)
result_home = model_home.fit()

# Print the result
print(result_home.summary())

Optimization terminated successfully.
         Current function value: 0.493191
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:            loan_status   No. Observations:                22806
Model:                          Logit   Df Residuals:                    22802
Method:                           MLE   Df Model:                            3
Date:                Tue, 12 Nov 2024   Pseudo R-squ.:                 0.05978
Time:                        19:35:28   Log-Likelihood:                -11248.
converged:                       True   LL-Null:                       -11963.
Covariance Type:            nonrobust   LLR p-value:                7.866e-310
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.9419      0.031    -62.383      0.000      -2.003      -1.881
OTHER          1.2865      0.

- Thus, we have the logit model:
$$
\begin{align*}
\text{logit(home)} &=\beta_{0} + \beta_{1}k_{1} + \beta_{2}k_{2} + \beta_{3}k_{3} \\
\\
&=\beta_{0} + \beta_{1}(\text{OTHER}) + \beta_{2}(\text{OWN}) + \beta_{3}(\text{RENT}) \\
\\
\text{logit(home)} &=-1.9419 + 1.2865(\text{OTHER}) - 0.6242(\text{OWN}) + 1.1716(\text{RENT}) \\
\end{align*}
$$

Extract parameter estimate $\beta_{0}, \beta_{1}, \dots, \beta_{3}$

In [27]:
# Print the parameter estimate of bk for k=0,1,2,3
bk_home = result_home.params
bk_home

const   -1.941907
OTHER    1.286500
OWN     -0.624225
RENT     1.171554
dtype: float64

Extract $\beta_{0}$ estimate.

In [28]:
# Print the parameter estimate of b0
b0_home = bk_home[0]
b0_home

  b0_home = bk_home[0]


-1.9419069897092203

Extract $\beta_{1}$ estimate for OTHER.

In [29]:
# Print the parameter estimate of b1 for OTHER
b1_other = bk_home[1]
b1_other

  b1_other = bk_home[1]


1.286500137132122

Extract $\beta_{2}$ estimate for OWN.

In [30]:
# Print the parameter estimate of b2 for OWN
b2_own = bk_home[2]
b2_own

  b2_own = bk_home[2]


-0.6242251000013656

Extract $\beta_{3}$ estimate for RENT.

In [31]:
# Print the parameter estimate of b3 for RENT
b3_rent = bk_home[3]
b3_rent

  b3_rent = bk_home[3]


1.1715544633640351

### Interpretation
---

Interpretation for $\beta_{0}$.
- $\beta_{0}$ is the logit value when $k_{1} = k_{1} = k_{3} =0$ or $x =$ reference.
- Thus, $\beta_{0}$ is the logit or log odds of success of reference category.
- $\text{odds(reference)}=\exp(\beta_{0})$

In [32]:
# Calculate odds of reference: MORTGAGE
odds_mortgage = np.exp(b0_home)

print(f"The odds of default for debtors who live in a mortgage is {odds_mortgage:.2f}.")

The odds of default for debtors who live in a mortgage is 0.14.


Interpretation:
> Debtors who live in a mortgage are less likely to default, with probability of default only 0.14 times probability of not default.

Interpretation for $\beta_{1}$, $\beta_{2}$, and $\beta_{3}$.

In [33]:
# Calculate the odds of OTHER, OWN, RENT compared with MORTGAGE
odds_ratio_home = np.exp(bk_home[1:4])

print(f"1. OR (OTHER, MORTGAGE) = {odds_ratio_home[0]:.2f}")
print(f"2. OR (OWN, MORTGAGE) = {odds_ratio_home[1]:.2f}")
print(f"3. OR (RENT, MORTGAGE) = {odds_ratio_home[2]:.2f}")

1. OR (OTHER, MORTGAGE) = 3.62
2. OR (OWN, MORTGAGE) = 0.54
3. OR (RENT, MORTGAGE) = 3.23


  print(f"1. OR (OTHER, MORTGAGE) = {odds_ratio_home[0]:.2f}")
  print(f"2. OR (OWN, MORTGAGE) = {odds_ratio_home[1]:.2f}")
  print(f"3. OR (RENT, MORTGAGE) = {odds_ratio_home[2]:.2f}")


Interpretation:
> Compared with MORTGAGE, the higher odds of default come from OTHER and RENT.

Let's prove the Odds Ratio formula above.

In [34]:
# Proportion of 'default' by 'home'
crosstab_home = pd.crosstab(home,
                            default,
                            margins = True)
crosstab_home

loan_status,0,1,All
person_home_ownership,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
MORTGAGE,8227,1180,9407
OTHER,52,27,79
OWN,1692,130,1822
RENT,7860,3638,11498
All,17831,4975,22806


In [35]:
# Calculate odds of each category
crosstab_home['Odds'] = crosstab_home[1]/crosstab_home[0]

crosstab_home

loan_status,0,1,All,Odds
person_home_ownership,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
MORTGAGE,8227,1180,9407,0.14343
OTHER,52,27,79,0.519231
OWN,1692,130,1822,0.076832
RENT,7860,3638,11498,0.46285
All,17831,4975,22806,0.279008


In [36]:
# Extract the odds of each category
odds_mortgage = crosstab_home['Odds'][0]
odds_other = crosstab_home['Odds'][1]
odds_own = crosstab_home['Odds'][2]
odds_rent = crosstab_home['Odds'][3]

print(f"The odds of default for MORTGAGE is {odds_mortgage:.2f}.")
print(f"The odds of default for OTHER is {odds_other:.2f}.")
print(f"The odds of default for OWN is {odds_own:.2f}.")
print(f"The odds of default for RENT is {odds_rent:.2f}.")

The odds of default for MORTGAGE is 0.14.
The odds of default for OTHER is 0.52.
The odds of default for OWN is 0.08.
The odds of default for RENT is 0.46.


  odds_mortgage = crosstab_home['Odds'][0]
  odds_other = crosstab_home['Odds'][1]
  odds_own = crosstab_home['Odds'][2]
  odds_rent = crosstab_home['Odds'][3]


In [37]:
# Calculate the OR between each category and MORTGAGE
crosstab_home['OR'] = crosstab_home['Odds']/crosstab_home['Odds']['MORTGAGE']

crosstab_home

loan_status,0,1,All,Odds,OR
person_home_ownership,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
MORTGAGE,8227,1180,9407,0.14343,1.0
OTHER,52,27,79,0.519231,3.620095
OWN,1692,130,1822,0.076832,0.535676
RENT,7860,3638,11498,0.46285,3.227005
All,17831,4975,22806,0.279008,1.945256


In [38]:
# Extract the OR of each category
OR_other = crosstab_home['OR'][1]
OR_own = crosstab_home['OR'][2]
OR_rent = crosstab_home['OR'][3]

print(f"Odds Ratio (OTHER, MORTGAGE) = {OR_other:.2f}")
print(f"Odds Ratio (OWN, MORTGAGE) = {OR_own:.2f}")
print(f"Odds Ratio (RENT, MORTGAGE) = {OR_rent:.2f}")

Odds Ratio (OTHER, MORTGAGE) = 3.62
Odds Ratio (OWN, MORTGAGE) = 0.54
Odds Ratio (RENT, MORTGAGE) = 3.23


  OR_other = crosstab_home['OR'][1]
  OR_own = crosstab_home['OR'][2]
  OR_rent = crosstab_home['OR'][3]


$\exp(\beta_{k})$ for odds ratio between category $k$ and reference category is proven.

---
What about odds ratio between category $k=3$ (RENT) and category $k=2$ (OWN)?
- $\text{OR}(2,3) = \exp(\beta_{3}-\beta_{2})$

In [39]:
# Calculate the OR between RENT and OWN
odds_ratio_rent_own = np.exp(b3_rent - b2_own)

print(f"The odds of default for RENT {odds_ratio_rent_own:.2f} times the odds for OWN.")

The odds of default for RENT 6.02 times the odds for OWN.


In [40]:
# Proof with the odds from contingency table
print(f" Odds Ratio (RENT, OWN) = {(odds_rent / odds_own):.2f} is proven.")

 Odds Ratio (RENT, OWN) = 6.02 is proven.


# **Case: Continuous Predictor**
---

## **Ordinal Predictor**
---
Ordinal predictor can be treated in a quantitative manner (continuous scale).

- `loan_grade` has a natural ordering of categories, grade A to G.
- We can treat variable `loan_grade` as the ordinal predictor for response `loan_status`.
- Thus, we don't have dummies for `loan_grade`.
- Code the categories with ordered scale from 1 to 7.
  - $x=1$ for grade A
  - $x=2$ for grade B
  - $x=3$ for grade C
  - $x=4$ for grade D
  - $x=5$ for grade E
  - $x=6$ for grade F
  - $x=7$ for grade G
- The logit model becomes: $\text{logit} =\beta_{0} + \beta_{1}x$

### Model Fitting
---

In [41]:
# Label the categories of 'loan_grade'
label = {'A' : 1,
         'B' : 2,
         'C' : 3,
         'D' : 4,
         'E' : 5,
         'F' : 6,
         'G' : 7}

loan_grade = data_train['loan_grade'].map(label)

In [42]:
# Modeling with statsmodels

# Add constant to predictor
loan_grade_sm = sm.add_constant(loan_grade)

# Model fitting
model_loan_grade = sm.Logit(endog = default,
                            exog = loan_grade_sm)
result_loan_grade = model_loan_grade.fit()

# Print the result
print(result_loan_grade.summary())

Optimization terminated successfully.
         Current function value: 0.459908
         Iterations 6
                           Logit Regression Results                           
Dep. Variable:            loan_status   No. Observations:                22806
Model:                          Logit   Df Residuals:                    22804
Method:                           MLE   Df Model:                            1
Date:                Tue, 12 Nov 2024   Pseudo R-squ.:                  0.1232
Time:                        19:39:08   Log-Likelihood:                -10489.
converged:                       True   LL-Null:                       -11963.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -3.1300      0.043    -72.434      0.000      -3.215      -3.045
loan_grade     0.7515      0.

- Thus, we have the logit model:
$$
\begin{align*}
\text{logit(loan-grade)} &=\beta_{0} + \beta_{1}(\text{loan-grade}) \\
\text{logit(loan-grade)} &=-3.1300 + 0.7515(\text{loan-grade}) \\
\end{align*}
$$

Extract $\beta_{0}$ estimate.

In [43]:
# Print the parameter estimate of b0
b0_loan_grade = result_loan_grade.params[0]
b0_loan_grade

  b0_loan_grade = result_loan_grade.params[0]


-3.1300040049471676

Extract $\beta_{1}$ estimate.

In [44]:
# Print the parameter estimate of b1 for loan grade
b1_loan_grade = result_loan_grade.params[1]
b1_loan_grade

  b1_loan_grade = result_loan_grade.params[1]


0.7514926219585687

### Interpretation
---

Interpretation for $\beta_{1}$.

In [45]:
# Calculate the OR of default
odds_ratio_loan_grade = np.exp(b1_loan_grade)

print(f"OR = {odds_ratio_loan_grade:.2f}")

OR = 2.12


- Interpretation:
> - The odds of default double for every one-level downgrade in loan grade.
> - In other words, the higher the level of the loan grade, the less the odds of default.

- To proof the interpretation, calculate the odds of each value or level of driving experience.

In [46]:
# Function to calculate odds of default
def odds(b0, b1, X):
    """
    Function to calculate odds of success.

    Parameters
    ----------
    b0  : float
        The optimum parameter estimate of intercept

    b1  : {array-like} of shape (n_predictors, 1)
        The optimum parameter estimate of slope/weights

    X   : {array-like} of shape (n_sample, n_predictors)
        The independent variable or predictor

    Returns
    -------
    odds  : {array-like} of shape (n_sample, 1)
          The odds of success
    """
    # Calculate logit (log odds)
    logit = b0 + b1*X
    odds = np.exp(logit)

    return odds

In [47]:
# Define x=1,2,...,7
label_loan_grade = np.unique(loan_grade)

# Calculate the odds of x
odds_label_loan_grade = odds(b0 = b0_loan_grade,
                             b1 = b1_loan_grade,
                             X = label_loan_grade)


# Extract the odds of each value
odds_A = odds_label_loan_grade[0]
odds_B = odds_label_loan_grade[1]
odds_C = odds_label_loan_grade[2]
odds_D = odds_label_loan_grade[3]
odds_E = odds_label_loan_grade[4]
odds_F = odds_label_loan_grade[5]
odds_G = odds_label_loan_grade[6]

print(f"Odds x=1 (grade A) : {odds_A:.2f}")
print(f"Odds x=2 (grade B) : {odds_B:.2f}")
print(f"Odds x=3 (grade C) : {odds_C:.2f}")
print(f"Odds x=4 (grade D) : {odds_D:.2f}")
print(f"Odds x=5 (grade E) : {odds_E:.2f}")
print(f"Odds x=6 (grade F) : {odds_F:.2f}")
print(f"Odds x=7 (grade G) : {odds_G:.2f}")

Odds x=1 (grade A) : 0.09
Odds x=2 (grade B) : 0.20
Odds x=3 (grade C) : 0.42
Odds x=4 (grade D) : 0.88
Odds x=5 (grade E) : 1.87
Odds x=6 (grade F) : 3.97
Odds x=7 (grade G) : 8.42


Interpretation:
> The higher the loan grade, the less the odds of default.

It is advantageous to treat ordinal predictors in a quantitative manner, the model is simpler and easier to interpret.