<a href="https://colab.research.google.com/github/francji1/01ZLMA/blob/main/code/01ZLMA_ex07_Binary_Data_1a_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Exercise 07 of the course 01ZLMA.

# GLM for Discrete response - Binary Data Analysis

Alternative and Binomial responses

**Bernoulli (Alternative) Model**

$$Y_{i,j} \sim Be(\pi_i) \ i = 1,\ldots,K \ \text{and} \ j = 1,\ldots, n_i.$$
$K$ is number of groups, $n_i$ is number of observations in group $i$ and $\sum_{i=1}^{K} n_i = N$
$$ E[Y_{i,j}] = \pi_i \ \text{and} \ g(\pi_i) = \eta_i =x_i^T \beta $$


**Binomial Model**
$$Y_i = \sum_{j=1}^{n_i} Y_{i,j} \sim Bi(n_i, \pi_i)$$

**Without continuos covariate (only factor variables)**

$K$ is constant and $n_i \rightarrow \infty $

**With at least one continuos covariate**

$n_i \approx 1$ ( $n_i$ is small enough) and $K \rightarrow \infty$



## Link functions for binary data

**Logistic function:**

* The logistic function is the canonical link function for binary responses, and it is CDF of the standard logistic distribution.

$$g(\pi_i)=log(\frac{\pi_i}{1-\pi_i}) $$

$$\pi_i = \frac{1}{1+e^{-x_i^T \beta}} = \frac{e^{x_i^T \beta}}{1+e^{x_i^T \beta}} $$

* **Logistic regression** is a GLM with binomial family and logit link


**Probit function:**

* The CDF of the normal distribution.
$$\pi_i = \Phi({x_i^T \beta}) $$


**Cauchit function:**

* The CDF of the Cauchy distribution

$$\pi_i = \frac{1}{\pi}\text{arctan}(x_i^T \beta) + \frac{1}{2} $$


**Complementary log-log (cloglog) function:**

* The inverse of the conditional log-log function (CDF of the Gumbel distribution)

$$\pi_i = 1 − e^{-e^{x_i^T \beta}}$$

* The counter part of the cloglog function is log-log link function.

Likelihood:
$$L_n (\beta, Y) = \prod_{i=1}^{K} \prod_{j=1}^{n_i} \pi_i^{Y_{ij}} (1-\pi_i)^{1-Y_{ij}}$$

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.genmod.families.links import Logit, Probit, Cauchy, CLogLog, LogLog
import statsmodels.api as sm
import statsmodels.formula.api as smf

import numpy as np
import pandas as pd
from scipy.stats import norm

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.stats import norm


In [None]:
# Set the display width to avoid automatic wrapping of the display
pd.set_option('display.width', 200)

# Increase maximum column width to see more data within each cell
pd.set_option('display.max_colwidth', 50)

In [None]:
# Define the link functions and names
links = [Logit(), Probit(), Cauchy(), CLogLog(), LogLog()]
link_names = ['logit', 'probit', 'cauchit', 'cloglog', 'loglog']

# Create a sequence of x values and DataFrame
x = np.linspace(-5, 5, 101)
df = pd.DataFrame()

for link, name in zip(links, link_names):
    y = link.inverse(x)
    temp_df = pd.DataFrame({'x': x, 'y': y, 'link_name': name})
    df = pd.concat([df, temp_df])

# Plotting
plt.figure(figsize=(10, 6))
for name in link_names:
    subset = df[df['link_name'] == name]
    plt.plot(subset['x'], subset['y'], label=name)

plt.title('Link Functions')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(title='Link Function')

plt.axvline(0, color='gray', linestyle='--', linewidth=1)
plt.axhline(0, color='gray', linestyle='--', linewidth=1)
plt.axhline(0.5, color='gray', linestyle='--', linewidth=1)
plt.axhline(1, color='gray', linestyle='--', linewidth=1)
plt.show()


## Logistic regression with Titanic dataset

https://www.kaggle.com/c/titanic/data

| Variable |                 Definition                 |                       Key                      |
|:--------:|:------------------------------------------:|:----------------------------------------------:|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

In [None]:
# Load the Titanic dataset from seaborn package
titanic = sns.load_dataset('titanic')

# Display the first few rows of the dataset
print(titanic.head())

In [None]:
# Load the R magic extension
%load_ext rpy2.ipython

In [None]:
%%R
install.packages("titanic")
library(titanic)

In [None]:
%%R -o titanic_train -o titanic_test

In [None]:
titanic_train

In [None]:
titanic_test

In [None]:
# Summarize the datasets
print("Summary of Train Dataset (Numeric):")
print(titanic_train.describe())
print("\nSummary of Train Dataset (Categorical):")
print(titanic_train.describe(include=['O']))

#print("\nSummary of Test Dataset (Numeric):")
#print(titanic_test.describe())
#print("\nSummary of Test Dataset (Categorical):")
#print(titanic_test.describe(include=['O']))

# Count missing values
print("\nNumber of NAs in Train Dataset:")
print(titanic_train.isna().sum())
#print("\nNumber of NAs in Test Dataset:")
#print(titanic_test.isna().sum())

We will modify dataset_train for our academic purpose :-)

### Model where all covariates are discrete (Use DF called `data_dis`)

In [None]:
# Load data and select columns
data_dis = titanic_train[['Survived', 'Pclass', 'Sex', 'Embarked']]

# Convert string columns to categorical
data_dis = data_dis.apply(lambda col: col.astype('category') if col.dtype == 'object' else col)
print("Summary after conversion to categorical:")
print(data_dis.describe(include='all'))

# Filter and transmute
data_dis = data_dis[data_dis['Embarked'].isin(['C', 'Q', 'S'])]
data_dis = data_dis.assign(
    survived=data_dis['Survived'].astype('category'),
    classes=data_dis['Pclass'].astype('category'),
    sex=data_dis['Sex'].astype('category'),
    embarked=data_dis['Embarked'].astype('category')
)
data_dis = data_dis[['survived', 'classes', 'sex', 'embarked']]



In [None]:
sns.countplot(x='sex', hue='survived', data=data_dis)

# Output the summary and structure
print("\nSummary of final data:")
print(data_dis.describe(include='all'))
print("\nStructure of final data:")
print(data_dis.info())


In [None]:
# Creating a contingency table and heatmap for each pair of categorical variables
for col1 in data_dis.columns:
    for col2 in data_dis.columns:
        if col1 > col2:
            contingency_table = pd.crosstab(data_dis[col1], data_dis[col2])
            print(contingency_table)
            print("-------------------")
            #sns.heatmap(contingency_table, annot=True, fmt="d", cmap="YlGnBu")
            #plt.title(f'Contingency Table between {col1} and {col2}')
            #plt.show()


In [None]:
#Print all heat maps
fig, axes = plt.subplots(nrows=len(data_dis.columns)*(len(data_dis.columns)-1)//2, ncols=3, figsize=(18, 6*len(data_dis.columns)))

row_index = 0  # Initialize row index for subplots
for col1 in data_dis.columns:
    for col2 in data_dis.columns:
        if col1 > col2:  # To avoid repeated pairings and self-pairings
            # Compute the contingency table
            contingency_table = pd.crosstab(data_dis[col1], data_dis[col2])

            # Compute row proportions (margin=1 in R's prop.table)
            row_proportions = pd.crosstab(data_dis[col1], data_dis[col2], normalize='index') * 100

            # Compute column proportions (margin=2 in R's prop.table)
            column_proportions = pd.crosstab(data_dis[col1], data_dis[col2], normalize='columns') * 100

            # Plotting the raw count contingency table
            ax = axes[row_index, 0]
            sns.heatmap(contingency_table, annot=True, fmt="d", cmap="YlGnBu", ax=ax)
            ax.set_title(f'Counts: {col1} vs {col2}')
            ax.set_ylabel(col1)
            ax.set_xlabel(col2)

            # Plotting row proportions
            ax = axes[row_index, 1]
            sns.heatmap(row_proportions, annot=True, fmt=".2f", cmap="YlGnBu", ax=ax)
            ax.set_title(f'Row Proportions: {col1} vs {col2}')
            ax.set_ylabel(col1)
            ax.set_xlabel(col2)

            # Plotting column proportions
            ax = axes[row_index, 2]
            sns.heatmap(column_proportions, annot=True, fmt=".2f", cmap="YlGnBu", ax=ax)
            ax.set_title(f'Column Proportions: {col1} vs {col2}')
            ax.set_ylabel(col1)
            ax.set_xlabel(col2)

            row_index += 1  # Increment row index for the next set of plots

plt.tight_layout()
plt.show()


### Odds Ratio from a 2×2 Contingency Table

|              | Exposure = 0 (Female) | Exposure = 1 (Male) |
|--------------|-----------------------|----------------------|
| Outcome = 0  |           c           |          d           |
| Outcome = 1  |           a           |          b           |

$$
\text{OR} = \frac{b \cdot c}{a \cdot d}
$$
The odds of survival for males compared to females.

In [None]:
# Manual computaion of OR
def odds_ratio(table2x2):
    """Calculate the odds ratio for a 2x2 contingency table."""
    # Extract values from the table
    c = table2x2.iloc[0, 0]
    d = table2x2.iloc[0, 1]
    a = table2x2.iloc[1, 0]
    b = table2x2.iloc[1, 1]

    # Calculate the odds ratio
    return (b * c)/(a * d)


In [None]:
# 2x2 Contingency table
table_sex = pd.crosstab(data_dis['survived'], data_dis['sex'])

# Display the contingency table
print("Contingency Table for Survival vs Sex:")
print(table_sex)

# Calculate and print the odds ratio
or_value = odds_ratio(table_sex)
print(f"Odds Ratio: {or_value:.4f}")

In [None]:
from statsmodels.stats.contingency_tables import Table2x2

# Convert the crosstab to a Table2x2 object
table_data = table_sex.values  # Extract the values from the crosstab
table = Table2x2(table_data)

# Calculate odds ratio and confidence interval
odds_ratio = table.oddsratio
confidence_interval = table.oddsratio_confint()

print(f"Odds Ratio: {odds_ratio}")
print(f"Confidence Interval: {confidence_interval}")

That means males had about 12× higher odds of dying than females.


### Task
Delta Method (https://en.wikipedia.org/wiki/Delta_method): Roughly, if there is a sequence of random variables \( X_n \) satisfying

$$
\sqrt{n}(X_n - \theta) \xrightarrow{D} \mathcal{N}(0, \sigma^2),
$$

where $ \theta $ and $ \sigma^2 $ are finite-valued constants and $ \xrightarrow{D} $ denotes **convergence in distribution**, then for any differentiable function $ g $ such that the first derivative $ g'(\theta) $ exists and is non-zero:

$$
\sqrt{n}(g(X_n) - g(\theta)) \xrightarrow{D} \mathcal{N}\left(0, \sigma^2 \cdot [g'(\theta)]^2\right).
$$

This result is useful for approximating the distribution of smooth (nonlinear) transformations of asymptotically normal estimators.

* Assume that each cell is independent and modeled as Poisson distribution, derivate the code `se_log_or = np.sqrt(1/a + 1/b + 1/c + 1/d)` by the use of delta method.

* Assume logistic regression and let

  - $Y_i \in \{0, 1\}$ be the binary outcome
  - $X_i \in \{0, 1\}$ be a binary predictor
  - $\text{logit}(p_i) = \beta_0 + \beta_1 X_i$

  So:

  - $p_i = \mathbb{P}(Y_i = 1 | X_i)$

  We organize the data into a 2×2 contingency table:

| Group ($X$) | Successes ($Y=1$) | Failures ($Y=0$) | Total |
|-------------|-------------------|------------------|--------|
| $X=0$       | a               | b              | a+b  |
| $X=1$       | c               | d              | c+d  |

  and define:

  - $\pi_0 = \frac{e^{\beta_0}}{1 + e^{\beta_0}}$
  - $\pi_1 = \frac{e^{\beta_0 + \beta_1}}{1 + e^{\beta_0 + \beta_1}}$

  Then the log-likelihood is:

  $$
  \ell(\beta_0, \beta_1) =
  a \log(\pi_0) + b \log(1 - \pi_0) +
  c \log(\pi_1) + d \log(1 - \pi_1)
  $$

  Derive the Fisher Information matrix for a simple logistic regression with a **single binary predictor**, and show that the variance of the coefficient corresponds to:

$$
\text{SE}^2(\hat{\beta}_1) = \frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}
$$

---


In [None]:
def combined_or_ci(data, formula='survived ~ sex'):
    """
    Computes odds ratio and 95% confidence intervals using:
    - Contingency table approximation
    - Empirical delta method (cell-based)
    - GLM model coefficient (Fisher Information)
    - Exact Delta method matching Fisher Information
    """

    # Fit logistic regression
    model = smf.glm(formula=formula, data=data, family=sm.families.Binomial()).fit()

    # Extract predictor term name (assumes 1 binary predictor)
    term = [t for t in model.params.index if t != 'Intercept'][0]
    beta = model.params[term]
    se_beta = model.bse[term]
    z = norm.ppf(0.975)

    # OR and CI from regression coefficient (Fisher Information)
    or_model = np.exp(beta)
    ci_model = np.exp([beta - z * se_beta, beta + z * se_beta])

    # Build contingency table from data
    target, predictor = formula.split('~')
    target = target.strip()
    predictor = predictor.strip()
    table = pd.crosstab(data[target], data[predictor])

    # Assume format: rows=[0 (failure), 1 (success)], columns=[baseline, group]
    a = table.iloc[1, 0]  # successes baseline
    b = table.iloc[0, 0]  # failures baseline
    c = table.iloc[1, 1]  # successes group
    d = table.iloc[0, 1]  # failures group

    # OR from table
    or_tab = (a * d) / (b * c)
    log_or_tab = np.log(or_tab)

    # --- Approximate SE from table
    se_table_approx = np.sqrt(1/a + 1/b + 1/c + 1/d)
    ci_table_approx = np.exp([log_or_tab - z * se_table_approx, log_or_tab + z * se_table_approx])

    # --- Empirical delta method (cell-based)
    def binomial_var_log(n, x):
        if x == 0 or n == 0:
            return np.nan
        p = x / n
        return (1 - p) / (n * p)

    row0 = a + b
    row1 = c + d

    var_log_a = binomial_var_log(row0, a)
    var_log_b = binomial_var_log(row0, b)
    var_log_c = binomial_var_log(row1, c)
    var_log_d = binomial_var_log(row1, d)

    se_empirical_delta = np.sqrt(var_log_a + var_log_b + var_log_c + var_log_d)
    ci_empirical_delta = np.exp([log_or_tab - z * se_empirical_delta, log_or_tab + z * se_empirical_delta])

    # --- Exact Delta method matching Fisher Information
    # Group-level predictions (MLE probabilities)
    p0_hat = a / (a + b)
    p1_hat = c / (c + d)

    var_exact_delta = (1 / ((a + b) * p0_hat * (1 - p0_hat)) +
                       1 / ((c + d) * p1_hat * (1 - p1_hat)))

    se_exact_delta = np.sqrt(var_exact_delta)
    ci_exact_delta = np.exp([log_or_tab - z * se_exact_delta, log_or_tab + z * se_exact_delta])

    return {
        'odds_ratio_model': or_model,
        'ci_model_fisher': ci_model,
        'ci_table_approx': ci_table_approx,
        'ci_empirical_delta': ci_empirical_delta,
        'ci_exact_delta_fisher': ci_exact_delta,
        'se_model_fisher': se_beta,
        'se_table_approx': se_table_approx,
        'se_empirical_delta': se_empirical_delta,
        'se_exact_delta_fisher': se_exact_delta
    }


In [None]:
res = combined_or_ci(data_dis)

print(f"Odds Ratio (model) = {res['odds_ratio_model']:.4f}")
print(f"95% CI from GLM Fisher SE = [{res['ci_model_fisher'][0]:.4f}, {res['ci_model_fisher'][1]:.4f}]")
print(f"95% CI Table Approx = [{res['ci_table_approx'][0]:.4f}, {res['ci_table_approx'][1]:.4f}]")
print(f"95% CI Empirical Delta = [{res['ci_empirical_delta'][0]:.4f}, {res['ci_empirical_delta'][1]:.4f}]")
print(f"95% CI Exact Delta (matches Fisher) = [{res['ci_exact_delta_fisher'][0]:.4f}, {res['ci_exact_delta_fisher'][1]:.4f}]")


chi-squared test
* Null hypothesis (H₀): The variables sex and survived are independent (i.e., survival does not depend on gender).
* Alternative hypothesis (H₁): The variables are dependent (i.e., survival does depend on gender).

In [None]:
table_sex

In [None]:
from scipy.stats import chi2_contingency

# Perform the chi-squared test
chi2, p_value, dof, expected = chi2_contingency(table_sex)

print(f"Chi-squared Test Statistic: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Degrees of Freedom: {dof}")
print("Expected Frequencies:")
print(expected)


### Null model

* Compute the null model (assume that the probability of survival was the same for all passangers)

* How do we interpret estimated parameter?

In [None]:
# Convert 'survived' to integer for statsmodels compatibility
#data_dis['survived'] = data_dis['survived'].astype(int)

# Adding a constant term for the intercept
#data_dis['intercept'] = 1

# Define the model
#model = sm.GLM(data_dis['survived'], data_dis[['intercept']], family=sm.families.Binomial(link=sm.genmod.families.links.logit()))

data_dis['survived'] = data_dis['survived'].astype(int)
# Define the model
model = smf.glm(formula='survived ~ 1', family=sm.families.Binomial(link=sm.genmod.families.links.Logit()), data = data_dis)

# Fit the model
mod0 = model.fit()

# Print the summary of the model
print(mod0.summary())

In [None]:
# The chances of survival according to training data.

odds_ratio = np.exp(mod0.params)
print("Odds Ratios:")
print(odds_ratio)

In [None]:
# The probability of survival.
probabilities = odds_ratio / (1 + odds_ratio)
print("Probabilities:")
print(probabilities)

In [None]:
data_dis.dtypes

In [None]:
data_dis['survived'].value_counts(normalize=True)


In [None]:
np.mean(data_dis['survived'].astype(int))

### Model with varaible: sex

* Compute the model with one covariate sex.

* How can we interpret estimated coefficients?

* Did survival depend on gender (`sex`) ?

* Perform an appropriate tests.

* Did women have a better chance of survival?


In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit the logistic regression model where 'survived' is the response and 'sex' is the predictor
data_dis['survived'] = data_dis['survived'].astype(int)
data_dis['sex'] = data_dis['sex'].astype('category')

mod_sex = smf.glm(formula='survived ~ sex', data=data_dis, family=sm.families.Binomial()).fit()

# Display the summary of the model
print(mod_sex.summary())


In [None]:
# Calculate the odds ratios for the coefficients
odds_ratios = np.exp(mod_sex.params)
print("Odds Ratios:")
print(odds_ratios)

In [None]:
# Extract coefficient and standard error for 'sex'
beta = mod_sex.params['sex[T.male]']
se = mod_sex.bse['sex[T.male]']

# Compute 95% CI on log-odds scale
z = norm.ppf(0.975)  # z-score for 95% CI
ci_lower_logit = beta - z * se
ci_upper_logit = beta + z * se

# Convert to OR scale
or_estimate = np.exp(beta)
ci_lower_or = np.exp(ci_lower_logit)
ci_upper_or = np.exp(ci_upper_logit)

# Display results
print(f"Odds Ratio (sex = male vs female): {or_estimate:.4f}")
print(f"95% CI for OR: [{ci_lower_or:.4f}, {ci_upper_or:.4f}]")

In [None]:
# Please note that this cell works may not work in other env-s that Google Colab
!pip install wget
import wget
url = "https://github.com/francji1/01ZLMA/raw/main/code/helpers.py"
wget.download(url, '../content/helpers.py')  # path where Colab can find libraries

In [None]:
from helpers import Anova

anova = Anova()

anova(mod0, mod_sex, test='chisq')

In [None]:
anova(mod_sex, test="Chisq")

In [None]:
import scipy.stats as stats

# Compute the likelihood ratio test statistic
lr_stat = 2 * (mod_sex.llf - mod0.llf)  # Twice the difference in log-likelihoods
lr_df = mod_sex.df_model - mod0.df_model  # Difference in degrees of freedom

lr_p_value = stats.chi2.sf(lr_stat, lr_df)

print(f"Likelihood Ratio Statistic: {lr_stat:.4f}")
print(f"Degrees of Freedom: {lr_df}")
print(f"P-value: {lr_p_value:.4f}")

In [None]:
from scipy.stats import norm

def OR_coef(variable, model, CI):
    """
    Estimate Odds Ratio with lower and upper limit of 95% CI for a variable.
    """
    # Extract parameter estimate and standard error
    beta = model.params[variable]
    se = model.bse[variable]

    # Compute the odds ratio
    or_value = np.exp(beta)

    # Compute the confidence interval
    ci_bounds = norm.ppf(0.5 + CI/2) * se
    ci_lower = np.exp(beta - ci_bounds)
    ci_upper = np.exp(beta + ci_bounds)

    return pd.DataFrame({
        'OR': [or_value],
        'LCL': [ci_lower],
        'UCL': [ci_upper]
    }, index=[variable])

or_estimate = OR_coef('sex[T.male]', mod_sex, 0.95)
print(or_estimate)


Compare with `table2x2.oddsratio_confint()`

### Your turn:

Estimate model with one covariate `class` and compute:

1. Did survival depend on (`class`) ?

2. Perform an appropriate tests.

3. Compute odds ratios between classes.

4. Did passangers in second class have a better chance of survival than in third?


### Model with all discrete covariates without interactions

In [None]:
# Simple Logistic Regression model with all discrete covariates without interactions

# Fit the logistic regression model
#predictors = data_dis.columns.drop(['survived', 'intercept'])  # Exclude the response variable
predictors = data_dis.columns.drop(['survived'])  # Exclude the response variable

data_dis['survived'] = data_dis['survived'].astype(int)
mod1 = smf.glm(formula='survived ~ ' + ' + '.join(predictors), data=data_dis, family=sm.families.Binomial()).fit()

# Print the summary of the model
print(mod1.summary())


Deviance tests to add/drop independent variables. (Functions from R not in Python)

`drop1(model,test="Chisq")`

`add1(model,terms.to.add,test="Chisq")`

`drop1(mod1,test="Chisq")`

`add1(mod0,survived~sex+class+embarked, test="Chisq")`

In [None]:
# Reset categories to only those that are present after filtering
data_dis['embarked'] = data_dis['embarked'].cat.set_categories(['C', 'Q', 'S'])

# Optionally, you can remove unused categories directly
data_dis['embarked'] = data_dis['embarked'].cat.remove_unused_categories()

# Check the categories after resetting
print("Categories after filtering and resetting:", data_dis['embarked'].cat.categories)



In [None]:
data_dis2 = data_dis.copy()
data_dis2['embarked'] = data_dis2['embarked'].replace('C', 'Q')
print(data_dis['embarked'].value_counts())
print(data_dis.info())

In [None]:
predictors = '+'.join(data_dis2.drop('survived', axis=1).columns)
formula = f'survived ~ {predictors}'
mod1b = smf.glm(formula=formula, data=data_dis2, family=sm.families.Binomial()).fit()
print(mod1b.summary())


In [None]:
or_estimate_sex_1 = OR_coef('sex[T.male]', mod1b, 0.95)
print(or_estimate_sex_1)

Interpret previous result:

* By how many percentage is the chance of survival lower for  men?

* Interpret confidence intrval and its significance.


Lets try model with second order interactions.


In [None]:
def create_interaction_formula(data, response_var):
    predictors = data.columns.drop(response_var)
    formula = f"{response_var} ~ " + " + ".join(predictors)
    interactions = " + "
    interaction_terms = []

    # Generate all pairwise interaction terms
    for i in range(len(predictors)):
        for j in range(i + 1, len(predictors)):
            interaction_terms.append(f"{predictors[i]}:{predictors[j]}")

    # Add interaction terms to the formula
    formula += " + " + " + ".join(interaction_terms)
    return formula


In [None]:

# Assuming 'survived' is the response variable and data_dis is preprocessed
formula_2nd = create_interaction_formula(data_dis, 'survived')
print("Generated Formula:", formula_2nd)

In [None]:
# Fit the logistic regression model with all pairwise interactions
mod2_all = smf.glm(formula=formula_2nd, data=data_dis, family=sm.families.Binomial()).fit()

# Print the summary of the model
print(mod2_all.summary())


In [None]:
mod2 = smf.glm(formula='survived ~ classes + sex + embarked + classes:sex + sex:embarked', data=data_dis, family=sm.families.Binomial()).fit()
print(mod2.summary())


In [None]:
anova(mod2_all, mod2, test='chisq')

Interpretation by OR in models with interactions is more complitacated, see Lecture notes.

#Task:
Lets try model with merged factor levels.




## Model with continuous independent variable.


Discuss difference from models without continuous variable (again)!!!

In [None]:
# Selecting specific columns
data_con = titanic_train[['Survived', 'Pclass', 'Sex', 'Embarked', 'Age', 'Fare']].copy()

# Drop rows with any missing values
data_con.dropna(inplace=True)

# Filter rows where 'Embarked' is one of 'C', 'Q', 'S'
data_con = data_con[data_con['Embarked'].isin(['C', 'Q', 'S'])]

# Transform and create new columns as needed
data_con['survived'] = data_con['Survived'].astype('category')
data_con['class'] = data_con['Pclass'].astype('category')
data_con['sex'] = data_con['Sex'].astype('category')  # already categorical, reaffirming
data_con['embarked'] = data_con['Embarked'].astype('category')  # already categorical, reaffirming
data_con['age'] = data_con['Age']  # Rename Age to age
data_con['fare'] = data_con['Fare']  # Rename Fare to fare

# Drop original columns if not needed, ensure only transformed/new columns are retained
data_con = data_con[['survived', 'class', 'sex', 'embarked', 'age', 'fare']]

In [None]:
# Print basic statistics and info about the DataFrame
print(data_con.describe(include='all'))  # Basic stats for numeric and categorical data
print(data_con.info())  # Info on DataFrame structure, similar to str() in R


In [None]:
# Drop rows with any missing values
data_con.dropna(inplace=True)

# Print basic statistics and info about the DataFrame
print(data_con.describe(include='all'))  # Basic stats for numeric and categorical data
print(data_con.info())  # Info on DataFrame structure, similar to str() in R


In [None]:
import statsmodels.api as sm

# Using DescrStatsW for 'fare'
desc_stats = sm.stats.DescrStatsW(data_con['fare'])

# You can access various summary statistics directly:
print("Mean:", desc_stats.mean)
print("Variance:", desc_stats.var)
print("Standard Deviation:", desc_stats.std)
print("Confidence Interval of Mean:", desc_stats.tconfint_mean())


In [None]:
data_pair_plot = data_con[['survived', 'age', 'fare', 'class']]

# Create a pair plot
sns.pairplot(data_pair_plot, hue='survived', diag_kind='kde', kind='scatter', palette='viridis')
plt.suptitle('Pair Plot of Variables', y=1.02)  # Adjust title and its position
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
ax = sns.boxplot(x='sex', y='age', hue='survived', data=data_con, palette='coolwarm')
sns.pointplot(x='sex', y='age', hue='survived', data=data_con, dodge=True, join=False, markers='D', capsize=0.1, estimator=np.mean)

# Add enhancements
plt.title('Gender Boxplot by Age')
plt.xlabel('Gender')
plt.ylabel('Age')
plt.legend(title='Survived')
plt.grid(True)
plt.show()

In [None]:
# Create a box plot with mean points
plt.figure(figsize=(10, 6))
ax = sns.boxplot(x='class', y='fare', hue='survived', data=data_con, palette='Set2')
sns.pointplot(x='class', y='fare', hue='survived', data=data_con, dodge=True, join=False,  markers='o', capsize=0.1, estimator=np.mean)

# Add enhancements
plt.title('Class vs Fare Boxplot')
plt.xlabel('Class')
plt.ylabel('Fare')
plt.legend(title='Survived')
plt.grid(True)
plt.show()


Continuous variable as factor

In [None]:
data_con_fac = data_con.copy()

# Categorize 'age' using pandas.cut
data_con_fac['age'] = pd.cut(data_con['age'],
                             bins=[-float('inf'), 15, 50, float('inf')],
                             labels=['child', 'adult', 'senior'])
data_con_fac['survived'] = data_con_fac['survived'].astype(int)


In [None]:
mod_0 = smf.glm(formula='survived ~ 1', data=data_con_fac,
                family=sm.families.Binomial()).fit()
print("Baseline Model Summary:")
print(mod_0.summary())

In [None]:
mod_age_fac = smf.glm(formula='survived ~ age', data=data_con_fac,
                      family=sm.families.Binomial()).fit()
print("Model with Age Factor Summary:")
print(mod_age_fac.summary())

In [None]:
# Calculate the exponentiated coefficients for odds ratios
odds_ratios = np.exp(mod_age_fac.params)
print("Odds Ratios:")
print(odds_ratios)

Is the chance decreasing with increasing age?

In [None]:
anova(mod_age_fac,mod_0,test="Chisq")

Age as continuous variable

In [None]:
data_con_fac['survived'] = data_con_fac['survived'].astype(int)

mod_age = smf.glm(formula='survived ~ I(age / 10)', data=data_con,
                  family=sm.families.Binomial()).fit()
print("Model Summary with Age Scaled by Decades:")
print(mod_age.summary())

odds_ratios = np.exp(mod_age.params)
print("Odds Ratios for Age Increase by Decade:")
print(odds_ratios)

Question:

* With increasing age by 10 years, chance to survive decreased by 11 %.

* What do you think about causality in this result?

In [None]:
anova(mod_age,mod_0,test="Chisq")

Question:

* Can we compare by deviance test models `mod_age` and `mod_age_fac`?
* Which model do you prefere and why?
* For which approach (factorized or continuous) saturated model is useful and why?


In [None]:
#mod_sat_fac <- glm(survived ~ sex*age*embarked*class, family = binomial,data = (data_con %>% mutate(age= as.factor(age), fare = as.factor(fare)) ))
#summary(mod_sat_fac)

In [None]:
# je uzitecny pro faktorizovany, ale ve spojitem muzeme pak odpovedet na otazky, ktere nam faktorizovany neda (napr. rozdil mezi 25 a 35 letym)

Your turn:

Consider a model with continuos variables `age`, `fare`, and any factor variable.

* Create factor `child`, which takes values 1 (child) and 0 (adult).
* Create factor from varaible `fare`, where each level break is by 10-20 pounds, 140 max.
* Estimate a model, where the chance of survival depends on factorized `fare` and `sex` and `child`.
* What percentage is the chance of survival lower for adult compare to child?
* Depends the probability of survival on fare? Test it.
* Assume that the chance of survival increases with exponential increasig fare. How the chance of survival increased if the person spent an extra 10 pound for a ticket?
* Build a model where the probabilty of survival depends on both `age` and `fare`. Are both covariates significant?
*

*   Využijeme *data_con*, kde proměnné *fare* a *age* jsou spojité a *sex* faktorová.

