Found from here:
https://www.kaggle.com/code/fehiepsi/bayesian-imputation-for-age/notebook

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In this notebook, we will do logistic regression to predict `Survived` using `Age` variable. For simplicity, I'll skip EAD part (which has been nicely done in many other popular kernels). I'll use [NumPyro](https://github.com/pyro-ppl/numpyro) for modelling, sampling, and making predictions.

In [2]:
#!pip install numpyro

In [3]:
from jax import ops, random
from jax.scipy.special import expit

import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS, Predictive

### prepare data

After loading data, I recognize that there are many missing values for `Age` column. We know (by intuition or from other kernels) that `Age` is correlated with the title of the name: e.g. those with `Mrs.` would be older than those with `Miss.`. Let's make a new column `Title` for that purpose.

In [4]:
train_df = pd.read_csv(
    "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv"
)
train_df.info()
train_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
#train_df = pd.read_csv("../input/titanic/train.csv")
d = train_df.copy()
d.Embarked.fillna("S", inplace=True)  # filling 2 missing data points with the mode "S"
d["Title"] = d.Name.str.split(", ").str.get(1).str.split(" ").str.get(0).apply(
    lambda x: x if x in ["Mr.", "Miss.", "Mrs.", "Master."] else "Misc.")
title_cat = pd.CategoricalDtype(categories=["Mr.", "Miss.", "Mrs.", "Master.", "Misc."], ordered=True)
age_mean, age_std = d.Age.mean(), d.Age.std()
embarked_cat = pd.CategoricalDtype(categories=["S", "C", "Q"], ordered=True)
data = dict(age=d.Age.pipe(lambda x: (x - age_mean) / age_std).values,
            pclass=d.Pclass.values - 1,
            title=d.Title.astype(title_cat).cat.codes.values,
            sex=(d.Sex == "male").astype(int).values,
            sibsp=d.SibSp.clip(0, 1).values,
            parch=d.Parch.clip(0, 2).values,
            embarked=d.Embarked.astype(embarked_cat).cat.codes.values,
            survived=d.Survived.values)

Note that I don't use other features such as `Fare` or `Cabin` for simplicity. I also don't do much of feature engineering for the same reason.

### modelling

If you are not familiar with NumPyro, you can take a look at [its documentation](https://github.com/pyro-ppl/numpyro#numpyro) which includes some tutorials, examples, and translated code for Statistical Rethinking book (which is a good reference IMO if you are not familiar with Bayesian methods).

In [10]:
def model(age, pclass, title, sex, sibsp, parch, embarked, survived=None):
    # create a variable for each of Pclass, Title, Sex, SibSp, Parch,
    b_pclass = numpyro.sample("b_Pclass", dist.Normal(0, 1), sample_shape=(3,))
    b_title = numpyro.sample("b_Title", dist.Normal(0, 1), sample_shape=(5,))
    b_sex = numpyro.sample("b_Sex", dist.Normal(0, 1), sample_shape=(2,))
    b_sibsp = numpyro.sample("b_SibSp", dist.Normal(0, 1), sample_shape=(2,))
    b_parch = numpyro.sample("b_Parch", dist.Normal(0, 1), sample_shape=(3,))
    b_embarked = numpyro.sample("b_Embarked", dist.Normal(0, 1), sample_shape=(3,))

    # impute Age by Title
    age_mu = numpyro.sample("age_mu", dist.Normal(0, 1), sample_shape=(5,))
    age_mu = age_mu[title]
    age_sigma = numpyro.sample("age_sigma", dist.Normal(0, 1), sample_shape=(5,))
    age_sigma = age_sigma[title]
    age_isnan = np.isnan(age)
    age_nanidx = np.nonzero(age_isnan)[0]
    age_impute = numpyro.sample("age_impute", dist.Normal(age_mu[age_nanidx], age_sigma[age_nanidx]))

    print('age.shape', age.shape)
    print('age_nanidx.shape', age_nanidx.shape)
    print('age_impute.shape', age_impute.shape)
    age = ops.index_update(age, age_nanidx, age_impute)
    numpyro.sample("age", dist.Normal(age_mu, age_sigma), obs=age)

    a = numpyro.sample("a", dist.Normal(0, 1))
    b_age = numpyro.sample("b_Age", dist.Normal(0, 1))
    logits = a + b_age * age

    logits = logits + b_title[title] + b_pclass[pclass] + b_sex[sex] \
        + b_sibsp[sibsp] + b_parch[parch] + b_embarked[embarked]
    # for prediction, we will convert `logits` to `probs` and record that result
    if survived is None:
        probs = expit(logits)
        numpyro.sample("probs", dist.Delta(probs))
    numpyro.sample("survived", dist.Bernoulli(logits=logits), obs=survived)

### sampling

After making a model, sampling is pretty fast in NumPyro.

In [11]:
mcmc = MCMC(NUTS(model), num_warmup=1000, num_samples=1000)
#mcmc = MCMC(NUTS(model), 1000, 1000)
mcmc.run(random.PRNGKey(0), **data)
mcmc.print_summary()

age_nanidx.shape (177,)
age_nanidx.shape (177,)
age_nanidx.shape (177,)


  0%|                                                  | 0/2000 [00:00<?, ?it/s]

age_nanidx.shape (177,)


sample: 100%|█| 2000/2000 [00:16<00:00, 117.79it/s, 63 steps of size 6.51e-02. a



                     mean       std    median      5.0%     95.0%     n_eff     r_hat
              a      0.12      0.83      0.12     -1.11      1.56   1144.71      1.00
  age_impute[0]      0.18      0.52      0.18     -0.63      1.04   2397.74      1.00
  age_impute[1]      0.09      0.53      0.10     -0.72      0.96   1635.02      1.00
  age_impute[2]      0.40      0.50      0.39     -0.44      1.19   3001.60      1.00
  age_impute[3]      0.21      0.53      0.21     -0.61      1.15   1490.23      1.00
  age_impute[4]     -0.57      0.59     -0.56     -1.56      0.32   1331.34      1.00
  age_impute[5]      0.20      0.55      0.19     -0.67      1.09   1770.44      1.00
  age_impute[6]      0.44      0.55      0.43     -0.56      1.26   1943.11      1.00
  age_impute[7]     -0.58      0.54     -0.57     -1.51      0.28   1847.12      1.00
  age_impute[8]      0.06      0.56      0.05     -0.84      1.01   2298.66      1.00
  age_impute[9]      0.21      0.53      0.21     -0.

As we can see, using Bayesian, we can get uncertainties of our results: e.g. imputing values, coefficients of being male or female,... (and you can make nice plots with them ;)

### make predictions

To make predictions on the new data, we will maginalize those `age_imput` variables (in other words, removing them from posterior samples) and use the remaining variables for predictions.

In [12]:
test_df = pd.read_csv(
    "https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv"
)
test_df.info()
test_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [13]:
#test_df = pd.read_csv("../input/titanic/test.csv")
d = test_df.copy()
d["Title"] = d.Name.str.split(", ").str.get(1).str.split(" ").str.get(0).apply(
    lambda x: x if x in ["Mr.", "Miss.", "Mrs.", "Master."] else "Misc.")
test_data = dict(age=d.Age.pipe(lambda x: (x - age_mean) / age_std).values,
                 pclass=d.Pclass.values - 1,
                 title=d.Title.astype(title_cat).cat.codes.values,
                 sex=(d.Sex == "male").astype(int).values,
                 sibsp=d.SibSp.clip(0, 1).values,
                 parch=d.Parch.clip(0, 2).values,
                 embarked=d.Embarked.astype(embarked_cat).cat.codes.values)

posterior = mcmc.get_samples().copy()
#print(posterior)
posterior.pop("age_impute")
predicted = Predictive(model, posterior)(random.PRNGKey(2), **test_data)
print(predicted)
survived_probs = predicted["probs"]
d["Survived"] = (survived_probs.mean(axis=0) >= 0.5).astype(np.uint8)
d[["PassengerId", "Survived"]].to_csv("submission.csv", index=False)

age_nanidx.shape (86,)
{'age': DeviceArray([[ 0.3304914 ,  1.190988  ,  2.2235837 , ...,  0.6058503 ,
              -0.33162794, -1.6635623 ],
             [ 0.3304914 ,  1.190988  ,  2.2235837 , ...,  0.6058503 ,
               0.70584744, -1.8773663 ],
             [ 0.3304914 ,  1.190988  ,  2.2235837 , ...,  0.6058503 ,
               0.9738562 , -1.4498053 ],
             ...,
             [ 0.3304914 ,  1.190988  ,  2.2235837 , ...,  0.6058503 ,
               0.18454845, -1.764892  ],
             [ 0.3304914 ,  1.190988  ,  2.2235837 , ...,  0.6058503 ,
              -0.9107513 , -1.2319694 ],
             [ 0.3304914 ,  1.190988  ,  2.2235837 , ...,  0.6058503 ,
               0.20563367, -1.3692285 ]], dtype=float32), 'age_impute': DeviceArray([[ 0.04010631,  0.44885987,  0.0212229 , ...,  0.54038113,
              -0.33162794, -1.6635623 ],
             [ 0.07183257,  1.3224967 , -0.4964431 , ..., -0.45723763,
               0.70584744, -1.8773663 ],
             [-1.7759464

Submiting the result gives me the score about 79 (top 16%). It is great for the first attempt. :)

### further improvements

+ Using other features such as `Cabin` or `Fare`.
+ The above model assumes a linear relationship of `Survived` w.r.t. other latent variables. The result is intuitive but is not enough to beat tree-based models. We can build more complicated models or construct a [Bayesian neural network](http://pyro.ai/numpyro/bnn.html) model to capture more complex relationship.