# [Multilevel Modeling](https://en.wikipedia.org/wiki/Multilevel_model)
Idea: Decompose the sum of random numbers into its contributions

for a given set of $x_i$ and $a_{ik}$ with
$$x_i = \Sigma_{j=0}^n \Sigma_{k=1}^{m_j} a_{ik}y_{jk}$$ 
with
* $m_j$ being the number of contributers of layer j
* $y_{jk}$ being the k-th contribution of layer j
* $m_0 = 1$ by default
* $a_{ik} \in \{0, 1\}$
* $\Sigma a_{ik} = 1$ only one contributer per layer

calculate the Distributions $y_{k} \sim N(μ_{k}, σ_{k})$
Boundary conditions:
* $\Sigma _{k} μ_k = 0$ via $μ_{m_j} = -\Sigma_{k = 1}^{m_j-1} μ_{k}$


Goal of this workbook to gather the basics
* which package to use
* how to solve
* how to model

Answerː Use pymc with nutpie as a solver :-)

In [2]:
import numpy as np
import warnings
import matplotlib.pyplot as plt
import polars as pl

from sklearn.linear_model import BayesianRidge
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline

%matplotlib inline
plt.style.use("default")


## Initialize the data

In [3]:
n_samples = 10000

## setup the individual random generators

In [4]:
def align_coefficients(coefs):
    """ensure that the average over all contributers per level is 0"""
    mem = {}
    for level, lcoefs in coefs.items():
        avg = np.mean([m for (m, s) in lcoefs.values()])
        mem[level] = {k: (m - avg, s) for k, (m, s) in lcoefs.items()}

    return mem


def gen_contributer_coefficients(n_contributers, contributer_avg, contributer_sig):
    """creates for n_contributers the average and standard deviation
    Input:
    * n_contributers: list with number of influencers for each level
    * contributer_avg: average for the average of the influencers
    * contributer_sig: average for the stdev for the influencers (lognormal distribution)

    Output:
    Dictionary {level : { influencer: (mu, sigma)}} with the coefficients for every
    influencer in each level
    """
    # assert n_contributers[0] == 1, "First level is allowed to have one contributer"
    return align_coefficients(
        {
            lvl: {
                i: (
                    np.random.normal(contributer_avg),
                    np.random.lognormal(contributer_sig),
                )
                for i in range(num)
            }
            for lvl, num in enumerate(n_contributers)
        }
    )


def print_coefficients(contributer_coefficients):
    contributer_coefficients = align_coefficients(contributer_coefficients)
    for m, stage in contributer_coefficients.items():
        for i, (mu, sig) in stage.items():
            print(f"Stage {m}: Contributer {i} mu={mu:.2f}, sig={sig:.2f}")

In [5]:
# easy as a start
cc_02 = {0: {0: (0, 5)}, 1: {0: (0, 1), 1: (0, 1), 2: (0, 1), 3: (4, 2)}}
cc_01 = {0: {0: (0, 5)}, 1: {0: (1, 1), 1: (-1, 1)}}

In [6]:
print_coefficients(cc_02)

Stage 0: Contributer 0 mu=0.00, sig=5.00
Stage 1: Contributer 0 mu=-1.00, sig=1.00
Stage 1: Contributer 1 mu=-1.00, sig=1.00
Stage 1: Contributer 2 mu=-1.00, sig=1.00
Stage 1: Contributer 3 mu=3.00, sig=2.00


## Create the Data

In [7]:
def gen_data(contributer_coefficients, n_samples):
    """generate random data.
    The first level defines the baseline that holds for all random numbers

    Inputs:
    * n_samples: number of samples
    * contributer_coefficients: dictionary with (mean,sig) per level per contributer

    Outputs:
    * data: array with the final number
    * contributers: matrix defining the contributers, first column is for the first level"""

    # data = np.random.normal(gen_avg, gen_sig, n_samples)
    data = np.zeros((n_samples,))
    contributers = np.zeros((n_samples, len(contributer_coefficients)))
    for lvl, cdict in contributer_coefficients.items():
        print(f"creating level {lvl}")
        lvl_influencers = len(cdict)  # number of influencers in this level
        lvldata = np.zeros((n_samples, lvl_influencers))

        for i, (mu, sig) in cdict.items():
            lvldata[:, i] = np.random.normal(mu, sig, n_samples)

        selection = np.random.randint(low=0, high=lvl_influencers, size=(n_samples))
        contributers[:, lvl] = selection

        data += np.array([lvldata[row, col] for row, col in enumerate(selection)])
        # Note: The first level
    return data, contributers.astype(int)

# Modeling

use [Radon Analysis](https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/multilevel_modeling.html) as reference

In [10]:
n_contributers = [2, 3, 5, 10, 10]

In [11]:
# generate the data
data, contributers = gen_data(n_samples=n_samples, contributer_coefficients=cc_01)

creating level 0
creating level 1


In [12]:
pipe = Pipeline(
    [
        ("ohe", OneHotEncoder(sparse_output=False)),
        ("regression", BayesianRidge(fit_intercept=False)),
    ]
)
pipe.fit(contributers, data)

df = pl.DataFrame

In [13]:
pipe[0].get_feature_names_out()

array(['x0_0', 'x1_0', 'x1_1'], dtype=object)

In [14]:
pipe[1].coef_

array([-0.02040709,  0.9998263 , -1.0202334 ])

In [35]:
df = (
    pl.DataFrame({"parameter": pipe[0].get_feature_names_out(), "value": pipe[1].coef_})
    .with_columns(pl.col("parameter").str.split("_").alias("splits"))
    .with_columns(
        pl.col("splits").list.first().alias("level"),
        pl.col("splits").list.last().alias("con"),
    )
    .drop("splits")
)

df

parameter,value,splits,level,con
str,f64,list[str],str,str
"""x0_0""",-0.020407,"[""x0"", ""0""]","""x0""","""0"""
"""x1_0""",0.999826,"[""x1"", ""0""]","""x1""","""0"""
"""x1_1""",-1.020233,"[""x1"", ""1""]","""x1""","""1"""


In [16]:
df

parameter,value
str,f64
"""x0_0""",-0.020407
"""x1_0""",0.999826
"""x1_1""",-1.020233
