# Fitting Data with LRMoE

## Introduction

In this notebook, we will demonstrate how to fit data with LRMoE. We start by simulating data from a two-component mixture, and then artificially impose data truncation. We then fit the data with LRMoE and assess the model fit.

In [None]:
using DrWatson
@quickactivate "LRMoEjl Demo"

using CategoricalArrays, DataFrames, Distributions
using GLM, LRMoE, JLD2, PrettyTables, Random

# some helper functions are hidden in a separate source file
include(srcdir("2023-iCAS-new-util.jl"))
using .continuous_util_jl:
    plot_simulated_obs,
    plot_LRMoE_fit,
    plot_simulated_T_delay,
    plot_simulated_T_claim_T_delay,
    plot_observed_losses

## Data Simulation

Let us consider a synthetic dataset with continuous observations, e.g. loss severity, time of claims reporting delay, etc.  We first simulate some covariates typically considered in auto insurance.

In [None]:
Random.seed!(7777)
sample_size = 10_000

X = DataFrame(
    intercept = fill(1.0, sample_size),
    sex = rand(Binomial(1, 0.50), sample_size),
    aged = rand(Uniform(20, 80), sample_size),
    agec = rand(Uniform(0, 10), sample_size),
    region = rand(Binomial(1, 0.50), sample_size)
)

pretty_table(first(X, 5), nosubheader=true)

We assume the observations are generated from a two-component LRMoE model with the following parameters.

In [None]:
# logit regression coefficients
α = [-0.5 1.0 -0.05 0.1 1.25;
     0.0 0.0   0.0 0.0  0.0]
# expert functions
comp_dist = [LogNormalExpert(4.0, 0.3) InverseGaussianExpert(20, 20)];

Now we are ready to simulate the observations and visualize the distribution.

In [None]:
exact_obs = LRMoE.sim_dataset(α, X, comp_dist)
exact_obs = vec(exact_obs);

In [None]:
plot_simulated_obs(exact_obs)

## Fitting LRMoE to Exact Data

While the data are simulated from a mixture of Lognormal and Inverse Gaussian, we consider fitting a 2-component LRMoE mixture of Gamma distributions.  The goal is to assess the model fit when a 'wrong' distribution is used, which is almost always the case as we can never know the true distribution in reality.

To prepare for model fitting, the first step is to convert the original dataframe into matrix form (support for the `@formula` interface is a feature under development).

In [None]:
# convert observations to a matrix Y, which is needed for LRMoE
Y_mat = reshape(exact_obs, length(exact_obs), 1)
X_mat = Matrix(X);

The `LRMoE.jl` package provides a function for initializing a model, which provides initial values `α_init` for the logit regression parameters and `params_init` for all possible expert functions. We can pick out the Gamma initializations.

In [None]:
# Random.seed!(20230315)
Random.seed!(7777)
n_comp = 2
model_init = cmm_init(Y_mat, X_mat, n_comp, ["continuous"]; exact_Y = true, n_random = 0)

model_init.params_init

# pickout desired parameter initializations
α_init = model_init.α_init
experts_init = vcat([hcat([model_init.params_init[1][j][2] for j in 1:n_comp]...) for d in 1:1]...)
# view
println("α_init: $(α_init)")
println("experts_init: $(experts_init[1, :])")

# dump(model_init, maxdepth = 0)

There are several settings to control the fitting function. For example, `ϵ` controls when to stop the EM algorithm based on the increment in loglikelihood, while `ecm_iter_max` provides a hard stop after a certain number of iterations. More details can be found in the package [documentation](https://actsci.utstat.utoronto.ca/LRMoE.jl/stable/fit/).

In [None]:
LRMoE_model = fit_LRMoE(Y_mat, X_mat, α_init, experts_init;
    exact_Y=true, ϵ=0.01, ecm_iter_max=1000, print_steps=10)

In [None]:
summary(LRMoE_model)

And we visualize the model fit.  We see that the LRMoE model can fit the data reasonably well even when it is not using the true experts.

In [None]:
plot_LRMoE_fit(exact_obs, X_mat, LRMoE_model)

## Right-Truncated Reporting Delay

Claims reporting delays are essentially right-truncated data.  For all the observed claims, we know their reporting delays are less than a certain threshold - reporting delay must be less than the period between accident date and valuation date, otherwise the claim is unreported (incurred but not reported, IBNR).

Let us assume accidents occur uniformly within a year, i.e. `T_accident` follows a uniform distribution from 0 to 365.

In [None]:
Random.seed!(7777)
T_accident = rand(Uniform(0, 365), sample_size)
T_delay = exact_obs;

In [None]:
plot_idx  = vcat([findall(T_accident + T_delay .> 365)[1], findall(T_accident + T_delay .<= 365)[2],
                  findall(T_accident + T_delay .> 365)[3], findall(T_accident + T_delay .<= 365)[4],]...)

plot_simulated_T_claim_T_delay(T_accident, T_delay)

Let us also assume claims reserving is done at the end of year. With accident and reporting delay times simulated as above, we know `T_accident + T_delay` will be right-truncated at 365, which is roughly 8% of the data.

In [None]:
println("Number of censored observations: $(sum((T_accident .+ T_delay) .> 365))")

Unlike the case of exact observations above, we require a special format for censored and/or truncated data.

Each observation of reporting delay will be accompanied by a right-truncation level `t_u = 365 - T_accident`.  Claims with reporting delay `T_delay` above this threshold will not be observed in reality.  The LRMoE model can also account for possible left-truncation with level `t_u` (see below), but in this example it will be set to zero.

In [None]:
# construct data frame for truncated observations
t_l = fill(0.0, sample_size)
t_u = 365 .- T_accident
truncation_idx = (T_accident .+ T_delay) .> 365

# view constructed data
df_view = DataFrame(
    T_accident = T_accident,
    T_delay = T_delay,
    T_report = T_accident .+ T_delay,
    truncation = truncation_idx,
    t_l = t_l,
    t_u = t_u
)

pretty_table(df_view[plot_idx,:], nosubheader=true)

The LRMoE package also requires two additional bounds `(y_l, y_u)` to indicate potential data censoring. In this example of reporting delays, there is no censoring, so we set `y_l = y_u = T_delay` for all incurred and reported claims. All incurred but not reported claims will be dropped from the simulated dataset (as they are not reported they are not observed!).

In [None]:
y_l = T_delay
y_u = T_delay

# construct the complete dataframe
Y = DataFrame(t_l=t_l, y_l=y_l, y_u=y_u, t_u=t_u)
# drop truncated observations
Y_truncated = Y[.!truncation_idx, :]
X_truncated = X[.!truncation_idx, :]
pretty_table(first(Y, 5), nosubheader=true)

## Fitting LRMoE with Right-Truncated Data

We are now ready to fit the LRMoE to the truncated data, similar to the procedure above.

In [None]:
# Random.seed!(20230315)
Random.seed!(42)
n_comp = 2
model_init = cmm_init(Matrix(Y_truncated), Matrix(X_truncated), n_comp, ["continuous"]; exact_Y = false, n_random = 0)

model_init.params_init

# pickout desired parameter initializations
α_init = model_init.α_init
experts_init = vcat([hcat([model_init.params_init[1][j][2] for j in 1:n_comp]...) for d in 1:1]...)
# view
println("α_init: $(α_init)")
println("experts_init: $(experts_init[1, :])")

In [None]:
LRMoE_model_truncated = fit_LRMoE(Matrix(Y_truncated), Matrix(X_truncated), α_init, experts_init;
    exact_Y=false, ϵ=0.01, ecm_iter_max=1000, print_steps=10)

In [None]:
summary(LRMoE_model_truncated)

## Additional Example: Left Truncation

Left-truncation is also common in insurance applications, e.g. due to policy deductibles.
Let us simulate from the same LRMoE model, but assume the dataset now represents the distribution of incurred losses.

In [None]:
actual_loss = LRMoE.sim_dataset(α, X, comp_dist);

Assuming there is a policy deductible of 10, all losses below the deductible will not be observed by the insurer.
This is represented by setting `t_l = 10` and `t_u = Inf`, whereby the lower bound represents the level of left-truncation, i.e. policy deductible.

In [None]:
left_truncation_idx = actual_loss .< 10
println("Number of truncated observations: $(sum(left_truncation_idx))")

The input data for fitting LRMoE can now be constructed, with `y_l = y_u = actual_loss` (since observed losses are exact), `t_l = 10`, and `t_u = Inf`. Afterwards, we can call the fitting function similarly as before.

In [None]:
t_l = fill(10.0, sample_size .- sum(left_truncation_idx))
y_l = actual_loss[.!left_truncation_idx]
y_u = actual_loss[.!left_truncation_idx]
t_u = fill(Inf, sample_size .- sum(left_truncation_idx))

# construct the complete dataframe
X_truncated2 = X[.!vec(left_truncation_idx), :]
Y_truncated2 = DataFrame(t_l=t_l, y_l=y_l, y_u=y_u, t_u=t_u)
pretty_table(Y_truncated2[1:5,:], nosubheader=true)

Note that losses below the deductible are dropped from the dataset, which can be observed from the following histogram.

In [None]:
plot_observed_losses(Y_truncated2[:, 2])

We will now fit the left-truncated data with LRMoE, from a starting point reasonably close to the true parameters. The goal is solely to verify that LRMoE package can recover the true parameters in the presence of data truncation.

In [None]:
α_init = fill(0.0, 2, 5)
experts_init = [LogNormalExpert(3.5, 1.0) InverseGaussianExpert(25.0, 25.0)];

LRMoE_model_truncated2 = fit_LRMoE(Matrix(Y_truncated2), Matrix(X_truncated2), α_init, experts_init;
    exact_Y=false, ϵ=0.01, ecm_iter_max=1000, print_steps=10)

In [None]:
plot_LRMoE_fit(Y_truncated2[:, 2], Matrix(X_truncated2), LRMoE_model_truncated2)

In [None]:
summary(LRMoE_model_truncated2)

In general, the LRMoE package can handle various combinations of data censoring and truncation, e.g. right-censored data due to policy limits. For more details, please refer to the [documentation](https://actsci.utstat.utoronto.ca/LRMoE.jl/stable/).