# Fitting Censored and Truncated Data with LRMoE

## Introduction

In this notebook, we will demonstrate how to fit censored and truncated data with LRMoE. We start by simulating data from a two-component mixture, and then artificially impose data censoring and truncation. We then fit the data with LRMoE, and compare the results with the true parameters.

In [None]:
using DrWatson
@quickactivate "LRMoEjl Demo"

using CategoricalArrays, DataFrames, Distributions
using GLM, LRMoE, JLD2, PrettyTables, Random

# some helper functions are hidden in a separate source file
include(srcdir("2023-iCAS-Censor-Truncation-util.jl"))
using .cnesor_truncation_util_jl:
    plot_simulated_T_delay,
    plot_simulated_T_claim_T_delay,
    plot_observed_losses

## Data Simulation

Data censoring is common in insurance applications, e.g. the time of claims reporting delay. Let us consider a synthetic dataset that describes the distribution of reporting delay. We first simulate some covariates typically used in automobile insurance.

In [None]:
Random.seed!(7777)
sample_size = 10_000

X = DataFrame(
    intercept = fill(1.0, sample_size),
    sex = rand(Binomial(1, 0.50), sample_size),
    aged = rand(Uniform(20, 80), sample_size),
    agec = rand(Uniform(0, 10), sample_size),
    region = rand(Binomial(1, 0.50), sample_size)
)

pretty_table(first(X, 5), nosubheader=true)

We assume the claim reporting delay is generated from a two-component LRMoE model with the following parameters.

In [None]:
# logit regression coefficients
α = [-0.5 1.0 -0.05 0.1 1.25;
     0.0 0.0   0.0 0.0  0.0]
# expert functions
comp_dist = [LogNormalExpert(4.0, 0.3) InverseGaussianExpert(20, 20)];

Now we are ready to simulate the actual claim reporting delay `T_delay` and visualize its distribution.

In [None]:
T_delay = LRMoE.sim_dataset(α, X, comp_dist)
T_delay = vec(T_delay);

In [None]:
plot_simulated_T_delay(T_delay)

## Data Censoring

Even though we could simulate data from a true model, reporting delay is right-censored in practice, that is, for certain observations, we only know they are greater than a certain threshold (e.g. claims are reported beyond valuation date), but we do not know their exact values.

Let us assume accidents occur uniformly within a year, i.e. `T_accident` follows a uniform distribution from 0 to 365.

In [None]:
T_accident = rand(Uniform(0, 365), sample_size);

In [None]:
plot_simulated_T_claim_T_delay(T_accident, T_delay)

Let us also assume policy reserving is done at the end of year. With claims and reporting delay times simulated as above, we know `T_accident + T_delay` will be censored at 365, which is roughly 8% of the data.

In [None]:
println("Number of censored observations: $(sum((T_accident .+ T_delay) .> 365))")

With data censoring, each observation of claim delay will be represented by a pair of values `(y_l, y_u)` to represent the reporting delay, which denote the lower and upper bounds of possible values.
* For exact values (i.e. no censoring), `y_l = y_u = T_delay`.
* For censoring, `y_l = 365 - T_claim`, and `y_u = Inf`.

In [None]:
# construct data frame for censoring
y_l = copy(T_delay)
y_u = copy(T_delay)
censor_idx = (T_accident .+ T_delay) .> 365
y_l[censor_idx] .= 365 .- T_accident[censor_idx]
y_u[censor_idx] .= Inf

# view constructed data
df_view = DataFrame(
    T_accident = T_accident,
    T_delay = T_delay,
    censored = censor_idx,
    y_l = y_l,
    y_u = y_u
)
pretty_table(df_view[[5, 10, 2500, 9998],:], nosubheader=true)

The LRMoE package also requires two additional bounds `(t_l, t_u)` to indicate potential data truncation (see Data Truncation below). In this example of reporting delay, there is no truncation, so we set `t_l = 0` and `t_u = Inf`. 

In [None]:
t_l = fill(0.0, sample_size)
t_u = fill(Inf, sample_size)

# construct the complete dataframe
Y = DataFrame(t_l=t_l, y_l=y_l, y_u=y_u, t_u=t_u)
pretty_table(Y[[5, 10, 2500, 9998],:], nosubheader=true)

## Fitting LRMoE with Censored Data

We will now fit the censored data with LRMoE, from a starting point reasonably close to the true parameters. The goal is to verify that LRMoE package can recover the true parameters in the presence of data censoring.

In [None]:
α_init = fill(0.0, 2, 5)
experts_init = [LogNormalExpert(3.5, 1.0) InverseGaussianExpert(25.0, 25.0)];

In [None]:
LRMoE_model = fit_LRMoE(Matrix(Y), X, α_init, experts_init;
    exact_Y=false, ϵ=0.01, ecm_iter_max=1000, print_steps=10)

We see from below the true parameters are recovered reasonably well. The fitted LRMoE model for reporting delay can then be integrated into a general framework for claims reserving.

In [None]:
summary(LRMoE_model)

In [None]:
α

In [None]:
comp_dist

## Data Truncation

Data truncation is also common in insurance applications, e.g. due to policy deductibles.
Let us simulate from the same LRMoE model, but assume the dataset now represents the distribution of incurred losses.

In [None]:
actual_loss = LRMoE.sim_dataset(α, X, comp_dist);

Assuming there is a policy deductible of 10, all losses below the deductible will not be observed by the insurer.
This is represented by setting `t_l = 10` and `t_u = Inf`, whereby the lower bound represents the level of left-truncation, i.e. policy deductible.

In [None]:
truncation_idx = actual_loss .< 10
println("Number of truncated observations: $(sum(truncation_idx))")

The input data for fitting LRMoE can now be constructed, with `y_l = y_u = actual_loss` (since observed losses are exact), `t_l = 10`, and `t_u = Inf`. Afterwards, we can call the fitting function similarly as before.

In [None]:
t_l = fill(10.0, sample_size .- sum(truncation_idx))
y_l = actual_loss[.!truncation_idx]
y_u = actual_loss[.!truncation_idx]
t_u = fill(Inf, sample_size .- sum(truncation_idx))

# construct the complete dataframe
X_truncated = X[.!vec(truncation_idx), :]
Y_truncated = DataFrame(t_l=t_l, y_l=y_l, y_u=y_u, t_u=t_u)
pretty_table(Y_truncated[1:5,:], nosubheader=true)

Note that losses below the deductible are dropped from the dataset, which can be observed from the following histogram.

In [None]:
plot_observed_losses(y_l)

In [None]:
LRMoE_model_truncated = fit_LRMoE(Y_truncated, X_truncated, α_init, experts_init;
    exact_Y=false, ϵ=0.01, ecm_iter_max=1000, print_steps=10)

In [None]:
summary(LRMoE_model_truncated)

In general, the LRMoE package can handle various combinations of data censoring and truncation. For more details, please refer to the [documentation](https://actsci.utstat.utoronto.ca/LRMoE.jl/stable/).