# Population growth
## Stepping beyond the constant rate logistic growth

    Author: Fábio Hipólito
    Contact: fabio.hipolito@gmail.com
    github: https://github.com/f-hipolito


## The motivation and purpose

This notebook was written to refresh my knowledge on population dynamics 
(primarily focused on population growth) and how to apply first order
differential equations to characterize to such problems.

The interest in this class of problems is tightly connected with 2019-2020 
Corona Virus Disease (CoViD19) outbreak and the spread of information and
statistical analysis of dubious quality. 
Given the impact of this infection, many people are trying to make sense of 
the propagation of the disease with either over-simplistic models, such as a
simple logistic model, or with overly complex ML based models.

It is my understanding that before addressing this problem from an ML 
perspective we can, and should, analyze it with fairly simple models 
containing few parameters that can be mapped to our understanding of 
reality, e.g. incubation times, policy change dates, etc.
In this notebook I delve into the generalization of a logistic model to
overcome a key limitation of this model, namely the assumption of constant
environment.

As discussed (and demonstrated) below, the logistic function is the solution 
for the population growth problem, if and if only the *environment is 
constant*. 
This assumption is incompatible with propagation of the corona virus, as 
every country in the world is implementing some sort mitigation policies to 
reduce the growth rate of the infection.
In the present notebook no claim is made regarding this, clear and solid
information on this can be found at the WHO website, namely the impact of
policy on the propagation of infections in 
[geral](https://www.who.int/infection-prevention/en/), and 
[CoViD19](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/infection-prevention-and-control).

In this notebook, we demonstrate how to generalize the logistic differential
equation to accommodate time-dependent rates and show how it can be used to 
accurately characterize the corona virus propagation.

__WIP:__ Beyond the time-dependent environment

## Contents
1. Models under consideration
    1. the model
        1. Time-dependent environment: $m$-steady state model
        2. Beyond the time-dependent environment
    2. application and discussion
        1. Korea
        2. Denmark
        3. Portugal
2. Review of the differential equations
    1. Constant environments
        1. exponential growth
        2. logistic equation
        3. growth rate coefficient
    2. Time dependent environments
        1. $m-$transitions between steady states
        2. smooth transitions
        3. the Gamma distribution

### Notes to the reader

__Mathematics:__ I will try to provide clear and easy to follow presentation of 
the mathematics behind population dynamics, for those with basic knowledge of 
differential calculus. For those unfamiliar with differential calculus, I 
recommend referring to:
1. [*Advanced Mathematical Methods for Engineering and Science Students* by Stephenson and Radmore](https://doi.org/10.1017/CBO9781139168120)
2. any edition of [*Mathematical Methods for Physicists* by Arfken, Weber and Harris](
https://doi.org/10.1016/C2009-0-30629-7).

__Notation:__
* $N(t)$ population at time $t$
* $N_t$ total population
* $n \equiv n(t) \doteq N(t)/N_t$ population density, note that the population 
    density is dimensionless $[n] = \mathbb{1}$.
* $dn/dt$ stands for the populution rate of change, with dimension 
    $[ dn/dt ] = \mathbb{1}/T$, where $T$ stands for Time
* $r(t)$ and $r_i$ are the general and ith growth rates, with dimension
    $[ dn/dt ] = \mathbb{1}/T$.
* $\mathrm{d}$ stands for days, unless otherwise stated all time scales and
    growth rates are in units of $\mathrm{d}$ and $\mathrm{d}^{-1}$,
    respectively.
* $f_X(t)$ probability density function (PDF) for a _continuous_ variable $t$
    distributed according to $X-$distribution.
* $F_X(t)$ cumulative distribution function (CDF) corresponding to $f_X(t)$ 

__Coding:__
As this project grew in size, I decided to move all code to external files,
thus dedicating this primary notebook to theory and discussion. The results
discussed in this file are generated by the following notebooks
    * korea.ipynb
    * denmark.ipynb
    * portugal.ipynb
that can be found in the root folder for this project.

All programming instructions use Julia 1.4.0, for reference see the [official documentation](https://docs.julialang.org/en/v1.4/).
A python 3.x version is in the pipeline, but priority is given to project
development using Julia.
Additional required packages and modules (for Julia only):
    * SpecialFunctions
    * MyFunctions, Mrate (local modules provided in the repository)
    * DataFrames, Query, CSV, Dates
    * LsqFit, Optim
    * Plots, LaTeXStrings

__Data sources:__
We collected our data from the publicly available data _European Centre for 
Disease Prevention and Control_ (ECDC) 
[site](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide)
for daily updates on CoViD19 cases. 
Dataset copyright belongs to ECDC, see https://www.ecdc.europa.eu/en/copyright 
or refer to attached file _ECDC_Copyright_ for full copyright information).

Datasets _aRnzjSwT_ and _6u83xREK_ where collected on 2020-04-17T21:29 and 
2020-04-25T16:41, respectively. The sha256 checksums are avaible in attached 
file _sha256sums_.

## 1. Generalization of the Logistic growth


A step-by-step generalization of the logistic differential equation and 
respective solution provided below. 
Here, we offer a brief presentation of the key arguments supporting this model
and present the solution that is used to characterize the propagation of 
CoViD19 in a few countries.

The manifestation of infection is not immediate due to finite (distribution of) 
incubation time(s). 
Consequently, any change to the environment, i.e. the growth rate,
will necessarily be _delayed_ and also _smoothen_ by the distribution of 
incubation times.
Consider the following growth rate with time dependence governed by a 
cumulative distribution function for the incubation times
![Time dependent R](img/time_dependent_R.svg)
where the rate transitions smoothly between rates 
$r_i = \{ 2, 0.5, 1.5\} \, \mathrm{d}^{-1}$ at times 
$t_i = \{ 10, 35 \} \, \mathrm{d}$.

This approach is useful to characterize growth in environments where the growth
rate evolves, smoothly, between different _constant_ or _steady state_ values.
While at first glance this model might appear a toy model for a mathematician, 
an inspection of number of cases of CoViD19 in South Korea reveals that multi 
steady state logistic models can be useful.
In the log plot depicted below, we plot number of cases in South Korea and two
logistic regressions fitted to a subset of the original dataset.
![Truncated logistic regression in log scale](img/South_Korea_truncated_log.svg)
To begin, this plot clearly shows why it is important to generalize the 
logistic model accommodate multiple rates, as no single rate logistic model 
will ever represent this data accurately.
More subtly, it also offers the possibility to infer that the transition occurs

Assuming that the transition between _constant_ rates is governed by the CDF 
for the incubation times, we require a characterization of the incubation times
to define our model.
This task has already been completed recently in
[*Euro Surveill. 2020;25(5):pii=2000062*](https://doi.org/10.2807/1560-7917.ES.2020.25.5.2000062)
where the distribution of incubation times is shown to be characterizable by
multiple distributions, namely Weibull, Gamma and LogNormal distributions.
For the purpose of this work, we consider consider the Gamma distribution
incubation times with $\mu_{inc} = 6.5\; \mathrm{d}$ and the standard 
deviation $\sigma_{inc} = 2.6 \; \mathrm{d}$.

At the present moment we do not account for recovered or deceased cases, our 
analysis is strictly limited to the growth of infected people.
The impact of finite recovery and death times is under consideration and
could be introduced soon.


### A _very_ short introduction to the $m+1$ _steady state_ solution

Without loss of generality, we introduce the $m+1$ _steady state_ logistic 
function
$$n_m(t) = \sigma( R(t) ) = 1\big/\Big[ 1 + e^{-R_m(t)} \Big] \, ,$$
where the time dependence is entirely contained within the argument of the 
logistic/sigmoid function.
We adopt the following representation for argument of the logistic function
representing the $m-$ transitions between _steady states_
$$ R_m(t) = r_0(t-t_0) +\sum_{i=1}^m \big( r_i -r_{i-1} \big) I_X(t -t_i) \; , $$
where $r_0$ is the initial rate, $t_0$ is the global integration constant,
$m$ represents the number of transitions, $r_i$ and $t_i$ are the 
rate and transition times for each $i^\mathrm{th}$-steady state after the
initial steady state.
Finally, the function $I_X(t -t_i)$ is the result integral of CDF of 
distribution $X$ representing the transition at time $t-t_i$.
It should be evident that for $t_{i\neq0} \to \infty $ the general solution
reduces to the solution of logistic differential equation.

For the a set of Gamma distributed incubation times, the integral of the
respective CDF reads
$$ I_\Gamma(t;\alpha,\beta) = \frac{1}{\Gamma(\alpha)}\bigg[ 
\bigg( t-\frac{\alpha}{\beta} \bigg)\gamma(\alpha,\beta t)
+\beta^{\alpha-1} t^\alpha e^{-\beta t} \bigg] \, , $$
where $\alpha$ and $\beta$ are shape and rate parameters for the Gamma
distribution; $\gamma(\alpha,\beta t)$ and $\Gamma(\alpha)$ are the
lower incomplete and complete Gamma functions.

### Generalizing beyond the time-dependent environment

__WIP:__ introduce notes on competing mechanism for active and passive members
of the population.
In the context of CoViD19, the infected people are the active members, while
combination of cured, immunized and deceased cases make up the passive 
members. 
In the simplest form: $ n(t) \doteq a(t) +p(p) $.

### Analysis of CoViD19 cases

The below we apply this model to characterize the spread of CoViD19 in several
different countries, using data from the _ECDC_, see note on data sources for details and 
copyright.

For the analysis at hand, the preprocessing of the dataset for each country
reduces to:
1. extract the following data fields for each country:
    1. date;
    2. number of daily infections;
    3. total population;
2. make a cumulative sum of daily infections;
3. normalize data to the total population;
4. convert dates to time in units of days and set initial time to date of first 
    entry in the date field for the relevant country;

The present moment we use the model to characterize the propagation of CoViD19
in three countries, namely: Korea, Denmark and Portugal.

#### South Korea


As show in the previous figure the growth of infections in South Korea is
clearly incompatible with a simple logistic regression model.
To clarify beyond any reasonable doubt, we show in the following the best
nonlinear least square fit for a logistic model to the number of infections
![Logistic regression](img/South_Korea_logistic.svg)
where a simple visual inspection shows that this model fails to reproduce real
data.

As discussed earlier, the logistic model can be shown to fit properly the data
by truncating the dataset, i.e. restricting the analysis to segments of periods
of time when the _environment_ is kept constant.
Below we address the impact of introducing additional parameters, as well as
the effect of truncating the dataset.
![multi steady state logistic regression](img/South_Korea_fits_converged.svg)

In this plot we display data for $t = [ 40, 117 ] \, \mathrm{d}$ to facilitate
the interpretation of the plots and unless stated otherwise our calculations 
use the entire dataset.

Starting from the most simple model, the blue curve shows that by truncating
the analysis up to $t = 66 \, \mathrm{d}$, the traditional logistic regression
represents the propagation of the infection rather well, indicating a
growth rate $r_0 = 0.193 \, \mathrm{d}^{-1} $, while simultaneously predicting 
that infection would spread to half population in day $t\sim 111$, ie
2020-04-20!.
While the estimate for the rate in the initial phase is likely to be correct,
the model clearly fails beyond day $t\sim 65$, and thus urging us to consider
a model with at least two rates.

The green curve represents the best fit considering a two steady state model,
parametrized with 
$r_i \simeq \{ 0.212, 0.00742 \} \, \mathrm{d}^{-1}$ and 
$t_i \simeq \{ 107,  58.9 \} \, \mathrm{d}$. 
It can be seen that this model follows data closely until the transition
occurring circa day $t=65$, but fails to reproduce accurately data beyond 
this transition.
Careful observation of the raw data it is possible to verify that slope of
$n(t)$ reduces circa day $t=95$, indicating another possible transition.
By truncating the analysis with the two steady state model to circa day $t=95$,
we obtain the orange curve, which follows closely all data within this domain.
This curve is parametrized by 
* $r_i \simeq \{ 0.241, 0.0112 \} \, \mathrm{d}^{-1}$,
* $t_i \simeq \{ 101, 57.8 \} \, \mathrm{d}$.

Finally, the magenta curve shows that the multiple steady state solution is
a very good candidate to characterize the propagation of CoViD19 in South
Korea, as this model accurately reproduces the data, while providing 
meaningful information compatible with the more simple truncated models.
The model is parametrized as follows
* $r_i \simeq \{ 0.250, 0.0122,  0.00211 \} \, \mathrm{d}^{-1}$, 
* $t_i \simeq \{ 99.5,  57.5,  88.5 \} \, \mathrm{d}$.

The respective standard errors and error margins, up to 3 significant digits, for the
parameters read
* $ \sigma_\mathrm{error}^{r_i} = \{ 0.00887,\; 0.000348, \; 0.000428 \}$;
* $ \sigma_\mathrm{error}^{t_i} = \{ 0.536, \; 0.167, \; 0.854 \}$;
* $ \delta_\mathrm{error}^{r_i} = \{ 0.0177, \; 0.000696, \; 0.000856 \}$;
* $ \delta_\mathrm{error}^{t_i} = \{ 1.07, \; 0.335, \; 1.71\}$.


While the significance of $t_0$ is lost in this model, i.e. it no longer
represent the time at which half population mark is reached as in the 
case of logistic model, the remaining times should represent the dates, 
namely 2020-02-26 and 2020-03-29, of around which social behavior should
have changed significantly in South Korea.
The latter remains to be demonstrated... 
__COMPARE THESE DATES WITH POLICY CHANGE IN KOREA.__

__Note:__
It is important to note that nonlinear fits using multiple parameters are
notoriously known for being problematic, as eloquently put by 
_John von Neumann_
*"...with four parameters I can fit an elephant, and with five I can make him wiggle his trunk."
[Nature volume 427, page297(2004)](https://www.nature.com/articles/427297a).

I am confident that all models discussed here have been properly fit, i.e.
the optimization process converges and output a set of meaningful parameters.
The convergence nonlinear fit is straightforward for all cases, but it is
important to state that the results shown above are dependent on the initial
guess, that is provided to the fitting routine. please refer to the relevant
notebook (Korea.ipynb) for full details.
Furthermore, the three steady state model required bounding the parameters to 
positive values to ensure that meaningful parameters. 






#### Denmark

WIP

#### Portugal

WIP


## 2. Brief review of the relevant differential equations

The present notebook concerns only population dynamics where the rate of change
of the population is proportional to the product of some power of the 
population with a given time dependent term, ie $dn/dt \propto Q(n) P(t) $, 
where $Q(n)$ is a polynomial of degree one or higher in $n$ with constant 
coefficients. 
Without loss of generality the differential equation that governs such processes
reads $$ \frac{dn}{dt} = F(t,n) \, ,$$ where $F(t,n)$ encodes the dependence on
$t$ and  $n$. Since we are considering a polynomial dependence on $n$ with
constant coefficients, $F(t,n)$ can be expressed by two separable functions for
$t$ and $n$, namely $F(t,n) = P(t) Q(n)$.
First order nonlinear differential equations in the form $dn/dt = P(t)Q(n)$, 
known as separable first order differential equations, have straightforward
solutions in the form 
$$ \int \frac{dn}{Q(n)} = \int P(t) dt + C\, : C \in \mathbb{R} \, ,$$ where 
$C$ is an integration constant.
To proceed, we must define $Q(n)$ and $P(t)$. For constant environment (steady
state) problems, the rate of change has to be independent of time, ie 
$dP(t)/dt=0$.
In this scenario, we find the differential equation for classic undergraduate 
problems, namely the exponential and logistic growths for 
$Q(n) = n$ and $Q(n) = n(1-n)$, respectively.

### Constant environment assumption
In constant environments $dP(t)/dt=0$, the differential equation reduces to 
$$\int \frac{dn}{Q(n)} = r_o (t -t_0) \, ,$$ where we set 
$P(t) = r_0\,;\;C=r_0 t_0\, :\; r_0, t_0 \in \mathbb{R}$.
All that is left to be done is to evaluate the indefinite integral 
$\int 1/Q(n) dn$.
An easy to follow list of integrals of rational functions can is available at 
[List of integrals of rational function](https://en.wikipedia.org/wiki/List_of_integrals_of_rational_functions).


#### The exponential growth/decay
The exponential growths, or decays, arise when the rate of change is linearly
proportional to the population, ie $Q(n)=n$. In this case the indefinite 
integral has a trivial solution $\int dn/n = \ln (n)$, hence the solution reads 
$$ \ln(n) = r_o(t-t_0) \Leftrightarrow n_E(t) = \exp[ r_0(t-t_0)] \, , $$ the 
so called exponential growth/rate for positive/negative $r_0$ (we introduce the
subscript $E$ to identify this particular solution).

In a naïve interpretation of the evolution of the CoViD19 infections, one might 
be tempted to assume (or worst, to model and fit) an exponential growth of the 
infections. 
By taking the limit 
$$\lim_{t\to \infty} \exp[ r_0(t-t_0) ] = \infty \Leftarrow r_0>0$$ it should 
become obvious that the exponential growth will inevitably lead solutions were 
the infected population is larger than the total population!
Alternatively, we can show that this solution exhibits a constant 'p-multiply' 
time-scale
$$ n( t+\tau_p) = p n( t ) \Leftrightarrow \exp( r_0 \tau_p ) = p \Leftrightarrow \tau_p = \frac{ \ln( p ) }{ r_0 } \, .$$
Considering growth rates $r_0 = \{ 0.1, 0.15, 0.2, 0.25 \} \, \mathrm{d}^{-1} $,
the doubling times $(p=2)$ are 
$ \tau_2 \simeq \{ 6.9, 4.6, 3.5, 2.8 \} \,  \mathrm{d} $.

__WIP:__ 
discuss exponential growth, maybe with plot and/or simple arithmetic examples 
such as $2^3 = 8$ vs $2^{(2\times3)} = 64$, doubling $\tau_2$ leads to a 
result 4 times larger.

The following natural step is the introduction of a saturation mechanism.

#### The logistic equation
The logistic equation is a simple and beautiful solution to the above-mentioned
problem, where instead of considering a rate of change of population linear 
with the population, it considers the product of the population $n$ with the 
remaining population $1-n$, ie $Q(n) = n(1-n)$.
Immediately, we can verify that rate of change vanishes as $n$ approaches the 
total population, thus ensuring that population has a upper limit $n(t) <= 1$.

To solve the differential equation we must solve the indefinite integral 
$ \int 1/Q(n) dn $ that reads
$$\int \frac{dn}{n(1-n)} = -\ln \bigg|\frac{1-n}{n}\bigg|\,.$$
In turn, the solution to the differential equation reads
$$ \begin{align}
-\ln \bigg|\frac{1-n}{n}\bigg| &= r_0(t-t_0) \\
\Leftrightarrow \bigg|\frac{1-n}{n}\bigg| &= \exp[ -r_0(t-t_0) ] \\
\Leftrightarrow \frac{1-n}{n} &= \exp[ -r_0(t-t_0) ] \quad \forall \quad r_0,t,t_0 \in \mathbb{R} \\
\Leftrightarrow n_L(t) &= \sigma(t)= \frac{1}{1+\exp[ -r_0(t-t_0) ]} = \frac{1}{1+e^{-R}} \; : R = r_0(t-t_0) \, ,
\end{align} $$
the so-called logistic equation. 
We introduce the subscript $L$ to identify this particular solution and also 
highlight that this is sigmoid function $\sigma(t)$, see (and references 
therein)
* https://mathworld.wolfram.com/LogisticEquation.html
* https://en.wikipedia.org/wiki/Logistic_function .

As in the previous case, the sign of $r_o$ positive/negative defines 
growth/decay.

Plot $\sigma( r_0( t -t_0 )$ for several $r_0$

Since the logistic growth is bounded, $ n(t) = [0,1] $, the analysis based on 
$p-$multiplying time scales $\tau_p$ is, in general, flawed. 
Alternatively, we can invert the solution and compute the time necessary to 
reach a given fraction of the total population $n(t) = a/b$, it reads
$ t = t_0 + r_0 \ln[ a/(b-a)] \; \forall \; b>a \, .$ 
Notably, the time to reach half total population is $t_0$ and the total 
population is reached asymptotically at $t = \infty$.

Notwithstanding the above-mentioned, the initial phase of population growth 
$ t \ll t_0 $, conversely $n(t) \ll 1/2$, the exponential term in the logistic
function dominates and the solution can be approximated by 
$n_L(t) \sim \exp[ r_0(t-t_0) ] = n_E(t)$.
Therefore, in this regime a $p-$multiplying time scale can be defined 
$\tau_p = \ln( p )/ r_0$.

This solution is known to accurately characterize the growth of populations, 
when the environment (parametrized via $r_0$) is kept constant. 
This constraint is crucial, as small changes in $r_0$ can lead to a drastically 
change the time necessary to infect large portion of the population.

#### Growth rate coefficient $r_0$ (a very simple model)

As discussed in the motivation, the constant environment assumption is broken
when public health measures impose new rules and procedures designed to 
mitigate the propagation of the infection.
This is not novel, it is actually well documented and policy definitely changes
the propagation rates, please refer to WHO infection prevention website for 
[general](https://www.who.int/infection-prevention/en/) and 
[CoViD19 specific](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/infection-prevention-and-control) 
information.

The role of $r$ in logistic differential equation $dn/dt = r n(1-n)$ can be 
understood a representation of the likelihood of an infected individual passing 
the infection to non-infected person, which can be mapped to (very) simple 
model such as $r \propto E p$, where $E$ is an estimate for the number of 
interactions between individuals per unit of time and $p$ the probability of 
transmitting the infection at each interaction.

Considering such model, it becomes evident that the growth rate will change as 
governments impose restrictions, e.g.:
* forcing people to stay at home reduces the number of interactions
* washing hands and face more frequently will likely reduce the probability of 
    transmitting the infection as the virus is literally washed away from skin
* increasing the distance at which people have interactions

Is there evidence for a change in the rate? Yes, consider for instance the log
plot of the number of infection in Korea.

__WIP:__ ADD PLOT WITH LINEAR FITS AND EXPLAIN

### Time dependent environment
As discussed previously, we expect $d P(t)/dt = 0$ to no longer hold. 
Therefore, we must define $P(t)$ and evaluate the respective integral. 
There are infinitely many options to introduce an explicit time dependence on 
$P(t)$. 
Below, we will try to consider generalizations that obey, as far possible, the 
following:
* the contanst rate model should be a limit of P(t);
* contain as few parameters as possible;
* incorporate the merits of the constant rate model;
* be compatible with real data.

#### $m$ transitions between $m+1$ steady states

The success of the constant rate logistic model suggests that even models with
simple time dependence could have potential to accurately characterize the 
population growth. 
A fairly simple model could consist of a series of $m+1$ steady states, each 
a specific constant rate, that change discontinuously at $m$ transition times.
This model can be represented by
$$ P(t) = r_0 +\sum_{i=1}^m \big( r_i -r_{i-1} \big) \Theta( t -t_i ) \; , $$
where $\Theta(t)$ is the Heaviside theta function, i.e. the step function (see: 
1. https://en.wikipedia.org/wiki/Heaviside_step_function;
2. https://mathworld.wolfram.com/HeavisideStepFunction.html.

This model naturally incorporates the merits of the constant rate model and can 
accommodate discrete changes in policy, at the cost of adding $2m$ new 
parameters ($m$ rates and $m$ times). 
Furthermore, it is trivial to show that this model reduces to the constant rate 
model when:
* $r_i =r_0$;
* $t_i \to \infty \; \forall \,i\geq 1$.
Finally, this model has an additional merit, the discrete change of rates would 
be clearly visible in log plots, as different rates generate straight lines with
different slopes at this plotting scale.

While this model appears to tick all boxes, it always forces the rate change to 
take place instantly at times $t_i$, which is not consistent with presently 
available data, where data is changing smoothly.
Data from Denmark and other countries appears to include some transitions that
occur in time scales significantly short than the mean incubation time.
Nonetheless, I would not consider this as a solid argument to use this model.

In the case of an infection there is a natural mechanism that introduces a 
variable delay on the propagation of the disease, the incubation time! 


#### Smooth transitions via distribution of incubation times

The manifestation of infection is not immediate due to finite (distribution of) 
incubation time(s). Consequently, any change to the environment, i.e. the
growth rate, will necessarily be _delayed_ and also _smoothen_ by the 
distribution of incubation times.

In short, from the moment that some one get infected, it take a given time 
interval, $\mu_{inc}$, for the infection to manifest. 
Furthermore, this incubation time is not constant, varying from case to case.

The details pertaining to disease incubation and its characterization lay
outside the scope of this project.
Hence, rather than trying to explain something that also lies outside my area
of knowledge, we refer to the follwing journal article 
[*Euro Surveill. 2020;25(5):pii=2000062*](https://doi.org/10.2807/1560-7917.ES.2020.25.5.2000062).
In this article, the authors characterize the distribution of incubation times
with several distribution.
For the purposes of this project, we will start by considering only the Gamma
distribution, but other distributions might be considered in the future.

Here, we propose using the distribution of incubation times, namely its 
cumulative distribution function (CDF), __as a proxy__ for the probability of 
getting infected, in this model
$$ P(t) = r_0 +\sum_{i=1}^m \big( r_i -r_{i-1} \big) F_X( t-t_i; \alpha, \beta ) \, $$
where $F_X( t-t_i; \alpha, \beta )$ is the CDF for the incubation times, with $X$ labeling the distribution.

>__Important notes:__ 
> * this model is a conjecture, even if it can be used to accurately model real 
    data, it still requires evidence for its validity;
> * correlation between $F_X( t-t_i; \alpha, \beta )$ and the probability of 
    transmitting infection, $p$ at a given transition, is to the best of my 
    knowledge, a correlation.
    A the present moment I have no evidence to substantiate a causal relation
    (see the discussion: *Growth rate coefficient $r_0$*).

The above-mentioned article characterizes the incubation time with three 
distributions, namely Weibull, Gamma
and LogNormal. In the present notebook, we consider only the Gamma distribution,
but a similar analysis can be computed with the Weibull distribution.
This reference provides the mean incubation time $\mu_{inc} = 6.5\; \mathrm{d}$ 
and the standard deviation $\sigma_{inc} = 2.6 \; \mathrm{d}$, from which we 
determine the parametrization of the Gamma distribution, namely 
$\alpha = \mu_{inc}^2/\sigma_{inc}^2 \sim 6.25 $ and  
$\beta = \mu_{inc}/\sigma_{inc}^2 \sim 0.96 \; \mathrm{d}^{-1}$.

##### The Gamma distribution

The CDF for a Gamma distribution reads 
$$ F_\Gamma(t;\alpha,\beta) = \gamma(\alpha, \beta t) / \Gamma(\alpha)$$
where $ \gamma(\alpha, \beta t) $ and $\Gamma(\alpha)$ are the lower incomplete
and the complete gamma functions, see:
* [*Abramovitz & Stegun* 1970 26.1.32](http://www.math.ubc.ca/~cbm/aands/abramowitz_and_stegun.pdf)
* https://en.wikipedia.org/wiki/Gamma_distribution
* https://dlmf.nist.gov/8.2
* https://dlmf.nist.gov/5.2.

The Gamma distribution is formally defined for $t \geq 0$, $\alpha>0$ and 
$\beta>0$, where $t=0$ represents the instant at which an individual is 
infected. 
It is natural to extend the domain to negative times, by setting 
$F_\Gamma(t<0;\alpha,\beta) = 0$. 
This extension can lead to a discontinuities in $F_\Gamma(t;\alpha\beta)$ and 
in its derivative 
$\partial F_\Gamma(t;\alpha\beta)/\partial t \doteq F_\Gamma '(t;\alpha\beta)$
at $t=0$, by evaluating the limits 
$\lim_{t\to 0^-} F_\Gamma  (t;\alpha,\beta) = \lim_{t\to 0^+} F_\Gamma  (t;\alpha,\beta) $ 
and 
$\lim_{t\to 0^-} F_\Gamma '(t;\alpha,\beta) = \lim_{t\to 0^+} F_\Gamma '(t;\alpha,\beta) $
we find the additional constraint $\alpha > 1$.
The respective probability density function (PDF) is computed by deriving the 
CDF, it reads
$$ f_\Gamma(t;\alpha,\beta) = \frac{ \partial F_\Gamma(t;\alpha,\beta) }{ \partial t}
= \frac{ \beta^\alpha t^{\alpha-1} e^{-\beta t} }{ \Gamma(\alpha) } \, .$$

Having defined $P(t)$, we can proceed with our calculation. 
To highlight the variable number of parameters in this model, we add the 
subscript $m$ to our definition, hence the integral in the time domain of the 
new growth rate function reads
$$ R_m(t) = \int_{-\infty}^t P(t')dt' = r_0(t-t_0)
+\sum_{i=1}^m \big( r_i -r_{i-1} \big) I_\Gamma(t -t_i,\alpha,\beta) \; , $$
and the integral of the CDF is
$$ I_\Gamma(t,\alpha,\beta) = \int_{-\infty} ^t F_\Gamma(t',\alpha,\beta) dt'
= \frac{1}{\Gamma(\alpha)}\bigg[ \bigg( t-\frac{\alpha}{\beta} \bigg)\gamma(\alpha,\beta t)
+\beta^{\alpha-1} t^\alpha e^{-\beta t} \bigg] \, . $$
Despite, its nonlinear nature in the limit of large $t$, the integral of the 
CDF reduces to a shifted linear funcition, namely 
$ F_\Gamma(t;\alpha,\beta) \simeq t -\alpha/\beta $ when $\beta t \gg 1$. 
Furthermore, this model reduces to the simple constant rate, whenever 
$r_i = r_0$ or $t_i\to \infty$.

Since this model only changes the time-dependent part of the logistic equation, 
$R(t)$, the solution to this modified logistic differential equation, with $m+1$
constant rates, reads
$$n_m(t) = 1\big/\Big[ 1 + e^{-R_m(t)} \Big] 
= 1 \big /\bigg( 1 + \exp\Big[ -r_0(t-t_0)
-\sum_{i=1}^m \big( r_i -r_{i-1} \big) I_\Gamma(t,\alpha,\beta) \Big] \bigg) \, . $$