# Population growth
## Stepping beyond the constant rate logistic growth

    Author: Fábio Hipólito
    Contact: fabio.hipolito@gmail.com
    github: https://github.com/f-hipolito


## The motivation and purpose

This notebook was written to refresh my knowledge on population dynamics 
(primarily focused on population growth) and how to apply first order
differential equations to characterize to such problems.

The interest in this class of problems is tightly connected with 2019-2020 
Corona Virus Disease (CoViD19) outbreak and the spread of information and
statistical analysis of dubious quality. 
Given the impact of this infection, many people are trying to make sense of 
the propagation of the disease with either over-simplistic models, such as a
simple logistic model, or with overly complex ML based models.

It is my understanding that before addressing this problem from an ML 
perspective we can, and should, analyze it with fairly simple models 
containing few parameters that can be mapped to our understanding of 
reality, _e.g._ incubation times, policy change dates, etc.

As discussed (and demonstrated) below, the logistic function is the solution 
for the population growth problem, if and if only the *environment is 
constant*. 
This assumption is incompatible with propagation of CoViD19, as every country
in the world is implementing some sort mitigation policies to reduce the
growth rate of the infection.
In the present notebook no claim is made regarding the latter, clear and 
solid information on this can be found at the WHO website, namely the impact
of policy on the propagation of infections in 
[geral](https://www.who.int/infection-prevention/en/), and 
[CoViD19](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/infection-prevention-and-control).


In this notebook I delve into some generalizations of a logistic differential
equation to overcome the key limitations of this model, namely the 
assumption of constant environment and infinite (mean) lifetimes.
We compare the merits and limitations of the methods by applying them to the
characterization of the propagation of CoViD19.
Using the multi-steady state we identify the 'transition' dates were it is
possible to observe clear change in the growth rate constants.

Furthermore, we extend our model beyond the simple differential equation for
_infected_ cases by considering a system of coupled differential equations 
for the *Susceptible*, *Infected*, *Recovered* and *Deceased*, *i.e.* the a
generalization of the **SIR** model.

### Notes to the reader

__Mathematics:__ I will try to provide clear and easy to follow presentation of 
the mathematics behind population dynamics, for those with basic knowledge of 
differential calculus. For those unfamiliar with differential calculus, I 
recommend referring to:
1. [*Advanced Mathematical Methods for Engineering and Science Students* by Stephenson and Radmore](https://doi.org/10.1017/CBO9781139168120)
2. any edition of [*Mathematical Methods for Physicists* by Arfken, Weber and Harris](
https://doi.org/10.1016/C2009-0-30629-7).

__Notation:__
* $N(t)$ population at time $t$
* $N_t$ total population
* $n \equiv n(t) \doteq N(t)/N_t$ population density, note that the population 
    density is dimensionless $[n] = \mathbb{1}$.
* $dn/dt$ stands for the population rate of change, with dimension 
    $[ dn/dt ] = \mathbb{1}/T$, where $T$ stands for Time
* $\omega(t)$ and $\omega_i$ are the general and ith growth rates, with 
    dimension $[ dn/dt ] = \mathbb{1}/T$.
* $\mathrm{d}$ stands for days, unless otherwise stated all time scales and
    growth rates are in units of $\mathrm{d}$ and $\mathrm{d}^{-1}$,
    respectively.
* $f_X(t)$ probability density function (PDF) for a _continuous_ variable $t$
    distributed according to $X-$distribution.
* $F_X(t)$ cumulative distribution function (CDF) corresponding to $f_X(t)$ 

__Real number rounding:__ 
Defaults to rounding to 3 significant digits, please refer to the attached
notebooks for figures with machine precision.

__Data sources:__
We collected our data from the _European Centre for Disease Prevention and 
Control_ [(ECDC)](https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide), refer to README.md file for
further details. 
Latest data collection 2020-04-25.

## 1. Generalization(s) of the Logistic differential equation


A step-by-step generalization of the logistic differential equation,
$dn/dt = \omega n(1-n)$, and the respective solutions are provided below. 
Here, we offer a brief presentation of the key arguments supporting this model
and present the solution that is used to characterize the propagation of 
CoViD19 in a few countries.

The manifestation of infection is not immediate due to finite (distribution of) 
incubation time(s). 
Consequently, any change to the environment, _i.e._ the growth rate,
will necessarily be _delayed_ and also _smoothen_ by the distribution of 
incubation times.
Consider the following growth rate with time dependence governed by a 
cumulative distribution function for the incubation times
![Time dependent R](img/time_dependent_R.svg)
where the rate transitions smoothly between rates 
$\omega_i = \{ 2, 0.5, 1.5\} \, \mathrm{d}^{-1}$ at times 
$t_i = \{ 10, 35 \} \, \mathrm{d}$.

This approach is useful to characterize growth in environments where the growth
rate evolves, smoothly, between different _constant_ or _steady state_ values.
While at first glance this model might appear a toy model for a mathematician, 
an inspection of number of cases of CoViD19 in South Korea reveals that multi 
steady state logistic models can be useful.
In the log plot depicted below, we plot number of cases in South Korea and three
logistic regressions fitted to a subset of the original dataset.
![Truncated logistic regression in log scale](img/Korea_South_truncated_log.svg)
To begin, this plot clearly shows why it is important to generalize the 
logistic model accommodate multiple rates, as no single rate logistic model 
will ever represent this data accurately.
More subtly, it also offers the possibility to infer that the transition occurs

Assuming that the transition between _constant_ rates is governed by the CDF 
for the incubation times, we require a characterization of the incubation times
to define our model.
This task has already been completed recently in
[*Euro Surveill. 2020;25(5):pii=2000062*](https://doi.org/10.2807/1560-7917.ES.2020.25.5.2000062)
where the distribution of incubation times is shown to be characterizable by
multiple distributions, namely Weibull, Gamma and LogNormal distributions.
For the purpose of this work, we consider consider the Gamma distribution
incubation times with $\mu_{inc} = 6.5\; \mathrm{d}$ and the standard 
deviation $\sigma_{inc} = 2.6 \; \mathrm{d}$.
_Note:_ recently published results 
[*Ann Intern Med. 2020;172(9):577-582*](https://doi.org/10.7326/M20-0504)
set the shape and scale parameters at $\alpha = 5.81$ and 
$\beta = 0.948 \, \mathrm{d}^{-1}$, with mean 
$\mu_{inc} = \alpha/\beta = 6.13 \, \mathrm{d}$ and standard deviation 
$\sigma_{inc} = \sqrt{ \alpha/\beta^2 } = 2.54 \, \mathrm{d}$ are in line
the previous reference.

At the present moment we do not account for recovered or deceased cases, our 
analysis is strictly limited to the growth of infected people.
The impact of finite recovery and death times is under consideration and
could be introduced soon.


### A _very_ short introduction to the $m+1$ _steady state_ solution

Without loss of generality, we introduce the $m+1$ _steady state_ logistic 
function
$$n_m(t) = \sigma\big( \Omega_m(t) \big) 
= 1\big/\Big[ 1 + e^{-\Omega_m(t)} \Big] \, ,$$
where the time dependence is entirely contained within the argument of the 
logistic/sigmoid function.
We adopt the following representation for argument of the logistic function
representing the $m-$ transitions between _steady states_
$$ \Omega_m(t) = \omega_0(t-t_0) 
+\sum_{i=1}^m \big( \omega_i -\omega_{i-1} \big) I_X(t -t_i) \; , $$
where $\omega_0$ is the initial rate, $t_0$ is the global integration constant,
$m$ represents the number of transitions, $\omega_i$ and $t_i$ are the 
rate and transition times for each $i^\mathrm{th}$-steady state after the
initial steady state.
Finally, the function $I_X(t -t_i)$ is the result integral of CDF of 
distribution $X$ representing the transition at time $t-t_i$.
It should be evident that for $t_{i\neq0} \to \infty $ the general solution
reduces to the solution of logistic differential equation.

For the a set of Gamma distributed incubation times, the integral of the
respective CDF reads
$$ I_\Gamma(t;\alpha,\beta) = \frac{1}{\Gamma(\alpha)}\bigg[ 
\bigg( t-\frac{\alpha}{\beta} \bigg)\gamma(\alpha,\beta t)
+\beta^{\alpha-1} t^\alpha e^{-\beta t} \bigg] \, , $$
where $\alpha$ and $\beta$ are shape and rate parameters for the Gamma
distribution; $\gamma(\alpha,\beta t)$ and $\Gamma(\alpha)$ are the
lower incomplete and complete Gamma functions.

Below we show that the $m+1$ steady state model can be used to accurately 
characterize the propagation of the infection in several countries.
While this simple model is useful and permits the identification of the _steady
state_ rates and respective transition times, it is built under the assumption 
of endless lifetime for the infected individuals.
As patients recover or die the pool of infected individuals reduces, leading to
an inevitable break down of the model.
The constraints associated with this assumption are discussed below in the 
analysis of the data from South Korea.

The natural next step for any model is the inclusion of a finite lifetime, or 
alternatively a dissipation mechanism.
It is worth noting that tentative figures for mean infected time $\tau_a = 21 \, \mathrm{d}$ 
[DOI:10.1016/S1473-3099(20)30287-5](https://doi.org/10.1016/S1473-3099(20)30287-5)
have been published in scientific literature.

### General time dependent rate and decay times

As discussed above any model built upon the assumption of boundless lifetime for
the infection will inevitably fail in the long run.
In this section we generalize our model to accommodate mechanisms that represent
the finite lifetime of the infection and also a more elaborate model for the
time dependent rates.

To begin, we introduce a new representation for population, $n(t)$, in which we 
the population is split onto three non-intersecting groups, defined by 
$N(t)/ N_t = s(t) +i(t) +r(t) = 1$, where $s(t)$, $i(t)$, $r(t)$ are the 
*susceptible*, _infected_ and _removed_ portions of the population, 
respectively.
Note that the _removed_ group contains both recovered and deceased cases.
This representation is the basis of the so-called _SIR_ models for population
dynamics with constant total population population, _i.e._ $d N/dt = 0$.
It follows immediately that $ds(t)/dt +di(t)/dt +dr(t)/dt = 0$.

Once again, the best approach to characterize the dynamics of the population, 
is to start by identifying the set of coupled differential equation and then 
integrating them over the time domain.
It is natural to expect that the rate of change of susceptible population
$ds/dt$ will proportional to the infected cases and reduce with vanishing 
$s(t)$.
Conversely, the rate of change for the infected will evolve in the opposite
direction, with respect to the interplay of $s$ and and $i$, but also decay as
time passes by.

This is formally represented by **SIR differential equations** that read
$$ \begin{align}
\frac{ \partial s(t) }{\partial t } &= -\omega i(t) s(t) \, ,\\
\frac{ \partial i(t) }{\partial t } &=  \omega i(t) s(t) -\nu_a i(t) \, ,\\
\frac{ \partial r(t) }{\partial t } &=  \nu_a i(t)\, ,
\end{align}$$
where $\omega,\nu_a \in \mathbb{R}$ are the infection growth and decay rates
with arbitrary dependence on $t$.

Considering a constant rate $d \omega/dt =0$ and $\nu_a=0$ the third equation 
vanishes and the second differential equation reduces to our initial
$d i /dt = \omega i ( 1-i)$, where $n(t)=i(t)$.

Furthermore, in this model we assume that surviving cases present in the 
*removed* group are __permanently immunized__ and thus no longer susceptible.
While it might be tempting to immediately consider the more general case, it is
worth keeping in mind that, while the personal impact of re-infection is huge,
the impact on the differential equations will be small.
The small magnitude of the impact arises from the fact that the susceptible 
term only impacts the evolution of the disease when the infection has covered
a very large portion of the population.
Hence, introducing this parameter could just add a redundant parameter to our 
model that only makes over-fitting more likely.

__Note__ that the infection growth rate $\omega$ is not restricted to a 
constant. 
In fact, we can just as easily generalize to time dependent function as in 
the $m+1$ steady state model discussed in the context of the logistic equation
by setting $ \omega \equiv \Omega_m(t) $ as defined in $\S 2$).

All coupled differential equations are integrated numerically using the open 
source suite [DifferentialEquations.jl](https://docs.sciml.ai/stable/) 
provided by https://sciml.ai/.
The parameter estimation is performed with 
[DiffEqParamEstim.jl](https://docs.sciml.ai/latest/analysis/parameter_estimation/).
The software packages are written in Julia and are available for Julia, Python 
and R.

#### Leaky immunization and deaths

The permanent immunization constraint can be lifted by separating the removed 
cases into two non-intersecting groups, the _recovered_ and the *deceased*,
which we label here as **SIRD** model.
The elements of the former _leak_ back to the susceptible group with a finite
decay rate, $\mu$, while the later simply accumulates the growing number of
deceased people
$$ \begin{align}
\frac{ \partial s(t) }{\partial t } &= -\omega i(t) s(t) +\mu r(t) \, ,\\
\frac{ \partial i(t) }{\partial t } &= 
    \omega i(t) s(t) -(\nu_a +\nu_d) i(t) \, ,\\
\frac{ \partial r(t) }{\partial t } &=  \nu_a i(t) -\mu r(t) \, ,\\
\frac{ \partial d(t) }{\partial t } &=  \nu_d i(t) \, .
\end{align}$$
Note that $r(t)$ stands for the recovered only, $d(t)$ for the deceased, 
$\nu_d$ the rate deaths from the infection and $\mu$ the rate
at which recovered people become susceptible once again.
As in the previous case, the generalization to time-dependent infection growth
rates is obtained straightforwardly by setting $ \omega \equiv \Omega_m(t) $
as defined in $\S 2$).

## Analysis of CoViD19 cases

The below we apply this model to characterize the spread of CoViD19 in several
different countries, using data from the *JHU CSSE* public records on CoViD19.
In a previous version, we used public data from *ECDC*.
Please refer to the [README.md](README.md) for further details and copyright.

For the analysis at hand, the preprocessing of the dataset for each country
reduces to:
1. extract the following data fields for each country:
    1. date;
    2. number of daily infections, recoveries and deaths;
    3. total population;
2. make a cumulative sum of each time series;
3. normalize data to the total population;
4. convert dates to time in units of days and set initial time to date of first 
    entry in the date field for the relevant country;

The present moment we use the model to characterize the propagation of CoViD19
in three countries, namely: [KOREA.ipynb](Korea.ipynb), 
[Denmark.ipynb](Denmark.ipynb) and [Portugal.ipynb](Portugal.ipynb).

### South Korea

#### Basic models: logistic and generalizations

As show in the previous figure the growth of infections in South Korea is
clearly incompatible with a simple logistic regression model.
To clarify beyond any reasonable doubt, we show in the following the best
nonlinear least square fit for a logistic model to the number of infections
![Logistic regression](img/Korea_South_logistic.svg)
where a simple visual inspection shows that this model fails to reproduce real
data.

As discussed earlier, the logistic model can be shown to fit properly the data
by truncating the dataset, _i.e._ restricting the analysis to segments of
periods of time when the _environment_ is kept constant.
Below we address the impact of introducing additional parameters, as well as
the effect of truncating the dataset.
![multi steady state logistic regression](img/Korea_South_fits_converged.svg)

Starting from the most simple model, the green, blue and purle curves show
the logistic regressions applied to truncated subsets of data.
In each subset, the traditional logistic regression represents the 
propagation of the infection rather well. For instance, in the first interval
it indicates a growth rate $\omega_0 = 0.159 \, \mathrm{d}^{-1} $, while
simultaneously predicting that infection would spread to half population in 
day $t\sim 111$, ie 2020-05-01!.
While the estimate for the rate in the initial phase is likely to be correct,
the model clearly fails beyond day $t\sim 45$, and thus urging us to consider
a model with at least two rates.

The red curve represents the best fit considering a truncated two steady 
state model, parametrized with 
$\omega_i \simeq \{ 0.258, 0.0115 \} \, \mathrm{d}^{-1}$ and 
$t_i \simeq \{ 19.6,  35.0 \} \, \mathrm{d}$. 
It can be seen that this model follows data closely until the transition
occurring circa day $t\sim 75$, but fails to reproduce accurately data beyond 
this transition.

By introducing additional steady states we obtain the orange and pink curves,
which allow for a more "accurate" representation of the total number
of cases of CoViD19 in South Korea.
The three steady-state model is parametrized as follows
* $\omega_i^{(3)} \simeq \{ 0.261, \; 0.012, \; 0.00183 \} \, \mathrm{d}^{-1}$, 
* $t_i^{(3)} \simeq \{ 19.7, \; 34.9, \; 66.4 \} \, \mathrm{d}$

while the parameters for the four steady-state model read
* $\omega_i^{(4)} \simeq \{ 0.260, \; 0.0118, \; 0.00142, \; 0.00371 \} 
\, \mathrm{d}^{-1}$, 
* $t_i^{(4)} \simeq \{ 19.7, \; 35.0, \; 66.7, \; 114.0 \} \, \mathrm{d}$.

The evaluation of standard error metrics read
* $ \sigma_\mathrm{error}^{\omega_i} 
= \{ 0.00552,\; 0.000183, \; 7.54\times 10^{-5}, \; 0.000384 \}$;
* $ \sigma_\mathrm{error}^{t_i} = \{ 0.209, \; 0.0951, \; 0.367, \; 1.99 \}$;
* $ \delta_\mathrm{error}^{\omega_i} 
= \{ 0.0109, \; 0.000363, \; 0.000149, \; 0.000759 \}$;
* $ \delta_\mathrm{error}^{t_i} = \{ 0.414, \; 0.188, \; 0.726, \; 3.93 \}$.
The difference between the coefficients found with the three and four 
steady-state model lies within the confidence intervals for all common
parameters.


While the significance of $t_0$ is lost in this model, _i.e._ it no longer
represent the time at which half population mark is reached as in the 
case of logistic model, the remaining times should represent the dates, 
namely 2020-02-26 and 2020-03-29, of around which social behavior should
have changed significantly in South Korea.
The latter remains to be demonstrated and requires comparison with
changes in policy in South Korea.

#### Limitations of the generalized Logistic models

##### Finite infection mean lifetime

Among other flaws, the logistic model is built under the assumption that 
individuals remain infected for arbitrarily long periods of time.
In reality, people recovered or eventually die from this disease, which 
leads to finite *active* lifetime during which a person can infect others.
Reports indicate that the active lifetime $\tau_a \sim 21 \, \mathrm{d}$
[DOI:10.1016/S1473-3099(20)30287-5](https://doi.org/10.1016/S1473-3099(20)30287-5).


The initial phase the growth of the logistic function, valid only in for 
$n\ll 1/2$, exhibits a characteristic $p$-multiple time scale 
$\tau_p = \ln(p)/\omega$ (see $\S 2$,).
In this regime, we can estimate the time required for doubling the number of 
infected people, *i.e.* $\tau_2^{(i)} = \ln(2) /\omega_i $.
Therefore, to ensure that the propagation is _"sustainable"_, the doubling 
time has to be smaller than the mean active lifetime, *i.e.* 
$ \tau_2 \ll \tau_a$, which in turn defines a limit to "validity" of the 
logistic model.

From our nonlinear regression on the Korean dataset, we find
$\tau_2^{(i)} = \{ 2.66, \; 57.8, \; 379.0 \} \, \mathrm{d}$.
On the bright side, the propagation of CoViD19 is growing at a rate slower 
(the doubling time estimate) than the recovery/death mean lifetime since
the first transition date (2020-02-26).

On the other hand, despite the nice close fit to the data, this simple model
fails to realistically characterize the propagation of the disease since 
the first transition date.


__Note:__
It is important to note that nonlinear fits using multiple parameters are
known for being notoriously problematic, as it was eloquently put by 
_John von Neumann_
    
    "...with four parameters I can fit an elephant, and with five I can make 
    him wiggle his trunk."
    
[Nature volume 427, page297(2004)](https://www.nature.com/articles/427297a).


I am confident that all models discussed here have been properly fit, _i.e._
the optimization process converges and outputs a set of meaningful parameters
for the specified model.
The convergence nonlinear fit is straightforward for all cases, but it is
important to state that the results shown above are dependent on the initial
guess, that is provided to the fitting routine. please refer to the relevant
notebook (Korea.ipynb) for full details.

Nonetheless, it is important to keep in mind that an incorrect model can 
*fit* accurately data, therefore *fit quality* estimates should always be 
taken into consideration care.

#### SIR-like models (WIP)

To address the finite _active_ mean lifetime of an infected individual, we
also analyze data using two models, which can accommodate time-dependent
growth rates for the infection. Please refer to the notebook
[SIR_KOREA.ipynb](SIR_Korea.ipynb).

The first model is the most simple incarnation of the SIR model, with the 
population split into three non-overlapping groups, the *susceptible* $s(t)$,
the *infected* $i(t)$ and the *removed* $r(t)$, as defined in the 
introduction.

In the attached notebook, we define the differential equations, using the
$m-$steady state model for the infection growth rate, and estimate the 
parameters $\omega_i$, $t_i$ and $\nu$ using different optimization 
algorithms.
The numerical integration of the differential equations relies on
[DifferentialEquations.jl](https://docs.sciml.ai/stable/)
while the optimization process uses 
[DiffEqParamEstim.jl](https://docs.sciml.ai/latest/analysis/parameter_estimation/)
and [Optim.jl](https://julianlsolvers.github.io/Optim.jl/stable/#).

Based on the results from the previous section, we jump straight to a *four
steady-state* model ($m=3$) and compare the results obtained via two
optimization algorithms, namely *Nelder-Mead* (dashed red) and *BFGS* (solid
black).
In both cases we use the standard $L_2$ loss function.

Below we compare the respective results with data for the total cases 
$1-s(t)$, infected $i(t)$ and removed $r(t)$.

![SIR regression](img/Korea_South_SIR.svg)

While at first glance the results from both methods are apparently similar,
and have a similar minimum, 
$L_2^\mathrm{NM}   = 9.79e\times 10^{-8}$ and 
$L_2^\mathrm{BFGS} = 5.88\times 10^{-8}$, it is important to highlight
several differences.
First, the former is considerably more sensitive to the initial guess.
Second, the algorithms converge to solutions that are considerably 
different for several parameters, namely $\omega_2$, $t_2$, $t_3$ and
$\nu_a$.

$ \omega_i^\mathrm{NM} = \{ 0.249, \, 0.0254, \, 0.0162, \, 0.0317 \}
\, \mathrm{d}^{-1} \; ;\quad
t_i^\mathrm{NM} = \{ 35.2, \, 65.3, \, 111.0 \} \, \mathrm{d}$

$ \nu_a^\mathrm{NM} = 0.0427 \, \mathrm{d}^{-1} \; \Leftrightarrow \;
\tau_a^\mathrm{NM} = 23.4 \, \mathrm{d}$

$ \omega_i^\mathrm{BFGS} = \{ 0.244, \, 0.0238, \, 0.000582, \, 0.0229 \} 
\, \mathrm{d}^{-1} \; ;\quad
t_i^\mathrm{BFGS} = \{ 34.9, \, 66.4, \, 105.0 \} \, \mathrm{d}$

$ \nu_a^\mathrm{BFGS} = 0.0331 \, \mathrm{d}^{-1} \; \Leftrightarrow \;
\tau_a^\mathrm{BFGS} = 30.2 \, \mathrm{d}$

Furthermore, preliminary tests with *Gradient Descent* algorithm return 
results that converge towards the solution of the *BFGS* algorithm, but at a 
much slower rate.

While at this moment we do not have enough information to decide which 
algorithm provides the best set of parameters, we can extract relevant
information.
First, this model is capable of reproducing the general trend for 
the total cases, the infected and removed cases.

Second, both algorithms indicate that the infection growth rate for 
steady-state 4, *i.e.* $\omega_4$, is approximately one order of magnitude
larger than that estimated via the generalized logistic model. 
Even a quick inspection of the logistic model show make it evident that
it will likely underestimate the growth rate at the difference between
the total number of cases and the number of active cases grows.
Hence, the figures provided by the SIR model are more likely to be the
realistic figures.

#### Further refining:  SIRD models (WIP)

The natural next step is to separate the recovered cases from the deaths.
This can be easily achieved by redefining the third differential equation and
introducing a new one for the fraction of population that whose death is 
attributed to CoViD19.
Therefore, at this moment we divide the population into four groups
susceptible $s(t)$, infected $i(t)$, recovered $r(t)$ and deceased $d(t)$.

As in the case of the SIR model, we compare the results total cases $1-s(t)$,
infected $i(t)$, recovered $r(t)$ and deceased $d(t)$, obtained via multiple
methods with data.

![SIRD regression](img/Korea_South_SIRD.svg)

While one could claim some degree of compatibility between the results 
obtained via the *Nelder-Mead* (dashed-red) and *BFGS* (solid black) 
algorithms for the *SIR* model, the same cannot be said for the *SIRD* model.

Note that both algorithms converge without any numerical issue, but it is
important to note that the former leads to solutions that are not compatible
with reality, with the negative decay rate for deaths $\nu_d$ being the prime
example.

Thus far, we have only considered **non-constrained** optimization, which
could be to restrict parameter research to a *"reasonable"* domain.

__Note:__ this is work in progress which requires an in-depth analysis before
any strong claim can be made.

For the sake of record we display the results obtained with both algorithms,
but it goes without saying that little importance should be assigned to the
results obtained via the *Nelder-Mead* algorithm.
Furthermore, the preliminary results obtained with *Gradient descent*
algorithm converge, albeit slowly, to those obtained with *BFGS*.

$ \omega_i^\mathrm{NM} = \{ 0.218, \, 0.0192, \, 0.00201, \, 0.0139 \} 
\, \mathrm{d}^{-1} \; ;\quad
t_i^\mathrm{NM} = \{ 39.2, \, 77.0, \, 169.0 \} \, \mathrm{d}$

$ \nu_a^\mathrm{NM} = 0.0379 \, \mathrm{d}^{-1} \; \Leftrightarrow \;
\tau_a^\mathrm{NM} = 26.4 \, \mathrm{d}$

$ \nu_d^\mathrm{NM} = -0.00217 \, \mathrm{d}^{-1} \; \Leftrightarrow \;
\tau_d^\mathrm{NM} = -461.0 \, \mathrm{d}$

$ \omega_i^\mathrm{BFGS} = \{ 0.229, \, 0.0146, \, 1.93\times 10^{-5}, \, 0.0231 \} 
\, \mathrm{d}^{-1} \; ;\quad
t_i^\mathrm{BFGS} = \{ 37.7, \, 74.9, \, 105.0 \} \, \mathrm{d}$

$ \nu_a^\mathrm{BFGS} = 0.0326 \, \mathrm{d}^{-1} \; \Leftrightarrow \;
\tau_a^\mathrm{BFGS} = 30.6 \, \mathrm{d}$

$ \nu_d^\mathrm{NM} = 0.000568 \, \mathrm{d}^{-1} \; \Leftrightarrow \;
\tau_d^\mathrm{NM} = 1760.0 \, \mathrm{d}$

$L_2^\mathrm{NM}   = 9.03\times 10^{-8}$ and 
$L_2^\mathrm{BFGS} = 5.11\times 10^{-8}$


One could, and probably should, consider if we are just splitting hairs or
making valid refinements of the model.
Given the results and scope of the refinements, it is my assessment that
this level of refinement is relevant, particularly as recent news indicate
that immunity might be short-lived.
This introduces a new degree of freedom into the propagation of this disease,
namely a leakage of recovered individuals back to the susceptible population.

#### Denmark

Direct inspection of the raw data for the number of infected cases in Denmark 
indicates at least two evident transitions occurring circa days $\sim\{67,95\}$.
While it is possible to fit truncated linear regressions to any of the 
intervals defined by these dates, we find that the transitions occuring circa 
day $67$, _i.e._ 2020-03-07 is not compatible with with a transition governed 
by the distribution of incubation times that we are considering.

Below, we compare two truncated logistic regressions and our a regression using
two-steady state solution with data for Denmark.
![2 steady state Denmark](img/Denmark_fits_converged.svg)
The two steady state model reproduces closely the number of cases for 
$t>67\,\mathrm{d}$, providing an estimate for the growth rates in these two 
regions $\omega_i \simeq \{ 0.0889, \; 0.0236 \} \, \mathrm{d}^{-1}$, while the 
offset and the transition times read
$t_i \simeq \{ 15.8  , \;   93.1 \} \, \mathrm{d}$.

From the growth rates, we estimate the doubling times 
$\tau_2^{(i)} = \ln(2) /\omega_i =\{ 7.8, \; 29.4 \} \, \mathrm{d}$
and verify that since the transition to the last steady state, 2020-04-02, 
$\tau_2 < \tau_a$.
Holding the present conditions constant, we are lead to believe that the
number of new infections will drop, but significantly slower than in South 
Korea.

The respective standard errors and error margins, up to 3 significant digits, 
for the parameters read
* $ \sigma_\mathrm{error}^{\omega_i} 
= \{ 0.00168, \; 0.000945 \} \, \mathrm{d}^{-1}$;
* $ \sigma_\mathrm{error}^{t_i} = \{ 0.155, \; 0.33 \} \, \mathrm{d}$;
* $ \delta_\mathrm{error}^{\omega_i} 
= \{ 0.00332 , \; 0.00187 \} \, \mathrm{d}^{-1}$;
* $ \delta_\mathrm{error}^{t_i} = \{ 0.306, \; 0.653 \} \, \mathrm{d}$.

#### Portugal

#### Generalized logistic

As in the Danish case, analysis of the number of cases in Portugal indicates the
presence of two transitions between different steady states.
Unlike the Danish case, both transitions can be modeled by the distribution of
incubations times.

Below, we compare how different models characterize the number of infections
in Portugal.
![3 steady state Portugal](img/Portugal_fits_converged.svg)


From the three steady state model, the growth rate coefficients read 
$\omega_i \simeq \{ 0.166, \; 0.037, \; 0.00867 \} \, \mathrm{d}^{-1}$, while the 
offset and transition times read
$t_i \simeq \{ 12.0, \; 24.8, \; 45.1 \} \, \mathrm{d}$, thus indicating
that the transitions occurred circa 2020-03-23 and 2020-04-06.

The doubling time estimates for Portugal
$\tau_2^{(i)} = \ln(2) /\omega_i 
=\{ 4.17, \; 18.7, \; 80.0 \} \, \mathrm{d}$
are in line with those for Denmark, but remain significantly higher than in 
South Korea.
In all cases, if the present conditions hold, the number of new infections 
will drop as the doubling time is larger than the mean active time $\tau_a$.

The respective standard errors and error margins, up to 3 significant digits,
for the parameters read
* $ \sigma_\mathrm{error}^{\omega_i} 
= \{ 0.00764, \; 0.00112, \; 0.000141 \} \, \mathrm{d}^{-1}$;
* $ \sigma_\mathrm{error}^{t_i} = \{ 0.204, \; 0.382, \; 0.413 \} \, \mathrm{d}$;
* $ \delta_\mathrm{error}^{\omega_i} 
= \{ 0.0152, \; 0.00223, \; 0.00028 \} \, \mathrm{d}^{-1}$;
* $ \delta_\mathrm{error}^{t_i} = \{ 0.406, \; 0.759, \; 0.82\} \, \mathrm{d}$.

#### Limitations

The limitations are the same as those discussed in the context of the Korean
dataset, and thus an analysis based on a more general model is required.

In addition, the Portuguese authorities changed the criteria that defined the 
transition from infected to recovered status on 2020-05-24, leading to a large 
conversion of infected to recovered cases.
This in criteria breaks the characterization of the infected and recovered
time series. 
Therefore, from this date onwards, we shall split our analysis into two parts.
The former contains data up to 2020-05-23, while the latter contains data from 2020-05-24 onwards.

Consider the following reports in the Portuguese press
[Público](https://www.publico.pt/2020/05/24/sociedade/noticia/nova-contagem-casos-traz-recorde-recuperados-covid19-1917895)
and
[Sabado](https://www.sabado.pt/portugal/detalhe/covid-19-numero-de-recuperados-vai-disparar-avisa-ministra).

#### SIR(D) models (WIP)

![SIR regression](img/Portugal_SIR_truncated.svg)

## 2. Brief review of the relevant differential equations

The present notebook concerns only population dynamics where the rate of change
of the population is proportional to the product of some power of the 
population with a given time dependent term, ie $dn/dt \propto Q(n) P(t) $, 
where $Q(n)$ is a polynomial of degree one or higher in $n$ with constant 
coefficients. 
Without loss of generality the differential equation that governs such processes
reads $$ \frac{dn}{dt} = F(t,n) \, ,$$ where $F(t,n)$ encodes the dependence on
$t$ and  $n$. Since we are considering a polynomial dependence on $n$ with
constant coefficients, $F(t,n)$ can be expressed by two separable functions for
$t$ and $n$, namely $F(t,n) = P(t) Q(n)$.
First order nonlinear differential equations in the form $dn/dt = P(t)Q(n)$, 
known as separable first order differential equations, have straightforward
solutions in the form 
$$ \int \frac{dn}{Q(n)} = \int P(t) dt + C\, : C \in \mathbb{R} \, ,$$ where 
$C$ is an integration constant.
To proceed, we must define $Q(n)$ and $P(t)$. For constant environment (steady
state) problems, the rate of change has to be independent of time, ie 
$dP(t)/dt=0$.
In this scenario, we find the differential equation for classic undergraduate 
problems, namely the exponential and logistic growths for 
$Q(n) = n$ and $Q(n) = n(1-n)$, respectively.

### Constant environment assumption
In constant environments $dP(t)/dt=0$, the differential equation reduces to 
$$\int \frac{dn}{Q(n)} = \omega_o (t -t_0) \, ,$$ where we set 
$P(t) = \omega_0\,;\;C=\omega_0 t_0\, :\; \omega_0, t_0 \in \mathbb{R}$.
All that is left to be done is to evaluate the indefinite integral 
$\int 1/Q(n) dn$.
An easy to follow list of integrals of rational functions can be found at 
[List of integrals of rational function](https://en.wikipedia.org/wiki/List_of_integrals_of_rational_functions).


#### The exponential growth/decay
The exponential growths, or decays, arise when the rate of change is linearly
proportional to the population, ie $Q(n)=n$. In this case the indefinite 
integral has a trivial solution $\int dn/n = \ln (n)$, hence the solution reads 
$$ \ln(n) = \omega_o(t-t_0) \Leftrightarrow n_E(t) 
= \exp[ \omega_0(t-t_0)] \, , $$ the so called exponential growth/rate for 
positive/negative $\omega_0$ (we introduce the subscript $E$ to identify this
particular solution).

In a naïve interpretation of the evolution of the CoViD19 infections, one might 
be tempted to assume (or worst, to model and fit) an exponential growth of the 
infections. 
By taking the limit $$\lim_{t\to \infty} \exp[ \omega_0(t-t_0) ] 
= \infty \Leftarrow \omega_0>0$$ it should become obvious that the exponential 
growth will inevitably lead solutions were the infected population is larger 
than the total population!
Alternatively, we can show that this solution exhibits a constant 'p-multiple' 
time-scale
$$ n( t+\tau_p) = p n( t ) \Leftrightarrow \exp( \omega_0 \tau_p ) 
= p \Leftrightarrow \tau_p = \frac{ \ln( p ) }{ \omega_0 } \, .$$
Considering growth rates 
$\omega_0 = \{ 0.1, 0.15, 0.2, 0.25 \} \, \mathrm{d}^{-1} $, the doubling times
$(p=2)$ are $ \tau_2 \simeq \{ 6.9, 4.6, 3.5, 2.8 \} \,  \mathrm{d} $.

The following natural step is the introduction of a saturation mechanism.

#### The logistic equation
The logistic equation is a simple and beautiful solution to the above-mentioned
problem, where instead of considering a rate of change of population linear 
with the population, it considers the product of the population $n$ with the 
remaining population $1-n$, ie $Q(n) = n(1-n)$.
Immediately, we can verify that rate of change vanishes as $n$ approaches the 
total population, thus ensuring that population has a upper limit $n(t) <= 1$.

To solve the differential equation we must solve the indefinite integral 
$ \int 1/Q(n) dn $ that reads
$$\int \frac{dn}{n(1-n)} = -\ln \bigg|\frac{1-n}{n}\bigg|\,.$$
In turn, the solution to the differential equation reads
$$ \begin{align}
-\ln \bigg|\frac{1-n}{n}\bigg| &= \omega_0(t-t_0) \\
\Leftrightarrow \bigg|\frac{1-n}{n}\bigg| &= \exp[ -\omega_0(t-t_0) ] \\
\Leftrightarrow \frac{1-n}{n} &= \exp[ -\omega_0(t-t_0) ] \quad \forall 
\quad \omega_0,t,t_0 \in \mathbb{R} \\
\Leftrightarrow n_L(t) &= \sigma(t)= \frac{1}{1+\exp[ -\omega_0(t-t_0) ]} 
= \frac{1}{1+e^{-\Omega}} \; : \Omega = \omega_0(t-t_0) \, ,
\end{align} $$
the so-called logistic equation. 
We introduce the subscript $L$ to identify this particular solution and also 
highlight that this is sigmoid function $\sigma(t)$, see (and references 
therein)
* https://mathworld.wolfram.com/LogisticEquation.html
* https://en.wikipedia.org/wiki/Logistic_function .

As in the previous case, the sign of $\omega_o$ positive/negative defines 
growth/decay.

Since the logistic growth is bounded, $ n(t) = [0,1] $, the analysis based on 
$p-$multiplying time scales $\tau_p$ is, in general, flawed. 
Alternatively, we can invert the solution and compute the time necessary to 
reach a given fraction of the total population $n(t) = a/b$, it reads
$ t = t_0 + \omega_0 \ln[ a/(b-a)] \; \forall \; b>a \, .$ 
Notably, the time to reach half total population is $t_0$ and the total 
population is reached asymptotically at $t = \infty$.

Notwithstanding the above-mentioned, the initial phase of population growth 
$ t \ll t_0 $, conversely $n(t) \ll 1/2$, the exponential term in the logistic
function dominates and the solution can be approximated by 
$n_L(t) \sim \exp[ \omega_0(t-t_0) ] = n_E(t)$.
Therefore, in this regime a $p-$multiplying time scale can be defined 
$\tau_p = \ln( p )/ \omega_0$.

This solution is known to accurately characterize the growth of populations, 
when the environment (parametrized via $\omega_0$) is kept constant. 
This constraint is crucial, as small changes in $\omega_0$ can lead to a
drastically change the time necessary to infect large portion of the 
population.

#### Growth rate coefficient $\omega_0$ (a very simple model)

As discussed in the motivation, the constant environment assumption is broken
when public health measures impose new rules and procedures designed to 
mitigate the propagation of the infection.
This is not novel, it is actually well documented and policy definitely changes
the propagation rates, please refer to WHO infection prevention website for 
[general](https://www.who.int/infection-prevention/en/) and 
[CoViD19 specific](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/technical-guidance/infection-prevention-and-control) 
information.

The role of $\omega$ in logistic differential equation $dn/dt = \omega n(1-n)$ 
can be understood a representation of the likelihood of an infected individual 
passing the infection to non-infected person.
This can then be mapped to a (very) simple model such as $\omega \propto E p$,
where $E$ is an estimate for the number of interactions between individuals 
per unit of time and $p$ the probability of transmitting the infection at each 
interaction.

Considering such model, it becomes evident that the growth rate will change as 
governments impose restrictions, _e.g._:
* forcing people to stay at home reduces the number of interactions
* washing hands and face more frequently will likely reduce the probability of 
    transmitting the infection as the virus is literally washed away from skin
* increasing the distance at which people have interactions

Is there evidence for a change in the rate? Yes, consider any of the plots 
presented for South Korea, Denmark or Portugal.

### Time dependent environment
As discussed previously, we expect $d P(t)/dt = 0$ to no longer hold. 
Therefore, we must define $P(t)$ and evaluate the respective integral. 
There are infinitely many options to introduce an explicit time dependence on 
$P(t)$. 
Below, we will try to consider generalizations that obey, as far possible, the 
following:
* the contanst rate model should be a limit of P(t);
* contain as few parameters as possible;
* incorporate the merits of the constant rate model;
* be compatible with real data.

#### $m$ transitions between $m+1$ steady states

The success of the constant rate logistic model suggests that even models with
simple time dependence could have potential to accurately characterize the 
population growth. 
A fairly simple model could consist of a series of $m+1$ steady states, each 
a specific constant rate, that change discontinuously at $m$ transition times.
This model can be represented by
$$ P(t) = \omega_0 +\sum_{i=1}^m \big( \omega_i 
-\omega_{i-1} \big) \Theta( t -t_i ) \; , $$
where $\Theta(t)$ is the Heaviside theta function, _i.e._ the step function
(see: 
1. https://en.wikipedia.org/wiki/Heaviside_step_function;
2. https://mathworld.wolfram.com/HeavisideStepFunction.html.

This model naturally incorporates the merits of the constant rate model and can 
accommodate discrete changes in policy, at the cost of adding $2m$ new 
parameters ($m$ rates and $m$ times). 
Furthermore, it is trivial to show that this model reduces to the constant rate 
model when:
* $\omega_i =\omega_0$;
* $t_i \to \infty \; \forall \,i\geq 1$.

Finally, this model has an additional merit, the discrete change of rates would 
be clearly visible in log plots, as different rates generate straight lines with
different slopes at this plotting scale.

While this model appears to tick all boxes, it always forces the rate change to 
take place instantly at times $t_i$, which is not consistent with presently 
available data, where data is changing smoothly.
Data from Denmark and other countries appears to include some transitions that
occur in time scales significantly short than the mean incubation time.
Nonetheless, I would not consider this as a solid argument to use this model.

In the case of an infection there is a natural mechanism that introduces a 
variable delay on the propagation of the disease, the incubation time! 


#### Smooth transitions and incubation times

The manifestation of infection is not immediate due to finite (distribution of) 
incubation time(s). 
Consequently, any change to the environment, _i.e._ the growth rate, will 
necessarily be _delayed_ and also _smoothen_ by the distribution of incubation
times.

In short, from the moment that some one get infected, it take a given time 
interval, $\mu_{inc}$, for the infection to manifest. 
Furthermore, this incubation time is not constant, varying from case to case.

The details pertaining to disease incubation and its characterization lay
outside the scope of this project.
Hence, rather than trying to explain something that also lies outside my area
of knowledge, we refer to the follwing journal article 
[*Euro Surveill. 2020;25(5):pii=2000062*](https://doi.org/10.2807/1560-7917.ES.2020.25.5.2000062).
In this article, the authors characterize the distribution of incubation times
with several distribution.
For the purposes of this project, we will start by considering only the Gamma
distribution, but other distributions might be considered in the future.

Here, we propose using the distribution of incubation times, namely its 
cumulative distribution function (CDF), __as a proxy__ for the probability of 
getting infected, in this model

$$ P(t) = \omega_0 +\sum_{i=1}^m \big( \omega_i 
-\omega_{i-1} \big) F_X( t-t_i; \alpha, \beta ) \, $$
where $F_X( t-t_i; \alpha, \beta )$ is the CDF for the incubation times, with $X$ labeling the distribution.

>__Important notes:__ 
> * this model is a conjecture, even if it can be used to accurately model real 
    data, it still requires evidence for its validity;
> * correlation between $F_X( t-t_i; \alpha, \beta )$ and the probability of 
    transmitting infection, $p$ at a given transition, is to the best of my 
    knowledge, a correlation.
    A the present moment I have little robust evidence to substantiate a causal 
    relation (see the discussion: *Growth rate coefficient $\omega_0$*).

The above-mentioned article characterizes the incubation time with three 
distributions, namely Weibull, Gamma
and LogNormal. In the present notebook, we consider only the Gamma distribution,
but a similar analysis can be computed with the Weibull distribution.
This reference provides the mean incubation time $\mu_{inc} = 6.5\; \mathrm{d}$ 
and the standard deviation $\sigma_{inc} = 2.6 \; \mathrm{d}$, from which we 
determine the parametrization of the Gamma distribution, namely 
$\alpha = \mu_{inc}^2/\sigma_{inc}^2 \sim 6.25 $ and 
$\beta = \mu_{inc}/\sigma_{inc}^2 \sim 0.96 \; \mathrm{d}^{-1}$.

##### The Gamma distribution

The CDF for a Gamma distribution reads 
$$ F_\Gamma(t;\alpha,\beta) = \gamma(\alpha, \beta t) / \Gamma(\alpha)$$
where $ \gamma(\alpha, \beta t) $ and $\Gamma(\alpha)$ are the lower incomplete
and the complete gamma functions, see:
* [*Abramovitz & Stegun* 1970 26.1.32](http://www.math.ubc.ca/~cbm/aands/abramowitz_and_stegun.pdf)
* https://en.wikipedia.org/wiki/Gamma_distribution
* https://dlmf.nist.gov/8.2
* https://dlmf.nist.gov/5.2.

The Gamma distribution is formally defined for $t \geq 0$, $\alpha>0$ and 
$\beta>0$, where $t=0$ represents the instant at which an individual is 
infected. 
It is natural to extend the domain to negative times, by setting 
$F_\Gamma(t<0;\alpha,\beta) = 0$. 
This extension can lead to a discontinuities in $F_\Gamma(t;\alpha\beta)$ and 
in its derivative 
$\partial F_\Gamma(t;\alpha\beta)/\partial t \doteq F_\Gamma '(t;\alpha\beta)$
at $t=0$, by evaluating the limits 
$\lim_{t\to 0^-} F_\Gamma  (t;\alpha,\beta) = \lim_{t\to 0^+} F_\Gamma  (t;\alpha,\beta) $ 
and 
$\lim_{t\to 0^-} F_\Gamma '(t;\alpha,\beta) = \lim_{t\to 0^+} F_\Gamma '(t;\alpha,\beta) $
we find the additional constraint $\alpha > 1$.
The respective probability density function (PDF) is computed by deriving the 
CDF, it reads
$$ f_\Gamma(t;\alpha,\beta) = \frac{ \partial F_\Gamma(t;\alpha,\beta) }{ \partial t}
= \frac{ \beta^\alpha t^{\alpha-1} e^{-\beta t} }{ \Gamma(\alpha) } \, .$$

Having defined $P(t)$, we can proceed with our calculation. 
To highlight the variable number of parameters in this model, we add the 
subscript $m$ to our definition, hence the integral in the time domain of the 
new growth rate function reads
$$ \Omega_m(t) = \int_{-\infty}^t P(t')dt' = \omega_0(t-t_0)
+\sum_{i=1}^m \big( \omega_i -\omega_{i-1} \big) 
I_\Gamma(t -t_i,\alpha,\beta) \; , $$ and the integral of the CDF is
$$ I_\Gamma(t,\alpha,\beta) = \int_{-\infty} ^t F_\Gamma(t',\alpha,\beta) dt'
= \frac{1}{\Gamma(\alpha)}\bigg[ \bigg( t-\frac{\alpha}{\beta} \bigg)\gamma(\alpha,\beta t)
+\beta^{\alpha-1} t^\alpha e^{-\beta t} \bigg] \, . $$
Despite, its nonlinear nature in the limit of large $t$, the integral of the 
CDF reduces to a shifted linear funcition, namely 
$ F_\Gamma(t;\alpha,\beta) \simeq t -\alpha/\beta $ when $\beta t \gg 1$. 
Furthermore, this model reduces to the simple constant rate, whenever 
$\omega_i = \omega_0$ or $t_i\to \infty$.

Since this model only changes the time-dependent part of the logistic equation, 
$\Omega(t)$, the solution to this modified logistic differential equation, with $m+1$
constant rates, reads
$$n_m(t) = 1\big/\Big[ 1 + e^{-\Omega_m(t)} \Big] 
= 1 \big /\bigg( 1 + \exp\Big[ -\omega_0(t-t_0)
-\sum_{i=1}^m \big( \omega_i -\omega_{i-1} \big) I_\Gamma(t,\alpha,\beta) \Big] \bigg) \, . $$