# EG_births_model

E. Quinn 4/5/2021

Probability model for births

### Import standard python datascience packages

In [78]:
import sys
import math
import re
import copy as cp
import numpy as np
import scipy as sc
import pandas as pd
import matplotlib.pyplot as plt
#plt.switch_backend('WebAgg')
import seaborn as sns
import pickle
%matplotlib inline

In [79]:
from datetime import datetime, timedelta, date
from datascience import *
import uuid
import random

In [80]:
from scipy.stats import nbinom
import numpy.random as npr

### Overview

This notebook builds a probability model for EG births using data from the 2020-2021 the NESDEC. 

It uses birth counts for the 18 years 2004-2020.

To summarize the results:

* A negative binomial model fits the data well
* The model produces single-year confidence intervals that are consistent with the data
* The model extends to total births in a multiyear period. These confidence intervals are also consistent with the data. 

### Implications

* The fact that a model that assumes an independent, identical distribution for each year fits the data well is evidence that the birth rate in EG has remained fairly constant over the data period, and it is reasonable to consider the variation we see as purely random fluctuations.

* Under this model the expected number of births in a year is 105, and with 90% confidence we expect the number of births to fall within 20 either side of that (the 90% CI being [85,127]).




### Set path to data files

In [81]:
data_path = '../'
!pwd

/home/gquinn/EG/school_committee/enrollment/notebooks


### Get EG Birth Counts

* Read births for 2005-2020 from 2020-2021 NESDEC data into an array
* Add 98 births for 2004 from Milone and MacBroom 2019 demographic study

In [82]:
ta = np.genfromtxt(data_path+'NESDEC_2020_2021.csv', delimiter=",",skip_header=1)
births_array = np.empty((17,2))
births_array[0:17,0,] = np.arange(2004,2021)
births_array[0,1] = 98
births_array[1:17,1,] = ta[0:16,2,]
print(births_array)

[[2004.   98.]
 [2005.  102.]
 [2006.  112.]
 [2007.   99.]
 [2008.  107.]
 [2009.   91.]
 [2010.   97.]
 [2011.  116.]
 [2012.   99.]
 [2013.  121.]
 [2014.  119.]
 [2015.  115.]
 [2016.  119.]
 [2017.   81.]
 [2018.   83.]
 [2019.  126.]
 [2020.  105.]]


### Get mean and variance of counts

In [83]:
mu = np.mean(births_array[:,1,])
var = np.var(births_array[:,1,])

print('Mean births: ',mu,' births variance: ',var)

Mean births:  105.29411764705883  births variance:  165.3840830449827


### Determine model and parameters

The variance is considerably larger than the mean, indicating *overdispersion*, so a Poisson model will not fit.

As and alternative, we consider a __[negative binomial distribution](https://en.wikipedia.org/wiki/Negative_binomial_distribution)__ with probability function:

$$P(Y=y) = {y + r -1 \choose y}(1-p)^{r}p^y,\quad y=0,1,2,3,...$$

We can estimate the parameters $r$ and $p$ using the *method of moments*.  We equate the sample mean and variance to the formulas for the expected value and variance of $y$:

$$ \frac{pr}{1-p} = 105.29 $$

$$ \frac{pr}{(1-p)^2} = 165.38$$

and solve the system for $r$ and $p$ to obtain:
* $\hat{p}$ = 0.3633
* $\hat{r}$ = 184.49

These are the parameters for a negative binomial random variable with mean $105.29$ and variance $165.38$.

### Compute method of moments estimates for r and p

In [84]:
p = (165.38 - 105.29)/165.38
print('Method of moments estimate for p:',p)
r = 105.29*(1-p)/p
print('Method of moments estimate for r:',r)

Method of moments estimate for p: 0.3633450235820534
Method of moments estimate for r: 184.4896671659178


### Compute single-year quartiles and confidence interval bounds

With 18 observations, we expect 1.8 observations to fall outside a 90% CI.  In the sample, there were two.

We expect 0.9 observations to fall outside a 95% CI.  In this sample, one year landed on the lower bound, 81, so these results are in line with expectations.

In [85]:
q25 = nbinom.ppf(0.25,r,1-p)               #first quartile
q50 = nbinom.ppf(0.5,r,1-p)                #median
q75 = nbinom.ppf(0.75,r,1-p)               #third quartile
q05 = nbinom.ppf(0.05,r,1-p)               #lower bound - 90% CI           
q95 = nbinom.ppf(0.95,r,1-p)               #upper bound - 90% CI                              
q025 = nbinom.ppf(0.025,r,1-p)             #lower bound - 95% CI
q975 = nbinom.ppf(0.975,r,1-p)             #upper bound - 95% CI
plo = nbinom.cdf(81,r,1-p)                 #percentile of smallest observed value
phi= nbinom.cdf(126,r,1-p)                 #percentile of largest observed value
print('Single year quantiles - first quartile',q25,'median',q50,'third quartile',q75)
print()
print('90 percent confidence interval',q05,q95)
print()
print('95 percent confidence interval',q025,q975)
print()
print('Percentiles of largest and smallest observed values:  largest:',plo,' smallest:',phi)

Single year quantiles - first quartile 96.0 median 105.0 third quartile 114.0

90 percent confidence interval 85.0 127.0

95 percent confidence interval 81.0 131.0

Percentiles of largest and smallest observed values:  largest: 0.02698009384805631  smallest: 0.9459781710220091


### Derive distribution of multi-year totals

If we assume that the births in successive years are independently distributed, we can derive the distribution of the total births in two or more years.

With the independence assumption, the moment generating function of the distribution of the sum will be the product of the moment generating functions for the years in the sum.

Since this model has no trends, the years have identical distributions for the number of births, and one common moment generating function for a negative binomial random variable with parameters $r$ and $p$ is:

$$ M_y(t) = \left(\frac{1-p}{1-pe^t}\right)^r$$

For the sum of two years, the moment generating function of the sum is the product:

$$ M_{2y}(t) = \left(\frac{1-p}{1-pe^t}\right)^r\left(\frac{1-p}{1-pe^t}\right)^r = \left(\frac{1-p}{1-pe^t}\right)^{2r}$$

which we can recognize as the moment generating function of a negative binomial random variable with parameters $2r$ and $p$.

We can extend this to sums of more than two years, the result being that the sum of $n$ independent negative binomial random variables with parameters $r$ and $p$ has a negative binomial distribution with parameters $n\cdot r$ and $p$. 

### Compute quartiles and confidence interval bounds for sums of two consecutive years

The smallest total was outside both the 90% and 95% CIs.  Its p-value was 0.004, or 1 in 250. 

The largest total was inside both CIs.

The smallest value is an outlier.

In [86]:
years = 2
q25 = nbinom.ppf(0.25,years*r,1-p)               #first quartile
q50 = nbinom.ppf(0.5,years*r,1-p)                #median
q75 = nbinom.ppf(0.75,years*r,1-p)               #third quartile
q05 = nbinom.ppf(0.05,years*r,1-p)               #lower bound - 90% CI           
q95 = nbinom.ppf(0.95,years*r,1-p)               #upper bound - 90% CI                              
q025 = nbinom.ppf(0.025,years*r,1-p)             #lower bound - 95% CI
q975 = nbinom.ppf(0.975,years*r,1-p)             #upper bound - 95% CI
plo = round(nbinom.cdf(164,years*r,1-p),4)       #percentile of smallest observed sum
phi = round(nbinom.cdf(240,years*r,1-p),4)       #percentile of largest observed sum
print('Two year total quantiles - first quartile',q25,'median',q50,'third quartile',q75)
print()
print('90 percent confidence interval',q05,q95)
print()
print('95 percent confidence interval',q025,q975)
print()
print('Percentiles of largest and smallest sums of consecutive years:  largest:',phi,' smallest:',plo)

Two year total quantiles - first quartile 198.0 median 210.0 third quartile 223.0

90 percent confidence interval 181.0 241.0

95 percent confidence interval 176.0 247.0

Percentiles of largest and smallest sums of consecutive years:  largest: 0.9468  smallest: 0.004


### Compute quartiles and confidence interval bounds for sums of three consecutive years

Both the smallest and largest totals are within the CIs.

The lowest sum of three consecutive years contains the two-year outlier, but when you add the third year, it falls into the CI.  

These results are in line with expectations.

In [87]:
years = 3
q25 = nbinom.ppf(0.25,years*r,1-p)               #first quartile
q50 = nbinom.ppf(0.5,years*r,1-p)                #median
q75 = nbinom.ppf(0.75,years*r,1-p)               #third quartile
q05 = nbinom.ppf(0.05,years*r,1-p)               #lower bound - 90% CI           
q95 = nbinom.ppf(0.95,years*r,1-p)               #upper bound - 90% CI                              
q025 = nbinom.ppf(0.025,years*r,1-p)             #lower bound - 95% CI
q975 = nbinom.ppf(0.975,years*r,1-p)             #upper bound - 95% CI
plo = round(nbinom.cdf(283,years*r,1-p),4)       #percentile of smallest observed sum
phi = round(nbinom.cdf(355,years*r,1-p),4)       #percentile of largest observed sum
print('Three year total quantiles - first quartile',q25,'median',q50,'third quartile',q75)
print()
print('90 percent confidence interval',q05,q95)
print()
print('95 percent confidence interval',q025,q975)
print()
print('Percentiles of largest and smallest sums of consecutive years:  largest:',phi,' smallest:',plo)

Three year total quantiles - first quartile 301.0 median 316.0 third quartile 331.0

90 percent confidence interval 280.0 353.0

95 percent confidence interval 273.0 361.0

Percentiles of largest and smallest sums of consecutive years:  largest: 0.9597  smallest: 0.0704


### Compute four-year quartiles and confidence interval bounds

The largest and smallest sums of four consecutive years are well within both the 90% and 95% CIs

In [88]:
years = 4
q25 = nbinom.ppf(0.25,years*r,1-p)               #first quartile
q50 = nbinom.ppf(0.5,years*r,1-p)                #median
q75 = nbinom.ppf(0.75,years*r,1-p)               #third quartile
q05 = nbinom.ppf(0.05,years*r,1-p)               #lower bound - 90% CI           
q95 = nbinom.ppf(0.95,years*r,1-p)               #upper bound - 90% CI                              
q025 = nbinom.ppf(0.025,years*r,1-p)             #lower bound - 95% CI
q975 = nbinom.ppf(0.975,years*r,1-p)             #upper bound - 95% CI
plo = round(nbinom.cdf(394,years*r,1-p),4)       #percentile of smallest observed sum
phi = round(nbinom.cdf(455,years*r,1-p),4)       #percentile of largest observed sum
print('Four year total quantiles - first quartile',q25,'median',q50,'third quartile',q75)
print()
print('90 percent confidence interval',q05,q95)
print()
print('95 percent confidence interval',q025,q975)
print()
print('Percentiles of largest and smallest sums of consecutive years:  largest:',phi,' smallest:',plo)

Four year total quantiles - first quartile 404.0 median 421.0 third quartile 438.0

90 percent confidence interval 379.0 464.0

95 percent confidence interval 372.0 473.0

Percentiles of largest and smallest sums of consecutive years:  largest: 0.9074  smallest: 0.1497
