# Lifespan of people born 1900

In this notebook we will use real data on lifespans for people who were born 1900 and died in Sweden. The data come from Sveriges dödbok (Swedish death index), and are provided by the Federation of Swedish Genealogical Societies https://www.rotter.se/swedish-roots

In [None]:
import numpy as np
from matplotlib import pyplot as plt
import pandas as pd

## reset default values for font size
plt.rc('axes', titlesize=16) 
plt.rc('axes', labelsize=16) 

Load the file with dates of births and deaths: 

In [None]:
## location of the dataset
url="https://raw.githubusercontent.com/aledberg/methodology/main/born1900.csv"
## read this into a pandas dataframe
dat=pd.read_csv(url)

## look at the first 10 entries
print(dat.head(10))

To compute the lifespan (date of death - date of birth) we need to convert the dates from string to datetime format. 

In [None]:
## convert string dates to datetime
dat['birthDate'] = pd.to_datetime(dat['birthDate'])
dat['deathDate'] = pd.to_datetime(dat['deathDate'])

## express the lifespan (age) in fractions of a year
dat['age']=((dat['deathDate']-dat['birthDate']).dt.days)/365.25

## look again at the first 10 rows
print(dat.head(10))

Now we can make a histogram showing the distribution of lifespans in this cohort.

In [None]:
ax=dat.hist('age',bins=100,grid=False,figsize=(12,8))
ax=ax[0][0]
ax.set(xlabel="age", ylabel="number of deaths")
plt.title("Lifespan of people born 1900")
plt.show()

## Survival functions

The histogram makes it easy to see at what age the highest and lowest number of deaths occurred. It's harder to see what fraction of people in the cohort lived to be 60. Another representation of the data will make that easy: the survival function.

In [None]:
sortAge=dat['age'].sort_values()
nAlive=np.array([len(sortAge)-i for i in [*range(len(dat))]])
plt.figure(figsize=(12,8))
plt.plot(sortAge,nAlive)
plt.xlabel("age")
plt.ylabel("persons still alive")
plt.grid()
plt.show()

If we divide the number of people still alive with the total people in the cohort to start with, we get the survival function for the cohort. 

In [None]:
pAlive=nAlive/len(nAlive)
plt.figure(figsize=(12,8))
plt.plot(sortAge,pAlive)
plt.xlabel("age")
plt.ylabel("probability of still being alive")
plt.title("Survival of persons born 1900")
plt.grid()
plt.show()

### Questions: 
At what age had half the cohort died? <br> 
What was the probability of living until 20 years of age?

## Differences in survival between men and women
Let us next look at differences in survival between men and women. Do you think the survival functions differ? If so how? <br> Let's take a look:

In [None]:
fig, ax = plt.subplots(figsize=(12,8))

## men
datm=dat[dat['sex']=="m"]
sortAge=datm['age'].sort_values()
nAlive=np.array([len(sortAge)-i for i in [*range(len(datm))]])
nAlive=nAlive/len(nAlive)
ax.plot(sortAge,nAlive)
## women
datw=dat[dat['sex']=="k"]
sortAge=datw['age'].sort_values()
nAlive=np.array([len(sortAge)-i for i in [*range(len(datw))]])
nAlive=nAlive/len(nAlive)
ax.plot(sortAge,nAlive)

plt.xlabel("age")
plt.ylabel("probability of still being alive")
plt.grid()
ax.legend(["men","women"])
plt.show()

## Estimate the hazard function

In [None]:
## use simpel estimate, fraction of people dying during the next year
dat['rage']=dat['age'].round()
ndead=dat['rage'].value_counts()
ndead=ndead.sort_index()
ndead=[ i for i in ndead.values]
ndead.insert(0,0)
from itertools import accumulate
cndead=list(accumulate(ndead))

nalive=[len(dat)-i for i in cndead]
## here we calculate the hazard 
haz=[]
for i in range(len(ndead)):
    haz.append(ndead[i]/nalive[i])

plt.figure(figsize=(12,8))
plt.plot(haz)
plt.xlabel("age")
plt.ylabel("hazard rate (per year)")
plt.show()

In [None]:
## use a log-scale to demonstrate the exponential increase
plt.figure(figsize=(12,8))
plt.plot(haz)
plt.yscale("log")
plt.xlabel("age")
plt.ylabel("hazard rate (per year)")
plt.show()