# Prediction using Swedish mortality data

In this notebook we will look at how predictions can be used to investigate excess mortality due to influenza or other virus. The data comes from [Statistics Sweden](https://www.scb.se/hitta-statistik/corona/corona-i-statistiken/#Statistik). The dataset we will work with has been edited and all the deaths without a determined date has been removed. This means that the total number of death will be slightly lower than what it is in reality. 

Start by loading the dataset and printing the first few lines.

In [None]:
options(warn=-1)
options(repr.plot.width=14, repr.plot.height=8)
require(ggplot2)
require(dplyr)
data <- readRDS("mortality_data_from_SCB.rds")

The dataset contains the daily counts of number of persons who have died in Sweden from 2015-01-01 to 2021-01-31. We have seen that there are variations in the counts depending on the seasons. We will now build a simple model taking these variations into account.   

In [None]:
## prepare a temporary dataset
tdata <- data %>% mutate(month=format(date,"%m")) %>% filter(sex=="both" & agegr=="all" & date < "2020-01-01") %>% mutate(time=as.integer(date-min(date)))
## use poisson regression to fit the model
fit <- glm(count~ 1+month+time,data=tdata,family="poisson")
## predict using the model 
tdata$pred <- predict(fit,type="response")
## and plot the data plus prediction
ggplot(data=tdata,aes(x=date,y=count))+geom_point(size=0.2)+geom_line(aes(x=date,y=pred))+ylab("deaths per day")+xlab("calendar date") +theme_bw(base_size=18)

Next, we use the model we fited using data from 2015-2019 to predict the deaths for 2020. 

In [None]:
## create a dataset with data from 2020 
tdata <- data %>% mutate(month=format(date,"%m")) %>% filter(sex=="both" & agegr=="all" & date >= "2020-01-01") %>% mutate(time=as.integer(date-as.Date("2015-01-01")))
## use the model (contained in fit) to predict deaths 
tdata$pred <- predict(fit,newdata=tdata,type="response")
## plot the data
ggplot(data=tdata,aes(x=date,y=count))+geom_point(size=0.2)+geom_line(aes(x=date,y=pred))+ylab("deaths per day")+xlab("calendar date") +theme_bw(base_size=18)


How do you explain the two waves of excess mortality?