# Prepare COVID-19 data from Robert Koch-Institute

## 1. Prepare spatial data

RKI COVID data are reported per district (**LK = "Landkreis"**), but do not contain coordinates.
Therefore, we read an LK polygon shapefile, and compute the center coordinate for each polygon.
These coordinates are then appended to the COVID data using the Landkreis ID.

Data source: [Robert Koch-Institut (RKI), dl-de/by-2-0](https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/917fc37a709542548cc3be077a786c17_0)

In [1]:
require(rgdal) # for reading shapefile
require(rgeos) # for computing centroids

# read Landkreis Shapefile
lk <- readOGR("data/Landkreise.shp")
#str(lk@data)

# compute centroid per polygon
cen <- gCentroid(lk, byid=TRUE)
#str(cen)

# make table with coordinates and Landkreis ID
lk_ <- cbind(as.data.frame(cen@coords), RS=as.integer(gsub("^0", "", lk$RS)), POP=lk$EWZ)
str(lk_)

Loading required package: rgdal

Loading required package: sp

rgdal: version: 1.4-7, (SVN revision 845)
 Geospatial Data Abstraction Library extensions to R successfully loaded
 Loaded GDAL runtime: GDAL 2.2.2, released 2017/09/15
 Path to GDAL shared files: /usr/share/gdal/2.2
 GDAL binary built with GEOS: TRUE 
 Loaded PROJ.4 runtime: Rel. 4.9.2, 08 September 2015, [PJ_VERSION: 492]
 Path to PROJ.4 shared files: (autodetected)
 Linking to sp version: 1.3-2 

Loading required package: rgeos

rgeos version: 0.5-3, (SVN revision 634)
 GEOS runtime version: 3.5.1-CAPI-1.9.1 
 Linking to sp version: 1.3-2 
 Polygon checking: TRUE 




OGR data source with driver: ESRI Shapefile 
Source: "/home/frantzda/cor/covid19/data/Landkreise.shp", layer: "Landkreise"
with 412 features
It has 39 fields
'data.frame':	412 obs. of  4 variables:
 $ x  : num  9.44 10.13 10.73 9.98 9.11 ...
 $ y  : num  54.8 54.3 53.9 54.1 54.1 ...
 $ RS : int  1001 1002 1003 1004 1051 1053 1054 1055 1056 1057 ...
 $ POP: int  89504 247548 217198 79487 133210 197264 165507 200581 314391 128647 ...


We simplify the geometry of the states (Bundesländer) for more efficient visualization.

In [2]:
bl <- readOGR("data/BL_mit_EW_und_Faellen.shp")
bl_simple <- gSimplify(bl, topologyPreserve=TRUE, tol = 0.025)
writeOGR(as(bl_simple, "SpatialPolygonsDataFrame"), "data/BL_simple.shp", "states", "ESRI Shapefile")

OGR data source with driver: ESRI Shapefile 
Source: "/home/frantzda/cor/covid19/data/BL_mit_EW_und_Faellen.shp", layer: "BL_mit_EW_und_Faellen"
with 16 features
It has 14 fields


## 2. Prepare COVID-19 data

### 2.1 Download and clean data

This code snippet downloads and prepares the **latest COVID-19 data** from RKI.
RKI updates the COVID-19 cases on a daily basis.

The cases are reported for districts ("Landkreise") - with the exception of Berlin, where cases are reported for the 12 "Bezirke".

Data source: [Robert Koch-Institut (RKI), dl-de/by-2-0](https://npgeo-corona-npgeo-de.hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0)

In [3]:
# download latest data
#download.file("https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv", "data/RKI_COVID19.csv")
download.file("https://www.arcgis.com/sharing/rest/content/items/f10774f1c63e40168479a1feb6c7ca74/data", "data/RKI_COVID19.csv")

# read COVID-19 statistics
rki <- read.csv("data/RKI_COVID19.csv")
#str(rki)

# add new column for Landkreis ID
rki <- cbind(rki, RS=as.integer(rki$IdLandkreis))
#str(rki)

# dissolve Berlin districts
#rki$RS[which(rki$RS > 11000 & rki$RS < 12000)] <- 11000L
#rki$Landkreis <- gsub("(SK Berlin).*", "\\1", rki$Landkreis)

# fix -1 cases
# not sure why this happens, but if these values are set to 1, the cases are identical to the ones reported in the news
rki$AnzahlFall[which(rki$AnzahlFall < 1)] <- 1

# add new column with a proper date and DOY
rki <- cbind(rki, date=as.POSIXct(rki$Meldedatum))
rki <- cbind(rki, doy=as.integer(format(rki$date, "%j")))
str(rki)


'data.frame':	152938 obs. of  21 variables:
 $ FID                 : int  13365240 13365241 13365242 13365243 13365244 13365245 13365246 13365247 13365248 13365249 ...
 $ IdBundesland        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Bundesland          : Factor w/ 16 levels "Baden-Württemberg",..: 15 15 15 15 15 15 15 15 15 15 ...
 $ Landkreis           : Factor w/ 412 levels "LK Ahrweiler",..: 336 336 336 336 336 336 336 336 336 336 ...
 $ Altersgruppe        : Factor w/ 7 levels "A00-A04","A05-A14",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Geschlecht          : Factor w/ 3 levels "M","unbekannt",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ AnzahlFall          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ AnzahlTodesfall     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Meldedatum          : Factor w/ 129 levels "2020/01/28 00:00:00",..: 31 36 36 38 44 52 54 55 57 107 ...
 $ IdLandkreis         : int  1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 ...
 $ Datenstand          : Factor w/ 1 level "21.06.2020, 00:00 Uhr": 1 1 1 1 1 1 1 1 1 1

### 2.2 Convert to dataframe

The RKI data are very detailed.
For keeping things simple, we only keep necessary variables:
- coordinates (Lat/Lon)
- time
- Landkreis name and ID
- population

If all properties (e.g. district, date, age, gender etc.) are the same, RKI seems to group cases.
However, we need the number of cases per district and day (but regardless of age or gender).
1. We firstly explode these groups into individual cases.
   The dataframe ``df`` has one row for each COVID-19 case in Germany.
   Many rows in ``df`` are redundant.
2. We count unique rows.
   The new dataframe ``df_day`` has one row per district and day (with case count as extra column).

In [4]:
require(plyr) # count cases for each LK and day

# compile dataframe
# sometimes, cases are reported together -> explode
df <- data.frame(X=0, 
                 Y=0, 
                 T=rep(rki$date,       rki$AnzahlFall),
                 DOY=rep(rki$doy,      rki$AnzahlFall),
                 LK=rep(rki$Landkreis, rki$AnzahlFall),
                 ID=rep(rki$RS,        rki$AnzahlFall),
                 POP=0)
pos <- sapply(df$ID, function(x)which(x==lk_$RS))
df$X   <- lk_$x[pos]
df$Y   <- lk_$y[pos]
df$POP <- lk_$POP[pos]
str(df)
              
# compile analysis-ready data
df_day <- count(df)
colnames(df_day)[8] <- "N"
str(df_day)

Loading required package: plyr



'data.frame':	189849 obs. of  7 variables:
 $ X  : num  9.44 9.44 9.44 9.44 9.44 ...
 $ Y  : num  54.8 54.8 54.8 54.8 54.8 ...
 $ T  : POSIXct, format: "2020-03-14" "2020-03-19" ...
 $ DOY: int  74 79 79 81 87 95 97 98 100 150 ...
 $ LK : Factor w/ 412 levels "LK Ahrweiler",..: 336 336 336 336 336 336 336 336 336 336 ...
 $ ID : int  1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 ...
 $ POP: int  89504 89504 89504 89504 89504 89504 89504 89504 89504 89504 ...
'data.frame':	24985 obs. of  8 variables:
 $ X  : num  9.44 9.44 9.44 9.44 9.44 ...
 $ Y  : num  54.8 54.8 54.8 54.8 54.8 ...
 $ T  : POSIXct, format: "2020-03-14" "2020-03-19" ...
 $ DOY: int  74 79 81 87 95 97 98 100 150 78 ...
 $ LK : Factor w/ 412 levels "LK Ahrweiler",..: 336 336 336 336 336 336 336 336 336 336 ...
 $ ID : int  1001 1001 1001 1001 1001 1001 1001 1001 1001 1001 ...
 $ POP: int  89504 89504 89504 89504 89504 89504 89504 89504 89504 89504 ...
 $ N  : int  4 4 1 1 1 1 1 1 1 2 ...


## 3. Export analysis-ready data

In [5]:
# write table with n cases per day and LK
write.csv(df_day, "data/covid19-deu.csv", row.names=FALSE)