# [Workshop 5. Assessment of Survey Data Quality](https://wapor.org/events/annual-conference/current-conference/training-workshops/)

## Part II: Design effect and weighting

Author: <a href="mailto:alexander.seymer@plus.ac.at?subject=Regarding the WAPOR 2023 workshop">Alexander Seymer @ PLUS</a>

Date: September 19, 2023

### Abstract

In this session of the workshop, the design effect, as measure for the impact of the sampling design on the variance of an estimator, and weighting as tool to account for complex sampling designs will be discussed.

### Preparing the R session


In [2]:
source("install.R")

Lade nötiges Paket: pacman



### Weighting

Although we will use R, alternative software packages provide means to employ weights:

| Software | Reference |
| -------- | --------- |
| Python | Paczkowski, W. R. (2022). Modern survey analysis: Using Python for deeper insights. Springer Nature. [https://link.springer.com/book/10.1007/978-3-030-76267-4](https://link.springer.com/book/10.1007/978-3-030-76267-4) | 
| SAS | Lewis, T. H. (2016). [Complex survey data analysis with SAS. CRC Press.](https://www.routledge.com/Complex-Survey-Data-Analysis-with-SAS/Lewis/p/book/9781498776776) |
| SPSS | Zou, D., Lloyd, J. E. V., & Baumbusch, J. L. (2019). Using SPSS to Analyze Complex Survey Data: A Primer. Journal of Modern Applied Statistical Methods, 18. [http://jmasm.com/index.php/jmasm/article/view/1026](http://jmasm.com/index.php/jmasm/article/view/1026) | 
| STATA | Weighing Data in Stata—Stata Help—Reed College. (2014). Retrieved September 7th, 2023, from [https://www.reed.edu/psychology/stata/gs/tutorials/weights.html](https://www.reed.edu/psychology/stata/gs/tutorials/weights.html) |


Complex sampling designs for surveys are commonly applied by employing:

- stratification (divide the population in homogenous groups and sample from each group a specific number);
- clustering (divide the population in groups, e.g. regions, and sample from a random subset of this groups);
- unequal sampling (oversampling of subgroups of interest).

Considering these samples as random samples will result in biased standard errors. 

A commmon approach to account for the bias is providing design weights. Most surveys using complex sampling strategies are providing weights to adjust for the deviation from the random sampling.


#### Example 1: Weighting Values in Crisis data for Austrian Sample

Weighting requires information about the target population or the inferential population. In this example, official statistics from Statistic Austria will be used. 

The data extracted from STATcube are:

-   [Abgestimmte Erwerbsstatistik - Personen - Zeitreihe ab 2011](https://statcube.at/statistik.at/ext/statcube/openinfopage?id=deaest_aest_zr_personen)

-   [Bevölkerung zu Jahresbeginn ab 2002 (einheitlicher Gebietsstand 2020)](https://statcube.at/statistik.at/ext/statcube/openinfopage?id=debevstandjbab2002)


##### Import data from EUROSTAT

We will start with importing the

In [1]:
search_eurostat("Population by sex, age and educational attainment level") %>%
  datatable()

ERROR: Error in search_eurostat("Population by sex, age and educational attainment level") %>% : konnte Funktion "%>%" nicht finden


In [None]:

```{r setup, include=FALSE}
library(tidyverse)
library(haven)
library(readxl)
library(knitr)
library(anesrake)
library(here)
library(readODS)
library(flextable)
opts_chunk$set(
  echo = TRUE,
  warning = FALSE,
  error = FALSE,
	message = FALSE
)

#VIC_data <- read_spss(here::here("import/Values in Crisis Survey Austria_merged_Welle1_Welle2.sav_aktualisiert.sav"))
#VIC_data <- read_spss(here("import/Values in Crisis Survey #Austria_Welle1_Welle2_2021-08-21_BEARB_MH3.sav"))
#VIC_data <- read_spss(here("import/VIC Datensatz gesamt_Version1_2022-10-14.sav"))
```

# Import Data from STATcube

The data extracted from STATcube are:

-   [Abgestimmte Erwerbsstatistik - Personen - Zeitreihe ab 2011](https://statcube.at/statistik.at/ext/statcube/openinfopage?id=deaest_aest_zr_personen)

-   [Bevölkerung zu Jahresbeginn ab 2002 (einheitlicher Gebietsstand 2020)](https://statcube.at/statistik.at/ext/statcube/openinfopage?id=debevstandjbab2002)

## Education

The most recent data from official records is from 2020. Hence this is the reference for all three Waves.

```{r}
DatEdu <- read_xlsx(here("import/Kathi/VIC3_Gewichtung_offizielle Statistiken_KS.xlsx"),
                    sheet = "Tabelle1",
                    range = "E109:F113",
                    col_names = FALSE) %>%
  setNames(c("Category","Count")) 

# Calculate share
DatEdu$Share <- DatEdu$Count/sum(DatEdu$Count)

DatEdu %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 2) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

## Gender

For Wave 1 (2020):

```{r}
tmp <- read_xlsx(here::here("import/StatCube/Gender.xlsx"),
                    range = "B15:D26",
                    col_names = FALSE,
                    sheet = "2020") %>%
  setNames(c("Category","Age","Count"))

DatGndr20 <- data.frame(
  Category = tmp[c(1,7),"Category"],
  Count = c(sum(tmp[1:6,"Count"]),
            sum(tmp[7:12,"Count"]))
)
# Calculate share
DatGndr20$Share <- DatGndr20$Count/sum(DatGndr20$Count)

DatGndr20 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

For Wave 2 (2021):

```{r}
tmp <- read_xlsx(here::here("import/StatCube/Gender.xlsx"),
                    range = "B15:D26",
                    col_names = FALSE,
                    sheet = "2021") %>%
  setNames(c("Category","Age","Count"))

DatGndr21 <- data.frame(
  Category = tmp[c(1,7),"Category"],
  Count = c(sum(tmp[1:6,"Count"]),
            sum(tmp[7:12,"Count"]))
)

# Calculate share
DatGndr21$Share <- DatGndr21$Count/sum(DatGndr21$Count)

DatGndr21 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )

```

For Wave 3 (2022):

```{r}
DatGndr22 <- read_xlsx(here("import/Kathi/VIC3_Gewichtung_offizielle Statistiken_KS.xlsx"),
                    sheet = "Tabelle1",
                    range = "C4:D5",
                    col_names = FALSE) %>%
  setNames(c("Category","Count"))

# Calculate share
DatGndr22$Share <- DatGndr22$Count/sum(DatGndr22$Count)

DatGndr22 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

## Age

Define age groups according to previous weighting procedures:

```{r}
Age_labels <- c("14 - 19 Jahre",
                "20 - 29 Jahre",
                "30 - 39 Jahre",
                "40 - 49 Jahre",
                "50 - 59 Jahre",
                "60 - 69 Jahre",
                "70 Jahre oder älter")
```

For Wave 1 (2020):

```{r}
DatAge20 <- read_xlsx(here::here("import/StatCube/Age.xlsx"),
                    range = "B28:C126",
                    col_names = FALSE,
                    sheet = "2020") %>%
  setNames(c("Category","Count"))

DatAge20$Count <- as.numeric(DatAge20$Count)
DatAge20$Age <- substr(DatAge20$Category,1,3) %>%
  as.numeric()
DatAge20$AgeGroup[DatAge20$Age < 150] <- Age_labels[7]
DatAge20$AgeGroup[DatAge20$Age < 70] <- Age_labels[6]
DatAge20$AgeGroup[DatAge20$Age < 60] <- Age_labels[5]
DatAge20$AgeGroup[DatAge20$Age < 50] <- Age_labels[4]
DatAge20$AgeGroup[DatAge20$Age < 40] <- Age_labels[3]
DatAge20$AgeGroup[DatAge20$Age < 30] <- Age_labels[2]
DatAge20$AgeGroup[DatAge20$Age < 20] <- Age_labels[1]

DatAge20 <- DatAge20 %>%
  group_by(AgeGroup) %>%
  summarise(Count = sum(Count, na.rm = TRUE)) %>%
  setNames(c("Category","Count"))

# Calculate share
DatAge20$Share <- DatAge20$Count/sum(DatAge20$Count)

DatAge20 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1.5) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

For Wave 2 (2021):

```{r}
DatAge21 <- read_xlsx(here::here("import/StatCube/Age.xlsx"),
                    range = "B28:C126",
                    col_names = FALSE,
                    sheet = "2021") %>%
  setNames(c("Category","Count"))

DatAge21$Count <- as.numeric(DatAge21$Count)
DatAge21$Age <- substr(DatAge21$Category,1,3) %>%
  as.numeric()
DatAge21$AgeGroup[DatAge21$Age < 150] <- Age_labels[7]
DatAge21$AgeGroup[DatAge21$Age < 70] <- Age_labels[6]
DatAge21$AgeGroup[DatAge21$Age < 60] <- Age_labels[5]
DatAge21$AgeGroup[DatAge21$Age < 50] <- Age_labels[4]
DatAge21$AgeGroup[DatAge21$Age < 40] <- Age_labels[3]
DatAge21$AgeGroup[DatAge21$Age < 30] <- Age_labels[2]
DatAge21$AgeGroup[DatAge21$Age < 20] <- Age_labels[1]

DatAge21 <- DatAge21 %>%
  group_by(AgeGroup) %>%
  summarise(Count = sum(Count, na.rm = TRUE)) %>%
  setNames(c("Category","Count"))

# Calculate share
DatAge21$Share <- DatAge21$Count/sum(DatAge21$Count)

DatAge21 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1.5) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

For Wave 3 (2022):

```{r}
DatAge22 <- read_xlsx(here("import/Kathi/VIC3_Gewichtung_offizielle Statistiken_KS.xlsx"),
                    sheet = "Tabelle1",
                    range = "E7:F13",
                    col_names = FALSE) %>%
  setNames(c("Category","Count"))

# Calculate share
DatAge22$Share <- DatAge22$Count/sum(DatAge22$Count)

DatAge22 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1.5) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

## County

For Wave 1 (2020):

```{r, message=FALSE}
DatCnty20 <- read_xlsx(here::here("import/StatCube/County.xlsx"),
                    range = "B12:C20",
                    col_names = FALSE,
                    sheet = "2020") %>%
  setNames(c("Category","Count"))

# Fix county labels
DatCnty20$Category <- DatCnty20$Category %>%
  str_split(" ") %>%
  unlist() %>%
  .[seq(1,18,2)]

# Calculate share
DatCnty20$Share <- DatCnty20$Count/sum(DatCnty20$Count)

DatCnty20 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1.5) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

For Wave 2 (2021):

```{r, message=FALSE}
DatCnty21 <- read_xlsx(here::here("import/StatCube/County.xlsx"),
                    range = "B12:C20",
                    col_names = FALSE,
                    sheet = "2021") %>%
  setNames(c("Category","Count"))

# Fix county labels
DatCnty21$Category <- DatCnty21$Category %>%
  str_split(" ") %>%
  unlist() %>%
  .[seq(1,18,2)]

# Calculate share
DatCnty21$Share <- DatCnty21$Count/sum(DatCnty21$Count)

DatCnty21 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1.5) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

For Wave 3 (2022):

```{r, message=FALSE}
DatCnty22 <- read_xlsx(here("import/Kathi/VIC3_Gewichtung_offizielle Statistiken_KS.xlsx"),
                       sheet = "Tabelle1",
                       range = "C118:D126",
                       col_names = FALSE) %>%
  setNames(c("Category","Count"))

# Calculate share
DatCnty22$Share <- DatCnty22$Count/sum(DatCnty22$Count)

DatCnty22 %>%
  setNames(c("Ausprägung","Anzahl","Prozent")) %>%
  flextable(cwidth = 1.5) %>%
  set_formatter( Prozent = function(x) sprintf( "%.1f%%", x*100 ) )
```

# Import Data from Survey

The current data is from November 17, 2022.

```{r}
VIC_data <- read_spss(here("import/VIC1-2-3_AUSSDA_Dataset_INTERN_2022-11-17.sav"))
```

## Prepare variables

### Geschlecht

::: {.callout-warning}
Each wave has a seperate gender variable (`VIC1_Q1, VIC2_Q1, VIC3_Q1`), which brings up the question, if mismatches exist over time and how to treat them.
:::

```{r}
VIC_data$VIC1_Q1 <- as_factor(VIC_data$VIC1_Q1) %>%
  factor()
VIC_data$VIC2_Q1 <- as_factor(VIC_data$VIC2_Q1) %>%
  factor()
VIC_data$VIC3_Q1 <- as_factor(VIC_data$VIC3_Q1) %>%
  factor()

levels(VIC_data$VIC1_Q1) <- DatGndr20$Category
levels(VIC_data$VIC2_Q1) <- DatGndr20$Category
levels(VIC_data$VIC3_Q1) <- DatGndr20$Category

VIC_data %>%
  dplyr::select(VIC1_Q1, VIC2_Q1, VIC3_Q1) %>%
  table() %>%
  as.data.frame() %>%
  flextable()
```

Currently, the gender variable for each wave is used.

### Age

Calculate age group for Wave 1 based on `VIC1_Q2_Alter`, for Wave 2 based on `VIC2_Q2_Alter` and with `VIC3_Q2_Alter` for Wave 3.

```{r}
VIC_data <- lapply(1:3, function(Wave){
  var_lab <- paste0("VIC",Wave,"_Q2_Alter")
  tmp <- NA
  tmp[VIC_data[var_lab] < 150] <- Age_labels[7]
  tmp[VIC_data[var_lab] < 70] <- Age_labels[6]
  tmp[VIC_data[var_lab] < 60] <- Age_labels[5]
  tmp[VIC_data[var_lab] < 50] <- Age_labels[4]
  tmp[VIC_data[var_lab] < 40] <- Age_labels[3]
  tmp[VIC_data[var_lab] < 30] <- Age_labels[2]
  tmp[VIC_data[var_lab] < 20] <- Age_labels[1]
  tmp %>%
    factor(levels = Age_labels,
           ordered = TRUE) %>%
    as.data.frame() %>%
    return()
}) %>%
  do.call("cbind",.) %>%
  setNames(paste0("Age_W",1:3)) %>%
  cbind(VIC_data,.)
```

Show result of the recoding.

```{r}
tab_data <- VIC_data %>%
  dplyr::select(Age_W1, Age_W2, Age_W3) %>%
  gather() %>%
  count(key, value) %>%
  group_by(key) %>%
  na.omit() %>% # now required with changes to dplyr::count()
  mutate(prop = prop.table(n))
  
data.frame(
  Categories = Age_labels,
  Wave1.n = tab_data[tab_data$key == "Age_W1","n"],
  Wave1.p = tab_data[tab_data$key == "Age_W1","prop"],
  Wave2.n = tab_data[tab_data$key == "Age_W2","n"],
  Wave2.p = tab_data[tab_data$key == "Age_W2","prop"],
  Wave3.n = tab_data[tab_data$key == "Age_W3","n"],
  Wave3.p = tab_data[tab_data$key == "Age_W3","prop"]
) %>%
  flextable(cwidth = 1.5) %>%
  set_header_labels(values = list(
    Categories = "",
    n = "n",
    prop = "Share",
    n.1 = "n",
    prop.1 = "Share",
    n.2 = "n",
    prop.2 = "Share"
  )) %>%
  add_header_row(values = c("Categories", paste0("Wave ",1:3)),
                 colwidths = c(1,2,2,2)) %>%
  colformat_double(digits = 3)

```

### Education

Recode education for Wave 1 based `VIC1_Q6`.

```{r}
# Recode education for Wave 1
VIC_data$Edu_W1 <- as_factor(VIC_data$VIC1_Q6)
## Combine: 
## "Allgemeinbildende höhere Schule (AHS)" &           
## "Berufsbildende höhere Schule (BHS, z,B, HAK, HTL)"
VIC_data$Edu_W1[VIC_data$Edu_W1 %in% levels(VIC_data$Edu_W1)[4:5]] <- levels(VIC_data$Edu_W1)[4]
## Combine:
## "Bachelor an Fachhochschule/ Pädagogische Hochschule" &
## "Bachelor an Universität" &
## "Diplomabschluss/ Master an Fachhochschule" &
## "Diplomabschluss/ Master an Universität" &
## "Postgradualen Universitätslehrgang (aufbauend auf Diplomabschluss, z.B. MBA) bzw. anderer Abschluss" &
## "Doktorat" 
VIC_data$Edu_W1[VIC_data$Edu_W1 %in% levels(VIC_data$Edu_W1)[6:11]] <- levels(VIC_data$Edu_W1)[5]
## Build factor and assign proper labels
VIC_data$Edu_W1 <- factor(VIC_data$Edu_W1)
levels(VIC_data$Edu_W1) <- DatEdu$Category
```

Recode education for Wave 2 based `VIC2_Q15`.

```{r}
# Recode education for wave 2
VIC_data$Edu_W2 <- as_factor(VIC_data$VIC2_Q15)
## Reassign "Berufsbildende mittlere Schule (z.B. Handelsschule)" to level 3
VIC_data$Edu_W2[VIC_data$Edu_W2 %in% levels(VIC_data$Edu_W2)[4]] <- levels(VIC_data$Edu_W2)[3]
## Combine & Reassign: 
## "Allgemeinbildende höhere Schule (AHS)" &           
## "Berufsbildende höhere Schule (BHS, z,B, HAK, HTL)"
VIC_data$Edu_W2[VIC_data$Edu_W2 %in% levels(VIC_data$Edu_W2)[5:6]] <- levels(VIC_data$Edu_W2)[4]
## Combine & Reassign:
## "Abschluss einer Akademie (Sozialakademie, Pädagogische Akademie, ...)"
## "Bachelor an Fachhochschule/ Pädagogische Hochschule" &
## "Bachelor an Universität" &
## "Diplomabschluss/ Master an Fachhochschule" &
## "Diplomabschluss/ Master an Universität" &
## "Postgradualen Universitätslehrgang (aufbauend auf Diplomabschluss, z.B. MBA) bzw. anderer Abschluss" &
## "Doktorat" 
VIC_data$Edu_W2[VIC_data$Edu_W2 %in% levels(VIC_data$Edu_W2)[7:13]] <- levels(VIC_data$Edu_W2)[5]
## Build factor and assign proper labels
VIC_data$Edu_W2 <- factor(VIC_data$Edu_W2)
levels(VIC_data$Edu_W2) <- DatEdu$Category
```

Recode education for Wave 3 based `VIC3_Q13`.

```{r}
# Recode education for wave 3
VIC_data$Edu_W3 <- as_factor(VIC_data$VIC3_Q13)
## Reassign "Berufsbildende mittlere Schule (z.B. Handelsschule)" from to level 3
VIC_data$Edu_W3[VIC_data$Edu_W3 %in% levels(VIC_data$Edu_W3)[4]] <- levels(VIC_data$Edu_W3)[3]
## Combine & Reassign: 
## "Allgemeinbildende höhere Schule (AHS)" &           
## "Berufsbildende höhere Schule (BHS, z,B, HAK, HTL)"
VIC_data$Edu_W3[VIC_data$Edu_W3 %in% levels(VIC_data$Edu_W3)[5:6]] <- levels(VIC_data$Edu_W3)[4]
## Combine & Reassign:
## "Bachelor an Fachhochschule/ Pädagogische Hochschule" &
## "Bachelor an Universität" &
## "Diplomabschluss/ Master an Fachhochschule" &
## "Diplomabschluss/ Master an Universität" &
## "Postgradualen Universitätslehrgang (aufbauend auf Diplomabschluss, z.B. MBA) bzw. anderer Abschluss" &
## "Doktorat" 
## "Anderer Abschluss"
VIC_data$Edu_W3[VIC_data$Edu_W3 %in% levels(VIC_data$Edu_W3)[7:13]] <- levels(VIC_data$Edu_W3)[5]
## Build factor and assign proper labels
VIC_data$Edu_W3 <- factor(VIC_data$Edu_W3)
levels(VIC_data$Edu_W3) <- DatEdu$Category
```

Show result of the recoding.

```{r}
tab_data <- VIC_data %>%
  dplyr::select(Edu_W1, Edu_W2, Edu_W3) %>%
  gather() %>%
  count(key, value) %>%
  group_by(key) %>%
  na.omit() %>% # now required with changes to dplyr::count()
  mutate(prop = prop.table(n))

tab_data$value <- factor(tab_data$value,
                         levels = DatEdu$Category,
                         ordered = TRUE)

tab_data <- tab_data[order(tab_data$value),]
  
data.frame(
  Categories = tab_data[tab_data$key == "Edu_W1","value"],
  Wave1.n = tab_data[tab_data$key == "Edu_W1","n"],
  Wave1.p = tab_data[tab_data$key == "Edu_W1","prop"],
  Wave2.n = tab_data[tab_data$key == "Edu_W2","n"],
  Wave2.p = tab_data[tab_data$key == "Edu_W2","prop"],
  Wave3.n = tab_data[tab_data$key == "Edu_W3","n"],
  Wave3.p = tab_data[tab_data$key == "Edu_W3","prop"]
) %>%
  flextable(cwidth = c(3,1,1,1,1,1,1)) %>%
  set_header_labels(values = list(
    value = "",
    n = "n",
    prop = "Share",
    n.1 = "n",
    prop.1 = "Share",
    n.2 = "n",
    prop.2 = "Share"
  )) %>%
  add_header_row(values = c("Categories", paste0("Wave ",1:3)),
                 colwidths = c(1,2,2,2)) %>%
  colformat_double(digits = 3)

```

### Region

Adjust variable labels to match with reference categories.

```{r}
# Welle 1
VIC_data$VIC1_Q4 <- VIC_data$VIC1_Q4 %>%
  as_factor() %>%
  factor()

# Welle 2
VIC_data$VIC2_Q8 <- VIC_data$VIC2_Q8 %>%
  as_factor() %>%
  factor()

# Welle 3
VIC_data$VIC3_Q8 <- VIC_data$VIC3_Q8 %>%
  as_factor() %>%
  factor()
```

# Weighting

The procedure uses the [anesrake](http://cran.r-project.org/web/packages/anesrake/index.html) package by @anesrake for R and follows the blog entry on raking weights by @daza_2012.

## Define lists of targets

For Wave 1 (2020):

```{r}
# Prepare list of population marginal proportions
target20 <- list(
  "Geschlecht" = DatGndr20$Share %>%
    setNames(DatGndr20$Category),
  "Alter" = DatAge20$Share %>%
    setNames(DatAge20$Category),
  "Bildung" = DatEdu$Share %>%
    setNames(DatEdu$Category),
  "County" = DatCnty20$Share %>%
    setNames(DatCnty20$Category)
)
```

For Wave 2 (2021):

```{r}
# Prepare list of population marginal proportions
target21 <- list(
  "Geschlecht" = DatGndr21$Share %>%
    setNames(DatGndr21$Category),
  "Alter" = DatAge21$Share %>%
    setNames(DatAge21$Category),
  "Bildung" = DatEdu$Share %>%
    setNames(DatEdu$Category),
  "County" = DatCnty21$Share %>%
    setNames(DatCnty21$Category)
)
```

For Wave 3 (2022):

```{r}
# Prepare list of population marginal proportions
target22 <- list(
  "Geschlecht" = DatGndr22$Share %>%
    setNames(DatGndr22$Category),
  "Alter" = DatAge22$Share %>%
    setNames(DatAge22$Category),
  "Bildung" = DatEdu$Share %>%
    setNames(DatEdu$Category),
  "County" = DatCnty22$Share %>%
    setNames(DatCnty22$Category)
)
```

## Subset the data by wave and run raking

```{r}
# Subset data relevant for weighting by wave
subsamples <- c("Teilnahme_W1",
                "Teilnahme_W2",
                "Teilnahme_W3",
                "Teilnahme_W12",
                "Teilnahme_W13",
                "Teilnahme_W23",
                "Teilnahme_W123")

Raking <- lapply(subsamples, 
  function(x){
    if(grepl("W3",x)){ # If Wave 3 use Wave 3 info on person
      subset_vars <- c("ID_U0", # ID
                       "VIC3_Q1", # Gender
                       "Age_W3", # Age groups
                       "Edu_W3", # Education
                       "VIC3_Q8") # Region
      subset_target <- target22
    } else {
      if(grepl("W2",x)){ # If Wave 2 use Wave 2 info on person
        subset_vars <- c("ID_U0", # ID
                         "VIC2_Q1", # Gender
                         "Age_W2", # Age groups
                         "Edu_W2", # Education
                         "VIC2_Q8") # Region
        subset_target <- target21
      } else { # Only for Wave 1 use Wave 1 info on person
        subset_vars <- c("ID_U0", # ID
                         "VIC1_Q1", # Gender
                         "Age_W1", # Age groups
                         "Edu_W1",# Education
                         "VIC1_Q4") # Region
        subset_target <- target20
      }
    }
    
    # Subset data
    Weight_Data <- VIC_data %>%
      .[.[,x] == 1,] %>%
      filter(is.na(ID_U0) == FALSE) %>%
      select(all_of(subset_vars)) %>%
      set_names(c("ID_U0",
                  "Geschlecht", 
                  "Alter",
                  "Bildung",
                  "County"))
    
    # Align variable labels of both data for
    # County:
    Weight_Data$County <- as_factor(Weight_Data$County) %>%
      factor(levels = DatCnty21$Category)

    DatWave <- Weight_Data %>% as.data.frame()
    
    set.seed(21072021)
    
    RakingResult <- anesrake(subset_target,
                             DatWave,
                             DatWave$ID_U0,
                             cap = 10,
                             type = "nolim",
                             force1 = FALSE)
    
    return(list(
      DatWave = DatWave,
      RakingResult = RakingResult
    ))
  }) %>%
  setNames(subsamples)
```

## Print some summary statistics:

### Wave 1 weights

```{r}
summary(Raking$Teilnahme_W1$RakingResult)
```

### Wave 2 weights

```{r}
summary(Raking$Teilnahme_W2$RakingResult)
```

### Wave 3 weights

```{r}
summary(Raking$Teilnahme_W3$RakingResult)
```

### Wave 1 + 2 (panel data) weights

```{r}
summary(Raking$Teilnahme_W12$RakingResult)
```

### Wave 1 + 3 (panel data) weights

```{r}
summary(Raking$Teilnahme_W13$RakingResult)
```

### Wave 2 + 3 (panel data) weights

```{r}
summary(Raking$Teilnahme_W23$RakingResult)
```

### Wave 1 + 2 + 3 (panel data) weights

```{r}
summary(Raking$Teilnahme_W123$RakingResult)
```

# Export results to SPSS

```{r, message=FALSE, warning=FALSE}
names(Raking) <- names(Raking) %>%
  str_remove_all("_") %>%
  str_remove_all("Teilnahme")

lapply(names(Raking), function(subname){
  data.frame(
    Raking[[subname]]$RakingResult$caseid,
    Raking[[subname]]$RakingResult$weightvec) %>%
    setNames(c("U0", paste0("Weight_",subname))) %>%
    write_sav(here(paste0("export/",subname,"_weights.sav"))
  )
}) %>%
  invisible()
```

```{r, include=FALSE, echo=FALSE}
lapply(names(Raking), function(subname){
  file.copy(
    from = here(paste0("export/",subname,"_weights.sav")),
    to = here(paste0("../",subname,"_weights.sav")),
          overwrite = TRUE)
})
```

# Storage key data

```{r}
list(DatAge20,  DatAge21,  DatAge22,
     DatCnty20, DatCnty21, DatCnty22,
     DatEdu,
     DatGndr20, DatGndr21, DatGndr22,
     target20,  target21,  target22) %>%
saveRDS(here("data/targets_w123.rda"))

saveRDS(Raking, here("data/Raking_w123.rda"))
Raking <- readRDS(here("data/Raking_w123.rda"))
```

# Descriptives of weighting variables

```{r}
lapply(Raking, function(x){
  desc_data <- psych::describe(x$RakingResult$weightvec) %>%
    as.data.frame() %>%
    dplyr::select(n, sd) %>%
    c(summary(x$RakingResult)$weight.summary[c(2,3,5,1,6)],
      summary(x$RakingResult)$general.design.effect,.) %>%
    setNames(c("1st Quartil",
               "Median",
               "3rd Quartil",
               "Minimum",
               "Maximum",
               "Design effect",
               "n",
               "Standard deviation"
    )) 
  desc_data <- desc_data[c("n",
                           "Standard deviation",
                           "1st Quartil",
                           "Median",
                           "3rd Quartil",
                           "Minimum",
                           "Maximum",
                           "Design effect"
  )] %>%
    as.data.frame()
  # selection of weights bigger than 3
  big_sel <- grep(TRUE,(x$RakingResult$weightvec>3))
  desc_data$'No. of weights > 3 (W>3)' <- length(x$RakingResult$weightvec[big_sel]) 
  desc_data$'Sum of W>3' <- sum(x$RakingResult$weightvec[big_sel])
  return(desc_data)
}) %>% do.call("rbind",.) %>%
  setNames(c("n",
             "Standard deviation",
             "1st Quartil",
             "Median",
             "3rd Quartil",
             "Minimum",
             "Maximum",
             "Design effect",
             'No. of weights > 3 (W>3)',
             'Sum of W>3'
  )) %>%
  t() %>%
  as.data.frame() %>%
  cbind(rownames(.),.) %>%
  setNames(c("Parameter",colnames(.)[-1])) %>%
  flextable() %>%
  autofit() %>%
  colformat_double(digits = 3) %>%
  colformat_double(i = c(1,8), digits = 0)

```



# Plot weight distributions

```{r}
plotdata <- lapply(names(Raking), function(x){
  cbind(x,
        Raking[[x]]$RakingResult$weightvec) %>%
    as.data.frame(row.names = FALSE) %>%
    setNames(c("Sample","Weight"))
}) %>%
  do.call("rbind",.)

plotdata$Sample <- factor(plotdata$Sample,
                          levels = c("W1","W2","W3","W12",
                                     "W13","W23","W123"),
                          ordered = TRUE)

p1 <- plotdata %>%   
  ggplot(aes(as.numeric(Weight))) +
  geom_histogram(color="black", fill="white") +
  facet_grid(rows = vars(Sample)) +
  theme_minimal() +
  labs(x = "",
       y = " ") +
  scale_x_continuous(breaks = 1:10) +
  theme(legend.position="none",
        plot.margin=unit(c(0,0,-0.5,0), "cm"))

p2 <- plotdata %>%
  ggplot(aes(as.numeric(Weight))) +
  geom_boxplot(width=0.6) +
  facet_grid(rows = vars(Sample)) +
  theme_minimal() +
  labs(x = "Weight",
       y = "") +
  #coord_cartesian(ylim = c(-0.4, 0.4)) +
  scale_y_continuous(limits = c(-0.5,0.5)) +
  scale_x_continuous(breaks = 1:10) +
  theme(legend.position="none",
        #axis.text.x=element_blank(),
        #axis.ticks.x=element_blank(),
        plot.margin=unit(c(0.25,0,0,0), "cm"),
        strip.text.x = element_blank())

p3 <- egg::ggarrange(p2, p1, widths  = 2:1)

ggsave(here::here("export/weights_plot_2022.png"),
       p3,
       dpi = 300)
```


### Design effect

And as with all biases, the first and foremost question is how serious is the issue? And of course, the answer is complicated and depends on different factors. Calculating the design effect helps to get an understanding of the gravity of the issue. The 
