# [Workshop 5. Assessment of Survey Data Quality](https://wapor.org/events/annual-conference/current-conference/training-workshops/)

# Part C: Weighting and design effect

Author: <a href="mailto:alexander.seymer@plus.ac.at?subject=Regarding the WAPOR 2023 workshop">Alexander Seymer @ PLUS</a>

Date: September 19, 2023


## Abstract

In this session of the workshop, the design effect, as measure for the impact of the sampling design on the variance of an estimator, and weighting as tool to account for complex sampling designs will be discussed.


## Preparing the R session

In [None]:
source("install.R")


Although we will use R, alternative software packages provide means to employ weights:

| Software | Reference |
| -------- | --------- |
| Python | Paczkowski, W. R. (2022). Modern survey analysis: Using Python for deeper insights. Springer Nature. [https://link.springer.com/book/10.1007/978-3-030-76267-4](https://link.springer.com/book/10.1007/978-3-030-76267-4) | 
| SAS | Lewis, T. H. (2016). [Complex survey data analysis with SAS. CRC Press.](https://www.routledge.com/Complex-Survey-Data-Analysis-with-SAS/Lewis/p/book/9781498776776) |
| SPSS | Zou, D., Lloyd, J. E. V., & Baumbusch, J. L. (2019). Using SPSS to Analyze Complex Survey Data: A Primer. Journal of Modern Applied Statistical Methods, 18. [http://jmasm.com/index.php/jmasm/article/view/1026](http://jmasm.com/index.php/jmasm/article/view/1026) | 
| STATA | Weighing Data in Stata—Stata Help—Reed College. (2014). Retrieved September 7th, 2023, from [https://www.reed.edu/psychology/stata/gs/tutorials/weights.html](https://www.reed.edu/psychology/stata/gs/tutorials/weights.html) |

## Weighting


Complex sampling designs for surveys are commonly applied by employing:

- stratification (divide the population in homogenous groups and sample from each group a specific number);
- clustering (divide the population in groups, e.g. regions, and sample from a random subset of this groups);
- unequal sampling (oversampling of subgroups of interest).

Considering these samples as random samples will result in biased standard errors. 

A common approach to account for the bias is providing design weights. Most surveys using complex sampling strategies are providing weights to adjust for the deviation from the random sampling.


## Example: Weighting Values in Crisis data for Austrian Sample


Weighting requires information about the target population or the inferential population. In this example, official statistics from [EUROSTAT](https://ec.europa.eu/eurostat/web/main/home) will be used as reference for the inferential population including sex, age, education and region. Hence, we will:

### 1. Import data from EUROSTAT

We will start by searching the EUROSTAT database for the relevant data. Let's check all database with containing the phrase `Population` in the title

In [None]:
search_eurostat("Population") %>%
  datatable(filter = "top")

Now import the data based on the database id.

In [None]:
EuroDataAT <- get_eurostat_json(
  "lfst_r_lfsd2pop",
  filters = list(
    geo = c("AT", paste0("AT",c(11:13,21,22,31:34))),
    time = 2020
  ))

datatable(EuroDataAT,
         filter = "top")

### 2. Prepare reference points for weighting

#### Sex

To subset the sex distribution, we will use the `TOTAL` column from education (`isced11`), the older than or equal to 15 years (`Y_GE15`) from age, remove totals (`T`) from sex and use only `AT` for regioin (`geo`).

In [None]:
# Subset sex distribution
DatGndr <- EuroDataAT %>%
  filter(isced11 == "TOTAL") %>%
  filter(age == "Y_GE15") %>%
  filter(sex != "T") %>%
  filter(geo == "AT") %>%
  dplyr::select(sex, values) %>%
  setNames(c("Category", "Count"))

# Calculate share
DatGndr$Share <- DatGndr$Count/sum(DatGndr$Count)

datatable(DatGndr) %>%
    formatPercentage(3) %>% 
    formatStyle(3, background = styleColorBar(c(0,1), 'lightblue'),
                         backgroundSize = '98% 88%',
                         backgroundRepeat = 'no-repeat',
                         backgroundPosition = 'center')

#### Age

A limitation for the application of age is that we need to use the predefined groups by official statistics. Otherwise, we use the total count (`T`) of sex, the specific age groups and apply the same filter for education and region as for sex.

In [None]:
# Subset age group distribution
DatAge <- EuroDataAT %>%
  filter(isced11 == "TOTAL") %>%
  filter(age %in% c("Y15-24","Y25-34","Y35-44","Y45-54","Y55-64","Y_GE65")) %>%
  filter(sex == "T") %>%
  filter(geo == "AT") %>%
  dplyr::select(age, values) %>%
  setNames(c("Category", "Count"))

# Calculate share
DatAge$Share <- DatAge$Count/sum(DatAge$Count)

datatable(DatAge) %>%
    formatPercentage(3) %>% 
    formatStyle(3, background = styleColorBar(c(0,1), 'lightblue'),
                         backgroundSize = '98% 88%',
                         backgroundRepeat = 'no-repeat',
                         backgroundPosition = 'center')

#### Education

The limitation of existing categories becomes even bigger with education as the eurostat database provides only three categories with ISCED 0-2, ISCED 3-4 and ISCED 5-8. 

In [None]:
# Subset education distribution
DatEdu <- EuroDataAT %>%
  filter(isced11 %in% c("ED0-2", "ED3_4", "ED5-8")) %>%
  filter(age == "Y_GE15") %>%
  filter(sex == "T") %>%
  filter(geo == "AT") %>%
  dplyr::select(isced11, values) %>%
  setNames(c("Category", "Count"))

# Calculate share
DatEdu$Share <- DatEdu$Count/sum(DatEdu$Count)

datatable(DatEdu) %>%
    formatPercentage(3) %>% 
    formatStyle(3, background = styleColorBar(c(0,1), 'lightblue'),
                         backgroundSize = '98% 88%',
                         backgroundRepeat = 'no-repeat',
                         backgroundPosition = 'center')

#### Region

The region needs only relabelling.

In [None]:
# Subset NUTS2 distribution
DatReg <- EuroDataAT %>%
  filter(isced11 == "TOTAL") %>%
  filter(age == "Y_GE15") %>%
  filter(sex == "T") %>%
  filter(geo != "AT") %>%
  dplyr::select(geo, values) %>%
  setNames(c("Category", "Count"))

# Relabel
DatReg$Category <- DatReg$Category %>%
  str_replace("AT11","Burgenland") %>%
  str_replace("AT12","Lower Austria") %>%
  str_replace("AT13","Vienna") %>%
  str_replace("AT21","Carinthia") %>%
  str_replace("AT22","Styria") %>%
  str_replace("AT31","Upper Austria") %>%
  str_replace("AT32","Salzburg") %>%
  str_replace("AT33","Tyrol") %>%
  str_replace("AT34","Vorarlberg")

# Calculate share
DatReg$Share <- DatReg$Count/sum(DatReg$Count)

datatable(DatReg) %>%
    formatPercentage(3) %>% 
    formatStyle(3, background = styleColorBar(c(0,1), 'lightblue'),
                         backgroundSize = '98% 88%',
                         backgroundRepeat = 'no-repeat',
                         backgroundPosition = 'center')

### 3. Import Values in Crisis data

You need to have executed the `DataDownloader`Notebook before!

In [None]:
VIC_data <- readRDS("data/10742_da01_en_v2_0.Rdata") %>%
    filter(country == 1) %>% # 1 for Austria
    # Subset VIC data
    dplyr::select(ID_merge,Q01, Q02_age_grouped, Q05_ISCED_3_Levels, Q09_AUT)     

Let's investigate the distribution and attributes of these variables in the VIC data.

In [None]:
table1(~ as_factor(Q01) + 
         as_factor(Q02_age_grouped) + 
         as_factor(Q05_ISCED_3_Levels) + 
         as_factor(Q09_AUT),
       data = VIC_data) %>%
  display_html()

### 4. Align both data sources

We will need align the labels and coding from both sources. We will use the EUROSTAT data as point of reference here and manipulate the labels and coding of the VIC data.

#### Sex

For the sex variable, we simply overwrite the labels.

In [None]:
# Make it a factor and drop empty levels
VIC_data$Q01 <- VIC_data$Q01 %>%
  as_factor() %>%
  factor()

# Set factor labels of VIC data to EUROSTAT labels.
levels(VIC_data$Q01) <- DatGndr$Category

#### Age

For age, we need to regroup the variable slighly.

In [None]:
# Make it a factor and drop empty levels
VIC_data$Q02_age_grouped <- VIC_data$Q02_age_grouped %>%
  as_factor() %>%
  factor()

# Get original categories from VIC data
old_labs <- levels(VIC_data$Q02_age_grouped)

# Create new variable 
VIC_data$age_grouped <- NA
VIC_data$age_grouped[VIC_data$Q02_age_grouped %in% old_labs[1:2]] <- DatAge$Category[1]
VIC_data$age_grouped[VIC_data$Q02_age_grouped %in% old_labs[3:4]] <- DatAge$Category[2]
VIC_data$age_grouped[VIC_data$Q02_age_grouped %in% old_labs[5:6]] <- DatAge$Category[3]
VIC_data$age_grouped[VIC_data$Q02_age_grouped %in% old_labs[7:8]] <- DatAge$Category[4]
VIC_data$age_grouped[VIC_data$Q02_age_grouped %in% old_labs[9:10]] <- DatAge$Category[5]
VIC_data$age_grouped[VIC_data$Q02_age_grouped %in% old_labs[11]] <- DatAge$Category[6]

# Make the new variable a ordered factor
VIC_data$age_grouped <- factor(VIC_data$age_grouped,
                              levels = DatAge$Category,
                              labels = DatAge$Category,
                              ordered = TRUE)

# A quick check if the recoding worked
table(VIC_data$Q02_age_grouped,
     VIC_data$age_grouped)

# Drop old variable
VIC_data <- VIC_data %>%
    select(!Q02_age_grouped)

#### Education

Education is already coded as need, hence we just need to relabel the variable.

In [None]:
# Make it a factor and drop empty levels
VIC_data$Q05_ISCED_3_Levels <- VIC_data$Q05_ISCED_3_Levels %>%
  as_factor() %>%
  factor()

# Set factor labels of VIC data to EUROSTAT labels.
levels(VIC_data$Q05_ISCED_3_Levels) <- DatEdu$Category

#### Region

In [None]:
# Make it a factor and drop empty levels
VIC_data$Q09_AUT <- VIC_data$Q09_AUT %>%
  as_factor() %>%
  factor()

##### Rename variables

The variable names need to be identical to the later target definition.

In [None]:
colnames(VIC_data) <- colnames(VIC_data) %>%
    str_replace("Q01","Sex") %>%
    str_replace("age_grouped", "Age") %>%
    str_replace("Q05_ISCED_3_Levels", "Education") %>%
    str_replace("Q09_AUT","Region")

### 5. Weighting

The procedure uses the [anesrake](http://cran.r-project.org/web/packages/anesrake/index.html) package by Pasek (2018) and follows the blog entry on raking weights by Daze (2012).

> Pasek, Josh. 2018. Anesrake: ANES Raking Implementation. https://CRAN.R-project.org/package=anesrake.
>
> Daza, Sebastian. 2012. “https://sdaza.com/blog/2012/raking/.” Raking Weights with R. https://sdaza.com/blog/2012/raking/


We will perform three steps:

#### 5.1. Define a target list

The anesrake package requires a target list consisted of named vectors with the marginal proportions. Hence, we will create this list first:

In [None]:
# Prepare list of population marginal proportions
target <- list(
  "Sex" = DatGndr$Share %>%
    setNames(DatGndr$Category),
  "Age" = DatAge$Share %>%
    setNames(DatAge$Category),
  "Education" = DatEdu$Share %>%
    setNames(DatEdu$Category),
  "Region" = DatReg$Share %>%
    setNames(DatReg$Category)
)

#### 5.2. Run the raking function

As raking is an iterative procedure, we need to set the seed of the CPU to receive reproducible results.

In [None]:
set.seed(19092023)

RakingResult <- anesrake(inputter = target,
                         dataframe = as.data.frame(VIC_data), # make sure it is a dataframe!
                         caseid = VIC_data$ID_merge,
                         cap = 10,
                         type = "nolim",
                         force1 = FALSE)
    

#### 5.3. Access raking results

In [None]:
summary(RakingResult)

#### 5.4. Trimming weights

Sometimes weights are very large, which might distort the results as single cases receive to much weight in the analysis. A commonly applied treshold for trimming weights is 5 (Stapelton 2012). We can see from the descriptives of the weights in the raking summary that our maximum weight is less than 2.

Stapleton, L. M. (2012). Analysis of data from complex surveys. In International handbook of survey methodology (pp. 342-369). Routledge.

#### 5.5. Export the results (Optional)

Exporting the results is straight forward and combining the ID and the weighting vector provides a very flexible way for the data management.

In [None]:
# Extract relevant information from list
WeightExport <- cbind(RakingResult$caseid,
      RakingResult$weightvec) %>%
    as.data.frame() %>%
    setNames(c("caseid","weight")) 

# Write SPSS file
write_sav(WeightExport, "data/VIC_AT_Weights.sav")

# Print exported data
datatable(WeightExport)

## Weights as survey quality measure

Weights are nothing else than bias measures based on a set of reference variables. And as such, we can use them to get an idea about the survey quality. Of course, the conclusions on the quality in the quality of the survey in total is indicative as the weighting procedure considers only a small set of variables and the implications for conclusions drawn from variables, which where not included in the weighting, might be completely unrelated. Still, a survey with smaller weights certainly  the answer is complicated and depends on different factors. 


The summary of the raking results provided us with some descriptives of the weight and the design effect, which we can use to get a graphical and numerical measure of the bias our data suffers from.

### 1. Graphical assessment

In [None]:
data.frame(Weight = RakingResult$weightvec) %>%
    ggplot(aes(as.numeric(Weight))) +
    geom_histogram(color="black", fill="white") +
    theme_minimal() +
    labs(x = "Weight",
         y = "Frequency") +
    theme(legend.position="none",
          plot.margin=unit(c(0,0,-0.5,0), "cm")) +
    geom_vline(xintercept = 1, colour = "red")

In [None]:
data.frame(Weight = RakingResult$weightvec) %>%
  ggplot(aes(as.numeric(Weight))) +
  geom_boxplot(width=0.6) +
  theme_minimal() +
  labs(x = "Weight",
       y = "") +
  scale_y_continuous(limits = c(-0.5,0.5)) +
  scale_x_continuous(limits = c(0.5,2)) +
  theme(legend.position="none",
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        plot.margin=unit(c(0.25,0,0,0), "cm"),
        strip.text.x = element_blank()) +
    geom_vline(xintercept = 1, colour = "red")

### 2. Numerical Assessment

We can extract the descriptives and the general design effect directly from the raking results

In [None]:
# Get the summary statistics and general design weight
RakingSummary <- RakingResult %>%
    summary() %>%
    .[c("weight.summary",
        "general.design.effect")] %>%
    unlist() %>%
    round(5) %>%
    as.data.frame()

# A reformat the labels
RakingSummary <- rownames(RakingSummary) %>%
    str_remove_all("summary") %>%
    str_remove_all("general") %>%
    gsub("[.]", " ", .) %>%
    str_replace_all("weight","Weight") %>%
    str_replace_all("design", "Design") %>%
    cbind(RakingSummary) %>%
    setNames(c("Attribute","Value"))

In addition, it is helpful to get some information about outliers in the weight variable. We will identify the number of weights greater than 3 and 5 and calculate the sum of these weights. 

In [None]:
RakingSummary <- rbind(
    # Add sample size
    data.frame(Attribute = 'n',
               Value = length(RakingResult$weightvec)),
    # Add standard deviation of weights
    data.frame(Attribute = 'Standard deviation',
               Value = sd(RakingResult$weightvec)),
    RakingSummary,
    # Diagnostics for weights greater than 3
    data.frame(Attribute = 'No. of weights > 3 (W>3)',
               Value = sum(RakingResult$weightvec > 3)),
    data.frame(Attribute = 'Sum of W>3',
               Value = sum(RakingResult$weightvec > 3)),
    # Diagnostics for weights greater than 5
    data.frame(Attribute = 'No. of weights > 5 (W>5)',
               Value = sum(RakingResult$weightvec > 5)),
    data.frame(Attribute = 'Sum of W>5',
               Value = sum(RakingResult$weightvec > 5))
     )

rownames(RakingSummary) <- NULL

In [None]:
RakingSummary

### 3. Design effect


But as with most statistics, descriptives are were useful get a first idea, but we usally  And of course, the answer is complicated and depends on different factors. Calculating the design effect helps to get an understanding of the gravity of the issue. 

Biemer and Christ (2012) call it the _Unequal Weighting Effect (UWE)_ defined by:

$$
UWE = 1 + cv^2
$$

with $cv$ as coefficent of variance of weights:

$$
cv = \frac{sd_{weights}}{mean_{weights}}
$$

> Biemer, P. P., & Christ, S. L. (2012). _Weighting survey data_. In International handbook of survey methodology (pp. 317-341). Routledge.

We can derive $cv$ for our weights.

In [None]:
sd_weights  <- sd(RakingResult$weightvec)
mean_weights  <- mean(RakingResult$weightvec)
cv  <- sd_weights/mean_weights
cv

And now the $UWE$

In [None]:
UWE <- 1 + (cv * cv)
UWE

Effective Sample Size might be a better way to understand the impact of the weights.

$$
Effective\ Size = \frac{n}{D_{eff}}
$$

We have in our data this number of cases:

In [None]:
length(RakingResult$weightvec)

The effect sample size is:

In [None]:
length(RakingResult$weightvec)/summary(RakingResult)$general.design.effect

Alternatively, if we only have the weights:

$$
Effective\ Size = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}
$$

For our example this translates into:

In [None]:
(sum(RakingResult$weightvec)^2)/sum(RakingResult$weightvec^2)

### 4. Summary assessment


The raking worked and the created weights indicate only small adjustments. Obviously, thats only partially surprising as the sampling of the survey was a stratified quota sampling strategy was applied for gender, age, education and region.

But we need to keep in mind, that addressing sampling biases has limitations.

### 5. Putting things into perspective

<img src="ressources/Tab_Weights_VIC123_AT.png"  width=600 height=600 />

<img src="ressources/Plot_Weights_VIC123_AT.png"  width=700 height=700 />