# POLSCI PS137L Spring 2025

## In-Class Assignment: Challenges with subjective meaures of democracy

This is the first part to a two-part notebook on a debate regarding the current state of democracy as well as broader questions of measuring democratic health (and backsliding).

In this exercise we will focus on some things that can go wrong with subective measures. But we will be balanced and look at some problems with objective measures in the homework!

Here is a quick overview/reminder of a well-known democracy measurement project that we have used a few times thorughout the semester: **Varieties of Democracy (V-Dem).** 

- **Overview:**
  - V-Dem is a comprehensive dataset that measures various dimensions of democracy across the world.
  - It covers more than **200 countries** and spans from **1789 to the present**, making it one of the most extensive sources of democratic indicators.
  - The project is based at the V-Dem Institute at the University of Gothenburg, Sweden, and involves collaboration with scholars worldwide.

- **Data Collection Process:**
  - V-Dem relies on **thousands of country experts** (coders) to assess democratic indicators.
  - Each country-year observation is coded by multiple experts, ensuring **diverse perspectives and reducing bias.**
  - Coders assess both **objective facts** (e.g., election turnover) and **subjective perceptions** (e.g., freedom of expression, judicial independence).

- **Aggregation of Indices:**
  - To ensure reliability, V-Dem uses a **Bayesian measurement model** that accounts for coder differences.
  - It does this by treating democracy as something we can't observe directly (a latent concept) and using expert ratings as imperfect signals of it. The model accounts for differences between coders, such as their personal biases or tendencies to rate things too high or too low, by estimating how reliable each coder is based on their consistency and agreement with others.

Here we will walk through a simplified version of their measurement model and think about what problems it can and can't solve.

First, let's look at one tiny slice of the index: how the United States fared at one specific indicator - the _Freedom of discussion for men (C) (v2cldiscm)_ indicator. Experts are asked to give one of the following five scores for any given year for the United States.

`*Question:* Are men able to openly discuss political issues in private homes and in public spaces?`

*Responses:*
- **0:** Not respected. Hardly any freedom of expression exists for men. Men are subject to immediate and harsh intervention and harassment for expression of political opinion.
- **1:** Weakly respected. Expressions of political opinions by men are frequently exposed to intervention and harassment.
- **2:** Somewhat respected. Expressions of political opinions by men are occasionally exposed to intervention and harassment.
- **3:** Mostly respected. There are minor restraints on the freedom of expression in the private sphere, predominantly limited to a few isolated cases or only linked to soft sanctions. But as a rule, there is no intervention or harassment if men make political statements.
- **4:** Fully respected. Freedom of speech for men in their homes and in public spaces is not restricted.  


## Part 1: The trend
Let's load up the data.

In [None]:
# Run this
suppressPackageStartupMessages(library(tidyverse))
usa <- read_rds("data/VDemUS-CY.rds")
head(usa)


Just a refresher on the variable names: 

| Variable         | Type    | Description                                                       |
|------------------|---------|-------------------------------------------------------------------|
| `country_text_id` | String  | The country identifier in text format. Example: "USA".          |
| `year`            | String  | The year associated with the observation, stored as text.        |
| `v2cldiscm`       | Number | Compiled Freedom of Discussion For Men Score (0 to 4). |

**Question 1.1. Make a plot of the `v2cldiscm` variable over time. Be sure to label the axes.**

In [None]:
# Code for 1.1

**Question 1.2. How did this measure change in the last 10-20 years of the data? Think about what might have caused this drop. Do you think this is reasonable?**

*Answer to 1.2*

## Part 2: Coder level data
Now, let's dig a little bit deeper into how this index is compiled. A nice thing about V-Dem is that they make the individual coder data available. This creates a massive dataset, but let's just look at the coders for the US on this variable for the last 10 years.

In [None]:
# Load the data
cy <- read_rds("data/coder_USA_contemp.rds")
head(cy)


| Variable         | Type    | Description                                                       |
|------------------|---------|-------------------------------------------------------------------|
| `country_text_id` | String  | The country identifier in text format. Example: "USA".          |
| `coder_id`        | Integer | Unique identifier for the coder who provided the data.           |
| `year`            | String  | The year associated with the observation, stored as text.        |
| `v2cldiscm`       | Integer | Freedom of Discussion Score for men (0, 1, 2, 3 or 4) given by that coder in that year |



Each row here corresponds to an individual coder, and the 0-4 point answer they gave to this question for a given year. 

By making of a table of the `year` variable we can count how many coders there are in each year

In [None]:
table(cy$year)

The number ranges from 10-20. This is actually high in a comparative sense: the minimum number of coders for recent years is 5, and most countries are close to that. 


**Question 2.1. Make a table of the `coder_id` variable. Interpret the output.**

In [None]:
# Code for 2.1

*Answer to 2.1*

**Question 2.2. Recall we saw a dip in the data from around 2014 to 2027. Create subsets of this data for these individual years, and then make a table of the `v2cldiscm` variable for each subset to see the distribution of coder ratings in each year.** 

**Question 2.3. compare the range of responses across these years.** 

*Words for 1.3*

We can compute the "raw average" coder rating using the tapply function. Let's put this in a year level data frame for later use.

In [None]:
rawavg <- tapply(cy$v2cldiscm, cy$year, mean)
# setting up the data frame with a year column
yr <- data.frame(year=2014:2023)
# adding the raw average as a column to the data frame
yr$rawavg <- rawavg
yr

**Question 2.4. Plot the raw average of the coder rating of the US on this variable**

In [None]:
# Code for 2.4

You should find a relatively modest dip in the coding from 2014 to 2017, compared to the one in the original data. I think what is going on here is those who give a higher rating happened to get a lower weight for some reason. (It probably has to do with how they answered other questions.) 

To get something a bit closer to the main data and explore how this weighting might work in more detail, let's create a version of the data where we drop some of the people who answered "4".

In [None]:
#Which ones coded 4 in 2017?
code4s<- cy$coder_id[cy$year == 2017 & cy$v2cldiscm == 4]
# Dropping 7 of them
todrop <- code4s[1:7]
cy2 <- subset(cy, !coder_id %in% todrop)
dim(cy)

**Question 2.5. Compute the average coder rating in the `cy2` data (after dropping some of the 4 coders), and add that to the `yr` data frame, in a column called `newavg`. Then plot the `newavg` variable over time.**

In [None]:
# Code for 2.5


## Part 3: Weighting

In a [response article to Little and Meng by authors who work at V–Dem](https://www.cambridge.org/core/journals/ps-political-science-and-politics/article/conceptual-and-measurement-issues-in-assessing-democratic-backsliding/7A620BD91885C932B48E6783BC32CA24), they state 
> If a few experts for a particular country shift their scores downward due
to bad-vibes bias, the bad-vibing experts likely will be considered less reliable and contribute less to the estimation process [of the Bayesian Measurement model]. A
country’s score on an indicator therefore is unlikely to experience a large decline unless the majority of its experts experience similar bad vibes.

Let's explore what they mean by this and why it may not be as simple as the authors suggest.

First, we are going to add a hypothetical "bad vibe" coder who starts coding the US as a 0 on this variable once Trump takes office in 2017.

In [None]:
badvibe <- data.frame(country_text_id = "USA", coder_id=789, year=2014:2023,
                      v2cldiscm=c(4,4,4,0,0,0,0,0,0,0))
# This adds the coder to the data
cy_wb <- rbind(cy2, badvibe)

Since you have already done lots of related coding I'll just compute the average with the bad vibe coder and compare that to the version without the bad vibe coder for you

In [None]:
avg_wb <- tapply(cy_wb$v2cldiscm, cy_wb$year, mean)
yr$avg_wb <- avg_wb
plot(yr$year, yr$newavg, type="l", ylim=c(0,4))
lines(yr$year, yr$avg_wb, col="red")

Unsurprisingly, adding the bad vibe coder lowers the average score (red) starting in 2017.

Now let's see how assigning a lower weight to coders who are out of step with others changes this picture. First we will add the average score on this variable for each year to the coder-level data

In [None]:
# Making a data frame with the average 
avg_wb_df <- data.frame(year=2014:2023, avgdisc =avg_wb)
# Merging this into the coder-year data
cy_wb <- merge(cy_wb, avg_wb_df, by="year", all.x=TRUE)
head(cy_wb)

How should we compute how far off each coder is from the average? A common way to do that is to use the squared difference between their score and the average.

In [None]:
# Computing the squared difference between individual coder choices and the average
# among all coders
cy_wb$sqdiff <- (cy_wb$v2cldiscm - cy_wb$avgdisc)^2
head(cy_wb)

Next we want to compute the average difference for each coder, which measures how close they are to what others say in general. We put this into a coder-level data frame for later use.

In [None]:
# Computing the coder average squared distance
coder_meandiff <- tapply(cy_wb$sqdiff, cy_wb$coder_id, mean)


We want to place a higher weight on coders who are closer to the average, so where the sqdiff is low. Here we make a data frame with the weights we want to give coders, which is decreasing in their mean difference. In particular, it gives a weight of 1 to a coder who is exactly at the average, and then the weight gets lower as they are further:

In [None]:
coderweight_df <- data.frame(coder_id = as.numeric(names(coder_meandiff)), 
                            coderweight=1/(1 + coder_meandiff))
coderweight_df

**Question 3.1. Recall our "bad vibe" coder was given id 789. How does the weight of this coder compare to others? Is this consistent with the quote from the V-dem author reply?**

*Words for 3.1*

Now let's recompute the score with the weighted average. First we merge the coder weights into our coder-level data

In [None]:
cy_wb <- merge(cy_wb, coderweight_df, by="coder_id", all.x=TRUE)
head(cy_wb)

The weighted average of a variable $x_1,x_2,...,x_n$ with weights $w_1, w_2, ...,w_n$ is:
$$
\frac {\sum_{i=1}^n w_i x_i}{\sum_{i=1}^n w_i}
$$

I'll spare you the code of this, which essentially first compute the sum of the weights (the denominator) and then c

In [None]:
# Computing the numerator of our weighted average formula
num <- tapply(cy_wb$v2cldiscm*cy_wb$coderweight, cy_wb$year, sum)
# And the denominator
denom <- tapply(cy_wb$coderweight, cy_wb$year, sum)
# Now adding this to our year data
cwa_disc <- num/denom
yr$wa <- cwa_disc
yr

**Question 3.2. Plot the `newavg` over time in black, the average with the bad vibe coder in red, and the weighted average with the bad vibe coder in blue.**

In [None]:
# Code for 3.2

It might seem surprising that the "bad vibe" coder can drag the average down so much since he got a low weight. To see why, let's look at the relationship between the rating given in 2017 and the weight each coder gets.

In [None]:
cy_2017 <- subset(cy_wb, year == 2017)
plot(cy_2017$v2cldiscm, cy_2017$coderweight, xlab="2017 Rating", ylab="Weight")


**Question 3.3. Suppose that in reality a score of 3 or 4 is reasonable here, and those who answered 2 are "semi-bad vibes coders." Who generally gets a higher weight here, those who scored 2 or those who scored 4? How might the bad vibe coder affect this? What does this imply about the quote from the V-dem reply?**

*Words for 3.3*