### 1) Name

Gabriella Messenger

### 2) Project Topic/Title

Analysis of Global Protests from 1990-2020

### 3) Inspiration

Explain the reason for your choice of data. Why is it interesting to you? Why do you think it is worth exploring for this project? Include any motivations or background that you believe is relevant. **(0.5 points)**

This dataset contains information of protests against the state worldwide from 1990-2020. It combines data from three sources: the Mass Mobilization (MM) Dataset, Varieties of Democracy (V-Dem) Dataset, and the Human Development Index (HDI). The MM dataset contains data of year, country and region, protester demands, state responses, and whether violence occured at a particular protest. V-Dem is a comprehensive and widely-used dataset that measures various indices of democracy and governance, providing scores that measure the electoral, liberal, participatory, deliberative, and egalitarian scores of democracy for a country and year specific to a protest. HDI is a composite measure that ranks countries based on their level of human development. Together, these data can provide valuable insight to the nature of citizen movements against the state, and the how the demands of both interact in protests. Within the past 5 years, we have seen a global shift towards right-wing control, exemplified by the Trump administration in the U.S. and the rise of Germany's far-right AfD party, to name a few. In the U.S. in particular, censorship, repeals of progressive policies, and social unrest are at unprecendented levels in recent history. Thus, understanding the efficacy and nature of citizen protests is more relevant than ever.

### 4)  Data 

- Identify and describe your data source(s). Share any related links and/or citations.
- **Using some code**, demonstrate that you have been able to read the data and (**using some code again**) give a brief overview of the data:
    - How many observations?
    - How many variables? How many of them are continuous and how many of them are categorical?
    - Are there any missing values? Will cleaning/imputing be necessary?
    - Is there any substantial correlation between the variables? This can be all variables in the data or only your variables of interest.

**(2 points)**

Data source: https://www.kaggle.com/code/devraai/protests-analysis-1990-to-march-2020

In [8]:
#### Initial exploration



import pandas as pd
import numpy as np

data = pd.read_csv('data.csv')
data.head()

data.stateresponse1.value_counts()

stateresponse1
ignore             6841
crowd dispersal    3313
arrests             874
accomodation        830
shootings           366
killings            219
beatings            186
Name: count, dtype: int64

In [7]:
print(f'Data has shape: {data.shape[0]} observations and {data.shape[1]} columns.')

print(data.columns)

print('Year range:', data.Year.min(), data.Year.max() )
print('Electoral_Score:', data.Electoral_Score.min(), data.Electoral_Score.max() )
print('Liberal_Score:', data.Liberal_Score.min(), data.Liberal_Score.max() )
print('Participatory_Score:', data.Participatory_Score.min(), data.Participatory_Score.max() )
print('Deliberative_Score:', data.Deliberative_Score.min(), data.Deliberative_Score.max() )
print('Egalitarian_Score:', data.Egalitarian_Score.min(), data.Egalitarian_Score.max() )
print('HDI_Score:',data.HDI_Score.min(), data.HDI_Score.max() )

Data has shape: 12652 observations and 25 columns.
Index(['id', 'Country', 'Year', 'region', 'protest', 'protesterviolence',
       'protesterdemand1', 'protesterdemand2', 'protesterdemand3',
       'protesterdemand4', 'stateresponse1', 'stateresponse2',
       'stateresponse3', 'stateresponse4', 'stateresponse5', 'stateresponse6',
       'stateresponse7', 'Electoral_Score', 'Liberal_Score',
       'Participatory_Score', 'Deliberative_Score', 'Egalitarian_Score',
       'HDI_Score', 'violenceStatus', 'predicted_prob'],
      dtype='object')
Year range: 1990 2019
Electoral_Score: 0.014 0.922
Liberal_Score: 0.006 0.896
Participatory_Score: 0.009 0.807
Deliberative_Score: 0.006 0.886
Egalitarian_Score: 0.034 0.885
HDI_Score: 0.197 0.955


Categorical variables: Country, region, protesterdemand1 - protesterdemand4, stateresponse1 - stateresponse7

Numerical variables: Year, protesterviolence, violenceStatus, Electoral_Score, Liberal_Score, Participatory_Score, Deliberative_Score, Egalitarian_Score, and HDI_Score

- protesterviolence and violenceStatus are discrete, 1 indicating violence
- Year ranges from 1990-2019
- democracy indices and HDI scores are continuous, ranging from 0 to 1

In [9]:
print(data.isnull().sum())

id                         0
Country                    0
Year                       0
region                     0
protest                    0
protesterviolence          0
protesterdemand1           1
protesterdemand2       10091
protesterdemand3       12317
protesterdemand4       12011
stateresponse1            23
stateresponse2         10280
stateresponse3         11896
stateresponse4         12453
stateresponse5         11995
stateresponse6         12639
stateresponse7         11893
Electoral_Score            0
Liberal_Score              0
Participatory_Score        0
Deliberative_Score         0
Egalitarian_Score          0
HDI_Score                216
violenceStatus             0
predicted_prob           216
dtype: int64


There are missing values in this dataset. Further exploration will be necessary to determine which rows to discard, that is, which observations are missing values for ALL protester demand columns or ALL state response columns. Most observations have at least 1 protester demand and 1 state response, and these will be used for my analysis. There are also certain columns I will remove: protest (all observations are protests, all have value 1), and predicted_prob, since this is the output of a logit classification model and not collected data.

In [11]:
num_data = data[['Liberal_Score', 'Electoral_Score', 'Participatory_Score', 'Deliberative_Score', 'Egalitarian_Score', 'HDI_Score', 'Year']]
num_data.corr()

Unnamed: 0,Liberal_Score,Electoral_Score,Participatory_Score,Deliberative_Score,Egalitarian_Score,HDI_Score,Year
Liberal_Score,1.0,0.977991,0.97757,0.979241,0.967448,0.652654,-0.028444
Electoral_Score,0.977991,1.0,0.980828,0.963404,0.942761,0.600305,-0.036454
Participatory_Score,0.97757,0.980828,1.0,0.963755,0.948148,0.634805,-0.019305
Deliberative_Score,0.979241,0.963404,0.963755,1.0,0.950908,0.620824,-0.029175
Egalitarian_Score,0.967448,0.942761,0.948148,0.950908,1.0,0.711389,-0.030508
HDI_Score,0.652654,0.600305,0.634805,0.620824,0.711389,1.0,0.211394
Year,-0.028444,-0.036454,-0.019305,-0.029175,-0.030508,0.211394,1.0


All democracy indices, (liberal, electoral, participatory, deliberative, egalitarian scores) are highly correlated with one another. This is expected since the ideals of a democratic government are, in theory, inclusive of liberalism, electoralism, participation, deliberation, and egalitarianism.

HDI_Score is moderately correlated with all democracy indices, which is intuitive since we usually associate democratic nations with a higher standard of life than their non-democratic counterparts.

Year shows almost zero (although slightly negative) correlation with all democracy indices, and slightly positive correlation with HDI_Score, *potentially* indicating that while human development has increased since 1990, levels of democracy have not. This will be further explored in my analysis.

### 5) Questions

Answer the following questions to describe your plan for the project content.

- Is this a regression or a classification problem?
- Which variable will be the response and which variables will be your predictors?
- What is your plan to develop a model? How many models are you planning to train? **Note that using a single model with all predictors at once is not acceptable. You need to start with a simple model. Keep adding predictors and observe the changes.**
- How will you explore the non-linearities in the data?
- Are there any variables that you are planning to exclude from your models? If yes, explain why.
- How will you evaluate the prediction performance of your models? Justify your choice.
- How will you perform inference on your models?

**(2 points)**



#### Overview
This project will be a classification problem. My goal is to predict whether a protest is successful (class 1) or unsuccessful (class 0). Success will be determined by the state response; if state response is 'accomodation' for any protest, that protest will be classified as successful. My goal is to uncover nuances in the relationship between the nature of a protest -- determined by the protester demands and whether or not protester violence occurred -- and the response of the state. I am also interested in determining what kinds of protests incite state violence, specifically for non-violent protests; this would be interesting to examine in the context of a certain country's democracy and HDI scores. This is also a classification problem, where state violence will be class 1, and the absence of state violence will be class 0. 

#### Model Development
First, I will define a new variable, 'success', determined by the presence of 'accomodation' in any state response variables for a given protest. I will also define a new variable, 'stateviolence', which will have a value of 1 if violenceStatus = 1 and protesterviolence = 0, since this signals violence on behalf of the government.

**Part 1**

Can the success of a protest be determined by the nature of the protest itself? Furthermore, what kinds of protests are more likely to result in success depending on the global region in which they occur? Are protests of any kind more likely to succeed in countries with higher democracy scores?

For my first model, the response will be success. I will start with a simple model: logistic regression with predictors 'protesterviolence' and 'protesterdemands' 1 through 4. This will provide a rough outline of the relationship between protest nature and success. Then, I will add 'Region', and likely increase complexity by including an interaction between region and protest nature, determined by the initial two predictors. This will provide insight to the regional trends in citizen protest. To further examine *where* certain kinds of protests are more successful, I will add all democracy score indices and HDI scores. This will provide insight to the relationship between democracy and protest success, i.e., whether democratic nations are actually more receptive to citizen petitions and demands. Lastly, I will include year as a predictor to see how receptivity to protests has changed since 1990.


**Part 2**

Can we predict the use of state violence in a protest based on a given government democracy scores? In countries with higher democracy scores, are protests more likely to be non-violent? How has this relationship changed since 1990?

The second part of my exploration will pertain to the relationship between democracy and HDI scores and state violence. Here, 'stateviolence' will be the response and the predictor variables will be all democracy indices and HDI score to begin with. Then, I will increase complexity by including 'Year' and 'Region', potentially as interaction terms, to examine the relationship of state violence with geographic and temporal changes. Finally, I will include protester demands as predictors to examine the potential relationship between protest nature and state violence.

#### Other Considerations

The following variables will be excluded from my analysis: id, predicted_prob, and potentially Country. Id is redundant, and predicted_prob is the output of an unknown logistic model, so neither are relevant for my exploration. To examine geographic trends in protest success and state violence, I will experiment with using 'Region' and 'Country' and determine which is better suited for this analysis. If regional trends miss significant variation between countries in that region, I will use country instead. If trends within a region are generally homogenous, I will use region for simplicity.

I will evaluate the prediction performance of my model with a variety of metrics discussed thus far. This includes, recall, precision, accuracy, and AUC as determined by the recall-precision curve. To perform inference, I will examine coefficients and statistical significance of all predictors mentioned above. When appropriate, I will conduct an LLR test to determine the significance of the overall model. Furthermore, I will implement Lasso and Ridge regularizations to fine-tune the coefficients and improve the model's prediction performance.


### 6) Stakeholders

Who would be the stakeholders of this project, i.e. who would be interested in hearing your results and how would those results benefit them? **(0.5 points)**

The main stakeholders of this project are those in the field of comparative politics and political science as a whole. A quantitative analysis of civic behaviors and state responses such as this one is relevant to the social sciences more broadly, such as sociology. However, in the context of a global right-wing shift, political literacy and informed skepticism of self-described democracies is more important than ever as more citizens are taking to protest to express their beliefs and outrage. Thus, I believe stakeholders include constituent populations as a whole, although academically this project is directly involved in the study of political science.