In [1]:
import pandas as pd
import numpy as np
import plotly.express as px


In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mysterymeatie/protests-from-1990-to-march-2020")

print("Path to dataset files:", path)# print all the columns with their data types

data=pd.read_csv('/kaggle/input/protests-from-1990-to-march-2020/data.csv')
print(data.dtypes)
print(data.shape)


Path to dataset files: /kaggle/input/protests-from-1990-to-march-2020
id                       int64
Country                 object
Year                     int64
region                  object
protest                  int64
protesterviolence        int64
protesterdemand1        object
protesterdemand2        object
protesterdemand3        object
protesterdemand4        object
stateresponse1          object
stateresponse2          object
stateresponse3          object
stateresponse4          object
stateresponse5          object
stateresponse6          object
stateresponse7          object
Electoral_Score        float64
Liberal_Score          float64
Participatory_Score    float64
Deliberative_Score     float64
Egalitarian_Score      float64
HDI_Score              float64
violenceStatus           int64
predicted_prob         float64
dtype: object
(12652, 25)


In [3]:
data.head()


Unnamed: 0,id,Country,Year,region,protest,protesterviolence,protesterdemand1,protesterdemand2,protesterdemand3,protesterdemand4,...,stateresponse6,stateresponse7,Electoral_Score,Liberal_Score,Participatory_Score,Deliberative_Score,Egalitarian_Score,HDI_Score,violenceStatus,predicted_prob
0,201990001,Canada,1990,North America,1,0,"political behavior, process",labor wage dispute,,,...,,,0.834,0.759,0.58,0.756,0.719,0.85,0,0.299385
1,201990002,Canada,1990,North America,1,0,"political behavior, process",,,,...,,,0.834,0.759,0.58,0.756,0.719,0.85,0,0.299385
2,201990003,Canada,1990,North America,1,0,"political behavior, process",,,,...,,,0.834,0.759,0.58,0.756,0.719,0.85,0,0.299385
3,201990004,Canada,1990,North America,1,1,land farm issue,,,,...,,,0.834,0.759,0.58,0.756,0.719,0.85,1,0.299385
4,201990005,Canada,1990,North America,1,1,"political behavior, process",,,,...,,,0.834,0.759,0.58,0.756,0.719,0.85,1,0.299385


## Before we implement our analysis, let's start with checking the null value conditions.

In [4]:
# print the NaN values in each column
print(data.isnull().sum())

id                         0
Country                    0
Year                       0
region                     0
protest                    0
protesterviolence          0
protesterdemand1           1
protesterdemand2       10091
protesterdemand3       12317
protesterdemand4       12011
stateresponse1            23
stateresponse2         10280
stateresponse3         11896
stateresponse4         12453
stateresponse5         11995
stateresponse6         12639
stateresponse7         11893
Electoral_Score            0
Liberal_Score              0
Participatory_Score        0
Deliberative_Score         0
Egalitarian_Score          0
HDI_Score                216
violenceStatus             0
predicted_prob           216
dtype: int64


### These insights were found at the first few scanning

- No empty row among the total 12652 rows.
- Most of records have at least one demand(except one row).
- About 83% of protests have only one demand.
- Violence status seems a good target for a binary classification task.

### Now let's dive into this dataset and start our Exploratory Data Analysis

### You can leverage libraries like ydata_profiling, sweetviz, autoviz to get a detailed EDA report of the data


In [5]:
from ydata_profiling import ProfileReport

profile = ProfileReport(data, title="Pandas Profiling Report")
# profile.to_file("output.html")


In [6]:
# visualize the data on a map, use a darker color for the countries with records
data_1990 = data[data['Year'] == 1990]
print(data_1990.shape)
data_2019 = data[data['Year'] == 2019]
print(data_2019.shape)

(413, 25)
(712, 25)


In [7]:
# group the data by country,year and count the number of records in each country,keep the region column
data_country_year = data.groupby(['Country', 'Year', 'region']).size().reset_index(name='count')
# display the top 5 countries with the most records
print(data_country_year.sort_values('count', ascending=False).head(10))


         Country  Year  region  count
1279       Kenya  2015  Africa    143
2294     Ukraine  2014  Europe     91
42       Algeria  2019    MENA     66
1964     Romania  1990  Europe     63
1358     Lebanon  2019    MENA     62
842      Germany  2015  Europe     62
177   Bangladesh  2011    Asia     60
180   Bangladesh  2014    Asia     59
2378       Yemen  2011    MENA     58
1760     Nigeria  2016  Africa     58


In [8]:
# find the top 10 countries with the most records in all years
data_country = data.groupby('Country').size().reset_index(name='count')
print(data_country.sort_values('count', ascending=False).head(10))


            Country  count
126  United Kingdom    574
43           France    542
59          Ireland    431
46          Germany    362
65            Kenya    350
9        Bangladesh    338
25            China    316
48           Greece    312
120        Thailand    249
86          Namibia    225


### Among the top 10 countries with the highest protest counts, five are from Europe (United Kingdom, France, Ireland, Germany, and Greece), three are from Asia (China, Bangladesh, and Thailand), and two are from Africa (Kenya and Namibia).

In [9]:
# build a heatmap to visualize the number of records in each region per year
data_region_year = data.groupby(['region', 'Year']).size().reset_index(name='count')
fig = px.imshow(data_region_year.pivot(index='region', columns='Year', values='count').fillna(0), title='Number of records in each region per year')
fig.show()


Regional Protest Analysis:

- Europe shows the highest protest activity across regions 
- Africa and Asia demonstrate clear upward trends in protest events
- This indicates both established protest cultures in Europe and emerging protest movements in developing regions


In [10]:

data_country_year = data.groupby(['Country', 'Year', 'region']).size().reset_index(name='count')
data_country_year = data_country_year.sort_values(['Year', 'count'], ascending=[True, False])
data_country_year_top10 = data_country_year.groupby('Year').head(10)

fig = px.bar(
    data_country_year_top10, 
    y='Country', 
    x='count', 
    color='region', 
    animation_frame='Year', 
    title='Top 10 Countries with the Highest Records per Year'
)

fig.update_layout(
    yaxis={'categoryorder': 'total ascending'},
    yaxis_title=None,
    updatemenus=[{
        'type': 'buttons',
        'showactive': False,
        'buttons': [
            {
                'label': 'Play',
                'method': 'animate',
                'args': [None, {'frame': {'duration': 2000, 'redraw': True}, 'fromcurrent': True}]
            },
            {
                'label': 'Pause',
                'method': 'animate',
                'args': [[None], {'frame': {'duration': 0, 'redraw': False}, 'mode': 'immediate', 'transition': {'duration': 0}}]
            }
        ]
    }]
)

fig.show()



In [11]:
# find the violend records in the data for each year
data_violence = data[data['protesterviolence'] == 1]
data_violence_count = data_violence.groupby('Year').size().reset_index(name='vio_count')
# Group data by year and count total records per year

data_violence_trend = data.groupby('Year').size().reset_index(name='total_count')

# Merge total yearly count with violence/no violence counts
data_violence_ratio = data_violence_count.merge(data_violence_trend, on='Year')

# Calculate the ratio of violence/no violence for each year
data_violence_ratio['ratio'] = data_violence_ratio['vio_count'] / data_violence_ratio['total_count']

# Display the first few rows of the resulting dataframe
print(data_violence_ratio.head())

# Plot the ratio of violence records over the years,along with the total number of records and the number of violence records
fig = px.line(data_violence_ratio, x='Year', y='total_count', title='Number of records per year', labels={'total_count': 'Total Records'})
fig.add_scatter(x=data_violence_ratio['Year'], y=data_violence_ratio['vio_count'], name='Violence Records', mode='lines')
fig.add_scatter(x=data_violence_ratio['Year'], y=data_violence_ratio['ratio'], name='Violence Ratio', mode='lines', yaxis='y2', line=dict(dash='dash'))
fig.update_layout(yaxis2={'overlaying': 'y', 'side': 'right', 'title': 'Violence Ratio', 'range': [0, 1]})
fig.show()

   Year  vio_count  total_count     ratio
0  1990        127          413  0.307506
1  1991         76          276  0.275362
2  1992         86          302  0.284768
3  1993         79          275  0.287273
4  1994        102          337  0.302671


Protest and Violence Trends (1990-2019):

- Overall protest events show an upward trend
- Violent protest events also increased proportionally
- Violence ratio remained stable at approximately 30% throughout the period


In [12]:
#  analyse the demand of the protesters,visualize in a pie chart
data_demand = data.groupby('protesterdemand1').size().reset_index(name='count')
data_demand = data_demand.sort_values('count', ascending=False)
fig = px.pie(data_demand, values='count', names='protesterdemand1', title='Protester Demand')

# print the state responses to the protests,visualize in a pie chart
data_response = data.groupby('stateresponse1').size().reset_index(name='count')
data_response = data_response.sort_values('count', ascending=False)
print(data_response)
fig = px.pie(data_response, values='count', names='stateresponse1', title='State Response')
import plotly.graph_objects as go

# Combine two pie charts
fig = go.Figure()

fig.add_trace(go.Pie(
    labels=data_demand['protesterdemand1'],
    values=data_demand['count'],
    name='Protester Demand',
    hole=0.4,
    domain={'x': [0, 0.45]}
))

fig.add_trace(go.Pie(
    labels=data_response['stateresponse1'],
    values=data_response['count'],
    name='State Response',
    hole=0.4,
    domain={'x': [0.55, 1]}
))

fig.update_layout(
    title_text='Protester Demand and State Response',
    annotations=[
        {'text': 'Protester Demand', 'x': 0.18, 'y': 0.5, 'showarrow': False},
        {'text': 'State Response', 'x': 0.82, 'y': 0.5, 'showarrow': False}
    ],
    showlegend=True
)

fig.show()

    stateresponse1  count
4           ignore   6841
3  crowd dispersal   3313
1          arrests    874
0     accomodation    830
6        shootings    366
5         killings    219
2         beatings    186


In [13]:
# for each region, find the most common protester demand and its ratio
data_region_demand = data.groupby(['region', 'protesterdemand1']).size().reset_index(name='count')
data_region_demand = data_region_demand.sort_values(['region', 'count'], ascending=[True, False])
data_region_demand_top = data_region_demand.groupby('region').head(1)
data_region_demand_total = data_region_demand.groupby('region')['count'].sum().reset_index(name='total_count')
data_region_demand_top = data_region_demand_top.merge(data_region_demand_total, on='region')
data_region_demand_top['ratio'] = data_region_demand_top['count'] / data_region_demand_top['total_count']
print(data_region_demand_top)



            region             protesterdemand1  count  total_count     ratio
0           Africa  political behavior, process   1684         2760  0.610145
1             Asia  political behavior, process   1426         2485  0.573843
2  Central America  political behavior, process    270          451  0.598670
3           Europe  political behavior, process   2723         4218  0.645567
4             MENA  political behavior, process    644          983  0.655137
5    North America  political behavior, process    357          520  0.686538
6          Oceania  political behavior, process     21           38  0.552632
7    South America  political behavior, process    839         1196  0.701505


In [14]:
# create a heatmap to visualize the number of records for each state response and protester demand,apply log transformation
data_response_demand = data.groupby(['stateresponse1', 'protesterdemand1']).size().reset_index(name='count')
data_response_demand['log_count'] = np.log1p(data_response_demand['count'])
fig = px.imshow(data_response_demand.pivot(index='stateresponse1', columns='protesterdemand1', values='log_count').fillna(0), title='Number of records for each state response and protester demand')
# set size of the heatmap
fig.update_layout(width=800, height=800)
fig.show()

- Common Government Responses:

"Ignore" and "crowd dispersal" emerge as the most frequent responses across all types of protests
"Killing" and "shooting" are the least common responses, indicating less frequent use of lethal force
This suggests authorities tend to prefer non-violent or passive approaches to handling protests

- Protest Satisfaction Patterns:

Labor and wage-related protests show the highest satisfaction rates, excluding political demands
Social restriction protests demonstrate the lowest likelihood of achieving satisfaction
This pattern might indicate that concrete, economic demands (like labor issues) are more likely to be addressed than broader social policy changes

In [15]:
# check the correlation between the numerical columns
data_numberic=data.select_dtypes(include=['int64','float64'])
# exclude the Year,id column
data_numberic=data_numberic.drop(['Year','id','protest'],axis=1)

correlation=data_numberic.corr()
# print the correlation matrix plot
fig = px.imshow(correlation, title='Correlation Matrix')
fig.show()


The analysis of democracy scores reveals two key findings:

- Relationship between Democracy Scores and Violence:

All democracy scores (Electoral, Liberal, Participatory, Deliberative, and Egalitarian) show strong positive correlations with each other
These scores demonstrate negative correlations with violence status
This suggests that protests in countries with higher democracy scores tend to be less violent

- Democracy Scores and Protest Satisfaction:

Democracy scores show negative correlations with predicted probability of satisfaction
This indicates that protests in more democratic countries are less likely to achieve satisfaction
This could suggest that in more democratic societies, protesters might have higher expectations or face more complex processes for achieving their demands