<a href="https://colab.research.google.com/github/XicaFelix/gdelt_global_protests_2017_2022/blob/dev/GDLET_protests_2017_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trends in Global Protests (2017 -2021)

## Motivation
I was interested in exploring protest trends before and after the pandemic as we currently are in a precarious time in America, and elsewhere in the world. Nationalism and populism are on the rise throughout the world, and it arguable has been the case for nearly a decade. There were considerable protests during the American President's first term beginning in 2017, as well as during the Covid-19 pandemic which took place during his presidency. People in America and through out the world protested their government, pandemic restrictions, as well as structural racism seen in America and other countries as a result of the Black Lives Matter (BLM) Movement.

Once again, we face unpopular government activity, and emerging health crises like a measles resurgence. My goal is to see what the motivations were for protests prior to the pandemic versus after.

## Data Source
In order to accomplish this lofty goal, I needed a massive dataset that kept track of protests throughout the world both before, during, and after the pandemic. The [GDELT Project](https://www.gdeltproject.org/) fits this need exactly. GDELT monitors news media around the world. It tracks "over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day" (GDELT Project, 2025).

GDELT is accesible via BigQuery where it contains an "events" table which records news events globally. Each event is categorized using an "EventRootCode". Protests have a "14" root code. Protests are further subcategorized with an "EventCode" such as "141" for policy change, or "142" for anti-government protests. GDELT includes information on the protestors and the protest target as well as some information on what kind of role these actors play in society. Actor roles are encoded as "ActorTypes". Each actor can have up to 3 Actor types, however ~50% of actors have a type, and in my observation they generally only have one type listed if at all.

## Data Selection
In selecting my time range, I wanted to get at least two years before the declaration of the Covid-19 pandemic in March of 2020, and ideally two years after. This would give me a time range 2018-2022. Initially, I faced issues with accessing data in this time frame. My initial SQL query showed gaps in the 2022 data. With this in mind, I shifted my time range one year prior: 2017-2021.

## Imports and Authentication

We will upgrade the pandas bigquery package (pandas-gbq) which will allow us to read in our 2017-2021 data set saved in BigQuery. Accesing BigQuery in Colab will also require authentication.

In [None]:
# Upgrade pandas and authenticate
!pip install --upgrade pandas-gbq
from google.colab import auth



In [None]:
# authenticate
auth.authenticate_user()

#### Packages to be Installed
- **numpy**: Statistical computation package. Works in conjunction with pandas to manipulate the data

- **pandas**: Data analysis and manipulation package> Perform data frame operations like deleting columns, finding unique values, etc.

- **pandas-gbq**: Allows us to load the BigQuery table as a pandas data frame

- **pycountry**: Geographic package that provides country names and their subdivisions. This is helpful to match countries and cities to a role (ex. government, civillians) if they are listed as protest targets or protestors. This is in the event that GDELT does not list an ActorType

- **rapidfuzz**: Fuzzy matching package. Fuzzy matching can help us to match

- **spacy**: Natural Language Processing (NLP) package. We will use this to match protestors and their targets to a role (government, healthcare, etc.) when that information is missing in GDELT


In [None]:
# Imports
!pip install pycountry rapidfuzz

import pycountry
import pandas as pd
import numpy as np
import spacy
from pandas_gbq import read_gbq
from spacy.matcher import PhraseMatcher
from rapidfuzz import process, fuzz



## Load the Data Set

We are loading in the raw data generated from the SQL query on the GDELT events table. The resulting data has been saved in BigQuery as a separate table called "gdelt_2017_2021_raw". This table covers protests from 2017 to 2021

In [None]:
# Load the data into a data frame
df = read_gbq(
    "SELECT * FROM `gdelt-protests-2019-2022.gdelt_analysis.gdelt_2017_2021_raw`",
    project_id="gdelt-protests-2019-2022",
    dialect="standard"
)

Downloading: 100%|[32m██████████[0m|


We can use `shape()` to identify the size of our pandas data frame, and how much data we will be working with. We can also use `head()` to see the first 5 rows as well as all the columns. This dataset has the following columns:

- **Date**: Date of the protest event

- **Actor1Name**: Name of the protestor

- **Actor1Type1Code**: The 3 letter code detailing the role of the actor as categorized by GDELT (ex. LAB -> Labor)

- **NumMentions**: How many times this protest was mentioned in news media

- **Actor2Name**: Name of the protest target

- **Actor2Type1Code**: The 3 letter code detailing the role of the actor as categorized by GDELT

- **EventRootCode**: Type of event. "14" is for protests

- **EventCode**: Code specifying what type of protest the event is (ex. "142" -> Anti-government)

- **GolsteinScale**: The intensity of the protest

- **AvgTone**: The tone of the protest (Negative to Postive, 0 is neutral)

- **ActionGeo_CountryCode**: FIPS 2 letter country code

- **CountryName**: Canonical name of the country where protest occurred

- **ActionGeo_Lat**: Latitude where protest occurred

- **ActionGeo_Long**: Longitude where protest occurred

In [None]:
# Check row count
len1 = df.shape[0]

print(f"File1: {len1} rows")

# Check df head
df.head()

File1: 4398214 rows


Unnamed: 0,Date,Actor1Name,Actor1Type1Code,NumMentions,Actor2Name,Actor2Type1Code,EventRootCode,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long
0,2021-09-28,WORKER,LAB,10,NURSE,HLH,14,141,-6.5,-1.634473,NZ,New Zealand,-42.0,174.0
1,2021-10-29,PROTESTER,OPP,2,POLICE,COP,14,140,-6.5,-3.389831,TH,Thailand,13.75,100.517
2,2021-10-26,POLICE,COP,6,,,14,141,-6.5,-2.371218,PO,Portugal,39.5,-8.0
3,2017-03-28,VILLAGE,CVL,2,,,14,141,-6.5,-6.313646,RI,Serbia,43.1322,20.5647
4,2017-08-17,BOLIVIAN,MIL,2,,,14,141,-6.5,-5.414013,BL,Bolivia,-17.0,-65.0


## Pre-processing

### Remove Duplicates

We need to remove duplicate events to avoid double counting them in the final data set. We can identify duplicates by looking for events that have the same combination of date, latitude, longitude, and actors.

In [None]:
# Drop duplicates based on key columns
key_cols = ['Date','ActionGeo_Lat','ActionGeo_Long','Actor1Name','Actor2Name'
,'EventCode']
unique_before = df.shape[0]
df = df.drop_duplicates(subset=key_cols)
print(f"Dropped {unique_before - df.shape[0]} duplicates based on {key_cols}")

Dropped 454684 duplicates based on ['Date', 'ActionGeo_Lat', 'ActionGeo_Long', 'Actor1Name', 'Actor2Name', 'EventCode']


### Drop Missing Data

Likewise, we should remove incomplete data that lacks columns we will perform analysis on like date, latitude, longitude, and average tone.

To make the data easier to read, we can round Average Tone and Goldstein Scale to two decimal places.

In [None]:
# Drop rows with missing data
unique_before = df.shape[0]
df = df.dropna(subset=['Date', 'ActionGeo_Lat', 'ActionGeo_Long', 'AvgTone']).copy()
print(f"Dropped {unique_before - df.shape[0]} rows with missing data")

# Round data to make charts easier to read
df['AvgTone'] = df['AvgTone'].round(2)
df['GoldsteinScale'] = df['GoldsteinScale'].round(2)


df.head()

Dropped 201906 rows with missing data


Unnamed: 0,Date,Actor1Name,Actor1Type1Code,NumMentions,Actor2Name,Actor2Type1Code,EventRootCode,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long
0,2021-09-28,WORKER,LAB,10,NURSE,HLH,14,141,-6.5,-1.63,NZ,New Zealand,-42.0,174.0
1,2021-10-29,PROTESTER,OPP,2,POLICE,COP,14,140,-6.5,-3.39,TH,Thailand,13.75,100.517
2,2021-10-26,POLICE,COP,6,,,14,141,-6.5,-2.37,PO,Portugal,39.5,-8.0
3,2017-03-28,VILLAGE,CVL,2,,,14,141,-6.5,-6.31,RI,Serbia,43.1322,20.5647
4,2017-08-17,BOLIVIAN,MIL,2,,,14,141,-6.5,-5.41,BL,Bolivia,-17.0,-65.0


### Drop Event Root Code

Since all protests have a "14" EventRootCode, this column does not provided useful information.

In [None]:
# Drop event Root code
df = df.drop(columns=['EventRootCode'])
df.head()

Unnamed: 0,Date,Actor1Name,Actor1Type1Code,NumMentions,Actor2Name,Actor2Type1Code,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long
0,2021-09-28,WORKER,LAB,10,NURSE,HLH,141,-6.5,-1.63,NZ,New Zealand,-42.0,174.0
1,2021-10-29,PROTESTER,OPP,2,POLICE,COP,140,-6.5,-3.39,TH,Thailand,13.75,100.517
2,2021-10-26,POLICE,COP,6,,,141,-6.5,-2.37,PO,Portugal,39.5,-8.0
3,2017-03-28,VILLAGE,CVL,2,,,141,-6.5,-6.31,RI,Serbia,43.1322,20.5647
4,2017-08-17,BOLIVIAN,MIL,2,,,141,-6.5,-5.41,BL,Bolivia,-17.0,-65.0


### Fill in Unknown Actors

We will replace actors that are listed as 'None' with a string like "Unknown Actor 1" to prevent future erroring on NaN values, and to provide more human readable information.

In [None]:
# Replace missing actors with unknown, and lowercase all actor names
df['Actor1Name'] = df['Actor1Name'].fillna('Unknown Actor 1')
df['Actor2Name'] = df['Actor2Name'].fillna('Unknown Actor 2')

df.head()

Unnamed: 0,Date,Actor1Name,Actor1Type1Code,NumMentions,Actor2Name,Actor2Type1Code,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long
0,2021-09-28,WORKER,LAB,10,NURSE,HLH,141,-6.5,-1.63,NZ,New Zealand,-42.0,174.0
1,2021-10-29,PROTESTER,OPP,2,POLICE,COP,140,-6.5,-3.39,TH,Thailand,13.75,100.517
2,2021-10-26,POLICE,COP,6,Unknown Actor 2,,141,-6.5,-2.37,PO,Portugal,39.5,-8.0
3,2017-03-28,VILLAGE,CVL,2,Unknown Actor 2,,141,-6.5,-6.31,RI,Serbia,43.1322,20.5647
4,2017-08-17,BOLIVIAN,MIL,2,Unknown Actor 2,,141,-6.5,-5.41,BL,Bolivia,-17.0,-65.0


### Convert Date to Pandas DateTime

The date column will be easier to work with if it is in pandas datetime format.

In [None]:
# Convert date to pd datetime
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m%d')


### Create Covid Era Column

Since we want to see the change in protests due to the pandemic, we can create a new column based on the protest date that tells us if the protest occurred before the pandemic or during.

In [None]:
# Create a new column to see if the protest was pre-COVID (3/1/2020)
df['COVID_Era'] = np.where(df['Date'] < '2020-03-01', 'Pre-COVID',
                           'COVID-Era')

### Map Event Codes to Protest Motivation

We can infer what motivated the protest from the EventCodes. GDELT provides descriptions of each event code and what they signify:

- **141**: Policy Change
- **142**: Anti-Government
- **143**: Anti-Business
- **144**: Group Rights
- **145**: Anti-Discrimination

Using these event codes, we can create a protest motivation column. Protests that dont have a specific code will default to "General Protest"


In [None]:
# Track motivations of the protest using the Event Code

# Convert EventCode to string if it's numeric
df['EventCode'] = df['EventCode'].astype(str)

# Define conditions and corresponding motivations
conditions = [
    df['EventCode'] == '141',
    df['EventCode'] == '142',
    df['EventCode'] == '143',
    df['EventCode'] == '144',
    df['EventCode'] == '145'
]

motivations = [
    'Policy Change',
    'Anti-Government',
    'Anti-Business',
    'Group Rights',
    'Anti-Discrimination'
]

# Default fallback if no match
df['ProtestMotivation'] = np.select(conditions, motivations,
                                    default='General Protest')


In [None]:
# Check the results
df.head()

Unnamed: 0,Date,Actor1Name,Actor1Type1Code,NumMentions,Actor2Name,Actor2Type1Code,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long,COVID_Era,ProtestMotivation
0,2021-09-28,WORKER,LAB,10,NURSE,HLH,141,-6.5,-1.63,NZ,New Zealand,-42.0,174.0,COVID-Era,Policy Change
1,2021-10-29,PROTESTER,OPP,2,POLICE,COP,140,-6.5,-3.39,TH,Thailand,13.75,100.517,COVID-Era,General Protest
2,2021-10-26,POLICE,COP,6,Unknown Actor 2,,141,-6.5,-2.37,PO,Portugal,39.5,-8.0,COVID-Era,Policy Change
3,2017-03-28,VILLAGE,CVL,2,Unknown Actor 2,,141,-6.5,-6.31,RI,Serbia,43.1322,20.5647,Pre-COVID,Policy Change
4,2017-08-17,BOLIVIAN,MIL,2,Unknown Actor 2,,141,-6.5,-5.41,BL,Bolivia,-17.0,-65.0,Pre-COVID,Policy Change


### Convert Actor Types to Human Readable Names

The 3-letter Actor Type Codes are not easy to decipher. It would be more helpful if we could convert them to their names. We can use the GDELT lookup table to convert the Actor Type Code for each  actor to an Actor Type Name. We can join these names to the data set to get the names matched with their code.

In [None]:
# 1) Load the CAMEO actor‐type lookup
type_url = "https://www.gdeltproject.org/data/lookups/CAMEO.type.txt"
type_code = pd.read_csv(
    type_url,
    sep="\t",
    header=None,
    names=["TypeCode","TypeLabel"],
    dtype=str
)

# 2) Create two separate lookup tables for each actor (avoid lookup naming conflict)
type_code1 = type_code.rename(
    columns={"TypeCode":"Actor1Type1Code","TypeLabel":"PrimaryActorType"}
)
type_code2 = type_code.rename(
    columns={"TypeCode":"Actor2Type1Code","TypeLabel":"SecondaryActorType"}
)

# 3) Merge type names for each actor
df = df.merge(type_code1, how="left", on="Actor1Type1Code")
df = df.merge(type_code2, how="left", on="Actor2Type1Code")


In [None]:
# Check data
df.head()

Unnamed: 0,Date,Actor1Name,Actor1Type1Code,NumMentions,Actor2Name,Actor2Type1Code,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long,COVID_Era,ProtestMotivation,PrimaryActorType,SecondaryActorType
0,2021-09-28,WORKER,LAB,10,NURSE,HLH,141,-6.5,-1.63,NZ,New Zealand,-42.0,174.0,COVID-Era,Policy Change,Labor,Health
1,2021-10-29,PROTESTER,OPP,2,POLICE,COP,140,-6.5,-3.39,TH,Thailand,13.75,100.517,COVID-Era,General Protest,Political Opposition,Police forces
2,2021-10-26,POLICE,COP,6,Unknown Actor 2,,141,-6.5,-2.37,PO,Portugal,39.5,-8.0,COVID-Era,Policy Change,Police forces,
3,2017-03-28,VILLAGE,CVL,2,Unknown Actor 2,,141,-6.5,-6.31,RI,Serbia,43.1322,20.5647,Pre-COVID,Policy Change,Civilian,
4,2017-08-17,BOLIVIAN,MIL,2,Unknown Actor 2,,141,-6.5,-5.41,BL,Bolivia,-17.0,-65.0,Pre-COVID,Policy Change,Military,


Now that we have the human readable Actor Type Names, we don't need the type codes.

In [None]:
# 4) Drop the extra TypeCode columns
df = df.drop(columns=["Actor1Type1Code","Actor2Type1Code"])



### Address Rows without Actor Types

As mentioned previously, GDELT does not include actor types for every single actor. We will need a way to address the missing actor types.

First, we can get a count of how many there are.

In [None]:
# Find number of rows without actor types
unmatched_actor1_type = df[df['PrimaryActorType'].isna()]['Actor1Name'].count()
unmatched_actor2_type = df[df['SecondaryActorType'].isna()]['Actor2Name'].count()

print(f"Unmatched actor1 types: {unmatched_actor1_type}")
print(f"Unmatched actor2 types: {unmatched_actor2_type}")

Unmatched actor1 types: 1945166
Unmatched actor2 types: 2407450


Out of more than four-million rows, nearly 50% of Actor1's (Protestors) and Actor2'a (Protest Targets) have no actor type listed.

 We can use NER, Phrase matching, and fuzzy matching to attempt and match the remaining rows to one of these categories

#### Get Top 10 Raw Actor Names by Actor Type for Fuzzy Matching

If we can find the top Actor Names for each Actor type, we can use fuzzy matching to categorize similar occurrences of Actor Names to the same type.

In [None]:
# Create a new data frame of only the Actor Name and Actor Type
counts1 = (
    df
    .groupby(['PrimaryActorType','Actor1Name'])
    .size()
    .reset_index(name='count')
)

# Sort by code then descending count, then pick top 5 per code
top10_actor1 = (
    counts1
    .groupby('PrimaryActorType', group_keys=False)
    .apply(lambda grp: grp.nlargest(10, 'count'))
    .reset_index(drop=True)
)

# Create a new data frame of only the Actor Name and Actor Type
counts2 = (
    df
    .groupby(['SecondaryActorType','Actor2Name'])
    .size()
    .reset_index(name='count')
)

# Sort by code then descending count, then pick top 10 per code
top10_actor2 = (
    counts2
    .groupby('SecondaryActorType', group_keys=False)
    .apply(lambda grp: grp.nlargest(10, 'count'))
    .reset_index(drop=True)
)

  .apply(lambda grp: grp.nlargest(10, 'count'))
  .apply(lambda grp: grp.nlargest(10, 'count'))


In [None]:
# Check data
top10_actor1.head(20)

Unnamed: 0,PrimaryActorType,Actor1Name,count
0,Agriculture,FARMER,19834
1,Agriculture,FISHER,316
2,Agriculture,FARM WORKER,173
3,Agriculture,FRENCH,113
4,Agriculture,MAHARASHTRA,108
5,Agriculture,UNITED STATES,104
6,Agriculture,RANCHER,102
7,Agriculture,IRISH,85
8,Agriculture,GREEK,84
9,Agriculture,TAMIL NADU,80


Some of the top Actor Names are locations. This is problematic, as not all locations will correspond to one particular actor type. We will need to filter out locations from this list.

Now that we have the top actor names by type, we can use them to match other rows that have not matched to an actor type.

In [None]:
# Create lookup dictionaries with top actor names by type
name_to_code_primary = dict(zip(top10_actor1['Actor1Name'], top10_actor1['PrimaryActorType']))
name_to_code_secondary = dict(zip(top10_actor2['Actor2Name'], top10_actor2['SecondaryActorType']))

In [None]:
# Check data
name_to_code_primary

{'FARMER': 'Agriculture',
 'FISHER': 'Agriculture',
 'FARM WORKER': 'Agriculture',
 'FRENCH': 'Labor',
 'MAHARASHTRA': 'Agriculture',
 'UNITED STATES': 'Radical',
 'RANCHER': 'Agriculture',
 'IRISH': 'Agriculture',
 'GREEK': 'Agriculture',
 'TAMIL NADU': 'Agriculture',
 'COMPANY': 'Business',
 'BUSINESS': 'Business',
 'BANK': 'Business',
 'COMPANIES': 'Business',
 'INDUSTRY': 'Business',
 'TRADER': 'Business',
 'INVESTOR': 'Business',
 'EMPLOYER': 'Business',
 'CORPORATION': 'Business',
 'PRODUCER': 'Business',
 'RESIDENTS': 'Civilian',
 'CITIZEN': 'Civilian',
 'COMMUNITY': 'Civilian',
 'VILLAGE': 'Civilian',
 'POPULATION': 'Civilian',
 'MIGRANT': 'Civilian',
 'CIVILIAN': 'Civilian',
 'IMMIGRANT': 'Civilian',
 'VOTER': 'Civilian',
 'SCIENTIST': 'Civilian',
 'CRIMINAL': 'Criminal',
 'GANG': 'Criminal',
 'ROBBER': 'Criminal',
 'THIEVES': 'Criminal',
 'PERPETRATOR': 'Criminal',
 'DEALER': 'Criminal',
 'MAFIA': 'Criminal',
 'PIRATE': 'Criminal',
 'BANDIT': 'Criminal',
 'THIEF': 'Criminal',

#### Create a Combined List of Top/Most Frequent Actor Names

Some of the top Actor Names are locations, as mentioned previously. We can filter them out using a country and subdivision lists from the `pycountry` package.

However, before we do, some of the countries listed in the raw data, will not match with what is in `pycountry`. `pycountry` primarly identifies countries via their official name, while GDELT often uses canonical names. We can create a dictionary mapping the canonical names to their official names.

Afterwards, we can use pycountry to filter out the locations.

In [None]:
# Create a supplementary country dictionary to match countries where official and canonical names differ
supp_country_map = {
    "RUSSIA": "Russian Federation",
    "UKRAINE": "Ukraine",
    "VENEZUELA": "Venezuela, Bolivarian Republic of"
}

we will combine the Actor1 and Actor2 lists into a combined list so we can reduce the workload and remove duplicates before removing the locations from the combined list.

In [None]:
#  We can combine the two dicts, remove duplicates, and filter out locations

# 1) Combine the two dicts
combined_lookup = name_to_code_primary.copy()
for actor, label in name_to_code_secondary.items():
  combined_lookup.setdefault(actor, label)

# 2) Get list of country names and major subdivisions to filter out from dict
country_dict = {c.name.upper() for c in pycountry.countries}
subdiv_dict = {sub.name.upper() for sub in pycountry.subdivisions}

# We need to include our supplementary country list as pycountry uses official, not canonical names unlike GDELT
locations = country_dict | subdiv_dict | supp_country_map.keys()

# 3) Filter out locations
name_to_label_lookup = {
    actor: label
    for actor, label in combined_lookup.items()
    if actor not in locations
}

# 4) Check the change in size
print(f"Before: {len(combined_lookup)} entries")
print(f"After:  {len(name_to_label_lookup)} entries")

Before: 328 entries
After:  304 entries


According to the data, there were 24 locations removed from the combined Actor Names list.

### Categorize Rows without Actor Types


As mentioned, we will use NLP and fuzzy matching to categorize protests where the actor types are missing. We specifically will use the Spacy NLP package to perform Named Entity Recognition (NER) and pattern matching.

NER allows us to match known entities, such as countries, famous people, world leaders and organizations to categories like Geo-Political Entity (GPE), Location (LOC) etc. We can then use these categories to map to our specific Actor Types. For example, "California" would be categorized by Spacy as an LOC. If "Calfornia" is in the Actor1Name column, we know that they are the protestor. Through this logic, "Californa" should be mapped to the "Civillian" actor type. If it was in Actor2Name, it should most likely map to "Government", as it is more likely that the California government is the *target* of a protest, than the *protestor*. We can safely assume this, as there are more civilians protesting governments than vice versa. This is confirmed at the end of the analysis.

We can also use Spacy's pattern matching capability. By creating a dictionary of actor types which map to key words present in the actor names, we can match actors to their categories. GDELT has more than 20 actor types, which would make this pattern dictionary quite tedious to create. For simplicities sake, we will use 10 broader custom categories. Some of the categories overalp with GDELT categories (Non-governmental organization, Civilian, Media, Health). This will make it faster than using a larger dictionary over our four-million rows. These custom categories still capture nuance, however we do lose some of the granulariy provided by GDELT.

We are going to use Spacy's small english model for speed, and quicker load time. We could use a larger model, but it takes up more space and is slower to run.

We will set the phrase matcher to check against lowercase strings, this way our matching is not impacted by differences in capitalization.

In [None]:
# Setup spacy and phrase matcher
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

The **10 custom categories** we will use in phrase matching are:
1. Civilian
2. Government
3. Political Party
4. Non-Governmental Organization
5. Corporate/Business
6. Agriculture
7. Health
8. Criminal Justice
9. Media
10. Religious

In [None]:
# 1) Define patterns we will use in the phrase matcher

patterns = {
    "Civilian":           ["protester","demonstrator","student","worker",
                            "citizen","residents","village","employee"],
    "Government":          ["police", "republic", "kingdom","governor","regime",
                            "parliament","army","military","government",
                            "security","state","president","authorities",
                            "authority","prime minister","chancellor",
                            "congress","legislature","court","judiciary",
                            "the white house", "west bank", "minist"],
    "Political Party":     ["party","minister","candidate","politician",
                            "congressman","congresswoman"],
    "Non-Governmental Organization":      ["ngo","nonprofit","human rights","activist",
                            "charity","organization"],
    "Corporate / Business":["company","corporation","bank","industry","firm",
                            "business","companies", "employer"],
    "Agriculture":         ["farm","farmer","agriculture","landowner"],
    "Health":          ["hospital","medical","healthcare","nurse","doctor"],
    "Criminal Justice":       ["prison","incarceration","inmate","detention"],
    "Media":        ["media","press","journalist","news agency"],
    "Religious":           ["christian","muslim","hindu","jewish","buddhism","jain", "episcopal"]
}

# 1a) Add patterns to the matcher with their labels (ex. Label: Agriculture, Term: farm)
for label, terms in patterns.items():
    matcher.add(label, [nlp.make_doc(t) for t in terms])

We want to match all countries in the world. We can do so by adding them to the phrase matcher under the "government" category initially. While not all countries will match to the government type (they won't if they are the protestor), it is more likely that they will than not.

In [None]:
# 1b) Seed every ISO country name to the matcher with the "Government" label
country_docs = [nlp.make_doc(c.name) for c in pycountry.countries]
matcher.add("Government", country_docs)

We also want to match cities and other location names. We can do this using pycountry's list of subdivisions. There are spelling mistakes, and different names used in the GDELT database. To improve our chance of matching actor names to subdivisions, we can use fuzzy matching.

In [None]:

# 2) Build a lookup of all subdivision names (e.g. California, Île‑de‑France, Delhi)

# These subdivisions will not automatically be captured by Spacy NER
subdiv_names = [sub.name.lower() for sub in pycountry.subdivisions]

# 2a) We can map the subdivisions correctly to GPE/LOC by using a fuzzy matcher
def is_subdivision(actor_name, threshold=75):
    """Return True if actor_name fuzzy‑matches a subdivision."""
    if not isinstance(actor_name, str):
        return False
    match = process.extractOne(
        actor_name.lower(),
        subdiv_names,
        scorer=fuzz.WRatio,
        score_cutoff=threshold
    )
    return bool(match)



We can also fuzzy match our raw actor names with the combined list of top actor names we generated earlier.

In [None]:
# 3) Define fuzzy lookup function to match raw actor names to top actor names by category
def fuzzy_lookup(name, top_actor_df, threshold = 80):
  choices = list(top_actor_df.keys())
  match = process.extractOne(name, choices, scorer= fuzz.wRatio, score_cutoff=threshold)
  return top_actor_df[match[0]] if match else None

We can bundle the fuzzy matching and phrase matching into a reusable function coupled with our Spacy NER.

In our categorizer function we will perform the following order of operations:

1. Match raw name to instance in our combined top actor names list
2. Match supplemental countries list (where canonical and offical names differ)
3. Map sub-divisions with fuzzy matching
4. Phrase Matching with our pattern dictionary
5. Spacy NER

Any raw names that don't match the above will have an "Unknown" actor type

Importantly, our **categorizing function is role specific**. Since we know that an actor's type may change depending on if it's the protestor or target, we need our function to take this into consideration. As mentioned previously, if Actor1Name (protestor) is "California" the correct category is "Civilian". However, if it is Actor2Name (target) it should be "Government".

In [None]:
# 4) Define categorization function

def categorize_actor(raw_name, role):
  '''
  raw_name: the Actor1Name or Actor2Name string
  role: 'primary' or 'secondary'
  returns: Category and the NER label (category_str, ner_label_str)
  '''

  if not isinstance(raw_name, str) or not raw_name.strip():
    return 'Unknown', 'Unknown'

  text = raw_name.strip()
  upper = text.upper()

  # 4a) Try to match to top actor names
  label = name_to_label_lookup.get(text)

  if not label:
    choices = list(name_to_label_lookup.keys())
    match = process.extractOne(text, choices, scorer=fuzz.WRatio, score_cutoff=80)
    if match:
      label = name_to_label_lookup[match[0]]

# We have the human readable category here, we don't need to generate a label like "GPE"
  if label:
    return label, None

  # 4b) Map all countries in the supplemental country map to Government (Russia, Venezuela, Ukraine)
  if upper in supp_country_map:
    return "Government", "GPE"

  # 4c) Map subdivisions
  if is_subdivision(text):
    # protests in a place (subdivision in Primary) map to Civilians
    # protests against a place (subdivision in Secondary) map to Government
    cat = 'Civilians' if role == 'primary' else 'Government'
    return cat, 'GPE'

  # 4d) Phrase Matcher
  # Title-case for NER, but keep matcher case‑insensitive
  doc = nlp(text.upper() if text[0].isupper() else text.title())
  matches = matcher(doc)
  if matches:
    cat = nlp.vocab.strings[matches[0][0]]
    ner = doc.ents[0].label_ if doc.ents else "Unknown"
    return cat, ner

  # 4e) NER fallback on properly-cased doc
  if doc.ents:
      ent = doc.ents[0].label_
      if ent in ("GPE","LOC"):
          return "Government", ent
      if ent == "ORG":
          return "NGO / Advocacy", ent
      if ent == "NORP":
          low = text.lower()
          if any(r in low for r in patterns["Religious"]):
            return "Religious", ent
          return "Civilians", ent
      # Everything else will be unknown
  return "Unknown", doc.ents[0].label_ if doc.ents else "Unknown"


To reduce duplicate work, we can create a list of unique actor names across protestors and targets. This list will be used to categorize each actor name for both the protestor (primary) and target (secondary) roles

In [None]:
# 5) Get list of unique actors
all_actors = pd.concat([df['Actor1Name'], df['Actor2Name']]).dropna().unique()

In the process of categorizing the actor name, we will often generate a Spacy NER category (GPE, LOC, NORP). We can keep track of this and the Actor Type category in separate dictionaires.

In [None]:
# 6a) Create two sets of dicts for each role (primary, secondary)
name_to_cat_primary = {}
name_to_ner_primary = {}
name_to_cat_secondary = {}
name_to_ner_secondary = {}

# 6b) Get actors that didn't match in the categorizer
unmatched_actors = {}

# 6c) Categorize all actors for each role
for actor in all_actors:
  cat1, ner1 = categorize_actor(actor, role="primary")
  name_to_cat_primary[actor] = cat1
  name_to_ner_primary[actor] = ner1

  cat2, ner2 = categorize_actor(actor, role="secondary")
  name_to_cat_secondary[actor] = cat2
  name_to_ner_secondary[actor] = ner2

  if cat1 == "Unknown" or cat2 == "Unknown":
    unmatched_actors[actor] = (ner1, ner2)


Once we have categorized each actor name by role, we can map this back to our data set

In [None]:
# 7) Map back to data frame
df['Actor1_NER'] = df['Actor1Name'].map(name_to_ner_primary)
df['PrimaryActorType'] = df['Actor1Name'].map(name_to_cat_primary)
df['Actor2_NER'] = df['Actor2Name'].map(name_to_ner_secondary)
df['SecondaryActorType'] = df['Actor2Name'].map(name_to_cat_secondary)

In [None]:
# Check changes
df[['Actor2Name','SecondaryActorType','ProtestMotivation']].sample(10)

Unnamed: 0,Actor2Name,SecondaryActorType,ProtestMotivation
36383,PRIEST,Government,Policy Change
157284,IRANIAN,Human Rights,Policy Change
781247,TOKYO,Government,Policy Change
1091020,Unknown Actor 2,Unknown,Group Rights
3695998,UNITED STATES,Government,Policy Change
2516701,Unknown Actor 2,Unknown,Policy Change
689222,CORSICA,Government,Policy Change
1274295,IMMIGRANT,Civilian,Anti-Government
2564837,CHINESE,Government,Policy Change
331729,Unknown Actor 2,Unknown,Anti-Discrimination


We can check if subdivisions and countries in our supplemental list have matched properly

In [None]:
# Check if subdivisions like Delhi matched properly
df.loc[df['Actor2Name'] == 'DELHI'].head()

Unnamed: 0,Date,Actor1Name,NumMentions,Actor2Name,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long,COVID_Era,ProtestMotivation,PrimaryActorType,SecondaryActorType,Actor1_NER,Actor2_NER
1428,2019-01-13,EMPLOYEE,1,DELHI,141,-6.5,-5.38,AF,Afghanistan,33.0,66.0,Pre-COVID,Policy Change,Labor,Government,,GPE
8366,2020-05-21,KATHMANDU,2,DELHI,141,-6.5,-2.05,NP,Nepal,27.7167,85.3167,COVID-Era,Policy Change,Civilians,Government,GPE,GPE
11447,2019-12-19,Unknown Actor 1,2,DELHI,140,-6.5,-6.66,CI,Chile,-33.45,-70.6667,Pre-COVID,General Protest,Unknown,Government,CARDINAL,GPE
13614,2019-03-05,AFGHAN,2,DELHI,145,-7.5,-5.31,AF,Afghanistan,33.0,66.0,Pre-COVID,Anti-Discrimination,Refugees,Government,,GPE
16389,2017-01-19,THAILAND,4,DELHI,141,-6.5,4.4,MY,Malaysia,2.5,112.5,Pre-COVID,Policy Change,Civilians,Government,GPE,GPE


In [None]:
#  Check if Russia matched properly
df.loc[df['Actor2Name'] == 'RUSSIA'].head()

Unnamed: 0,Date,Actor1Name,NumMentions,Actor2Name,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long,COVID_Era,ProtestMotivation,PrimaryActorType,SecondaryActorType,Actor1_NER,Actor2_NER
493,2021-01-31,Unknown Actor 1,8,RUSSIA,140,-6.5,-7.57,LH,Lithuania,56.0,24.0,COVID-Era,General Protest,Unknown,State Intelligence,CARDINAL,
1278,2017-08-22,RESIDENTS,1,RUSSIA,141,-6.5,-1.42,LH,Lithuania,56.0,24.0,Pre-COVID,Policy Change,Civilian,State Intelligence,,
1547,2019-01-16,Unknown Actor 1,6,RUSSIA,141,-6.5,-3.01,LO,Slovakia,48.666667,19.5,Pre-COVID,Policy Change,Unknown,State Intelligence,CARDINAL,
1594,2019-09-30,LITHUANIA,2,RUSSIA,141,-6.5,0.72,LH,Lithuania,56.0,24.0,Pre-COVID,Policy Change,Civilians,State Intelligence,GPE,
1604,2018-07-17,MOBSTER,4,RUSSIA,145,-7.5,-3.52,PM,Panama,9.0,-80.0,Pre-COVID,Anti-Discrimination,Civilians,State Intelligence,GPE,


Russia is incorrectly being matched as state intelligence, when it is more likely government

In [None]:
#  Check if Venezuela matched
df.loc[df['Actor2Name'] == 'VENEZUELA'].head()

Unnamed: 0,Date,Actor1Name,NumMentions,Actor2Name,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long,COVID_Era,ProtestMotivation,PrimaryActorType,SecondaryActorType,Actor1_NER,Actor2_NER
45,2017-03-07,PERU,2,VENEZUELA,141,-6.5,-4.14,PE,Peru,-10.0,-76.0,Pre-COVID,Policy Change,Civilians,Government,GPE,GPE
378,2017-05-01,PROTESTER,2,VENEZUELA,140,-6.5,-4.32,PA,Paraguay,-23.0,-58.0,Pre-COVID,General Protest,Political Opposition,Government,,GPE
1950,2017-07-28,MILITARY,2,VENEZUELA,141,-6.5,-3.36,BL,Bolivia,-16.05,-68.6833,Pre-COVID,Policy Change,Military,Government,,GPE
2237,2017-05-05,CARRIER,2,VENEZUELA,141,-6.5,-5.68,FI,Finland,60.1756,24.9342,Pre-COVID,Policy Change,Military,Government,,GPE
4206,2017-04-20,Unknown Actor 1,2,VENEZUELA,141,-6.5,-4.25,CI,Chile,-30.0,-71.0,Pre-COVID,Policy Change,Unknown,Government,CARDINAL,GPE


In [None]:
#  Check if Ukraine matched
df.loc[df['Actor2Name'] == 'UKRAINE'].head()

Unnamed: 0,Date,Actor1Name,NumMentions,Actor2Name,EventCode,GoldsteinScale,AvgTone,ActionGeo_CountryCode,CountryName,ActionGeo_Lat,ActionGeo_Long,COVID_Era,ProtestMotivation,PrimaryActorType,SecondaryActorType,Actor1_NER,Actor2_NER
1584,2018-08-02,HUNGARIAN,4,UKRAINE,141,-6.5,-7.93,HU,Hungary,47.0,20.0,Pre-COVID,Policy Change,Unknown,Government,Unknown,GPE
2227,2017-06-05,ROMANIA,10,UKRAINE,141,-6.5,4.88,MD,Moldova,47.0,29.0,Pre-COVID,Policy Change,Civilians,Government,GPE,GPE
3420,2018-05-13,ARMENIA,10,UKRAINE,141,-6.5,-1.91,AM,Armenia,40.0,45.0,Pre-COVID,Policy Change,Civilians,Government,GPE,GPE
3773,2018-07-04,HUNGARY,2,UKRAINE,141,-6.5,-9.0,HU,Hungary,47.0,20.0,Pre-COVID,Policy Change,Civilians,Government,GPE,GPE
4405,2020-02-20,Unknown Actor 1,6,UKRAINE,140,-6.5,-2.06,RB,,44.8186,20.4681,Pre-COVID,General Protest,Unknown,Government,CARDINAL,GPE


Now that we have matched our actors, we can evaluate our performance. We started with 50% of the raw data not having actor type information.

In [None]:
# Compare all_actors to unmatched actors
len_unmatched = len(unmatched_actors)
total_len = len(all_actors)
print(f"Total actors: {total_len}")
print(f"Unmatched actors: {len_unmatched}")
print(f"Percent unmatched actors: {100*len_unmatched/total_len:.2f}")

Total actors: 6714
Unmatched actors: 551
Percent unmatched actors: 8.21


We have significantly decreased the number of unmatched actors, now only 8% of actors have an "Unknown" type. We can also check our effectiveness against the entire dataset, where this 8% may comprise a larger amount of data.

In [None]:
# Check the count of unknown actors in comparison to correctly categorized
uncategorized_actor_1 = df[df['PrimaryActorType'] == 'Unknown']['Actor1Name'].count()
uncategorized_actor_2 = df[df['SecondaryActorType'] == 'Unknown']['Actor2Name'].count()

print(f"Unknown actors in Actor1: {uncategorized_actor_1}")
print(f"Unknown actors in Actor2: {uncategorized_actor_2}")

total_actors = len(df)
print(f"Total actors: {total_actors}")

percent_uknown_actors_1 = (uncategorized_actor_1) / total_actors * 100
percent_uknown_actors_2 = (uncategorized_actor_2) / total_actors * 100

print(f"Percentage of unknown primary actors: {percent_uknown_actors_1:.2f}%")
print(f"Percentage of unknown secondary actors: {percent_uknown_actors_2:.2f}%")

Unknown actors in Actor1: 430007
Unknown actors in Actor2: 1371650
Total actors: 3743211
Percentage of unknown primary actors: 11.49%
Percentage of unknown secondary actors: 36.64%


The improvements to the matcher have reduced the unknown actors marginally overall. This implies that 8% of unmatched secondary actor names (protest targets) map to 36% of the unknown secondary types. That's a 5x magnifying effect. These actors drive significant protests despite representing a small subset of unique actors.

Given secondary actors, AKA targets of the protests, can be individual people, as well as geographic areas smaller than countries it is harder to significantly increase coverage.

## Final Clean-up

We have processed the data as much as we could, and now it is nearly in workable order. We can make our dashboard creation easier by improving the column names for readability.

In [None]:
# Make columns more reader-friendly
df = df.rename(columns={
    'SQLDATE':            'Date',
    'Actor1Name':         'Primary Actor',
    'Actor2Name':         'Secondary Actor',
    'EventRootCode':      'Root Code',
    'EventCode':          'Event Code',
    'GoldsteinScale':     'Goldstein Scale',
    'AvgTone':            'Average Tone',
    'ActionGeo_CountryCode':'Country Code',
    'ActionGeo_Lat':      'Latitude',
    'ActionGeo_Long':     'Longitude',
    'PrimaryActorType':   'Primary Actor Type',
    'SecondaryActorType': 'Secondary Actor Type',
    'ProtestMotivation':  'Motivation',
    'COVID_Era':          'Era'
})


It can be easier to work with parts of dates in Tableau, so we can split the date up into several columns.

In [None]:
# Separate date into parts
df['Year']  = df['Date'].dt.year
df['Month'] = df['Date'].dt.month_name().str.slice(stop=3)
df['MonthNum'] = df['Date'].dt.month

We can do a final lat/long check and remove any coordinates that don't make sense

In [None]:
# Filter out coordinates that don't make sense (outside of logical bounds)
# Filter out placeholder coordinates (0.0, 0.0) where location is unknown
df = df[
  df['Latitude'].between(-90, 90) &
  df['Longitude'].between(-180, 180) &
  ~((df['Latitude'] == 0) & (df['Longitude'] == 0))
]


We can get an initial sense of our data and Covid's impact by grouping by year and Covid-Era.

In [None]:
# Check primary actor type and covid era
print(df.groupby('Primary Actor Type').size().sort_values(ascending=False).head())
print(df.groupby(['Year','Era']).size())


Primary Actor Type
Civilians               1266805
Unknown                  430007
Government               332579
Political Opposition     292791
Police forces            138200
dtype: int64
Year  Era      
2017  Pre-COVID    902006
2018  Pre-COVID    815983
2019  Pre-COVID    784480
2020  COVID-Era    578025
      Pre-COVID    117977
2021  COVID-Era    544740
dtype: int64


Based on this rough analysis, there are more civillian protests than any other kind. Protests decreased between 2017 and 2021. The 36% of unknown actor type comes in second by actor type.

In [None]:
df.head()

Unnamed: 0,Date,Primary Actor,NumMentions,Secondary Actor,Event Code,Goldstein Scale,Average Tone,Country Code,CountryName,Latitude,Longitude,Era,Motivation,Primary Actor Type,Secondary Actor Type,Actor1_NER,Actor2_NER,Year,Month,MonthNum
0,2021-09-28,WORKER,10,NURSE,141,-6.5,-1.63,NZ,New Zealand,-42.0,174.0,COVID-Era,Policy Change,Labor,Health,,,2021,Sep,9
1,2021-10-29,PROTESTER,2,POLICE,140,-6.5,-3.39,TH,Thailand,13.75,100.517,COVID-Era,General Protest,Political Opposition,Police forces,,,2021,Oct,10
2,2021-10-26,POLICE,6,Unknown Actor 2,141,-6.5,-2.37,PO,Portugal,39.5,-8.0,COVID-Era,Policy Change,Police forces,Unknown,,CARDINAL,2021,Oct,10
3,2017-03-28,VILLAGE,2,Unknown Actor 2,141,-6.5,-6.31,RI,Serbia,43.1322,20.5647,Pre-COVID,Policy Change,Civilian,Unknown,,CARDINAL,2017,Mar,3
4,2017-08-17,BOLIVIAN,2,Unknown Actor 2,141,-6.5,-5.41,BL,Bolivia,-17.0,-65.0,Pre-COVID,Policy Change,Insurgents,Unknown,,CARDINAL,2017,Aug,8


We can finish up by dropping columns we wont use in Tableau like the NER Actor Type categories as well as the country codes (we already have the country names)

In [None]:
# Drop Columns we won't use in Tableau
df = df.drop(columns=['Actor1_NER', 'Actor2_NER','Event Code', 'Country Code'])

In [None]:
df.head()

Unnamed: 0,Date,Primary Actor,NumMentions,Secondary Actor,Goldstein Scale,Average Tone,CountryName,Latitude,Longitude,Era,Motivation,Primary Actor Type,Secondary Actor Type,Year,Month,MonthNum
0,2021-09-28,WORKER,10,NURSE,-6.5,-1.63,New Zealand,-42.0,174.0,COVID-Era,Policy Change,Labor,Health,2021,Sep,9
1,2021-10-29,PROTESTER,2,POLICE,-6.5,-3.39,Thailand,13.75,100.517,COVID-Era,General Protest,Political Opposition,Police forces,2021,Oct,10
2,2021-10-26,POLICE,6,Unknown Actor 2,-6.5,-2.37,Portugal,39.5,-8.0,COVID-Era,Policy Change,Police forces,Unknown,2021,Oct,10
3,2017-03-28,VILLAGE,2,Unknown Actor 2,-6.5,-6.31,Serbia,43.1322,20.5647,Pre-COVID,Policy Change,Civilian,Unknown,2017,Mar,3
4,2017-08-17,BOLIVIAN,2,Unknown Actor 2,-6.5,-5.41,Bolivia,-17.0,-65.0,Pre-COVID,Policy Change,Insurgents,Unknown,2017,Aug,8


Let's see how many rows of data we are left with

In [None]:
print("Final events:", df.shape[0])

Final events: 3743211


We have lost about 600,000, or 14% of events in our processing pipeline.

## Save Final Dataset

We are now done processing and can save our dataset to BigQuery and to Google Drive. We will need to save our data to Google Drive as it is the only accesible connector in Tableau Public.

### Save Data Set to BigQuery

In [None]:
# Save final data frame as BigQuery dataset
from google.colab import auth
auth.authenticate_user()


In [None]:
from google.cloud import bigquery

# Initialize client
project_id = 'gdelt-protests-2019-2022'
client     = bigquery.Client(project=project_id)
print("Using project:", client.project)

# Create new dataset
dataset_id = f"{project_id}.gdelt_analysis"
dataset = bigquery.Dataset(dataset_id)
dataset.location = "US"
client.create_dataset(dataset, exists_ok=True)
print("Dataset ready:", dataset_id)


Using project: gdelt-protests-2019-2022
Dataset ready: gdelt-protests-2019-2022.gdelt_analysis


In [None]:
# Load dataframe to BigQuery
table_id = f"{dataset_id}.protests_2017_2022"
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE", autodetect=True)
load_job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
load_job.result()

print("Loaded rows:", client.get_table(table_id).num_rows)


Loaded rows: 3743211


### Save Data Set to Google Drive

In [None]:
# Save final to
from google.colab import drive

drive.mount('/content/drive')

output_path = '/content/drive/My Drive/gdelt_protests_2017_2021/gdelt_protests_final_2017_2021.csv'
df.to_csv(output_path, index=False)
print(" Saved to", output_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
 Saved to /content/drive/My Drive/gdelt_protests_2017_2021/gdelt_protests_final_2017_2021.csv
