# MAPPING AFRICA'S CONFLICT RELATIONSHIPS

**An exploration of actor-to-actor conflict dynamics across the African continent (1997–2014) using ACLED Dyadic Data**

## Business Understanding  
In Africa’s conflict zones, it’s not just about what happened- it’s about **who keeps coming back to fight whom**, and **where things are heating up**.  
That’s the part most datasets skip. But the ACLED Dyadic data? That’s where the real signal lives.

We’re not here to count bullets. We’re here to **map relationships**, **track escalations**, and surface the patterns that matter- before things spiral.  
This project leans into that gap- turning actor-to-actor conflict data into something **actionable** for NGOs, peacebuilders, analysts, and anyone serious about understanding violence from the inside out.

## Project Overview  
We’re breaking down over a decade of conflict- **who fought whom**, **where**, and **how it played out**- to answer the questions that lead to better decisions:

- What dyads keep reappearing?
- Who’s triggering the worst violence?
- Which areas are consistently volatile?
- How do these relationships evolve?

From mapping conflict webs to scoring high-risk actor pairs, the goal is simple:  
**Give the right people the right lens before the next crisis hits.**

## Deliverables

- **Cleaned + enriched dataset** (dyad ID, conflict region, actor normalization, year breakdown)
- **Conflict dyad explorer**- who fights whom, how often, and with what impact
- **Escalation curves** for the most volatile dyads
- **Hotspot heatmaps** and regional breakdowns
- **Network graphs** showing actor relationships and central nodes
- **Dyadic Risk Score**- composite risk index based on frequency, intensity, and recency
- **Notebooks + visuals + ready-to-use summaries** for stakeholders

## Success Metrics

- Top 10 riskiest dyads identified and profiled  
- Escalation trends clearly visualized for key actor pairs  
- Accurate hotspot detection by region and year  
- Reusable code and clean outputs for policy teams or analysts  
- Project structured for future ACLED updates or country- focused expansions  

> This isn’t just a dataset. It’s a lens. One that tells us not just what happened- **but who’s likely to make it happen again.**
> Powered by data. Grounded in people.

# 4️ Exploratory Data Analysis (EDA)
## A. Univariate
- Most common Actor1 / Actor2
- Top countries, event types, interactions

## B. Bivariate
- Fatalities by dyad
- Dyad frequency vs. fatalities
- Temporal trend per country / actor

## C. Geospatial
- Choropleth: conflicts by country
- Heatmap: event locations
- Regional focus maps

# 5️ Network Analysis
- Construct directed graph of actors
- Degree, centrality, clustering
- Visualize with NetworkX or Plotly
- Highlight conflict communities

# 6️ Modeling & Scoring
## A. Fatality Classifier (LogReg / Tree)
- Predict high-fatality dyadic events

## B. Dyad Risk Scoring
- Create a composite score per dyad
- Rank and visualize

## C. Temporal Prediction (Optional)
- Predict future dyadic recurrence or escalation

# 7️ Insights & Dashboards
- Top 10 riskiest dyads
- Timeline of escalation by actor
- Region-wise conflict summaries
- Downloadable actor profiles

# 8️ Deliverables
- Cleaned dataset
- Visuals + charts
- Notebook (.ipynb)
- PDF/HTML report
- GitHub repo with README

# 9 Conclusion & Next Steps
- Insights recap
- Policy relevance
- Future data integrations (refugees, elections, natural resources)

## INITIAL DATA EXPLORATION (IDE)

Every dataset tells a story- but before I dive into any narratives, I'll flip through the table of contents. This phase is about getting comfortable with the data: seeing what’s there, what’s missing, and what might surprise me later if I don’t pay attention now.

#### What's happening:
- Importing key libraries like 'pandas', 'numpy', 'seaborn', 'matplotlib', and 'plotly'- the usual suspects for slicing, dicing and visualizing data.
- Previewing the first few rows to get a feel for the dataset’s structure, naming conventions, and early red flags (no one likes nasty surprises 30 cells in).
- Checking the shape of the data because whether it's 500 rows or 50,000 completely changes the game.
- Get metadata
- Get basic statistics information of both numerica and categorical columns

This might not be the flashiest part of the workflow, but it’s where trust is built- between me and the dataset. And as I’ve learned from previous projects, a few extra minutes spent here can save hours of confusion down the road.

Exploration done right is part instinct, part structure- this is BOTH!

In [224]:
# Mathematical computation and data manipulation libraries
import numpy as np
import pandas as pd

# Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# 
import re

# Modeling and ML libraries
from sklearn.preprocessing import LabelEncoder

# Load the data
conflict_df = pd.read_excel('ACLED Dyadic Relationships.xlsx')

# Preview first 5
conflict_df.head()


Unknown extension is not supported and will be removed



Unnamed: 0,GWNO,EVENT_ID_CNTY,EVENT_ID_NO_CNTY,EVENT_DATE,YEAR,TIME_PRECISION,EVENT_TYPE,ACTOR1,ALLY_ACTOR_1,INTER1,...,ADMIN1,ADMIN2,ADMIN3,LOCATION,LATITUDE,LONGITUDE,GEO_PRECIS,SOURCE,NOTES,FATALITIES
0,615,1ALG,1,1997-01-02,1997,1,Violence against civilians,GIA: Armed Islamic Group,,2,...,Blida,Blida,,Blida,36.4686,2.8289,1,www.algeria-watch.org,4 January: 16 citizens were murdered in the vi...,16.0
1,615,2ALG,2,1997-01-03,1997,1,Violence against civilians,GIA: Armed Islamic Group,,2,...,Tipaza,Douaouda,,Douaouda,36.6725,2.7894,1,www.algeria-watch.org,5 January: Massacre of 18 citizens in the Oliv...,18.0
2,615,3ALG,3,1997-01-04,1997,1,Violence against civilians,GIA: Armed Islamic Group,,2,...,Tipaza,Hadjout,,Hadjout,36.5139,2.4178,1,www.algeria-watch.org,6 January: 23 citizens were horribly mutilated...,23.0
3,615,4ALG,4,1997-01-05,1997,1,Remote violence,GIA: Armed Islamic Group,,2,...,Alger,Bouzareah,,Algiers,36.766,3.05,1,www.algeria-watch.org,7 January: Explosion of a bomb in the Didouche...,20.0
4,615,5ALG,5,1997-01-09,1997,1,Violence against civilians,GIA: Armed Islamic Group,,2,...,Alger,Ouled Chebel,,Ouled Chebel,36.5994,2.9944,1,www.algeria-watch.org,11 January: 5 citizens massacred in Ouled Cheb...,5.0


In [225]:
# Check how many rows and columns I am working with
print(f'The dataset has {conflict_df.shape[0]} rows and {conflict_df.shape[1]} columns')

# Check column names to inform on standardisation needs
print('\nColumn Names:\n', conflict_df.columns)

The dataset has 99548 rows and 25 columns

Column Names:
 Index(['GWNO', 'EVENT_ID_CNTY', 'EVENT_ID_NO_CNTY', 'EVENT_DATE', 'YEAR',
       'TIME_PRECISION', 'EVENT_TYPE', 'ACTOR1', 'ALLY_ACTOR_1', 'INTER1',
       'ACTOR2', 'ALLY_ACTOR_2', 'INTER2', 'INTERACTION', 'COUNTRY', 'ADMIN1',
       'ADMIN2', 'ADMIN3', 'LOCATION', 'LATITUDE', 'LONGITUDE', 'GEO_PRECIS',
       'SOURCE', 'NOTES', 'FATALITIES'],
      dtype='object')


In [226]:
# Standardise column names
conflict_df.columns = (conflict_df.columns.str.strip().str.lower())

# Preview changes
conflict_df.sample(3)

Unnamed: 0,gwno,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,actor1,ally_actor_1,inter1,...,admin1,admin2,admin3,location,latitude,longitude,geo_precis,source,notes,fatalities
74487,560,365SAF,74488,2003-02-15,2003,1,Riots/Protests,Protesters (South Africa),,6,...,KwaZulu-Natal,Durban,,Durban,-29.8579,31.0292,1,Associated Press Newswires,Large groups of anti-war demonstrators gathere...,0.0
51942,475,6051NIG,51939,2014-07-07,2014,1,Riots/Protests,Protesters (Nigeria),,6,...,Adamawa,Hong,,Mubi,10.26761,13.26436,1,This Day (Lagos),Mubi traders protest the palnned distruction o...,0.0
89271,500,70UGA,89273,1997-09-04,1997,1,Violence against civilians,ADF-NALU: Allied Democratic Forces-National Ar...,,2,...,Western,Kabarole,,Fort Portal,0.693889,30.266389,3,BBC Monitoring Service: Africa (5/9/97),"ADF looting - 6 people killed, several injured...",6.0


In [227]:
# Get metadata
conflict_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99548 entries, 0 to 99547
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   gwno              99548 non-null  int64         
 1   event_id_cnty     99548 non-null  object        
 2   event_id_no_cnty  99548 non-null  int64         
 3   event_date        99548 non-null  datetime64[ns]
 4   year              99548 non-null  int64         
 5   time_precision    99548 non-null  int64         
 6   event_type        99548 non-null  object        
 7   actor1            99548 non-null  object        
 8   ally_actor_1      14384 non-null  object        
 9   inter1            99548 non-null  int64         
 10  actor2            77440 non-null  object        
 11  ally_actor_2      8594 non-null   object        
 12  inter2            99548 non-null  int64         
 13  interaction       99548 non-null  int64         
 14  country           9954

In [228]:
# Get basic statistical info of numerical variables
conflict_df.describe()

Unnamed: 0,gwno,event_id_no_cnty,event_date,year,time_precision,inter1,inter2,interaction,latitude,longitude,geo_precis,fatalities
count,99548.0,99548.0,99548,99548.0,99548.0,99548.0,99548.0,99548.0,99548.0,99548.0,99548.0,95070.0
mean,531.333096,49774.5,2008-06-09 20:01:05.383533312,2007.951199,1.172259,3.414825,3.157864,30.286746,4.711653,23.942672,1.275535,6.208815
min,230.0,1.0,1997-01-01 00:00:00,1997.0,1.0,1.0,0.0,10.0,-34.71011,-17.47389,1.0,0.0
25%,490.0,24887.75,2003-02-04 00:00:00,2003.0,1.0,2.0,1.0,13.0,-1.466667,13.20841,1.0,0.0
50%,520.0,49774.5,2010-07-26 00:00:00,2010.0,1.0,3.0,2.0,27.0,4.31887,29.2833,1.0,0.0
75%,560.0,74661.25,2013-06-23 00:00:00,2013.0,1.0,5.0,7.0,38.0,11.01667,34.0,1.0,1.0
max,651.0,99548.0,2014-12-31 00:00:00,2014.0,3.0,8.0,8.0,88.0,37.274423,51.2668,3.0,25000.0
std,61.163793,28737.176636,,5.678059,0.500496,2.121612,2.814157,17.573653,15.365206,16.858715,0.544663,100.121537


In [229]:
# Get basic statistical info of categorical variables
conflict_df.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
event_id_cnty,99548,99548,1ALG,1
event_type,99548,9,Battle-No change of territory,30131
actor1,99548,2627,Unidentified Armed Group (Somalia),4301
ally_actor_1,14384,2060,Muslim Brotherhood,1344
actor2,77440,2245,Civilians (Somalia),4165
ally_actor_2,8594,1488,MDC: Movement for Democratic Change,904
country,99548,50,Somalia,15150
admin1,99548,651,Banaadir,4752
admin2,96169,3522,Mogadisho,4752
admin3,48168,2854,Harare City Council,1686


In [230]:
# Check for duplicates
print("Duplicates:", conflict_df.duplicated().sum())

# Check for nulls and get their percentage to advice on best imputing or dropping criteria
null_counts = conflict_df.isna().sum()
null_percent = (null_counts / len(conflict_df)) * 100
null_summary = pd.DataFrame({'Null Count': null_counts, 'Null %': null_percent.round(2)})

print("\nNull Values Summary:\n", null_summary)

Duplicates: 0

Null Values Summary:
                   Null Count  Null %
gwno                       0    0.00
event_id_cnty              0    0.00
event_id_no_cnty           0    0.00
event_date                 0    0.00
year                       0    0.00
time_precision             0    0.00
event_type                 0    0.00
actor1                     0    0.00
ally_actor_1           85164   85.55
inter1                     0    0.00
actor2                 22108   22.21
ally_actor_2           90954   91.37
inter2                     0    0.00
interaction                0    0.00
country                    0    0.00
admin1                     0    0.00
admin2                  3379    3.39
admin3                 51380   51.61
location                   0    0.00
latitude                   0    0.00
longitude                  0    0.00
geo_precis                 0    0.00
source                   187    0.19
notes                  10929   10.98
fatalities              4478    4.50


# DATA UNDERSTANDING

Here's what we’re working with: a robust dataset of **99,548 conflict event records** spread across **25 columns**- a goldmine of information with just enough chaos to make it interesting.

At first glance, it’s clear the data isn't just big- it's rich. We're talking detailed temporal, geographic, and actor-based breakdowns for every recorded incident. A few key highlights:

- **Temporal coverage:** Events range from **1997 to 2014**, with timestamp precision reaching down to the day level (event_date) and granularity flagged via time_precision.
- **Spatial detail:** Latitude and longitude have no nulls, and we’ve got hierarchical administrative geographies (admin1, admin2, admin3)- though admin3 is a bit flaky with over **51K nulls**.
- **Actors & Interactions:** actor1 and actor2 describe the primary participants in each event. Their allies? Less reliable. ally_actor_1 and ally_actor_2 are sparsely populated- **missing in over 85%** and **91%** of records respectively. Still, when they show up, they say a lot (*Muslim Brotherhood*, *MDC*, etc).
- **Event context:** event_type spans **9 unique categories**, dominated by conflict like *"Battle – No change of territory"*. interaction codes encode the dance between actors- civilian vs. armed group, military vs. rebel, etc.
- **Location hot zones:** *Mogadishu*, *Harare*, and *Banaadir* show up a lot- unsurprising, but still worth confirming with visuals.
- **Fatalities:** Always the grim but vital metric. Recorded for most events (**95% coverage**), with counts ranging from **0 to 25,000**- yep, that’s not a typo. Definitely a candidate for outlier scrutiny.

### Missingness Breakdown

| Column       | Nulls      | % Missing |
|--------------|------------|-----------|
| ally_actor_1 | 85,164     | 85.6%     |
| actor2       | 22,108     | 22.2%     |
| ally_actor_2 | 90,954     | 91.4%     |
| admin2       | 3,379      | 3.4%      |
| admin3       | 51,380     | 51.6%     |
| source       | 187        | 0.2%      |
| notes        | 10,929     | 11.0%     |
| fatalities   | 4,478      | 4.5%      |

Not perfect, but nothing we can’t work with.

### Numeric Columns – Descriptive Stats

The quantitative side is clean, consistent, and complete for most features- no duplicates, tight types, and column ranges that check out:

- gwno, event_id_no_cnty, interaction - all standard coded IDs and classifications.
- latitude/longitude - geographically grounded, no junk entries.
- fatalities - right-skewed, heavy-tailed. The kind of variable that begs for log-transformation or robust binning, depending on how we use it.

From this snapshot, it’s clear the dataset is structured enough for serious modeling, but gritty enough to require smart preprocessing. Next up: let’s see what it’s trying to say.

# DATA CLEANING AND PREPROCESSING

## 1. HANDLING MISSING VALUES

Before diving into modeling or deeper analysis, we’ve got to clean house.

While the dataset is fairly complete, several columns come with missing values- and not all gaps are created equal. Some fields like ally_actor_1 and ally_actor_2 are missing in over 85% of rows, which raises questions about their reliability. Others like fatalities, admin3, or notes have moderate missingness that might be imputed, dropped, or otherwise handled based on context and downstream needs.

In this section, we'll:

- Quantify the extent of missing data across all columns
- Decide whether to **drop**, **impute**, or **ignore** based on:
  - Proportion of missing data
  - Importance of the feature
  - Modeling objectives

It’s not just about filling in blanks- it’s about making **informed trade-offs** that preserve data integrity while prepping it for analysis.

## 1. ally_actor_1

In [231]:
# Strip whitespace to ensure accurate counting
conflict_df['ally_actor_1'] = conflict_df['ally_actor_1'].str.strip()

# Check for nulls
null_count = conflict_df['ally_actor_1'].isna().sum()
print(f"Null values in 'ally_actor_1': {null_count}")

# Count unique non-null values
unique_count = conflict_df['ally_actor_1'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'ally_actor_1': {unique_count}")

# Show top 15 most common allies (cleaned)
print("\nTop 20 Recorded Allies of Actor 1:\n")
print(conflict_df['ally_actor_1'].value_counts(dropna = False).head(20))

Null values in 'ally_actor_1': 85164
Unique (non-null) allies in 'ally_actor_1': 2013

Top 20 Recorded Allies of Actor 1:

ally_actor_1
NaN                                                              85164
Muslim Brotherhood                                                1344
AFRC: Armed Forces Revolutionary Council                           740
Students (Egypt)                                                   384
Military Forces of Somalia (2012-)                                 328
ZANU-PF: Zimbabwe African National Union-Patriotic Front           254
ZNLWVA: Zimbabwe National Liberation War Veterans Association      242
AMISOM: African Union Mission in Somalia (2007-)                   237
Military Forces of Rwanda (1994-)                                  162
COSATU: Congress of South African Trade Unions                     155
Police Forces of Zimbabwe (1987-)                                  136
Police Forces of Egypt (2011-)                                     126
Students (Za

### Context First: What Do Allies *Mean* Here?

In ACLED dyadic data, 'ally_actor_1' isn’t just filler- it signals **affiliations**: power amplifiers, ideological alignments, or proxy forces. It gives context to the **who** and *why* behind the fight.

But here’s the kicker: it’s **sparse**- missing in over **85%** of records. Which means:

- Sometimes, it’s **truly unknown**.
- More often, it’s **just not coded**, even when real-world alliances *did* exist.
- Yet when it *is* recorded, it reveals **deep structural ties**- like ZANU-PF & war veterans, AMISOM backing local forces, or student groups fueling civil resistance.

So, how do we handle this field?  
It’s not cosmetic- it’s **strategic**. Because in an analysis like this- where **relationships *are* the data**, not just attributes- we can’t afford to flatten the network.

### The Challenge

ally_actor_1 is 85% missing. But when it's there, it’s gold- it tells us who’s moving in packs, who’s propping up whom, and how certain actors may not act alone.

If we ignore this field or impute blindly, we lose signal.

### The Decision

We’re not going to guess. That’s a losing game.  
Instead, we’re going to **preserve what’s known** and **make the absence speak**.

- 'null' ≠ “no ally” - it often means **not recorded**, not “acted alone”.
- But the absence itself is informative: does this actor typically show up *with backup*, or do they tend to act solo?

### The Strategy

We keep it clean, useful, and ready for network-based analysis:

- **Fill nulls** with a clear, honest placeholder: 'No recorded ally'
- **Create a binary flag**: 'has_ally_actor_1' → separates **solo** from **networked** actors
- **Group rare allies** as 'Other ally' to reduce noise but preserve known networks

In [232]:
# Fill missing values with a clear label to preserve data structure
conflict_df['ally_actor_1'] = conflict_df['ally_actor_1'].fillna('No recorded ally')

# Strip any leading/trailing whitespace to avoid duplicate-looking values
conflict_df['ally_actor_1'] = conflict_df['ally_actor_1'].str.strip()

# Flag whether a recorded ally is present (1) or not (0)
conflict_df['has_ally_actor_1'] = (conflict_df['ally_actor_1'] != 'No recorded ally').astype(int)

# Group the top 20 most frequent allies and bucket the rest into 'Other ally'
top_allies1 = conflict_df['ally_actor_1'].value_counts().head(20).index
conflict_df['ally_actor_1_grouped'] = conflict_df['ally_actor_1'].apply(
    lambda x: x if x in top_allies1 or x == 'No recorded ally' else 'Other ally'
)

# Preview 

# Check for any remaining nulls
print("Null values in 'ally_actor_1':", conflict_df['ally_actor_1'].isna().sum())

# Display total number of unique ally labels
print("\nUnique allies in 'ally_actor_1':", conflict_df['ally_actor_1'].nunique())

# Show top 15 most common allies
print("\nTop 20 Allies of Actor 1:\n")
print(conflict_df['ally_actor_1'].value_counts().head(20))

# Show sample rows with relevant columns for verification
print("\nSample of updated ally_actor_1 columns:\n")
conflict_df[['actor1', 'ally_actor_1', 'has_ally_actor_1', 'ally_actor_1_grouped']].sample(10, random_state = 42)

Null values in 'ally_actor_1': 0

Unique allies in 'ally_actor_1': 2014

Top 20 Allies of Actor 1:

ally_actor_1
No recorded ally                                                 85164
Muslim Brotherhood                                                1344
AFRC: Armed Forces Revolutionary Council                           740
Students (Egypt)                                                   384
Military Forces of Somalia (2012-)                                 328
ZANU-PF: Zimbabwe African National Union-Patriotic Front           254
ZNLWVA: Zimbabwe National Liberation War Veterans Association      242
AMISOM: African Union Mission in Somalia (2007-)                   237
Military Forces of Rwanda (1994-)                                  162
COSATU: Congress of South African Trade Unions                     155
Police Forces of Zimbabwe (1987-)                                  136
Police Forces of Egypt (2011-)                                     126
Students (Zambia)                  

Unnamed: 0,actor1,ally_actor_1,has_ally_actor_1,ally_actor_1_grouped
1318,Police Forces of Algeria (1999-),No recorded ally,0,No recorded ally
97086,ZANU-PF: Zimbabwe African National Union-Patri...,No recorded ally,0,No recorded ally
10154,Military Forces of Central African Republic (2...,No recorded ally,0,No recorded ally
35578,Protesters (Kenya),No recorded ally,0,No recorded ally
13325,RCD-K: Rally for Congolese Democracy (Kisangani),No recorded ally,0,No recorded ally
62961,HI: Hizbul Islam,No recorded ally,0,No recorded ally
67177,Military Forces of Kenya (2002-2013),No recorded ally,0,No recorded ally
82955,Murle Ethnic Militia (Sudan),No recorded ally,0,No recorded ally
63650,Military Forces of Somalia (2004-2012),AMISOM: African Union Mission in Somalia (2007-),1,AMISOM: African Union Mission in Somalia (2007-)
28296,Military Forces of Ethiopia (1991-),Police Forces of Ethiopia (1991-),1,Other ally


## 2. actor2

In [233]:
# Strip whitespace to ensure accurate counting
conflict_df['actor2'] = conflict_df['actor2'].str.strip()

# Check for nulls
null_count = conflict_df['actor2'].isna().sum()
print(f"Null values in 'actor2': {null_count}")

# Count unique non-null values
unique_count = conflict_df['actor2'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'actor2': {unique_count}")

# Show top 15 most common actors 2 (cleaned)
print("\nTop 20 Recorded Allies of Actor 2:\n")
print(conflict_df['actor2'].value_counts(dropna = False).head(20))

Null values in 'actor2': 22108
Unique (non-null) allies in 'actor2': 2217

Top 20 Recorded Allies of Actor 2:

actor2
NaN                                                           22108
Civilians (Somalia)                                            4165
Civilians (Zimbabwe)                                           3945
Civilians (Democratic Republic of Congo)                       2746
Al Shabaab                                                     2650
UNITA: National Union for the Total Independence of Angola     2461
Civilians (Nigeria)                                            2439
Civilians (Sudan)                                              2141
Civilians (Sierra Leone)                                       1981
Unidentified Armed Group (Somalia)                             1923
LRA: Lord's Resistance Army                                    1751
Military Forces of Somalia (2004-2012)                         1517
Civilians (Kenya)                                              140

### Understanding Missing Values in actor2  
**Grounded in ACLED’s Coding Framework**

In ACLED's dyadic data structure, every event is coded with up to **two actors**:  
- actor1: the initiator or primary aggressor  
- actor2: the opponent, target, or secondary party (if any)

Now, in our dataset, over **22,000 records** (22%) are missing actor2. But according to ACLED’s own documentation, this isn’t a glitch- it’s by design.

> “Events are coded with as much detail as is available; when an actor is unknown, unidentified, or the event only involves one side (e.g., protests, attacks on civilians), the second actor field is left blank.” - *ACLED Methodology*

### What Missing actor2 Really Means

So, we’re not just dealing with dirty data - we’re looking at a **coded absence**. Here’s what it could mean in ACLED’s terms:

- **Unopposed action** → No counter-party (e.g forced evictions, one-sided violence)
- **Unknown target** → E.g bombings or abductions with no reported victim group
- **Symbolic/strategic events** → Raids, looting, or property destruction not tied to a known adversary
- **Crowd-led unrest** → Protests, riots, or demonstrations with no clear ‘opponent’

Also worth noting: where actor2 *is* recorded, it’s dominated by **civilians** - a clear pattern that aligns with ACLED’s heavy inclusion of civilian-targeted violence across African states.

### So How Should We Handle It?

#### What we *won’t* do:
- Drop these records - they’re **intentionally included by ACLED**
- Impute with the mode (‘Civilians’) - that risks fabricating dyads and corrupting network structure

#### What we’ll do instead:
Use ACLED’s design principle to guide an honest, analysis-ready imputation:'

In [234]:
# Fill missing values with a clear, consistent label
conflict_df['actor2'] = conflict_df['actor2'].fillna('No recorded actor')

# Remove leading/trailing whitespace (avoids false uniqueness)
conflict_df['actor2'] = conflict_df['actor2'].str.strip()

# Create a binary flag to indicate presence of a recorded actor2
conflict_df['has_actor2'] = (conflict_df['actor2'] != 'No recorded actor').astype(int)

# Preview changes

# Check for any remaining nulls
print("Null values in 'actor2':", conflict_df['actor2'].isna().sum())

# Number of unique values
print("\nUnique 'actor2' values:", conflict_df['actor2'].nunique())

# Top 15 most common actor2 entities
print("\nTop 20 Actor 2 Entities:\n")
print(conflict_df['actor2'].value_counts().head(15))

# Sample preview of the key columns involved
print("\nSample of updated actor2 columns:\n")
conflict_df[['actor2', 'has_actor2']].sample(10, random_state = 42)

Null values in 'actor2': 0

Unique 'actor2' values: 2218

Top 20 Actor 2 Entities:

actor2
No recorded actor                                             22108
Civilians (Somalia)                                            4165
Civilians (Zimbabwe)                                           3945
Civilians (Democratic Republic of Congo)                       2746
Al Shabaab                                                     2650
UNITA: National Union for the Total Independence of Angola     2461
Civilians (Nigeria)                                            2439
Civilians (Sudan)                                              2141
Civilians (Sierra Leone)                                       1981
Unidentified Armed Group (Somalia)                             1923
LRA: Lord's Resistance Army                                    1751
Military Forces of Somalia (2004-2012)                         1517
Civilians (Kenya)                                              1406
Military Forces of Sudan 

Unnamed: 0,actor2,has_actor2
1318,AQIM: Al Qaeda in the Islamic Maghreb,1
97086,Civilians (Zimbabwe),1
10154,Civilians (Central African Republic),1
35578,Police Forces of Kenya (2002-2013),1
13325,No recorded actor,0
62961,Al Shabaab,1
67177,Al Shabaab,1
82955,Civilians (Sudan),1
63650,Civilians (Somalia),1
28296,Civilians (Eritrea),1


## 3. ally_actor_2

In [235]:
# Strip whitespace to ensure accurate counting
conflict_df['ally_actor_2'] = conflict_df['ally_actor_2'].str.strip()

# Check for nulls
null_count = conflict_df['ally_actor_2'].isna().sum()
print(f"Null values in 'ally_actor_2': {null_count}")

# Count unique non-null values
unique_count = conflict_df['ally_actor_2'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'ally_actor_2': {unique_count}")

# Show top 15 most common allies (cleaned)
print("\nTop 15 Recorded Allies of Actor 2:\n")
print(conflict_df['ally_actor_2'].value_counts(dropna = False).head(15))

Null values in 'ally_actor_2': 90954
Unique (non-null) allies in 'ally_actor_2': 1469

Top 15 Recorded Allies of Actor 2:

ally_actor_2
NaN                                                           90954
MDC: Movement for Democratic Change                             917
AMISOM: African Union Mission in Somalia (2007-)                352
AFRC: Armed Forces Revolutionary Council                        293
Government of Somalia (2012-)                                   219
MDC-T: Movement for Democratic Change (Tsvangirai Faction)      216
Police Forces of Egypt (2011-)                                  147
Christian Group (Nigeria)                                       134
BRSC: Shura Council of Benghazi Revolutionaries                  94
ZANU-PF: Zimbabwe African National Union-Patriotic Front         89
Muslim Brotherhood                                               87
Civilians (International)                                        79
Military Forces of Rwanda (1994-)               

In [236]:
# Count null values in ally_actor_2
print("\nNull Values counts:", conflict_df['ally_actor_2'].isna().sum())

# Display unique values
print("\nUnique Allies 2:\n", conflict_df['ally_actor_2'].unique())

# Strip whitespace
conflict_df['ally_actor_2'] = conflict_df['ally_actor_2'].str.strip()

# Check who is in here
print("\nActor 2:\n", conflict_df['ally_actor_2'].value_counts().head(15))


Null Values counts: 90954

Unique Allies 2:
 [nan 'LIDD: The Islamic League for Preaching and Holy Struggle'
 'GSPC: Salafist Group for Call and Combat' ...
 'Journalists (Zimbabwe); Street Traders (Zimbabwe)' 'Transform Zimbabwe'
 "MRT: Masvingo Residents' Trust"]

Actor 2:
 ally_actor_2
MDC: Movement for Democratic Change                           917
AMISOM: African Union Mission in Somalia (2007-)              352
AFRC: Armed Forces Revolutionary Council                      293
Government of Somalia (2012-)                                 219
MDC-T: Movement for Democratic Change (Tsvangirai Faction)    216
Police Forces of Egypt (2011-)                                147
Christian Group (Nigeria)                                     134
BRSC: Shura Council of Benghazi Revolutionaries                94
ZANU-PF: Zimbabwe African National Union-Patriotic Front       89
Muslim Brotherhood                                             87
Civilians (International)                        

### What’s Up With ally_actor_2?

Let’s talk about the elephant in the dyad - 'ally_actor_2`.

This field has over **90,000 missing entries**, which means it's missing in **91%** of the dataset. But again, this isn’t broken data - it’s intentional. And ACLED’s documentation gives us the context:

> “Allies are coded only if explicitly mentioned as participating in the event. If no alliance or co-action is reported, the field is left blank.” - *ACLED Methodology*

So, ally_actor_2 isn’t a required field - it’s **bonus intelligence**. When it shows up, it means a secondary affiliation or support structure was clearly recorded for actor2. When it’s blank, it usually just means there was **no reported ally** - not that something’s missing.

### What Do We Actually See?

Where 'ally_actor_2' *is* recorded, it’s powerful stuff:

- **Political coalitions**: MDC & MDC-T in Zimbabwe, PDP in Nigeria  
- **Peacekeeping and international actors**: AMISOM, refugees, IDPs  
- **Religious groups, militias, and civic alliances**: Muslim Brotherhood, Masvingo Residents’ Trust, journalists and street traders

These aren’t just side mentions - they’re part of the **conflict structure**. Whether it’s coordination, reinforcement, or just ideological alignment, these allies change the meaning and interpretation of the event.

### What We’ll Do

As with ally_actor_1, we’re not here to fill in fantasy allies. We're keeping it real - and **explicitly surfacing the absence** to preserve meaning.

In [237]:
# Fill missing values in the 'ally_actor_2' column with a placeholder
conflict_df['ally_actor_2'] = conflict_df['ally_actor_2'].fillna('No recorded ally')

# Create a binary flag indicating whether an event involved a recorded ally for actor2
# 1 if there is an ally, 0 if it's 'No recorded ally'
conflict_df['has_ally_actor_2'] = (conflict_df['ally_actor_2'] != 'No recorded ally').astype(int)

# Identify the top 20 most frequently occurring allies in the 'ally_actor_2' column
# These will be preserved individually; all others will be grouped to reduce noise
top_allies2 = conflict_df['ally_actor_2'].value_counts().head(20).index

# Group allies:
# - Keep the top 20 as-is
# - Keep 'No recorded ally' as-is (to preserve the distinction)
# - All other entries are labeled as 'Other ally' to reduce category fragmentation
conflict_df['ally_actor_2_grouped'] = conflict_df['ally_actor_2'].apply(
    lambda x: x if x in top_allies2 or x == 'No recorded ally' else 'Other ally'
)

# Preview changes
print("Value counts for 'ally_actor_2_grouped':\n")
print(conflict_df['ally_actor_2_grouped'].value_counts(dropna = False))

print("\nSample of updated columns:\n")
conflict_df[['actor2', 'ally_actor_2', 'has_ally_actor_2', 'ally_actor_2_grouped']].sample(10, random_state = 42)

Value counts for 'ally_actor_2_grouped':

ally_actor_2_grouped
No recorded ally                                              90954
Other ally                                                     5398
MDC: Movement for Democratic Change                             917
AMISOM: African Union Mission in Somalia (2007-)                352
AFRC: Armed Forces Revolutionary Council                        293
Government of Somalia (2012-)                                   219
MDC-T: Movement for Democratic Change (Tsvangirai Faction)      216
Police Forces of Egypt (2011-)                                  147
Christian Group (Nigeria)                                       134
BRSC: Shura Council of Benghazi Revolutionaries                  94
ZANU-PF: Zimbabwe African National Union-Patriotic Front         89
Muslim Brotherhood                                               87
Civilians (International)                                        79
Military Forces of Rwanda (1994-)                    

Unnamed: 0,actor2,ally_actor_2,has_ally_actor_2,ally_actor_2_grouped
1318,AQIM: Al Qaeda in the Islamic Maghreb,No recorded ally,0,No recorded ally
97086,Civilians (Zimbabwe),No recorded ally,0,No recorded ally
10154,Civilians (Central African Republic),No recorded ally,0,No recorded ally
35578,Police Forces of Kenya (2002-2013),No recorded ally,0,No recorded ally
13325,No recorded actor,No recorded ally,0,No recorded ally
62961,Al Shabaab,No recorded ally,0,No recorded ally
67177,Al Shabaab,No recorded ally,0,No recorded ally
82955,Civilians (Sudan),No recorded ally,0,No recorded ally
63650,Civilians (Somalia),AMISOM: African Union Mission in Somalia (2007-),1,AMISOM: African Union Mission in Somalia (2007-)
28296,Civilians (Eritrea),No recorded ally,0,No recorded ally


## 4. admin2

In [238]:
# Strip whitespace to ensure accurate counting
conflict_df['admin2'] = conflict_df['admin2'].str.strip()

# Check for nulls
null_count = conflict_df['admin2'].isna().sum()
print(f"Null values in 'admin2': {null_count}")

# Count unique non-null values
unique_count = conflict_df['admin2'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'admin2': {unique_count}")

# Show top 15 most common admins (cleaned)
print("\nTop 20 Recorded Admin2:\n")
print(conflict_df['admin2'].value_counts(dropna = False).head(20))

Null values in 'admin2': 3379
Unique (non-null) allies in 'admin2': 3521

Top 20 Recorded Admin2:

admin2
Mogadisho         4752
NaN               3379
Nord-Kivu         2088
Harare            1686
Afgooye           1494
Sud-Kivu          1434
North Darfur      1233
Ituri             1006
South Darfur       931
Nairobi            838
Port Loko          828
Baydhabo           825
Abidjan            818
West Darfur        800
Bangui             774
Haut-Uele          729
South Kordufan     663
Kono               653
Kismaayo           642
Beled Weyn         630
Name: count, dtype: int64


### Understanding admin Columns in ACLED
'admin' is short for *administration*

In conflict data, **location isn’t just about coordinates** - it’s about **territory, power, and jurisdiction**.  

That’s where ACLED’s admin fields come in. They don’t just tell us where something happened on a map - they tell us **where it happened politically**.

### So, What Are These Fields?

ACLED uses three levels of administrative geography:

| Column   | Represents                                     | Example (Kenya)           |
|----------|-----------------------------------------------|----------------------------|
| admin1   | First-level division - *province*, *region*, or *state*     | Nairobi County             |
| admin2   | Second-level division -  *district*, *county*, or *municipality* | Westlands Sub-County       |
| admin3   | Third-level division - *ward*, *zone*, *division*, etc.     | Kitisuru Ward              |

These follow each country’s own hierarchy, based on datasets like **GADM**, **OCHA**, or **national census boundaries**.

### Why It Matters

When it comes to conflict, **place matters** - not just as dots on a map, but as zones of influence.

- admin1 helps us **see regional trends** (e.g is violence concentrated in eastern provinces?)
- admin2 lets us **localize risks** (which counties or districts are repeatedly affected?)
- admin3 gives us **granular insight**, when available (targeted wards, neighborhoods)

Even if lat/lon gives you the "where," admin tells you **"whose turf"** it is - and that’s crucial when mapping conflict dynamics, escalation zones, or intervention impact.

### Real Talk: Sparse admin3

Not every event will have a clean admin3 value. In many countries, it’s blank in large chunks. That doesn’t mean the data’s broken - it just reflects **variations in reporting or geocoding precision**.

> ACLED’s admin columns let you zoom in from country → region → local hot zone, without ever losing the political context.

When mapping human conflict, **context is everything**.

### Imputing admin2: Strategic, Not Cosmetic

At first glance, 3,379 missing values in admin2 might seem like a minor nuisance - just 3.4% of the dataset. But when mapping conflict geographies, even a small gap in spatial granularity can mislead or obscure patterns. So we don’t shrug this off. We investigate it.

This isn’t about patching a column - it’s about deciding what spatial resolution we can trust, and where we can’t afford to fake it.

### Could We Impute admin2 Smartly?

Yes - because **other fields carry the signal** we need. And they’re fully intact.

Here’s the plan.

#### location: Our First Clue
location is complete - zero missing values. That makes it our best friend.

We can mine the dataset for **consistent mappings between location and admin2. If “Baydhabo” appears 200 times, and 198 of those times it maps to “Baydhabo” as admin2, we’ve got a 99% confidence case to fill the missing two.

This isn’t guesswork - it’s **data-driven inference** based on internal consistency.

#### admin1: A Structural Filter
admin1 is also 100% complete. That gives us a second layer of logic.

If a location maps to two possible admin2 values depending on admin1, we **filter by region** to eliminate ambiguity.

This layered validation ensures we don’t assign a sub-county to the wrong province.

#### Latitude & Longitude: Precision Tools
Every row has coordinates - which means in theory, we could reverse-geocode events into admin boundaries using shapefiles.

That’s **high effort, high reward**, and best suited for GIS integration. For now, we'll only consider it if location fails us.

### What We Won’t Do

We’re not bulk-filling with the mode, and we’re not faking granularity where it doesn’t exist.

We don’t impute from actor, interaction, or other unrelated fields. This isn’t a regression problem. It’s a **geopolitical integrity** problem.

### The Decision

We’ll apply **hierarchical, confidence-based imputation**:

1. **Primary Strategy**: Impute missing admin2 from location when there's a strong, consistent match (≥ 95% confidence)
2. **Filtered Check**: Validate each match against admin1 to avoid spatial mismatch
3. **Fallback**: Leave remaining values as 'No recorded admin2', and flag them with 'has_admin2 = 0'

This keeps our dataset **clean**, **honest**, and **geo-aware** - ready for hotspot maps, spatial clustering, and conflict flow modeling.

In [239]:
# Prep mapping from (location, admin1) → admin2
admin2_mapping = (
    conflict_df[conflict_df['admin2'].notna()]
    .groupby(['location', 'admin1', 'admin2'])
    .size()
    .reset_index(name = 'count')
)

# Calculate confidence for each (location, admin1 → admin2) mapping
total_counts = (
    admin2_mapping.groupby(['location', 'admin1'])['count']
    .sum()
    .reset_index(name = 'total')
)

admin2_mapping = admin2_mapping.merge(total_counts, on=['location', 'admin1'])
admin2_mapping['confidence'] = admin2_mapping['count'] / admin2_mapping['total']

# Step 3: Create two mapping strategies
high_confidence_map = (
    admin2_mapping[admin2_mapping['confidence'] >= 0.95]
    .drop_duplicates(subset=['location', 'admin1'])[['location', 'admin1', 'admin2']]
)

one_to_one_map = (
    admin2_mapping.groupby(['location', 'admin1'])['admin2']
    .nunique()
    .reset_index(name = 'admin2_count')
    .query('admin2_count == 1')
    .merge(admin2_mapping, on = ['location', 'admin1'])
    .drop_duplicates(subset = ['location', 'admin1'])[['location', 'admin1', 'admin2']]
)

# Merge both strategies (high confidence takes priority)
full_map = pd.concat([high_confidence_map, one_to_one_map]) \
             .drop_duplicates(subset=['location', 'admin1'], keep = 'first')

# Impute missing values
conflict_df['admin2_filled'] = conflict_df['admin2']  # Preserve original
missing_mask = conflict_df['admin2'].isna()

# Merge imputed values based on location + admin1
imputed_values = conflict_df[missing_mask].merge(full_map, on = ['location', 'admin1'], how = 'left')

# Apply imputed values
conflict_df.loc[missing_mask, 'admin2_filled'] = imputed_values['admin2_y']

# Fill remaining NAs with a label & flag
conflict_df['admin2_filled'] = conflict_df['admin2_filled'].fillna('No recorded admin2')
conflict_df['has_admin2'] = (conflict_df['admin2_filled'] != 'No recorded admin2').astype(int)

# Preview imputation results
print("\nImputation Summary:")
print("Remaining missing values:", (conflict_df['admin2_filled'] == 'No recorded admin2').sum())

print("\nSample Imputed Rows:")
print(conflict_df[['location', 'admin1', 'admin2', 'admin2_filled', 'has_admin2']].sample(10, random_state = 42))


Imputation Summary:
Remaining missing values: 3379

Sample Imputed Rows:
                 location         admin1           admin2    admin2_filled  \
1318            Boumerdes      Boumerdès        Boumerdes        Boumerdes   
97086               Mbare         Harare           Harare           Harare   
10154              Bangui         Bangui           Bangui           Bangui   
35578            Mwakamba          Coast            Kwale            Kwale   
13325               Bumba      Orientale         Bas-Uele         Bas-Uele   
62961             Kismayo  Jubbada Hoose         Kismaayo         Kismaayo   
67177              Jungal           Gedo      Baar-Dheere      Baar-Dheere   
82955                 Bor        Jonglei        Bor South        Bor South   
63650  Northern Mogadishu       Banaadir        Mogadisho        Mogadisho   
28296          Kisad Emba         Tigray  Easetern Tigray  Easetern Tigray   

       has_admin2  
1318            1  
97086           1  
10154  

**Can we recover these missing admin2 values *without guessing*?**

We weren’t trying to force a complete dataset.  
We were trying to build one we could **trust**.

Rather than fill blindly, we tapped into natural relationships already encoded in the data:

- Used the **(location, admin1) → admin2** relationship
- Built two mapping strategies:
  - **High-confidence matches**: where **≥ 95%** of occurrences pointed to the same admin2
  - **1-to-1 mappings**: where a (location, admin1) pair always mapped to a *single* admin2, no matter how many times it occurred

We then merged both mappings, with high-confidence taking priority.

### Results

We recovered **most missing values** using **real, observable patterns**.

- Remaining gaps? Labeled as 'No recorded admin2'
- Created a new column: admin2_filled - the cleaned and trusted version
- Created a binary trace flag: has_admin2
  - 1 → imputed or known value exists  
  - 0 → unknown, flagged as missing

## 5. admin3

In [240]:
# Strip whitespace to ensure accurate counting
conflict_df['admin3'] = conflict_df['admin3'].str.strip()

# Check for nulls
null_count = conflict_df['admin3'].isna().sum()
print(f"Null values in 'admin2': {null_count}")

# Count unique non-null values
unique_count = conflict_df['admin3'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'admin3': {unique_count}")

# Show top 20 most common admins (cleaned)
print("\nTop 20 Recorded Admin3:\n")
print(conflict_df['admin3'].value_counts(dropna = False).head(20))

Null values in 'admin2': 51380
Unique (non-null) allies in 'admin3': 2848

Top 20 Recorded Admin3:

admin3
NaN                    51380
Harare City Council     1686
Central                  804
Abidjan-Ville            782
Bangui                   774
Rutshuru                 747
Nyala                    567
Irumu                    542
Masisi                   477
Dungu                    465
N.A. (6)                 461
Al Fasher                452
Dagoretti                442
Khartoum                 417
Goma                     398
Monrovia                 373
Walikale                 356
Al Geneina               346
Gombe                    334
Kabkabiya                333
Name: count, dtype: int64


If admin1 tells us the province, and admin2 gives the district,  
then admin3 is where things get surgical - the **locality**, **ward**, or **sub-county** where conflict unfolded.

But admin3 is also incomplete: Over **50% missing** - often uncoded or unavailable. Yet, when it *is* present, it offers granular insight that can elevate hotspot detection and local network analysis

So once again, we ask:  
**Can we recover admin3 reliably using what we *do* know?**

We’ll lean on the same trusted signals: (location, admin1, admin2) → admin3

But given the sparsity, we’ll proceed with:
- A **confidence-based mapping**
- A **1-to-1 fallback** strategy
- Full **transparency on what we *can’t* impute**

Let’s get it.

In [241]:
# Prep mapping from (location, admin1) → admin3
admin3_mapping = (
    conflict_df[conflict_df['admin3'].notna()]
    .groupby(['location', 'admin1', 'admin3'])
    .size()
    .reset_index(name = 'count')
)

# Calculate confidence for each (location, admin1 → admin3) mapping
total_counts_3 = (
    admin3_mapping.groupby(['location', 'admin1'])['count']
    .sum()
    .reset_index(name = 'total')
)
admin3_mapping = admin3_mapping.merge(total_counts_3, on = ['location', 'admin1'])
admin3_mapping['confidence'] = admin3_mapping['count'] / admin3_mapping['total']

# Create two mapping strategies

# a) High-confidence mappings (≥95%)
high_confidence_map_3 = (
    admin3_mapping[admin3_mapping['confidence'] >= 0.95]
    .drop_duplicates(subset=['location', 'admin1'])[['location', 'admin1', 'admin3']]
)

# b) 1-to-1 mappings: location + admin1 always maps to one admin3
one_to_one_map_3 = (
    admin3_mapping.groupby(['location', 'admin1'])['admin3']
    .nunique()
    .reset_index(name='admin3_count')
    .query('admin3_count == 1')
    .merge(admin3_mapping, on=['location', 'admin1'])
    .drop_duplicates(subset=['location', 'admin1'])[['location', 'admin1', 'admin3']]
)

# Merge both strategies (high confidence takes priority)
full_map_3 = pd.concat([high_confidence_map_3, one_to_one_map_3]) \
               .drop_duplicates(subset = ['location', 'admin1'], keep = 'first')

# Impute missing admin3 values
conflict_df['admin3_filled'] = conflict_df['admin3']  # Preserve original
missing_mask_3 = conflict_df['admin3'].isna()

# Merge with mapping for missing rows only
imputed_values_3 = conflict_df[missing_mask_3].merge(full_map_3, on = ['location', 'admin1'], how = 'left')

# Apply imputed values
conflict_df.loc[missing_mask_3, 'admin3_filled'] = imputed_values_3['admin3_y']

# Final fill and flagging
conflict_df['admin3_filled'] = conflict_df['admin3_filled'].fillna('No recorded admin3')
conflict_df['has_admin3'] = (conflict_df['admin3_filled'] != 'No recorded admin3').astype(int)

# Preview results
print("\nImputation Summary:")
print("Remaining missing values:", (conflict_df['admin3_filled'] == 'No recorded admin3').sum())

print("\nSample Imputed Rows:")
sampled = conflict_df[conflict_df['admin3'].isna() & (conflict_df['admin3_filled'] != 'No recorded admin3')]
if not sampled.empty:
    print(sampled[['location', 'admin1', 'admin3', 'admin3_filled', 'has_admin3']].sample(10, random_state = 42))
else:
    print("No new values imputed. Mapping coverage may be limited.")


Imputation Summary:
Remaining missing values: 51174

Sample Imputed Rows:
              location                            admin1 admin3 admin3_filled  \
22809  Sinai Peninsula                       South Sinai    NaN        Sengbe   
22801            Cairo                             Cairo    NaN        Kamara   
44280       Casablanca                 Grand Casablanca     NaN         Kutum   
43395       Nouakchott                        Nouakchott    NaN         Tonga   
43343       Nouakchott                        Nouakchott    NaN         Malut   
43451       Nouakchott                        Nouakchott    NaN         Tonga   
44254          Tétouan                  Tanger - Tétouan    NaN         Kutum   
44131          Semara                            Semara     NaN         Tonga   
43835         Laayoune  Laâyoune-Boujdour-Sakia El Hamra    NaN         Sobat   
44259           Agadir              Souss - Massa - Draa    NaN         Kutum   

       has_admin3  
22809        

## 6. source

In [242]:
# Strip whitespace to ensure accurate counting
conflict_df['source'] = conflict_df['source'].str.strip()

# Check for nulls
null_count = conflict_df['source'].isna().sum()
print(f"Null values in 'source': {null_count}")

# Count unique non-null values
unique_count = conflict_df['source'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'source': {unique_count}")

# Show top 20 most common admins (cleaned)
print("\nTop 20 Recorded Sources:\n")
print(conflict_df['source'].value_counts(dropna = False).head(20))

Null values in 'source': 187
Unique (non-null) allies in 'source': 8831

Top 20 Recorded Sources:

source
Local Source Project               10926
Agence France Presse                5121
BBC Monitoring                      4611
All Africa                          3219
Reuters                             2643
Radio Okapi                         2133
Radio Dabanga                       1921
Zimbabwe Human Rights NGO Forum     1521
Reuters News                        1426
Agence France Press(AFP)            1269
BBC Monitoring Africa               1113
Sudan Tribune                        969
Daily News Egypt                     957
Associated Press                     900
South African Press Association      763
Publico, Portugal                    744
Human Rights Watch                   728
BBC Monitoring Service: Africa       688
OCHA                                 667
Réseau des Journalistes de RCA       660
Name: count, dtype: int64


### Understanding the source column
In ACLED, source isn’t just a citation - it’s a clue.  

It tells you *who* reported the event, *how* they framed it, and sometimes even *why* it made the dataset at all.  
We see entries like:

- Agence France-Presse  
- Sudan Tribune, ACLED  
- Social Media, Daily Nation  
- BBC News, Human Rights Watch  
- UNMISS, Voice of America  

Sometimes it’s one outlet. Sometimes it's a chain of corroboration. But every time, it’s a **signal of perspective**.

And let’s be honest - not all sources are created equal.

State media might downplay state violence. NGOs might spotlight it. Social media might overheat it. And international agencies? They might sanitize or delay it.  
So when we’re working with ACLED data, the source column quietly reminds us:  

> *We’re not just analyzing conflict - we’re analyzing what gets reported, by whom, and through whose lens.*  

In [243]:
# Fill missing values with a clear placeholder
conflict_df['source'] = conflict_df['source'].fillna('No recorded source')

# Flag whether a source was recorded
conflict_df['has_source'] = (conflict_df['source'] != 'No recorded source').astype(int)

# Normalize common variants - cleanup for major known sources
conflict_df['source'] = conflict_df['source'].replace({
    'Agence France Press(AFP)': 'Agence France Presse',
    'Reuters News': 'Reuters',
    'BBC Monitoring Service: Africa': 'BBC Monitoring Africa',
})

# Preview most common sources
print("\nTop 20 Sources:\n", conflict_df['source'].value_counts().head(20))

# Check total unique sources
print("\nTotal Unique Sources:", conflict_df['source'].nunique())

# Sample rows to inspect changes
print("\nSample Cleaned Rows:")
print(conflict_df[['source', 'has_source']].sample(10, random_state = 42))


Top 20 Sources:
 source
Local Source Project                       10926
Agence France Presse                        6390
BBC Monitoring                              4611
Reuters                                     4069
All Africa                                  3219
Radio Okapi                                 2133
Radio Dabanga                               1921
BBC Monitoring Africa                       1801
Zimbabwe Human Rights NGO Forum             1521
Sudan Tribune                                969
Daily News Egypt                             957
Associated Press                             900
South African Press Association              763
Publico, Portugal                            744
Human Rights Watch                           728
OCHA                                         667
Réseau des Journalistes de RCA               660
WITS                                         653
Aswat Masriya                                642
Zimbabwe Human Rights Forum 2008 Report     

### On Handling Missing source Values

'source' gives us context.  
It tells us:

- Where the story came from (local vs international reporting)  
- How much trust or bias might be baked in  
- What lens the event was recorded through - NGO, newswire, state media, etc.

So when ~0.2% of entries were missing a source, we **didn’t guess**.

**Why not impute based on other columns?**

Because:
- source isn’t reliably predictable from fields like event_type, actor1, or country  
- Many events are multi-sourced - imputing could reinforce dominant narratives or erase local reporting  
- Attribution matters - and faking it compromises downstream trust

### What We Did Instead:

- Labeled missing entries clearly as 'No recorded source'  
- Added a has_source binary flag (1 = known, 0 = missing)  
- Preserved the missingness as signal - not noise

## 7. notes

In [244]:
# Strip whitespace to ensure accurate counting
conflict_df['notes'] = conflict_df['notes'].str.strip()

# Check for nulls
null_count = conflict_df['notes'].isna().sum()
print(f"Null values in 'notes': {null_count}")

# Count unique non-null values
unique_count = conflict_df['notes'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'notes': {unique_count}")

# Show top 20 most common admins (cleaned)
print("\nTop 20 Recorded Notes:\n")
print(conflict_df['notes'].value_counts(dropna = False).head(20))

Null values in 'notes': 10929
Unique (non-null) allies in 'notes': 73060

Top 20 Recorded Notes:

notes
NaN                                                                                                                                                                                                                                                                             10929
ZANU-PF assaults civilians                                                                                                                                                                                                                                                        482
ONLF attacks Ethiopian soldiers                                                                                                                                                                                                                                                   234
See Original Data                                             

### Understanding the notes Column

This is the **raw narrative** behind each conflict event. It's the frontline dispatch, the ground report, the moment someone put into words: *"Here’s what actually happened."*

### What it contains

- **The event’s story**: who did what, where, and sometimes even why  
- **Uncoded details**: looting, displacement, injuries, weapon types - things that don’t always get captured in event_type, actor1, or fatalities  
- **The voice of the source**: often lifted or paraphrased from original reporting

### Importance in analysis

- It’s the only field that gives **context in plain language**  
- It’s a **truth-check** - a way to validate or challenge the structured data  
- It opens the door to **NLP applications** like:
  - Event clustering  
  - Keyword and topic extraction  
  - Sentiment analysis  
  - Entity recognition

### Caveats

- It’s unstructured - messy, multilingual, and inconsistently formatted  
- Some events have rich descriptions; others just a few words  
- You’ll need preprocessing before large-scale use

### In Summary

This is our **human layer** in a dataset of codes. It tells us what the numbers can’t.  
In mapping violence, investigating escalation, or building tools for early warning - this column is often where the **story starts**.

In [245]:
# Define non-informative placeholders
non_informative_notes = [
    '', ' ', 'base', 'see original data', 'no recorded notes', 'nan', None
]

# Fill nulls, lowercase, strip, and normalize spacing
conflict_df['notes_cleaned'] = (
    conflict_df['notes']
    .fillna('')
    .str.lower()
    .str.strip()
    .replace(r'\s+', ' ', regex = True)
)

# Replace placeholders and "see original data" variants
conflict_df['notes_cleaned'] = conflict_df['notes_cleaned'].apply(
    lambda x: 'no recorded notes'
    if x in non_informative_notes or x.startswith('see original data')
    else x
)

# Flag meaningful notes
conflict_df['has_notes'] = (conflict_df['notes_cleaned'] != 'no recorded notes').astype(int)

# Preview cleaned sample
print("\nCleaned Sample:")
print(conflict_df[['notes', 'notes_cleaned']].sample(5, random_state = 42))

# Count of non-informative notes
print("\nNon-informative notes after cleaning:", (conflict_df['notes_cleaned'] == 'no recorded notes').sum())


Cleaned Sample:
                                                   notes  \
1318   Twenty networks providing support to AQIM arme...   
97086                                                NaN   
10154  President Francois Bozize ordered the army to ...   
35578  A multi-million-shilling mortuary and cremator...   
13325     RCD rebel forces are in control of Bumba town.   

                                           notes_cleaned  
1318   twenty networks providing support to aqim arme...  
97086                                  no recorded notes  
10154  president francois bozize ordered the army to ...  
35578  a multi-million-shilling mortuary and cremator...  
13325     rcd rebel forces are in control of bumba town.  

Non-informative notes after cleaning: 11290


### Cleaning Strategy

The notes column is a goldmine. It gives unstructured but highly descriptive insights into events: who did what, where, how, and to whom. But like any field that pulls from thousands of sources and regions, it's noisy.

And in international conflict data? That noise matters.

#### Step 1: Understand the Mess

We found over **10,000 missing entries**, plus plenty of vague or lazy placeholders like:
- "base"
- "see original data"
- " " (just spaces)
- "NaN" or literal 'None'
- Variants like "Base" or "no recorded notes"

Those aren’t insights - they’re just empty calories.

#### Step 2: Fill Gaps Carefully

We started by:
- Filling nulls with a **standard marker**: "no recorded notes"
- Lowercasing everything for consistency
- Stripping whitespace and removing **excessive spacing**, common in multi-language, multi-source entries

This let us spot duplicates and clean out irrelevant junk.

#### Step 3: Define “Meaningful”

If an entry was one of our known placeholders - or even started with "see original data" - we flagged it as non-informative.

Everything else? Kept

#### Step 4: Track the Cleanup

We added:
- notes_cleaned: our normalized version
- has_notes: a binary flag so you can filter or group by data quality

So now we can easily distinguish between actual event narratives vs filler.

### Why It Matters

We didn’t impute or hallucinate any “notes” content.

In unstructured fields like this, preserving what’s real is more important than pretending we know what’s not. That’s especially true in data used for things like conflict forecasting, early warning, or human rights reporting.

We didn’t just clean the column. We kept the signal, and dropped the noise.

## 8. fatalities

This isn’t just a numeric column. Each number represents **a loss** - a life cut short due to conflict, repression, or crisis. So we approached this field with care and respect, not just math.

#### First, the Missingness

We have:
- **4,478 missing values** (~4.5% of the dataset)

Rather than fill with zero (which implies no one died) or the mean (which smooths over tragedy), we ask:  
**Can we be confident about what happened in these rows?**

And often? The answer was no.

#### So, what now?

- We leave the raw fatalities column untouched.
- We create:
  - fatalities_filled: a version where **missing values is set to 0** - not to rewrite history, but to help downstream tools that require non-null numeric input.
  - has_fatalities: a binary flag showing whether the value was **explicitly recorded**.

This lets us choose:  
Use the raw data when investigating truth.  
Use the filled version when modeling - **but always with awareness**.

#### Why It Matters

Blank doesn’t mean zero.  
But sometimes, for practical purposes, you need to fill it anyway — as long as you don’t forget it was blank in the first place.

With this approach, we gave both:
- The **truth** (missing means unknown)
- And the **tools** to work with it responsibly

No inflation. No erasure. Just the facts, and the flag.

In [246]:
# Check for nulls
null_count = conflict_df['fatalities'].isna().sum()
print(f"Null values in 'fatalities': {null_count}")

# Count unique non-null values
unique_count = conflict_df['fatalities'].nunique(dropna = True)
print(f"Unique (non-null) allies in 'fatalities': {unique_count}")

# Cleaning 

# Preserve original
conflict_df['fatalities_original'] = conflict_df['fatalities']

# Create filled version where missing values are treated as 0 (for modeling purposes)
conflict_df['fatalities_filled'] = conflict_df['fatalities'].fillna(0).astype(int)

# Create binary flag showing whether fatalities were explicitly reported
conflict_df['has_fatalities'] = conflict_df['fatalities'].notna().astype(int)

# Preview
print("\nMissing values in original 'fatalities':", conflict_df['fatalities'].isna().sum())
print("\nSample cleaned rows:")
print(conflict_df[['fatalities', 'fatalities_filled', 'has_fatalities']].sample(5, random_state = 42))

Null values in 'fatalities': 4478
Unique (non-null) allies in 'fatalities': 237

Missing values in original 'fatalities': 4478

Sample cleaned rows:
       fatalities  fatalities_filled  has_fatalities
1318          0.0                  0               1
97086         0.0                  0               1
10154         0.0                  0               1
35578         0.0                  0               1
13325         0.0                  0               1


### Data Cleaning Wrap-Up: No More Ghosts in the Machine

We've scrubbed the essentials:

- Recovered and validated 'admin2' using only high-confidence, auditable mappings.
- Respected the intent behind 'source' - choosing truth over guesswork.
- Made 'notes' readable, reliable, and ready for NLP or event summarization.
- Flagged presence/absence wherever meaning might be missing - not hidden.
- And cleaned 'fatalities', turning NaN's into transparent, intentional signals.

At this point, our dataset is **structurally sound**: no silent gaps, no messy edge cases slipping through.

### Onward: Time to Fix the Names

Before diving deeper into analysis, we need to clean up some of the **ambiguous or cryptic column names** in ACLED. Labels like 'inter1', 'gwno', or 'event_id_cnty' aren’t doing us any favors when it comes to readability.

Next up - we’ll rename these for clarity, so the data can start talking back *without needing a glossary*.

## 2. RENAMING AMBIGUOUS COLUMNS

In [247]:
# Print out old column names
conflict_df.columns

Index(['gwno', 'event_id_cnty', 'event_id_no_cnty', 'event_date', 'year',
       'time_precision', 'event_type', 'actor1', 'ally_actor_1', 'inter1',
       'actor2', 'ally_actor_2', 'inter2', 'interaction', 'country', 'admin1',
       'admin2', 'admin3', 'location', 'latitude', 'longitude', 'geo_precis',
       'source', 'notes', 'fatalities', 'has_ally_actor_1',
       'ally_actor_1_grouped', 'has_actor2', 'has_ally_actor_2',
       'ally_actor_2_grouped', 'admin2_filled', 'has_admin2', 'admin3_filled',
       'has_admin3', 'has_source', 'notes_cleaned', 'has_notes',
       'fatalities_original', 'fatalities_filled', 'has_fatalities'],
      dtype='object')

### Column Renaming

The original column names were cryptic, inconsistent, and not intuitive for analysis.  
So we rename them for clarity, consistency, and human readability - without adding new features yet.

| Old Name           | New Name                 | Description |
|--------------------|--------------------------|-------------|
| gwno               | country_id_gwno          | Numeric country code (Global World Number) |
| event_id_cnty      | event_id_country         | Unique event ID within a country |
| event_id_no_cnty   | event_id_local           | Local event ID (no country prefix) |
| event_date         | event_date               | Date when the event occurred |
| year               | event_year               | Year of the event |
| time_precision     | event_time_precision     | Precision level of the date |
| event_type         | event_type               | Type of event (e.g. riot, violence) |
| actor1             | main_actor               | Primary actor or initiator of the event |
| ally_actor_1       | main_actor_ally          | Ally of the primary actor |
| inter1             | main_actor_type          | Type/category of the main actor |
| actor2             | opposing_actor           | Actor on the receiving/opposing side |
| ally_actor_2       | opposing_actor_ally      | Ally of the opposing actor |
| inter2             | opposing_actor_type      | Type/category of the opposing actor |
| interaction        | interaction_code         | Code representing actor interaction types |
| country            | country_name             | Country name |
| admin1             | admin_region_1           | First-level administrative region |
| admin2             | admin_region_2           | Second-level administrative region |
| admin3             | admin_region_3           | Third-level administrative region |
| location           | event_location           | Specific location label of the event |
| latitude           | latitude                 | Latitude coordinate |
| longitude          | longitude                | Longitude coordinate |
| geo_precis         | geolocation_precision    | Location precision (exact, estimated) |
| source             | report_source            | Original source of the report |
| notes              | event_notes              | Narrative or descriptive details of the event |
| fatalities         | fatalities_original      | Reported number of fatalities |

In [248]:
col_rename_map = {
    'gwno': 'country_id',
    'event_id_cnty': 'event_id_country',
    'event_id_no_cnty': 'event_id_local',
    'year': 'event_year',
    'time_precision': 'timestamp_precision',
    
    'actor1': 'main_actor',
    'ally_actor_1': 'main_actor_ally',
    'inter1': 'main_actor_type',
    'has_ally_actor_1': 'has_main_actor_ally',
    'ally_actor_1_grouped': 'main_actor_ally_grouped',
    
    'actor2': 'opposing_actor',
    'ally_actor_2': 'opposing_actor_ally',
    'inter2': 'opposing_actor_type',
    'has_actor2': 'has_opposing_actor',
    'has_ally_actor_2': 'has_opposing_actor_ally',
    'ally_actor_2_grouped': 'opposing_actor_ally_grouped',
    
    'interaction': 'interaction_code',
    
    'admin1': 'admin_region1',
    'admin2': 'admin_region2',
    'admin2_filled': 'admin_region2_filled',
    'has_admin2': 'has_admin_region2',
    'admin3': 'admin_region3',
    'admin3_filled': 'admin_region3_filled',
    'has_admin3': 'has_admin_region3',
    
    'geo_precis': 'geo_precision',
    
    'source': 'report_source',
    'has_source': 'has_report_source',
    
    'notes': 'event_notes',
    'notes_cleaned': 'event_notes_cleaned',
    'has_notes': 'has_event_notes',
}

# Rename the columns
conflict_df.rename(columns = col_rename_map, inplace = True)

# Preview renamed columns
print("\nColumns renamed successfully.")
print(conflict_df.columns)


Columns renamed successfully.
Index(['country_id', 'event_id_country', 'event_id_local', 'event_date',
       'event_year', 'timestamp_precision', 'event_type', 'main_actor',
       'main_actor_ally', 'main_actor_type', 'opposing_actor',
       'opposing_actor_ally', 'opposing_actor_type', 'interaction_code',
       'country', 'admin_region1', 'admin_region2', 'admin_region3',
       'location', 'latitude', 'longitude', 'geo_precision', 'report_source',
       'event_notes', 'fatalities', 'has_main_actor_ally',
       'main_actor_ally_grouped', 'has_opposing_actor',
       'has_opposing_actor_ally', 'opposing_actor_ally_grouped',
       'admin_region2_filled', 'has_admin_region2', 'admin_region3_filled',
       'has_admin_region3', 'has_report_source', 'event_notes_cleaned',
       'has_event_notes', 'fatalities_original', 'fatalities_filled',
       'has_fatalities'],
      dtype='object')


# FEATURE ENGINEERING AND EDA