# GDELT Conflict Predictor - Feature Engineering

## Notebook Overview

Based on the EDA conclusions from `GDELT_EDA_Traditional.ipynb`, this notebook implements comprehensive feature engineering for conflict prediction modeling.

### Key Findings from EDA:
- **Dataset**: 92,000 events (2015-2024), 17% conflict rate
- **Core Features**: Date, EventRootCode, GoldsteinScale are 100% complete
- **Issues**: High missingness in actor metadata (95-99%), moderate class imbalance
- **Strengths**: Rich temporal structure, good geographic diversity

### Feature Engineering Strategy:
1. Drop highly incomplete features (>90% missing)
2. Create temporal aggregation features (country-day/week level)
3. Build rolling window metrics (conflict rates over 7, 14, 30 days)
4. Engineer escalation indicators and trends
5. Add country-pair interaction features
6. Create geographic clustering features
7. Prepare final dataset for modeling

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta, datetime
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler, LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.2.6


## 1. Load Data

In [2]:
# Load the GDELT dataset
data_path = '../data/gdelt_sample.csv'  # Adjust path as needed

try:
    df = pd.read_csv(data_path, low_memory=False)
    print(f"Dataset loaded: {df.shape[0]:,} rows, {df.shape[1]} columns")
    print(f"\nMemory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
except FileNotFoundError:
    print(f"Error: File not found at {data_path}")
    print("Please update the data_path variable with the correct path to your GDELT dataset.")

Error: File not found at ../data/gdelt_sample.csv
Please update the data_path variable with the correct path to your GDELT dataset.


In [None]:
# Display basic info
print("Dataset Info:")
print("=" * 80)
df.info()

Dataset Info:


NameError: name 'df' is not defined

: 

In [None]:
# Convert SQLDATE to datetime
df['Date'] = pd.to_datetime(df['SQLDATE'].astype(str), format='%Y%m%d')
print(f"Date range: {df['Date'].min()} to {df['Date'].max()}")
print(f"Total days: {(df['Date'].max() - df['Date'].min()).days}")

## 2. Data Cleaning & Feature Selection

Based on EDA, we'll:
- Drop features with >90% missingness
- Handle Actor1CountryCode missingness (55.4% complete)
- Focus on event-level and temporal features

In [None]:
# Calculate missing percentage for each column
missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("Features with >90% missingness:")
print("=" * 80)
high_missing = missing_pct[missing_pct > 90]
print(high_missing)

In [None]:
# Drop columns with >90% missing values
cols_to_drop = missing_pct[missing_pct > 90].index.tolist()
print(f"Dropping {len(cols_to_drop)} columns with >90% missingness:")
print(cols_to_drop)

df_clean = df.drop(columns=cols_to_drop)
print(f"\nDataset shape after dropping: {df_clean.shape}")

In [None]:
# Handle Actor1CountryCode missingness - fill with 'UNKNOWN'
if 'Actor1CountryCode' in df_clean.columns:
    before_missing = df_clean['Actor1CountryCode'].isnull().sum()
    df_clean['Actor1CountryCode'] = df_clean['Actor1CountryCode'].fillna('UNKNOWN')
    print(f"Filled {before_missing:,} missing Actor1CountryCode values with 'UNKNOWN'")

if 'Actor2CountryCode' in df_clean.columns:
    before_missing = df_clean['Actor2CountryCode'].isnull().sum()
    df_clean['Actor2CountryCode'] = df_clean['Actor2CountryCode'].fillna('UNKNOWN')
    print(f"Filled {before_missing:,} missing Actor2CountryCode values with 'UNKNOWN'")

In [None]:
# Create target variable: IsConflict (EventRootCode in ['14', '15', '16', '17', '18', '19', '20'])
conflict_codes = ['14', '15', '16', '17', '18', '19', '20']
df_clean['IsConflict'] = df_clean['EventRootCode'].astype(str).isin(conflict_codes).astype(int)

print(f"\nTarget Variable Distribution:")
print("=" * 80)
print(df_clean['IsConflict'].value_counts())
print(f"\nConflict Rate: {df_clean['IsConflict'].mean()*100:.2f}%")

## 3. Temporal Aggregation Features

Create country-day and country-week level aggregations to capture temporal patterns.

In [None]:
# Create temporal features from date
df_clean['Year'] = df_clean['Date'].dt.year
df_clean['Month'] = df_clean['Date'].dt.month
df_clean['DayOfWeek'] = df_clean['Date'].dt.dayofweek  # 0=Monday, 6=Sunday
df_clean['DayOfYear'] = df_clean['Date'].dt.dayofyear
df_clean['WeekOfYear'] = df_clean['Date'].dt.isocalendar().week
df_clean['Quarter'] = df_clean['Date'].dt.quarter
df_clean['IsWeekend'] = (df_clean['DayOfWeek'] >= 5).astype(int)

print("Temporal features created:")
print(df_clean[['Date', 'Year', 'Month', 'DayOfWeek', 'WeekOfYear', 'Quarter', 'IsWeekend']].head())

In [None]:
# Country-Day level aggregations
print("Creating country-day aggregations...")

country_day_agg = df_clean.groupby(['Actor1CountryCode', 'Date']).agg({
    'GLOBALEVENTID': 'count',  # Number of events
    'IsConflict': ['sum', 'mean'],  # Count and rate of conflicts
    'GoldsteinScale': ['mean', 'std', 'min', 'max'],  # Event intensity metrics
    'NumMentions': 'sum',  # Total media attention
    'NumSources': 'sum',  # Source diversity
    'NumArticles': 'sum'  # Article coverage
}).reset_index()

# Flatten column names
country_day_agg.columns = ['_'.join(col).strip('_') for col in country_day_agg.columns.values]
country_day_agg.rename(columns={
    'Actor1CountryCode': 'Country',
    'Date': 'Date',
    'GLOBALEVENTID_count': 'EventCount_Day',
    'IsConflict_sum': 'ConflictCount_Day',
    'IsConflict_mean': 'ConflictRate_Day',
    'GoldsteinScale_mean': 'AvgGoldstein_Day',
    'GoldsteinScale_std': 'StdGoldstein_Day',
    'GoldsteinScale_min': 'MinGoldstein_Day',
    'GoldsteinScale_max': 'MaxGoldstein_Day',
    'NumMentions_sum': 'TotalMentions_Day',
    'NumSources_sum': 'TotalSources_Day',
    'NumArticles_sum': 'TotalArticles_Day'
}, inplace=True)

print(f"Country-day aggregations created: {country_day_agg.shape}")
print(country_day_agg.head())

In [None]:
# Country-Week level aggregations
print("Creating country-week aggregations...")

df_clean['YearWeek'] = df_clean['Date'].dt.to_period('W').apply(lambda r: r.start_time)

country_week_agg = df_clean.groupby(['Actor1CountryCode', 'YearWeek']).agg({
    'GLOBALEVENTID': 'count',
    'IsConflict': ['sum', 'mean'],
    'GoldsteinScale': ['mean', 'std', 'min', 'max'],
    'NumMentions': 'sum',
    'NumSources': 'sum',
    'NumArticles': 'sum'
}).reset_index()

# Flatten column names
country_week_agg.columns = ['_'.join(col).strip('_') for col in country_week_agg.columns.values]
country_week_agg.rename(columns={
    'Actor1CountryCode': 'Country',
    'YearWeek': 'YearWeek',
    'GLOBALEVENTID_count': 'EventCount_Week',
    'IsConflict_sum': 'ConflictCount_Week',
    'IsConflict_mean': 'ConflictRate_Week',
    'GoldsteinScale_mean': 'AvgGoldstein_Week',
    'GoldsteinScale_std': 'StdGoldstein_Week',
    'GoldsteinScale_min': 'MinGoldstein_Week',
    'GoldsteinScale_max': 'MaxGoldstein_Week',
    'NumMentions_sum': 'TotalMentions_Week',
    'NumSources_sum': 'TotalSources_Week',
    'NumArticles_sum': 'TotalArticles_Week'
}, inplace=True)

print(f"Country-week aggregations created: {country_week_agg.shape}")
print(country_week_agg.head())

## 4. Rolling Window Features

Create 7-day, 14-day, and 30-day rolling window metrics for conflict prediction.

In [None]:
# Function to create rolling window features
def create_rolling_features(df, group_col, date_col, windows=[7, 14, 30]):
    """
    Create rolling window features for each country.
    
    Args:
        df: DataFrame with country-day aggregations
        group_col: Column to group by (e.g., 'Country')
        date_col: Date column name
        windows: List of window sizes in days
    
    Returns:
        DataFrame with rolling features added
    """
    df = df.sort_values([group_col, date_col]).reset_index(drop=True)
    
    for window in windows:
        print(f"Creating {window}-day rolling features...")
        
        # Rolling conflict rate
        df[f'ConflictRate_Roll{window}d'] = df.groupby(group_col)['ConflictRate_Day'].transform(
            lambda x: x.rolling(window=window, min_periods=1).mean()
        )
        
        # Rolling conflict count
        df[f'ConflictCount_Roll{window}d'] = df.groupby(group_col)['ConflictCount_Day'].transform(
            lambda x: x.rolling(window=window, min_periods=1).sum()
        )
        
        # Rolling event count
        df[f'EventCount_Roll{window}d'] = df.groupby(group_col)['EventCount_Day'].transform(
            lambda x: x.rolling(window=window, min_periods=1).sum()
        )
        
        # Rolling average Goldstein scale
        df[f'AvgGoldstein_Roll{window}d'] = df.groupby(group_col)['AvgGoldstein_Day'].transform(
            lambda x: x.rolling(window=window, min_periods=1).mean()
        )
        
        # Rolling media attention
        df[f'TotalMentions_Roll{window}d'] = df.groupby(group_col)['TotalMentions_Day'].transform(
            lambda x: x.rolling(window=window, min_periods=1).sum()
        )
    
    return df

print("Rolling window feature function defined.")

In [None]:
# Apply rolling window features to country-day aggregations
country_day_agg = create_rolling_features(
    country_day_agg, 
    group_col='Country', 
    date_col='Date', 
    windows=[7, 14, 30]
)

print(f"\nRolling features added. New shape: {country_day_agg.shape}")
print("\nSample of rolling features:")
print(country_day_agg[['Country', 'Date', 'ConflictRate_Roll7d', 'ConflictRate_Roll14d', 'ConflictRate_Roll30d']].head(10))

## 5. Escalation Indicators & Trend Features

Detect escalating tensions and conflict trends.

In [None]:
# Create lag features (previous day values)
print("Creating lag features...")

for lag in [1, 3, 7]:
    country_day_agg[f'ConflictRate_Lag{lag}d'] = country_day_agg.groupby('Country')['ConflictRate_Day'].shift(lag)
    country_day_agg[f'AvgGoldstein_Lag{lag}d'] = country_day_agg.groupby('Country')['AvgGoldstein_Day'].shift(lag)
    country_day_agg[f'EventCount_Lag{lag}d'] = country_day_agg.groupby('Country')['EventCount_Day'].shift(lag)

print("Lag features created.")

In [None]:
# Create change/delta features (rate of change)
print("Creating change/delta features...")

# Conflict rate change (day-over-day)
country_day_agg['ConflictRate_Change1d'] = country_day_agg.groupby('Country')['ConflictRate_Day'].diff()

# Conflict rate change (week-over-week)
country_day_agg['ConflictRate_Change7d'] = country_day_agg.groupby('Country')['ConflictRate_Day'].diff(7)

# Goldstein scale change
country_day_agg['AvgGoldstein_Change1d'] = country_day_agg.groupby('Country')['AvgGoldstein_Day'].diff()
country_day_agg['AvgGoldstein_Change7d'] = country_day_agg.groupby('Country')['AvgGoldstein_Day'].diff(7)

# Event volume change
country_day_agg['EventCount_Change1d'] = country_day_agg.groupby('Country')['EventCount_Day'].diff()
country_day_agg['EventCount_Change7d'] = country_day_agg.groupby('Country')['EventCount_Day'].diff(7)

print("Change features created.")

In [None]:
# Create escalation indicators
print("Creating escalation indicators...")

# Escalation: conflict rate increasing over 7 days
country_day_agg['IsEscalating_7d'] = (
    (country_day_agg['ConflictRate_Roll7d'] > country_day_agg['ConflictRate_Lag7d']) & 
    (country_day_agg['ConflictRate_Change7d'] > 0)
).astype(int)

# Sudden spike: conflict rate doubled in 1 day
country_day_agg['SuddenSpike'] = (
    country_day_agg['ConflictRate_Day'] >= 2 * country_day_agg['ConflictRate_Lag1d']
).astype(int)

# Tension indicator: negative Goldstein trend (worsening relations)
country_day_agg['NegativeTrend'] = (
    country_day_agg['AvgGoldstein_Change7d'] < -1.0
).astype(int)

# High intensity: average Goldstein below -5 (very negative events)
country_day_agg['HighIntensity'] = (
    country_day_agg['AvgGoldstein_Roll7d'] < -5.0
).astype(int)

print("Escalation indicators created.")
print(f"\nEscalation Summary:")
print(f"  Escalating (7d): {country_day_agg['IsEscalating_7d'].sum():,} instances")
print(f"  Sudden Spikes: {country_day_agg['SuddenSpike'].sum():,} instances")
print(f"  Negative Trends: {country_day_agg['NegativeTrend'].sum():,} instances")
print(f"  High Intensity: {country_day_agg['HighIntensity'].sum():,} instances")

## 6. Country-Pair Interaction Features

Create features based on historical interactions between country pairs.

In [None]:
# Create country pair identifier
print("Creating country-pair interaction features...")

if 'Actor2CountryCode' in df_clean.columns:
    # Create sorted pair (so USA-CHN = CHN-USA)
    df_clean['CountryPair'] = df_clean.apply(
        lambda row: '-'.join(sorted([row['Actor1CountryCode'], row['Actor2CountryCode']])),
        axis=1
    )
    
    # Country-pair aggregations
    pair_agg = df_clean.groupby(['CountryPair', 'Date']).agg({
        'GLOBALEVENTID': 'count',
        'IsConflict': ['sum', 'mean'],
        'GoldsteinScale': 'mean'
    }).reset_index()
    
    pair_agg.columns = ['_'.join(col).strip('_') for col in pair_agg.columns.values]
    pair_agg.rename(columns={
        'CountryPair': 'CountryPair',
        'Date': 'Date',
        'GLOBALEVENTID_count': 'PairEventCount',
        'IsConflict_sum': 'PairConflictCount',
        'IsConflict_mean': 'PairConflictRate',
        'GoldsteinScale_mean': 'PairAvgGoldstein'
    }, inplace=True)
    
    # Historical conflict rate (all-time average for each pair)
    pair_history = df_clean.groupby('CountryPair').agg({
        'IsConflict': 'mean',
        'GoldsteinScale': 'mean',
        'GLOBALEVENTID': 'count'
    }).reset_index()
    
    pair_history.rename(columns={
        'IsConflict': 'PairHistoricalConflictRate',
        'GoldsteinScale': 'PairHistoricalGoldstein',
        'GLOBALEVENTID': 'PairTotalInteractions'
    }, inplace=True)
    
    print(f"Country-pair features created: {pair_agg.shape}")
    print(f"Country-pair history: {pair_history.shape}")
    print("\nTop 10 country pairs by interaction count:")
    print(pair_history.nlargest(10, 'PairTotalInteractions')[['CountryPair', 'PairTotalInteractions', 'PairHistoricalConflictRate']])
else:
    print("Actor2CountryCode not available, skipping country-pair features.")

## 7. Geographic Clustering Features

Create regional conflict spillover features.

In [None]:
# Define regional groupings (you can expand this)
region_mapping = {
    # North America
    'USA': 'North_America', 'CAN': 'North_America', 'MEX': 'North_America',
    
    # Europe
    'GBR': 'Europe', 'FRA': 'Europe', 'DEU': 'Europe', 'ITA': 'Europe', 'ESP': 'Europe',
    'POL': 'Europe', 'UKR': 'Europe', 'RUS': 'Europe',
    
    # Middle East
    'ISR': 'Middle_East', 'SAU': 'Middle_East', 'IRN': 'Middle_East', 'IRQ': 'Middle_East',
    'SYR': 'Middle_East', 'JOR': 'Middle_East', 'TUR': 'Middle_East', 'EGY': 'Middle_East',
    
    # Asia
    'CHN': 'Asia', 'JPN': 'Asia', 'IND': 'Asia', 'PAK': 'Asia', 'KOR': 'Asia',
    'PRK': 'Asia', 'AFG': 'Asia', 'BGD': 'Asia', 'IDN': 'Asia',
    
    # Africa
    'NGA': 'Africa', 'ZAF': 'Africa', 'EGY': 'Africa', 'KEN': 'Africa', 'ETH': 'Africa',
    'GHA': 'Africa', 'UGA': 'Africa', 'SOM': 'Africa',
    
    # South America
    'BRA': 'South_America', 'ARG': 'South_America', 'COL': 'South_America', 
    'VEN': 'South_America', 'CHL': 'South_America',
    
    # Oceania
    'AUS': 'Oceania', 'NZL': 'Oceania'
}

# Add region to country-day aggregations
country_day_agg['Region'] = country_day_agg['Country'].map(region_mapping).fillna('Other')

print("Region mapping applied.")
print(f"\nRegion distribution:")
print(country_day_agg['Region'].value_counts())

In [None]:
# Regional conflict spillover: average conflict rate in region
print("Creating regional spillover features...")

regional_agg = country_day_agg.groupby(['Region', 'Date']).agg({
    'ConflictRate_Day': 'mean',
    'ConflictCount_Day': 'sum',
    'EventCount_Day': 'sum',
    'AvgGoldstein_Day': 'mean'
}).reset_index()

regional_agg.rename(columns={
    'ConflictRate_Day': 'RegionalConflictRate',
    'ConflictCount_Day': 'RegionalConflictCount',
    'EventCount_Day': 'RegionalEventCount',
    'AvgGoldstein_Day': 'RegionalAvgGoldstein'
}, inplace=True)

# Merge regional features back to country-day data
country_day_agg = country_day_agg.merge(regional_agg, on=['Region', 'Date'], how='left')

print(f"Regional spillover features created. Shape: {country_day_agg.shape}")
print("\nSample regional features:")
print(country_day_agg[['Country', 'Date', 'Region', 'RegionalConflictRate', 'ConflictRate_Day']].head(10))

## 8. Merge Features Back to Original Dataset

Join all engineered features back to the event-level dataset.

In [None]:
# Merge country-day features to original dataset
print("Merging engineered features to original dataset...")

df_features = df_clean.merge(
    country_day_agg,
    left_on=['Actor1CountryCode', 'Date'],
    right_on=['Country', 'Date'],
    how='left'
)

print(f"Features merged. New shape: {df_features.shape}")

# Drop duplicate country column
if 'Country' in df_features.columns:
    df_features.drop(columns=['Country'], inplace=True)

In [None]:
# If country-pair features exist, merge them
if 'CountryPair' in df_clean.columns:
    df_features = df_features.merge(
        pair_history,
        on='CountryPair',
        how='left'
    )
    print(f"Country-pair features merged. New shape: {df_features.shape}")

## 9. Feature Summary & Selection

In [None]:
# Display all engineered features
print("Engineered Features Summary:")
print("=" * 80)

# Categorize features
temporal_features = [col for col in df_features.columns if any(x in col for x in ['Year', 'Month', 'Day', 'Week', 'Quarter', 'Weekend'])]
rolling_features = [col for col in df_features.columns if 'Roll' in col]
lag_features = [col for col in df_features.columns if 'Lag' in col]
change_features = [col for col in df_features.columns if 'Change' in col]
escalation_features = [col for col in df_features.columns if any(x in col for x in ['Escalating', 'Spike', 'Trend', 'Intensity'])]
pair_features = [col for col in df_features.columns if 'Pair' in col]
regional_features = [col for col in df_features.columns if 'Regional' in col or col == 'Region']

print(f"\nTemporal features ({len(temporal_features)}): {temporal_features}")
print(f"\nRolling window features ({len(rolling_features)}): {rolling_features}")
print(f"\nLag features ({len(lag_features)}): {lag_features}")
print(f"\nChange/delta features ({len(change_features)}): {change_features}")
print(f"\nEscalation indicators ({len(escalation_features)}): {escalation_features}")
print(f"\nCountry-pair features ({len(pair_features)}): {pair_features}")
print(f"\nRegional features ({len(regional_features)}): {regional_features}")

total_engineered = len(temporal_features) + len(rolling_features) + len(lag_features) + len(change_features) + len(escalation_features) + len(pair_features) + len(regional_features)
print(f"\nTotal engineered features: {total_engineered}")

In [None]:
# Check for remaining missing values
print("\nMissing values in engineered features:")
print("=" * 80)

all_engineered_features = temporal_features + rolling_features + lag_features + change_features + escalation_features + pair_features + regional_features
missing_in_features = df_features[all_engineered_features].isnull().sum().sort_values(ascending=False)
missing_in_features = missing_in_features[missing_in_features > 0]

if len(missing_in_features) > 0:
    print(missing_in_features)
    print(f"\nNote: Lag and change features may have NaN for first few days (expected).")
else:
    print("No missing values in engineered features!")

## 10. Save Processed Dataset

In [None]:
# Save the feature-engineered dataset
output_path = '../data/gdelt_features.csv'

print(f"Saving processed dataset to {output_path}...")
df_features.to_csv(output_path, index=False)
print(f"Dataset saved successfully!")
print(f"Final shape: {df_features.shape}")
print(f"File size: {df_features.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

In [None]:
# Also save aggregated country-day dataset (for time series modeling)
country_day_output = '../data/gdelt_country_day_features.csv'

print(f"\nSaving country-day aggregated dataset to {country_day_output}...")
country_day_agg.to_csv(country_day_output, index=False)
print(f"Country-day dataset saved successfully!")
print(f"Shape: {country_day_agg.shape}")

## 11. Final Summary & Next Steps

In [None]:
print("\n" + "=" * 80)
print("FEATURE ENGINEERING COMPLETE")
print("=" * 80)

print(f"\nOriginal dataset: {df.shape}")
print(f"After cleaning: {df_clean.shape}")
print(f"With engineered features: {df_features.shape}")
print(f"Features added: {df_features.shape[1] - df_clean.shape[1]}")

print(f"\nTarget variable (IsConflict):")
print(f"  Conflict events: {df_features['IsConflict'].sum():,} ({df_features['IsConflict'].mean()*100:.2f}%)")
print(f"  Non-conflict events: {(df_features['IsConflict'] == 0).sum():,} ({(1-df_features['IsConflict'].mean())*100:.2f}%)")

print(f"\nOutputs saved:")
print(f"  1. Event-level features: {output_path}")
print(f"  2. Country-day features: {country_day_output}")

print(f"\n" + "=" * 80)
print("NEXT STEPS:")
print("=" * 80)
print("1. Load gdelt_features.csv in the baseline modeling notebook")
print("2. Implement time series cross-validation (train on past, test on future)")
print("3. Build baseline models: Logistic Regression, Random Forest, XGBoost")
print("4. Handle class imbalance with SMOTE or class weights")
print("5. Evaluate with Precision, Recall, F1, ROC-AUC, PR-AUC")
print("6. Analyze feature importance and iterate on features")
print("=" * 80)