# FIXME
- DRAW ANOTHER SUBSET OT 1000 SAMPLES FOR RAPID PROTOTYPING
- SAVE IT AS WELL
- THEN DEVELOP ON THIS IN THE NOTEBOOKS
- THEN RUN A BACKGROUND JOB ON THE FULL DATASET

# Exploratory Data Analysis

Here, I will just check if I can get some usable data from the GDELT 2.0 Event
Database via Google BigQuery.

So far, I didn't decide for which problem to solve and what kind of model to
train, because I want to find a suitable dataset first.

Here's what I'm looking for data:
- with time stamps that's updated regularly, so I can train an initial model
and then schedule it to run periodically and monitor it
- that's sufficiently large to train a model on
- that has some interesting features and a suitable target variable

During a brief search, I found the GDELT 2.0 Event Database, which is a public
and free database that contains event data from all over the world.
It seems to fulfill these requirements and is available via BigQuery.

Here, I will check if I can get some data from it and if it's suitable for my
needs.

## Environment

To use this project's uv environment, make sure you installed it according to
the instructions in the README.md file.

Then, connect to the `.venv` kernel.
Check the path to the kernel to make sure it's the right one.
It should be `.venv/bin/python`.

Run the next cell to check if you use the correct kernel.
It should output this:

# FIXME: Once I decided for an actual name for the repo, adapt the path!
```
<path_to_wherever_you_cloned_the_repo_to>/mlopsproject2/.venv/bin/python
```

In [1]:
!which python

/Users/fakrueg/projects/courses/datatalks/mlops-zoomcamp/mlopsproject2/.venv/bin/python


In [2]:
# Dependencies
import os
import pandas as pd
import pandas_gbq
import mlflow
import joblib
from mlflow.models import infer_signature

from google.cloud import bigquery
from dotenv import load_dotenv
from typing import Optional, Tuple
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

# Load environment variables
load_dotenv()

True

In [3]:
# define paths
PATH_REPO = Path(".").resolve().parent
PATH_DATA = PATH_REPO / "data" / "raw"

In [4]:
# set MLFlow tracking URI or rather: basically connect to the MLFlow server
mlflow.set_tracking_uri("http://127.0.0.1:5001")

# set experiment
mlflow.set_experiment("testing_setup")

<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1755362327633, experiment_id='1', last_update_time=1755362327633, lifecycle_stage='active', name='testing_setup', tags={}>

In [None]:
# BigQuery Client Setup
def setup_bigquery_client():
    """
    Set up BigQuery client using credentials file
    """
    # Check if credentials file exists
    cred_path = "../bigquery-credentials.json"
    if not Path(cred_path).exists():
        raise FileNotFoundError(f"Credentials file not found: {cred_path}")
    
    # Set environment variable for this session
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = cred_path
    
    # Get project ID from environment
    project_id = os.getenv('GOOGLE_CLOUD_PROJECT')
    if not project_id:
        raise ValueError("GOOGLE_CLOUD_PROJECT not set in .env file")
    
    # Initialize client
    client = bigquery.Client(project=project_id)
    return client

# Initialize BigQuery client
try:
    client = setup_bigquery_client()
    print(f"BigQuery client initialized successfully!")
    print(f"Project: {client.project}")
    print(f"Using credentials from: ./bigquery-credentials.json")
except Exception as e:
    print(f"Error setting up BigQuery client: {e}")
    client = None

# FIXME: Explain why I selected these exact features.

In [None]:
def safe_gdelt_query(start_date, end_date, limit=100, dry_run=True):
    """
    Safely query GDELT data with automatic cost estimation
    
    Args:
        start_date (str): Start date in 'YYYY-MM-DD' format
        end_date (str): End date in 'YYYY-MM-DD' format  
        limit (int): Maximum number of rows to return
        dry_run (bool): If True, only estimate query cost
    """

    if client is None:
        raise ValueError("BigQuery client not initialized")
    
    # Convert dates to GDELT format (YYYYMMDD) as integers
    start_gdelt = int(start_date.replace('-', ''))
    end_gdelt = int(end_date.replace('-', ''))

    query = f"""
    SELECT 
        SQLDATE,                -- event date
        MonthYear,              -- month and year
        EventCode,
        EventBaseCode,
        EventRootCode,
        QuadClass,
        GoldsteinScale,
        Actor1Code,
        Actor1Name,
        Actor1CountryCode,
        Actor1Type1Code,
        Actor1Type2Code,
        Actor1Type3Code,
        Actor2Code,
        Actor2Name,
        Actor2CountryCode,
        Actor2Type1Code,
        Actor2Type2Code,
        Actor2Type3Code,
        ActionGeo_CountryCode,
        ActionGeo_ADM1Code,
        ActionGeo_Lat,
        ActionGeo_Long,
        ActionGeo_FeatureID,
        NumArticles             -- target variable
    FROM `gdelt-bq.gdeltv2.events`
    WHERE SQLDATE >= {start_gdelt}  -- start date
      AND SQLDATE <= {end_gdelt}    -- end date
    ORDER BY RAND()                 -- order rows randomly to get random sample
    LIMIT {limit}                   -- limit the number of rows to return
    """

    # Always do a dry run first for cost estimation
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    dry_job = client.query(query, job_config=job_config)

    bytes_processed = dry_job.total_bytes_processed
    estimated_cost = (bytes_processed / 1e12) * 5  # $5 per TB

    print(
        f"Query will process: {bytes_processed:,} bytes "
        f"({bytes_processed/1e6:.2f} MB) or rather "
        f"({bytes_processed/1e9:.2f} GB)."
    )
    print(f"Estimated cost: ${estimated_cost:.6f}")

    if dry_run:
        print("Dry run complete - no data retrieved")
        return None

    # Execute the actual query
    print("Executing query...")
    df = pandas_gbq.read_gbq(query, project_id=client.project, dialect='standard')

    print(f"Query completed! Retrieved {len(df)} rows")
    return df

Get 10k random rows from GDELT events table from year 2024.

I decided to go for a sample size 10k rows, because that should be an acceptable
balance between speed of model training and showing it enough data.
If I go for an 80:20 train:test split, I will end up with 8k rows for training
and 2k rows for testing.
There are just 24 features and one target variable.
So basically the ration rows to features is 10000:24, which is 416.67.
I intend to use tree based algorithms such as XGBoost, CatBoost and LightGBM.
They are rather data efficient, and at this ratio, maybe it's even already
enough for acceptable performance.

Honestly, I could go for **much** more than that though, but then models would
train much longer, too.
This is some sort of a subset for speed of development.
At the same time, I could have also gone for much less than that, but then it
would definitely become a true subset, and whatever I train would likely be
underperforming.
So I decide to go with this as a compromise and check how well it performs.
If it does good enough, I won't need to go for a larger subset.
If it doesn't perform well, I can at least select hyperparameters and then go
for a larger subset.
Then again, this is not a machine learning engineering course, but a machine
learning *operations*, so I don't need to get the best possible model in the
first place.
A good model is sufficient.

In [None]:
# Start with just a dry run to check costs
test_df = safe_gdelt_query(
    '2024-01-01',
    '2024-12-31',
    limit=14000,
    dry_run=True
)

In [None]:
# Looks fine enough, so go for it
# loads a pandas df into object data_gdelt

# actually I just added this False as another layer of safety, so it doesn't
# automatically run stuff
# BigQuery can generate some costs, but at this rate it won't, because we're
# still well below the free quota of 1TB per month
if False:
    data_gdelt = safe_gdelt_query(
        '2024-01-01',
        '2024-12-31',
        limit=10000,
        dry_run=False
    )

In [None]:
# just have a look at if downloading worked
data_gdelt.head()

In [None]:
# write data to parquet, so I can re-use it later without querying again
data_gdelt.to_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k_full.parquet",
    index=False
)

# FIXME: CHECKPOINT

In [5]:
# load data from parquet again
data_gdelt = pd.read_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k_full.parquet"
)

## Split data into train and test

Split the data first to prevent data leakage.
Make a truly unseen hold out test set, which will not be used for training or
validation at all.
It will only be used to evaluate one single final model in the very end.

I will use a 80:20 split for training and testing.
This will leave me with 8k rows for training and 2k rows for testing.
For development, I will use 5-fold cross validation.

In [6]:
# split data into train and test
train_df, test_df = train_test_split(
    data_gdelt,
    test_size=0.2,
    random_state=42
)

# check the data
train_df.head()

Unnamed: 0,SQLDATE,MonthYear,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,...,Actor2CountryCode,Actor2Type1Code,Actor2Type2Code,Actor2Type3Code,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles
9254,20240328,202403,50,50,5,1,3.5,BUS,COMPANIES,,...,,,,,SO,SO,6.0,48.0,SO,8
1561,20240515,202405,40,40,4,1,1.0,LEG,REPRESENTATIVES,,...,FRA,GOV,,,FR,FR00,48.8667,2.33333,-1456928,3
1670,20240405,202404,36,36,3,1,4.0,ESP,SPAIN,ESP,...,,GOV,,,SP,SP,40.0,-4.0,SP,10
6087,20240209,202402,114,114,11,3,-2.0,USA,NORTH CAROLINA,USA,...,,LEG,,,US,USNC,35.6411,-79.8431,NC,4
6669,20240815,202408,42,42,4,1,1.9,TZA,TANZANIA,TZA,...,KEN,,,,TZ,TZ,-6.0,35.0,TZ,1


In [7]:
# save both train and test data
train_df.to_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k_train.parquet",
    index=False
)

test_df.to_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k_test.parquet",
    index=False
)

Now I can work on the train split without risking leaking any information.

Check for missing values first. 
If there is any column that has a lot of missing values, I will drop it.
Columns with just a low percentages may be imputed if there is a meaningful
way.

In [8]:
# check for missing values
train_df.isnull().mean()

SQLDATE                  0.000000
MonthYear                0.000000
EventCode                0.000000
EventBaseCode            0.000000
EventRootCode            0.000000
QuadClass                0.000000
GoldsteinScale           0.000000
Actor1Code               0.102375
Actor1Name               0.102375
Actor1CountryCode        0.460875
Actor1Type1Code          0.551750
Actor1Type2Code          0.975000
Actor1Type3Code          0.999250
Actor2Code               0.304250
Actor2Name               0.304250
Actor2CountryCode        0.566375
Actor2Type1Code          0.671500
Actor2Type2Code          0.983375
Actor2Type3Code          0.999125
ActionGeo_CountryCode    0.030625
ActionGeo_ADM1Code       0.030625
ActionGeo_Lat            0.032375
ActionGeo_Long           0.031750
ActionGeo_FeatureID      0.030625
NumArticles              0.000000
dtype: float64

For most machine learning projects, it’s reasonable to drop columns with more 
than 50% missing values, especially if there are plenty of other features.
High missingness usually means the feature will be hard to impute reliably and
won’t add robust predictive power.

Columns to drop:
	- Actor1Type2Code (97.5%)
	- Actor1Type3Code (99.9%)
	- Actor2Type2Code (98.3%)
	- Actor2Type3Code (99.9%)
	- Actor1Type1Code (55.2%)
	- Actor1CountryCode (46.1% — still too high for my taste, and difficult to impute)
	- Actor2CountryCode (56.6%)
	- Actor2Type1Code (67.2%)

One option for imputation would be to use the mode or rather the most frequent
value.
However, this is data from global events.
Imputation is always basically making up data and hoping it's a good guess.
Often, for numerical data, a mean or median is a good guess.
However, I am afraid in this case, it may not make much sense for some of the
columns.
For example, if the most frequent value is "USA", this will be filled in for
all rows where the value is missing.
But peprhaps there may be a good reason why the value is missing.
For example, if the value is missing, it may mean that the event is not
related to a country.
Because of this, I will treat the missing values as missing by introducing a
new category for unknown.
I will have to check how they encode this in general and which value can be
used for this.
Perhaps 0 is a good value for this in case it is not taken for anything else.


The column "SQLDATE" is a date in the format YYYYMMDD.
That integer is likely not very useful for modeling, so I will convert it to
more informative features such as year, month, day of year, day of week, and
whether it is a weekend or not.

I will also drop the intermediate date column.

In [9]:
# drop columns with roughly 50% or more missing values

# define columns to drop
columns_to_drop = [
    "Actor1Type2Code",
    "Actor1Type3Code",
    "Actor2Type2Code",
    "Actor2Type3Code",
    "Actor1Type1Code",
    "Actor1CountryCode",
    "Actor2CountryCode",
    "Actor2Type1Code"
]

# drop columns
train_df = train_df.drop(columns=columns_to_drop)

In [10]:
# check data again to get updated overview
print(train_df.info())
train_df.head()

<class 'pandas.core.frame.DataFrame'>
Index: 8000 entries, 9254 to 7270
Data columns (total 17 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   SQLDATE                8000 non-null   Int64  
 1   MonthYear              8000 non-null   Int64  
 2   EventCode              8000 non-null   object 
 3   EventBaseCode          8000 non-null   object 
 4   EventRootCode          8000 non-null   object 
 5   QuadClass              8000 non-null   Int64  
 6   GoldsteinScale         8000 non-null   float64
 7   Actor1Code             7181 non-null   object 
 8   Actor1Name             7181 non-null   object 
 9   Actor2Code             5566 non-null   object 
 10  Actor2Name             5566 non-null   object 
 11  ActionGeo_CountryCode  7755 non-null   object 
 12  ActionGeo_ADM1Code     7755 non-null   object 
 13  ActionGeo_Lat          7741 non-null   float64
 14  ActionGeo_Long         7746 non-null   float64
 15  Action

Unnamed: 0,SQLDATE,MonthYear,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles
9254,20240328,202403,50,50,5,1,3.5,BUS,COMPANIES,,,SO,SO,6.0,48.0,SO,8
1561,20240515,202405,40,40,4,1,1.0,LEG,REPRESENTATIVES,FRAGOV,FRENCH,FR,FR00,48.8667,2.33333,-1456928,3
1670,20240405,202404,36,36,3,1,4.0,ESP,SPAIN,GOV,PRIME MINISTER,SP,SP,40.0,-4.0,SP,10
6087,20240209,202402,114,114,11,3,-2.0,USA,NORTH CAROLINA,LEG,CONGRESS,US,USNC,35.6411,-79.8431,NC,4
6669,20240815,202408,42,42,4,1,1.9,TZA,TANZANIA,KEN,NAIROBI,TZ,TZ,-6.0,35.0,TZ,1


Column in need for imputation and strategy:
- Actor1Code
    - Dtype: object
    - Example: "USA"
    - Strategy: Fill with "UNKNOWN"
- Actor1Name
    - Dtype: object
    - Example: "NORTH CAROLINA"
    - Strategy: Fill with "UNKNOWN"
- Actor2Code
    - Dtype: object
    - Example: "GOV"
    - Strategy: Fill with "UNKNOWN"
- Actor2Name
    - Dtype: object
    - Example: "PRIME MINISTER"
    - Strategy: Fill with "UNKNOWN"
- ActionGeo_CountryCode
    - Dtype: object
    - Example: "FR"
    - Strategy: Fill with "UNKNOWN"
- ActionGeo_ADM1Code
    - Dtype:    object
    - Example: "FR00"
    - Strategy: Fill with "UNKNOWN"
- ActionGeo_FeatureID
    - Dtype: object
    - Example: "TZ"
    - Strategy: Fill with "UNKNOWN"

There are two more columns in need of imputation and strategy:
ActionGeo_Lat and ActionGeo_Long, which are latitude and longitude of the event.

Two options:
- Impute with a value far outside the possible range (e.g., latitude 999,
longitude 999) or with a special flag value (e.g., -999, if the ML library
supports it).
This makes it clear to the model and downstream analysis that location is
missing—not just “somewhere ordinary.”
- Alternatively, it would be possible to use the mean or median
latitude/longitude, but this risks misleading the model to treat all
missing-location events as if they happened in a single place, which is
generally undesirable for geospatial modeling.

I will go with the first option, so here's the plan:
- ActionGeo_Lat
    - Dtype: float64
    - Example: 48.8667
    - Strategy: Impute with 999
- ActionGeo_Long
    - Dtype: float64
    - Example: 2.33333
    - Strategy: Impute with 999

In [11]:
# impute missing values

# define imputation strategy
imputation_strategy = {
    "Actor1Code": "UNKNOWN",
    "Actor1Name": "UNKNOWN",
    "Actor2Code": "UNKNOWN",
    "Actor2Name": "UNKNOWN",
    "ActionGeo_CountryCode": "UNKNOWN",
    "ActionGeo_ADM1Code": "UNKNOWN",
    "ActionGeo_FeatureID": "UNKNOWN",
    "ActionGeo_Lat": 999,
    "ActionGeo_Long": 999,
}

# fill missing values with new
train_df.fillna(imputation_strategy, inplace=True)

# check result
print(train_df.isna().mean())
train_df.head()

SQLDATE                  0.0
MonthYear                0.0
EventCode                0.0
EventBaseCode            0.0
EventRootCode            0.0
QuadClass                0.0
GoldsteinScale           0.0
Actor1Code               0.0
Actor1Name               0.0
Actor2Code               0.0
Actor2Name               0.0
ActionGeo_CountryCode    0.0
ActionGeo_ADM1Code       0.0
ActionGeo_Lat            0.0
ActionGeo_Long           0.0
ActionGeo_FeatureID      0.0
NumArticles              0.0
dtype: float64


Unnamed: 0,SQLDATE,MonthYear,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles
9254,20240328,202403,50,50,5,1,3.5,BUS,COMPANIES,UNKNOWN,UNKNOWN,SO,SO,6.0,48.0,SO,8
1561,20240515,202405,40,40,4,1,1.0,LEG,REPRESENTATIVES,FRAGOV,FRENCH,FR,FR00,48.8667,2.33333,-1456928,3
1670,20240405,202404,36,36,3,1,4.0,ESP,SPAIN,GOV,PRIME MINISTER,SP,SP,40.0,-4.0,SP,10
6087,20240209,202402,114,114,11,3,-2.0,USA,NORTH CAROLINA,LEG,CONGRESS,US,USNC,35.6411,-79.8431,NC,4
6669,20240815,202408,42,42,4,1,1.9,TZA,TANZANIA,KEN,NAIROBI,TZ,TZ,-6.0,35.0,TZ,1


Great! Now there are no missing values in the train_df.
I hope that this method makes any sense.
The only way to find out is to try it out.

## Take care of the columns for date

There is a column called "SQLDATE" which is a date in the format YYYYMMDD.
This is not very useful for modeling, so I will convert it to more informative
features such as year, month, day of year, day of week, and whether it is a
weekend or not.

I will also drop the intermediate date column.

Beyond this, there is a second column called "MonthYear" which seems to contain
redundant information.
It should probably be dropped.

In [12]:
# Convert to datetime first
train_df['date'] = pd.to_datetime(train_df['SQLDATE'], format='%Y%m%d')

# Extract useful components
train_df['year'] = train_df['date'].dt.year
train_df['month'] = train_df['date'].dt.month
train_df['day_of_year'] = train_df['date'].dt.dayofyear  # 1-365
train_df['day_of_week'] = train_df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
train_df['is_weekend'] = train_df['day_of_week'].isin([5, 6]).astype(int)

# Drop the intermediate date column and the original date related columns
train_df = train_df.drop(["date", "SQLDATE", "MonthYear"], axis=1)

## Check if I need to One-Hot-Encode the categorical features

Right now, many of the categorical columns use integers to encode the values.
While this will probably work, it also introduces an order to the values.
A model may learn some patterns from this that don't really exist.

However, one-hot-encoding will increase the number of features by a lot.
This may be a problem if the number of features is too high.

Check which columns should not have an order and
if one-hot-encoding is feasible here by having a look at the number of
unique values in each column, then decide.

In [13]:
train_df.nunique()

EventCode                 169
EventBaseCode             119
EventRootCode              20
QuadClass                   4
GoldsteinScale             42
Actor1Code                701
Actor1Name               1191
Actor2Code                605
Actor2Name               1052
ActionGeo_CountryCode     196
ActionGeo_ADM1Code       1090
ActionGeo_Lat            2118
ActionGeo_Long           2204
ActionGeo_FeatureID      2301
NumArticles                30
year                        1
month                      12
day_of_year               366
day_of_week                 7
is_weekend                  2
dtype: int64

In [14]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8000 entries, 9254 to 7270
Data columns (total 20 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   EventCode              8000 non-null   object 
 1   EventBaseCode          8000 non-null   object 
 2   EventRootCode          8000 non-null   object 
 3   QuadClass              8000 non-null   Int64  
 4   GoldsteinScale         8000 non-null   float64
 5   Actor1Code             8000 non-null   object 
 6   Actor1Name             8000 non-null   object 
 7   Actor2Code             8000 non-null   object 
 8   Actor2Name             8000 non-null   object 
 9   ActionGeo_CountryCode  8000 non-null   object 
 10  ActionGeo_ADM1Code     8000 non-null   object 
 11  ActionGeo_Lat          8000 non-null   float64
 12  ActionGeo_Long         8000 non-null   float64
 13  ActionGeo_FeatureID    8000 non-null   object 
 14  NumArticles            8000 non-null   Int64  
 15  year  

Dang! Those are a lot of unique values in the categorical columns.
The cardinality of these features is way too high for one-hot encoding, as it
would blow up the feature space and hurt both memory and model generalization.

So, instead, I will use native categorical encoding in the ML algorithms I aim
to use: XGBoost, CatBoost, and LightGBM.
- XGBoost, CatBoost, and LightGBM can handle categorical features by mapping
them to integer codes (label encoding)
- CatBoost/LightGBM even use more advanced target encoding under the hood.
- Assign each unique category a unique integer, including `"UNKNOWN"` for missing
values.

Here's the plan:
I will use the `OrdinalEncoder` from `sklearn` to encode the categorical columns.
This is an encoder object that can be fitted on the training set, saved to a
file, and then applied to train, test, and any new data, too.
This is important as the encoding must be the same for train data and any new
data the model is queried on, including the test data.
If, on the other hand, the encoding is different, the model will not be able to
make meaningful predictions.
It also supports handling unknown values.
For example, if there are categories in the test data that were not seen in the
training data, the encoder will assign them a value of choice.
I will use -1 for unknown values to signal the algorithm it's a new category it
wasn't trained on.

Usually, it would be important to save the encoder to a file.
Here, however, I only develop the parts to get a rapid prototype and
conceptualize.
Once I made everything work, I will refactor and export this to a script.
There, I will save the encoder.

In [15]:
# list the categorical columns
categorical_cols = [
    'EventCode',
    'EventBaseCode',
    'EventRootCode',
    'Actor1Code',
    'Actor1Name',
    'Actor2Code',
    'Actor2Name',
    'ActionGeo_CountryCode',
    'ActionGeo_ADM1Code',
    'ActionGeo_FeatureID'
]

# initialize an encoder for categorical data
# encodes categorical data as integers (example "USA" may get 1 or whatever)
# use -1 for unknown values to signal algorithm it's a new category it wasn't trained on
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# fit encoder on train set and transform it right away
train_df[categorical_cols] = encoder.fit_transform(train_df[categorical_cols])

# check the result
train_df.head()

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles,year,month,day_of_year,day_of_week,is_weekend
9254,50.0,34.0,4.0,1,3.5,63.0,248.0,541.0,989.0,160.0,793.0,6.0,48.0,2264.0,8,2024,3,88,3,0
1561,43.0,27.0,3.0,1,1.0,363.0,905.0,153.0,332.0,58.0,253.0,48.8667,2.33333,188.0,3,2024,5,136,2,0
1670,39.0,23.0,2.0,1,4.0,164.0,1006.0,182.0,758.0,161.0,797.0,40.0,-4.0,2265.0,10,2024,4,96,4,0
6087,104.0,71.0,10.0,3,-2.0,621.0,775.0,316.0,201.0,180.0,1017.0,35.6411,-79.8431,2224.0,4,2024,2,40,4,0
6669,45.0,29.0,3.0,1,1.9,601.0,1056.0,294.0,635.0,175.0,884.0,-6.0,35.0,2279.0,1,2024,8,228,3,0


In [16]:
# check the distribution of these new values
train_df.describe()

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles,year,month,day_of_year,day_of_week,is_weekend
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,63.1965,43.704875,5.925875,1.813375,0.471487,363.219875,703.47175,369.439625,699.0385,122.17675,626.614875,63.724785,35.278839,1482.92225,5.6105,2024.0,6.388125,179.6215,2.70625,0.19425
std,47.814056,33.158543,5.560777,1.132456,4.801816,206.078193,362.997321,186.551957,321.558717,57.680394,334.660121,172.290927,188.447514,753.310752,4.907347,0.0,3.436599,105.059388,1.865113,0.395647
min,0.0,0.0,0.0,1.0,-10.0,0.0,0.0,0.0,0.0,0.0,0.0,-45.25,-176.533,0.0,1.0,2024.0,1.0,1.0,0.0,0.0
25%,39.0,23.0,2.0,1.0,-2.0,193.0,405.0,182.0,434.0,82.0,362.0,27.8333,-73.9662,852.0,2.0,2024.0,3.0,87.0,1.0,0.0
50%,49.0,33.0,3.0,1.0,1.0,347.5,768.0,456.0,810.0,140.0,652.0,37.768,21.0,1689.5,4.0,2024.0,6.0,177.0,3.0,0.0
75%,96.0,67.0,10.0,3.0,3.4,608.0,1108.0,541.0,989.0,179.0,977.0,47.0,46.7728,2181.0,10.0,2024.0,9.0,271.0,4.0,0.0
max,168.0,118.0,19.0,4.0,10.0,700.0,1190.0,604.0,1051.0,195.0,1089.0,999.0,999.0,2300.0,140.0,2024.0,12.0,366.0,6.0,1.0


## No scaling or normalization is needed

The range of the features differs a lot.
For example, `QuadClass` ranges from 1 to 4, while `ActionGeo_FeatureID` ranges
from 0 to 2300.
Many algorithms would suffer from this.

However, I already decided to use tree-based models only.
This is actually for a personal reason.
I will have to use these algorithms for a different project in the future, so I
already want to gather some experience and get familiar with them.

Tree-based models like XGBoost, CatBoost, and LightGBM do not require feature
scaling or normalization for either categorical features (integer encoded) or
numerical features.

These algorithms split data based on feature values and thresholds, not on
Euclidean distance, so scaling has no effect on their performance or accuracy.

Label-encoded categorical variables are treated as distinct categories,
regardless of their actual integer value range.

The models can be trained directly with the current integer and numeric features.

If I had used different algorithms, I would have to check if scaling or
normalization is needed.
Here, however, this is not the case.

## Divide into features and target

What's left now is to extract the features and the target.
The target is the number of articles in the media `NumArticles`.
The features are all the other columns.

In [17]:
# get features and check result
X_train = train_df.drop(columns=["NumArticles"])
X_train.head()

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,year,month,day_of_year,day_of_week,is_weekend
9254,50.0,34.0,4.0,1,3.5,63.0,248.0,541.0,989.0,160.0,793.0,6.0,48.0,2264.0,2024,3,88,3,0
1561,43.0,27.0,3.0,1,1.0,363.0,905.0,153.0,332.0,58.0,253.0,48.8667,2.33333,188.0,2024,5,136,2,0
1670,39.0,23.0,2.0,1,4.0,164.0,1006.0,182.0,758.0,161.0,797.0,40.0,-4.0,2265.0,2024,4,96,4,0
6087,104.0,71.0,10.0,3,-2.0,621.0,775.0,316.0,201.0,180.0,1017.0,35.6411,-79.8431,2224.0,2024,2,40,4,0
6669,45.0,29.0,3.0,1,1.9,601.0,1056.0,294.0,635.0,175.0,884.0,-6.0,35.0,2279.0,2024,8,228,3,0


In [18]:
# get target and check resuls
y_train = train_df["NumArticles"]
y_train.head()

9254     8
1561     3
1670    10
6087     4
6669     1
Name: NumArticles, dtype: Int64

Amazing! I'd say everything looks great.
This should be sufficient.

## Collect the steps and refactor them into a function

This exact same logic must be applied to the test_df, too, so the model will be
able to make meaningful predictions.

I will collect the steps and refactor them into a function.
This function can not only be used for the test_df, but it will also be useful
for later when I export this notebook as a script.

Make sure to use the train_df and then apply whatever it learned to the test_df.
Never learn from the full dataset or from the test_df itself.
An example for this is checking number of missing values to decide which columns
to drop and scaling the data in case it is needed.

Make sure to fit encoding (like category codes) on the training set, then apply
to the test set to avoid data leakage.

Make sure train and test columns are in the same order.

Save the data to parquet files.

In [56]:
# COLLECTION OF STEPS
# FIXME: TURN THIS INTO A FUNCTION
# IT MUST TAKE A DATA SET
# INFO IF IT IS TRAIN OR TEST
# DEPENDING ON WHICH SPLIT IT IS, DO DIFFERENT THINGS
# TRAIN: FITS AND SAVES AND OPTIMALLY LOGS AN ENCODER
# TEST OR RATHER JUST NOT TRAIN: TAKES / LOADS THE ENCODER AND JUST APPLIES IT TO THE DATA
# RETURN FEATURES AND LABELS

def save_series_as_parquet(series, filepath):
    """Save a pandas Series as parquet file"""
    # Convert Series to DataFrame with proper column name
    df = series.to_frame(name=series.name if series.name else 'target')
    df.to_parquet(filepath)

def prepare_data(
    df: pd.DataFrame,
    is_train: bool,
    encoder: Optional[OrdinalEncoder] = None,
    encoder_path: Optional[str] = None,
    save_data: bool = False,
    save_path_dir: Optional[str] = None,
    path_repo: Optional[str] = None,
) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Prepare data for training or testing.
    """
    
    # Constants

    # define columns to drop
    columns_to_drop = [
        "Actor1Type2Code",
        "Actor1Type3Code",
        "Actor2Type2Code",
        "Actor2Type3Code",
        "Actor1Type1Code",
        "Actor1CountryCode",
        "Actor2CountryCode",
        "Actor2Type1Code"
    ]

    # define imputation strategy for columns with missing values
    imputation_strategy = {
        "Actor1Code": "UNKNOWN",
        "Actor1Name": "UNKNOWN",
        "Actor2Code": "UNKNOWN",
        "Actor2Name": "UNKNOWN",
        "ActionGeo_CountryCode": "UNKNOWN",
        "ActionGeo_ADM1Code": "UNKNOWN",
        "ActionGeo_FeatureID": "UNKNOWN",
        "ActionGeo_Lat": 999,
        "ActionGeo_Long": 999,
    }

    # define the categorical columns for encoding
    categorical_cols = [
        'EventCode',
        'EventBaseCode',
        'EventRootCode',
        'Actor1Code',
        'Actor1Name',
        'Actor2Code',
        'Actor2Name',
        'ActionGeo_CountryCode',
        'ActionGeo_ADM1Code',
        'ActionGeo_FeatureID'
    ]
    
    # target value or rather label
    target_label = "NumArticles"
    
    # path to intermediate data
    if save_data == True:
        PATH_DATA = Path(path_repo) / "data/intermediate"


    # Handle missing values
    # drop columns with roughly 50% or more missing values
    # at some point, imputation doesn't make sense anymore, so drop them
    df = df.drop(columns=columns_to_drop)

    # impute missing values with new category for unknown
    df.fillna(imputation_strategy, inplace=True)


    # Handle time and data columns
    # convert column "SQLDATE" to more meaningful date and time info
    # this will allos models to learn from it better

    # convert to datetime first
    df['date'] = pd.to_datetime(df['SQLDATE'], format='%Y%m%d')

    # extract useful components
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day_of_year'] = df['date'].dt.dayofyear  # 1-365
    df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

    # drop the intermediate date column and the original date related columns
    df = df.drop(["date", "SQLDATE", "MonthYear"], axis=1)


    # Handle categorical data by numerical encoding
    
    # if train data is passed, fit_transform and save encoder
    if is_train == True:
        
        # initialize an encoder for categorical data
        # encodes categorical data as integers
        # example "USA" may get 1 or whatever
        # use -1 for unknown values
        # this signals algorithm it's a new category it wasn't trained on
        encoder = OrdinalEncoder(
            handle_unknown='use_encoded_value',
            unknown_value=-1
        )
        
        # fit encoder on train set and transform data it right away
        df[categorical_cols] = encoder.fit_transform(df[categorical_cols])
        
        # if data is supposed to be saved, save encoder to file
        if save_data == True:
            encoder_path = 'ordinal_encoder_prototype.pkl' 
            joblib.dump(encoder, encoder_path)

            # log as part of a custom model (for later use with ML model)
            with mlflow.start_run():
                # log just the encoder as artifact for now
                mlflow.log_artifact(encoder_path, artifact_path='preprocessing')

                # clean up local file after logging
                # so it's in just one location: artifact store
                os.remove(encoder_path)


    # if test data is passed, load the encoder and transform test data
    elif is_train == False:
        
        # we need the run_id to locate the encoder in the artifact store
        if encoder is None and encoder_path is None:
            raise ValueError("For test data, either 'encoder' object or 'encoder_path' (run_id) must be provided")
        
        # if encoder object is passed directly, use it
        if encoder is not None:
            loaded_encoder = encoder
        
        # if encoder_path (run_id) is provided, load from artifact store
        elif encoder_path is not None:
            # download the encoder from MLflow artifact store
            # encoder_path should be the run_id where the encoder was logged
            artifact_path = mlflow.artifacts.download_artifacts(
                f"runs:/{encoder_path}/preprocessing/ordinal_encoder_prototype.pkl"
            )
            # load the encoder from the downloaded file
            loaded_encoder = joblib.load(artifact_path)
            
            # clean up: delete the temporary downloaded file
            os.remove(artifact_path)
        
        # apply the loaded encoder to transform test data (only transform, no fitting)
        df[categorical_cols] = loaded_encoder.transform(df[categorical_cols])


    # Extract features and labels
    # get features
    X = df.drop(columns=[target_label])

    # get target
    y = df[target_label]
    
    
    # Save data to parquet files if desired
    if save_data == True:
        
        # ensure the intermediate data directory exists
        PATH_DATA.mkdir(parents=True, exist_ok=True)
        
        # if path to directory to save data to was passed, use it
        if save_path_dir is not None:
            save_name = PATH_DATA / save_path_dir
        # if not passed, use a default path
        else:
            # count number of directories in intermediate data dir
            num_dirs = sum(1 for p in PATH_DATA.iterdir() if p.is_dir())
            save_name = PATH_DATA / f"gdelt_events_2024_subset_version_{num_dirs}"
            
        # ensure the save directory exists
        save_name.mkdir(parents=True, exist_ok=True)
            
        # switch between train and test features and labels
        if is_train == True:
            X_name = "X_train"
            y_name = "y_train"
        else:
            X_name = "X_test"
            y_name = "y_test"
            
        # save data to parquet files
        X.to_parquet(save_name / f"{X_name}.parquet")
        save_series_as_parquet(y, save_name / f"{y_name}.parquet")

    # always return features and labels
    # return encoder if train was used
    if is_train == True:
        return X, y, encoder
    else:
        return X, y

In [22]:
# make copy of previously processed train data to compare if output is same
train_df_backup = train_df.copy()
X_train_backup = X_train.copy()
y_train_backup = y_train.copy()

In [25]:
# briefly check if backup worked
print(train_df_backup.equals(train_df))
print(X_train_backup.equals(X_train))
print(y_train_backup.equals(y_train))

True
True
True


In [26]:
# load train data again fresh from file, because it was already processed
train_df = pd.read_parquet(PATH_DATA / "gdelt_events_2024_subset_10k_train.parquet")

In [29]:
# test the function for train data
X_train_new, y_train_new, encoder_new = prepare_data(
    df = train_df,
    is_train = True,
    save_data = False,
)

In [43]:
# compare column names
print("Column names comparison:")
print(X_train_backup.columns == X_train_new.columns)


# compare values element by element (ignoring index/column names)
print("\nValue comparison:")
print(f"X values equal: {X_train_new.values.shape == X_train_backup.values.shape and (X_train_new.values == X_train_backup.values).all()}")
print(f"y values equal: {y_train_new.values.shape == y_train_backup.values.shape and (y_train_new.values == y_train_backup.values).all()}")

# compare index now
print("\nIndex comparison:")
print(X_train_new.index == X_train_backup.index)
print(y_train_new.index == y_train_backup.index)

Column names comparison:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True]

Value comparison:
X values equal: True
y values equal: True

Index comparison:
[False False False ... False False False]
[False False False ... False False False]


Great! The **actual** data in there is the same.
All column names match and the values are the same, too for both features (X)
and the labels (y).
The index differs, but this is expected, so it's fine.

In [46]:
# test the function for test data
# pass the new encoder here
X_test_new, y_test_new = prepare_data(
    df = test_df,
    is_train = False,
    save_data = False,
    encoder = encoder_new
)

I'll now also check this function for the test data.
It's bad to look at test data, but I guess verifying the code works by
comparing columns and printing the head is acceptable.
Without it, I wouldn't even know if my function works.

In theory, I could also just run it in `is_train = False` mode for the train
data, but I want to be completely sure.

Honestly, this cannot be seen as data leakage, because I will obtain like zero
information from looking at these numbers.
To make this even more clear, I'll just print the first row.
This is just enough for me to see the function works as expected.

In [48]:
# compare columns to backup -> columns must match
print("Comparing columns:")
print(X_test_new.columns == X_train_backup.columns)

# print only first row to get some feedback if function works
X_test_new.head(1)

Comparing columns:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True]


Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,year,month,day_of_year,day_of_week,is_weekend
6252,88.0,63.0,9.0,3,-5.0,193.0,1117.0,541.0,989.0,177.0,892.0,54.0,-4.0,2281.0,2024,2,36,0,0


Amazing! This can be used!

But I need to test the writing function.
This will also log to MLFLow, so there are many more break points.

In [51]:
# test the function for train data
X_train_new, y_train_new, encoder_new = prepare_data(
    df = train_df,
    is_train = True,
    save_data = True,
    save_path_dir = "notebook_prototyping",
    path_repo = PATH_REPO
)

🏃 View run rumbling-gull-895 at: http://127.0.0.1:5001/#/experiments/1/runs/c20dd06ff0e74c0aa45621b4fb7dbfa7
🧪 View experiment at: http://127.0.0.1:5001/#/experiments/1


Great! This works!
The data is saved and the encoder is logged to MLFlow.

In [57]:
# test the function for test data
# pass run ID as encoder path
X_test_new, y_test_new = prepare_data(
    df = test_df,
    is_train = False,
    save_data = True,
    save_path_dir = "notebook_prototyping",
    path_repo = PATH_REPO,
    encoder_path = "c20dd06ff0e74c0aa45621b4fb7dbfa7"
)

In [58]:
# compare to first row of previous run
X_test_new.head(1)

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor2Code,Actor2Name,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,year,month,day_of_year,day_of_week,is_weekend
6252,88.0,63.0,9.0,3,-5.0,193.0,1117.0,541.0,989.0,177.0,892.0,54.0,-4.0,2281.0,2024,2,36,0,0


Amazing! This works, too! So, this is my data prep and I can go on I think!

I leave some code here for later reference.

I haven't verified this works yet, but this is how I can likely do it:

```python
# bundle a model witht the encoder
mlflow.pyfunc.log_model("model_with_preprocessing",
                        python_model=YourModelWrapper(),
                        artifacts={'encoder': encoder_path})

# load the encoder in future scripts/notebooks
# if this is actually needed, because I have the function
run_id = "<your_run_id>"
encoder_path = mlflow.artifacts.download_artifacts(f"runs/{run_id}/preprocessing/ordinal_encoder.pkl")
loaded_encoder = joblib.load(encoder_path)
```

## Get a 1000 Samples Subset from Processed Train Data