# Exploratory Data Analysis

In this notebook, I have a look at the data I downloaded from BigQuery.
I already split the data into train and test sets right away in the script for
downloading the data to prevent a data leakage.

Here, I will inspect the train split and see what needs to be done to prepare
it for modeling.
I will perform the scripts, then collect them in a function and apply it to the
test split, too.
Whatever decisions need to be made will be based on the train split exclusively.
Nothing will be decided based on the test split.
In case dataset wide statistics are needed for some transformation, they will
be based on the train split exclusively, too, and later applied to the test
split.

The workflow from this note will be exported to the script
`scripts/prepare_data_for_modeling.py`.

## Environment

To use this project's uv environment, make sure you installed it according to
the instructions in the README.md file.

Then, connect to the `.venv` kernel.
Check the path to the kernel to make sure it's the right one.
It should be `.venv/bin/python`.

Run the next cell to check if you use the correct kernel.
It should output this:

```
<path_to_wherever_you_cloned_the_repo_to>/gdelt-newsimpact/.venv/bin/python
```

In [1]:
!which python

/Users/fakrueg/projects/courses/datatalks/mlops-zoomcamp/mlopsproject2/.venv/bin/python


## Setup

In [2]:
# Dependencies
import os
import pandas as pd
import mlflow
import joblib
from mlflow.models import infer_signature

from pathlib import Path
from typing import Optional, Tuple
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

In [3]:
# define paths
PATH_REPO = Path(".").resolve().parent
PATH_DATA = PATH_REPO / "data" / "raw"
PATH_TRAIN = PATH_DATA / "gdelt_events_2024_subset_10k_train.parquet"
PATH_TEST = PATH_DATA / "gdelt_events_2024_subset_10k_test.parquet"

In [4]:
# for saving any feature scalers or encoders to the artifact store

# set MLFlow tracking URI or rather: basically connect to the MLFlow server
mlflow.set_tracking_uri("http://127.0.0.1:5001")

# set experiment
mlflow.set_experiment("testing_setup")

<Experiment: artifact_location='mlflow-artifacts:/1', creation_time=1755362327633, experiment_id='1', last_update_time=1755362327633, lifecycle_stage='active', name='testing_setup', tags={}>

In [5]:
# load data from parquet files

# train data
df_train = pd.read_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k_train.parquet"
)

# test data
df_test = pd.read_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k_test.parquet"
)

# have a look at the train data
df_train.head()

Unnamed: 0,SQLDATE,MonthYear,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,...,Actor2CountryCode,Actor2Type1Code,Actor2Type2Code,Actor2Type3Code,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles
0,20241107,202411,180,180,18,4,-9.0,IRNGOV,IRANIAN,IRN,...,AFG,,,,IR,IR16,35.7131,47.2656,-3071164,2
1,20240922,202409,190,190,19,4,-10.0,ISRMIL,ISRAEL,ISR,...,LBN,,,,IS,IS00,31.4167,34.3333,-797156,2
2,20241107,202411,180,180,18,4,-9.0,UKR,UKRAINIAN,UKR,...,RUS,,,,RS,RS46,54.768,45.837,-2985097,10
3,20240922,202409,190,190,19,4,-10.0,ISRMED,ISRAELI,ISR,...,,MIL,,,IS,IS03,32.9,35.3333,-779978,10
4,20240922,202409,190,190,19,4,-10.0,GOV,PRIME MINISTER,,...,,,,,US,USDE,39.3498,-75.5148,DE,1


## Check for Missing Values

If there is any column that has a lot of missing values, I will drop it.
Columns with just a low percentages may be imputed if there is a meaningful
way.

In [6]:
# check for missing values and extract columns with >= 50% missing
missing_rates = df_train.isnull().mean()
columns_high_missing = missing_rates[missing_rates >= 0.5].index.tolist()

print("Missing value rates:")
print(missing_rates)
print(f"\nColumns with >= 50% missing values:")
print(columns_high_missing)

Missing value rates:
SQLDATE                  0.000000
MonthYear                0.000000
EventCode                0.000000
EventBaseCode            0.000000
EventRootCode            0.000000
QuadClass                0.000000
GoldsteinScale           0.000000
Actor1Code               0.101250
Actor1Name               0.101250
Actor1CountryCode        0.392750
Actor1Type1Code          0.595250
Actor1Type2Code          0.977625
Actor1Type3Code          1.000000
Actor2Code               0.252875
Actor2Name               0.252875
Actor2CountryCode        0.495125
Actor2Type1Code          0.673375
Actor2Type2Code          0.978000
Actor2Type3Code          0.999500
ActionGeo_CountryCode    0.029250
ActionGeo_ADM1Code       0.029250
ActionGeo_Lat            0.030000
ActionGeo_Long           0.030000
ActionGeo_FeatureID      0.029250
NumArticles              0.000000
dtype: float64

Columns with >= 50% missing values:
['Actor1Type1Code', 'Actor1Type2Code', 'Actor1Type3Code', 'Actor2Type1Code', 

For most machine learning projects, it’s reasonable to drop columns with more 
than 50% missing values, especially if there are plenty of other features.
High missingness usually means the feature will be hard to impute reliably and
won’t add robust predictive power.

One option for imputation would be to use the mode or rather the most frequent
value.
However, this is data from global events.
Imputation is always basically making up data and hoping it's a good guess.
Often, for numerical data, a mean or median is a good guess.
However, I am afraid in this case, it may not make much sense for some of the
columns.
For example, if the most frequent value is "USA", this will be filled in for
all rows where the value is missing.
But perhaps there may be a good reason why the value is missing.
For example, if the value is missing, it may mean that the event is not
related to a country.
Because of this, I will treat the missing values as missing by introducing a
new category for unknown.
I will have to check how they encode this in general and which value can be
used for this.
Perhaps 0 is a good value for this in case it is not taken for anything else.


In [7]:
# print columns that will be dropped
print("Columns that will be dropped:")
for column in columns_high_missing:
    print(column)

# drop columns with 50% or more missing values
df_train = df_train.drop(columns=columns_high_missing)

# check for missing values again get an updated overview
df_train.isnull().mean()

Columns that will be dropped:
Actor1Type1Code
Actor1Type2Code
Actor1Type3Code
Actor2Type1Code
Actor2Type2Code
Actor2Type3Code


SQLDATE                  0.000000
MonthYear                0.000000
EventCode                0.000000
EventBaseCode            0.000000
EventRootCode            0.000000
QuadClass                0.000000
GoldsteinScale           0.000000
Actor1Code               0.101250
Actor1Name               0.101250
Actor1CountryCode        0.392750
Actor2Code               0.252875
Actor2Name               0.252875
Actor2CountryCode        0.495125
ActionGeo_CountryCode    0.029250
ActionGeo_ADM1Code       0.029250
ActionGeo_Lat            0.030000
ActionGeo_Long           0.030000
ActionGeo_FeatureID      0.029250
NumArticles              0.000000
dtype: float64

In [8]:
# check data types
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8000 entries, 0 to 7999
Data columns (total 19 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   SQLDATE                8000 non-null   Int64  
 1   MonthYear              8000 non-null   Int64  
 2   EventCode              8000 non-null   object 
 3   EventBaseCode          8000 non-null   object 
 4   EventRootCode          8000 non-null   object 
 5   QuadClass              8000 non-null   Int64  
 6   GoldsteinScale         8000 non-null   float64
 7   Actor1Code             7190 non-null   object 
 8   Actor1Name             7190 non-null   object 
 9   Actor1CountryCode      4858 non-null   object 
 10  Actor2Code             5977 non-null   object 
 11  Actor2Name             5977 non-null   object 
 12  Actor2CountryCode      4039 non-null   object 
 13  ActionGeo_CountryCode  7766 non-null   object 
 14  ActionGeo_ADM1Code     7766 non-null   object 
 15  Acti

All the values in the columns have an actual meaning here that is hard to
approximate using mean/median or mode.
I will fill the missing values with a new value making it clear to the model
that the value is missing.

I will impute the numerical values (such as latitude, longitude and
GoldsteinScale) with a value far outside the possible range (e.g., latitude
999, longitude 999) 
I will impute the categorical values (such as Actor1Type1Code, Actor1Type2Code,
Actor1Type3Code, Actor2Type1Code, Actor2Type2Code, Actor2Type3Code) with a
new category for unknown.
To accomplish this, I will just fill it with "UNKNOWN".

In [9]:
# imputation strategy
# numerical values: use 999 (far outside normal range)
# categorical values: use "UNKNOWN"

# automatically identify column types
numerical_columns = df_train.select_dtypes(
    include=['int64', 'float64']
).columns.tolist()
categorical_columns = df_train.select_dtypes(
    include=['object', 'string']
).columns.tolist()

print("Numerical columns:", numerical_columns)
print("Categorical columns:", categorical_columns)

# create imputation strategy dynamically
imputation_strategy = {}

# add categorical imputation (UNKNOWN for all)
for col in categorical_columns:
    imputation_strategy[col] = "UNKNOWN"

# add numerical imputation (999 for all)
for col in numerical_columns:
    imputation_strategy[col] = 999

# fill missing values with strategy
df_train.fillna(imputation_strategy, inplace=True)

# check result
print("Missing values after imputation:")
print(df_train.isna().mean())
df_train.head()

Numerical columns: ['SQLDATE', 'MonthYear', 'QuadClass', 'GoldsteinScale', 'ActionGeo_Lat', 'ActionGeo_Long', 'NumArticles']
Categorical columns: ['EventCode', 'EventBaseCode', 'EventRootCode', 'Actor1Code', 'Actor1Name', 'Actor1CountryCode', 'Actor2Code', 'Actor2Name', 'Actor2CountryCode', 'ActionGeo_CountryCode', 'ActionGeo_ADM1Code', 'ActionGeo_FeatureID']
Missing values after imputation:
SQLDATE                  0.0
MonthYear                0.0
EventCode                0.0
EventBaseCode            0.0
EventRootCode            0.0
QuadClass                0.0
GoldsteinScale           0.0
Actor1Code               0.0
Actor1Name               0.0
Actor1CountryCode        0.0
Actor2Code               0.0
Actor2Name               0.0
Actor2CountryCode        0.0
ActionGeo_CountryCode    0.0
ActionGeo_ADM1Code       0.0
ActionGeo_Lat            0.0
ActionGeo_Long           0.0
ActionGeo_FeatureID      0.0
NumArticles              0.0
dtype: float64


Unnamed: 0,SQLDATE,MonthYear,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,Actor2Code,Actor2Name,Actor2CountryCode,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles
0,20241107,202411,180,180,18,4,-9.0,IRNGOV,IRANIAN,IRN,AFG,AFGHAN,AFG,IR,IR16,35.7131,47.2656,-3071164,2
1,20240922,202409,190,190,19,4,-10.0,ISRMIL,ISRAEL,ISR,LBN,LEBANON,LBN,IS,IS00,31.4167,34.3333,-797156,2
2,20241107,202411,180,180,18,4,-9.0,UKR,UKRAINIAN,UKR,RUS,RUSSIAN,RUS,RS,RS46,54.768,45.837,-2985097,10
3,20240922,202409,190,190,19,4,-10.0,ISRMED,ISRAELI,ISR,MIL,MILITARY,UNKNOWN,IS,IS03,32.9,35.3333,-779978,10
4,20240922,202409,190,190,19,4,-10.0,GOV,PRIME MINISTER,UNKNOWN,UNKNOWN,UNKNOWN,UNKNOWN,US,USDE,39.3498,-75.5148,DE,1


Great! Now there are no missing values in the df_train.
I hope that this method makes any sense.
The only way to find out is to try it out.

## Take care of the columns for date

There is a column called "SQLDATE" which is a date in the format YYYYMMDD.
This is not very useful for modeling, so I will convert it to more informative
features such as year, month, day of year, day of week, and whether it is a
weekend or not.

I will also drop the intermediate date column.

Beyond this, there is a second column called "MonthYear" which seems to contain
redundant information.
It should probably be dropped.

In [10]:
# Convert to datetime first
df_train['date'] = pd.to_datetime(df_train['SQLDATE'], format='%Y%m%d')

# Extract useful components
df_train['year'] = df_train['date'].dt.year
df_train['month'] = df_train['date'].dt.month
df_train['day_of_year'] = df_train['date'].dt.dayofyear  # 1-365
df_train['day_of_week'] = df_train['date'].dt.dayofweek  # 0=Monday, 6=Sunday
df_train['is_weekend'] = df_train['day_of_week'].isin([5, 6]).astype(int)

# Drop the intermediate date column and the original date related columns
df_train = df_train.drop(["date", "SQLDATE", "MonthYear"], axis=1)

## Check if I need to One-Hot-Encode the categorical features

Right now, many of the categorical columns use integers to encode the values.
While this will probably work, it also introduces an order to the values.
A model may learn some patterns from this that don't really exist.

However, one-hot-encoding will increase the number of features by a lot.
This may be a problem if the number of features is too high.

Check which columns should not have an order and
if one-hot-encoding is feasible here by having a look at the number of
unique values in each column, then decide.

In [11]:
df_train.nunique()

EventCode                  17
EventBaseCode              11
EventRootCode               8
QuadClass                   4
GoldsteinScale             10
Actor1Code                478
Actor1Name                848
Actor1CountryCode         147
Actor2Code                443
Actor2Name                751
Actor2CountryCode         149
ActionGeo_CountryCode     166
ActionGeo_ADM1Code        778
ActionGeo_Lat            1310
ActionGeo_Long           1339
ActionGeo_FeatureID      1403
NumArticles                31
year                        1
month                       9
day_of_year                11
day_of_week                 6
is_weekend                  2
dtype: int64

Dang! Those are a lot of unique values in the categorical columns.
The cardinality of these features is way too high for one-hot encoding, as it
would blow up the feature space and hurt both memory and model generalization.

So, instead, I will use native categorical encoding in the ML algorithms I aim
to use: XGBoost, CatBoost, and LightGBM.
- XGBoost, CatBoost, and LightGBM can handle categorical features by mapping
them to integer codes (label encoding)
- CatBoost/LightGBM even use more advanced target encoding under the hood.
- Assign each unique category a unique integer, including `"UNKNOWN"` for missing
values.

Here's the plan:
I will use the `OrdinalEncoder` from `sklearn` to encode the categorical columns.
This is an encoder object that can be fitted on the training set, saved to a
file, and then applied to train, test, and any new data, too.
This is important as the encoding must be the same for train data and any new
data the model is queried on, including the test data.
If, on the other hand, the encoding is different, the model will not be able to
make meaningful predictions.
It also supports handling unknown values.
For example, if there are categories in the test data that were not seen in the
training data, the encoder will assign them a value of choice.
I will use -1 for unknown values to signal the algorithm it's a new category it
wasn't trained on.

Usually, it would be important to save the encoder to a file.
Here, however, I only develop the parts to get a rapid prototype and
conceptualize.
Once I made everything work, I will refactor and export this to a script.
There, I will save the encoder.

In [12]:
# initialize an encoder for categorical data
# encodes categorical data as integers (example "USA" may get 1 or whatever)
# use -1 for unknown values to signal algorithm it's a new category it wasn't trained on
encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)

# fit encoder on train set and transform it right away
df_train[categorical_columns] = encoder.fit_transform(df_train[categorical_columns])

# check the result
df_train.head()

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,Actor2Code,Actor2Name,...,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles,year,month,day_of_year,day_of_week,is_weekend
0,11.0,7.0,6.0,4,-9.0,187.0,330.0,61.0,0.0,14.0,...,261.0,35.7131,47.2656,573.0,2,2024,11,312,3,0
1,16.0,10.0,7.0,4,-10.0,203.0,339.0,64.0,209.0,381.0,...,272.0,31.4167,34.3333,777.0,2,2024,9,266,6,1
2,11.0,7.0,6.0,4,-9.0,416.0,792.0,133.0,328.0,584.0,...,529.0,54.768,45.837,543.0,10,2024,11,312,3,0
3,16.0,10.0,7.0,4,-10.0,202.0,340.0,64.0,235.0,428.0,...,275.0,32.9,35.3333,727.0,10,2024,9,266,6,1
4,16.0,10.0,7.0,4,-10.0,143.0,596.0,134.0,387.0,711.0,...,703.0,39.3498,-75.5148,1252.0,1,2024,9,266,6,1


In [13]:
# check the distribution of these new values
df_train.describe()

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,Actor2Code,Actor2Name,...,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles,year,month,day_of_year,day_of_week,is_weekend
count,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,...,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0,8000.0
mean,13.05575,8.007875,5.766625,3.40725,-6.47075,257.344,481.058125,104.4925,258.191125,485.000875,...,431.311625,62.349349,43.667745,878.16125,5.378125,2024.0,9.145375,266.326125,4.61725,0.678625
std,4.326515,3.020464,2.114999,1.183563,6.713044,130.284336,248.84653,38.32725,125.400933,226.239862,...,222.067189,165.684497,179.380515,430.636955,3.992667,0.0,1.27277,39.027979,2.171944,0.467034
min,0.0,0.0,0.0,1.0,-10.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-44.0,-172.178309,0.0,1.0,2024.0,1.0,27.0,0.0,0.0
25%,11.0,7.0,6.0,4.0,-10.0,143.0,299.0,64.0,163.0,329.0,...,271.0,31.4167,-3.91667,530.0,2.0,2024.0,9.0,266.0,3.0,0.0
50%,16.0,10.0,7.0,4.0,-10.0,240.0,455.0,133.0,290.0,528.0,...,405.5,33.8719,34.3333,837.0,4.0,2024.0,9.0,266.0,6.0,1.0
75%,16.0,10.0,7.0,4.0,-9.0,411.0,756.0,134.0,387.0,711.0,...,670.0,46.4639,38.0,1287.0,10.0,2024.0,9.0,266.0,6.0,1.0
max,16.0,10.0,7.0,4.0,8.5,477.0,847.0,146.0,442.0,750.0,...,777.0,999.0,999.0,1402.0,60.0,2024.0,12.0,350.0,6.0,1.0


Very interesting to see: GoldsteinScale is very skewed.
It represents intensity of interactions between actors.
Negative values mean conflictual interactions.
A -10 is actually an act of war.
A positive value means a cooperative action.

In this data, there's almost only bad events happening.
I don't know if this is due to my hashing in SQL, just random or because most
of the news happens to be about bad things.
I mean from my own personal experience, it makes sense.
A lot of news focuses on bad things.

I don't want to go any deeper into interpreting this, but it's noteworthy in
my opinion.

## No scaling or normalization is needed

The range of the features differs a lot.
Many algorithms would suffer from this.

However, I already decided to use tree-based models only.
This is actually for a personal reason.
I will have to use these algorithms for a different project in the future, so I
already want to gather some experience and get familiar with them.

Tree-based models like XGBoost, CatBoost, and LightGBM do not require feature
scaling or normalization for either categorical features (integer encoded) or
numerical features.

These algorithms split data based on feature values and thresholds, not on
Euclidean distance, so scaling has no effect on their performance or accuracy.

Label-encoded categorical variables are treated as distinct categories,
regardless of their actual integer value range.

The models can be trained directly with the current integer and numeric features.

If I had used different algorithms, I would have to check if scaling or
normalization is needed.
Here, however, this is not the case.

## Divide into features and target

What's left now is to extract the features and the target.
The target is the number of articles in the media `NumArticles`.
The features are all the other columns.

In [14]:
# get features and check result
X_train = df_train.drop(columns=["NumArticles"])
X_train.head()

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,Actor2Code,Actor2Name,...,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,year,month,day_of_year,day_of_week,is_weekend
0,11.0,7.0,6.0,4,-9.0,187.0,330.0,61.0,0.0,14.0,...,65.0,261.0,35.7131,47.2656,573.0,2024,11,312,3,0
1,16.0,10.0,7.0,4,-10.0,203.0,339.0,64.0,209.0,381.0,...,66.0,272.0,31.4167,34.3333,777.0,2024,9,266,6,1
2,11.0,7.0,6.0,4,-9.0,416.0,792.0,133.0,328.0,584.0,...,127.0,529.0,54.768,45.837,543.0,2024,11,312,3,0
3,16.0,10.0,7.0,4,-10.0,202.0,340.0,64.0,235.0,428.0,...,66.0,275.0,32.9,35.3333,727.0,2024,9,266,6,1
4,16.0,10.0,7.0,4,-10.0,143.0,596.0,134.0,387.0,711.0,...,151.0,703.0,39.3498,-75.5148,1252.0,2024,9,266,6,1


In [15]:
# get target and check resuls
y_train = df_train["NumArticles"]
y_train.head()

0     2
1     2
2    10
3    10
4     1
Name: NumArticles, dtype: Int64

Amazing! I'd say everything looks great.
This should be sufficient.

## Collect the steps and refactor them into a function

This exact same logic must be applied to the df_test, too, so the model will be
able to make meaningful predictions.

I will collect the steps and refactor them into a function.
This function can not only be used for the df_test, but it will also be useful
for later when I export this notebook as a script.

Make sure to use the df_train and then apply whatever it learned to the df_test.
Never learn from the full dataset or from the df_test itself.
An example for this is checking number of missing values to decide which columns
to drop and scaling the data in case it is needed.

Make sure to fit encoding (like category codes) on the training set, then apply
to the test set to avoid data leakage.

Make sure train and test columns are in the same order.

Save the data to parquet files.

In [16]:
# helper function for saving the labels
def save_series_as_parquet(series, filepath):
    """Save a pandas Series as parquet file"""
    # Convert Series to DataFrame with proper column name
    df = series.to_frame(name=series.name if series.name else 'target')
    df.to_parquet(filepath)

# function for data preparation
def prepare_data(
    df: pd.DataFrame,
    is_train: bool,
    encoder: Optional[OrdinalEncoder] = None,
    encoder_path: Optional[str] = None,
    save_data: bool = False,
    save_path_dir: Optional[str] = None,
    path_repo: Optional[str] = None,
) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Prepare data for training or testing.
    """
    
    # Constants
    
    # target value or rather label
    target_label = "NumArticles"
    
    # path to intermediate data
    if save_data == True:
        PATH_DATA = Path(path_repo) / "data/intermediate"

    # Handle missing values

    # check for missing values and extract columns with >= 50% missing
    missing_rates = df.isnull().mean()
    columns_high_missing = missing_rates[missing_rates >= 0.5].index.tolist()

    # drop columns with 50% or more missing
    df = df.drop(columns=columns_high_missing)

    # automatically identify column types
    numerical_columns = df.select_dtypes(
        include=['int64', 'float64']
    ).columns.tolist()
    categorical_columns = df.select_dtypes(
        include=['object', 'string']
    ).columns.tolist()
    
    # create imputation strategy dynamically
    # numerical values: use 999 (far outside normal range)
    # categorical values: use "UNKNOWN"
    imputation_strategy = {}

    # add categorical imputation (UNKNOWN for all)
    for col in categorical_columns:
        imputation_strategy[col] = "UNKNOWN"

    # add numerical imputation (999 for all)
    for col in numerical_columns:
        imputation_strategy[col] = 999

    # fill missing values with strategy
    df.fillna(imputation_strategy, inplace=True)

    # Handle time and data columns
    # convert column "SQLDATE" to more meaningful date and time info
    # this will allos models to learn from it better

    # convert to datetime first
    df['date'] = pd.to_datetime(df['SQLDATE'], format='%Y%m%d')

    # extract useful components
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day_of_year'] = df['date'].dt.dayofyear  # 1-365
    df['day_of_week'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)

    # drop the intermediate date column and the original date related columns
    df = df.drop(["date", "SQLDATE", "MonthYear"], axis=1)


    # Handle categorical data by numerical encoding
    
    # if train data is passed, fit_transform and save encoder
    if is_train == True:
        
        # initialize an encoder for categorical data
        # encodes categorical data as integers
        # example "USA" may get 1 or whatever
        # use -1 for unknown values
        # this signals algorithm it's a new category it wasn't trained on
        encoder = OrdinalEncoder(
            handle_unknown='use_encoded_value',
            unknown_value=-1
        )
        
        # fit encoder on train set and transform data it right away
        df[categorical_columns] = encoder.fit_transform(df[categorical_columns])
        
        # if data is supposed to be saved, save encoder to file
        if save_data == True:
            encoder_path = 'ordinal_encoder_prototype.pkl' 
            joblib.dump(encoder, encoder_path)

            # log as part of a custom model (for later use with ML model)
            with mlflow.start_run():
                # log just the encoder as artifact for now
                mlflow.log_artifact(encoder_path, artifact_path='preprocessing')

                # clean up local file after logging
                # so it's in just one location: artifact store
                os.remove(encoder_path)


    # if test data is passed, load the encoder and transform test data
    elif is_train == False:
        
        # we need the run_id to locate the encoder in the artifact store
        if encoder is None and encoder_path is None:
            raise ValueError("For test data, either 'encoder' object or 'encoder_path' (run_id) must be provided")
        
        # if encoder object is passed directly, use it
        if encoder is not None:
            loaded_encoder = encoder
        
        # if encoder_path (run_id) is provided, load from artifact store
        elif encoder_path is not None:
            # download the encoder from MLflow artifact store
            # encoder_path should be the run_id where the encoder was logged
            artifact_path = mlflow.artifacts.download_artifacts(
                f"runs:/{encoder_path}/preprocessing/ordinal_encoder_prototype.pkl"
            )
            # load the encoder from the downloaded file
            loaded_encoder = joblib.load(artifact_path)
            
            # clean up: delete the temporary downloaded file
            os.remove(artifact_path)
        
        # apply the loaded encoder to transform test data (only transform, no fitting)
        df[categorical_columns] = loaded_encoder.transform(df[categorical_columns])


    # Extract features and labels
    # get features
    X = df.drop(columns=[target_label])

    # get target
    y = df[target_label]
    
    
    # Save data to parquet files if desired
    if save_data == True:
        
        # ensure the intermediate data directory exists
        PATH_DATA.mkdir(parents=True, exist_ok=True)
        
        # if path to directory to save data to was passed, use it
        if save_path_dir is not None:
            save_name = PATH_DATA / save_path_dir
        # if not passed, use a default path
        else:
            # count number of directories in intermediate data dir
            num_dirs = sum(1 for p in PATH_DATA.iterdir() if p.is_dir())
            save_name = PATH_DATA / f"gdelt_events_2024_subset_version_{num_dirs}"
            
        # ensure the save directory exists
        save_name.mkdir(parents=True, exist_ok=True)
            
        # switch between train and test features and labels
        if is_train == True:
            X_name = "X_train"
            y_name = "y_train"
        else:
            X_name = "X_test"
            y_name = "y_test"
            
        # save data to parquet files
        X.to_parquet(save_name / f"{X_name}.parquet")
        save_series_as_parquet(y, save_name / f"{y_name}.parquet")

    # always return features and labels
    # return encoder if train was used
    if is_train == True:
        return X, y, encoder
    else:
        return X, y

In [17]:
# make copy of previously processed train data to compare if output is same
df_train_backup = df_train.copy()
X_train_backup = X_train.copy()
y_train_backup = y_train.copy()

In [18]:
# briefly check if backup worked
print(df_train_backup.equals(df_train))
print(X_train_backup.equals(X_train))
print(y_train_backup.equals(y_train))

True
True
True


In [19]:
# load train data again fresh from file, because it was already processed
df_train = pd.read_parquet(PATH_DATA / "gdelt_events_2024_subset_10k_train.parquet")

In [20]:
# test the function for train data
X_train_new, y_train_new, encoder_new = prepare_data(
    df = df_train,
    is_train = True,
    save_data = False,
)

In [21]:
# compare column names
print("Column names comparison:")
print(X_train_backup.columns == X_train_new.columns)


# compare values element by element (ignoring index/column names)
print("\nValue comparison:")
print(f"X values equal: {X_train_new.values.shape == X_train_backup.values.shape and (X_train_new.values == X_train_backup.values).all()}")
print(f"y values equal: {y_train_new.values.shape == y_train_backup.values.shape and (y_train_new.values == y_train_backup.values).all()}")

# compare index now
print("\nIndex comparison:")
print(X_train_new.index == X_train_backup.index)
print(y_train_new.index == y_train_backup.index)

Column names comparison:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]

Value comparison:
X values equal: True
y values equal: True

Index comparison:
[ True  True  True ...  True  True  True]
[ True  True  True ...  True  True  True]


Great! The data in there is the same.
All column names match and the values are the same, too for both features (X)
and the labels (y).
Even the index is the same. I actually didn't expect that.

In [22]:
# test the function for test data
# pass the new encoder here
X_test_new, y_test_new = prepare_data(
    df = df_test,
    is_train = False,
    save_data = False,
    encoder = encoder_new
)

I'll now also check this function for the test data.
It's bad to look at test data, but I guess verifying the code works by
comparing columns and printing the head is acceptable.
Without it, I wouldn't even know if my function works.

In theory, I could also just run it in `is_train = False` mode for the train
data, but I want to be completely sure.

Honestly, this cannot be seen as data leakage, because I will obtain like zero
information from looking at these numbers.
To make this even more clear, I'll just print the first row.
This is just enough for me to see the function works as expected.

In [23]:
# compare columns to backup -> columns must match
print("Comparing columns:")
print(X_test_new.columns == X_train_backup.columns)

# print only first row to get some feedback if function works
X_test_new.head(1)

Comparing columns:
[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True]


Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,Actor2Code,Actor2Name,...,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,year,month,day_of_year,day_of_week,is_weekend
0,6.0,3.0,2.0,1,8.0,422.0,802.0,134.0,86.0,710.0,...,-1.0,-1.0,34.0,9.0,-1.0,2024,7,190,0,0


Amazing! This can be used!

But I need to test the writing function.
This will also log to MLFLow, so there are many more break points.

In [24]:
# test the function for train data
X_train_new, y_train_new, encoder_new = prepare_data(
    df = df_train,
    is_train = True,
    save_data = True,
    save_path_dir = "notebook_prototyping",
    path_repo = PATH_REPO
)

🏃 View run welcoming-wasp-672 at: http://127.0.0.1:5001/#/experiments/1/runs/f55cb43906854b59b8485232f2667bb8
🧪 View experiment at: http://127.0.0.1:5001/#/experiments/1


Great! This works!
The data is saved and the encoder is logged to MLFlow.

In [25]:
# test the function for test data
# pass run ID as encoder path
X_test_new, y_test_new = prepare_data(
    df = df_test,
    is_train = False,
    save_data = True,
    save_path_dir = "notebook_prototyping",
    path_repo = PATH_REPO,
    encoder_path = "27d90f27450249d987677a5b7fa18167"
)

In [26]:
# compare to first row of previous run
X_test_new.head(1)

Unnamed: 0,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,Actor2Code,Actor2Name,...,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,year,month,day_of_year,day_of_week,is_weekend
0,6.0,3.0,2.0,1,8.0,422.0,802.0,134.0,86.0,710.0,...,-1.0,-1.0,34.0,9.0,-1.0,2024,7,190,0,0


Amazing! This works, too! So, this is my data prep and I can go on I think!

I leave some code here for later reference.

I haven't verified this works yet, but this is how I can likely do it:

```python
# bundle a model witht the encoder
mlflow.pyfunc.log_model("model_with_preprocessing",
                        python_model=YourModelWrapper(),
                        artifacts={'encoder': encoder_path})

# load the encoder in future scripts/notebooks
# if this is actually needed, because I have the function
run_id = "<your_run_id>"
encoder_path = mlflow.artifacts.download_artifacts(f"runs/{run_id}/preprocessing/ordinal_encoder.pkl")
loaded_encoder = joblib.load(encoder_path)
```

## Get a 1000 Samples Subset from Processed Train Data

10k rows is already not that much data for ML, but it can still run for some
time.
For rapid prototyping in interactive Jupyter notebooks, I will get an even
smaller subset of just 1k rows.

I will draw it directly from the processed train data, and use it when I
develop in interactive mode.
I can then run a script using the full data in background later on.

In [27]:
# get 1000 random samples
df_train_subset1k = df_train.sample(n=1000, random_state=42)

# check it out
df_train_subset1k.shape


(1000, 25)

I won't save this subset right here, because I will extract the logic to a
script now anyway and do it there.