# Exploratory Data Analysis

Here, I will just check if I can get some usable data from the GDELT 2.0 Event
Database via Google BigQuery.

So far, I didn't decide for which problem to solve and what kind of model to
train, because I want to find a suitable dataset first.

Here's what I'm looking for data:
- with time stamps that's updated regularly, so I can train an initial model
and then schedule it to run periodically and monitor it
- that's sufficiently large to train a model on
- that has some interesting features and a suitable target variable

During a brief search, I found the GDELT 2.0 Event Database, which is a public
and free database that contains event data from all over the world.
It seems to fulfill these requirements and is available via BigQuery.

Here, I will check if I can get some data from it and if it's suitable for my
needs.

## Environment

To use this project's uv environment, make sure you installed it according to
the instructions in the README.md file.

Then, connect to the `.venv` kernel.
Check the path to the kernel to make sure it's the right one.
It should be `.venv/bin/python`.

Run the next cell to check if you use the correct kernel.
It should output this:

# FIXME: Once I decided for an actual name for the repo, adapt the path!
```
<path_to_wherever_you_cloned_the_repo_to>/mlopsproject2/.venv/bin/python
```

In [2]:
!which python

/Users/fakrueg/projects/courses/datatalks/mlops-zoomcamp/mlopsproject2/.venv/bin/python


In [6]:
# Dependencies
import os
import pandas as pd
from google.cloud import bigquery
import pandas_gbq
from dotenv import load_dotenv
from pathlib import Path

# Load environment variables
load_dotenv()

True

In [84]:
# define paths
PATH_REPO = Path(".").resolve().parent
PATH_DATA = PATH_REPO / "data" / "raw"

In [None]:
# BigQuery Client Setup
def setup_bigquery_client():
    """
    Set up BigQuery client using credentials file
    """
    # Check if credentials file exists
    cred_path = "../bigquery-credentials.json"
    if not Path(cred_path).exists():
        raise FileNotFoundError(f"Credentials file not found: {cred_path}")
    
    # Set environment variable for this session
    os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = cred_path
    
    # Get project ID from environment
    project_id = os.getenv('GOOGLE_CLOUD_PROJECT')
    if not project_id:
        raise ValueError("GOOGLE_CLOUD_PROJECT not set in .env file")
    
    # Initialize client
    client = bigquery.Client(project=project_id)
    return client

# Initialize BigQuery client
try:
    client = setup_bigquery_client()
    print(f"BigQuery client initialized successfully!")
    print(f"Project: {client.project}")
    print(f"Using credentials from: ./bigquery-credentials.json")
except Exception as e:
    print(f"Error setting up BigQuery client: {e}")
    client = None

BigQuery client initialized successfully!
Project: mlops-zoomcamp-1337420697
Using credentials from: ./bigquery-credentials.json


# FIXME: Explain why I selected these exact features.

In [70]:
def safe_gdelt_query(start_date, end_date, limit=100, dry_run=True):
    """
    Safely query GDELT data with automatic cost estimation
    
    Args:
        start_date (str): Start date in 'YYYY-MM-DD' format
        end_date (str): End date in 'YYYY-MM-DD' format  
        limit (int): Maximum number of rows to return
        dry_run (bool): If True, only estimate query cost
    """

    if client is None:
        raise ValueError("BigQuery client not initialized")
    
    # Convert dates to GDELT format (YYYYMMDD) as integers
    start_gdelt = int(start_date.replace('-', ''))
    end_gdelt = int(end_date.replace('-', ''))

    query = f"""
    SELECT 
        SQLDATE,                -- event date
        MonthYear,              -- month and year
        EventCode,
        EventBaseCode,
        EventRootCode,
        QuadClass,
        GoldsteinScale,
        Actor1Code,
        Actor1Name,
        Actor1CountryCode,
        Actor1Type1Code,
        Actor1Type2Code,
        Actor1Type3Code,
        Actor2Code,
        Actor2Name,
        Actor2CountryCode,
        Actor2Type1Code,
        Actor2Type2Code,
        Actor2Type3Code,
        ActionGeo_CountryCode,
        ActionGeo_ADM1Code,
        ActionGeo_Lat,
        ActionGeo_Long,
        ActionGeo_FeatureID,
        NumArticles             -- target variable
    FROM `gdelt-bq.gdeltv2.events`
    WHERE SQLDATE >= {start_gdelt}  -- start date
      AND SQLDATE <= {end_gdelt}    -- end date
    ORDER BY RAND()                 -- order rows randomly to get random sample
    LIMIT {limit}                   -- limit the number of rows to return
    """

    # Always do a dry run first for cost estimation
    job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
    dry_job = client.query(query, job_config=job_config)

    bytes_processed = dry_job.total_bytes_processed
    estimated_cost = (bytes_processed / 1e12) * 5  # $5 per TB

    print(
        f"Query will process: {bytes_processed:,} bytes "
        f"({bytes_processed/1e6:.2f} MB) or rather "
        f"({bytes_processed/1e9:.2f} GB)."
    )
    print(f"Estimated cost: ${estimated_cost:.6f}")

    if dry_run:
        print("Dry run complete - no data retrieved")
        return None

    # Execute the actual query
    print("Executing query...")
    df = pd.read_gbq(query, project_id=client.project, dialect='standard')

    print(f"Query completed! Retrieved {len(df)} rows")
    return df

Get 10k random rows from GDELT events table from year 2024.

I decided to go for a sample size 10k rows, because that should be an acceptable
balance between speed of model training and showing it enough data.
If I go for an 80:20 train:test split, I will end up with 8k rows for training
and 2k rows for testing.
There are just 24 features and one target variable.
So basically the ration rows to features is 10000:24, which is 416.67.
I intend to use tree based algorithms such as XGBoost, CatBoost and LightGBM.
They are rather data efficient, and at this ratio, maybe it's even already
enough for acceptable performance.

Honestly, I could go for **much** more than that though, but then models would
train much longer, too.
This is some sort of a subset for speed of development.
At the same time, I could have also gone for much less than that, but then it
would definitely become a true subset, and whatever I train would likely be
underperforming.
So I decide to go with this as a compromise and check how well it performs.
If it does good enough, I won't need to go for a larger subset.
If it doesn't perform well, I can at least select hyperparameters and then go
for a larger subset.
Then again, this is not a machine learning engineering course, but a machine
learning *operations*, so I don't need to get the best possible model in the
first place.
A good model is sufficient.

In [None]:
# Start with just a dry run to check costs
test_df = safe_gdelt_query(
    '2024-01-01',
    '2024-12-31',
    limit=14000,
    dry_run=True
)

Query will process: 101,663,870,564 bytes (101663.87 MB) or rather (101.66 GB).
Estimated cost: $0.508319
Dry run complete - no data retrieved


In [None]:
# Looks fine enough, so go for it
# loads a pandas df into object data_gdelt

# actually I just added this False as another layer of safety, so it doesn't
# automatically run stuff
# BigQuery can generate some costs, but at this rate it won't, because we're
# still well below the free quota of 1TB per month
if False:
    data_gdelt = safe_gdelt_query(
        '2024-01-01',
        '2024-12-31',
        limit=10000,
        dry_run=False
    )

Query will process: 101,664,169,301 bytes (101664.17 MB) or rather (101.66 GB).
Estimated cost: $0.508321
Executing query...


  df = pd.read_gbq(query, project_id=client.project, dialect='standard')


Query completed! Retrieved 10000 rows


In [None]:
# just have a look at if downloading worked
data_gdelt.head()

Unnamed: 0,SQLDATE,MonthYear,EventCode,EventBaseCode,EventRootCode,QuadClass,GoldsteinScale,Actor1Code,Actor1Name,Actor1CountryCode,...,Actor2CountryCode,Actor2Type1Code,Actor2Type2Code,Actor2Type3Code,ActionGeo_CountryCode,ActionGeo_ADM1Code,ActionGeo_Lat,ActionGeo_Long,ActionGeo_FeatureID,NumArticles
0,20241127,202411,111,111,11,3,-2.0,,,,...,RUS,MIL,,,RS,RS48,55.7522,37.6156,-2960561,10
1,20240520,202405,30,30,3,1,4.0,CVL,VOTER,,...,,,,,US,USCA,34.1819,-118.36,273472,2
2,20240417,202404,61,61,6,2,6.4,GBR,BRITAIN,GBR,...,,,,,NZ,NZ,-42.0,174.0,NZ,5
3,20240916,202409,190,190,19,4,-10.0,HTI,HAITI,HTI,...,,CVL,,,HA,HA11,18.5392,-72.335,-70311,5
4,20240524,202405,46,46,4,1,7.0,,,,...,MRT,,,,MR,MR06,18.1194,-16.0406,-1402901,2


In [85]:
# write data to parquet, so I can re-use it later without querying again
data_gdelt.to_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k.parquet",
    index=False
)

In [None]:
# load data from parquet again
data_gdelt = pd.read_parquet(
    PATH_DATA / "gdelt_events_2024_subset_10k.parquet"
)