<h1 style="text-align: center;">Data Preprocessing & Integrity Check</h1>


# 1. Setup & Loading Data

In this notebook, we focus on the initial preparation of the raw data. We clean missing values and ensure consistency across different currencies and timestamps. This step is critical to ensure that subsequent modeling and feature construction are built on a solid and reliable foundation.

In [1]:
import os
from pathlib import Path

repo_root = Path.cwd()

while not (repo_root / ".git").exists() and repo_root.parent != repo_root:
    repo_root = repo_root.parent

os.chdir(repo_root)
print(f"Current working directory set to: {repo_root}")

Current working directory set to: c:\Users\Lenovo\Desktop\Git Uploads\cross-currency-extrema-forecasting


In [2]:
import pandas as pd
from src.data.data_integrity import DataIntegrityChecker

df = pd.read_parquet("data/raw/currencies_market_data.parquet")

df["open_time"] = pd.to_datetime(df["open_time"], unit="ms")

# 2. Integrity Check & Cleaning

The first step involved identifying and removing duplicate time bar entries per currency. Approximately 6.45% of the data (206,187 rows) were found to be duplicates and were removed.

Next, we checked for missing values and found none, ensuring that the dataset was complete in terms of recorded prices and volumes. We also eliminated Sunday entries, as the early forex session on Sundays typically suffers from very low liquidity and higher noise, which could negatively impact model performance. A small portion of the data (~0.52%) corresponding to weekend-boundary entries was removed for this reason.


In [3]:
checker = DataIntegrityChecker(df)
clean_df, summary = checker.run_all_checks()

🚀 Running data integrity pipeline...

✅ Column structure verified.

📊 Data types:
open_time    datetime64[ns]
open                float64
high                float64
low                 float64
close               float64
volume              float64
currency             object
dtype: object

⚠️ Found 206187 duplicate entries (6.45% of all data). Dropping them.

✅ No NaN values detected.

📅 Data date ranges by currency:
   - EURUSD: 2025-03-02 22:00:00 → 2025-09-22 12:19:00
   - USDJPY: 2025-04-01 00:00:00 → 2025-09-22 12:21:00
   - GBPUSD: 2025-03-02 22:00:00 → 2025-09-22 12:22:00
   - AUDUSD: 2025-03-02 22:00:00 → 2025-09-22 12:24:00
   - USDCAD: 2025-03-02 22:00:00 → 2025-09-22 12:26:00
   - USDCHF: 2025-03-02 22:00:00 → 2025-09-22 12:20:00
   - NZDUSD: 2025-03-02 22:00:00 → 2025-09-22 12:21:00
   - EURJPY: 2025-03-02 22:00:00 → 2025-09-22 12:23:00
   - GBPJPY: 2025-03-02 22:00:00 → 2025-09-22 12:25:00
   - AUDJPY: 2025-03-02 22:00:00 → 2025-09-22 12:26:00
   - AUDSGD: 2025-03-02 22:


Time continuity across all currencies was verified, revealing only minimal gaps (≤1.54% per currency and 1.18% overall). These gaps were forward-filled within valid trading hours, ensuring smooth time series sequences for modeling. Additionally, incomplete day entries after 2025-09-22 00:00:00 were dropped, aligning the dataset to complete trading days.


# 3. Post-Cleaning Verification & Exploration

In this section we re-run the integrity check pipeline ensuring that our data has none of the problems mentioned above.

In [4]:
verifier = DataIntegrityChecker(clean_df)
verifier.run_all_checks(False)

🚀 Running data integrity pipeline...

✅ Column structure verified.

📊 Data types:
open_time    datetime64[ns]
currency             object
open                float64
high                float64
low                 float64
close               float64
volume              float64
dtype: object

✅ No duplicate entries detected.

✅ No NaN values detected.

📅 Data date ranges by currency:
   - EURUSD: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - USDJPY: 2025-04-01 00:00:00 → 2025-09-21 23:59:00
   - GBPUSD: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - AUDUSD: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - USDCAD: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - USDCHF: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - NZDUSD: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - EURJPY: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - GBPJPY: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - AUDJPY: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - AUDSGD: 2025-03-02 22:00:00 → 2025-09-21 23:59:00
   - SGDJ

In [5]:
summary

Unnamed: 0_level_0,open,open,open,open,open,open,open,open,open,open,...,volume,volume,volume,volume,volume,volume,volume,volume,volume,volume
Unnamed: 0_level_1,count,mean,std,min,1%,10%,50%,90%,99%,max,...,count,mean,std,min,1%,10%,50%,90%,99%,max
currency,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AUDJPY,208920.0,94.297089,2.18661,86.1835,87.99538,91.101,94.399,96.897,98.1585,98.4195,...,208920.0,126.530887,61.979075,1.0,12.0,51.0,120.0,211.0,295.0,424.0
AUDSGD,208920.0,0.836854,0.006812,0.801185,0.810226,0.83037,0.8368,0.84481,0.85329,0.85475,...,208920.0,50.846243,23.011019,1.0,11.0,30.0,47.0,78.0,135.0,283.0
AUDUSD,208920.0,0.644545,0.01228,0.59402,0.600851,0.62883,0.6473,0.657505,0.66691,0.670565,...,208920.0,95.643107,57.485616,1.0,6.0,29.0,87.0,171.0,279.0,436.0
EURAUD,178560.0,1.777645,0.020143,1.711365,1.724609,1.751025,1.781172,1.79687,1.82971,1.85532,...,178560.0,124.447463,65.541043,1.0,10.59,46.0,117.0,214.0,306.0,420.0
EURGBP,178560.0,0.857128,0.010132,0.832505,0.835835,0.841755,0.859949,0.868096,0.87312,0.87523,...,178560.0,182.316482,88.651932,1.0,10.0,48.0,195.0,291.0,357.0,446.0
EURJPY,208920.0,166.772889,4.832381,155.648,157.29657,161.438,165.61125,172.7235,173.818,174.502,...,208920.0,206.769026,97.590638,1.0,6.0,57.0,226.0,321.0,380.0,443.0
EURUSD,208920.0,1.140036,0.032646,1.039045,1.050837,1.08457,1.145285,1.17364,1.182399,1.191135,...,208920.0,118.952398,72.553409,1.0,4.0,31.0,110.0,218.0,324.0,446.0
GBPJPY,208920.0,195.212657,3.698519,184.41,186.943095,189.5935,195.5345,199.468,200.404,201.2605,...,208920.0,138.615886,68.157357,1.0,6.0,54.0,134.0,229.0,312.0,440.0
GBPUSD,208920.0,1.334451,0.025263,1.25833,1.27126,1.29273,1.342235,1.35935,1.373725,1.37878,...,208920.0,119.118127,68.928002,1.0,4.0,32.0,113.0,212.0,305.0,440.0
NZDUSD,208920.0,0.590663,0.011957,0.548985,0.55623,0.572095,0.593485,0.60359,0.608915,0.612025,...,208920.0,89.47333,52.939344,1.0,5.0,23.0,85.0,157.0,251.0,399.0


✅ Everything is in order according to post-cleaning check, and based on the ranges and quantiles in the descriptive summary table, no notable outliers were detected. We can now save the cleaned dataset for the next steps of the project.

In [6]:
clean_df.to_parquet("data/processed/clean_data.praquet", index=False)

With the dataset now preprocessed and standardized, we are ready to define our targets for modeling. The next notebook will cover the target construction, specifying the values that our models will learn to predict. See [→ Notebook 02 – Target Construction](02_target_construction.ipynb).