In [None]:
import sys, subprocess
if "google.colab" in sys.modules:
    subprocess.run(["pip", "install", "-q", "pandas", "numpy", "scikit-learn", "requests", "pydantic", "jsonschema"])


# Clean Synthetic Experiment Records

**What**: Load, validate, and clean synthetic experimental data.

**Why**: Data quality is the foundation of reliable research. Identifying missing values, outliers, and inconsistencies early prevents errors in downstream analysis.

**How**:
1. **Load the dataset** into a pandas DataFrame.
2. **Parse timestamps** to ensure correct temporal analysis.
3. **Perform range checks** to identify invalid data points.

**Key Concept**: **Data Validation** involves checking data against a set of rules (e.g., "metrics must be between 0 and 100") to ensure its logical consistency.

By the end of this notebook, you will have completed the listed steps and produced the outputs described in the success criteria.

### Success criteria
- You loaded experiment records.
- You parsed timestamps and checked ranges.
- You flagged or confirmed data quality.

In [None]:
from pathlib import Path


def find_data_dir() -> Path:
    candidates = [Path.cwd() / "data", Path.cwd().parent / "data", Path.cwd().parent.parent / "data"]
    for candidate in candidates:
        if (candidate / "sample_texts" / "articles_sample.csv").exists():
            return candidate
    raise FileNotFoundError("data directory not found. Run scripts/generate_synthetic_data.py.")

DATA_DIR = find_data_dir()


In [None]:
import pandas as pd

experiments = pd.read_csv(DATA_DIR / "sample_tabular" / "experiments_sample.csv")
experiments["timestamp"] = pd.to_datetime(experiments["timestamp"])

print("Dataset shape", experiments.shape)
print("Missing values", experiments.isna().sum())
experiments.head()


## Basic range checks

In [None]:
metric_out_of_range = (~experiments["metric_value"].between(0, 100)).sum()
print(f"Records outside expected metric range: {metric_out_of_range}")
experiments.describe()


### If you get stuck / What to try next

If you get stuck: ensure timestamps parse by rerunning the cleaning cell; confirm data generation. What to try next: create features in pipelines/tabular/feature_engineering.ipynb and visualize them in the tabular notebooks.