# 01. Data Loading & Schema Validation

**Objective:** Load the raw crime incidents data, validate its schema against expected types/constraints, and perform an initial audit of data quality (specifically missing values).

**Inputs:** `data/crime_incidents_combined.parquet`
**Outputs:** Schema validation report, Missing value analysis

In [1]:
import sys
from pathlib import Path
# Add project root to sys.path to allow importing scripts
PROJECT_ROOT = Path("..").resolve()
sys.path.append(str(PROJECT_ROOT))

import pandas as pd
import pandera as pa
from pandera import errors
import scripts.config as config
from scripts.data_loader import load_raw_data, Schema

# Display settings
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)

top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



## 1. Load Data

Using the centralized data loader which handles column renaming and standard type conversion.

In [2]:
try:
    df = load_raw_data(validate=True)
    print(f"Successfully loaded {len(df):,} records.")
except errors.SchemaErrors as err:
    print("Schema validation failed with the following errors:")
    print(err.failure_cases)
    # We continue with the dataframe available in the error object if needed, 
    # or usually we might stop. For this audit, we want to inspect the data even if it fails strict validation.
    # The loader raises, so 'df' might not be assigned if we don't handle it.
    # Let's catch and reload without validation to proceed with audit if validation fails.
    print("Reloading without strict validation to continue audit...")
    df = load_raw_data(validate=False)

Loading data from /Users/dustinober/Projects/Crime Incidents Philadelphia/data/crime_incidents_combined.parquet...


Validating schema...


Schema validation passed.
Successfully loaded 3,496,353 records.


## 2. Schema Definition & Data Dictionary

The data is validated against the following schema:

In [3]:
print(Schema)

<Schema DataFrameSchema(
    columns={
        'cartodb_id': <Schema Column(name=cartodb_id, type=DataType(int64))>
        'dispatch_date_time': <Schema Column(name=dispatch_date_time, type=DataType(datetime64[ns, UTC]))>
        'dc_dist': <Schema Column(name=dc_dist, type=DataType(int64))>
        'psa': <Schema Column(name=psa, type=DataType(str))>
        'ucr_general': <Schema Column(name=ucr_general, type=DataType(int64))>
        'text_general_code': <Schema Column(name=text_general_code, type=DataType(str))>
        'location_block': <Schema Column(name=location_block, type=DataType(str))>
        'lat': <Schema Column(name=lat, type=DataType(float64))>
        'lng': <Schema Column(name=lng, type=DataType(float64))>
    },
    checks=[],
    parsers=[],
    coerce=True,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None, 
    add_missing_columns=False
)>


### Data Dictionary

| Column | Type | Description | Required |
|---|---|---|---|
| `cartodb_id` | int | Unique identifier | Yes |
| `dispatch_date_time` | datetime | Date and time of the incident dispatch | Yes |
| `dc_dist` | int | District Control Number (District ID) | Yes |
| `psa` | str | Police Service Area | No |
| `ucr_general` | int | Uniform Crime Reporting General Code | Yes |
| `text_general_code` | str | Description of the crime type | No |
| `location_block` | str | Block-level location address | No |
| `lat` | float | Latitude | No |
| `lng` | float | Longitude | No |
