# Transport Validation Data – Cleaning & Normalization

This notebook prepares a clean and consistent validation dataset for downstream
travel reconstruction and usage analysis in an urban transport system.

## Data Ingestion

Raw validation data is loaded from a public repository.
All columns are initially read as strings to avoid implicit type coercion.

In [1]:
import pandas as pd
from pathlib import Path

DATA_IN = (
    "https://raw.githubusercontent.com/"
    "diferviec/transport-data-analysis/main/data/validations_anon.csv"
)

BASE_DIR = Path.cwd()
OUT_DIR = BASE_DIR / "ValidacionesClean"
DATA_OUT = OUT_DIR / "validations_clean.csv"

OUT_DIR.mkdir(parents=True, exist_ok=True)

## Temporal Consistency

Records with invalid or missing timestamps are removed to ensure that all
subsequent analyses are based on a reliable chronological order.

In [2]:
df_raw = pd.read_csv(DATA_IN, dtype=str)

print("Rows:", len(df_raw))
print("Columns:", df_raw.columns.tolist())

df = df_raw.copy()
df["DateTime"] = pd.to_datetime(df["DateTime"], errors="coerce")

invalid_dt = df["DateTime"].isna().sum()
print("Invalid DateTime rows:", invalid_dt)

df = df[df["DateTime"].notna()].copy()
df = df.sort_values("DateTime").reset_index(drop=True)

df.head()

Rows: 15986
Columns: ['SupportId', 'DateTime', 'StopPlaceShortName', 'TransactionType', 'ProfileName', 'EquipmentModel', 'ValidationStatus', 'ValidationTicket']
Invalid DateTime rows: 0


Unnamed: 0,SupportId,DateTime,StopPlaceShortName,TransactionType,ProfileName,EquipmentModel,ValidationStatus,ValidationTicket
0,CARD_f325e60d92,2026-01-17 04:53:20.082,Parque Centenario,ENTRY,Agente,GATE,OK,RETURN_CODE_OK
1,CARD_f325e60d92,2026-01-17 04:53:41.044,Parque Centenario,ENTRY,Agente,GATE,OK,RETURN_CODE_OK
2,CARD_5dfa843d36,2026-01-17 05:17:43.951,Durán,EXIT,Agente,GATE,OK,RETURN_CODE_OK
3,CARD_5dfa843d36,2026-01-17 05:17:47.818,Durán,EXIT,Agente,GATE,OK,RETURN_CODE_OK
4,CARD_5dfa843d36,2026-01-17 05:17:51.907,Durán,EXIT,Agente,GATE,OK,RETURN_CODE_OK


In [3]:
def top_values(s, n=10):
    return s.fillna("NULL").astype(str).value_counts().head(n)

display(pd.DataFrame({
    "ValidationStatus": top_values(df["ValidationStatus"], 10),
    "ValidationTicket": top_values(df["ValidationTicket"], 10),
}).fillna(""))

print("TransactionType (top):")
display(top_values(df["TransactionType"], 15))

print("EquipmentModel (top):")
display(top_values(df["EquipmentModel"], 15))

print("ProfileName (top):")
display(top_values(df["ProfileName"], 20))

Unnamed: 0,ValidationStatus,ValidationTicket
OK,15986.0,
RETURN_CODE_CTR_OUTDATED,,7.0
RETURN_CODE_DOUBLE_VALIDATION,,11.0
RETURN_CODE_ERROR_API,,1019.0
RETURN_CODE_INSUFFICIENT_BALANCE,,360.0
RETURN_CODE_MEDIA_ERROR,,1.0
RETURN_CODE_MEDIA_OUTDATED,,11.0
RETURN_CODE_OK,,14050.0
RETURN_CODE_TOKEN_UNAUTHORIZED,,511.0
RETURN_CODE_TOKEN_VALIDITY_EXCEEDED,,16.0


TransactionType (top):


TransactionType
ENTRY                   9032
EXIT                    5961
CORRESPONDENCE_ENTRY     993
Name: count, dtype: int64

EquipmentModel (top):


EquipmentModel
GATE              14802
Validadora Bus     1113
Acceso PMR           71
Name: count, dtype: int64

ProfileName (top):


ProfileName
Estandar                           13307
NULL                                 760
Tercera edad                         686
Estudiante                           507
Agente                               472
Personas con Movilidad Reducida      254
Name: count, dtype: int64

In [4]:
def log_step(before_df, after_df, label):
    removed = len(before_df) - len(after_df)
    print(f"{label}: {len(before_df):,} -> {len(after_df):,}  (removed {removed:,})")

## Validation Rules

Only successful passenger validations are retained:

- `ValidationStatus == OK`
- `ValidationTicket == RETURN_CODE_OK`

After applying these filters, both fields are removed from the dataset,
as they become constant and no longer provide analytical value.

Operational profiles (agents) are excluded to ensure that the dataset
represents only passenger activity.

## Profile Normalization

After removing operational profiles, all remaining passenger profiles
are normalized into two categories:

- `Estandar`
- `Preferencial`

This normalization simplifies fare segmentation and ensures consistency
for downstream analysis.

## Station Normalization

Station names are normalized to ensure semantic consistency across the dataset.
In particular, bus feeder references are consolidated into their corresponding
main station name:

- `Bus Durán` → `Durán`

This avoids treating the same physical location as separate stations
in downstream analysis.

## Equipment Scope

Bus validators are excluded at this stage to focus the dataset on
gate-based access events within the transport system.

In [5]:
df0 = df.copy()

df1 = df0[df0["ValidationStatus"] == "OK"].copy()
log_step(df0, df1, "Filter ValidationStatus == OK")

df2 = df1[df1["ValidationTicket"] == "RETURN_CODE_OK"].copy()
log_step(df1, df2, "Filter ValidationTicket == RETURN_CODE_OK")

# 1) Primero se eliminan agentes
df3 = df2[~df2["ProfileName"].isin(["Agente"])].copy()
log_step(df2, df3, "Exclude ProfileName in ['Agente']")

# 1.1) Normalización de nombres de estación
df3["StopPlaceShortName"] = (
    df3["StopPlaceShortName"]
        .astype(str)
        .str.strip()
        .replace({
            "Bus Durán": "Durán",
            "DurÃ¡n": "Durán",
        })
)

# 2) Luego se normaliza el perfil: todo lo que no sea 'Estandar' pasa a 'Preferencial'
df3["ProfileName"] = df3["ProfileName"].where(
    df3["ProfileName"] == "Estandar",
    "Preferencial"
)

# 3) Se excluye Validadora Bus (si aplica a este dataset)
df4 = df3[df3["EquipmentModel"] != "Validadora Bus"].copy()
log_step(df3, df4, "Exclude EquipmentModel == Validadora Bus")

df_clean = df4.copy()

# Eliminar columnas ya constantes (opcional, recomendado para dataset final)
cols_to_drop = ["ValidationStatus", "ValidationTicket"]
df_clean = df_clean.drop(columns=[c for c in cols_to_drop if c in df_clean.columns])

print("\nFinal rows:", len(df_clean))
print("Final columns:", df_clean.columns.tolist())

Filter ValidationStatus == OK: 15,986 -> 15,986  (removed 0)
Filter ValidationTicket == RETURN_CODE_OK: 15,986 -> 14,050  (removed 1,936)
Exclude ProfileName in ['Agente']: 14,050 -> 13,584  (removed 466)
Exclude EquipmentModel == Validadora Bus: 13,584 -> 12,566  (removed 1,018)

Final rows: 12566
Final columns: ['SupportId', 'DateTime', 'StopPlaceShortName', 'TransactionType', 'ProfileName', 'EquipmentModel']


In [6]:
string_cols = [
    "SupportId",
    "StopPlaceShortName",
    "TransactionType",
    "ProfileName",
    "EquipmentModel",
]

for c in string_cols:
    if c in df_clean.columns:
        df_clean[c] = df_clean[c].astype(str).str.strip()
        df_clean[c] = df_clean[c].replace({"nan": None, "None": None, "": None})

# Orden correcto para siguientes pasos
df_clean = df_clean.sort_values(["SupportId", "DateTime"]).reset_index(drop=True)

print("Min DateTime:", df_clean["DateTime"].min())
print("Max DateTime:", df_clean["DateTime"].max())

df_clean.head()

Min DateTime: 2026-01-17 05:30:55.393000
Max DateTime: 2026-01-17 21:21:39.535000


Unnamed: 0,SupportId,DateTime,StopPlaceShortName,TransactionType,ProfileName,EquipmentModel
0,CARD_0015fb32fd,2026-01-17 13:20:30.830,Durán,ENTRY,Estandar,GATE
1,CARD_0015fb32fd,2026-01-17 13:20:40.237,Durán,ENTRY,Estandar,GATE
2,CARD_0015fb32fd,2026-01-17 13:20:44.054,Durán,ENTRY,Estandar,GATE
3,CARD_0015fb32fd,2026-01-17 13:20:47.026,Durán,ENTRY,Estandar,GATE
4,CARD_0015fb32fd,2026-01-17 13:20:49.315,Durán,ENTRY,Estandar,GATE


In [7]:
example_card = df_clean["SupportId"].dropna().iloc[0]
df_clean[df_clean["SupportId"] == example_card][
    ["SupportId", "DateTime", "StopPlaceShortName", "TransactionType", "ProfileName", "EquipmentModel"]
].head(20)

Unnamed: 0,SupportId,DateTime,StopPlaceShortName,TransactionType,ProfileName,EquipmentModel
0,CARD_0015fb32fd,2026-01-17 13:20:30.830,Durán,ENTRY,Estandar,GATE
1,CARD_0015fb32fd,2026-01-17 13:20:40.237,Durán,ENTRY,Estandar,GATE
2,CARD_0015fb32fd,2026-01-17 13:20:44.054,Durán,ENTRY,Estandar,GATE
3,CARD_0015fb32fd,2026-01-17 13:20:47.026,Durán,ENTRY,Estandar,GATE
4,CARD_0015fb32fd,2026-01-17 13:20:49.315,Durán,ENTRY,Estandar,GATE


## Final Dataset

The resulting dataset is sorted by card and timestamp and exported as a clean
CSV file, ready for travel reconstruction and time-based analysis.

In [8]:
df_clean.to_csv(DATA_OUT, index=False, encoding="utf-8")
print(f"Saved clean dataset | rows: {len(df_clean)}")

Saved clean dataset | rows: 12566
