# **Data Collection & Storage — Final Project (Focused on Collection & Cleaning)**

This notebook is scoped to the first two lifecycle steps: **Data Collection** and **Data Cleaning**. The downstream analysis and communication will be completed later.

**Problematic (context for later analysis):**
- Which lines are the most well served for Parisians?
- What is the main cause of delays?
- How much does it differ from other lines?

We will only use the CSV file: `Regularities_by_liaisons_Trains_France.csv`.

---

## Principles We Follow
- Clarity and reproducibility first: every step is explained and deterministic.
- Decisions are justified using both observed data properties and data best practices.
- No ML/advanced stats here; we prepare a clean, reliable dataset for future analysis.

## 1. Data Collection

### 1.1 Dataset selection & scope
- Source: Kaggle dataset “Public transport traffic data in France” ([link](https://www.kaggle.com/datasets/gatandubuc/public-transport-traffic-data-in-france)).
- We explicitly use only one file from the dataset: `Regularities_by_liaisons_Trains_France.csv`.
- Rationale: this file contains service regularity information per liaison/line, which is directly relevant for understanding service levels and delays.

### 1.2 Provenance & reproducibility
- We retrieve data programmatically using `kagglehub` to guarantee a deterministic path and versioned download.
- We copy the CSV to a local `data/raw` folder to ensure a stable, project-local reference path for subsequent steps.
- Ethical note: data is public/open; we will cite data source in the final report and respect licensing.

### 1.3 Practical notes for this notebook
- We aim for clear, well-documented code; each decision is justified by what we observe in the data and by standard data-cleaning best practices.
- No analysis is performed here; the goal is to produce a clean, consistent, and well-typed dataset ready for downstream analysis.

In [13]:
from __future__ import annotations

import os
from pathlib import Path
import pandas as pd
import numpy as np

In [16]:
df_raw = pd.read_csv("regularite-mensuelle-tgv-aqst.csv", sep=";")
df_raw.shape

(10687, 26)

In [17]:
# just to have an idea of the data

df_raw.head(5)

Unnamed: 0,Date,Service,Gare de départ,Gare d'arrivée,Durée moyenne du trajet,Nombre de circulations prévues,Nombre de trains annulés,Commentaire annulations,Nombre de trains en retard au départ,Retard moyen des trains en retard au départ,Retard moyen de tous les trains au départ,Commentaire retards au départ,Nombre de trains en retard à l'arrivée,Retard moyen des trains en retard à l'arrivée,Retard moyen de tous les trains à l'arrivée,Commentaire retards à l'arrivée,Nombre trains en retard > 15min,Retard moyen trains en retard > 15 (si liaison concurrencée par vol),Nombre trains en retard > 30min,Nombre trains en retard > 60min,Prct retard pour causes externes,Prct retard pour cause infrastructure,Prct retard pour cause gestion trafic,Prct retard pour cause matériel roulant,Prct retard pour cause gestion en gare et réutilisation de matériel,"Prct retard pour cause prise en compte voyageurs (affluence, gestions PSH, correspondances)"
0,2018-01,National,GRENOBLE,PARIS LYON,183,245,0,,37,8.027027,1.212245,,23,46.314493,6.123741,Le 9760 heurte un chevreuil vers Le-Creusot-Mo...,25,6.123741,13,6,17.647059,52.941176,0.0,23.529412,5.882353,0.0
1,2018-01,International,PARIS LYON,ITALIE,394,94,0,,27,11.261728,2.997695,,22,55.681818,11.601064,,22,11.601064,15,6,33.333333,19.047619,23.809524,14.285714,9.52381,0.0
2,2018-01,National,MARSEILLE ST CHARLES,LYON PART DIEU,106,557,7,,133,6.978195,1.706333,,60,28.92,5.195333,,40,5.195333,19,5,23.076923,23.076923,19.230769,23.076923,3.846154,7.692308
3,2018-01,National,PARIS NORD,DUNKERQUE,116,271,3,,46,11.236594,1.797637,,29,28.689655,3.738806,,18,3.738806,9,4,35.714286,28.571429,7.142857,25.0,3.571429,0.0
4,2018-01,National,ANNECY,PARIS LYON,224,198,0,,12,8.070833,0.489141,,38,37.246053,8.552525,,38,8.552525,14,5,23.809524,42.857143,9.52381,14.285714,4.761905,4.761905


In [24]:
# 2.1 Inspect raw structure
print("Raw shape (rows, cols):", df_raw.shape)
print("\nColumn names (original):\n", list(df_raw.columns))

print("\nData types:")
df_raw.dtypes.to_frame("dtype").T

print("\nInfo:")
df_raw.info()

print("\nSample records:")
df_raw.head(5)

# all missing values plutot, au lieu de juste top 20
print("\nMissingness:")
missing_fraction = df_raw.isna().mean().sort_values(ascending=False)
missing_fraction

Raw shape (rows, cols): (10687, 26)

Column names (original):
 ['Date', 'Service', 'Gare de départ', "Gare d'arrivée", 'Durée moyenne du trajet', 'Nombre de circulations prévues', 'Nombre de trains annulés', 'Commentaire annulations', 'Nombre de trains en retard au départ', 'Retard moyen des trains en retard au départ', 'Retard moyen de tous les trains au départ', 'Commentaire retards au départ', "Nombre de trains en retard à l'arrivée", "Retard moyen des trains en retard à l'arrivée", "Retard moyen de tous les trains à l'arrivée", "Commentaire retards à l'arrivée", 'Nombre trains en retard > 15min', 'Retard moyen trains en retard > 15 (si liaison concurrencée par vol)', 'Nombre trains en retard > 30min', 'Nombre trains en retard > 60min', 'Prct retard pour causes externes', 'Prct retard pour cause infrastructure', 'Prct retard pour cause gestion trafic', 'Prct retard pour cause matériel roulant', 'Prct retard pour cause gestion en gare et réutilisation de matériel', 'Prct retard pour 

Commentaire annulations                                                                        1.0
Commentaire retards au départ                                                                  1.0
Date                                                                                           0.0
Retard moyen de tous les trains à l'arrivée                                                    0.0
Prct retard pour cause gestion en gare et réutilisation de matériel                            0.0
Prct retard pour cause matériel roulant                                                        0.0
Prct retard pour cause gestion trafic                                                          0.0
Prct retard pour cause infrastructure                                                          0.0
Prct retard pour causes externes                                                               0.0
Nombre trains en retard > 60min                                                                0.0
Nombre tra

In [18]:
from ydata_profiling import ProfileReport

profile = ProfileReport(df_raw, title="Profiling Report")

profile.to_file("books_data.html") # Export the report to an HTML file

100%|██████████| 26/26 [00:00<00:00, 496.19it/s]<00:00, 95.51it/s, Describe variable: Prct retard pour cause prise en compte voyageurs (affluence, gestions PSH, correspondances)]
Summarize dataset: 100%|██████████| 396/396 [00:10<00:00, 39.41it/s, Completed]                                                                                                                                                                                       
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.80s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  1.74it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 77.82it/s]


## 2. Data Cleaning

Goal: transform the raw table into a clean, consistent, analysis-ready dataset.

We will:
- Inspect schema, types, and missingness to understand the data.
- Define and justify a cleaning plan based on observations and best practices.
- Implement a transparent, deterministic cleaning pipeline that can be re-run.


Raw shape (rows, cols): (10687, 26)

Column names (original):
 ['Date', 'Service', 'Gare de départ', "Gare d'arrivée", 'Durée moyenne du trajet', 'Nombre de circulations prévues', 'Nombre de trains annulés', 'Commentaire annulations', 'Nombre de trains en retard au départ', 'Retard moyen des trains en retard au départ', 'Retard moyen de tous les trains au départ', 'Commentaire retards au départ', "Nombre de trains en retard à l'arrivée", "Retard moyen des trains en retard à l'arrivée", "Retard moyen de tous les trains à l'arrivée", "Commentaire retards à l'arrivée", 'Nombre trains en retard > 15min', 'Retard moyen trains en retard > 15 (si liaison concurrencée par vol)', 'Nombre trains en retard > 30min', 'Nombre trains en retard > 60min', 'Prct retard pour causes externes', 'Prct retard pour cause infrastructure', 'Prct retard pour cause gestion trafic', 'Prct retard pour cause matériel roulant', 'Prct retard pour cause gestion en gare et réutilisation de matériel', 'Prct retard pour 

Commentaire annulations                                                                        1.000000
Commentaire retards au départ                                                                  1.000000
Commentaire retards à l'arrivée                                                                0.934687
Date                                                                                           0.000000
Retard moyen de tous les trains à l'arrivée                                                    0.000000
Prct retard pour cause gestion en gare et réutilisation de matériel                            0.000000
Prct retard pour cause matériel roulant                                                        0.000000
Prct retard pour cause gestion trafic                                                          0.000000
Prct retard pour cause infrastructure                                                          0.000000
Prct retard pour causes externes                                

Observations:

- Comment columns are empty or almost empty.
- The rest of the columns have little to no missing value.

### 2.2 Cleaning plan & decisions

Based on inspection and data-cleaning best practices, we will:

1) Standardize column names to `snake_case` for consistency and easier downstream use.
2) Remove exact duplicate rows to avoid double-counting.
3) Trim whitespace from textual columns and normalize casing where appropriate.
4) Parse date-like columns (e.g., containing `date`/`jour`) to datetime for reliable time operations.
5) Convert numeric-like columns currently stored as text to numeric with safe coercion.
6) Handle missing values:
   - Drop rows missing essential identifiers (e.g., liaison/line/date) if present.
   - For count-like fields (e.g., number of delays/canceled services), impute 0 only when semantically sound.
   - For categorical reason fields, set missing to `Unknown` and normalize labels.
7) Persist a clean dataset to `data/processed` (CSV and Parquet if available) for reproducibility.

We document each decision inline below and rely on column-name heuristics when the schema varies (robustness without hard-coding unknowns).

In [19]:
# Checking if there are any duplicates

num_duplicates = df_raw.duplicated().sum()
print(f"\nNumber of duplicate rows: {num_duplicates}")


Number of duplicate rows: 0


In [23]:
# Trim whitespace in all object columns

obj_cols = [c for c in df_raw.columns if df_raw[c].dtype == "object"]
for c in obj_cols:
    df_raw[c] = df_raw[c].astype(str).str.strip()

In [None]:
# Identify date-like columns (heuristic by name)
date_like_cols = [c for c in df_clean.columns if re.search(r"\b(date|jour|calendar|perio)\b", c)]
for c in date_like_cols:
    df_clean[c] = pd.to_datetime(df_clean[c], errors="coerce", dayfirst=True, infer_datetime_format=True)



In [None]:
# Identify numeric-like columns by name patterns
numeric_name_patterns = [
    r"^(nb|nombre|count|total|vol|qty|quant|qte)",
    r"(delai|retard|late|min|sec|heures?)",
    r"(annul|cancel)",
    r"(train|service|trajet|travel)s?",
]

maybe_numeric_cols: list[str] = []
for c in df_clean.columns:
    if any(re.search(pat, c) for pat in numeric_name_patterns):
        maybe_numeric_cols.append(c)

# Additionally include columns that look numeric but are objects
maybe_numeric_cols += [
    c for c in df_clean.columns
    if df_clean[c].dtype == "object" and df_clean[c].str.match(r"^[-+]?\d+[\d.,]*$", na=False).mean() > 0.5
]
maybe_numeric_cols = sorted(set(maybe_numeric_cols))

# Coerce to numeric safely (commas to dots; non-numeric -> NaN)
for c in maybe_numeric_cols:
    if df_clean[c].dtype == "object":
        df_clean[c] = df_clean[c].str.replace(",", ".", regex=False)
    df_clean[c] = pd.to_numeric(df_clean[c], errors="coerce")

# Identify potential essential identifier columns (liaison/line/date)
essential_keywords = ["liaison", "ligne", "line", "route", "relation", "code", "id", "date", "jour"]
essential_cols = [c for c in df_clean.columns if any(k in c for k in essential_keywords)]

# Decision: only drop rows if we have at least one credible identifier present; else keep all (conservative)
if essential_cols:
    before_drop = len(df_clean)
    df_clean = df_clean.dropna(subset=essential_cols, how="any")
    dropped_for_missing_keys = before_drop - len(df_clean)
else:
    dropped_for_missing_keys = 0

# Normalize reason/motif columns for delays/cancellations
reason_keywords = ["motif", "cause", "reason"]
reason_cols = [c for c in df_clean.columns if any(k in c for k in reason_keywords)]
for c in reason_cols:
    df_clean[c] = df_clean[c].fillna("Unknown").astype(str).str.strip().str.replace("\s+", " ", regex=True).str.title()

# Impute zeros for count-like columns (only when semantically appropriate)
count_like_patterns = [r"^(nb|nombre|count|total)", r"(annul|cancel)", r"(retard|delay|delai)"]
count_like_cols = [c for c in df_clean.columns if any(re.search(pat, c) for pat in count_like_patterns)]
for c in count_like_cols:
    if pd.api.types.is_numeric_dtype(df_clean[c]):
        df_clean[c] = df_clean[c].fillna(0)

# Final light tidy: reorder columns (ids first, then dates, then others)
id_first = [c for c in df_clean.columns if re.search(r"\b(id|code|ligne|line|liaison)\b", c)]
date_first = [c for c in df_clean.columns if c in date_like_cols]
other_cols = [c for c in df_clean.columns if c not in set(id_first + date_first)]
df_clean = df_clean[id_first + date_first + other_cols]

print({
    "initial_rows": int(initial_rows),
    "removed_duplicates": int(removed_dups),
    "dropped_for_missing_keys": int(dropped_for_missing_keys),
    "final_rows": int(len(df_clean)),
    "final_cols": int(df_clean.shape[1]),
})
df_clean.head(5)


In [4]:
# 2.4 Persist cleaned dataset for reproducibility
processed_dir = project_root / "data" / "processed"
processed_dir.mkdir(parents=True, exist_ok=True)

clean_csv_path = processed_dir / "regularities_by_liaisons_trains_france_clean.csv"
df_clean.to_csv(clean_csv_path, index=False)

# Parquet is efficient for downstream analytics if available
try:
    clean_parquet_path = processed_dir / "regularities_by_liaisons_trains_france_clean.parquet"
    df_clean.to_parquet(clean_parquet_path, index=False)
    wrote_parquet = True
except Exception as e:
    wrote_parquet = False

print({
    "clean_csv_path": str(clean_csv_path),
    "wrote_parquet": wrote_parquet,
    "processed_dir": str(processed_dir),
})

{'clean_csv_path': '/Users/anastasiabouevdombre/Documents/AIDAMS/S5_Classes/data_storage_and_collection/project/projet_repo_groupe/Data-Storage-Collection-Project/data/processed/regularities_by_liaisons_trains_france_clean.csv', 'wrote_parquet': False, 'processed_dir': '/Users/anastasiabouevdombre/Documents/AIDAMS/S5_Classes/data_storage_and_collection/project/projet_repo_groupe/Data-Storage-Collection-Project/data/processed'}
