# Notebook Objective

## Data Exploration and Cleaning

Explore and clean reference data from GF to prepare it for use as a seed file.

### Source Data

- **Source**: Google Spreadsheet
- **Link**: [GF Reference Data](https://docs.google.com/spreadsheets/d/1W239EkEV72WzzefB6jbYUQkIIQ-9xa4u9_m8XdCpU4o/edit?gid=1263453670#gid=1263453670)

### Tasks

1. **Data Exploration**

   - Analyze raw data structure
   - Identify data types
   - Check for missing values
   - Review data consistency

2. **Data Cleaning**

   - Handle missing values
   - Correct data types
   - Standardize formats

3. **Data Transformation**
   - Create new CSV file
   - Prepare for use as seed data

### Output

- Clean, formatted CSV file ready for use as a seed file in the project

This notebook will document the process of transforming raw reference data into a clean, structured format suitable for further use.


In [157]:
from pathlib import Path
import pandas as pd

# 13_pollution_eau_d4g/dbt_/seeds/mapping_categorie_new.csv
map_cat_file = Path.cwd().parent.parent / "dbt_/seeds/mapping_category_gf.csv"
df = pd.read_csv(map_cat_file)

df.head(2)

Unnamed: 0,cdparametresiseeaux,cdparametre,libmajparametre,libminparametre,casparam,categorie,sous catégorie,Détails sous catégorie,Limite qualité,Unité limite qualité,Commentaire limite qualité,Valeur sanitaire 1,Unité valeur sanitaire 1,Commentaire valeur sanitaire 1,Valeur sanitaire 2,Unité valeur sanitaire 2,Commentaire valeur sanitaire 2
0,PFOA,5347.0,ACIDE PERFLUORO-OCTANOÏQUE,Acide perfluoro-octanoïque,335-67-1,PFAS,,,,,,0.075,µg/L,Valeur sanitaire indicative établie par l'Anse...,,,
1,PFHPA,5977.0,ACIDE PERFLUOROHEPTANOÏQUE,Acide perfluoroheptanoïque,375-85-9,PFAS,,,,,,0.075,µg/L,Valeur sanitaire indicative établie par l'Anse...,,,


In [158]:
# base information + check null value coverage
# cdparametre has null value,but the above sql has pre-filled the empty cdparametre columne with 0
# conclution: the libmajparametre,libminparametre,categorie has no null value,
info = pd.DataFrame(
    {
        "Data Type": df.dtypes,
        "Total Rows": len(df),
        "Null Rows": df.isnull().sum(),
        "Null Percentage": (df.isnull().sum() / len(df) * 100)
        .round(6)
        .apply(lambda x: f"{x}%"),
    }
)

info

Unnamed: 0,Data Type,Total Rows,Null Rows,Null Percentage
cdparametresiseeaux,object,828,1,0.120773%
cdparametre,float64,828,8,0.966184%
libmajparametre,object,828,0,0.0%
libminparametre,object,828,0,0.0%
casparam,object,828,23,2.777778%
categorie,object,828,0,0.0%
sous catégorie,object,828,28,3.381643%
Détails sous catégorie,object,828,699,84.42029%
Limite qualité,float64,828,23,2.777778%
Unité limite qualité,object,828,23,2.777778%


In [159]:
# distinc category
unique_cat = df["categorie"].unique().tolist()
unique_cat

['PFAS', 'CVM', 'Substances industrielles', 'pesticides']

In [160]:
unique_sous_cat = df["souscat"].dropna().unique().tolist()
unique_sous_cat

KeyError: 'souscat'

In [None]:
# check uniqueness of cdparametresiseeaux and cdparametre
# found:cdparametre is more reliable than cdparametresiseeaux,
# because cdparametresiseeaux can be in reverse order:
# ex: cdparametre: 6381 =>cdparametresiseeaux: DIM2ESA or ESADIM2
# conclution: use cdparametre as key

# convert cdparametre field into int type instead of float
df["cdparametre"] = df["cdparametre"].fillna(0).astype(int)
c = df[df["cdparametre"].duplicated(keep=False)]
c.sort_values("cdparametre")[["cdparametresiseeaux", "cdparametre"]]

Unnamed: 0,cdparametresiseeaux,cdparametre
809,MFTC,0
64,CLTHSYN,0
63,CLTHAR6,0
150,TBZLM6,0
798,FPDF,0
794,MDB,0
23,,0
796,ISOF,0
744,CNEB,1341
652,CLRNB,1341


In [None]:
# test Unité limite qualité, Unité valeur sanitaire1, Unité valeur sanitaire2
# conclution:
# 1) 13 rows has no Unite at all,
# 2) all parameters have µg/L as unity. verified manually the 13 no unity parameters.
columns_to_check = [
    "Unité limite qualité",
    "Unité valeur sanitaire 1",
    "Unité valeur sanitaire 2",
]
no_unity_value = df[df[columns_to_check].isnull().all(axis=1)]
no_unity_value[
    ["cdparametresiseeaux", "libmajparametre", "cdparametre"] + columns_to_check
]

Unnamed: 0,cdparametresiseeaux,libmajparametre,cdparametre,Unité limite qualité,Unité valeur sanitaire 1,Unité valeur sanitaire 2
6,PFDODA,ACIDE PERFLUORODODÉCANOIQUE,6507,,,
7,PFNA,ACIDE PERFLUORO-NONANOÏQUE,6508,,,
8,PFDA,ACIDE PERFLUORO-DECANOÏQUE,6509,,,
9,PFUNA,ACIDE PERFLUORO UNDECANOÏQUE,6510,,,
10,PFHPS,ACIDE PERFLUOROHEPTANE SULFONIQUE,6542,,,
11,PFTRDA,ACIDE PERFLUORO TRIDECANOIQUE,6549,,,
12,PFDS,ACIDE PERFLUORODECANE SULFONIQUE,6550,,,
13,ASPFOS,ACIDE PERFLUOROOCTANE SULFONIQUE,6560,,,
17,PFPS,ACIDE PERFLUOROPENTANE SULFONIQUE,8738,,,
18,PFNS,ACIDE PERFLUORONONANE SULFONIQUE,8739,,,


In [161]:
"""_summary_
clean, transform, and convert data to new csv file 
"""

# clean the raw data, and use the cleaned csv map_category_v2.csv as seed
# rename the columns as in edc_result
df.columns = df.columns.str.strip()
df = df.rename(
    columns={
        "sous catégorie": "souscat",
        "Détails sous catégorie": "detailsouscat",
        "Limite qualité": "limitequal",
        "Valeur sanitaire 1": "valsanitaire1",
        "Commentaire valeur sanitaire 1": "commentvalsanitaire1",
        "Valeur sanitaire 2": "valsanitaire2",
        "Commentaire valeur sanitaire 2": "commentvalsanitaire2",
    }
)
# add a new colum unite, all the unity we use here is µg/L
df["unite"] = (
    df["Unité limite qualité"]
    .combine_first(df["Unité valeur sanitaire 1"])
    .combine_first(df["Unité valeur sanitaire 2"])
)
# delete the null cdparametresiseeaux row
df = df[df["cdparametresiseeaux"].notna()]
# delete useless columns
colums_delete = [
    "Unité limite qualité",
    "Unité valeur sanitaire 1",
    "Unité valeur sanitaire 2",
    "Commentaire limite qualité",
]
df = df.drop(colums_delete, axis=1)
# new csv path
map_cat_path = Path.cwd().parent.parent / "dbt_/seeds/mapping_category_v2.csv"
# create a new file
df.to_csv(map_cat_path, index=False)

In [162]:
# show new colum and colum's type
df.dtypes

cdparametresiseeaux      object
cdparametre             float64
libmajparametre          object
libminparametre          object
casparam                 object
categorie                object
souscat                  object
detailsouscat            object
limitequal              float64
valsanitaire1            object
commentvalsanitaire1     object
valsanitaire2           float64
commentvalsanitaire2     object
unite                    object
dtype: object