# Préparation des données GEMStat

## Objectif

Ajouter les données GEMStat de la zone Eastern Cape au training pour améliorer la généralisation du modèle.

## Données disponibles

- 26 stations dans la zone test (lat -34 à -32, lon 24 à 28.5)
- ~50k échantillons avec Alkalinity, Conductivity, Phosphorus
- Période : 1976-2022

In [1]:
import pandas as pd
import numpy as np

print("Imports OK!")

Imports OK!


## Étape 1 : Charger les données GEMStat

In [2]:
# Charger les fichiers
samples = pd.read_csv('../data/raw/samples.csv', sep=';', low_memory=False)
metadata = pd.read_excel('../data/raw/metadata.xlsx')

print(f"Samples : {samples.shape}")
print(f"Metadata : {metadata.shape}")

print("\nColonnes samples:")
print(samples.columns.tolist())

print("\nColonnes metadata:")
print(metadata.columns.tolist())

Samples : (526404, 11)
Metadata : (252, 27)

Colonnes samples:
['GEMS Station Number', 'Sample Date', 'Sample Time', 'Depth', 'Parameter Code', 'Analysis Method Code', 'Value Flags', 'Value', 'Unit', 'Data Quality', 'Status']

Colonnes metadata:
['GEMS Station Number', 'Historical GEMS Number', 'Local Station Number', 'Country Name', 'Water Type', 'Station Identifier', 'Station Narrative', 'Water Body Name', 'Main Basin', 'Upstream Basin Area', 'Elevation', 'Monitoring Type', 'Date Station Opened', 'Responsible Collection Agency', 'Latitude', 'Longitude', 'River Width', 'Discharge', 'Max. Depth', 'Lake Area', 'Lake Volume', 'Average Retention', 'Area of Aquifer', 'Depth of Impermeable Lining', 'Production Zone', 'Mean Abstraction Rate', 'Mean Abstraction Level']


## Étape 2 : Filtrer les stations de la zone test

In [3]:
# Zone test (Eastern Cape)
LAT_MIN, LAT_MAX = -34, -32
LON_MIN, LON_MAX = 24, 28.5

# Filtrer les stations
stations_zone_test = metadata[
    (metadata['Latitude'] >= LAT_MIN) & 
    (metadata['Latitude'] <= LAT_MAX) &
    (metadata['Longitude'] >= LON_MIN) & 
    (metadata['Longitude'] <= LON_MAX)
].copy()

print(f"Stations dans la zone test : {len(stations_zone_test)}")
print(f"\nStations : {stations_zone_test['GEMS Station Number'].tolist()}")

Stations dans la zone test : 26

Stations : ['ZAF00235', 'ZAF00233', 'ZAF00246', 'ZAF00248', 'ZAF00285', 'ZAF00241', 'ZAF00220', 'ZAF00286', 'ZAF00222', 'ZAF00236', 'ZAF00012', 'ZAF00013', 'ZAF00242', 'ZAF00250', 'ZAF00249', 'ZAF00237', 'ZAF00243', 'ZAF00223', 'ZAF00314', 'ZAF00221', 'ZAF00290', 'ZAF00239', 'ZAF00238', 'ZAF00301', 'ZAF00247', 'ZAF00240']


In [4]:
# Filtrer les échantillons pour ces stations
station_ids = stations_zone_test['GEMS Station Number'].tolist()
samples_zone = samples[samples['GEMS Station Number'].isin(station_ids)].copy()

print(f"Échantillons dans la zone test : {len(samples_zone)}")
print(f"\nParamètres :")
print(samples_zone['Parameter Code'].value_counts())

Échantillons dans la zone test : 50950

Paramètres :
Parameter Code
EC         18684
DRP        16373
Alk-Tot    15893
Name: count, dtype: int64


## Étape 3 : Filtrer par période (2011-2016)

In [5]:
# Convertir la date
samples_zone['Sample Date'] = pd.to_datetime(samples_zone['Sample Date'])

# Filtrer par période (2011-2016 pour être proche du training 2013-2015)
samples_filtered = samples_zone[
    (samples_zone['Sample Date'] >= '2011-01-01') &
    (samples_zone['Sample Date'] <= '2016-12-31')
].copy()

print(f"Échantillons 2011-2016 : {len(samples_filtered)}")
print(f"\nParamètres :")
print(samples_filtered['Parameter Code'].value_counts())

Échantillons 2011-2016 : 4456

Paramètres :
Parameter Code
DRP        1509
EC         1496
Alk-Tot    1451
Name: count, dtype: int64


## Étape 4 : Pivoter les données

Transformer de format long (une ligne par mesure) à format large (une ligne par date/station avec les 3 variables).

In [6]:
# Garder seulement les colonnes utiles
samples_clean = samples_filtered[[
    'GEMS Station Number', 'Sample Date', 'Parameter Code', 'Value'
]].copy()

# Mapper les noms de paramètres
param_mapping = {
    'Alk-Tot': 'Total Alkalinity',
    'EC': 'Electrical Conductance',
    'DRP': 'Dissolved Reactive Phosphorus'
}

# Filtrer seulement nos 3 paramètres
samples_clean = samples_clean[samples_clean['Parameter Code'].isin(param_mapping.keys())]
samples_clean['Parameter'] = samples_clean['Parameter Code'].map(param_mapping)

print(f"Échantillons avec nos 3 paramètres : {len(samples_clean)}")

Échantillons avec nos 3 paramètres : 4456


In [7]:
# Pivoter
samples_wide = samples_clean.pivot_table(
    index=['GEMS Station Number', 'Sample Date'],
    columns='Parameter',
    values='Value',
    aggfunc='mean'  # Si plusieurs mesures le même jour
).reset_index()

print(f"Observations après pivot : {len(samples_wide)}")
print(f"\nColonnes : {samples_wide.columns.tolist()}")
print(f"\nAperçu :")
display(samples_wide.head())

Observations après pivot : 1511

Colonnes : ['GEMS Station Number', 'Sample Date', 'Dissolved Reactive Phosphorus', 'Electrical Conductance', 'Total Alkalinity']

Aperçu :


Parameter,GEMS Station Number,Sample Date,Dissolved Reactive Phosphorus,Electrical Conductance,Total Alkalinity
0,ZAF00012,2011-01-19,0.012,199.4,74.9
1,ZAF00012,2011-02-14,0.013,252.2,107.119
2,ZAF00012,2011-03-17,0.01,217.0,98.69
3,ZAF00012,2011-04-13,0.01,288.5,128.976
4,ZAF00012,2011-05-11,0.01,199.5,71.231


In [8]:
# Garder seulement les lignes avec les 3 variables
target_cols = ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']
samples_complete = samples_wide.dropna(subset=target_cols)

print(f"Observations avec les 3 variables : {len(samples_complete)}")

Observations avec les 3 variables : 1415


## Étape 5 : Ajouter les coordonnées

In [9]:
# Joindre avec les coordonnées
gemstat_data = samples_complete.merge(
    metadata[['GEMS Station Number', 'Latitude', 'Longitude']],
    on='GEMS Station Number',
    how='left'
)

# Renommer et réorganiser
gemstat_data = gemstat_data.rename(columns={'Sample Date': 'Sample Date'})
gemstat_data['Sample Date'] = gemstat_data['Sample Date'].dt.strftime('%d-%m-%Y')

# Réorganiser les colonnes comme le training original
gemstat_final = gemstat_data[[
    'Latitude', 'Longitude', 'Sample Date',
    'Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus'
]].copy()

print(f"Dataset GEMStat final : {gemstat_final.shape}")
display(gemstat_final.head())

Dataset GEMStat final : (1415, 6)


Unnamed: 0,Latitude,Longitude,Sample Date,Total Alkalinity,Electrical Conductance,Dissolved Reactive Phosphorus
0,-32.515278,28.015556,19-01-2011,74.9,199.4,0.012
1,-32.515278,28.015556,14-02-2011,107.119,252.2,0.013
2,-32.515278,28.015556,17-03-2011,98.69,217.0,0.01
3,-32.515278,28.015556,13-04-2011,128.976,288.5,0.01
4,-32.515278,28.015556,11-05-2011,71.231,199.5,0.01


## Étape 6 : Comparer avec le training original

In [10]:
# Charger le training original
training_original = pd.read_csv('../data/raw/water_quality_training_dataset.csv')

print("COMPARAISON")
print("=" * 50)
print(f"\nTraining original : {len(training_original)} observations")
print(f"GEMStat zone test : {len(gemstat_final)} observations")

print("\nStatistiques des variables cibles :")
print("\nTraining original :")
for col in ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']:
    print(f"  {col}: mean={training_original[col].mean():.1f}, std={training_original[col].std():.1f}")

print("\nGEMStat zone test :")
for col in ['Total Alkalinity', 'Electrical Conductance', 'Dissolved Reactive Phosphorus']:
    print(f"  {col}: mean={gemstat_final[col].mean():.1f}, std={gemstat_final[col].std():.1f}")

COMPARAISON

Training original : 9319 observations
GEMStat zone test : 1415 observations

Statistiques des variables cibles :

Training original :
  Total Alkalinity: mean=119.1, std=74.7
  Electrical Conductance: mean=485.0, std=341.9
  Dissolved Reactive Phosphorus: mean=43.5, std=51.0

GEMStat zone test :
  Total Alkalinity: mean=180.1, std=153.9
  Electrical Conductance: mean=1052.9, std=1213.4
  Dissolved Reactive Phosphorus: mean=0.1, std=0.1


## Étape 7 : Sauvegarder

In [11]:
# Sauvegarder le dataset GEMStat
gemstat_final.to_csv('../data/raw/gemstat_eastern_cape.csv', index=False)
print("✅ Sauvegardé : gemstat_eastern_cape.csv")

# Créer un dataset combiné (training original + GEMStat)
training_combined = pd.concat([training_original, gemstat_final], ignore_index=True)
training_combined.to_csv('../data/raw/water_quality_training_combined.csv', index=False)
print(f"✅ Sauvegardé : water_quality_training_combined.csv ({len(training_combined)} obs)")

✅ Sauvegardé : gemstat_eastern_cape.csv
✅ Sauvegardé : water_quality_training_combined.csv (10734 obs)


## Résumé

**Fichiers créés :**
- `gemstat_eastern_cape.csv` : Données GEMStat de la zone test uniquement
- `water_quality_training_combined.csv` : Training original + GEMStat

**Prochaine étape :**
1. Extraire les features satellite/climat pour les nouvelles observations GEMStat
2. Fusionner avec les features existantes
3. Réentraîner le modèle