### Modélisation des données du Défi IA 2021-2022

Dans ce calepin, nous décrivons quelques éléments de modélisation des données du  Défi IA 2021-2022. Il est suggéré quelques pistes pour réaliser des statistiques descriptives de ces données ainsi qu'une approche de prévision du cumul de pluie à l'aide de la régression linéaire. Des codes en Python sont également proposés. Le but de calepin est également de vous aider à initier la rédaction d'un rapport sur votre travail dans ce Défi IA.

#### Données disponibles sur les stations de mesure

Sur les années 2016 et 2017, données d'apprentissage sur $N$ stations météorologiques dont on dispose des coordonnées spatiales (latitude et longitude).

Pour chaque station $1 \leq i \leq N$, on dispose des mesures suivantes :

**Variables explicatives** : mesure de $p$ variables $X_{ijt} = (X_{ijt}^{(k)})_{1 \leq k \leq p} \in \mathbb{R}^{p}$ pour la station $i$, le jour $j$ (variable non-ordonnée car non-disponible dans l'ensemble test) et l'heure $t \in \{0,\ldots,23 \}$ (variable ordonnée disponible dans l'ensemble test). Les mesures sont

- 'ff' : *inclure une description*
- 't' : *inclure une description*
- 'td' : *inclure une description*
- 'hu' : *humidité*
- 'dd' : *inclure une description*
- 'precip' : *cumul de pluie sur une heure en ml*

On peut également ajouter une variable sur le mois de l'année car cette information est disponible dans l'ensemble test.


**Variable à expliquer/prédire** : cumul de pluie $Y_{ij}$ sur une journée au jour $j+1$ dans la station $i$ à partir des données disponibles au jour $j$. Dans l'ensemble d'apprentissage, on dipose en fait de la variable $Y_{ijt}$ cumul de pluie  sur une journée au jour $j+1$ dans la station $i$ et àl'heure $t$. De façon évidente on a que (avec $T=23$)
$$
Y_{ij} = \sum_{t = 0}^{T} Y_{ijt}
$$

**Travail préliminaire** : proposer une analyse descriptive de ces données : boxplot, histogramme uni-varié, ACP pour étude des corrélation entre variables explicatives, etc...

**Modèles linéaires possibles de prévision du cumul de pluie** : 

*Modèle global temps par temps*

$$
Y_{ijt} = \theta_{0}^{t} + \sum_{k = 1}^{p} \theta_{k}^{t}X_{ijt}^{(k)} + \varepsilon_{ijt}
$$

et prévision par $\hat{Y}_{ij} = \sum_{t = 0}^{T} \hat{Y}_{ijt} $ où $\hat{Y}_{ijt} = \hat{\theta}_{0}^{t} + \sum_{k = 1}^{p} \hat{\theta}_{k}^{t}X_{ijt}^{(k)}$

*Modèle par station et temps par temps*

$$
Y_{ijt} = \theta_{0,i}^{t} + \sum_{k = 1}^{p} \theta_{k,i}^{t}X_{ijt}^{(k)} + \varepsilon_{ijt}
$$

où les cofficients du modèle linéaire varient selon la station de mesure.

*Modèle global avec agrégation du temps*

$$
Y_{ij} = \theta_{0} + \sum_{t = 0}^{T}  \sum_{k = 1}^{p} \theta_{k}^{t}X_{ijt}^{(k)} + \varepsilon_{ij}
$$

et bien d'autres modèles sont possibles !

In [159]:
#from google.colab import drive
#drive.mount('/content/drive',force_remount=True)

In [160]:
#import os
#os.chdir('/content/drive/My Drive/Données Massives/')

In [161]:
import matplotlib.pyplot as plt
from IPython.display import display


In [162]:
import pandas as pd
import datetime
import seaborn as sns
import numpy as np

# Suppression des messages d'erreur liés à des besoins de mise à jour de syntaxe en Python
import warnings

def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()

warnings.filterwarnings("ignore")

In [163]:
# Lecture des données de l'ensemble d'apprentissage 

path = 'defi-ia-2022/Train/Train/X_station_train.csv'
first_date = datetime.datetime(2016,1,1)    
last_date = datetime.datetime(2017,12,31)

# Read the ground station data
def read_gs_data(fname):
    gs_data = pd.read_csv(fname,parse_dates=['date'],infer_datetime_format=True)
    gs_data = gs_data.sort_values(by=["number_sta","date"])
    return gs_data

x = read_gs_data(path)
x['number_sta']=x['number_sta'].astype('category')

# Tri par station puis par datea
x = x.sort_values(['number_sta','date'])
x

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4
...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22


In [164]:
# Ajout des variables jour et heure
Xtrain = x
split_Id = Xtrain['Id'].str.split(pat="_", expand = True)
split_Id = split_Id.rename(columns={0: "number_sta_2", 1: "day", 2: "hour"})
Xtrain['number_sta_2'] = split_Id['number_sta_2']
Xtrain['day'] = split_Id["day"]
Xtrain['hour'] = split_Id["hour"]
Xtrain = Xtrain.drop("number_sta_2",axis=1)
display(Xtrain) 

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4
...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22


In [165]:
#Ajout de mois 

Xtrain['month'] = pd.DatetimeIndex(Xtrain['date']).month
display(Xtrain) 

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12


On va tenter d'ajouter une variable explicative ayant une bonne corrélation avec la variable precip.Pour cela, on va utiliser l'algorithme knn qui va nous dire s'il pleut ou pas.
La nouvelle variable knn qui sera crée prendra en valeur la prédiction de la classe donnée par l'algorithme knn. ( soit il pleut soit il ne pleut pas)


In [166]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer




In [167]:
Xtrain

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12


In [168]:
Xtrain = Xtrain.fillna(method='ffill')
Xtrain['class'] = np.where(Xtrain['precip'] == 0,1,2)


X = Xtrain.drop(['date','class','Id'],axis=1)
Y = Xtrain['class']
X['day'] = X['day'].astype('category')
X['hour'] = X['hour'].astype('category')

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.3, random_state=4,stratify=Y)

On standardise nos données

On va maintenant tenter d'améliorer ce score en utilisant la méthode de validation croisée avec knn et en prenant une grille de valeur de k afin de choisir le k qui maximise la précision de notre modèle.


In [169]:
X = Xtrain
Y = X[{"date","precip"}]
Y.set_index('date',inplace = True) 



In [170]:
Xtrain

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month,class
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1,1
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1,1
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1,1
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12,1
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12,1
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12,1
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12,1


In [171]:
# Réorganisation des données pour modèle avec aggrégation dans le temps
X["number_sta_day"] = Xtrain['number_sta'].astype(str) + '_' + Xtrain['day'].astype(str)
X

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month,class,number_sta_day
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1,1,14066001_0
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1,1,14066001_0
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1,1,14066001_0
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1,1,14066001_0
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1,1,14066001_0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12,1,95690001_729
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12,1,95690001_729
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12,1,95690001_729
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12,1,95690001_729


In [172]:
X.drop(['class','Id','number_sta'],axis=1,inplace= True)

In [173]:
X['ff_idx'] = 'ff_' + X["hour"].astype(str)
X['t_idx'] = 't_' + X["hour"].astype(str)
X['td_idx'] = 'td_' + X["hour"].astype(str)
X['hu_idx'] = 'hu_' + X["hour"].astype(str)
X['dd_idx'] = 'dd_' + X["hour"].astype(str)
X['precip_idx'] = 'precip_' + X["hour"].astype(str)

In [174]:
ff = X.pivot(index='number_sta_day',columns='ff_idx',values='ff')
t = X.pivot(index='number_sta_day',columns='t_idx',values='t')
td = X.pivot(index='number_sta_day',columns='td_idx',values='td')
hu = X.pivot(index='number_sta_day',columns='hu_idx',values='hu')
dd = X.pivot(index='number_sta_day',columns='dd_idx',values='dd')
precip = X.pivot(index='number_sta_day',columns='precip_idx',values='precip')

In [175]:
X_reshape = pd.concat([ff ,t,td,hu,dd,precip],axis=1)

In [176]:
X_reshape.reset_index(inplace=True)
X_reshape

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_21,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.2,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.0,0.2,0.4,2.2,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183742,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
183743,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.8
183744,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
183745,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [177]:
# Données de cumul de pluie
fname = 'defi-ia-2022/Train/Train/Y_train.csv'
param = 'Ground_truth'  #weather parameter name in the file ('Ground_truth' about Y and 'Prediction' about baseline)
data = pd.read_csv(fname, parse_dates=['date'], infer_datetime_format=True)
data['number_sta'] = data['number_sta'].astype('category')
display(data)

Unnamed: 0,date,number_sta,Ground_truth,Id
0,2016-01-02,14066001,3.4,14066001_0
1,2016-01-02,14126001,0.5,14126001_0
2,2016-01-02,14137001,3.4,14137001_0
3,2016-01-02,14216001,4.0,14216001_0
4,2016-01-02,14296001,13.3,14296001_0
...,...,...,...,...
183742,2017-12-31,86137003,5.0,86137003_729
183743,2017-12-31,86165005,3.2,86165005_729
183744,2017-12-31,86272002,1.8,86272002_729
183745,2017-12-31,91200002,1.6,91200002_729


In [178]:
Y = data[["Ground_truth","Id"]]
Y.rename(columns = {"Ground_truth":"Y","Id":"number_sta_day"},inplace=True)
Y.dropna(inplace=True)
display(Y)

Unnamed: 0,Y,number_sta_day
0,3.4,14066001_0
1,0.5,14126001_0
2,3.4,14137001_0
3,4.0,14216001_0
4,13.3,14296001_0
...,...,...
183742,5.0,86137003_729
183743,3.2,86165005_729
183744,1.8,86272002_729
183745,1.6,91200002_729


In [179]:
all_data = pd.merge(X_reshape,Y)

display(all_data)
all_data.dropna(inplace=True)

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9,Y
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.4
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.7
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.6
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.2,0.4,2.2,0.0,0.0,0.0,0.0,3.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162102,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,3.2
162103,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.8,0.0
162104,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.4
162105,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.4


In [180]:

Y = all_data['Y']
X = all_data.drop("Y",axis=1)
X

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_21,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.2,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.0,0.2,0.4,2.2,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162102,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
162103,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.8
162104,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162105,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [181]:
X['day'] = X['number_sta_day'].str.rpartition("_")[2]
X['day'] = X['day'].astype('int')
X['day'] = X['day'] % 365

In [182]:
X['number_sta'] = X['number_sta_day'].str.rpartition('_')[0]
X['number_sta'] = X['number_sta'].astype('int64')

In [183]:
X['season'] =pd.cut(X['day'],[0,60,151,243,334], labels=['Hiver','Printemps','Ete', 'Automne'],ordered=False)
X['season'] = X['season'].fillna('Hiver')
X['season']

0             Hiver
1             Hiver
2             Hiver
3         Printemps
4         Printemps
            ...    
162102        Hiver
162103        Hiver
162104        Hiver
162105        Hiver
162106        Hiver
Name: season, Length: 162090, dtype: category
Categories (4, object): ['Hiver', 'Printemps', 'Ete', 'Automne']

In [184]:
X['month'] = X['day'] / 30
X['month'] = X['month'].astype('int')


In [185]:
X.reset_index(inplace=True)
X.drop('index',axis=1,inplace=True)

In [186]:
X['month'] = X['month'].astype('category')
X['season'] = X['season'].astype('category')
X['day'] = X['day'].astype('category')

Ajoutons les coordonnées des stations

In [187]:
#open the file with station coordinates (latitude/longitude)

coords_fname  = 'defi-ia-2022/Other/Other/stations_coordinates.csv'
coords = pd.read_csv(coords_fname)
coords['number_sta'] = coords['number_sta'].astype('category')
display(coords)



Unnamed: 0,number_sta,lat,lon,height_sta
0,86118001,46.477,0.985,120.0
1,86149001,46.917,0.025,60.0
2,56081003,48.050,-3.660,165.0
3,53215001,47.790,-0.710,63.0
4,22135001,48.550,-3.380,148.0
...,...,...,...,...
320,86137003,47.035,0.098,96.0
321,86165005,46.412,0.841,153.0
322,86273001,46.464,1.042,121.0
323,91200002,48.526,1.993,116.0


In [188]:
len(np.unique(X['number_sta']))

254

In [189]:
coords.dtypes

number_sta    category
lat            float64
lon            float64
height_sta     float64
dtype: object

In [190]:
X =pd.merge(X,coords,how='left',on='number_sta')
X['number_sta'] = X['number_sta'].astype('category')
X

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_7,precip_8,precip_9,day,number_sta,season,month,lat,lon,height_sta
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0,14066001,Hiver,0,49.334,-0.431,2.0
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,1,14066001,Hiver,0,49.334,-0.431,2.0
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.0,0.0,0.0,10,14066001,Hiver,0,49.334,-0.431,2.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,100,14066001,Printemps,3,49.334,-0.431,2.0
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.0,101,14066001,Printemps,3,49.334,-0.431,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162085,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.0,360,95690001,Hiver,12,49.108,1.831,126.0
162086,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,1.4,0.8,361,95690001,Hiver,12,49.108,1.831,126.0
162087,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,362,95690001,Hiver,12,49.108,1.831,126.0
162088,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,363,95690001,Hiver,12,49.108,1.831,126.0


In [191]:
Y =pd.DataFrame(Y)
Y.reset_index(inplace=True)
Y.drop('index',inplace = True,axis=1)

In [192]:
Y

Unnamed: 0,Y
0,3.4
1,11.7
2,1.0
3,5.6
4,3.2
...,...
162085,3.2
162086,0.0
162087,4.4
162088,5.4


In [193]:
#X['class'] = np.where(Y['Y'] == 0,1,2)
#Y = X.loc[:,X.columns.str.startswith('precip')].sum(axis=1)
Class =np.where(Y['Y'] == 0,1,2)
Class = pd.DataFrame(Class)
Class.reset_index(inplace=True)
Class.drop('index',inplace=True,axis=1)

In [194]:
Class

Unnamed: 0,0
0,2
1,2
2,2
3,2
4,2
...,...
162085,2
162086,1
162087,2
162088,2


In [195]:
from sklearn.model_selection import ShuffleSplit,StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder,OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

Utilisons KNN

In [38]:

split = StratifiedShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X, Class)
train_index, test_index = next(split.split(X, Class)) 

X_train,X_test = X.loc[train_index], X.loc[test_index]
Y_train, Y_test = Class.loc[train_index], Class.loc[test_index]


In [39]:
transformer = ColumnTransformer(
                     [
                         ('transform_name_categories', OrdinalEncoder(), make_column_selector(dtype_include=object)),
                         ('transformer_name_for_numerical', StandardScaler(), make_column_selector(dtype_include=np.number))
                     ]
                 )
X_train = transformer.fit_transform(X_train)
X_test = transformer.fit_transform(X_test)




In [40]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train,Y_train)



predictions = knn.predict(X_test)
accuracy_score(Y_test,predictions)


0.5545890143336006

In [41]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X, Class)
train_index, test_index = next(split.split(X, Class)) 

X_train,X_test = X.loc[train_index], X.loc[test_index]
Y_train, Y_test = Class.loc[train_index], Class.loc[test_index]


Random Forest

In [42]:
from sklearn.ensemble import RandomForestClassifier

random_forest = RandomForestClassifier(oob_score=True)

transformer = ColumnTransformer(
                     [
                         ('transform_name_categories', OrdinalEncoder(), make_column_selector(dtype_include=object)),
                         ('transformer_name_for_numerical', StandardScaler(), make_column_selector(dtype_include=np.number))
                     ]
                 )
X_train = transformer.fit_transform(X_train)
X_test = transformer.fit_transform(X_test)

X_train
random_forest.fit(X_train,Y_train)


RandomForestClassifier(oob_score=True)

In [43]:
predictions = random_forest.predict(X_test)
accuracy_score(Y_test,predictions)

0.7434552820449545

On récupère les features importantes

In [44]:
# Get numerical feature importances
importances = list(random_forest.feature_importances_)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(X.columns, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances]

Variable: t_16                 Importance: 0.02
Variable: t_17                 Importance: 0.02
Variable: t_18                 Importance: 0.02
Variable: t_19                 Importance: 0.02
Variable: number_sta_day       Importance: 0.01
Variable: ff_20                Importance: 0.01
Variable: ff_21                Importance: 0.01
Variable: ff_22                Importance: 0.01
Variable: ff_23                Importance: 0.01
Variable: t_0                  Importance: 0.01
Variable: t_1                  Importance: 0.01
Variable: t_10                 Importance: 0.01
Variable: t_11                 Importance: 0.01
Variable: t_12                 Importance: 0.01
Variable: t_13                 Importance: 0.01
Variable: t_14                 Importance: 0.01
Variable: t_15                 Importance: 0.01
Variable: t_2                  Importance: 0.01
Variable: t_20                 Importance: 0.01
Variable: t_21                 Importance: 0.01
Variable: t_22                 Importanc

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

Utilisons Support Vector Machine (SVM) afin de classifier les jours où il pleut

In [45]:
# from sklearn import svm
# svm_model = svm.SVC(kernel = 'linear')
# svm_model.fit(X_train,Y_train)
# predictions = svm_model.predict(X_test)
# accuracy_score(Y_test,predictions)

In [46]:
# predictions

Régression Logisitique

In [158]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X, Class)
train_index, test_index = next(split.split(X, Class)) 

X_train,X_test = X.loc[train_index], X.loc[test_index]
Y_train, Y_test = Class.loc[train_index], Class.loc[test_index]

transformer = ColumnTransformer(
                     [
                         ('transform_name_categories', OrdinalEncoder(), make_column_selector(dtype_include=object)),
                         ('transformer_name_for_numerical', StandardScaler(), make_column_selector(dtype_include=np.number))
                     ]
                 )
X_train = transformer.fit_transform(X_train)
X_test = transformer.fit_transform(X_test)



ValueError: Found input variables with inconsistent numbers of samples: [96055, 162090]

In [48]:
from sklearn.linear_model import LogisticRegression

lg = LogisticRegression(random_state=4,class_weight = {1:0.7,2:1})
lg.fit(X_train,Y_train)


LogisticRegression(class_weight={1: 0.7, 2: 1}, random_state=4)

In [49]:
predictions = lg.predict(X_test)
accuracy_score(Y_test,predictions)

0.6405494889670348

XGBoost

In [196]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X, Class)
train_index, test_index = next(split.split(X, Class)) 

X_train,X_test = X.loc[train_index], X.loc[test_index]
Y_train, Y_test = Class.loc[train_index], Class.loc[test_index]

transformer = ColumnTransformer(
                     [
                         ('transform_name_categories', OrdinalEncoder(), make_column_selector(dtype_include=object)),
                         ('transformer_name_for_numerical', StandardScaler(), make_column_selector(dtype_include=np.number))
                     ]
                 )
X_train = transformer.fit_transform(X_train)
X_test = transformer.fit_transform(X_test)

In [197]:
from xgboost import XGBClassifier
xg = XGBClassifier(n_estimators=500,
    max_depth=11,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.7,
    missing=-999,
    random_state=4,
    tree_method='gpu_hist')
xg.fit(X_train,Y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7,
              enable_categorical=False, gamma=0, gpu_id=0, importance_type=None,
              interaction_constraints='', learning_rate=0.05, max_delta_step=0,
              max_depth=11, min_child_weight=1, missing=-999,
              monotone_constraints='()', n_estimators=500, n_jobs=8,
              num_parallel_tree=1, predictor='auto', random_state=4,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.9,
              tree_method='gpu_hist', validate_parameters=1, verbosity=None)

In [198]:
predictions = xg.predict(X_test)
accuracy_score(Y_test,predictions)

0.7552388590700639

In [53]:
from xgboost import XGBClassifier
xg = XGBClassifier(n_estimators=2000,
    max_depth=6,
    learning_rate=0.02,
    subsample=0.9,
    colsample_bytree=0.7,
    random_state=4,
    objective = 'binary:logistic',
    nthread= -1,
    min_child_weight= 2,   
    tree_method='gpu_hist')
xg.fit(X_train,Y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.7,
              enable_categorical=False, gamma=0, gpu_id=0, importance_type=None,
              interaction_constraints='', learning_rate=0.02, max_delta_step=0,
              max_depth=6, min_child_weight=2, missing=nan,
              monotone_constraints='()', n_estimators=2000, n_jobs=8,
              nthread=-1, num_parallel_tree=1, predictor='auto', random_state=4,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=0.9,
              tree_method='gpu_hist', validate_parameters=1, verbosity=None)

In [54]:
predictions = xg.predict(X_test)
accuracy_score(Y_test,predictions)

0.7406790466201905

Extra Trees Classifier (possède une faible variance comparé à Random Forest)

In [55]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X, Class)
train_index, test_index = next(split.split(X, Class)) 

X_train,X_test = X.loc[train_index], X.loc[test_index]
Y_train, Y_test = Class.loc[train_index], Class.loc[test_index]

transformer = ColumnTransformer(
                     [
                         ('transform_name_categories', OrdinalEncoder(), make_column_selector(dtype_include=object)),
                         ('transformer_name_for_numerical', StandardScaler(), make_column_selector(dtype_include=np.number))
                     ]
                 )
X_train = transformer.fit_transform(X_train)
X_test = transformer.fit_transform(X_test)

In [56]:
from sklearn.ensemble import ExtraTreesClassifier
extree = ExtraTreesClassifier(
    random_state = 4,
    n_estimators =500 )
extree.fit(X_train,Y_train)

ExtraTreesClassifier(n_estimators=500, random_state=4)

In [57]:
predictions = extree.predict(X_test)
accuracy_score(Y_test,predictions)

0.7413782466530939

Nous allons utilisé un autre modèle pour prédire le cumul de pluie sur les jours où on a prédit qu'il pleuvait

In [201]:
X_test = pd.DataFrame(X_test)
X_test['index_original'] = test_index
X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,139,140,141,142,143,144,145,146,147,index_original
0,9963.0,-0.782191,-0.369685,-0.253163,-0.225904,-0.386027,-0.464068,-0.459726,-0.431489,-0.330751,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.246583,0.770286,1.779455,33008
1,23933.0,1.506121,1.504938,1.465567,1.453535,1.442577,1.444650,1.450591,1.464252,1.480821,...,-0.208987,0.247758,-0.208473,-0.188604,-0.196418,-0.204863,-0.861916,-0.497492,-1.101222,79626
2,21010.0,0.861483,0.861775,0.747444,0.722892,0.704842,0.701588,0.706907,0.727601,0.759611,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-0.847017,0.598202,0.007839,69919
3,6317.0,-1.451270,-1.515796,-1.665585,-1.569456,-1.833526,-1.738895,-1.855456,-1.845440,-1.656959,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,1.008429,0.946319,0.382327,21003
4,16984.0,-1.252685,-1.399966,-0.753467,-0.894217,-0.903491,-1.094438,-0.780462,-0.874877,-1.509982,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-1.390341,1.370604,1.332950,56399
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48622,29183.0,0.662898,0.663644,0.526221,0.497813,0.477578,0.472683,0.477810,0.500671,0.537437,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.439280,0.153605,2.384397,97449
48623,36819.0,-0.809687,-1.006752,-0.841956,-0.859590,-0.837060,-0.904271,-0.984887,-0.902806,-0.976765,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.128383,0.129344,1.361757,122775
48624,40096.0,-0.803577,-0.342252,-0.484596,0.058042,-0.064361,-0.228120,-0.213006,-0.082366,-0.528999,...,-0.208987,-0.201147,0.278680,-0.188604,-0.196418,-0.204863,0.947839,1.201341,0.166276,133658
48625,33696.0,1.023407,1.023328,0.927825,0.906418,0.890150,0.888234,0.893709,0.912637,0.940768,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.707465,0.107905,1.693035,112123


In [202]:
index_pluie =np.where(predictions==2)

In [203]:
X_pluie = X_test.loc[index_pluie]
X_pluie

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,139,140,141,142,143,144,145,146,147,index_original
1,23933.0,1.506121,1.504938,1.465567,1.453535,1.442577,1.444650,1.450591,1.464252,1.480821,...,-0.208987,0.247758,-0.208473,-0.188604,-0.196418,-0.204863,-0.861916,-0.497492,-1.101222,79626
2,21010.0,0.861483,0.861775,0.747444,0.722892,0.704842,0.701588,0.706907,0.727601,0.759611,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-0.847017,0.598202,0.007839,69919
4,16984.0,-1.252685,-1.399966,-0.753467,-0.894217,-0.903491,-1.094438,-0.780462,-0.874877,-1.509982,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-1.390341,1.370604,1.332950,56399
5,15876.0,0.696505,0.697174,0.563659,0.535903,0.516038,0.511421,0.516580,0.539074,0.575035,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-0.930452,1.248171,0.339117,52660
6,38138.0,0.494865,0.495995,0.339033,0.307361,0.285277,0.278994,0.283958,0.308653,0.349443,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,1.517981,1.069317,1.232126,127128
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48622,29183.0,0.662898,0.663644,0.526221,0.497813,0.477578,0.472683,0.477810,0.500671,0.537437,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.439280,0.153605,2.384397,97449
48623,36819.0,-0.809687,-1.006752,-0.841956,-0.859590,-0.837060,-0.904271,-0.984887,-0.902806,-0.976765,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.128383,0.129344,1.361757,122775
48624,40096.0,-0.803577,-0.342252,-0.484596,0.058042,-0.064361,-0.228120,-0.213006,-0.082366,-0.528999,...,-0.208987,-0.201147,0.278680,-0.188604,-0.196418,-0.204863,0.947839,1.201341,0.166276,133658
48625,33696.0,1.023407,1.023328,0.927825,0.906418,0.890150,0.888234,0.893709,0.912637,0.940768,...,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.707465,0.107905,1.693035,112123


In [204]:
X_pluie.shape

(31109, 149)

In [205]:
Y_test = pd.DataFrame(Y_test)
Y_test

Unnamed: 0,0
33008,1
79626,2
69919,2
21003,1
56399,2
...,...
97449,2
122775,2
133658,2
112123,2


In [206]:
Y = pd.DataFrame(Y)
Y.reset_index(inplace=True)
Y.drop('index',axis=1,inplace=True)

In [207]:
Y

Unnamed: 0,Y
0,3.4
1,11.7
2,1.0
3,5.6
4,3.2
...,...
162085,3.2
162086,0.0
162087,4.4
162088,5.4


In [208]:
Y_pluie = Y.loc[X_pluie['index_original']]



In [209]:
Y_pluie = pd.DataFrame(Y_pluie)
Y_pluie = Y_pluie.reset_index()


In [210]:
Y_pluie.drop('index',axis=1,inplace=True)

In [211]:
Y_pluie

Unnamed: 0,Y
0,1.2
1,17.8
2,4.2
3,0.2
4,0.0
...,...
31104,0.2
31105,16.3
31106,1.0
31107,0.2


In [212]:
sum(Y_pluie['Y']==0)

7252

In [213]:
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test,predictions)

array([[12868,  7252],
       [ 4650, 23857]], dtype=int64)

Après avoir prédit les jours où il pleut, on utilise un modèle prédictif afin de déterminer le cumul de pluie par jour sur les jours où on prédit où il pleut

In [214]:
index_original = X_pluie['index_original']
X_pluie.drop('index_original',axis=1,inplace=True)

In [215]:
X_pluie.reset_index(inplace=True)

In [216]:
X_pluie.drop('index',axis=1,inplace=True)

In [217]:
Y_pluie

Unnamed: 0,Y
0,1.2
1,17.8
2,4.2
3,0.2
4,0.0
...,...
31104,0.2
31105,16.3
31106,1.0
31107,0.2


In [218]:
Y_pluie.shape

(31109, 1)

In [219]:
X_pluie

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,138,139,140,141,142,143,144,145,146,147
0,23933.0,1.506121,1.504938,1.465567,1.453535,1.442577,1.444650,1.450591,1.464252,1.480821,...,-0.200889,-0.208987,0.247758,-0.208473,-0.188604,-0.196418,-0.204863,-0.861916,-0.497492,-1.101222
1,21010.0,0.861483,0.861775,0.747444,0.722892,0.704842,0.701588,0.706907,0.727601,0.759611,...,-0.200889,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-0.847017,0.598202,0.007839
2,16984.0,-1.252685,-1.399966,-0.753467,-0.894217,-0.903491,-1.094438,-0.780462,-0.874877,-1.509982,...,-0.200889,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-1.390341,1.370604,1.332950
3,15876.0,0.696505,0.697174,0.563659,0.535903,0.516038,0.511421,0.516580,0.539074,0.575035,...,-0.200889,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,-0.930452,1.248171,0.339117
4,38138.0,0.494865,0.495995,0.339033,0.307361,0.285277,0.278994,0.283958,0.308653,0.349443,...,0.264361,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,1.517981,1.069317,1.232126
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31104,29183.0,0.662898,0.663644,0.526221,0.497813,0.477578,0.472683,0.477810,0.500671,0.537437,...,-0.200889,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.439280,0.153605,2.384397
31105,36819.0,-0.809687,-1.006752,-0.841956,-0.859590,-0.837060,-0.904271,-0.984887,-0.902806,-0.976765,...,-0.200889,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.128383,0.129344,1.361757
31106,40096.0,-0.803577,-0.342252,-0.484596,0.058042,-0.064361,-0.228120,-0.213006,-0.082366,-0.528999,...,-0.200889,-0.208987,-0.201147,0.278680,-0.188604,-0.196418,-0.204863,0.947839,1.201341,0.166276
31107,33696.0,1.023407,1.023328,0.927825,0.906418,0.890150,0.888234,0.893709,0.912637,0.940768,...,-0.200889,-0.208987,-0.201147,-0.208473,-0.188604,-0.196418,-0.204863,0.707465,0.107905,1.693035


In [220]:
X_pluie.shape

(31109, 148)

In [222]:
split = ShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X_pluie, Y_pluie)
train_index, test_index = next(split.split(X_pluie, Y_pluie)) 

X_train,X_test = X_pluie.loc[train_index], X_pluie.loc[test_index]
Y_train, Y_test = Y_pluie.loc[train_index], Y_pluie.loc[test_index]


Utilisons une régression linéaire


In [223]:
from sklearn import linear_model, metrics

reg = linear_model.LinearRegression()

reg.fit(X_train,Y_train)



LinearRegression()

In [224]:
predictions = reg.predict(X_test)
predictions

array([[0.45103199],
       [1.68803053],
       [4.2037174 ],
       ...,
       [2.29403717],
       [5.40609177],
       [1.86199501]])

In [225]:
np.mean(np.abs(predictions-Y_test)/(Y_test+1))*100

Y    126.140274
dtype: float64

In [226]:
metrics.mean_absolute_error(predictions,Y_test)

2.8788738788348676

Random Forest Regression

In [227]:
split = ShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X_pluie, Y_pluie)
train_index, test_index = next(split.split(X_pluie, Y_pluie)) 

X_train,X_test = X_pluie.loc[train_index], X_pluie.loc[test_index]
Y_train, Y_test = Y_pluie.loc[train_index], Y_pluie.loc[test_index]

In [228]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(random_state=4)
regressor.fit(X_train, Y_train)
predictions = regressor.predict(X_test)

In [229]:
new_predictions = [[i] for i in predictions]
new_predictions = np.asarray(new_predictions)

In [230]:
np.mean(np.abs(new_predictions-Y_test)/(Y_test+1))*100

Y    124.201792
dtype: float64

In [231]:
metrics.mean_absolute_error(predictions,Y_test)

2.7625826636665596

Quantile regression

In [121]:
split = ShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X_pluie, Y_pluie)
train_index, test_index = next(split.split(X_pluie, Y_pluie)) 

X_train,X_test = X_pluie.loc[train_index], X_pluie.loc[test_index]
Y_train, Y_test = Y_pluie.loc[train_index], Y_pluie.loc[test_index]

In [139]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_pinball_loss, mean_squared_error


all_models = {}
common_params = dict(
    learning_rate=0.05,
    n_estimators=1000,
    max_depth=3,
    min_samples_leaf=9,
    min_samples_split=9,
    max_features = 'auto'
)
for alpha in [0.05, 0.5, 0.95]:
    gbr = GradientBoostingRegressor(loss="quantile", alpha=alpha, **common_params)
    all_models["q %1.2f" % alpha] = gbr.fit(X_train, Y_train)

In [140]:
all_models

{'q 0.05': GradientBoostingRegressor(alpha=0.05, learning_rate=0.05, loss='quantile',
                           max_features='auto', min_samples_leaf=9,
                           min_samples_split=9, n_estimators=1000),
 'q 0.50': GradientBoostingRegressor(alpha=0.5, learning_rate=0.05, loss='quantile',
                           max_features='auto', min_samples_leaf=9,
                           min_samples_split=9, n_estimators=1000),
 'q 0.95': GradientBoostingRegressor(alpha=0.95, learning_rate=0.05, loss='quantile',
                           max_features='auto', min_samples_leaf=9,
                           min_samples_split=9, n_estimators=1000)}

In [232]:
predictions = all_models['q 0.05'].predict(X_test)
quantile_regression_pred = [[i] for i in predictions]
quantile_regression_pred = np.asarray(quantile_regression_pred)
np.mean(np.abs(quantile_regression_pred-Y_test)/(Y_test+1))*100


Y    41.070987
dtype: float64

In [233]:
predictions = all_models['q 0.50'].predict(X_test)
quantile_regression_pred = [[i] for i in predictions]
quantile_regression_pred = np.asarray(quantile_regression_pred)
np.mean(np.abs(quantile_regression_pred-Y_test)/(Y_test+1))*100

Y    47.719771
dtype: float64

In [234]:
predictions = all_models['q 0.95'].predict(X_test)
quantile_regression_pred = [[i] for i in predictions]
quantile_regression_pred = np.asarray(quantile_regression_pred)
np.mean(np.abs(quantile_regression_pred-Y_test)/(Y_test+1))*100

Y    469.222313
dtype: float64

Prédictions sur données Test

In [235]:
# Lecture des données de l'ensemble test

path = 'defi-ia-2022/Test/Test/X_station_test.csv'


with open(path, encoding="utf8", errors='ignore') as f:
  contents = f.read()
Xtest = pd.read_csv(path)
display(Xtest)
#Xtest['number_sta']=Xtest['number_sta'].astype('category')

# Tri par station puis par datea
#Xtest = Xtest.sort_values(['number_sta','date'])
#Xtest

Unnamed: 0,dd,hu,td,t,ff,precip,month,Id
0,,,,278.35,,,12,14047002_277_4
1,,,,278.40,,0.0,12,14047002_277_5
2,,,,279.01,,0.0,12,14047002_277_6
3,,,,279.66,,0.0,12,14047002_277_7
4,,,,279.99,,0.0,12,14047002_277_8
...,...,...,...,...,...,...,...,...
2304797,190.0,82.8,277.00,279.74,10.62,0.0,12,95690001_176_19
2304798,195.0,84.2,277.44,279.93,11.86,0.0,12,95690001_176_20
2304799,199.0,85.7,277.95,280.21,11.77,0.0,12,95690001_176_21
2304800,198.0,85.3,278.25,280.58,10.16,0.0,12,95690001_176_22


In [236]:
split_Id = Xtest['Id'].str.split(pat="_", expand = True)
split_Id = split_Id.rename(columns={0: "number_sta_2", 1: "day", 2: "hour"})
Xtest['number_sta_2'] = split_Id['number_sta_2']
Xtest['day'] = split_Id["day"]
Xtest['hour'] = split_Id["hour"]
Xtest['Id'] = split_Id['number_sta_2'] + "_" + split_Id['day']
Xtest = Xtest.drop("number_sta_2",axis=1)
Xtest.fillna(method='bfill',inplace=True)
display(Xtest)

Unnamed: 0,dd,hu,td,t,ff,precip,month,Id,day,hour
0,221.0,89.3,275.19,278.35,4.90,0.0,12,14047002_277,277,4
1,221.0,89.3,275.19,278.40,4.90,0.0,12,14047002_277,277,5
2,221.0,89.3,275.19,279.01,4.90,0.0,12,14047002_277,277,6
3,221.0,89.3,275.19,279.66,4.90,0.0,12,14047002_277,277,7
4,221.0,89.3,275.19,279.99,4.90,0.0,12,14047002_277,277,8
...,...,...,...,...,...,...,...,...,...,...
2304797,190.0,82.8,277.00,279.74,10.62,0.0,12,95690001_176,176,19
2304798,195.0,84.2,277.44,279.93,11.86,0.0,12,95690001_176,176,20
2304799,199.0,85.7,277.95,280.21,11.77,0.0,12,95690001_176,176,21
2304800,198.0,85.3,278.25,280.58,10.16,0.0,12,95690001_176,176,22


In [237]:
Xtest['ff_idx'] = 'ff_' + Xtest["hour"].astype(str)
Xtest['t_idx'] = 't_' + Xtest["hour"].astype(str)
Xtest['td_idx'] = 'td_' + Xtest["hour"].astype(str)
Xtest['hu_idx'] = 'hu_' + Xtest["hour"].astype(str)
Xtest['dd_idx'] = 'dd_' + Xtest["hour"].astype(str)
Xtest['precip_idx'] = 'precip_' + Xtest["hour"].astype(str)

In [238]:
ff = Xtest.pivot(index='Id',columns='ff_idx',values='ff')
t = Xtest.pivot(index='Id',columns='t_idx',values='t')
td = Xtest.pivot(index='Id',columns='td_idx',values='td')
hu = Xtest.pivot(index='Id',columns='hu_idx',values='hu')
dd = Xtest.pivot(index='Id',columns='dd_idx',values='dd')
precip = Xtest.pivot(index='Id',columns='precip_idx',values='precip')

In [239]:
X_reshape = pd.concat([ff ,t,td,hu,dd,precip],axis=1)

In [240]:
X_reshape.reset_index(inplace=True)
X_reshape

Unnamed: 0,Id,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_21,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9
0,14047002_100,5.38,5.38,5.38,5.38,5.38,5.38,5.38,5.380000,5.38,...,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,14047002_223,4.54,4.54,4.54,4.54,4.54,4.54,4.54,4.540000,4.54,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,14047002_236,3.57,3.57,3.57,3.57,3.57,3.57,3.57,3.570000,3.57,...,0.0,0.0,0.6,0.2,0.6,0.0,0.0,0.8,0.0,0.0
3,14047002_256,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.000000,2.00,...,0.4,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14047002_277,,,4.90,4.90,4.90,4.90,4.90,4.900000,4.90,...,2.4,2.0,0.6,,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96050,95690001_95,0.13,0.30,2.37,1.82,2.20,2.44,2.32,2.140000,2.21,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96051,95690001_96,3.15,3.92,6.27,5.49,6.19,7.46,6.06,4.710000,5.00,...,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96052,95690001_97,1.41,1.76,6.33,8.23,7.77,6.50,4.28,3.520000,2.42,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96053,95690001_98,6.90,7.09,8.85,8.58,7.42,7.62,9.56,8.250000,7.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0


In [241]:
X = X_reshape

X['day'] = X['Id'].str.rpartition("_")[2]
X['day'] = X['day'].astype('int')
X['day'] = X['day'] % 365

In [242]:
X['number_sta'] = X['Id'].str.rpartition('_')[0]
X['number_sta'] = X['number_sta'].astype('int64')

In [243]:
X['season'] =pd.cut(X['day'],[0,60,151,243,334], labels=['Hiver','Printemps','Ete', 'Automne'],ordered=False)
X['season'] = X['season'].fillna('Hiver')
X['season']

0        Printemps
1              Ete
2              Ete
3          Automne
4          Automne
           ...    
96050    Printemps
96051    Printemps
96052    Printemps
96053    Printemps
96054    Printemps
Name: season, Length: 96055, dtype: category
Categories (4, object): ['Hiver', 'Printemps', 'Ete', 'Automne']

In [244]:
X['month'] = X['day'] / 30
X['month'] = X['month'].astype('int')

In [245]:
X.reset_index(inplace=True)
X.drop('index',axis=1,inplace=True)

In [246]:
X['month'] = X['month'].astype('category')
X['season'] = X['season'].astype('category')
X['day'] = X['day'].astype('category')

In [247]:
X =pd.merge(X,coords,how='left',on='number_sta')
X['number_sta'] = X['number_sta'].astype('category')
X.fillna(method = 'bfill',inplace = True)

In [248]:

X_test = transformer.fit_transform(X)
#predictions_test = random_forest.predict(X_test)
predictions_test = xg.predict(X_test)

In [249]:
predictions_test

array([2, 2, 2, ..., 1, 2, 2])

In [250]:
index_pluie =np.where(predictions_test==2)

In [251]:
X_pluie = X.loc[index_pluie]
X_pluie

Unnamed: 0,Id,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_7,precip_8,precip_9,day,number_sta,season,month,lat,lon,height_sta
0,14047002_100,5.38,5.38,5.38,5.38,5.38,5.38,5.38,5.380000,5.38,...,0.0,0.0,1.0,100,14047002,Printemps,3,49.275,-0.712,60.0
1,14047002_223,4.54,4.54,4.54,4.54,4.54,4.54,4.54,4.540000,4.54,...,0.0,0.0,0.0,223,14047002,Ete,7,49.275,-0.712,60.0
2,14047002_236,3.57,3.57,3.57,3.57,3.57,3.57,3.57,3.570000,3.57,...,0.8,0.0,0.0,236,14047002,Ete,7,49.275,-0.712,60.0
3,14047002_256,2.00,2.00,2.00,2.00,2.00,2.00,2.00,2.000000,2.00,...,0.0,0.0,0.0,256,14047002,Automne,8,49.275,-0.712,60.0
4,14047002_277,3.39,3.39,4.90,4.90,4.90,4.90,4.90,4.900000,4.90,...,0.0,0.0,0.0,277,14047002,Automne,9,49.275,-0.712,60.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96047,95690001_92,1.57,2.33,0.93,1.41,1.46,1.13,2.32,1.090000,2.19,...,0.0,0.0,0.0,92,95690001,Printemps,3,49.108,1.831,126.0
96048,95690001_93,6.55,6.17,7.73,9.19,8.24,7.07,6.76,6.180000,7.52,...,0.0,0.0,0.0,93,95690001,Printemps,3,49.108,1.831,126.0
96051,95690001_96,3.15,3.92,6.27,5.49,6.19,7.46,6.06,4.710000,5.00,...,0.0,0.0,0.0,96,95690001,Printemps,3,49.108,1.831,126.0
96053,95690001_98,6.90,7.09,8.85,8.58,7.42,7.62,9.56,8.250000,7.16,...,0.2,0.0,0.0,98,95690001,Printemps,3,49.108,1.831,126.0


In [252]:
X_pluie_test = transformer.fit_transform(X_pluie)

In [261]:
predictions_test_regressor = regressor.predict(X_pluie_test)

In [253]:
predictions_test_regressor_quant = all_models['q 0.05'].predict(X_pluie_test)

In [254]:
array_test = pd.DataFrame(predictions_test)
array_test[0] = np.where(array_test[0]==1,0,2)
array_test['Id'] = X['Id']
display(array_test)

Unnamed: 0,0,Id
0,2,14047002_100
1,2,14047002_223
2,2,14047002_236
3,2,14047002_256
4,2,14047002_277
...,...,...
96050,0,95690001_95
96051,2,95690001_96
96052,0,95690001_97
96053,2,95690001_98


In [255]:
array_test.set_axis(['pred', 'Id'], axis=1, inplace=True)

In [256]:
array_test

Unnamed: 0,pred,Id
0,2,14047002_100
1,2,14047002_223
2,2,14047002_236
3,2,14047002_256
4,2,14047002_277
...,...,...
96050,0,95690001_95
96051,2,95690001_96
96052,0,95690001_97
96053,2,95690001_98


In [257]:
array_test.pred.value_counts()

2    61710
0    34345
Name: pred, dtype: int64

In [262]:
tab_reg = {'pred' : predictions_test_regressor ,'Id' : X_pluie['Id']}
tab_reg = pd.DataFrame(tab_reg)
tab_reg

Unnamed: 0,pred,Id
0,5.997,14047002_100
1,2.576,14047002_223
2,4.155,14047002_236
3,3.909,14047002_256
4,6.353,14047002_277
...,...,...
96047,4.183,95690001_92
96048,3.074,95690001_93
96051,4.181,95690001_96
96053,3.247,95690001_98


In [258]:
tab_reg_quant = {'pred_quant' : predictions_test_regressor_quant ,'Id' : X_pluie['Id']}
tab_reg_quant = pd.DataFrame(tab_reg_quant)
tab_reg_quant

Unnamed: 0,pred_quant,Id
0,0.0,14047002_100
1,0.0,14047002_223
2,0.2,14047002_236
3,0.0,14047002_256
4,0.2,14047002_277
...,...,...
96047,0.0,95690001_92
96048,0.0,95690001_93
96051,0.0,95690001_96
96053,0.0,95690001_98


In [263]:
pred = pd.merge(array_test,tab_reg,on ='Id',how='left')
pred

Unnamed: 0,pred_x,Id,pred_y
0,2,14047002_100,5.997
1,2,14047002_223,2.576
2,2,14047002_236,4.155
3,2,14047002_256,3.909
4,2,14047002_277,6.353
...,...,...,...
96050,0,95690001_95,
96051,2,95690001_96,4.181
96052,0,95690001_97,
96053,2,95690001_98,3.247


In [264]:
pred['pred_y'] = pred.pred_y.fillna(0)
pred.drop('pred_x',axis=1,inplace=True)
pred.rename(columns = {'pred_y' : 'pred'},inplace=True)
pred

Unnamed: 0,Id,pred
0,14047002_100,5.997
1,14047002_223,2.576
2,14047002_236,4.155
3,14047002_256,3.909
4,14047002_277,6.353
...,...,...
96050,95690001_95,0.000
96051,95690001_96,4.181
96052,95690001_97,0.000
96053,95690001_98,3.247


In [265]:
pred = pd.merge(pred,tab_reg_quant,on ='Id',how='left')
pred

Unnamed: 0,Id,pred,pred_quant
0,14047002_100,5.997,0.0
1,14047002_223,2.576,0.0
2,14047002_236,4.155,0.2
3,14047002_256,3.909,0.0
4,14047002_277,6.353,0.2
...,...,...,...
96050,95690001_95,0.000,
96051,95690001_96,4.181,0.0
96052,95690001_97,0.000,
96053,95690001_98,3.247,0.0


In [266]:
pred['pred_quant'] = pred.pred_quant.fillna(0)
pred

Unnamed: 0,Id,pred,pred_quant
0,14047002_100,5.997,0.0
1,14047002_223,2.576,0.0
2,14047002_236,4.155,0.2
3,14047002_256,3.909,0.0
4,14047002_277,6.353,0.2
...,...,...,...
96050,95690001_95,0.000,0.0
96051,95690001_96,4.181,0.0
96052,95690001_97,0.000,0.0
96053,95690001_98,3.247,0.0


In [267]:
fname = './defi-ia-2022/Test/Test/Baselines/Baseline_observation_test.csv'
Baseline_observation_test = pd.read_csv(fname)
Baseline_observation_test = Baseline_observation_test.sort_values(by=["Id"])
display(Baseline_observation_test)

Unnamed: 0,Id,Prediction
82989,14047002_100,9.9
84063,14047002_223,0.8
83793,14047002_236,7.0
84334,14047002_256,4.0
83254,14047002_281,0.0
...,...,...
21605,95690001_95,0.0
43472,95690001_96,1.8
45972,95690001_97,0.0
3055,95690001_98,8.0


In [268]:
pred = pd.merge(Baseline_observation_test,pred,on ='Id',how='left')
pred

Unnamed: 0,Id,Prediction,pred,pred_quant
0,14047002_100,9.9,5.997,0.0
1,14047002_223,0.8,2.576,0.0
2,14047002_236,7.0,4.155,0.2
3,14047002_256,4.0,3.909,0.0
4,14047002_281,0.0,5.555,0.0
...,...,...,...,...
85135,95690001_95,0.0,0.000,0.0
85136,95690001_96,1.8,4.181,0.0
85137,95690001_97,0.0,0.000,0.0
85138,95690001_98,8.0,3.247,0.0


In [269]:
pred = pred.drop(["Prediction","pred"],axis=1)
pred

Unnamed: 0,Id,pred_quant
0,14047002_100,0.0
1,14047002_223,0.0
2,14047002_236,0.2
3,14047002_256,0.0
4,14047002_281,0.0
...,...,...
85135,95690001_95,0.0
85136,95690001_96,0.0
85137,95690001_97,0.0
85138,95690001_98,0.0


In [270]:
pred.rename(columns = {'pred_quant' : 'Prediction'},inplace=True)
pred['Prediction']=pred['Prediction']+1
pred

Unnamed: 0,Id,Prediction
0,14047002_100,1.0
1,14047002_223,1.0
2,14047002_236,1.2
3,14047002_256,1.0
4,14047002_281,1.0
...,...,...
85135,95690001_95,1.0
85136,95690001_96,1.0
85137,95690001_97,1.0
85138,95690001_98,1.0


In [271]:
# Enregistrement prédictions

output_file = "prediction_test.csv"
pred.to_csv('./predictions/' + output_file,index=False)