### Modélisation des données du Défi IA 2021-2022

Dans ce calepin, nous décrivons quelques éléments de modélisation des données du  Défi IA 2021-2022. Il est suggéré quelques pistes pour réaliser des statistiques descriptives de ces données ainsi qu'une approche de prévision du cumul de pluie à l'aide de la régression linéaire. Des codes en Python sont également proposés. Le but de calepin est également de vous aider à initier la rédaction d'un rapport sur votre travail dans ce Défi IA.

#### Données disponibles sur les stations de mesure

Sur les années 2016 et 2017, données d'apprentissage sur $N$ stations météorologiques dont on dispose des coordonnées spatiales (latitude et longitude).

Pour chaque station $1 \leq i \leq N$, on dispose des mesures suivantes :

**Variables explicatives** : mesure de $p$ variables $X_{ijt} = (X_{ijt}^{(k)})_{1 \leq k \leq p} \in \mathbb{R}^{p}$ pour la station $i$, le jour $j$ (variable non-ordonnée car non-disponible dans l'ensemble test) et l'heure $t \in \{0,\ldots,23 \}$ (variable ordonnée disponible dans l'ensemble test). Les mesures sont

- 'ff' : *inclure une description*
- 't' : *inclure une description*
- 'td' : *inclure une description*
- 'hu' : *humidité*
- 'dd' : *inclure une description*
- 'precip' : *cumul de pluie sur une heure en ml*

On peut également ajouter une variable sur le mois de l'année car cette information est disponible dans l'ensemble test.


**Variable à expliquer/prédire** : cumul de pluie $Y_{ij}$ sur une journée au jour $j+1$ dans la station $i$ à partir des données disponibles au jour $j$. Dans l'ensemble d'apprentissage, on dipose en fait de la variable $Y_{ijt}$ cumul de pluie  sur une journée au jour $j+1$ dans la station $i$ et àl'heure $t$. De façon évidente on a que (avec $T=23$)
$$
Y_{ij} = \sum_{t = 0}^{T} Y_{ijt}
$$

**Travail préliminaire** : proposer une analyse descriptive de ces données : boxplot, histogramme uni-varié, ACP pour étude des corrélation entre variables explicatives, etc...

**Modèles linéaires possibles de prévision du cumul de pluie** : 

*Modèle global temps par temps*

$$
Y_{ijt} = \theta_{0}^{t} + \sum_{k = 1}^{p} \theta_{k}^{t}X_{ijt}^{(k)} + \varepsilon_{ijt}
$$

et prévision par $\hat{Y}_{ij} = \sum_{t = 0}^{T} \hat{Y}_{ijt} $ où $\hat{Y}_{ijt} = \hat{\theta}_{0}^{t} + \sum_{k = 1}^{p} \hat{\theta}_{k}^{t}X_{ijt}^{(k)}$

*Modèle par station et temps par temps*

$$
Y_{ijt} = \theta_{0,i}^{t} + \sum_{k = 1}^{p} \theta_{k,i}^{t}X_{ijt}^{(k)} + \varepsilon_{ijt}
$$

où les cofficients du modèle linéaire varient selon la station de mesure.

*Modèle global avec agrégation du temps*

$$
Y_{ij} = \theta_{0} + \sum_{t = 0}^{T}  \sum_{k = 1}^{p} \theta_{k}^{t}X_{ijt}^{(k)} + \varepsilon_{ij}
$$

et bien d'autres modèles sont possibles !

In [6]:
#from google.colab import drive
#drive.mount('/content/drive')

In [7]:
#import os
#os.chdir('/content/drive/My Drive/Données Massives/')

In [8]:
import matplotlib.pyplot as plt
from IPython.display import display


In [9]:
import pandas as pd
import datetime
import seaborn as sns
import numpy as np

# Suppression des messages d'erreur liés à des besoins de mise à jour de syntaxe en Python
import warnings

def fxn():
    warnings.warn("deprecated", DeprecationWarning)

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    fxn()

warnings.filterwarnings("ignore")

In [10]:
# Lecture des données de l'ensemble d'apprentissage 

path = 'defi-ia-2022/Train/Train/X_station_train.csv'
first_date = datetime.datetime(2016,1,1)    
last_date = datetime.datetime(2017,12,31)

# Read the ground station data
def read_gs_data(fname):
    gs_data = pd.read_csv(fname,parse_dates=['date'],infer_datetime_format=True)
    gs_data = gs_data.sort_values(by=["number_sta","date"])
    return gs_data

x = read_gs_data(path)
x['number_sta']=x['number_sta'].astype('category')

# Tri par station puis par datea
x = x.sort_values(['number_sta','date'])
x

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4
...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22


In [11]:
# Ajout des variables jour et heure
Xtrain = x
split_Id = Xtrain['Id'].str.split(pat="_", expand = True)
split_Id = split_Id.rename(columns={0: "number_sta_2", 1: "day", 2: "hour"})
Xtrain['number_sta_2'] = split_Id['number_sta_2']
Xtrain['day'] = split_Id["day"]
Xtrain['hour'] = split_Id["hour"]
Xtrain = Xtrain.drop("number_sta_2",axis=1)
display(Xtrain) 

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4
...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22


In [12]:
#Ajout de mois 

Xtrain['month'] = pd.DatetimeIndex(Xtrain['date']).month
display(Xtrain) 

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12


In [13]:
################################################
# Inclure une analyse desciptive des données ! #
################################################

Barplot des précipitations

In [14]:
# bins = [0,1,2,4,5,6,7,8,9,10,20,30]
# data_bar_precip = Xtrain.groupby(pd.cut(Xtrain['precip'], bins=bins)).count()
# data_bar_precip['precip'].plot.bar()

In [15]:
# Xtrain.boxplot()

In [16]:
# corr_Xtrain = Xtrain.corr(method='pearson')
# corr_Xtrain
# plt.matshow(corr_Xtrain)
# plt.show()

Matrice de corrélation de Xtrain
On retrouve une corrélation négative entre la température et l'humidité ce qui semble logique
On constate aussi une forte corrélation positive entre la température et td

In [17]:
# sns.heatmap(corr_Xtrain,annot=True)
# plt.show()

In [18]:
#On rajoute deux variables 
#precip_bool: 1: il pleut 0: il ne pleut pas
#precip _ategories: 5 categories 0, 0.2, 0.4 , 0.6, 0.8+


#Xtrain_2017['col2'] = pd.cut(df['col1'], bins=[0, 10, 50, float('Inf')], labels=['xxx', 'yyy', 'zzz'])

#Xtrain_2017



In [19]:
# Exemple d'implémentation du modèle linéaire par station et temps par temps mais avec
# des coefficients qui ne dépendent pas du temps
# Imputation de données manquantes par la médiane des variables

################################################################################################
# Attention, une erreur de mosélisation s'est glissée dans ce code, saurez-vous la retrouver ? #
################################################################################################

from sklearn.linear_model import LinearRegression

def regression_bystation (x,num_station):

    X = x[x['number_sta']==num_station]
    Y = X[{"date","precip"}]
    Y.set_index('date',inplace = True) 

    X = X[{"date","ff","t","td","hu","dd"}]
    X.set_index('date',inplace = True)

    # Imputation des valeurs manquantes
    median_t = X['t'].median()
    X['t'] = X['t'].fillna(median_t)

    median_ff = X['ff'].median()
    X['ff'] = X['ff'].fillna(median_ff)

    median_td = X['td'].median()
    X['td'] = X['td'].fillna(median_td)

    median_hu = X['hu'].median()
    X['hu'] = X['hu'].fillna(median_hu)

    median_dd = X['dd'].median()
    X['dd'] = X['dd'].fillna(median_dd)

    median_pre = Y['precip'].median()
    Y['precip'] = Y['precip'].fillna(median_pre)

    lr= LinearRegression(normalize=False)
    lr.fit(X, Y)

    return(lr)

ACP normée avec deux composantes principales


In [20]:
from sklearn.preprocessing import StandardScaler

# Xtrain = Xtrain.fillna(method='ffill')
# features = ['ff','t','td','hu','dd']

# X_acp = Xtrain.loc[:,features].values
# X_acp = StandardScaler().fit_transform(X_acp)


# Y_acp = Xtrain.loc[:,'precip'].values



In [21]:
from sklearn.decomposition import PCA

# pca = PCA(n_components = 5)

# principalComponents = pca.fit_transform(X_acp)

# principalDf = pd.DataFrame(data = principalComponents
#              , columns = ['principal component 1', 'principal component 2','principal component 3','principal component 4','principal component 5'])

# finalDf = pd.concat([principalDf, Xtrain[['precip']]], axis = 1)



On récupère les valeurs propres de nos composantes principales

In [22]:
# pca.explained_variance_

Notre première composante principale explique 37% de l'inertie tandis que la seconde composante principale explique 25 % de l'inertie.

In [23]:
# pca.explained_variance_ratio_

Visualisation des valeurs propres de nos composantes principales. 

In [24]:

# sns.set_theme(style='darkgrid')
# graph_variance = sns.lineplot(x=['1','2','3','4','5'],y = pca.explained_variance_)
# graph_variance.axhline(1,color= 'red')

On constate que les composantes principales 1 et 2 ont des valeurs propres supérieurs à 1. 
Tandis que la troisième composante principale à une valeur propre très proche de 1 c'est pourquoi on décide aussi de la garder pour la suite selon la règle de Kaiser.


In [25]:
# display(finalDf)

S'il pleut, la variable class prend comme valeur 1, sinon elle prend comme valeur 2.


In [26]:
# finalDf['class'] = np.where(finalDf['precip'] == 0,1,2)
# display(finalDf)



Scatter plot avec les deux premières composantes principales

In [27]:
# sns.scatterplot(data = finalDf,x= 'principal component 1', y = 'principal component 2', hue='class')

En 3 dimensions cela devient illisible.

In [28]:
# fig = plt.figure()
# fig.set_size_inches(15, 10.5)
# ax = fig.add_subplot(111, projection='3d')
# x = np.array(finalDf['principal component 1'])
# y = np.array(finalDf['principal component 2'])
# z = np.array(finalDf['principal component 3'])
# ax.scatter(x,y,z, marker="s", c=finalDf["class"], s=20, cmap="RdBu")
# ax.set_xlabel('Principal Component 1')
# ax.set_ylabel('Principal Component 2')
# ax.set_zlabel('Principal Component 3')
# ax.legend()
# plt.show()

On va tenter d'ajouter une variable explicative ayant une bonne corrélation avec la variable precip.Pour cela, on va utiliser l'algorithme knn qui va nous dire s'il pleut ou pas.
La nouvelle variable knn qui sera crée prendra en valeur la prédiction de la classe donnée par l'algorithme knn. ( soit il pleut soit il ne pleut pas)


In [29]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.impute import SimpleImputer
# knn = KNeighborsClassifier(n_neighbors=10)



In [30]:
Xtrain

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12


In [31]:
Xtrain = Xtrain.fillna(method='ffill')
Xtrain['class'] = np.where(Xtrain['precip'] == 0,1,2)


X = Xtrain.drop(['date','class','Id'],axis=1)
Y = Xtrain['class']
X['day'] = X['day'].astype('category')
X['hour'] = X['hour'].astype('category')
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.3, random_state=4,stratify=Y)

In [32]:
# #Xtrain['year'] = Xtrain[]
# Xtrain = Xtrain.fillna(method='ffill')
# Xtrain_gp = Xtrain.groupby(by=['number_sta','day'])



On standardise nos données

In [33]:
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)




In [34]:
#knn.fit(X_train,y_train)
#predictions = knn.predict(X_test)
#accuracy_score(y_test,predictions)

In [35]:
# knn.fit(X_train,y_train)

In [36]:

accuracy_score(y_test,np.repeat(1,y_test.shape[0]))


0.9045026507302831

En ne prédisant que des 1 (i.e jours où il pleut), on obtient une précision de 90%

In [37]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.3, random_state=4,stratify=Y)
ss = StandardScaler()

X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

random_forest.fit(X_train,y_train)


RandomForestClassifier()

In [38]:
predictions = random_forest.predict(X_test)
accuracy_score(y_test,predictions)

1.0

In [39]:
np.unique(predictions,return_counts=True)


(array([1, 2]), array([1196515,  126328], dtype=int64))

In [40]:
np.unique(y_test,return_counts=True)



(array([1, 2]), array([1196515,  126328], dtype=int64))

In [41]:
# predictions = knn.predict(X_test)
# accuracy_score(y_test,predictions)

On obtient une précision de 91% avec knn (k=10)


On va maintenant tenter d'améliorer ce score en utilisant la méthode de validation croisée avec knn et en prenant une grille de valeur de k afin de choisir le k qui maximise la précision de notre modèle.


In [42]:
# from sklearn.model_selection import GridSearchCV 
# knn2 = KNeighborsClassifier() 
# param_grid = {"n_neighbors" : np.arange(1, 25)}
# knn_cv = GridSearchCV(knn2, param_grid, cv=3)
# knn_cv.fit(X, Y)

In [43]:
# from sklearn.model_selection import GridSearchCV 
# knn2 = KNeighborsClassifier() 
# param_grid = {"n_neighbors" : np.arange(1, 25)}

In [44]:
X = Xtrain
Y = X[{"date","precip"}]
Y.set_index('date',inplace = True) 

# Imputation des valeurs manquantes
#X = X[{"date","ff","t","td","hu","dd"}]
#X.set_index('date',inplace = True)

In [45]:
Xtrain

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month,class
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1,1
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1,1
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1,1
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1,1
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12,1
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12,1
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12,1
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12,1


In [46]:
# Réorganisation des données pour modèle avec aggrégation dans le temps
X["number_sta_day"] = Xtrain['number_sta'].astype(str) + '_' + Xtrain['day'].astype(str)
X

Unnamed: 0,number_sta,date,ff,t,td,hu,dd,precip,Id,day,hour,month,class,number_sta_day
0,14066001,2016-01-01 00:00:00,3.05,279.28,277.97,91.4,200.0,0.0,14066001_0_0,0,0,1,1,14066001_0
1,14066001,2016-01-01 01:00:00,2.57,278.76,277.45,91.4,190.0,0.0,14066001_0_1,0,1,1,1,14066001_0
2,14066001,2016-01-01 02:00:00,2.26,278.27,277.02,91.7,181.0,0.0,14066001_0_2,0,2,1,1,14066001_0
3,14066001,2016-01-01 03:00:00,2.62,277.98,276.95,93.0,159.0,0.0,14066001_0_3,0,3,1,1,14066001_0
4,14066001,2016-01-01 04:00:00,2.99,277.32,276.72,95.9,171.0,0.0,14066001_0_4,0,4,1,1,14066001_0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4409469,95690001,2017-12-30 19:00:00,9.10,286.68,283.44,80.8,239.0,0.0,95690001_729_19,729,19,12,1,95690001_729
4409470,95690001,2017-12-30 20:00:00,8.58,286.39,283.21,81.1,231.0,0.0,95690001_729_20,729,20,12,1,95690001_729
4409471,95690001,2017-12-30 21:00:00,8.74,286.28,283.40,82.6,226.0,0.0,95690001_729_21,729,21,12,1,95690001_729
4409472,95690001,2017-12-30 22:00:00,9.04,286.21,283.29,82.4,224.0,0.0,95690001_729_22,729,22,12,1,95690001_729


In [47]:
X.drop(['class','Id','number_sta'],axis=1,inplace= True)

In [48]:
X['ff_idx'] = 'ff_' + X["hour"].astype(str)
X['t_idx'] = 't_' + X["hour"].astype(str)
X['td_idx'] = 'td_' + X["hour"].astype(str)
X['hu_idx'] = 'hu_' + X["hour"].astype(str)
X['dd_idx'] = 'dd_' + X["hour"].astype(str)
X['precip_idx'] = 'precip_' + X["hour"].astype(str)

In [49]:
ff = X.pivot(index='number_sta_day',columns='ff_idx',values='ff')
t = X.pivot(index='number_sta_day',columns='t_idx',values='t')
td = X.pivot(index='number_sta_day',columns='td_idx',values='td')
hu = X.pivot(index='number_sta_day',columns='hu_idx',values='hu')
dd = X.pivot(index='number_sta_day',columns='dd_idx',values='dd')
precip = X.pivot(index='number_sta_day',columns='precip_idx',values='precip')

In [50]:
X_reshape = pd.concat([ff ,t,td,hu,dd,precip],axis=1)

In [51]:
X_reshape.reset_index(inplace=True)
X_reshape

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_21,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.2,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.0,0.2,0.4,2.2,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
183742,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
183743,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.8
183744,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
183745,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [52]:
# Données de cumul de pluie
fname = 'defi-ia-2022/Train/Train/Y_train.csv'
param = 'Ground_truth'  #weather parameter name in the file ('Ground_truth' about Y and 'Prediction' about baseline)
data = pd.read_csv(fname, parse_dates=['date'], infer_datetime_format=True)
data['number_sta'] = data['number_sta'].astype('category')
display(data)

Unnamed: 0,date,number_sta,Ground_truth,Id
0,2016-01-02,14066001,3.4,14066001_0
1,2016-01-02,14126001,0.5,14126001_0
2,2016-01-02,14137001,3.4,14137001_0
3,2016-01-02,14216001,4.0,14216001_0
4,2016-01-02,14296001,13.3,14296001_0
...,...,...,...,...
183742,2017-12-31,86137003,5.0,86137003_729
183743,2017-12-31,86165005,3.2,86165005_729
183744,2017-12-31,86272002,1.8,86272002_729
183745,2017-12-31,91200002,1.6,91200002_729


In [89]:
Y = data[["Ground_truth","Id"]]
Y.rename(columns = {"Ground_truth":"Y","Id":"number_sta_day"},inplace=True)
Y.dropna(inplace=True)
display(Y)

Unnamed: 0,Y,number_sta_day
0,3.4,14066001_0
1,0.5,14126001_0
2,3.4,14137001_0
3,4.0,14216001_0
4,13.3,14296001_0
...,...,...
183742,5.0,86137003_729
183743,3.2,86165005_729
183744,1.8,86272002_729
183745,1.6,91200002_729


In [97]:
all_data = pd.merge(X_reshape,Y)

display(all_data)
all_data.dropna(inplace=True)

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9,Y
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.4
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.7
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.6
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.2,0.4,2.2,0.0,0.0,0.0,0.0,3.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162102,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,3.2
162103,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.8,0.0
162104,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.4
162105,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.4


In [98]:

Y = all_data['Y']
X = all_data.drop("Y",axis=1)
X

Unnamed: 0,number_sta_day,ff_0,ff_1,ff_10,ff_11,ff_12,ff_13,ff_14,ff_15,ff_16,...,precip_21,precip_22,precip_23,precip_3,precip_4,precip_5,precip_6,precip_7,precip_8,precip_9
0,14066001_0,3.05,2.57,3.38,3.20,3.85,5.19,6.04,4.43,5.10,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,14066001_1,4.73,4.22,9.45,11.51,11.52,10.79,10.53,9.66,10.26,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,14066001_10,3.68,4.70,7.78,8.08,6.89,5.54,7.01,7.10,6.41,...,0.2,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,14066001_100,2.09,2.75,6.80,6.40,6.73,5.62,4.92,5.49,5.41,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14066001_101,1.23,1.35,2.08,2.33,2.64,2.48,2.83,1.69,1.96,...,0.0,0.0,0.0,0.2,0.4,2.2,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162102,95690001_725,8.48,9.21,7.40,7.22,6.82,6.88,7.09,5.77,6.43,...,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0
162103,95690001_726,7.76,9.47,7.49,10.45,9.44,12.49,12.29,12.05,11.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.4,0.8
162104,95690001_727,6.04,5.15,2.90,2.43,3.93,4.48,4.48,4.06,3.16,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
162105,95690001_728,1.78,1.26,9.64,10.86,9.73,9.10,10.45,8.57,9.81,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [130]:
X.reset_index(inplace=True)
X.drop('index',axis=1,inplace=True)

In [134]:
Y =pd.DataFrame(Y)
Y.reset_index(inplace=True)
Y.drop(['level_0','index'],inplace = True,axis=1)

In [135]:
Y

Unnamed: 0,Y
0,3.4
1,11.7
2,1.0
3,5.6
4,3.2
...,...
162085,3.2
162086,0.0
162087,4.4
162088,5.4


In [126]:
#X['class'] = np.where(Y['Y'] == 0,1,2)
#Y = X.loc[:,X.columns.str.startswith('precip')].sum(axis=1)
Class =np.where(Y['Y'] == 0,1,2)
Class = pd.DataFrame(Class)
Class.reset_index(inplace=True)
Class.drop('index',inplace=True,axis=1)

In [136]:
Class

Unnamed: 0,0
0,2
1,2
2,2
3,2
4,2
...,...
162085,2
162086,1
162087,2
162088,2


In [141]:
from sklearn.model_selection import ShuffleSplit

split = ShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X, Class)
train_index, test_index = next(split.split(X, Class)) 

X_train,X_test = X.loc[train_index], X.loc[test_index]
Y_train, Y_test = Class.loc[train_index], Class.loc[test_index]


In [142]:
random_forest = RandomForestClassifier()
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

random_forest.fit(X_train,Y_train)

RandomForestClassifier()

In [143]:
predictions = random_forest.predict(X_test)
accuracy_score(Y_test,predictions)

0.7416044584284451

In [None]:
X.columns

Utilisons Support Vector Machine (SVM) afin de classifier les jours où il pleut

In [144]:
from sklearn import svm
svm_model = svm.SVC(kernel = 'linear')
svm_model.fit(X_train,Y_train)
predictions = svm_model.predict(X_test)
accuracy_score(Y_test,predictions)

In [63]:
# predictions

In [64]:
X_test = pd.DataFrame(X_test)
X_test['index_original'] = test_index
X_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,136,137,138,139,140,141,142,143,144,index_original
0,-1.377722,0.996327,0.867585,0.840192,0.478546,0.528861,0.487825,0.847564,0.884470,0.733857,...,0.318967,-0.175510,-0.195362,-0.194716,0.738791,0.300360,1.369900,0.33301,-0.192761,145078
1,-0.295639,-1.397036,-1.458617,-1.359627,-1.329646,-1.097133,-1.352016,-1.343079,-1.471393,-1.612993,...,0.318967,0.773628,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,56068
2,-0.254950,0.690727,0.692365,0.553114,0.522566,0.501533,0.494703,0.499679,0.522030,0.559270,...,-0.172263,-0.175510,0.294996,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,64733
3,-0.566146,0.345792,0.347966,0.172569,0.136547,0.112115,0.102662,0.107017,0.132235,0.176522,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,29813
4,1.167803,-0.728347,-0.715440,-0.194624,-0.669351,-0.581324,-0.399426,-0.313201,0.043335,0.226884,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,140740
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55114,1.681968,-1.266929,-1.059839,-1.029153,-1.079073,-1.226939,-0.984048,-0.953861,-0.879863,-1.196670,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,175467
55115,-0.054505,-1.072273,-1.067895,-1.496490,-1.431230,-1.247434,-1.008121,-0.754086,-0.770447,-0.760203,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,0.325677,-0.18847,-0.192761,87049
55116,0.530431,-0.183713,0.070030,-0.952377,-0.781094,-1.124460,-1.255726,-1.157081,-0.664450,-0.582259,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,116535
55117,0.064209,3.749754,3.574438,0.409575,0.038349,-0.014276,0.958962,2.373437,2.064111,1.754518,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,93661


In [65]:
index_pluie =np.where(predictions==2)

In [66]:
X_pluie = X_test.loc[index_pluie]
X_pluie

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,136,137,138,139,140,141,142,143,144,index_original
0,-1.377722,0.996327,0.867585,0.840192,0.478546,0.528861,0.487825,0.847564,0.884470,0.733857,...,0.318967,-0.175510,-0.195362,-0.194716,0.738791,0.300360,1.369900,0.33301,-0.192761,145078
1,-0.295639,-1.397036,-1.458617,-1.359627,-1.329646,-1.097133,-1.352016,-1.343079,-1.471393,-1.612993,...,0.318967,0.773628,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,56068
2,-0.254950,0.690727,0.692365,0.553114,0.522566,0.501533,0.494703,0.499679,0.522030,0.559270,...,-0.172263,-0.175510,0.294996,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,64733
6,-1.504073,0.012961,0.338903,0.419589,0.661397,0.969519,0.852354,0.620233,0.145912,0.136233,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,89738
7,1.714416,0.615083,0.616839,0.469661,0.437913,0.416134,0.408729,0.413569,0.436549,0.475334,...,-0.172263,-0.175510,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,0.33301,-0.192761,178688
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55107,1.680982,1.023559,1.127395,0.726696,0.366804,-0.075763,-0.258428,-0.137536,-0.175497,-0.045069,...,0.810197,2.197335,4.463039,8.575080,0.738791,0.300360,-0.196434,-0.18847,-0.192761,174126
55108,-1.435793,1.099202,1.100205,1.003760,0.979693,0.962687,0.958962,0.964674,0.983628,1.012524,...,-0.172263,-0.175510,0.294996,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,131976
55109,0.733873,-1.230620,-1.241102,-0.768780,-0.791252,-0.622315,0.002932,-0.499199,-1.074760,-1.623065,...,0.318967,0.299059,-0.195362,-0.194716,-0.198377,-0.199476,-0.196434,-0.18847,-0.192761,129026
55111,-0.558891,-1.505963,-1.500911,-1.870359,-1.935764,-1.978448,-2.001979,-2.000961,-1.960346,-1.878230,...,0.318967,-0.175510,-0.195362,-0.194716,0.270207,-0.199476,-0.196434,-0.18847,-0.192761,36175


In [67]:
X_pluie.shape

(30112, 146)

In [68]:
Y_test = pd.DataFrame(Y_test)
Y_test

Unnamed: 0,0
0,2
1,2
2,2
3,1
4,1
...,...
55114,1
55115,2
55116,1
55117,1


In [69]:
Y_pluie = Y.loc[X_pluie['index_original']]
pd.DataFrame(Y_pluie)



Unnamed: 0,0
145078,2.6
56068,1.2
64733,0.2
89738,0.2
178688,0.2
...,...
174126,10.5
131976,1.2
129026,10.4
36175,1.0


In [70]:
pd.DataFrame(Y_pluie)
Y_pluie = Y_pluie.reset_index()


In [73]:
Y_pluie.drop('index',axis=1,inplace=True)

In [75]:
Y_pluie

Unnamed: 0,0
0,2.6
1,1.2
2,0.2
3,0.2
4,0.2
...,...
30107,10.5
30108,1.2
30109,10.4
30110,1.0


In [76]:
sum(Y_pluie[0]==0)

0

Après avoir prédit les jours où il pleut, on utilise un modèle prédictif afin de déterminer le cumul de pluie par jour

In [77]:
index_original = X_pluie['index_original']
X_pluie.drop('index_original',axis=1,inplace=True)

In [78]:
X_pluie.reset_index(inplace=True)

In [79]:
X_pluie.drop('index',axis=1,inplace=True)

In [80]:
Y_pluie

Unnamed: 0,0
0,2.6
1,1.2
2,0.2
3,0.2
4,0.2
...,...
30107,10.5
30108,1.2
30109,10.4
30110,1.0


In [81]:
split = ShuffleSplit(n_splits=1, test_size=0.3,random_state = 4)
split.get_n_splits(X_pluie, Y_pluie)
train_index, test_index = next(split.split(X_pluie, Y_pluie)) 

X_train,X_test = X_pluie.loc[train_index], X_pluie.loc[test_index]
Y_train, Y_test = Y_pluie.loc[train_index], Y_pluie.loc[test_index]


Utilisons une régression linéaire


In [82]:
from sklearn import linear_model, metrics

reg = linear_model.LinearRegression()

reg.fit(X_train,Y_train)



LinearRegression()

In [83]:
predictions = reg.predict(X_test)
predictions

array([[0.2],
       [5.4],
       [0.2],
       ...,
       [5.2],
       [0.2],
       [0.2]])

In [84]:
np.mean(np.abs(predictions-Y_test)/(Y_test+1))*100

0    1.860569e-13
dtype: float64

In [85]:
metrics.mean_absolute_error(predictions,Y_test)

5.111849496817511e-15