<div style="display: flex; background-color: RGB(255,114,0);" >

# PROJET - CLIMAT : Regional Climate Forecast 2022 by CNRS </h1>
</div>

<div style="display: flex; background-color: Blue; padding: 15px;" >

## 1.Mission 
</div>

https://challengedata.ens.fr/challenges/80

<div style="display: flex; background-color: Green; padding: 7px;" >

### 1.1. Challenge goals
</div>

How accurately can we predict regional temperature anomalies based on past and neighbouring climate observations ?

<div style="display: flex; background-color: Green; padding: 7px;" >

### 1.2. Challenge context
</div>

The prediction of temperature anomalies on interannual timescales (1 or 5 years) is one of the most challenging topics in climate science. Recently, a study has shown that local anomalies in the earth's temperature have a certain predictability 1 to 5 years in the future, even while temperatures are average at a global scale. The aim of this study is to demonstrate this by predicting temperature anomalies at a regional scale.

In climate science, because of the inherent chaotic nature of the studied system, predictions have to be made in a probabilistic framework. Indeed, for risk assessment and early mitigation, the likelihood of extreme events is potentially as important as the knowledge of the expected event. Hence, a skillful prediction has to be accurate (minimal prediction error) and reliable (good sampling of prediction spread).

Another challenge of climate science is the lack or the limited amount of data to train any prediction system. Thus, this challenge aims to find an algorithm that predicts the temperature anomalies of the next years using the past 10 years of data and to provide a probabilistic prediction, or at least the expected prediction and its uncertainty (within a Gaussian assumption).

Ref: Sévellec, F., & Drijfhout, S. S. (2018). A novel probabilistic forecast system predicting anomalously warm 2018-2022 reinforcing the long-term global warming trend. Nature Communications, 9(1), [3024]. DOI: 10.1038/s41467-018-05442-8


<div style="display: flex; background-color: Green; padding: 7px;" >

### 1.3. Benchmark description
</div>

To estimate the validity of the predictions we propose to use two different measures: the coefficient of determination (R2), which shows the skill of the mean prediction; and the reliability, which measures the accuracy of the spread in the prediction. The metric thus characterizes the accuracy using the mean and the consistency of the predicted error with the effective one. You can check out the climate_metric.py file for details.

The benchmark for this challenge consists in the mean and variance of the 22 model predictions (rows where TIME = 10). This simple benchmark achieves a score of -0.23 on the test data.

In [1]:
from os import getcwd
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from tqdm import tqdm

<div style="display: flex; background-color: Blue; padding: 15px;" >

## 2.Chargement des données
</div>

In [2]:
# ---------------------------------------------------------------------------------------------
#                               MAIN
# ---------------------------------------------------------------------------------------------
verbose = False
force_reloading = True

# Récupère le répertoire du programme
file_path = getcwd() + "\\"
data_set_path = file_path + "dataset\\"

print(f"Current execution path : {file_path}")
print(f"Dataset path : {data_set_path}")

Current execution path : c:\Users\User\WORK\workspace-ia\PROJETS\projet_climat\
Dataset path : c:\Users\User\WORK\workspace-ia\PROJETS\projet_climat\dataset\


In [14]:
X_train_file_name = "train_X.csv"
Y_train_file_name = "train_Y.csv"
X_test_file_name = "test_X.csv"
Y_test_randomized_file_name = "test_Y_randomized.csv"

- `ID` : a unique ID for each value.

- `DATASET`: the dataset id. There are several independent data sets per file. Each dataset is composed of:
  - `3072` temperature anomalies all over the world from 22 models (model id from 1 to 22) during 10 years (time from 0 to 9)
  - `3072` temperature anomalies all over the world actually observed (model id = 0) during 10 years (time from 0 to 9)
  - the predicted 192 temparature anomalies all over the world for the 22 models (time equals 10)

- `MODEL`: model id (1-22) for models and (0) or the actual observation
- `TIME`: timestamp as integers (0-9) for the 10 year history and 10 for the predicted date.
- `POSITION`: earth coordinates in healpix (nside=4 for prediction, nside=16 for history) the ordering is nested
` `VALUE` : the corresponding temperature anomalies.

In [10]:
X_train_origin = pd.read_csv(data_set_path+X_train_file_name, sep=',')

print(f"{X_train_origin.shape} données chargées ------> {list(X_train_origin.columns)}")
X_train_origin.head()

(3553920, 6) données chargées ------> ['ID', 'DATASET', 'MODEL', 'TIME', 'POSITION', 'VALUE']


Unnamed: 0,ID,DATASET,MODEL,TIME,POSITION,VALUE
0,DATA_0000_MODEL_0000_TIME_0000_POS_0000,0,0,0,0,0.2621
1,DATA_0000_MODEL_0000_TIME_0000_POS_0001,0,0,0,1,0.2917
2,DATA_0000_MODEL_0000_TIME_0000_POS_0002,0,0,0,2,0.3111
3,DATA_0000_MODEL_0000_TIME_0000_POS_0003,0,0,0,3,0.3727
4,DATA_0000_MODEL_0000_TIME_0000_POS_0004,0,0,0,4,0.3222


In [None]:
show_model(df_x,model,data)

In [11]:
Y_train_origin = pd.read_csv(data_set_path+Y_train_file_name, sep=',')

print(f"{Y_train_origin.shape} données chargées ------> {list(Y_train_origin.columns)}")
Y_train_origin.head()

(960, 5) données chargées ------> ['ID', 'DATASET', 'POSITION', 'MEAN', 'VARIANCE']


Unnamed: 0,ID,DATASET,POSITION,MEAN,VARIANCE
0,DATA_0000_POS_0000,0,0,0.499,
1,DATA_0000_POS_0001,0,1,0.4542,
2,DATA_0000_POS_0002,0,2,0.7851,
3,DATA_0000_POS_0003,0,3,0.3708,
4,DATA_0000_POS_0004,0,4,0.465,


In [12]:
X_test_origin = pd.read_csv(data_set_path+X_test_file_name, sep=',')

print(f"{X_test_origin.shape} données chargées ------> {list(X_test_origin.columns)}")
X_test_origin.head()

(1421568, 6) données chargées ------> ['ID', 'DATASET', 'MODEL', 'TIME', 'POSITION', 'VALUE']


Unnamed: 0,ID,DATASET,MODEL,TIME,POSITION,VALUE
0,DATA_0000_MODEL_0000_TIME_0000_POS_0000,0,0,0,0,-0.1261
1,DATA_0000_MODEL_0000_TIME_0000_POS_0001,0,0,0,1,-0.1504
2,DATA_0000_MODEL_0000_TIME_0000_POS_0002,0,0,0,2,-0.1753
3,DATA_0000_MODEL_0000_TIME_0000_POS_0003,0,0,0,3,-0.2156
4,DATA_0000_MODEL_0000_TIME_0000_POS_0004,0,0,0,4,-0.1625


In [15]:
Y_test_randomized_origin = pd.read_csv(data_set_path+Y_test_randomized_file_name, sep=',')

print(f"{Y_test_randomized_origin.shape} données chargées ------> {list(Y_test_randomized_origin.columns)}")
Y_test_randomized_origin.head()

(384, 5) données chargées ------> ['ID', 'DATASET', 'POSITION', 'MEAN', 'VARIANCE']


Unnamed: 0,ID,DATASET,POSITION,MEAN,VARIANCE
0,DATA_0000_POS_0000,0,0,-0.668516,0.868183
1,DATA_0000_POS_0001,0,1,3.786169,0.579119
2,DATA_0000_POS_0002,0,2,-1.786576,1.231979
3,DATA_0000_POS_0003,0,3,-1.577711,0.559777
4,DATA_0000_POS_0004,0,4,-2.420425,2.157342
