<a href="https://colab.research.google.com/github/VTNay/MEC557-Project/blob/Nay/MEC557_Weather.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Projects

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/git/https%3A%2F%2Fgitlab.in2p3.fr%2Fenergy4climate%2Fpublic%2Feducation%2Fmachine_learning_for_climate_and_energy/master?filepath=book%2Fnotebooks%2Fprojects.ipynb)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<div class="alert alert-block alert-warning">
    <b>Schedule</b>
    
- Ask your supervisors for the data if not already provided (it is not included in this repository).
- Quick presentation.
- Final project presentation.
    
</div>

<div class="alert alert-block alert-info">
    <b>One problematic, One dataset, One (or more) method(s)</b>
    
- Quality of the dataset is key.
- Results on a clean notebook.
- Explain which method(s) you used and why.
- If a method fails, explain why.

</div>

## Project: Weather station

<img alt="weather" src="https://github.com/VTNay/MEC557-Project/blob/main/images/map.png?raw=1" width=400>

- Suppose there are 5 weather stations that monitor the weather: Paris, Brest, London, Marseille and Berlin.
- The weather station in Paris breaks down
- Can we use the other stations to infer the weather in Paris

### Data set

<img alt="weather" src="https://github.com/VTNay/MEC557-Project/blob/main/images/annual_temperature.png?raw=1" width=400>

- Surface variables: skt, u10, v10, t2m, d2m, tcc, sp, tp, ssrd, blh
- Temporal resolution: hourly
- Spatial resolution: N/A

### First steps

- Look at the correlations between variables.
- What variable do I want to predict
- What time scale am interested in?
- Start with the easy predictions and move on to harder ones
- Are there events that are more predictable than others?

In [2]:
from pathlib import Path
import numpy as np
import pandas as pd
import xarray as xr
from functools import reduce
from matplotlib import pyplot as plt
from sklearn import preprocessing

paris_path = Path('/content/drive/My Drive/PHY557_Project/weather/paris')
brest_path = Path('/content/drive/My Drive/PHY557_Project/weather/brest')
london_path = Path('/content/drive/My Drive/PHY557_Project/weather/london')
marseille_path = Path('/content/drive/My Drive/PHY557_Project/weather/marseille')
berlin_path = Path('/content/drive/My Drive/PHY557_Project/weather/berlin')

file_path = {'t2m': 't2m.nc', 'blh': 'blh.nc', 'd2m': 'd2m.nc', 'skt': 'skt.nc', 'sp': 'sp.nc', 'ssrd': 'ssrd.nc', 'tcc': 'tcc.nc', 'tp': 'tp.nc', 'u10': 'u10.nc', 'v10': 'v10.nc'}
City_path = {'Paris': paris_path, 'Brest': brest_path, 'London': london_path, 'Marseille': marseille_path, 'Berlin': berlin_path}

Weather_stations = {'Paris': [], 'Brest': [], 'London': [], 'Marseille': [], 'Berlin': []}
for i in Weather_stations:
  Weather_stations[i] = {'t2m': [], 'blh': [], 'd2m': [], 'skt': [], 'sp': [], 'ssrd': [], 'tcc': [], 'tp': [], 'u10': [], 'v10': []}

for city in Weather_stations:
  for i in Weather_stations[city]:
    temp = xr.open_dataset(Path(City_path[city], file_path[i]))
    temp = temp.to_dataframe()
    if i == 'd2m' or i == 'blh':
      temp = temp.droplevel([1,2])
    else:
      temp = temp.droplevel([0,1])
    Weather_stations[city][i] = temp
  #merge them into 1 dataframe
  Weather_stations[city] = reduce(lambda left, right: pd.merge(left, right, left_index=True, right_index=True, how='outer'), Weather_stations[city].values())

Berlin = Weather_stations['Berlin']
Brest =  Weather_stations['Brest']
London = Weather_stations['London']
Paris = Weather_stations['Paris']
Marseille = Weather_stations['Marseille']
Paris = Paris[Paris.index < '2020-01-01 07:00:00'] #All dataframe has the same number of rows

# Function to rename columns
def rename_columns(df, prefix):
    return df.rename(columns={col: f"{prefix}_{col}" for col in df.columns})
# Rename columns of each DataFrame so that the feature in X will be 'Berlin_t2m', 'Berlin_u10', 'London_t2m',...
Berlin = rename_columns(Berlin, 'Berlin')
Brest = rename_columns(Brest, 'Brest')
London = rename_columns(London, 'London')
Marseille = rename_columns(Marseille, 'Marseille')

#Data cleaning
# Concatenate X = Berlin, Brest, London, Marseille and y = Paris
combined = pd.concat([Berlin, Brest, London, Marseille, Paris], axis=1)
# Drop NA values
combined = combined.dropna()
# Split them back into X and y
X_raw = combined.iloc[:, :-10]  # X has 40 features
y = combined.loc[:,'t2m']  # y is the temperature in Paris
# Normalize X and y
X_raw = (X_raw - X_raw.mean())/ X_raw.std()
y = (y - y.mean())/y.std()
# Number of years
n_years = y.index.year.max() - y.index.year.min() + 1
n_years

40

In [7]:
y

time
1980-01-01 07:00:00    272.039154
1980-01-01 08:00:00    272.022308
1980-01-01 09:00:00    271.751892
1980-01-01 10:00:00    274.506470
1980-01-01 11:00:00    275.079346
                          ...    
2019-12-31 19:00:00    272.958130
2019-12-31 20:00:00    272.240845
2019-12-31 21:00:00    271.729919
2019-12-31 22:00:00    273.190796
2019-12-31 23:00:00    272.771423
Freq: H, Name: Paris_t2m, Length: 350633, dtype: float32

In [3]:
from pathlib import Path
import numpy as np
import pandas as pd
import xarray as xr
from functools import reduce
from matplotlib import pyplot as plt
from sklearn import preprocessing

# Define file paths for weather data of different cities
weather_paths = {
    'Paris': Path('/content/drive/My Drive/PHY557_Project/weather/paris'),
    'Brest': Path('/content/drive/My Drive/PHY557_Project/weather/brest'),
    'London': Path('/content/drive/My Drive/PHY557_Project/weather/london'),
    'Marseille': Path('/content/drive/My Drive/PHY557_Project/weather/marseille'),
    'Berlin': Path('/content/drive/My Drive/PHY557_Project/weather/berlin')
}

# Define the names of the files for each weather parameter
file_names = {
    't2m': 't2m.nc', 'blh': 'blh.nc', 'd2m': 'd2m.nc', 'skt': 'skt.nc',
    'sp': 'sp.nc', 'ssrd': 'ssrd.nc', 'tcc': 'tcc.nc', 'tp': 'tp.nc',
    'u10': 'u10.nc', 'v10': 'v10.nc'
}

# Initialize a dictionary to store weather data for each city
weather_data = {city: {} for city in weather_paths}

# Load and preprocess the data for each city
for city, path in weather_paths.items():
    for param, file_name in file_names.items():
        # Load the dataset
        dataset = xr.open_dataset(Path(path, file_name)).to_dataframe()
        # Drop unnecessary levels based on the parameter
        if param in ['d2m', 'blh']:
            dataset = dataset.droplevel([1, 2])
        else:
            dataset = dataset.droplevel([0, 1])
        # Store the processed data
        weather_data[city][param] = dataset
    # Merge data from different parameters into a single dataframe
    weather_data[city] = reduce(lambda left, right: pd.merge(
        left, right, left_index=True, right_index=True, how='outer'),
        weather_data[city].values())

# Function to rename columns with city prefix
def rename_columns(df, prefix):
    """Rename columns with a city prefix."""
    return df.rename(columns={col: f"{prefix}_{col}" for col in df.columns})

# Rename columns and concatenate data from all cities
combined_data = pd.concat([rename_columns(weather_data[city], city) for city in weather_data], axis=1)

# Data cleaning
# Drop rows with missing values
combined_data = combined_data.dropna()

# Split the combined data into features (X) and target (y)
X_raw = combined_data.iloc[:, :-10]  # Features from all cities except Paris
y = combined_data['Paris_t2m']  # Target: temperature in Paris

# Normalize features and target
X_normalized = (X_raw - X_raw.mean()) / X_raw.std()
y_normalized = (y - y.mean()) / y.std()

# Calculate the number of years in the dataset
n_years = y_normalized.index.year.max() - y_normalized.index.year.min() + 1

In [5]:
y_normalized

time
1980-01-01 07:00:00   -1.730207
1980-01-01 08:00:00   -1.732572
1980-01-01 09:00:00   -1.770533
1980-01-01 10:00:00   -1.383840
1980-01-01 11:00:00   -1.303419
                         ...   
2019-12-31 19:00:00   -1.601199
2019-12-31 20:00:00   -1.701893
2019-12-31 21:00:00   -1.773618
2019-12-31 22:00:00   -1.568537
2019-12-31 23:00:00   -1.627409
Freq: H, Name: Paris_t2m, Length: 350633, dtype: float32

In [None]:
# Function to rename columns
def rename_columns(df, prefix):
    return df.rename(columns={col: f"{prefix}_{col}" for col in df.columns})
# Rename columns of each DataFrame so that the feature in X will be 'Berlin_t2m', 'Berlin_u10', 'London_t2m',...
Berlin = rename_columns(Berlin, 'Berlin')
Brest = rename_columns(Brest, 'Brest')
London = rename_columns(London, 'London')
Marseille = rename_columns(Marseille, 'Marseille')

In [None]:
#Data cleaning
# Concatenate X = Berlin, Brest, London, Marseille and y = Paris
combined = pd.concat([Berlin, Brest, London, Marseille, Paris], axis=1)
# Drop NA values
combined = combined.dropna()
# Split them back into X and y
X_raw = combined.iloc[:, :-10]  # X has 40 features
y = combined.loc[:,'t2m']  # y is the temperature in Paris
# Normalize X and y
X_raw = (X_raw - X_raw.mean())/ X_raw.std()
y = (y - y.mean())/y.std()
# Number of years
n_years = y.index.year.max() - y.index.year.min() + 1
n_years

40

**PCA**

Lasso is a linear model that uses this cost function:

$$
\frac{1}{2N_{\text{training}}} \sum_{i=1}^{N_{\text{training}}} \left( y^{(i)}_{\text{real}} - y^{(i)}_{\text{pred}} \right)^2 + \alpha \sum_{j=1}^{n} |a_j|
$$

$a_j$ is the coefficient of the j-th feature. The final term is called $l_1$ penalty and $\alpha$ is a hyperparameter that tunes the intensity of this penalty term. The higher the coefficient of a feature, the higher the value of the cost function. So, the idea of Lasso regression is to optimize the cost function reducing the absolute values of the coefficients.


In [None]:
from sklearn.linear_model import LassoCV
from sklearn.datasets import make_regression
# Set number of splits for cross-validation - two years for each fold
n_splits =  5 # We have 40 years in total

# Initialize LassoCV, which will perform cross-validation
lasso = LassoCV(cv=n_splits, random_state=0, max_iter=10000)

# Fit the Lasso model to the data
lasso.fit(X_raw, y)
# Number of features remained
n_features = 20
# n_features features having the highest absolute value of coefficient can be considered as more important or keeped
indices_top = np.argsort(np.abs(lasso.coef_))[-n_features:]
selected_features = [X_raw.columns[i] for i in indices_top]

# To reduce the feature set
X = X_raw[selected_features]

**Linear Regression**

In [None]:
#Linear Regression for Paris_t2m with 40 features - the most naive approach
# Import scikit-learn cross-validation function
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

# Call the Linear regressor
lin = LinearRegression(fit_intercept= True)

# Set number of splits for cross-validation - two years for each fold
n_splits =  5 # We have 40 years in total

# Initialize KFold
kf = KFold(n_splits=n_splits)

# Arrays to store scores
train_scores = []
test_scores = []

for train_index, test_index in kf.split(X):
    # Split data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    # Fit model
    lin.fit(X_train, y_train)

    # Calculate R2 scores
    train_score = lin.score(X_train,y_train)
    test_score = lin.score(X_test, y_test)

    # Append scores
    train_scores.append(train_score)
    test_scores.append(test_score)

# Average R2 scores
avg_train_score = np.mean(train_scores)
avg_test_score = np.mean(test_scores)

print(f"Average R2 Score on Training Data: {avg_train_score}")
print(f"Average R2 Score on Test Data: {avg_test_score}")

Average R2 Score on Training Data: 0.9191834597452836
Average R2 Score on Test Data: 0.9181068618513505


In [None]:
print(train_scores)
print(test_scores)

[0.9178100925659008, 0.9200567342719677, 0.9189771593618314, 0.9187203396985614, 0.9203529728281573]
[0.9221752601135499, 0.9149210355507839, 0.9197156572378424, 0.9203415732840835, 0.9133807830704933]


**2nd Degree Polynomial Regression**

In [None]:
from inspect import modulesbyfile
#Linear Regression for Paris_t2m with 40 features - the most naive approach
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Transform data to include polynomial features
degree = 2
polynomial_features = PolynomialFeatures(degree=degree, include_bias=True)
linear_regression = LinearRegression()

# Create a pipeline that includes both polynomial expansion and linear regression
model = make_pipeline(polynomial_features, linear_regression)

In [None]:
# Set number of splits for cross-validation - two years for each fold
n_splits = 5 # We have 40 years in total

# Initialize KFold
kf = KFold(n_splits=n_splits)

# Arrays to store scores
train_scores = []
test_scores = []

for train_index, test_index in kf.split(X):
    # Split data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    # Fit model
    model.fit(X_train, y_train)

    # Calculate R2 scores
    train_score = model.score(X_train,y_train)
    test_score = model.score(X_test, y_test)

    # Append scores
    train_scores.append(train_score)
    test_scores.append(test_score)

# Average R2 scores
avg_train_score = np.mean(train_scores)
avg_test_score = np.mean(test_scores)

print(f"Average R2 Score on Training Data: {avg_train_score}")
print(f"Average R2 Score on Test Data: {avg_test_score}")

Average R2 Score on Training Data: 0.9362718111277444
Average R2 Score on Test Data: 0.9340951422262893


**3nd Degree Polynomial Regression**

In [None]:
from inspect import modulesbyfile
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

#Linear Regression for Paris_t2m with 40 features - the most naive approach
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

# Transform data to include polynomial features
degree = 3
polynomial_features = PolynomialFeatures(degree=degree, include_bias=True)
linear_regression = LinearRegression()

# Create a pipeline that includes both polynomial expansion and linear regression
model = make_pipeline(polynomial_features, linear_regression)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
model.fit(X_train, y_train)
train_score = model.score(X_train,y_train)
test_score = model.score(X_test, y_test)
print(f"Average R2 Score on Training Data: {train_score}")
print(f"Average R2 Score on Test Data: {test_score}")

Average R2 Score on Training Data: 0.9451980423935095
Average R2 Score on Test Data: 0.9444146196582686


In [None]:
# Set number of splits for cross-validation - two years for each fold
from sklearn.model_selection import train_test_split

n_splits = 5 # We have n_features years in total

# Initialize KFold
kf = KFold(n_splits=n_splits)

# Arrays to store scores
train_scores = []
test_scores = []

for train_index, test_index in kf.split(X):
    # Split data
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    # Fit model
    model.fit(X_train, y_train)

    # Calculate R2 scores
    train_score = model.score(X_train,y_train)
    test_score = model.score(X_test, y_test)

    # Append scores
    train_scores.append(train_score)
    test_scores.append(test_score)

# Average R2 scores
avg_train_score = np.mean(train_scores)
avg_test_score = np.mean(test_scores)

print(f"Average R2 Score on Training Data: {avg_train_score}")
print(f"Average R2 Score on Test Data: {avg_test_score}")

Average R2 Score on Training Data: 0.9456672090526222
Average R2 Score on Test Data: 0.9385114223385068


## Fucntion ##

In [None]:
from inspect import modulesbyfile
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression

#Linear Regression for Paris_t2m with 40 features - the most naive approach
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
#Ridge
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

def poly_ridge(degree, X_cv, y_cv, alpha):
  # # Transform data to include polynomial features
  # polynomial_features = PolynomialFeatures(degree=degree, include_bias=True)

  # # Create a pipeline that includes both polynomial expansion and linear regression
  # model = make_pipeline(polynomial_features, linear_regression)


  #model.fit(X_train, y_train)
  #train_score = model.score(X_train,y_train)
  #test_score = model.score(X_test, y_test)

  #print(f"Average R2 Score on Training Data: {train_score}")
  #print(f"Average R2 Score on Test Data: {test_score}")

  # n_splits = 5 # We have 0.75 x n_years = 30 years in total

  # # Initialize KFold
  # kf = KFold(n_splits=n_splits)

  # # Arrays to store scores
  # train_scores = []
  # test_scores = []
  '''
  for train_index, test_index in kf.split(X_cv):
      # Split data
      X_train, X_test = X.iloc[train_index], X.iloc[test_index]
      y_train, y_test = y.iloc[train_index], y.iloc[test_index]
      # Fit model
      model.fit(X_train, y_train)

      # Calculate R2 scores
      train_score = model.score(X_train,y_train)
      test_score = model.score(X_test, y_test)

      # Append scores
      train_scores.append(train_score)
      test_scores.append(test_score)

  # Average R2 scores
  avg_train_score = np.mean(train_scores)
  avg_test_score = np.mean(test_scores)

  print(f"Average R2 Score on Training Data: {avg_train_score}")
  print(f"Average R2 Score on Test Data: {avg_test_score}")
  '''
  #Ridge regression
  # Call the Ridge regressor
  #reg_class = Ridge

  # Number of test years
  #N_TEST_YEARS = 8
  # Number of test days = number of test columns
  #n_test = 365 * N_TEST_YEARS # 365 days per year

  # Define array of regularization-parameter values
  #alpha = np.linspace(0, 80, 80)

  # Select cross validation data
  # X_cv = X[:-n_test]
  # y_cv = y[:-n_test]

  # # Select test set for later
  # X_test = X[-n_test:]
  # y_test = y[-n_test:]

  # Set number of splits for cross-validation - two years for each fold
  #n_splits_cv = (n_years - N_TEST_YEARS)//2
  n_splits_cv = 5

  # Declare empty arrays in which to store r2 scores and coefficients
  r2_validation = np.empty(alpha.shape)
  coefs = np.empty((len(alpha), X.shape[1]))
  #r2_test = np.empty(alpha.shape)


  # Loop over regularization-parameter values
  for k, complexity in enumerate(alpha):
      # Transform data to include polynomial features
      polynomial_features = PolynomialFeatures(degree=degree, include_bias=True)
      # Define the Ridge estimator for particular regularization-parameter value
      reg = Ridge(alpha=complexity)
      # Create a pipeline that includes both polynomial expansion and ridge regression
      model = make_pipeline(polynomial_features, reg)
      # Get r2 test scores from k-fold cross-validation
      r2_validation_arr = cross_val_score(model, X_cv, y_cv, cv=n_splits_cv)

      # Get r2 expected prediction score by averaging over test scores
      r2_validation[k] = r2_validation_arr.mean()

      # Save coefficients
      # reg.fit(X_cv, y_cv)
      # coefs[k] = reg.coef_

      # Get r2 test error
      #r2_test[k] = reg.score(X_test, y_test)


    # Get the best values of the regularization parameter, prediction R2 and coefficients
  i_best = np.argmax(r2_validation)
  alpha_best = alpha[i_best]
  r2_validation_best = r2_validation[i_best]
  return r2_validation, i_best


In [None]:
# Degree 1
degree = 1
X_cv, X_test, y_cv, y_test = train_test_split(X, y, test_size=.25, random_state=0)
alpha = np.linspace(0, 80, 80)
r2_validation, i_best = poly_ridge(degree, X_cv, y_cv, alpha)
best_model = Ridge(alpha = alpha[i_best])
best_model.fit(X_cv, y_cv)
R2 = best_model.score(X_test, y_test)
print(r2_validation)
print(R2)

[0.91897914 0.9189792  0.9189792  0.91897919 0.91897919 0.91897919
 0.91897919 0.91897918 0.91897918 0.91897917 0.91897917 0.91897916
 0.91897915 0.91897914 0.91897913 0.91897912 0.91897911 0.9189791
 0.91897909 0.91897908 0.91897907 0.91897905 0.91897904 0.91897902
 0.918979   0.91897899 0.91897897 0.91897895 0.91897893 0.91897891
 0.91897889 0.91897887 0.91897885 0.91897883 0.9189788  0.91897878
 0.91897876 0.91897873 0.91897871 0.91897868 0.91897865 0.91897863
 0.91897859 0.91897857 0.91897854 0.9189785  0.91897848 0.91897844
 0.91897841 0.91897838 0.91897834 0.91897831 0.91897827 0.91897823
 0.9189782  0.91897816 0.91897813 0.91897809 0.91897805 0.91897801
 0.91897797 0.91897793 0.91897789 0.91897784 0.9189778  0.91897775
 0.91897771 0.91897767 0.91897762 0.91897757 0.91897752 0.91897748
 0.91897743 0.91897738 0.91897733 0.91897728 0.91897723 0.91897718
 0.91897713 0.91897707]
0.9196585732899049


In [None]:
# Degree 2
degree = 2
X_cv, X_test, y_cv, y_test = train_test_split(X, y, test_size=.25, random_state=0)
alpha = np.linspace(0, 80, 80)
r2_validation, i_best = poly_ridge(degree, X_cv, y_cv, alpha)
best_model = Ridge(alpha = alpha[i_best])
best_model.fit(X_cv, y_cv)
R2 = best_model.score(X_test, y_test)
print(r2_validation)
print(R2)

[0.93595851 0.93595815 0.93595816 0.93595812 0.93595805 0.93595794
 0.93595783 0.93595768 0.93595754 0.93595735 0.93595716 0.93595694
 0.93595668 0.93595646 0.93595621 0.93595592 0.93595565 0.93595536
 0.93595505 0.93595474 0.93595443 0.93595413 0.93595381 0.93595347
 0.93595313 0.9359528  0.93595244 0.93595209 0.93595172 0.93595136
 0.93595103 0.93595067 0.93595029 0.93594991 0.93594954 0.93594918
 0.93594879 0.93594843 0.93594805 0.93594766 0.93594728 0.93594691
 0.93594651 0.93594614 0.93594575 0.93594536 0.93594496 0.9359446
 0.93594419 0.93594379 0.93594343 0.93594301 0.93594262 0.93594224
 0.93594186 0.93594144 0.93594103 0.93594066 0.93594025 0.93593986
 0.93593946 0.93593904 0.93593867 0.93593828 0.93593785 0.93593748
 0.93593704 0.93593666 0.93593624 0.93593585 0.93593544 0.93593506
 0.93593462 0.93593425 0.93593383 0.93593343 0.93593302 0.93593263
 0.93593222 0.93593181]
0.9196585873142077


In [None]:
# Degree 3
degree = 3
X_cv, X_test, y_cv, y_test = train_test_split(X, y, test_size=.25, random_state=0)
alpha = np.linspace(0, 80, 80)
r2_validation, i_best = poly_ridge(degree, X_cv, y_cv, alpha)
best_model = Ridge(alpha = alpha[i_best])
best_model.fit(X_cv, y_cv)
R2 = best_model.score(X_test, y_test)
print(r2_validation)
print(R2)

**Ridge Regression**

In [None]:
#Ridge regression
# Import Ridge
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Call the Ridge regressor
reg_class = Ridge

# Number of test years
N_TEST_YEARS = 8
# Number of test days = number of test columns
n_test = 365 * N_TEST_YEARS # 365 days per year

# Define array of regularization-parameter values
alpha = np.linspace(0, 80, 80)

# Select cross validation data
X_cv = X[:-n_test]
y_cv = y[:-n_test]

# Select test set for later
X_test = X[-n_test:]
y_test = y[-n_test:]

# Set number of splits for cross-validation - two years for each fold
n_splits_cv = (n_years - N_TEST_YEARS)//2

# Declare empty arrays in which to store r2 scores and coefficients
r2_validation = np.empty(alpha.shape)
coefs = np.empty((len(alpha), X.shape[1]))
r2_test = np.empty(alpha.shape)


# Loop over regularization-parameter values
for k, complexity in enumerate(alpha):
    # Define the Ridge estimator for particular regularization-parameter value
    reg = reg_class(alpha=complexity)

    # Get r2 test scores from k-fold cross-validation
    r2_validation_arr = cross_val_score(reg, X_cv, y_cv, cv=n_splits_cv)

    # Get r2 expected prediction score by averaging over test scores
    r2_validation[k] = r2_validation_arr.mean()

    # Save coefficients
    reg.fit(X_cv, y_cv)
    coefs[k] = reg.coef_

    # Get r2 test error
    r2_test[k] = reg.score(X_test, y_test)


# Get the best values of the regularization parameter, prediction R2 and coefficients
i_best = np.argmax(r2_validation)
alpha_best = alpha[i_best]
r2_validation_best = r2_validation[i_best]
coefs_best = coefs[i_best]
r2_test_best = r2_test[i_best]

In [None]:
print(r2_validation)

In [None]:
# Plot validation curve
complexity_label = r'$\alpha$'
plt.figure()
plt.plot(alpha, r2_validation, label = 'Train R2')
plt.legend()
plt.xlabel(complexity_label)
plt.ylabel(r'$R^2$')

plt.figure()
plt.plot(alpha, r2_test, label = 'Test R2')
plt.xlabel(complexity_label)
plt.ylabel(r'$R^2$')
plt.legend()
_ = plt.title(r'Best $R^2 train$: {:.3} for $\alpha$ = {:.1e} and $R^2 test$ : {:.3}'.format(
    r2_validation_best, alpha_best, r2_test_best))
_ = plt.xlim(alpha[[0, -1]])


In [None]:
# Define the Ridge estimator for best regularization parameter value
reg = reg_class(alpha=alpha_best)

# Fit on train data
reg.fit(X_cv, y_cv)

# Test on test data
r2_test = reg.score(X_test, y_test)

print('Test R2: {:.3f}'.format(r2_test))

***
## Credit

[//]: # "This notebook is part of [E4C Interdisciplinary Center - Education](https://gitlab.in2p3.fr/energy4climate/public/education)."
Contributors include Bruno Deremble and Alexis Tantet.
Several slides and images are taken from the very good [Scikit-learn course](https://inria.github.io/scikit-learn-mooc/).

<br>

<div style="display: flex; height: 70px">
    
<img alt="Logo LMD" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_lmd.jpg?raw=1" style="display: inline-block"/>

<img alt="Logo IPSL" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_ipsl.png?raw=1" style="display: inline-block"/>

<img alt="Logo E4C" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_e4c_final.png?raw=1" style="display: inline-block"/>

<img alt="Logo EP" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_ep.png?raw=1" style="display: inline-block"/>

<img alt="Logo SU" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_su.png?raw=1" style="display: inline-block"/>

<img alt="Logo ENS" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_ens.jpg?raw=1" style="display: inline-block"/>

<img alt="Logo CNRS" src="https://github.com/VTNay/MEC557-Project/blob/main/images/logos/logo_cnrs.png?raw=1" style="display: inline-block"/>
    
</div>

<hr>

<div style="display: flex">
    <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0; margin-right: 10px" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a>
    <br>This work is licensed under a &nbsp; <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
</div>