<div style="text-align: left">
<img src="http://project.inria.fr/saclaycds/files/2017/02/logoUPSayPlusCDS_990.png" width="800px">
</div>

# [RAMP](https://www.ramp.studio/problems/storm_forecast_hackathon) on Tropical Storm Intensity Forecast (from reanalysis data)

_Sophie Giffard-Roisin (CU/CNRS), Mo Yang (CNRS), Balazs Kegl (CNRS/CDS), Claire Monteleoni (CU/CNRS), Alexandre Boucaud (CNRS/CDS)_

1. [Introduction](#Introduction)
2. [The prediction task](#The-prediction-task)
2. [Installation of libraries](#Installation-of-libraries) : To do before coming!
2. [The data](#The-data)
3. [The pipeline](#The-pipeline)
4. [Evaluation](#Evaluation)
5. [Local testing/exploration](#Testing-the-submission)
6. [Submission](#Submitting-to-the-online-challenge:-ramp.studio)

## Introduction

The goal of the RAMP is to predict the intensity of tropical and extra-tropical storms (24h forecast) using information from past storms since 1979. The intensity can be measured as the maximum sustained wind over a period of one minute at 10 meters height. This speed, calculated every 6 hours, is usually explained in knots (1kt=0.514 m/s) and is used to define the hurricane category from the [Saffir-Simpson scale](https://en.wikipedia.org/wiki/Saffir–Simpson_scale). Estimating the intensity evolution of a storm is of course crucial for the population.

<img src="https://github.com/sophiegif/ramp_kit_storm_forecast_new/blob/master/figures_pynb/all_storms_since1979_IBTrRACKS_newcats.png?raw=true" width="70%">
<div style="text-align: center">Database: tropical/extra-tropical storm tracks since 1979. Dots = initial position, color = maximal storm strength according to the Saffir-Simpson scale.</div>

Today, the forecasts (track and intensity) are provided by a numerous number of guidance models (1). Dynamical models solve the physical equations governing motions in the atmosphere. Statistical models, in contrast, are based on historical relationships between storm behavior and various other parameters. However, the lack of improvement in intensity forecasting is attributed to the complexity of tropical systems and an incomplete understanding of factors that affect their development. What is mainly still hard to predict is the rapid intensification of hurricanes: in 1992, Andrew went from tropical depression to a category 5 hurricane in 24h. 

Machine learning (and deep learning) methods have been only scarcely tested, and there is hope in that it can improve storm forecasts.

## The prediction task

<ul class="list-unstyled list-inline text-center">
  <li>
    <img src="https://github.com/sophiegif/ramp_kit_storm_forecast_new/blob/master/figures_pynb/storm_shema3.png?raw=true" alt= "image1" width="350" height="350">
    <figcaption>Goal: estimate the 24h-forecast intensity of all storms.</figcaption>
  </li>
  
  <li>
    <img src="https://github.com/sophiegif/ramp_kit_storm_forecast_new/blob/master/figures_pynb/hurricane_pb.png?raw=true" alt= "image2" width="350" height="350">
    <figcaption>Feature data: centered maps of wind, altitude, sst, slp, humidity...</figcaption>
  </li>
</ul>

This challenge proposes to design the best algorithm to predict for a large number of storms the 24h-forecast intensity every 6 hours. The (real) database is composed of more than 3000 extra-tropical and tropical storm tracks, and it also provides the intensity and some local physical information at each timestep (2). Moreover, we also provide some 700-hPa and 1000-hPa feature maps of the neighborhood of the storm (from ERA-interm reanalysis database (3)), that can be viewed as images centered on the current storm location (see right image).

The goal is to provide for each time step of each storm (total number of instants = 90 000), the predicted 24h-forecast intensity, so 4 time steps in the future. 

References

1. National Hurricane Center Forecast Verification website, https://www.nhc.noaa.gov/verification/, updated 04 April 2017.

2. Knapp, K. R., M. C. Kruk, D. H. Levinson, H. J. Diamond, and C. J. Neumann, 2010: The International Best Track Archive for Climate Stewardship (IBTrACS): Unifying tropical cyclone best track data. Bulletin of the American Meteorological Society, 91, 363-376 https://www.ncdc.noaa.gov/ibtracs/index.php?name=wmo-data

3. Dee, D. P. et al.(2011), The ERA-Interim reanalysis: configuration and performance of the data assimilation system. Q.J.R. Meteorol. Soc., 137: 553–597. https://rmets.onlinelibrary.wiley.com/doi/abs/10.1002/qj.828

## Installation of libraries

To get this notebook running and test your models locally using the `ramp_test_submission`, we recommend that you use the Python distribution from [Anaconda](https://www.anaconda.com/download/) or [Miniconda](https://docs.anaconda.com/docs_oss/conda/install/quick#miniconda-quick-install-requirements). (uncomment the lines before running them)

In [None]:
# !conda install -y -c conda conda-env     # First install conda-env to ease the creation of virtual envs in conda
# !conda env create                        # Uses the local environment.yml to create the 'storm_forecast_2' env

**OR** if you have Python already installed but are **not using Anaconda**, you'll want to use `pip` 

In [None]:
# !pip install -r requirements.txt

#### Installation of ramp-workflow

For being able to test submissions, you also need to have the `ramp-workflow` package locally. You can install the latest version with pip from github:

In [None]:
!pip install git+https://github.com/paris-saclay-cds/ramp-workflow

#### Download data (optional)

If the data has not yet been downloaded locally, uncomment the following cell and run it.
The starting kit data is 260 MB.

In [None]:
!python download_data.py

In [1]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The scikit-learn version is 0.19.2.


In [5]:
## Utilities ##

import numpy as np
import pandas as pd
from typing import Dict
from typing import Tuple
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.random_projection import SparseRandomProjection
from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import pairwise_distances
from sklearn.neighbors import kneighbors_graph
from sklearn.metrics import pairwise_distances
from sklearn.utils.graph_shortest_path import graph_shortest_path
from scipy.stats import pearsonr, spearmanr
import multiprocessing as mp
import itertools
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import GradientBoostingRegressor

def group_dict(iterable, keyfn, mapfn):
    """
    Groups the iterable using the given key function and returns a dictionary
    of keys to groups.
    """
    groups = it.groupby(iterable, key=keyfn)
    gdict = dict()
    for k, g in groups:
        gdict[k] = list(map(mapfn, g))
    return gdict

def stormid_dict(X_df: pd.DataFrame) -> Dict[str, pd.DataFrame]:
    """
    Partitions the storm forecast dataset into separate groups for each storm and
    returns the result as a dictionary.
    """
    groups = X_df.groupby(['stormid'])
    storm_dict = dict()
    for stormid, df in groups:
        storm_dict[stormid] = df
    return storm_dict

def feature_groups(X_df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Partitions X_df into three groups by columns:
    1) 0-D features
    2) 11x11 z, u, v wind reanalysis data
    3) 11x11 sst, slp, humidity, and vorticity reanalysis data
    4) All features
    """
    feat_cols = X_df.get(['stormid','instant_t', 'windspeed', 'latitude', 'longitude','hemisphere','Jday_predictor','initial_max_wind','max_wind_change_12h','dist2land'])
    nature_cols = pd.get_dummies(X_df.nature, prefix='nature', drop_first=False)
    basin_cols = pd.get_dummies(X_df.basin, prefix='basin', drop_first=False)
    X_0D = pd.concat([feat_cols, nature_cols, basin_cols], axis=1, sort=False)
    X_zuv = X_df.get([col for col in X_df.columns if col.startswith('z_') or col.startswith('u_') or col.startswith('v_')])
    X_sshv = X_df.get([col for col in X_df.columns if col.startswith('sst') or col.startswith('slp')
                   or col.startswith('hum') or col.startswith('vo700')])
    X_all = pd.concat([X_0D, X_zuv, X_sshv], axis = 1)
    X_0D_zuv = pd.concat([X_0D, X_zuv], axis = 1)
    X_0D_sshv = pd.concat([X_0D, X_sshv], axis = 1)
    
    return X_0D, X_0D_zuv, X_0D_sshv, X_all

def trust_cont_score(X, X_map, k=10, alpha=0.5, impute_strategy='median'):
    """
    Computes the "trustworthiness" and "continuity" [1] of X_map with respect to X.
    This is a port and extension of the implementation provided by Van der Maaten [2].
    
    Parameters:
    X     : the data in its original representation
    X_map : the lower dimensional representation of the data to be evaluated
    k     : parameter that determines the size of the neighborhood for the T&C measure
    alpha : mixing parameter in [0,1] that determines the weight given to trustworthiness vs. continuity; higher values will give more
            weight to trustworthiness, lower values to continuity.
    
    [1] Kaski S, Nikkilä J, Oja M, Venna J, Törönen P, Castrén E. Trustworthiness and metrics in visualizing similarity of gene expression. BMC bioinformatics. 2003 Dec;4(1):48.
    [2] Maaten L. Learning a parametric embedding by preserving local structure. InArtificial Intelligence and Statistics 2009 Apr 15 (pp. 384-391).
    """
    # Impute X values
    X = Imputer(strategy=impute_strategy).fit_transform(X)
    # Compute pairwise distance matrices
    D_h = pairwise_distances(X, X, metric='euclidean')
    D_l = pairwise_distances(X_map, X_map, metric='euclidean')
    # Compute neighborhood indices
    ind_h = np.argsort(D_h, axis=1)
    ind_l = np.argsort(D_l, axis=1)
    # Compute trustworthiness
    N = X.shape[0]
    T = 0
    C = 0
    t_ranks = np.zeros((k, 1))
    c_ranks = np.zeros((k, 1))
    for i in range(N):
        for j in range(k):
            t_ranks[j] = np.where(ind_h[i,:] == ind_l[i, j+1])
            c_ranks[j] = np.where(ind_l[i,:] == ind_h[i, j+1])
        t_ranks -= k
        c_ranks -= k
        T += np.sum(t_ranks[np.where(t_ranks > 0)])
        C += np.sum(c_ranks[np.where(c_ranks > 0)])
    S = (2 / (N * k * (2 * N - 3 * k - 1)))
    T = 1.0 - S*T
    C = 1.0 - S*C
    return alpha*T + (1.0-alpha)*C

def sammon_stress(X, X_m, impute_strategy='median'):
    X = Imputer(strategy=impute_strategy).fit_transform(X)
    Dx = pairwise_distances(X, X, metric='euclidean')
    Dy = pairwise_distances(X_m, X_m, metric='euclidean')
    # Sammon Stress computes sums over indices where i < j
    # We can interpet this as being the upper triangle of each matrix, from the k=1 diagonal
    Dx_ut = np.triu(Dx, k=1)
    Dy_ut = np.triu(Dy, k=1)
    # Compute Sammon Stress, S
    S = (1 / np.sum(Dx_ut))*np.sum(np.square(Dx_ut - Dy_ut) / (Dx_ut + np.ones(Dx.shape)))
    return S
    
    
def residual_variance(X, X_m, n_neighbors=20):
    kng_h = kneighbors_graph(X, n_neighbors=n_neighbors, mode='distance', n_jobs=mp.cpu_count()).toarray()
    D_h = graph_shortest_path(kng_h, method='D', directed=False)
    #D_h = pairwise_distances(X, X, metric='euclidean')
    #D_l = kneighbors_graph(X_m, n_neighbors=50, mode='distance').toarray()
    D_l = pairwise_distances(X_m, X_m, metric='euclidean')
    r,_ = spearmanr(D_h.flatten(), D_l.flatten())
    return 1 - r**2.0

In [None]:
def parallel(fn, params, n_jobs=mp.cpu_count()):
    pool = mp.Pool(n_jobs)
    print('started new process pool with {} processes'.format(n_jobs))
    try:
        res = pool.map(fn, params)
        pool.close()
        pool.join()
    except:
        print('process pool interrupted, shutting down')
        pool.terminate()
        pool.join()
        raise
    return res

mplog = mp.get_logger()

In [None]:
def plot_mapping_vs_intensity_2d(X_map, ys, title="", xlabel="", ylabel="", instant_labels: pd.DataFrame = None):
    plt.scatter(X_map[:,0], X_map[:,1], c=ys, cmap='copper')
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.colorbar().set_label("Storm intensity")
    if instant_labels is not None:
        for (i, xy) in enumerate(X_map):
            if i % 2 == 0:
                plt.annotate(instant_labels.values[i], xy)

In [None]:
def pca(X_df, n_components=2):
    X_drop = X_df.drop(columns = ['stormid'])
    imputer = Imputer(strategy='median')
    scaler = StandardScaler()
    pca = PCA(n_components=n_components)
    pca_pipeline = Pipeline([('med_imputer', imputer),('scaler', scaler),('pca',pca)])
    X_pc = pca_pipeline.fit_transform(X_drop)    
    pca_comp_feat_ratios = np.square(pca.components_)
    feat_names = X_drop.columns
    pc_feat_contrib = pd.DataFrame(pca_comp_feat_ratios, columns=feat_names)
    return pd.concat([X_df['stormid'],pd.DataFrame(X_pc)], axis = 1) #, pc_feat_contrib, pca.components_, pca.explained_variance_ratio_


In [2]:
def rand_projection(X_df, n_components='auto', eps=0.1):
    X_drop = X_df.drop(columns = ['stormid'])
    imputer = Imputer(strategy='median')
    scaler = StandardScaler()
    proj = SparseRandomProjection(n_components=n_components, eps=eps)
    proj_pipeline = Pipeline([('med_imputer', imputer),('scaler', scaler),('proj', proj)])
    X_rp = proj_pipeline.fit_transform(X_drop)
    return pd.concat([X_df['stormid'],pd.DataFrame(X_rp)], axis = 1) #, proj.n_components_

In [None]:
def tsne(X_df, n_components=2, n_iter=5000, perplexity=30, learning_rate=100, init='pca'):
    imputer = Imputer(strategy='median')
    scaler = StandardScaler()
    tsne = TSNE(n_components=n_components, perplexity=perplexity, learning_rate=learning_rate, init=init, n_iter=n_iter)
    tsne_pipeline = Pipeline([('med_imputer', imputer),('scaler', scaler),('tsne', tsne)])
    X_tsne = tsne_pipeline.fit_transform(X_df)
    print("t-SNE completed after {} iterations with final KLD: {}".format(tsne.n_iter_, tsne.kl_divergence_))
    return X_tsne

In [None]:
def umap(X_df, n_components=2, y=None, n_neighbors=5, min_dist=0.1, metric='correlation'):
    import warnings
    warnings.filterwarnings('ignore')
    imputer = Imputer(strategy='median')
    scaler = StandardScaler()
    umap = UMAP(n_components=n_components, n_neighbors=n_neighbors, min_dist=min_dist, metric=metric)
    umap_pipeline = Pipeline([('med_imputer', imputer),('scaler', scaler),('umap', umap)])
    X_umap = umap_pipeline.fit_transform(X_df, y)
    warnings.resetwarnings()
    return X_umap

## The data

The 3000 storms have been separated in a train set, a test set and a local starting kit (train+test sets). The data from `download_data.py` (local starting kit) includes only 1/4 storms of the total database; and the train set on which your code will run on the platform has another half. They are disjoined. 

Let's have a look at the local train data (only the first rows are plotted).

In [6]:
from problem import get_train_data
import numpy as np
import pandas as pd
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

import warnings
warnings.filterwarnings('ignore')

zero_d_dict = {}
# RMSE will be in the same units as the output variable(windspeed in knots)
# Data Exploration: the training set has 15777 training examples with 859 features each.
data_train, y_true = get_train_data()
X_0D, X_0D_zuv, X_0D_sshv, X_all = feature_groups(data_train)

In [None]:
# Copy of their cross-validation function: 
from sklearn.utils import shuffle
from sklearn.model_selection import GroupKFold

def get_cv(X, y, num_splits):
    group = np.array(X['stormid'])
    X, y, group = shuffle(X, y, group, random_state=3)
    gkf = GroupKFold(n_splits=num_splits).split(X, y, group)
    return gkf

In [22]:
X_rp = rand_projection(X_all, eps = .5)
np.shape(X_rp)

(15777, 464)

In [None]:
# Generate Results for Everything: (must change X_df in each function)
total_rp_dict = {}

for eps in [0.1, 0.25, 0.5, 0.75, 0.99]
    rp_dict = {}
    
    X_rp = rand_projection(X_all,eps)

    max_depth = [5,10,20,25]
    max_depth_tree = [5,10,15,20,25,50,100]
    max_features = ['sqrt', 'log2', None]
    n_estimators = [10,25,50,100,200]
    alpha_lasso = [.00001, .0001, .001, .01, .1, 1]
    alpha_ridge = [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000]
    C = np.logspace(-2, 2, num=5) 
    gamma = np.logspace(-4, 2, num=7)

    params_forest_boost = list(itertools.product(max_depth, max_features, n_estimators))
    params_tree = list(itertools.product(max_depth, max_features))
    params_svr = list(itertools.product(C, gamma))

    results_forest = parallel(run_forest, params_forest_boost)
    results_tree = parallel(run_tree, params_tree)
    results_boost = parallel(run_boost, params_forest_boost)
    results_svr = parallel(run_svr, params_svr)

    rp_dict['forest_regressor'] = dict(results_forest)
    total_rp_dict[eps] = rp_dict
    np.save('storm_forecast_rp_tuning.npy', total_rp_dict)
    print("Saved!")
    
    rp_dict['tree_regressor'] = dict(results_tree)
    total_rp_dict[eps] = rp_dict
    np.save('storm_forecast_rp_tuning.npy', total_rp_dict)
    print("Saved!")
    
    rp_dict['boost_regressor'] = dict(results_boost)
    total_rp_dict[eps] = rp_dict
    np.save('storm_forecast_rp_tuning.npy',total_rp_dict)
    print("Saved!")
    
    rp_dict['svr_regressor'] = dict(results_svr)
    total_rp_dict[eps] = rp_dict
    np.save('storm_forecast_rp_tuning.npy', total_rp_dict)
    print("Saved!")
    
    rp_dict['ridge_regressor'] = ridge(X_rp, y_true, alpha_ridge)
    total_rp_dict[eps] = rp_dict
    np.save('storm_forecast_rp_tuning.npy', total_rp_dict)
    print("Saved!")
    
    rp_dict['lasso_regressor'] = lasso(X_rp, y_true, alpha_lasso)
    total_rp_dict[eps] = rp_dict
    np.save('storm_forecast_rp_tuning.npy', total_rp_dict)
    print("Saved!")

In [None]:
'''
# Generate Results for Everything: (must change X_df in each function)
zero_0d_zuv_dict = {}
max_depth = [5,10,20,25]
max_depth_tree = [5,10,15,20,25,50,100]
max_features = ['sqrt', 'log2', None]
n_estimators = [10,25,50,100,200]
alpha_lasso = [.00001, .0001, .001, .01, .1, 1]
alpha_ridge = [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000]

params_forest_boost = list(itertools.product(max_depth, max_features, n_estimators))
params_tree = list(itertools.product(max_depth, max_features))

results_forest2 = parallel(run_forest, params_forest_boost)
results_tree2 = parallel(run_tree, params_tree)
results_boost2 = parallel(run_boost, params_forest_boost)

zero_0d_zuv_dict['forest_regressor'] = dict(results_forest2)

zero_0d_zuv_dict['tree_regressor'] = dict(results_tree2)

zero_0d_zuv_dict['boost_regressor'] = dict(results_boost2)

zero_0d_zuv_dict['ridge_regressor'] = ridge(X_0D_zuv, y_true, alpha_ridge)

zero_0d_zuv_dict['lasso_regressor'] = lasso(X_0D_zuv, y_true, alpha_lasso)

print(zero_0d_zuv_dict)
#np.save('storm_forecast_zero_d_zuv_tuning.npy', zero_0d_zuv_dict)
'''


In [None]:
# Homemade Forest Parameter Tuning 
def forest(X_df, y_df, max_depth, max_features, n_estimators, n_splits = 5):
    mse_sum = 0
    for train_index, test_index in get_cv(X_df,y_df,n_splits):
        forest_regressor = Pipeline([       
            ('regressor', RandomForestRegressor(max_depth = max_depth, max_features = max_features, n_estimators = n_estimators))])
        drop = X_df.drop(columns = ['stormid'])

        X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
        y_train, y_test = y_df[train_index], y_df[test_index]

        forest_regressor.fit(X_train,y_train)
        y_pred = forest_regressor.predict(X_test)
        mse_sum += mean_squared_error(y_test, y_pred)
        
    return ((max_depth, max_features, n_estimators),np.sqrt(mse_sum/n_splits))

def run_forest(p):
    d, f, n = p
    return forest(X_rp, y_true, max_depth=d, max_features=f, n_estimators=n)


In [None]:
max_depth = [5,10,15,20,25]
max_features = ['sqrt', 'log2', None]
n_estimators = [10,25,50, 100,250,500]

results = parallel(run_forest, params)
zero_d_dict['forest_regressor'] = dict(results)

In [None]:
# Homemade Tree Parameter Tuning 
def tree(X_df, y_df, max_depth, max_features, n_splits =5):
    mse_sum = 0
    for train_index, test_index in get_cv(X_df,y_df,n_splits):
        tree_regressor = Pipeline([
            ('regressor', DecisionTreeRegressor(max_depth = max_depth, max_features = max_features))])
        drop = X_df.drop(columns = ['stormid'])
        X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
        y_train, y_test = y_true[train_index], y_true[test_index]
        tree_regressor.fit(X_train,y_train)
        y_pred = tree_regressor.predict(X_test)
        mse_sum += mean_squared_error(y_test, y_pred)
        
    return ((max_depth, max_features),np.sqrt(mse_sum/n_splits))

def run_tree(p):
    d, f = p
    return tree(X_rp, y_true, max_depth=d, max_features=f)

In [None]:
# Generate Results
max_depth = [5,10,15,20,25,50,100]
max_features = [None, 'sqrt', 'log2']

params = list(itertools.product(max_depth, max_features))

results_tree = parallel(run_tree, params)
zero_d_dict['tree_regressor'] = dict(results_tree)

In [None]:
# Homemade SVR Parameter Tuning 
def svr(X_df, y_df, C, gamma, n_splits = 5):
    mse_sum = 0
    for train_index, test_index in get_cv(X_df,y_df,n_splits):
        rbf_regressor = Pipeline([
            ('regressor', SVR(kernel = 'rbf', C = C, gamma = gamma))])
        drop = X_df.drop(columns = ['stormid'])
        
        X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
        y_train, y_test = y_true[train_index], y_true[test_index]
        rbf_regressor.fit(X_train,y_train)
        y_pred = rbf_regressor.predict(X_test)
        mse_sum += mean_squared_error(y_test, y_pred)
        
    return ((C, gamma),np.sqrt(mse_sum/n_splits))

def run_svr(p):
    c, g = p
    return svr(X_rp, y_true, C = c, gamma = g)

In [None]:
# Generate Reults
C = np.logspace(-3, 3, num=7)
gamma = np.logspace(-5, 2, num=8)

params = list(itertools.product(C, gamma))

results_rbf = parallel(run_svr, params)
zero_d_dict['rbf_regressor'] = dict(results_rbf)

In [None]:
# Homemade Boosted Forest Parameter Tuning 
def boost(X_df, y_df, max_depth, max_features, n_estimators, n_splits = 5):
    mse_sum = 0
    for train_index, test_index in get_cv(X_df,y_df,5):
        forest_regressor = Pipeline([      
            ('regressor', GradientBoostingRegressor(max_depth = max_depth, max_features = max_features, n_estimators = n_estimators))])
        drop = X_df.drop(columns = ['stormid'])

        X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
        y_train, y_test = y_df[train_index], y_df[test_index]

        forest_regressor.fit(X_train,y_train)
        y_pred = forest_regressor.predict(X_test)
        mse_sum += mean_squared_error(y_test, y_pred)
        
    return ((max_depth, max_features, n_estimators),np.sqrt(mse_sum/n_splits))

def run_boost(p):
    d, f, n = p
    return boost(X_rp, y_true, max_depth=d, max_features=f, n_estimators=n)

In [None]:
# Generate Results
max_depth = [5,10,15,20,25]
max_features = ['sqrt', 'log2', None]
n_estimators = [10,25,50,100,250,500]

params = list(itertools.product(max_depth, max_features, n_estimators))

results_boost = parallel(run_boost, params)
zero_d_dict['boost_regressor'] = dict(results_boost)

In [None]:
# Homemade Lasso Parameter Tuning 
def lasso(X_df, y_df, alpha, n_splits = 5):
    lasso_dict = {}
    for a in alpha:
        mse_sum = 0
        for train_index, test_index in get_cv(X_df,y_df,n_splits):
            lasso_regressor = Pipeline([
                ('regressor', Lasso(alpha = a))])
            drop = X_df.drop(columns = ['stormid'])

            X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
            y_train, y_test = y_true[train_index], y_true[test_index]

            lasso_regressor.fit(X_train,y_train)
            y_pred = lasso_regressor.predict(X_test)
            mse_sum += mean_squared_error(y_test, y_pred)

        lasso_dict[a] = np.sqrt(mse_sum/n_splits)
    return lasso_dict


In [None]:
# Generate Results
alpha = [.00001, .0001, .001, .01, .1, 1]
zero_d_dict['lasso_regressor'] = lasso(X_0D, y_true, alpha)

In [None]:
# Homemade Ridge Parameter Tuning 
def ridge(X_df, y_df, alpha, n_splits = 5):
    ridge_dict = {}
    for a in alpha:
        mse_sum = 0
        for train_index, test_index in get_cv(X_df,y_df,n_splits):
            ridge_regressor = Pipeline([
                ('regressor', Ridge(alpha = a))])
            drop = X_df.drop(columns = ['stormid'])

            X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
            y_train, y_test = y_true[train_index], y_true[test_index]

            ridge_regressor.fit(X_train,y_train)
            y_pred = ridge_regressor.predict(X_test)
            mse_sum += mean_squared_error(y_test, y_pred)

        ridge_dict[a] = np.sqrt(mse_sum/n_splits)
    return ridge_dict
    

In [None]:
alpha = [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000]
zero_d_dict['ridge_regressor'] = ridge(X_0D, y_true, alpha)

In [None]:
zero_d_dict['tree_regressor'] = dict(results_tree)
zero_d_dict['boost_regressor'] = dict(results_boost)
zero_d_dict['forest_regressor'] = dict(results_forest)
print(zero_d_dict)
np.save('storm_forecast_zero_d_tuning.npy', zero_d_dict)

You can see that the data is a list of time instants (one every 6h). The first storm will result in x lines beginning with its stormid and the corresponding time step, with all the associated features on the same row. Then the time steps from the second storm will be below, and so on. 

In [None]:
print('Number of storms in the local training set: {}'.format( len(set(data_train['stormid'])) ) )

In [None]:
print('Total number of time steps in the local training set: {}'.format(y_train.size))

### 1. 0D features from track data

A set of simple features has been extracted for each storm at each time point: 

- latitude, longitude: in degrees
- windspeed: current (max) windspeed (knots) 
- hemisphere:  South=0, North=1
- Jday predictor:  Gaussian function of (Julian day of storm init - peak day of the hurricane season), see (1)
- initial_max_wind: initial (max) windspeed of the storm 
- max_wind_change_12h: last 12h (max) windspeed change
- basin = based on the present location: 
       0 = NA - North Atlantic / 1 = SA - South Atlantic    / 2 = WP - West Pacific       / 3 = EP - East Pacific /
       4 = SP - South Pacific  / 5 = NI - North Indian      / 6 = SI - South Indian       / 7 = AS - Arabian Sea /
       8 = BB - Bay of Bengal  / 9 = EA - Eastern Australia / 10 = WA - Western Australia / 11 = CP - Central Pacific
       12 = CS - Carribbean Sea/ 13 = GM - Gulf of Mexico   / 14 = MM - Missing
- nature = nature of the storm  
       0 = TS - Tropical / 1 = SS - Subtropical / 2 = ET - Extratropical / 3 = DS - Disturbance /
       4 = MX - Mix of conflicting reports / 5 = NR - Not Reported / 6 = MM - Missing / 7 =  - Missing
- dist2land = current distance to the land (km)


(1) DeMaria, Mark, et al. "Further improvements to the statistical hurricane intensity prediction scheme (SHIPS)." Weather and Forecasting 20.4 (2005): 531-543. https://journals.ametsoc.org/doi/full/10.1175/WAF862.1

In [None]:
# Simple Regression With Only 0D Features:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

print(test.shape)
print(y_true.shape)
# Data Exploration: There are now 11832 training examples and 3945 test examples
X_train, X_test, y_train, y_test = train_test_split(test, y_true)

#X_train = X_train.loc[:,'latitude':'dist2land']
#X_test = X_test.loc[:,'latitude':'dist2land']
#X_train = preprocessing.scale(X_train)


In [None]:
# SVR with RBF Kernel:
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

rbf_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', Imputer(strategy='median')),
            ('regressor', Lasso())])

'''
rbf_regressor.fit(X_train,y_train)
y_pred = rbf_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)
'''


#grid_list = {"regressor__C": np.logspace(-3, 3, num=7), "regressor__gamma": np.logspace(-5, 2, num=8)}
grid_list = { 
    'regressor__alpha': [1, 10]
}
folds = get_cv(data_train, y_true)
for f in folds:
    print(f)
grid = RandomizedSearchCV(estimator=rbf_regressor, param_distributions=grid_list, n_jobs = None, random_state = 1, n_iter=2, scoring = 'neg_mean_squared_error', cv = 5)
print(grid)
grid.fit(tmp,y_true)
#print(grid.best_score_)
#grid.best_params_

'''
param_tuning['rbf_regressor'] = (grid.best_score_,grid.best_params_)
np.save('storm_forecast_parameter_tuning.npy', param_tuning)
'''


In [None]:
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

poly_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', SVR('poly'))])

poly_regressor.fit(X_train,y_train)
y_pred = poly_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)


grid_list = {"regressor__C": np.logspace(-3, 3, num=7), "regressor__gamma": np.logspace(-5, 2, num=8)}
grid = RandomizedSearchCV(estimator=poly_regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=50, scoring = 'neg_mean_squared_error', cv = 5)
grid.fit(X_train,y_train)
print(grid.best_score_)
grid.best_params_

param_tuning['poly_regressor'] = (grid.best_score_,grid.best_params_)
np.save('storm_forecast_parameter_tuning.npy', param_tuning)


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

tree_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', Imputer(strategy='median')),
            ('regressor', DecisionTreeRegressor())])

tree_regressor.fit(X_train,y_train)
y_pred = tree_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)


grid_list = {"regressor__max_depth": [None,5,10,15,20,25],'regressor__max_features': [None, 'sqrt', 'log2']}
grid = RandomizedSearchCV(estimator=tree_regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=10, scoring = 'neg_mean_squared_error', cv = get_cv(test,y_true))

grid.fit(tmp,y_pred)

#param_tuning['tree_regressor'] = (grid.best_score_,grid.best_params_)
#np.save('storm_forecast_parameter_tuning.npy', param_tuning)

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

forest_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', RandomForestRegressor())])

forest_regressor.fit(X_train,y_train)
y_pred = forest_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)


grid_list = { 
    'regressor__n_estimators': [10,25,50,100,250,500,1000],
    'regressor__max_features': ['auto', 'sqrt', 'log2'],
    'regressor__max_depth' : [None,5,10,15,20,25,50,75,100],
}
grid = RandomizedSearchCV(estimator=forest_regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=50, scoring = 'neg_mean_squared_error', cv = 5)
grid.fit(X_train,y_train)

param_tuning['forest_regressor'] = (grid.best_score_,grid.best_params_)
np.save('storm_forecast_parameter_tuning_basin+nature.npy', param_tuning)


In [None]:
print(param_tuning)

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

boost_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', GradientBoostingRegressor())])

boost_regressor.fit(X_train,y_train)
y_pred = boost_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)

grid_list = { 
    'regressor__n_estimators': [10,25,50,100,250,500,1000],
    'regressor__max_features': ['auto', 'sqrt', 'log2'],
    'regressor__max_depth' : [None,3,5,10,15,20,25,50,75,100],
}
grid = RandomizedSearchCV(estimator=forest_regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=50, scoring = 'neg_mean_squared_error', cv = 5)
grid.fit(X_train,y_train)

param_tuning['boost_regressor'] = (grid.best_score_,grid.best_params_)
np.save('storm_forecast_parameter_tuning_basin+nature.npy', param_tuning)

In [None]:
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

elastic_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', ElasticNet())])

elastic_regressor.fit(X_train,y_train)
y_pred = elasticregressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

[1, 50, 100, 200, 1000]

ridge_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', Ridge())])

ridge_regressor.fit(X_train,y_train)
y_pred = ridge_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)

grid_list = { 
    'regressor__alpha': [1, 2, 5, 10, 25, 50, 100, 200, 500, 1000]
}
grid = RandomizedSearchCV(estimator=ridge_regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=50, scoring = 'neg_mean_squared_error', cv = 5)
grid.fit(X_train,y_train)

param_tuning['ridge_regressor'] = (grid.best_score_,grid.best_params_)
np.save('storm_forecast_parameter_tuning.npy', param_tuning)


In [None]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

lasso_regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', Lasso())])

lasso_regressor.fit(X_train,y_train)
y_pred = lasso_regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)

grid_list = { 
    'regressor__alpha': [.00001, .0001, .001, .01, .1, 1]
}
grid = RandomizedSearchCV(estimator=lasso_regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=50, scoring = 'neg_mean_squared_error', cv = 5)
grid.fit(X_train,y_train)

param_tuning['lasso_regressor'] = (grid.best_score_,grid.best_params_)
np.save('storm_forecast_parameter_tuning.npy', param_tuning)


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', RandomForestRegressor())])

regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)

grid_list = { 
    'n_estimators': [50, 100, 200, 300, 400, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [5,10,20],
    'criterion' :['gini', 'entropy']
}

grid = RandomizedSearchCV(estimator=regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=100, scoring = 'neg_mean_squared_error', cv = 5)
print(grid)
grid.fit(X_train,y_train)
print(grid.best_score_)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline

grid_list = {"C": [1,2,3],
             "gamma": [.1,.2,.3]}

regressor = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', SVR(kernel = 'rbf', gamma = .1, C = 1))])
regressor.fit(X_train,y_train)
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(mse)

grid_list = {"regressor__C": np.logspace(-2, 2, num=5), "regressor__gamma": np.logspace(-4, 0, num=5)}

grid = RandomizedSearchCV(estimator=regressor, param_distributions=grid_list, n_jobs = -1, random_state = 1, n_iter=100, scoring = 'neg_mean_squared_error', cv = 5)
print(grid)
grid.fit(X_train,y_train)
print(grid.best_score_)

#print(grid.best_estimator_)
#print(grid.best_estimator_.regressor__gamma)

### 2. The reanalysis data

At each time step, we extracted 7 grids (11x11 pixels) of meteorological parameters centered on the current storm location. Their choice is based on the forecast literature, on personal experience and on known hypothesis of storm strengthening.

#### a) 25x25 degree z, u and v at 700hPa-level
First, we provide 3 maps of 25 x 25 degrees (lat/long) at 700hPa-level pressure: the altitude `z`, the u-wind `u` (positive if wind from the West) and the v-wind `v` (positive if wind from the South). These grids are subsampled to 11x11 pixels (1 pixel ~=2 degrees).


In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
sample_id=20 # sample number plotted - you can change it to see other storms and other instants
grid_l=11 # size of all 2D-grids (in pixels)

In [None]:
params_25x25=['z','u','v']
plt.figure(figsize=(10,4))
for p,param in enumerate(params_25x25):
    image=np.zeros([grid_l,grid_l])
    for i in range(grid_l):
         for j in range(grid_l):
            image[i,j]=data_train[param+'_'+str(i)+'_'+str(j)][sample_id]
    plt.subplot(1,3,p+1)
    plt.imshow(np.array(image),extent=[-12,12,-12,12],
               interpolation='nearest', origin='lower', cmap='seismic')
    plt.xlabel('param '+param)
t=plt.suptitle('Example of 700-hPa level maps 25x25 degrees, centered in the storm location.'
         +'\n (altitude, u-wind and v-wind)')

#### b) 11x11 degree sst, slp, humidity at 1000hPa,  and vorticity at 700hPa
We provide some more localized maps of 11 x 11 degrees (lat/long) at the surface: the sea surface temperature `sst`, surface level pressure `slp`, the relative humidity `hum` at 1000hPa (near surface). These grids are sampled to 11x11 pixels (1 pixel = 1 degree). We also provide the vorticity at 700hPa `vo700`. 

NB: `sst` is only defined on the sea, so land has NaNs values.

In [None]:
params_11x11=['sst','slp','hum','vo700']
plt.figure(figsize=(10,3))
for p,param in enumerate(params_11x11):
    image=np.zeros([grid_l,grid_l])
    for i in range(grid_l):
         for j in range(grid_l):
            image[i,j]=data_train[param+'_'+str(i)+'_'+str(j)][sample_id]
    plt.subplot(1,4,p+1)
    plt.imshow(np.array(image),extent=[-5,5,-5,5],
               interpolation='nearest', origin='lower', cmap='seismic')
    plt.xlabel('param '+param)
t=plt.suptitle('Example of 11x11 degrees maps, centered in the storm location.'
         +'\n (surf. temp., surf. pressure, 1000hPa humidity and 700hPa vorticity)')

## The pipeline

<img src="https://github.com/sophiegif/ramp_kit_storm_forecast_new/blob/master/figures_pynb/pipeline.png?raw=true" width="70%">

For submitting at the [RAMP site](http://ramp.studio), you will have to write two classes, saved in a specific file:   

* a class `FeatureExtractor` in a `feature_extractor.py` file.
* a class `Regressor` in a `regressor.py` file.

You can look at the simple examples provided in /submissions:
- starting_kit : using only the track data
- starting_kit_pressure_map : using both track data and image data

### Using data from previous time steps
Of course, you can use the data from previous time steps, e.g., for the prediction of the intensity of storm S at t=3 you can use data from S at t=\[0:2\]. However, it is completely forbidden (and we check it!) to use future data like S at t=4,.. This is illustrated in the figure below, where the estimation of the 24h-forecast of the time instant 2 (red line) can use blue but not red features.

- `illegal_lookahead`: this simple submission illustrates the error you will have if you are illegally looking ahead time of the same storm.
- `legal_lookbefore` : this simple submission illustrates how to use information from previous time steps of the same storm.

<img src="https://github.com/sophiegif/ramp_kit_storm_forecast_new/blob/master/figures_pynb/illegal_lookahead.png?raw=true" width="70%">
<div style="text-align: center">Data from previous steps are allowed, but data from future steps are forbidden.</div>

## Evaluation
The framework is evaluated with a cross-validation approach. The metric used is the RMSE (root mean square error) in knots across all storm time instants. We also made visible three other metrics: `mae` is the mean absolute error, in knots. `mae_hurr` is the MAE using only time instants corresponding to hurricanes (windspeed>64 knots), while `rel_mae_hurr` is the relative RMSE on hurricanes. These metrics are interesting because the current forecasting practice is to exclude all other stages of development (e.g., extratropical, tropical wave...), see [this page](https://www.nhc.noaa.gov/verification/verify5.shtml?).

## Testing the submission
You can test locally our pipeline using `ramp_test_submission` command line (`-h` will give you all infos). For that, open a terminal in your `storm_forecast/` folder and type on a terminal `ramp_test_submission --submission starting_kit`. You can then copy a submission example in `submissions/<YOUR_SUBMISSION_NAME>/`and modify its codes as you want. Finally, test it on your computer with `ramp_test_submission --submission <YOUR_SUBMISSION_NAME>`.

If you get to see the train and test scores, and no errors, then you can submit your model to the ramp.studio.

## Some warnings when building the model

<div class="alert alert-danger">

 <ul>
  <li>If you want to use the features from previous time steps in your learning (for example using LSTMs), you will have to use the 'stormid' and the 'instant_t' columns. Moreover, you will have to handle separetly the first time steps, which are not provided with past data.</li>
  <li>The intensity value to predict is the max windspeed. However, this value was measured empirically with a precision of ~5knots. </li>
</ul> 

</div>

## Submitting to the online challenge: ramp.studio

Once you have found a good model, you can submit them to [ramp.studio](http://www.ramp.studio) to enter the online challenge. First, if it is your first time using the RAMP platform, [sign up](http://www.ramp.studio/sign_up), otherwise [log in](http://www.ramp.studio/login). Then sign up to the event [storm_forecast_CI2018](http://www.ramp.studio/events/storm_forecast_CI2018). Sign up for the event. Both signups are controled by RAMP administrators, so there **can be a delay between asking for signup and being able to submit**.

Once your signup request is accepted, you can go to your [sandbox](http://www.ramp.studio/events/storm_forecast_CI2018/sandbox) and copy-paste (or upload) [`feature_extractor.py`](/edit/submissions/starting_kit/feature_extractor.py) and [`classifier.py`](/edit/submissions/starting_kit/classifier.py). Save it, rename it, then submit it. The submission is trained and tested on our backend in the similar way as `ramp_test_submission` does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in [my submissions](http://www.ramp.studio/events/storm_forecast_CI2018/my_submissions). Once it is trained your submission shows up on the [public leaderboard](http://www.ramp.studio/events/storm_forecast_CI2018/leaderboard). 
If there is an error (despite having tested your submission locally with `ramp_test_submission`), it will show up in the "Failed submissions" table in [my submissions](http://www.ramp.studio/events/storm_forecast_CI2018/my_submissions). You can click on the error to see part of the trace.

After submission, do not forget to give credit to the previous submissions you reused or integrated into your submission.

The data set we use on the backend is usually different from what you find in the starting kit, so the score may be different.

The official score in this RAMP (the first score column after "historical contributivity" on the [leaderboard](http://www.ramp.studio/events/storm_forecast_CI2018/leaderboard)) is the RMSE.

## More information

You can find more information in the [README](https://github.com/paris-saclay-cds/ramp-workflow/blob/master/README.md) of the [ramp-workflow library](https://github.com/paris-saclay-cds/ramp-workflow).

## Contact

Don't hesitate to [contact us](mailto:admin@ramp.studio?subject=Storm forecast CI2018 ramp).

In [None]:
# Homemade Forest Parameter Tuning. Attempt at Multiprocessing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

from multiprocessing import Pool
import itertools


def forest_search(max_depth, max_features, n_estimators):
    
    cart_prod = list(itertools.product(max_depth,max_features,n_estimators))
    n_splits = 5
    forest_dict = {}


    for param_combo in cart_prod:
        mse_sum = 0
        for train_index, test_index in get_cv(all_features,y_true,n_splits):
            forest_regressor = Pipeline([('imputer', Imputer(strategy='median')), 
                ('scale', StandardScaler()),       
                ('regressor', RandomForestRegressor(max_depth = param_combo[0], max_features = param_combo[1], n_estimators = param_combo[2]))])
            drop = all_features.drop(columns = ['stormid'])

            X_train, X_test = drop.iloc[train_index,:], drop.iloc[test_index,:]
            y_train, y_test = y_true[train_index], y_true[test_index]

            forest_regressor.fit(X_train,y_train)
            y_pred = forest_regressor.predict(X_test)
            mse_sum += mean_squared_error(y_test, y_pred)

        forest_dict[param_combo] = np.sqrt(mse_sum/n_splits)
        print('Elements in Dict:', len(forest_dict))
    return forest_dict

if __name__ == '__main__':
    p = Pool(6)

    max_depth = [5,10,15,20,25]
    max_features = ['sqrt', 'log2', None]
    n_estimators = [10,25,50,100,250,500]

    print(type(max_depth))
    forest_dict = p.starmap(forest_search, zip([5,10,15,20,25],['sqrt', 'log2', None], [10,25,50,100,250,500]))
    p.close()
    p.join()

In [None]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
from sklearn.svm import SVR

class Regressor(BaseEstimator):
    def __init__(self, kernel = 'rbf', gamma = .1, C = 1):
        self.reg = Pipeline([('scale', StandardScaler()),
            ('imputer', SimpleImputer(strategy='median')),
            ('regressor', SVR(kernel = kernel, gamma = gamma, C = C))                 
        ])
        self.kernel = kernel

    def fit(self, X, y):
        self.reg.fit(X, y)

    def predict(self, X):
        print("Kernel:", self.kernel)
        return self.reg.predict(X)


In [None]:
from sklearn import preprocessing
import itertools

# This actually decreases performance:
'''
data_train['previous_windspeed'] = data_train['windspeed'].shift(1)
data_train['previous_windspeed'].loc[data_train['stormid'] != data_train['stormid'].shift(1)] = np.nan
data_train['previous_windspeed'].loc[data_train['previous_windspeed'] == np.nan] = data_train['windspeed']
data_train['previous_windspeed'].fillna(method='bfill', inplace = True)
'''

# This is an illegal lookahead???:
'''
data_train['previous_y'] = y_true
data_train['previous_y'] = data_train['previous_y'].shift(1)
data_train['previous_y'].loc[data_train['stormid'] != data_train['stormid'].shift(1)] = np.nan
data_train['previous_y'].fillna(method='bfill', inplace = True)
'''
df = 
all_features = pd.concat([data_train.loc[:,'stormid'], data_train.loc[:,'latitude':'max_wind_change_12h'],pd.get_dummies(data_train.nature, prefix = 'nature'), pd.get_dummies(data_train.basin, prefix = 'basin'), extra_dummy, data_train.loc[:,'dist2land':] ], axis = 1)

# Manually add dummy variables for the basins:
extra_dummy = pd.DataFrame(data = np.zeros((len(data_train),8)), columns = ['basin_7', 'basin_8', 'basin_9', 'basin_10', 'basin_11', 'basin_12', 'basin_13', 'basin_14'])

# For "illegal lookahead" features:
#zero_d = pd.concat([data_train.loc[:,'stormid'], data_train.loc[:,'latitude':'max_wind_change_12h'], data_train.loc[:,'dist2land'], pd.get_dummies(data_train.nature, prefix = 'nature'), pd.get_dummies(data_train.basin, prefix = 'basin'), extra_dummy, data_train.loc[:,'previous_y']], axis = 1)

# For basin and nature features:
zero_d = pd.concat([data_train.loc[:,'stormid'], data_train.loc[:,'latitude':'max_wind_change_12h'], data_train.loc[:,'dist2land'], pd.get_dummies(data_train.nature, prefix = 'nature'), pd.get_dummies(data_train.basin, prefix = 'basin'), extra_dummy], axis = 1)

# For basin, nature, and previous windspeed features:
#zero_d = pd.concat([data_train.loc[:,'stormid'], data_train.loc[:,'latitude':'max_wind_change_12h'], data_train.loc[:,'dist2land'], pd.get_dummies(data_train.nature, prefix = 'nature'), pd.get_dummies(data_train.basin, prefix = 'basin'), extra_dummy, data_train.loc[:,'previous_windspeed']], axis = 1)

# For nature feature:
#zero_d = pd.concat([data_train.loc[:,'latitude':'max_wind_change_12h'], data_train.loc[:,'dist2land'], pd.get_dummies(data_train.nature, prefix = 'nature')], axis = 1)

all_features = pd.concat([data_train.loc[:,'stormid'], data_train.loc[:,'latitude':'max_wind_change_12h'],pd.get_dummies(data_train.nature, prefix = 'nature'), pd.get_dummies(data_train.basin, prefix = 'basin'), extra_dummy, data_train.loc[:,'dist2land':] ], axis = 1)

zero_d_uvz = pd.concat([zero_d, data_train[[col for col in data_train.columns if col.startswith(('z_','u_','v_'))]]], axis = 1)

zero_d_sst= pd.concat([zero_d, data_train[[col for col in data_train.columns if col.startswith(('sst_','slp_','hum_'))]]], axis = 1)
#print(all_features.loc[:,'sst_0_0':'sst_10_10'])

