# Hurricane meteorological data

### Credit: Sophie Giffard-Roisin

We will use a real hurricane meteorological dataset, which typical goal is to estimate the current stength of the hurricane or to predict its evolution.
<img src="https://github.com/sophiegif/ramp_kit_storm_forecast_new/blob/master/figures_pynb/all_storms_since1979_IBTrRACKS_newcats.png?raw=true" width="70%">
<div style="text-align: center">Database: tropical/extra-tropical storm tracks since 1979. Dots = initial position, color = maximal storm strength according to the Saffir-Simpson scale.</div>


### Requirements

* numpy  
* matplotlib
* pandas 
* scikit-learn   

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pathlib import Path

## Loading the data

Drop features `windspeed, stormid, instant_t`

In [None]:
data_p = Path("./data")
columns=['windspeed', 'stormid', 'instant_t']
target = 'windspeed'
df = ?
y_tr = ?
X_tr = ?
X_tr.head(5)

For the meaning of the columns, refer to this notebook https://github.com/ramp-kits/storm_forecast/blob/master/storm_forecast_starting_kit.ipynb

Load also the test data:

In [None]:
?

## Standardize your data

In [None]:
from sklearn.preprocessing import StandardScaler

For all features, transform your data such that mean=0 and std=1 (on the training data), and use the same parameters for transforming the test data also. 

In [None]:
scaler = StandardScaler().fit(X_tr)
X_tr = scaler.transform(X_tr)
X_ts = scaler.transform(X_ts)

## Principal Component analysis

Use [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and transform the training data such as to conserve 95% of the explained variance of the data. Then, transform the test data accordingly. (Look at sklearn.decomposition.PCA for how to use it)

In [None]:
from sklearn.decomposition import PCA

1) Fit the pca with the training features:

In [None]:
pca = PCA(svd_solver='full')
pca.fit(X_tr)

2) Calculate the cumulative explained variance (you can use the np.cumsum function) and determine how many modes are necessary in order to keep 95% of the explained variance:

In [None]:
cumsum_var = np.cumsum(pca.explained_variance_)
thresh = 0.95 * max(cumsum_var)
for c, i in zip(cumsum_var, range(len(cumsum_var))):
    if c > thresh:
        num_modes = i
        break

Now we can plot it:

In [None]:
plt.figure()
plt.plot(cumsum_var)
plt.axhline(thresh, xmin=0, xmax=len(pca.explained_variance_))
plt.show()
print('Number of modes:' + str(num_modes))

3) Create a reduced feature matrix X_df_pca (and then X_df_test_pca) using the number of modes found. You may need to create a second 'pca' instance.

In [None]:
pca = PCA(svd_solver='full', n_components=num_modes)
pca.fit(X_tr)
X_tr_pca=pca.transform(X_tr)
X_ts_pca=pca.transform(X_ts)
print('Number of feature dimensions:'+ str(len(X_tr_pca[0])) )

We can now plot the first two modes of the X_df_pca, with y as color label.

In [None]:
plt.figure(figsize=(16,8))
ax = plt.subplot(1,2,1)
plt.scatter(np.transpose(X_tr_pca)[0], np.transpose(X_tr_pca)[1], c=y_tr, s=1, cmap='jet')
plt.colorbar()

ax2 = plt.subplot(1,2,2)
plt.scatter(np.transpose(X_tr_pca)[0], np.transpose(X_tr_pca)[1], c=y_tr, s=1, cmap='jet')
ax2.set_title('Without outliers')
plt.xlim([-4,4])
col = plt.colorbar()
t = plt.suptitle('PCA modes 0 and 1, color = y (hurricane windspeed, knots):')

## Non-linear methods: Multidimensional scaling (MDS) and Isomap

The MDS performs a non-linear dimentionality reduction  by preserving the (Eucliean) distances between points. [The isomap](https://scikit-learn.org/stable/modules/manifold.html#isomap) is an extended version of the MDS where the geodesic distances are preserved. What is geodesic distance?

In [None]:
from sklearn.manifold import MDS, Isomap
import warnings
from scipy.sparse import SparseEfficiencyWarning
warnings.simplefilter('ignore',SparseEfficiencyWarning)


Nsamples = 1000
X_tr_small = X_tr[:Nsamples]

First, apply the MDS to the training data X_df with 2 components and save it as X_df_MDS. Do the same thing with isomap and create a X_df_isomap. Verify that their shape are Nb_samples x 2 . Use less samples (1000) in order to reduce computing time.

In [None]:
embedding_MDS = MDS(n_components=2)
X_tr_MDS = embedding_MDS.fit_transform(X_tr_small)

embedding_isomap= Isomap(n_components=2)
X_tr_isomap = embedding_isomap.fit_transform(X_tr_small)

print(X_df_MDS.shape)

And now we can plot them.

In [None]:
plt.figure(figsize=(10,5))
ax = plt.subplot(1,2,1)
ax.set_title('MDS with 2 components')
plt.scatter(np.transpose(X_tr_MDS)[0], np.transpose(X_tr_MDS)[1],c=y_tr[:Nsamples], s=1, cmap='jet')
c=plt.colorbar()

ax2 = plt.subplot(1,2,2)
ax2.set_title('Isomap with 2 components')
plt.scatter(np.transpose(X_tr_isomap)[0], np.transpose(X_tr_isomap)[1], c=y_tr[:Nsamples], s=1, cmap='jet')
c2=plt.colorbar()


t = plt.suptitle('MDS and Isomap, color = y (hurricane windspeed, knots):')

On the Isomap, we can maybe distinguish the trajectories of individual hurricanes (one hurricane has between 5 to 100 time steps, every time step is a different point here).

## ARD : Regression with automatic dim. reduction

Choose a smaller number of samples in order to reduce the computational time...

In [None]:
from sklearn.linear_model import ARDRegression
Nsamples = 500

Fit an ARD instance with part of the training samples (ex. 500 - you can shuffle them to have a better result):

In [None]:
clf = ARDRegression(compute_score=True)

?

We can plot the values of the feature weights and their histogram.

In [None]:
plt.figure(figsize=(12, 5))
plt.title("Weights of the model")
ax = plt.subplot(1,2,1)
plt.plot(clf.coef_, color='darkblue', linestyle='-', linewidth=2,
         label="ARD estimate")
plt.xlabel("Features")
plt.ylabel("Values of the weights")
plt.legend(loc=1)

ax2 = plt.subplot(1,2,2)
plt.title("Histogram of the weights")
plt.hist(clf.coef_, bins=len(X_tr[0]), color='navy', log=True)
plt.ylabel("Features")
plt.xlabel("Values of the weights")
plt.legend(loc=1)

You can see on the first figure what are the important features for this task. On the second, you can see that a lot of weights are 0.


In [None]:
from sklearn.metrics import mean_absolute_error

Now, estimate the mean absolute error on the test set:

In [None]:
y_est = clf.predict(X_tr_test)
mae_ARD = mean_absolute_error(y_tr_test, y_est)
print(mae_ARD)

## Other methods

You can play with other dimensionality reduction on the same dataset (or on others) by looking at:

In [None]:
from sklearn.random_projection import johnson_lindenstrauss_min_dim # for deciding whether to use it or not
from sklearn.random_projection import GaussianRandomProjection
from sklearn.manifold import SpectralEmbedding