# 3.5 Hyperparameter Tuning

The choice of model parameters that is optimal for the specific problem can be found automatically by searching the model parameter space. The type of algorithm is fixed in this instance.

There are built-in toolkits to perform the hyperparameter search.

In [None]:
# basic tools
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import wget
import h5py
import pandas as pd
import os

In this example, we will use the classification example of event discrimination using seismic data.

In [None]:
wget.download("https://www.dropbox.com/s/qwo2rh5zqwho1l5/miniPNW_metadata.csv?dl=1")
wget.download("https://www.dropbox.com/s/ie34kfu33d2jv9m/miniPNW_waveforms.hdf5?dl=1")
os.replace("miniPNW_metadata.csv","../../miniPNW_metadata.csv")
os.replace("miniPNW_waveforms.hdf5","../../miniPNW_waveforms.hdf5")

Read the metadata. We will pick the source_type as a categorical class that we aim to classify the data into

In [None]:

# read the metadata
df = pd.read_csv("../../miniPNW_metadata.csv")
df

In [None]:
list(df.keys())

Create the array of labels

In [None]:
labels = df['source_type']

How much data is there?

In [None]:
df.describe()

How many classes are there?

In [None]:
print(np.unique(labels))
print(len(np.unique(labels)))
plt.hist(labels)

There are 6 classes. The data looks quite imbalanced. This is because the seismic networks will preferably pick earthquakes.

Now are read the data. It is stored in an HDF5 files under a finite number of groups. Each groups has an array of datasets that correspond to the waveforms. To link the metadata to the waveform files, the key ``trace_name`` has the dataset ID. The address is labeled as follows: 

bucketX$i,:3,:n

where X is the HDF5 group number, i is the index


In [None]:
f = h5py.File("../../miniPNW_waveforms.hdf5", "r")

In [None]:
def read_data(tn,f):
    bucket, narray = tn.split('$')
    x, y, z = iter([int(i) for i in narray.split(',:')])
    print(f['/data/%s' % bucket].shape)
    data = f['/data/%s' % bucket][x, :y, :z]
    return data

In [None]:
ldata=list(df['trace_name'])

In [None]:
crap=read_data(ldata[40],f)
print(crap.shape)

In [None]:
plt.plot(crap[2,:])

We will just extract the Z component and reshape them into a single array.

In [None]:
nt=crap.shape[-1]
ndata=len(labels)
print(ndata,nt)
Z=np.zeros(shape=(ndata,nt))
for i in range(ndata-1):
    # print(df.iloc[i]["trace_name"])
    print("Done at %f"%(i/ndata))
    Z[i,:]=read_data(df.iloc[i]["trace_name"],f)[2,:nt]


Backup with mnist

In [None]:
import numpy as np
from sklearn.datasets import load_digits,fetch_openml
digits = load_digits()
digits.keys()

In [None]:
# explore data type
data,y = digits["data"].copy(),digits["target"].copy()
print(type(data[0][:]),type(y[0]))
# note that we do not modify the raw data that is stored on the digits dictionary.

In [None]:
print(min(data[0]),max(data[0]))
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
scaler = MinMaxScaler()
scaler.fit_transform(data)# fit the model for data normalization
newdata = scaler.transform(data) # transform the data. watch that data was converted to a numpy array

# Split data into 50% train and 50% test subsets
print(f"There are {data.shape[0]} data samples")
X_train, X_test, y_train, y_test = train_test_split(
    data, y, test_size=0.2, shuffle=False)


In [None]:
import sklearn
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Support Vector Machine classifier
clf = SVC(gamma=0.001) # model design
clf.fit(X_train, y_train) # learn
svc_prediction = clf.predict(X_test) # predict on test
print("SVC Accuracy:", metrics.accuracy_score(y_true=y_test ,y_pred=svc_prediction))


what are the parameters we are trying to optimize?

In [None]:
clf.get_params()

A search consists of:

* an estimator (regressor or classifier such as ``SVC()``);

* a parameter space;

* a method for searching or sampling candidates;

* a cross-validation scheme; and

* a loss function.

There are two main approaches:
- Grid Search cross validation. Performs the search in the brute-force way using cross-validation. One has to define the parameter space. The scikit-learn function is ``GridSearchCV``. More details [here](!https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).



In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = [
  {'C': [1, 5,10,50, 100,500, 1000], 'kernel': ['linear']},
  {'C': [1,5, 10,50, 100,500, 1000], 'gamma': [0.01,0.001, 0.0001], 'kernel': ['rbf']},
 ]

In [None]:
search = GridSearchCV(clf, param_grid, cv=5,verbose=3)

In [None]:
search.fit(X_train, y_train) # learn

In [None]:
search.get_params()

In [None]:
search.cv_results_

In [None]:
search.best_params_

The second approach is the Random Search Cross Validation.
It performs the search in the brute-force way using cross-validation. One has to define the parameter space. The scikit-learn function is ``GridSearchCV``. More details [here](!https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform,norm ,loguniform

 
distributions= [ {'C': uniform(loc=1, scale=1000), 'kernel': ['linear']},
  {'C': uniform(loc=1, scale=1000), 'gamma': loguniform(1e-4,1e-2), 'kernel': ['rbf']}]
clf2 = RandomizedSearchCV(clf, distributions, random_state=0,cv=5)
clf2.fit(X_train,y_train)

Compare the two

In [None]:
print(search.best_params_)
print(clf2.best_params_)