# MAST-ML Hyperparameter Optimization Activity


---
This activity serves as a way to learn more about hyperparameter optimization. This notebook builds of of the notebook used in the "MASTML Workflows" activity so the first half or so is the same as we setup the dataset.

During the activity we'll be working with the MLPregressor model from scikit-learn. It may be useful to have the documentation for that model open as a reference as it will inform our decision making along the way.

The overall goal is to explore how we might go from the default MLPregressor provided by scikit-learn to a model that we think is performing the best for our data.

**Note: all sections before section 5 are the same as the "workflow activity" so if you've worked through that recently feel free to execute those and skip over the details if you already know what's going on. Notice how with this setup to using mastml we can copy paste previous notebooks to build off of previous ideas and explore new concepts.**

## Section 1: Setting up our Google Colab Environment
---
Before running any code we first need to install MAST-ML as well as it's dependencies into the colab environment. 


Clone the MAST-ML code into the content directory to the left. You should be able to see a new "MAST-ML" directory after running this cell.

In [1]:
!git clone --single-branch --branch skunkworks_s21 https://github.com/uw-cmg/MAST-ML

Cloning into 'MAST-ML'...
remote: Enumerating objects: 344, done.[K
remote: Counting objects: 100% (344/344), done.[K
remote: Compressing objects: 100% (80/80), done.[K
remote: Total 18544 (delta 279), reused 300 (delta 258), pack-reused 18200[K
Receiving objects: 100% (18544/18544), 131.39 MiB | 22.63 MiB/s, done.
Resolving deltas: 100% (12340/12340), done.


Next, we install the required dependencies of MAST-ML to our Colab session

In [2]:
!pip install -r MAST-ML/requirements.txt
!pip install pymatgen==2020.12.31
#!pip install scikit-learn=='0.23.2'

Collecting citrination-client
[?25l  Downloading https://files.pythonhosted.org/packages/61/49/c0af91084172f6a6aa7d625651ec366c85a4fd717c5b4fa0e014d1953d6e/citrination-client-6.5.1.tar.gz (54kB)
[K     |████████████████████████████████| 61kB 3.7MB/s 
[?25hCollecting dlhub_sdk
[?25l  Downloading https://files.pythonhosted.org/packages/1e/cd/02ad247cf7df4467b63dec57e2c8d8f5fe64330bf6da61e6ac6cddcde149/dlhub_sdk-0.9.4-py2.py3-none-any.whl (41kB)
[K     |████████████████████████████████| 51kB 3.8MB/s 
[?25hCollecting globus_nexus_client
  Downloading https://files.pythonhosted.org/packages/52/ca/a0e2c03aeea3e4b3b3256ab309e24fb5227ebaf92aabca56b6dfc3cc758a/globus_nexus_client-0.3.0-py2.py3-none-any.whl
Collecting globus_sdk
[?25l  Downloading https://files.pythonhosted.org/packages/92/a4/57b628cc5509eeb8361eb87506a3aea2078ca9c4e4ffaebc88280cdf7f40/globus_sdk-2.0.1-py2.py3-none-any.whl (85kB)
[K     |████████████████████████████████| 92kB 5.5MB/s 
[?25hCollecting matminer
[?25l  Do

Collecting pymatgen==2020.12.31
[?25l  Downloading https://files.pythonhosted.org/packages/e1/18/274b40cff34257a728071199d21105ced3116b42dd60793113eee7b1b5ca/pymatgen-2020.12.31.tar.gz (2.8MB)
[K     |████████████████████████████████| 2.8MB 13.3MB/s 
Collecting scipy>=1.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/75/91/ee427c42957f8c4cbe477bf4f8b7f608e003a17941e509d1777e58648cb3/scipy-1.6.2-cp37-cp37m-manylinux1_x86_64.whl (27.4MB)
[K     |████████████████████████████████| 27.4MB 163kB/s 
Building wheels for collected packages: pymatgen
  Building wheel for pymatgen (setup.py) ... [?25l[?25hdone
  Created wheel for pymatgen: filename=pymatgen-2020.12.31-cp37-cp37m-linux_x86_64.whl size=3590894 sha256=3b301a056c466c7a808b9861f72ec6277e8c22c80dbffb3963a75424bab7bbb2
  Stored in directory: /root/.cache/pip/wheels/bd/fd/4c/bbea735ca0989c51e67a45d1384b1ce3481bc2aa1337b4a6e9
Successfully built pymatgen
[31mERROR: matminer 0.6.5 has requirement scikit-learn>=0.23.1, 

Now we'll sync Colab with our google drive so that we can save directly our outputs to google drive. If you haven't already I recommend making a folder in google drive titled "MASTML_colab" or something similar to direct all your results towards. Going forward I'll assume this folder exists and I'll base the runs out of that folder. If you want to change the naming that can work as well as long as you update when that location is referenced.

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


We need to add the MAST-ML folder to our sys path so that python can find the modules


In [2]:
import sys
sys.path.append('MAST-ML')

Here we import the MAST-ML modules used. Note that if you're making edits you may have to come back to update these imports to grab new functionality that isn't included here.

In [3]:

from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets, DataCleaning
from mastml.preprocessing import SklearnPreprocessor
from mastml.models import SklearnModel
from mastml.data_splitters import SklearnDataSplitter, NoSplit
from mastml.feature_selectors import EnsembleModelFeatureSelector, NoSelect
from mastml.feature_generators import ElementalFeatureGenerator
from mastml.hyper_opt import GridSearch

  defaults = yaml.load(f)


To install latest forestci compatabilty with scikit-learn>=0.24, run pip install git+git://github.com/scikit-learn-contrib/forest-confidence-interval.git
To import data from figshare, manually install figshare via git clone of git clone https://github.com/cognoma/figshare.git


And finally we'll import pandas to help with handling dataframes throughout the notebook.

In [4]:
import pandas as pd

## Section 2: Data Cleaning


---
this section is largely the same as the previous notebook in functionality.
We'll read in the same initial bandgap data we used in the previous notebook then perform the same cleaning steps:  
1) Filtering for "Reliability"  
2) Averaging bandgap values where we have duplicates  


Read in the band gap data from our dataset. If you haven't already upload the bandgap_data_v2.csv data to the MASTML_colab folder

In [6]:
mastml_df = pd.read_csv("./drive/MyDrive/MASTML_colab/bandgap_data_v2.csv")

Filter for only Reliability 1

In [7]:
mastml_df_filtered = mastml_df[mastml_df["Reliability"]==1]

Define the averaging function used previously in the nanohub notebook. Note that this wasn't explicitly in the previous notebook as it was being imported from a seperate script file with some of these helpep functions. but here we'll just define it locally in the notebook

In [8]:
def average_bandgaps(master_df, input_col_header, output_col_header):
    for chem_formula in master_df[input_col_header].unique():
        temp_df = master_df[master_df[input_col_header]==chem_formula].copy()
        if len(temp_df) > 1:
            avg_bandgap = temp_df[output_col_header].mean()
            indexes = temp_df.index
            master_df.at[indexes,output_col_header] = avg_bandgap
    master_df_clean = master_df.drop_duplicates(subset=input_col_header)
    return master_df_clean

We then call the function to do the same bandgap averaging when we have duplicates in the dataset.

In [9]:
mastml_df_clean = average_bandgaps(mastml_df_filtered, 'chemicalFormula Clean', 'Band gap values Clean')

This section is new. We reset the index to match the previous notebook so that we can explicitly define the same Train / Test split that we used before. The test_indices object is just a hard copied list of the index values from the previous notebook. If you want to go check them you can find the X_test object and call X_test.index to see these yourself.

In [10]:
mastml_df_clean.reset_index(inplace=True)
mastml_df_clean.drop(columns='level_0',inplace=True)

In [11]:
test_indices = [279, 168, 192,  33, 223,  22, 341, 453, 460, 455, 120, 430, 436,
            366, 292, 278, 163, 216, 420, 210, 214, 422, 340,  41, 416, 146,
            280, 229, 300, 111, 407, 250, 379,  20, 356,   4, 141, 139, 121,
            324, 147, 415,  57, 301, 393, 454,  30]

Finally we define a new column "testdata" which is going to be a binary column that is either 0 for "not testing data" or 1 for "is testing data". This is what we can feed into MAST-ML to explicitly define a set of Test data that is held out from all training.

In [12]:
mastml_df_clean["testdata"]=0

In [13]:
for idx in test_indices:
  mastml_df_clean.at[idx,'testdata']=1

In [14]:
output_path = "./drive/MyDrive/MASTML_colab/bandgap_data_v3.csv"
mastml_df_clean.to_csv(output_path,index=False)

Notice how in the initial data cleaning and configuration there is still a bit that we do outside of MAST-ML. While MAST-ML gives a good deal of flexibility and useful tools for performing these machine learning workflows there will often still be custom steps like this that get added to the overall workflow that varies dataset by dataset.

## Section 3: Initializing MAST-ML
---
Now we'll dive into interacting more directly with the MAST-ML software. The first thing we need to do is setup some of the baseline information that MASTML will use as we call different sections of the code. This is similar to the [general] section from the previous configuration file oriented code base.


Set the name of the savepath to save MAST-ML results to. It's recommended to make this a unique name each time you come back to this notebook. That way all the outputs you get from each session will be in a unique location that's easier to come back to later.

By default I've set the output to the "hyperopt" folder under our colab folder.

In [15]:
SAVEPATH = 'drive/MyDrive/MASTML_colab/hyperopt'

mastml = Mastml(savepath=SAVEPATH)
savepath = mastml.get_savepath

With MAST-ML initialized you should see your output directory created. You can check this using the file tree on the left of the screen or directly through google drive.

Next up we need to define the configuration of our Data file that we setup earlier. We'll define the names for all of the key components:  
target: the target variable that we want to predict  
extra_columns: the metadata columns that aren't features but we still want to keep track off  
testdata_columns: the column with binary values defining what is and isn't test data  
group_column: column names specifying unique groups in the data. We don't use this during this workflow  
as_frame: determines the structure of outputs. True gives up dataframe outputs that are easier to read in the notebook

In [16]:
target = 'Band gap values Clean'
extra_columns = ['index', 'Band gap units', 'Band gap method', 'Reliability','chemicalFormula Clean']
testdata_columns = ['testdata']

# calling the LocalDatasets section of the code initializes this section which we then execute with the method below
d = LocalDatasets(file_path='./drive/MyDrive/MASTML_colab/bandgap_data_v3.csv', 
                  target=target, 
                  extra_columns=extra_columns, 
                  group_column=None,
                  testdata_columns=testdata_columns,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()



Let's take a second to look through what just happened. In the previous cell the "data_dict" object was defined. It is a dictionary of various things that were loaded in from the dataseet. We'll pull those out of the dictionary to set them all to unique objects.

We see there are 5 keys:  
  X: the X feature matrix (used to fit the ML model). notice this is empty becausee we haven't done any feature generation  
  y: the y target data vector (true values)  
  X_extra: matrix of meta data not used in fitting (i.e. not part of X or y)  
  groups: vector of group labels. empty because we didn't set it  
  X_testdata: matrix or vector of left out data indices

In [17]:
data_dict.keys()

dict_keys(['X', 'y', 'groups', 'X_extra', 'X_testdata'])

In [18]:
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']
X_testdata = data_dict['X_testdata']

In [19]:
X

0
1
2
3
4
...
462
463
464
465
466


In [20]:
groups

In [21]:
X_extra

Unnamed: 0,index,Band gap units,Band gap method,Reliability,chemicalFormula Clean
0,0,eV,Reflection,1,Li1F1
1,6,eV,Reflection,1,Li1Cl1
2,7,eV,Absorption,1,Li1Br1
3,9,eV,Thermal activation,1,Li3Sb1
4,10,eV,Reflection,1,Li1I1
...,...,...,...,...,...
462,1437,eV,Absorption,1,Bi1I3
463,1445,eV,Magnetoreflection,1,Bi
464,1448,eV,,1,Th1O2
465,1455,eV,Thermal activation,1,UO


In [22]:
X_testdata

[array([  4,  20,  22,  30,  33,  41,  57, 111, 120, 121, 139, 141, 146,
        147, 163, 168, 192, 210, 214, 216, 223, 229, 250, 278, 279, 280,
        292, 300, 301, 324, 340, 341, 356, 366, 379, 393, 407, 415, 416,
        420, 422, 430, 436, 453, 454, 455, 460])]

## Section 4: Feature Generation/Engineering
---
Now we'll setup the input features for the model with a few mastml runs 


If the data contains missing values (this one doesn't), we can clean the data with the built in tools in MAST-ML, which corrects missing values and provides some basic analysis of the input data. Since there are no missing values the data cleaner will still output some useful plots and statistics of our input data.

In [23]:
cleaner = DataCleaning()
X, y = cleaner.evaluate(X=X, 
                        y=y, 
                        method='imputation', 
                        strategy='mean', 
                        savepath=savepath)

Looking at the format of the DataCleaning section also highlights the key way we will interact with MAST-ML in this format. For each section of the code we want to use we'll initialize it using what's called a class name, in this case "DataCleaning", and then call the "evaluate" method to essentially run the code for that Class.

Let's look through the outputs and compare them to some of the initial dataset analysis and compare to the previous Nanohub workflow. Open the "histogram_target_values.png" file in the newly created DataCleaning folder under our output directory. Compare back to the histogram we made in the previous notebook. Are they the same?

This is the type of check we would do to make sure we aren't missing any data switching between the two platforms.

Next is generating the elemental features used in the model. Just like the previous step we define the class of feature generation we want to use, and then call the evaluate method. Again results are output to a new folder with the name of the Class that was evaluated. The features are also added to the X object so we can continue to use them directly without having to read in from the generated files.

You can see from the output that MAST-ML is also performing some basic feature engineering by dropping features that are missing values. This is the most basic way of handling missing values, and if we wanted to do something more complex later we could come back and use imputation to fill in those missing values instead.

In [24]:
generator = ElementalFeatureGenerator(composition_df = X_extra["chemicalFormula Clean"],
                      feature_types='composition_avg',
                      remove_constant_columns=True)
X, y = generator.evaluate(X = X,
                          y = y,
                          savepath = savepath)

Dropping 1/88 generated columns due to missing values


Using the cell block below with outputs the feature object directly compare the features generated to those in the previous workflow. Do we have the same total number?

If they're different can you think of any reasons why?  
hint: mastml does some initial cleaning automatically on the features.

In [25]:
X

Unnamed: 0,AtomicNumber_composition_average,AtomicRadii_composition_average,AtomicVolume_composition_average,AtomicWeight_composition_average,BCCefflatcnt_composition_average,BCCenergy_pa_composition_average,BCCfermi_composition_average,BCCmagmom_composition_average,BCCvolume_pa_composition_average,BCCvolume_padiff_composition_average,BoilingT_composition_average,BulkModulus_composition_average,Column_composition_average,CovalentRadii_composition_average,CovalentRadius_composition_average,Density_composition_average,ElasticModulus_composition_average,ElectricalConductivity_composition_average,ElectronAffinity_composition_average,Electronegativity_composition_average,FirstIonizationEnergy_composition_average,GSbandgap_composition_average,GSenergy_pa_composition_average,GSestBCClatcnt_composition_average,GSestFCClatcnt_composition_average,GSmagmom_composition_average,GSvolume_pa_composition_average,Group_composition_average,HHIp_composition_average,HHIr_composition_average,HeatCapacityMass_composition_average,HeatCapacityMolar_composition_average,HeatFusion_composition_average,HeatVaporization_composition_average,ICSDVolume_composition_average,IonicRadii_composition_average,IonizationEnergy_composition_average,IsAlkali_composition_average,IsAlkalineEarth_composition_average,IsBCC_composition_average,...,IsHalogen_composition_average,IsHexagonal_composition_average,IsMetal_composition_average,IsMetalloid_composition_average,IsMonoclinic_composition_average,IsNonmetal_composition_average,IsOrthorhombic_composition_average,IsPnictide_composition_average,IsRareEarth_composition_average,IsRhombohedral_composition_average,IsSimpleCubic_composition_average,IsTetragonal_composition_average,IsTransitionMetal_composition_average,MeltingT_composition_average,MendeleevNumber_composition_average,MiracleRadius_composition_average,NUnfilled_composition_average,NValance_composition_average,NdUnfilled_composition_average,NdValence_composition_average,NfUnfilled_composition_average,NfValence_composition_average,NpUnfilled_composition_average,NpValence_composition_average,NsUnfilled_composition_average,NsValence_composition_average,Number_composition_average,Period_composition_average,Polarizability_composition_average,Row_composition_average,SecondIonizationEnergy_composition_average,ShearModulus_composition_average,SpaceGroupNumber_composition_average,SpecificHeatCapacity_composition_average,ThermalConductivity_composition_average,ThermalExpansionCoefficient_composition_average,ThirdIonizationEnergy_composition_average,n_ws^third_composition_average,phi_composition_average,valence_composition_average
0,6.000000,1.135000,9311.576313,12.969702,5.772386,-1.346741,-0.679877,0.0,12.4700,-0.680417,849.940000,5.500000,9.000000,0.975000,92.500000,268.348000,5.000000,5.850000,193.900000,2.480000,11.407259,0.9850,-1.783974,2.950630,3.717561,0.0,13.150417,9.000000,2200.000000,2850.000000,2.203000,28.0820,1.627500,75.184900,19.000000,1.045000,1100.500,0.50,0.0,0.50,...,0.50,0.0,0.500000,0.00,0.0,0.500000,0.000000,0.00,0.000000,0.00,0.500000,0.0,0.0,253.595000,47.000000,76.000000,1.0,4.000000,0.000000,0.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,6.000000,2.000000,12.446000,2.000000,55.804000,2.100000,122.00,2.203000,42.363950,923.000000,92.579000,0.490000,1.4250,1.000000
1,10.000000,1.270000,9169.525548,21.197000,6.658641,-1.410040,1.219961,0.0,18.5250,-2.020417,926.980000,6.050000,9.000000,1.110000,115.000000,269.107000,5.000000,5.850000,204.300000,2.070000,9.179675,1.2465,-1.827160,3.436376,4.329563,0.0,20.545417,9.000000,2200.000000,2600.000000,2.030500,29.4045,3.100000,78.650000,25.300000,1.285000,885.550,0.50,0.0,0.50,...,0.50,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.000000,0.00,0.000000,0.0,0.0,312.645000,47.500000,76.000000,1.0,4.000000,0.000000,0.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,10.000000,2.500000,13.257500,2.500000,50.224000,2.100000,146.50,2.031000,42.354450,23.000000,81.031000,0.490000,1.4250,4.000000
2,19.000000,1.345000,32.035942,43.422500,6.919518,-1.432083,1.117212,0.0,21.0350,-2.001667,973.500000,6.450000,9.000000,1.185000,124.000000,1827.500000,5.000000,5.850000,192.200000,1.970000,8.602760,0.7285,-1.726456,3.552844,4.476302,0.0,23.036667,9.000000,3100.000000,5550.000000,2.028000,50.2750,4.142500,80.912500,43.350000,1.360000,829.950,0.50,0.0,0.50,...,0.50,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.000000,0.00,0.000000,0.0,0.0,359.745000,48.000000,76.000000,1.0,9.000000,0.000000,5.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,19.000000,3.000000,13.692500,3.000000,49.219000,2.100000,146.50,1.904000,42.411000,23.000000,79.225500,0.490000,1.4250,4.000000
3,15.000000,1.560000,23.705899,35.645750,6.704252,-2.371630,2.267697,0.0,19.1550,-1.180000,1676.250000,18.750000,4.500000,1.272500,130.750000,2075.500000,24.250000,9.425000,70.100000,1.247500,6.195887,0.0000,-2.431756,3.405574,4.290754,0.0,20.335000,4.500000,4150.000000,4000.000000,2.738250,24.9525,7.197500,127.317500,23.750000,0.760000,598.425,0.75,0.0,0.75,...,0.00,0.0,1.000000,0.25,0.0,0.000000,0.000000,0.25,0.000000,0.25,0.000000,0.0,0.0,566.212500,22.000000,152.750000,1.5,4.500000,0.000000,2.500000,0.000000,0.0,0.750000,0.750000,0.75,1.25,15.000000,2.750000,19.901250,2.750000,61.611000,8.150000,213.25,2.738250,69.600000,37.250000,98.163250,1.050000,3.2375,2.000000
4,28.000000,1.440000,32.101458,66.922735,7.343549,-1.459519,2.360221,0.0,25.9350,-3.869167,1036.150000,9.350000,9.000000,1.280000,133.500000,2737.500000,5.000000,5.850000,177.550000,1.820000,7.921489,0.5310,-1.687645,3.814044,4.805395,0.0,29.804167,9.000000,3900.000000,4500.000000,1.898000,39.6450,5.380000,84.000000,32.050000,1.480000,764.200,0.50,0.0,0.50,...,0.50,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.000000,0.00,0.000000,0.0,0.0,420.270000,48.500000,76.000000,1.0,9.000000,0.000000,5.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,28.000000,3.500000,14.680000,3.500000,47.884500,2.100000,146.50,1.863500,42.574500,66.500000,77.725500,0.490000,1.4250,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
462,60.500000,1.422500,40.865008,147.423452,8.158525,-1.763974,5.485714,0.0,33.9975,-6.501250,802.225000,13.525000,16.500000,1.362500,141.250000,6150.000000,8.500000,0.225000,248.975000,2.500000,9.659820,0.7965,-2.122475,4.320688,5.443726,0.0,40.498750,16.500000,5000.000000,5100.000000,0.191000,47.2025,8.596500,60.425000,40.700000,1.907500,932.125,0.00,0.0,0.00,...,0.75,0.0,0.250000,0.00,0.0,0.750000,0.750000,0.25,0.000000,0.25,0.000000,0.0,0.0,426.237500,93.500000,40.500000,1.5,20.000000,0.000000,10.000000,0.000000,3.5,1.500000,4.500000,0.00,2.00,60.500000,5.250000,3.868750,5.250000,18.520000,3.000000,51.00,0.139250,2.304250,68.600000,31.139750,0.290000,1.0375,6.500000
463,83.000000,1.700000,35.483459,208.980400,7.821898,-3.895384,8.463157,0.0,29.9100,-3.040000,1837.000000,31.000000,15.000000,1.460000,148.000000,9780.000000,34.000000,0.900000,110.000000,2.020000,7.285500,0.0000,-3.973694,4.039198,5.089071,0.0,32.950000,15.000000,5300.000000,6000.000000,0.122000,25.5200,11.106000,179.000000,35.300000,1.030000,703.300,0.00,0.0,0.00,...,0.00,0.0,1.000000,0.00,0.0,0.000000,0.000000,1.00,0.000000,1.00,0.000000,0.0,0.0,544.400000,86.000000,162.000000,3.0,29.000000,0.000000,10.000000,0.000000,14.0,3.000000,3.000000,0.00,2.00,83.000000,6.000000,0.400000,6.000000,16.687000,12.000000,12.00,0.122000,7.870000,13.400000,25.559000,1.160000,4.1500,5.000000
464,35.333333,1.086000,12405.753339,88.012287,5.956046,-4.062017,4.524067,0.0,15.8700,-0.990000,1757.733333,18.000000,11.666667,1.036667,112.666667,3908.952667,24.333333,2.366667,94.066667,2.726667,11.180933,0.0000,-5.642042,3.092382,3.896157,0.0,16.860000,11.666667,333.333333,333.333333,0.651333,28.6920,4.750000,183.580600,22.466667,1.283333,1071.600,0.00,0.0,0.00,...,0.00,0.0,0.333333,0.00,0.0,0.666667,0.000000,0.00,0.333333,0.00,0.666667,0.0,0.0,710.866667,63.333333,102.000000,4.0,5.333333,2.666667,0.666667,0.000000,0.0,1.333333,2.666667,0.00,2.00,35.333333,3.666667,11.234667,3.666667,27.246000,10.333333,83.00,0.651000,18.178267,523.666667,43.290333,0.426667,1.1000,2.666667
465,50.000000,1.057500,9306.473007,127.014155,5.880448,-6.738583,8.971338,0.0,13.7800,-0.785000,2145.050000,50.000000,9.500000,1.075000,131.000000,9525.714500,93.000000,1.800000,70.550000,2.410000,9.906075,0.0000,-8.024566,3.026135,3.812691,0.0,14.565000,9.500000,250.000000,250.000000,0.517000,28.5215,4.680000,212.995450,19.200000,1.105000,948.950,0.00,0.0,0.00,...,0.00,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.500000,0.00,0.500000,0.0,0.0,731.400000,53.500000,111.000000,11.0,6.000000,4.500000,0.500000,5.500000,1.5,1.000000,2.000000,0.00,2.00,50.000000,4.500000,12.851000,4.500000,17.558500,55.500000,37.50,0.520000,13.933700,396.950000,27.467000,0.755000,1.9500,4.000000


Next we'll see one of the benefits of using MAST-ML in this new way. Currently we don't have the same method impelemented in MAST-ML to remove highly correlated features. Previously adding this in would have been a good deal of work. But because we're using MAST-ML in this interactive notebook environment we can add in our own feature engineering steps that aren't included in the MAST-ML software. Below I just copied over the code from the previous notebook to filter highly correlated features

In [26]:
import numpy as np

In [27]:
features_corr_df = X.corr(method="pearson").abs()
# Filter the features with correlation coefficients above 0.95
upper = features_corr_df.where(np.triu(np.ones(features_corr_df.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
X = X.drop(columns=to_drop)

In [28]:
X

Unnamed: 0,AtomicNumber_composition_average,AtomicRadii_composition_average,AtomicVolume_composition_average,BCCefflatcnt_composition_average,BCCenergy_pa_composition_average,BCCfermi_composition_average,BCCmagmom_composition_average,BCCvolume_pa_composition_average,BCCvolume_padiff_composition_average,BoilingT_composition_average,BulkModulus_composition_average,Column_composition_average,Density_composition_average,ElasticModulus_composition_average,ElectricalConductivity_composition_average,ElectronAffinity_composition_average,Electronegativity_composition_average,FirstIonizationEnergy_composition_average,GSbandgap_composition_average,GSenergy_pa_composition_average,GSmagmom_composition_average,HHIp_composition_average,HHIr_composition_average,HeatCapacityMass_composition_average,HeatCapacityMolar_composition_average,HeatFusion_composition_average,HeatVaporization_composition_average,ICSDVolume_composition_average,IonicRadii_composition_average,IsAlkali_composition_average,IsAlkalineEarth_composition_average,IsBCC_composition_average,IsBoron_composition_average,IsCarbon_composition_average,IsChalcogen_composition_average,IsDBlock_composition_average,IsFBlock_composition_average,IsFCC_composition_average,IsHalogen_composition_average,IsHexagonal_composition_average,IsMetal_composition_average,IsMetalloid_composition_average,IsMonoclinic_composition_average,IsNonmetal_composition_average,IsOrthorhombic_composition_average,IsPnictide_composition_average,IsRhombohedral_composition_average,IsSimpleCubic_composition_average,IsTetragonal_composition_average,MeltingT_composition_average,MendeleevNumber_composition_average,MiracleRadius_composition_average,NUnfilled_composition_average,NValance_composition_average,NdUnfilled_composition_average,NdValence_composition_average,NfUnfilled_composition_average,NfValence_composition_average,NpUnfilled_composition_average,NpValence_composition_average,NsUnfilled_composition_average,NsValence_composition_average,Polarizability_composition_average,SecondIonizationEnergy_composition_average,ShearModulus_composition_average,SpaceGroupNumber_composition_average,ThermalConductivity_composition_average,ThermalExpansionCoefficient_composition_average,ThirdIonizationEnergy_composition_average,n_ws^third_composition_average,valence_composition_average
0,6.000000,1.135000,9311.576313,5.772386,-1.346741,-0.679877,0.0,12.4700,-0.680417,849.940000,5.500000,9.000000,268.348000,5.000000,5.850000,193.900000,2.480000,11.407259,0.9850,-1.783974,0.0,2200.000000,2850.000000,2.203000,28.0820,1.627500,75.184900,19.000000,1.045000,0.50,0.0,0.50,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.50,0.0,0.500000,0.00,0.0,0.500000,0.000000,0.00,0.00,0.500000,0.0,253.595000,47.000000,76.000000,1.0,4.000000,0.000000,0.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,12.446000,55.804000,2.100000,122.00,42.363950,923.000000,92.579000,0.490000,1.000000
1,10.000000,1.270000,9169.525548,6.658641,-1.410040,1.219961,0.0,18.5250,-2.020417,926.980000,6.050000,9.000000,269.107000,5.000000,5.850000,204.300000,2.070000,9.179675,1.2465,-1.827160,0.0,2200.000000,2600.000000,2.030500,29.4045,3.100000,78.650000,25.300000,1.285000,0.50,0.0,0.50,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.50,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.00,0.000000,0.0,312.645000,47.500000,76.000000,1.0,4.000000,0.000000,0.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,13.257500,50.224000,2.100000,146.50,42.354450,23.000000,81.031000,0.490000,4.000000
2,19.000000,1.345000,32.035942,6.919518,-1.432083,1.117212,0.0,21.0350,-2.001667,973.500000,6.450000,9.000000,1827.500000,5.000000,5.850000,192.200000,1.970000,8.602760,0.7285,-1.726456,0.0,3100.000000,5550.000000,2.028000,50.2750,4.142500,80.912500,43.350000,1.360000,0.50,0.0,0.50,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.50,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.00,0.000000,0.0,359.745000,48.000000,76.000000,1.0,9.000000,0.000000,5.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,13.692500,49.219000,2.100000,146.50,42.411000,23.000000,79.225500,0.490000,4.000000
3,15.000000,1.560000,23.705899,6.704252,-2.371630,2.267697,0.0,19.1550,-1.180000,1676.250000,18.750000,4.500000,2075.500000,24.250000,9.425000,70.100000,1.247500,6.195887,0.0000,-2.431756,0.0,4150.000000,4000.000000,2.738250,24.9525,7.197500,127.317500,23.750000,0.760000,0.75,0.0,0.75,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.00,0.0,1.000000,0.25,0.0,0.000000,0.000000,0.25,0.25,0.000000,0.0,566.212500,22.000000,152.750000,1.5,4.500000,0.000000,2.500000,0.000000,0.0,0.750000,0.750000,0.75,1.25,19.901250,61.611000,8.150000,213.25,69.600000,37.250000,98.163250,1.050000,2.000000
4,28.000000,1.440000,32.101458,7.343549,-1.459519,2.360221,0.0,25.9350,-3.869167,1036.150000,9.350000,9.000000,2737.500000,5.000000,5.850000,177.550000,1.820000,7.921489,0.5310,-1.687645,0.0,3900.000000,4500.000000,1.898000,39.6450,5.380000,84.000000,32.050000,1.480000,0.50,0.0,0.50,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.50,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.00,0.000000,0.0,420.270000,48.500000,76.000000,1.0,9.000000,0.000000,5.000000,0.000000,0.0,0.500000,2.500000,0.50,1.50,14.680000,47.884500,2.100000,146.50,42.574500,66.500000,77.725500,0.490000,4.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
462,60.500000,1.422500,40.865008,8.158525,-1.763974,5.485714,0.0,33.9975,-6.501250,802.225000,13.525000,16.500000,6150.000000,8.500000,0.225000,248.975000,2.500000,9.659820,0.7965,-2.122475,0.0,5000.000000,5100.000000,0.191000,47.2025,8.596500,60.425000,40.700000,1.907500,0.00,0.0,0.00,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.75,0.0,0.250000,0.00,0.0,0.750000,0.750000,0.25,0.25,0.000000,0.0,426.237500,93.500000,40.500000,1.5,20.000000,0.000000,10.000000,0.000000,3.5,1.500000,4.500000,0.00,2.00,3.868750,18.520000,3.000000,51.00,2.304250,68.600000,31.139750,0.290000,6.500000
463,83.000000,1.700000,35.483459,7.821898,-3.895384,8.463157,0.0,29.9100,-3.040000,1837.000000,31.000000,15.000000,9780.000000,34.000000,0.900000,110.000000,2.020000,7.285500,0.0000,-3.973694,0.0,5300.000000,6000.000000,0.122000,25.5200,11.106000,179.000000,35.300000,1.030000,0.00,0.0,0.00,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.00,0.0,1.000000,0.00,0.0,0.000000,0.000000,1.00,1.00,0.000000,0.0,544.400000,86.000000,162.000000,3.0,29.000000,0.000000,10.000000,0.000000,14.0,3.000000,3.000000,0.00,2.00,0.400000,16.687000,12.000000,12.00,7.870000,13.400000,25.559000,1.160000,5.000000
464,35.333333,1.086000,12405.753339,5.956046,-4.062017,4.524067,0.0,15.8700,-0.990000,1757.733333,18.000000,11.666667,3908.952667,24.333333,2.366667,94.066667,2.726667,11.180933,0.0000,-5.642042,0.0,333.333333,333.333333,0.651333,28.6920,4.750000,183.580600,22.466667,1.283333,0.00,0.0,0.00,0.0,0.0,0.666667,0.0,0.333333,0.333333,0.00,0.0,0.333333,0.00,0.0,0.666667,0.000000,0.00,0.00,0.666667,0.0,710.866667,63.333333,102.000000,4.0,5.333333,2.666667,0.666667,0.000000,0.0,1.333333,2.666667,0.00,2.00,11.234667,27.246000,10.333333,83.00,18.178267,523.666667,43.290333,0.426667,2.666667
465,50.000000,1.057500,9306.473007,5.880448,-6.738583,8.971338,0.0,13.7800,-0.785000,2145.050000,50.000000,9.500000,9525.714500,93.000000,1.800000,70.550000,2.410000,9.906075,0.0000,-8.024566,0.0,250.000000,250.000000,0.517000,28.5215,4.680000,212.995450,19.200000,1.105000,0.00,0.0,0.00,0.0,0.0,0.500000,0.0,0.500000,0.000000,0.00,0.0,0.500000,0.00,0.0,0.500000,0.500000,0.00,0.00,0.500000,0.0,731.400000,53.500000,111.000000,11.0,6.000000,4.500000,0.500000,5.500000,1.5,1.000000,2.000000,0.00,2.00,12.851000,17.558500,55.500000,37.50,13.933700,396.950000,27.467000,0.755000,4.000000


Next up we perform the last feature engineering step, which was to normalize the features using scikit-learn's MinMaxScaler method. 

In [29]:
preprocessor = SklearnPreprocessor(preprocessor='MinMaxScaler', as_frame=True)
X = preprocessor.evaluate(X=X,
                          y=y, 
                          savepath=savepath)

## Section 5: Neural Network Optimization
---
In this section we'll start to analyze the NN from scikit-learn the MLPRegressor which stands for Multi-layer Perceptron Regressor. We can find the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html 

We'll use the same assessment techniques as before where we use a combination of 5-Fold cross validation and the previously established test set to measure model predictive ability.

During this activity we'll go through a number of different steps to optimize our neural network. For our first step let's use the grid search method outlined previously to grid over a number of different hyperparameters.

Because the neural network model tends to take a bit longer to train we'll do a very rough grid of each hyperparameter so that we can run through quickly and identify what sorts of values are working best for each hyperparameter. After this initial run we can then narrow down either the range of values to get a more fine gride. Or we could start to exclude hyperparameters if we think we've found the best value for them overall.

We'll initially set up a rough grid over 3 hyperparameters:

1) Alpha (Regularization)  
  Let's set a minimum value of 10^-8, a max of 10^2, with 5 grid points. Since we're varying over orders of magnitude it's easier to do this in log space, and our alpha values should be floats. so the string to setup this grid should look like 'x y z log float' with x and y being the exponents for the numbers and z being the number of grid points.  
2) Initial learning rate  
  For the learning rate we'll do the same thing but set the minimum to 10^-5 and max 10^1 with 5 grid points again.  
3) Activation  
  Activation is a categorical hyperparameter so it gets handled a bit differently. Looking at the sklearn documentation the available activation functions are: identity, logistic, tanh, and relu. Currently we have to do a bit of a hack to get it to work. but we can set these categories similarly in a string (inside quotes) with spaces in between. We also have to add an extra "fake" value as mastml is expecting to get a certain format and we're breaking that. so for the value input set it as 'identity logistic tanh relu fake' to have it try all the activation function options  

  To include multiple hyperparameters in the grid search we seperate them by a semicolon   
  So an example would look like this:
  hyperparams = 'param1 ; param2 ; param3'  
  param_values = '1 5 3 log float ; 2 10 5 log float ; activation1 activation2 activation3 activation4 activation5'


In [32]:
default_model = SklearnModel(model='MLPRegressor')
models = [default_model]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

### here's the key grid search settings we're going to edit
hyperparams = 'param1; param2; param3'
param_vals = '1 5 3 log float; 2 10 5 log float; activation1 activation2 activation3 activation4 activation5'
###

grid1 = GridSearch(param_names=hyperparams,param_values=param_vals,scoring='neg_mean_squared_error')
grids = [grid1]
splitter = NoSplit()
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  hyperopts = grids,
                  recalibrate_errors = True,
                  verbosity=3)

You must specify either lin or log scaling for GridSearch
Hyperparameter optimization failed, likely due to inappropriate domain of values to optimize one or more parameters over. Please check your input file and the sklearn docs for the mode you are optimizing for the domain of correct values


NameError: ignored

With this initial rough grid search complete let's go look through our results. Locate the "gridsearch_mlpregressor..." file in your run output and open it up to view how the model performed under each combination of hyperparameters. Because we varied so many at a time we can't make a simple one-dimensional learning curve.

For this initial rough grid search what we're trying to identify is roughly what hyperparameter values are performing better so we can refine our search. for now let's just sort the values by the mean test score column so we can see what performed best. You can do this by opening the output in excel, or optionally you could read that output file back into the notebook here to do your analysis.



For the top 10 or so performing combinations of hyperparameters let's ask ourselves a few questions:  
1) For each hyperparameter is there a common range for the numerical ones or common category for the categorical ones thatis clearly dominating? For example we might notice that a certain activation function is consistently performing well. Or that all of the learning rate values are around 10^-3. Take note of this for each hyperparameter, or if there isn't a clear trend note that as well.

2) How quickly is performance dropping off during this rough grid search. For example find the performance for the top perfoming combination (for me it was {'activation': 'relu', 'alpha': 0.31622776601683794, 'learning_rate_init': 0.01}) and and for each numerical hyperparameter find the decrease in performance as each hyperparameter moves up and down one grid step. 

3) Finally for the numerical hyperparameters do any of them continuously decrease towards the boundary that we gave them? this might indicate we needed to do a wider grid search initially. Otherwise if we're finding minimum values within the previous range that means we can start to narrow the search.

Using the information from this previous analysis we'll establish our next search. Based on a quick look through the results I found the following:
- for learning rate it looks like 10^-1 to 10^-5 are performing best so let's move the outer bounds of the grid in to match those new values  
- for regularization values from 10^0 to 10^-8 were giving decent results so let's move those bounds in a bit on the high side  
- for the activation function all values are represented in the well performing models except the identity function (which makes sense). So we could make a few choices here. In the interest of time let's just pick the one that is most populated in the top 10 and use that going forward. Make sure to keep track of which one that is! For me it was the logistic function. 

Using these notes update your grid search from above and copy into the cell below to make your adjustments. To exclude the activation function from the grid search make sure to remove those sections. and add the extra input:  
activation='logistic'  
to the initial definition of the default_model object

Also because we've removed one of the hyperparameters from the grid lets increase the grid density to 10.

**Note: this will take a bit to run**

In [None]:
# make sure to change the "editthis"
default_model = SklearnModel(model='MLPRegressor',activation=editthis)
models = [default_model]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

### here's the key grid search settings we're going to edit
hyperparams = 
param_vals = 
###

grid1 = GridSearch(param_names=hyperparams,param_values=param_vals,scoring='neg_mean_squared_error')
grids = [grid1]
splitter = NoSplit()
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  hyperopts = grids,
                  recalibrate_errors = True,
                  verbosity=3)



Conducting a similar analysis of the results as we did previously we'll refine the grid search one more time. Because we now have just two hyperparameters and both are numerical we could also make a heatmap of performance across the grid to identify which areas are performing well. For now, in the interest of time we'll skip making the full heatmap and again just do a quick scan of the top performing combinations of hyperparameters.

For the learning rate hyperparameter it looks like performance drops consistently when the value goes below 10^-3, while the regularization doesn't have a clear trend. Instead many of the top combinations vary over many orders of magnitude. This suggests that the model isn't very sensitive to this parameter within the range we've given it so similar to before let's just pick and value to fix and only vary the learning rate. So let's set the regularizatin to:  
alpha=0.001

Again because we've narrowed down the hyperparameters we can increaes the grid density of the remaining parameter. Lets set the number of grid points to 50 to do a much finer grid. Again copy the above settings to the cell below and make your adjustments.

In [None]:
# make sure to change the "editthis" tags
default_model = SklearnModel(model='MLPRegressor',activation='logistic',alpha=editthis)
models = [default_model]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

### here's the key grid search settings we're going to edit
hyperparams = 
param_vals = 
###

grid1 = GridSearch(param_names=hyperparams,param_values=param_vals,scoring='neg_mean_squared_error')
grids = [grid1]
splitter = NoSplit()
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  hyperopts = grids,
                  recalibrate_errors = True,
                  verbosity=3)



With this grid search what was the best learning rate?

With this grid search complete we can say we've fairly thoroughly investigated various combinations of hyperparameters and found the best combination. However, one thing we haven't done yet it varied the neuron structure of the network. There is a hyperparameter which sets this, but because of how it's structured it doesn't mesh well with the grid search settings above. Instead we'll have to vary the structure manually.

We do this with the "hidden_layer_sizes" hyperparameter which looks like this:  
hidden_layer_sizes = (100)   
hidden_layer_sizes = (100,100)

The number inside the parentheses specify the number of neurons in each layer and we can add more layers by adding more commas with additional numbers

by default Scikit-learn sets the network to have one layer with 100 neurons. As a last optimization step let's try to vary this structure and see how it affects the results.

Just like we did previously when we fixed the other hyperparameters outside of the grid search we can do the same by adding in the hidden_layer_sizes hyperparameter. And we'll try a few different configurations:

1) Reduce the number of neurons in the single layer to 50. How does this affect the results? Does the simple model cause a drop in performance?

2) If it doesn't, keep decreasing by 10 until you see a significant change in performance. Note: for me this occurred around 20 neurons.

3) Let's try multiple layers. Using the previous result of 20 neurons lets increase the number of layers to 2 and then 3 and see how the performance is affected. Did increasing the number of layers affect performance? (also note: this isn't the correct way to optimize this structure overall, we're just trying a few different combinations).

4) based on these results try to find the best structure with the previouosly set hyperparameters. Report in your slides the best model you find!


In [None]:
# make sure to edit the "editthis" sections below.
default_model = SklearnModel(model='MLPRegressor',activation='tempeditthis',alpha=editthis,learning_rate_init=editthis,hidden_layer_sizes=(editthis))
models = [default_model]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

splitter = SklearnDataSplitter(splitter='RepeatedKFold', n_repeats=3, n_splits=5)
splitter.evaluate(X=X,
                  y=y,
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  verbosity=3)