# MAST-ML Workflow Activity


---
This activity serves as an introduction to the more pythonic way of working with the MAST-ML software. During the activity we'll see how to set up some basic MAST-ML workflows which mirror workflows previously explored via:
- Citrination
- Nanohub Introduction to Machine Learning for Materials Science

Throughout the insstructions we'll reference directly back to the Nanohub notebook. I'ld advise you have that open in a second tab to refer back to. you can find that notebook here:  
www.nanohub.org/tools/intromllab 

The overall goal is to reproduce those workflows using the MAST-ML software, learn how to execute calls to MAST-ML, and how to find and anlyze the results.

This notebook is setup in a linear fashion where working from top to bottom will execute the full workflow.


## Section 1: Setting up our Google Colab Environment
---
Before running any code we first need to install MAST-ML as well as it's dependencies into the colab environment. 


Clone the MAST-ML code into the content directory to the left. You should be able to see a new "MAST-ML" directory after running this cell.

In [1]:
!git clone --single-branch --branch skunkworks_s21 https://github.com/uw-cmg/MAST-ML

fatal: destination path 'MAST-ML' already exists and is not an empty directory.


Next, we install the required dependencies of MAST-ML to our Colab session

In [None]:
!pip install -r MAST-ML/requirements.txt
!pip install pymatgen==2020.12.31
#!pip install scikit-learn=='0.23.2'

Collecting citrination-client
[?25l  Downloading https://files.pythonhosted.org/packages/61/49/c0af91084172f6a6aa7d625651ec366c85a4fd717c5b4fa0e014d1953d6e/citrination-client-6.5.1.tar.gz (54kB)
[K     |████████████████████████████████| 61kB 2.9MB/s 
[?25hCollecting dlhub_sdk
[?25l  Downloading https://files.pythonhosted.org/packages/1e/cd/02ad247cf7df4467b63dec57e2c8d8f5fe64330bf6da61e6ac6cddcde149/dlhub_sdk-0.9.4-py2.py3-none-any.whl (41kB)
[K     |████████████████████████████████| 51kB 3.4MB/s 
[?25hCollecting globus_nexus_client
  Downloading https://files.pythonhosted.org/packages/52/ca/a0e2c03aeea3e4b3b3256ab309e24fb5227ebaf92aabca56b6dfc3cc758a/globus_nexus_client-0.3.0-py2.py3-none-any.whl
Collecting globus_sdk
[?25l  Downloading https://files.pythonhosted.org/packages/92/a4/57b628cc5509eeb8361eb87506a3aea2078ca9c4e4ffaebc88280cdf7f40/globus_sdk-2.0.1-py2.py3-none-any.whl (85kB)
[K     |████████████████████████████████| 92kB 4.0MB/s 
[?25hCollecting matminer
[?25l  Do

Collecting pymatgen==2020.12.31
[?25l  Downloading https://files.pythonhosted.org/packages/e1/18/274b40cff34257a728071199d21105ced3116b42dd60793113eee7b1b5ca/pymatgen-2020.12.31.tar.gz (2.8MB)
[K     |████████████████████████████████| 2.8MB 5.6MB/s 
Collecting scipy>=1.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/b6/3a/9e0649ab2d5ade703baa70ef980aa08739226e5d6a642f084bb201a92fc2/scipy-1.6.1-cp37-cp37m-manylinux1_x86_64.whl (27.4MB)
[K     |████████████████████████████████| 27.4MB 159kB/s 
Building wheels for collected packages: pymatgen


Now we'll sync Colab with our google drive so that we can save directly our outputs to google drive. If you haven't already I recommend making a folder in google drive titled "MASTML_colab" or something similar to direct all your results towards. Going forward I'll assume this folder exists and I'll base the runs out of that folder. If you want to change the naming that can work as well as long as you update when that location is referenced.

In [4]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


We need to add the MAST-ML folder to our sys path so that python can find the modules


In [5]:
import sys
sys.path.append('MAST-ML')

Here we import the MAST-ML modules used. Note that if you're making edits you may have to come back to update these imports to grab new functionality that isn't included here.

In [18]:

from mastml.mastml import Mastml
from mastml.datasets import LocalDatasets, DataCleaning
from mastml.preprocessing import SklearnPreprocessor
from mastml.models import SklearnModel
from mastml.data_splitters import SklearnDataSplitter, NoSplit
from mastml.feature_selectors import EnsembleModelFeatureSelector, NoSelect
from mastml.feature_generators import ElementalFeatureGenerator
from mastml.hyper_opt import GridSearch

ImportError: ignored

And finally we'll import pandas to help with handling dataframes throughout the notebook.

In [None]:
import pandas as pd

## Section 2: Data Cleaning


---
this section is largely the same as the previous notebook in functionality.
We'll read in the same initial bandgap data we used in the previous notebook then perform the same cleaning steps:  
1) Filtering for "Reliability"  
2) Averaging bandgap values where we have duplicates  


Read in the band gap data from our dataset. If you haven't already upload the bandgap_data_v2.csv data to the MASTML_colab folder

In [None]:
mastml_df = pd.read_csv("./drive/MyDrive/MASTML_colab/bandgap_data_v2.csv")

Filter for only Reliability 1

In [None]:
mastml_df_filtered = mastml_df[mastml_df["Reliability"]==1]

Define the averaging function used previously in the nanohub notebook. Note that this wasn't explicitly in the previous notebook as it was being imported from a seperate script file with some of these helpep functions. but here we'll just define it locally in the notebook

In [None]:
def average_bandgaps(master_df, input_col_header, output_col_header):
    for chem_formula in master_df[input_col_header].unique():
        temp_df = master_df[master_df[input_col_header]==chem_formula].copy()
        if len(temp_df) > 1:
            avg_bandgap = temp_df[output_col_header].mean()
            indexes = temp_df.index
            master_df.at[indexes,output_col_header] = avg_bandgap
    master_df_clean = master_df.drop_duplicates(subset=input_col_header)
    return master_df_clean

We then call the function to do the same bandgap averaging when we have duplicates in the dataset.

In [7]:
mastml_df_clean = average_bandgaps(mastml_df_filtered, 'chemicalFormula Clean', 'Band gap values Clean')

NameError: ignored

This section is new. We reset the index to match the previous notebook so that we can explicitly define the same Train / Test split that we used before. The test_indices object is just a hard copied list of the index values from the previous notebook. If you want to go check them you can find the X_test object and call X_test.index to see these yourself.

In [8]:
mastml_df_clean.reset_index(inplace=True)
mastml_df_clean.drop(columns='level_0',inplace=True)

NameError: ignored

In [9]:
test_indices = [279, 168, 192,  33, 223,  22, 341, 453, 460, 455, 120, 430, 436,
            366, 292, 278, 163, 216, 420, 210, 214, 422, 340,  41, 416, 146,
            280, 229, 300, 111, 407, 250, 379,  20, 356,   4, 141, 139, 121,
            324, 147, 415,  57, 301, 393, 454,  30]

Finally we define a new column "testdata" which is going to be a binary column that is either 0 for "not testing data" or 1 for "is testing data". This is what we can feed into MAST-ML to explicitly define a set of Test data that is held out from all training.

In [10]:
mastml_df_clean["testdata"]=0

NameError: ignored

In [11]:
for idx in test_indices:
  mastml_df_clean.at[idx,'testdata']=1

NameError: ignored

In [12]:
output_path = "./drive/MyDrive/MASTML_colab/bandgap_data_v3.csv"
mastml_df_clean.to_csv(output_path,index=False)

NameError: ignored

Notice how in the initial data cleaning and configuration there is still a bit that we do outside of MAST-ML. While MAST-ML gives a good deal of flexibility and useful tools for performing these machine learning workflows there will often still be custom steps like this that get added to the overall workflow that varies dataset by dataset.

## Section 3: Initializing MAST-ML
---
Now we'll dive into interacting more directly with the MAST-ML software. The first thing we need to do is setup some of the baseline information that MASTML will use as we call different sections of the code. This is similar to the [general] section from the previous configuration file oriented code base.


Set the name of the savepath to save MAST-ML results to. It's recommended to make this a unique name each time you come back to this notebook. That way all the outputs you get from each session will be in a unique location that's easier to come back to later.

By default I've set the output to the "nanohub_workflow" folder under our colab folder.

In [13]:
SAVEPATH = 'drive/MyDrive/MASTML_colab/Nanohub_workflow'

mastml = Mastml(savepath=SAVEPATH)
savepath = mastml.get_savepath

With MAST-ML initialized you should see your output directory created. You can check this using the file tree on the left of the screen or directly through google drive.

Next up we need to define the configuration of our Data file that we setup earlier. We'll define the names for all of the key components:  
target: the target variable that we want to predict  
extra_columns: the metadata columns that aren't features but we still want to keep track off  
testdata_columns: the column with binary values defining what is and isn't test data  
group_column: column names specifying unique groups in the data. We don't use this during this workflow  
as_frame: determines the structure of outputs. True gives up dataframe outputs that are easier to read in the notebook

In [14]:
target = 'Band gap values Clean'
extra_columns = ['index', 'Band gap units', 'Band gap method', 'Reliability','chemicalFormula Clean']
testdata_columns = ['testdata']

# calling the LocalDatasets section of the code initializes this section which we then execute with the method below
d = LocalDatasets(file_path='./drive/MyDrive/MASTML_colab/bandgap_data_v3.csv', 
                  target=target, 
                  extra_columns=extra_columns, 
                  group_column=None,
                  testdata_columns=testdata_columns,
                  as_frame=True)

# Load the data with the load_data() method
data_dict = d.load_data()

FileNotFoundError: ignored

Let's take a second to look through what just happened. In the previous cell the "data_dict" object was defined. It is a dictionary of various things that were loaded in from the dataseet. We'll pull those out of the dictionary to set them all to unique objects.

We see there are 5 keys:  
  X: the X feature matrix (used to fit the ML model). notice this is empty becausee we haven't done any feature generation  
  y: the y target data vector (true values)  
  X_extra: matrix of meta data not used in fitting (i.e. not part of X or y)  
  groups: vector of group labels. empty because we didn't set it  
  X_testdata: matrix or vector of left out data indices

In [15]:
data_dict.keys()

NameError: ignored

In [16]:
X = data_dict['X']
y = data_dict['y']
X_extra = data_dict['X_extra']
groups = data_dict['groups']
X_testdata = data_dict['X_testdata']

NameError: ignored

In [None]:
X

In [None]:
groups

In [None]:
X_extra

In [None]:
X_testdata

## Section 4: Reproducing Key Workflow Steps
---
Now we'll start to dive into reproducing the key workflow steps from the previous notebook. These are:  
1) Feature Generation  
2) Feature Engineering  
3) Model Assessment and Training  
4) Model Optimization  
5) Model Predictions

If the data contains missing values (this one doesn't), we can clean the data with the built in tools in MAST-ML, which corrects missing values and provides some basic analysis of the input data. Since there are no missing values the data cleaner will still output some useful plots and statistics of our input data.

In [None]:
cleaner = DataCleaning()
X, y = cleaner.evaluate(X=X, 
                        y=y, 
                        method='imputation', 
                        strategy='mean', 
                        savepath=savepath)

Looking at the format of the DataCleaning section also highlights the key way we will interact with MAST-ML in this format. For each section of the code we want to use we'll initialize it using what's called a class name, in this case "DataCleaning", and then call the "evaluate" method to essentially run the code for that Class.

Let's look through the outputs and compare them to some of the initial dataset analysis and compare to the previous Nanohub workflow. Open the "histogram_target_values.png" file in the newly created DataCleaning folder under our output directory. Compare back to the histogram we made in the previous notebook. Are they the same?

This is the type of check we would do to make sure we aren't missing any data switching between the two platforms.

Next is generating the elemental features used in the model. Just like the previous step we define the class of feature generation we want to use, and then call the evaluate method. Again results are output to a new folder with the name of the Class that was evaluated. The features are also added to the X object so we can continue to use them directly without having to read in from the generated files.

You can see from the output that MAST-ML is also performing some basic feature engineering by dropping features that are missing values. This is the most basic way of handling missing values, and if we wanted to do something more complex later we could come back and use imputation to fill in those missing values instead.

In [None]:
generator = ElementalFeatureGenerator(composition_df = X_extra["chemicalFormula Clean"],
                      feature_types='composition_avg',
                      remove_constant_columns=True)
X, y = generator.evaluate(X = X,
                          y = y,
                          savepath = savepath)

Using the cell block below with outputs the feature object directly compare the features generated to those in the previous workflow. Do we have the same total number?

If they're different can you think of any reasons why?  
hint: mastml does some initial cleaning automatically on the features.

In [17]:
X

NameError: ignored

Next we'll see one of the benefits of using MAST-ML in this new way. Currently we don't have the same method impelemented in MAST-ML to remove highly correlated features. Previously adding this in would have been a good deal of work. But because we're using MAST-ML in this interactive notebook environment we can add in our own feature engineering steps that aren't included in the MAST-ML software. Below I just copied over the code from the previous notebook to filter highly correlated features

In [None]:
import numpy as np

In [None]:
features_corr_df = X.corr(method="pearson").abs()
# Filter the features with correlation coefficients above 0.95
upper = features_corr_df.where(np.triu(np.ones(features_corr_df.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
X = X.drop(columns=to_drop)

In [None]:
X

Next up we perform the last feature engineering step, which was to normalize the features using scikit-learn's MinMaxScaler method. 

In [None]:
preprocessor = SklearnPreprocessor(preprocessor='MinMaxScaler', as_frame=True)
X = preprocessor.evaluate(X=X,
                          y=y, 
                          savepath=savepath)

With our features setup now we jump into training, and evaluating models. This section is a bit more complex as we're defining multiple things at once. Things we define at the top of the cell are:  
1) The model or models to use. These need to be in list format which is why you see them in square brackets.  
2) A potential feature selector. Here we don't use one to mirror the previous workflow.  
3) Assessment metrics. We specify a range of them.  
4) A Splitter to use. The splitter is the class that we'll call in the bottom half and defines what kind of splits in the dataset we want to make. Recall that we previously established our Test set of data. So for this first test of the "default model" we don't need to make any additional splits which is why we use the NoSplit() class

In [None]:
default_decisiontree = SklearnModel(model='DecisionTreeRegressor')
models = [default_decisiontree]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

splitter = NoSplit()
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  verbosity=3)

After this run completes we want to go look at how the model is performing. Navigate to the newly created "DecisionTreeRegressor..." folder and find both the "parity_plot_leaveout.png" file as well as the "parity_plot_train.png" file. Compare them both to eachother as well as to the parity plots made during the Nanohub notebook for the default model. Are they the same? Similar?  

Note that the model type used is technically different. Previously we used a RandomForest with 1 tree which very closely mimics a sigle decision tree, and this time we explicitly used the decisiontreeregressor from scikit-learn.

Next we'll reproduce the model hyperparameter optimization we previously performed. Most things stay very similar but we switch to the Random Forest model so we can increase the number of trees again, and we need to add a new option to the splitter evaluate call which is the "GridSearch" class in mastml. This mirrors the same gridsearchcv call that is made in the previous nanohub notebook, however the format is slightly different.  

For the GridSearch class we need to specify:  
1) param_names: the hyperparameters to grid over  
2) param_values: a string which specifies the grid. 

This follows the format of a linspace or logspace command in programs like matlab, or python packages like Numpy. the numbers in the string specify the starting value, ending value, and then number of points in between. Then we can give two options after which are lin/log which specifies whether the numbers are in linear space or log space. log space means we are specifying the exponent (10^x). And the last option is the type of number with int or float being the two most common versions. Some Hyperparametes need to be integers.

3) scoring: a string specifying the score function to use.

In [None]:
default_RF = SklearnModel(model='RandomForestRegressor')
models = [default_RF]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']
grid1 = GridSearch(param_names='n_estimators',param_values='2 50 5 lin int',scoring='neg_mean_squared_error')
grids = [grid1]
splitter = NoSplit()
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  hyperopts = grids,
                  recalibrate_errors = True,
                  verbosity=3)


With the optimization run complete again we'll look through our outputs to find the results. Go into the new RandomForesRegressor folder and then into split_outer_0/split_0 and find the "grid search" files. One of them has the best identified hyperparameters, and the other give the full results for all options tried. 

Do the results match the previous gridsearch from the nanohub workflow? meaning do we get the same number of trees as the best?

The next step is to generate the 5-fold Cross Validation results. Unfortunately it looks like there's a bug with the decision tree currently, so while I set up how that should look it isn't working currently.

In the next cell however, with the Random Forest the CV is working correctly so we'll run 5-fold CV with the optimized number of trees from the above hyperparameter grid search.

In [None]:
default_decisiontree = SklearnModel(model='DecisionTreeRegressor')
models = [default_decisiontree]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

splitter = SklearnDataSplitter(splitter='RepeatedKFold', n_repeats=2, n_splits=5)
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  verbosity=3)

In [None]:
opt_RF = SklearnModel(model='RandomForestRegressor',n_estimators=50)
models = [opt_RF]
selector = [NoSelect()]
metrics = ['r2_score', 'mean_absolute_error', 'root_mean_squared_error', 'rmse_over_stdev']

splitter = SklearnDataSplitter(splitter='RepeatedKFold', n_repeats=2, n_splits=5)
splitter.evaluate(X=X,
                  y=y, 
                  models=models,
                  preprocessor=None,
                  selectors=selector,
                  metrics=metrics,
                  savepath=savepath,
                  X_extra=X_extra,
                  leaveout_inds=X_testdata,
                  recalibrate_errors = True,
                  verbosity=3)

For the predictions section in the Nanohub workflow we used the test data as data to predict. In the MASTML framework all of the predictions are made and named with the convention "leaveout..." on the files. To compare how the test data is predicted between MASTML and the previous Nanohub notebook we can compare the predictions in these files to the ones made previously after optimizing the model.

## Section 5: Modifying the Workflow
---

And with that we've completed the same steps as previously, using the MASTML code. In doing so we've been able to automatically generate our parity plots, along with a lot of other statistics and plots that we haven't learned about yet. 

And with this setup we can now do the last step of the activity, in which we can take advantage of the steps we've already established to start to make changes.

In the current code we've used the DecisionTreeRegressor and RandomForestRegressor models from scikit-learn. Now choose another model type and repeat the workflow by modifying each step:

1) pick another model type from scikit-learn. You can see a reference for available models here: https://scikit-learn.org/stable/supervised_learning.html 
If you're not sure what kind of model to try I might suggest one of the linear type models such as Ridge Regression or LASSO. To see the list of available hyperparameters for each model you can click their respective link.

2) build a default model where you don't change any hyperparameters from the scikit-learn defaults and analyze it's performance both on the Test data and with a 5-fold CV

3) perform a grid search on 1 of the hyperparameters. I'd suggest picking the alpha hyperparameters if using one of the linear models suggested above.

4) Compare the performance with the optimized hyperparameters. Were you able to improve the performance? how much did the RMSE value decrease for the Test set? How about the 5-fold CV test?