<a href="https://colab.research.google.com/github/aghosh92/Cation-Ordering-ML/blob/main/Example_SissoRegression_Matminer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook has been prepared by Dennis P. Trujillo and Ayana Ghosh.

Email: dptru10@gmail.com 

Email: research.aghosh@gmail.com

It shows how SISSO approach can be implemented within a regression environment to find the best combination of non-linearized features with respect to the target using Matminer and Automatminer.

Note: User may want to restart runtime to avoid any installation errors due to inconsistencies of the version of packages that the environment uses. If this happens, please restart runtime and run the cells, to start using this notebook.

Install packages

In [1]:
!pip install matminer 
!pip install automatminer 



Import essential libraries

In [2]:
import os
import pandas as pd
import numpy as np
from itertools import combinations
from sklearn.linear_model import Lasso
from matminer.featurizers.function import FunctionFeaturizer
#from automatminer import DataCleaner

  defaults = yaml.load(f)


In [3]:
#@title Utility Functions
def get_data(selected_feature_list,depth_value):
    
    function_featurizer = FunctionFeaturizer(multi_feature_depth=depth_value,
                                             combo_function=np.sum)
    function_featurizer.set_n_jobs(4)
    function_featurizer=function_featurizer.fit(df_x[selected_feature_list])
    df_combined=function_featurizer.featurize_dataframe(df_x[selected_feature_list],
                                                        selected_feature_list)

    df_combined[target] = df[target]
    df_combined=df_combined.replace([np.inf,-np.inf],np.nan)
    df_combined=df_combined.dropna(axis=1)
    df_combined=df_combined.drop(columns=selected_feature_list,axis=1)
    df_combined.to_csv('/content/functionalized_data.csv')

    P = df_combined[target].values
    df_combined = df_combined.loc[:, df_combined.columns != target]

    return P, df_combined
    
def lasso_fit(lam, P, D, feature_list):
    #LASSO
    #D_standardized = ss.zscore(D)
    lasso =  Lasso(alpha=lam)
    lasso.fit(D, P) 
    coef =  lasso.coef_
    
    # get strings of selected features
    selected_indices = coef.nonzero()[0]
    selected_features = [feature_list[i] for i in selected_indices]
    
    # get RMSE of LASSO model
    P_predict = lasso.predict(D)

    return coef,selected_features

Mount Google Drive

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Read in the Data (.csv files)

In [5]:
!gdown https://drive.google.com/uc?id=19tVrJblX9SXGcp1VyfdBT5cxZor4Sk3Y

Downloading...
From: https://drive.google.com/uc?id=19tVrJblX9SXGcp1VyfdBT5cxZor4Sk3Y
To: /content/final_layer_predict_energy_diff_mod.csv
  0% 0.00/55.4k [00:00<?, ?B/s]100% 55.4k/55.4k [00:00<00:00, 3.71MB/s]


In [6]:
filename = '/content/final_layer_predict_energy_diff_mod.csv'
df = pd.read_csv(filename).set_index('Index').sample(n=10,random_state=2)

In [7]:
target   = 'Target'
#These features were selected based on the RF model importances
selected_feature_list = ['C_B','r_B_prime_site','B_prime__p','B_prime__d','_bar_y_A_dis_bar_',
                         'dis_y_A_prime_2','dis_y_A_prime_1','_cell_volume','r_Asite','_cellength_a']
df_x = df[selected_feature_list]

In [8]:
#%cd /content

In [9]:
#%rm final_layer_predict_energy_diff_mod.csv

Generate functionalized features.

Here as a representative case, we only show an example with feature depth of 2. However, we have utilized additional feautre depth for generating results as reported in the referred manuscript.

In [10]:
#may take a few minutes to run
functionalized_csv ='/content/functionalized_data.csv' 
if os.path.exists(functionalized_csv):
    print("loading functionalized data...")
    df_D = pd.read_csv(functionalized_csv).set_index('Index')
    P = df_D[target]
    df_D = df_D.drop(columns=target,axis=1)
    D = df_D.loc[:, df_D.columns != target].values
    features_list = df_D.columns.to_list()
else: 
    print('generating functionalized data...')
    P, df_D = get_data(selected_feature_list,2)
    features_list = df_D.columns.to_list()
    D = df_D.values

generating functionalized data...


FunctionFeaturizer: 100%|██████████| 10/10 [01:24<00:00,  8.49s/it]


Perform LASSO regression.

This is a representative case to show how we can utilize it. 
Convergent results are used and reported in the related manuscript.

In [11]:
alpha = 0.2
coef, selected_features = lasso_fit(alpha, P, D, features_list)

  positive)


In [13]:
print("alpha: %.3f\t dimension of descriptor: %s" 
      %(alpha, len(selected_features)))
lasso_features=pd.DataFrame({'features':np.array(selected_features), 
                             'abs(nonzero_coefs_LASSO)': np.abs(coef[coef.nonzero()])}).sort_values(by='abs(nonzero_coefs_LASSO)',
                            ascending=False)
print(lasso_features.head(n=10))
lasso_features.to_csv('lasso_equations.csv')

alpha: 0.200	 dimension of descriptor: 1704
                                       features  abs(nonzero_coefs_LASSO)
2                          r_B_prime_site**(-3)                111.269623
1432  sqrt(_cell_volume) + log(dis_y_A_prime_2)                 85.272079
1254    log(_bar_y_A_dis_bar_) + 1/_cell_volume                 81.445326
745           1/_cell_volume + B_prime__p**(-2)                 53.839470
1431      log(dis_y_A_prime_2) + 1/_cell_volume                 52.533048
1063               B_prime__d + _cellength_a**2                 50.742101
12                       log(_bar_y_A_dis_bar_)                 48.611888
0                                        C_B**2                 45.633063
1290           r_Asite + log(_bar_y_A_dis_bar_)                 36.683446
25                   C_B + r_B_prime_site**(-3)                 36.571798
