In [None]:
from IPython.display import Image

# Machine learning lattice parameters of M$_2$AX phases #

### MAX phases are a family of laminated materials (https://doi.org/10.1016/j.mattod.2023.11.010), with the generic formula M$_{n+1}$AX$_n$, where M is an early transition metal (e.g. Sc, Ti, V, Cr, Mn, Zr, Nb, Mo, Hf, Ta and others), A is an A-group element (e.g. Al, Si, S, Ga, Ge, Se, Cd, In, Sn and others) and X is B, C or N. These materials have increasingly captivated a lot of attention because of their unique way of combining ceramic and metallic properties into a homogeneous bulk material. Interestingly for both theoretical aspects and practical applications, their versatile physical and chemical properties can be tuned by properly alloying the elements in their different sublattices. Moreover, these materials are of utter importance since they constitute the three-dimensional precursors from which, after exfoliation of the A elements, MXenes 2D materials can be obtained (https://doi.org/10.1021/acs.chemrev.3c00241). ###
### ![](Figures/Fig1.png) ###
### Figure 1 *(a) Two SEM images of MXenes flakes and (b) a model of a unit cell of a MXene with $n=1$, together with the supercell obtained by replicating the unit cell twice in each direction.* ###
### Alloying different elements led to a huge extension of the family of MAX phases, giving rise to both ordered and disordered structures. As a consequence of the very large combinatorial space that can be explored in this way, at least in principle, a lot of new materials are waiting to be synthesized and characterized. In this respect, computational design and modeling plays a crucial role. A fundamental step that precedes experimental attempts of synthesizing complex MAX phases is the computational screening of unstable phases, usually done with the help of density functional theory (DFT) calculations of relative free energies of formation. When dealing with multi-site solid solutions MAX phases, one needs to first perform the full variable-cell relaxation of atomic positions inside a supercell (typically generated via special quasirandom structures, known as SQS, https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.65.353) that models the randomness of the solution. However, estimating initial positions can be quite challenging in these cases, where also Vegard's law starts to fail. ###

### In this notebook, you will learn how it is possible to build a machine learning model that predicts lattice parameters of complex M$_2$AX phases with up to five M elements, two A elements, and two X elements, with experimental-level accuracy. ###

In [None]:
import os
import numpy as np
import pickle
import pandas as pd
import math
import optuna
from optuna.samplers import TPESampler
import warnings
#optuna.logging.set_verbosity(optuna.logging.WARNING)
#warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from xgboost import XGBRegressor

from functions import  permutations, read_data, rescale_output, reduce_features, generate_features

### After importing everything we need, we can proceed by first collecting the data. The 'input_data_211.csv' file contains the lattice parameters $a$ and $c$ measured experimentally for every M$_2$AX phase synthesized so far (units in Angstrom). The first three columns contain strings that list all the species contained in the M, A and X sublattice, respectively. The next three columns contain instead the relative concentrations of elements in the three sublattices. Let's see how it looks like:

In [None]:
data = pd.read_csv('input_data_211.csv')
data.head(5)

### These are quite simple 'ternary' (since they only have three elements, one per type) M$_2$AX phases. If you scroll down the file, however, you will find more complex phases with different chemical species in them, for example: 

In [None]:
data.iloc[[200]]

### refers to a M$_2$AX phase that was synthesized alloying five different transition metals, two A elements and carbon. More details can be found in https://doi.org/10.1080/21663831.2021.2017043.

### Now we can upload the whole data set using the function read_data, that takes as input the csv file and gives two arrays as output. The X array contains the input features for each structure, and the Y array contains the corresponding lattice parameters $a$ and $c$. The input features are constructed by taking the physical and chemical properties associated with each element entering the MAX phase, together with the relative concentrations. The complete list of fifteen variables used is given in the following table. Data were taken from  https://pubchem.ncbi.nlm.nih.gov/periodic-table/.

| Category | Feature |
| :-:  | :-:  |
| Structural |  Relative concentration <br> Atomic number <br>  Neutron number <br> Atomic mass|
| Physical and chemical | Pauling electronegativity  <br>  Ionization energy  <br> Van der Waals radius |
| Electronic | Highest $n$ quantum number for $s$ valence shell orbital, Number of electrons in highest $s$ valence shell   <br> Highest $n$ quantum number for $p$ valence shell orbital, Number of electrons in highest $p$ valence shell  <br> Highest $n$ quantum number for $d$ valence shell orbital, Number of electrons in highest $d$ valence shell  <br> Highest $n$ quantum number for $f$ valence shell orbital, Number of electrons in highest $f$ valence shell |

### Since up to now, the majority of M$_2$AX phases has been synthesized with at most five M elements, two A elements and two X elements, we concentrated on the generic $(M_{c_{M_1}}^1, M_{c_{M_2}}^2, M_{c_{M_3}}^3, M_{c_{M_4}}^4, M_{c_{M_5}}^5)_2(A_a^1, A_{1-a}^2)(X_x^1, X_{1-x}^2)$ phase. Then, a total of nine possible elements can appear simultaneously in a single phase, and so the total number of features is 9 $\times$15 =135 features. ###

### However, since the list of all elements that are present in the dataset is not extended to the whole periodic table, but it is limited to eleven transition metals, twenty-four A elements and three X elements, a feature reduction is already allowed. For example, the M transition metal elements considered do not have any $p$ orbitals in their valence shell, as X elements do not have any $d$ or $f$ orbitals in their valence shell. Removing these unnecessary variables leads to a total of 117 features. As an example, the electronic configuration of Niobium reads [Kr]5s$^1$4d$^4$, so that we consider 5 as the highest $s$ valence shell quantum number, 1 as the number of electrons in that shell, 4 as the highest $d$ valence shell quantum number and 4 as the electrons in it, 0 for $f$ shells and electrons in them. The generic input vector $x$ for a phase can be written as $x=$(c$_{M_1}$, c$_{M_2}$, ..., c$_{M_5}$, c$_{A_1}$, c$_{A_2}$, c$_{X_1}$, c$_{X_2}$, z$_{M_1}$, z$_{M_2}$, ..., z$_{M_5}$, z$_{A_1}$, z$_{A_2}$, z$_{X_1}$, z$_{X_2}$, ...), where $c_s$ and $z_s$ are concentrations and atomic numbers of $s$ elements, respectively, and so on for the remaining thirteen sets of variables. Only concentrations referring to the same sublattice sum up to one, i.e. $\sum_i c_{M_i} =1$, $\sum_i c_{A_i} =1$ and $\sum_i c_{X_i} =1$. Each input vector can be thought as a horizontal stacking of vectors, each of which has the nine values for the fifteen variables listed above: $x=c+z+...$, where the sum has to be intended as stacking. As a convention, we use 0 for all variables specific to elements that are not present in the generic $(M_{c_{M_1}}^1, M_{c_{M_2}}^2, M_{c_{M_3}}^3, M_{c_{M_4}}^4, M_{c_{M_5}}^5)_2(A_a^1, A_{1-a}^2)(X_x^1, X_{1-x}^2)$ complex M$_2$AX phase. For example, in the system (Ti$_{0.5}$, Zr$_{0.5}$)$_2$AlC, there are only two out of five possible M elements, and one of two A and X elements, so that in its input vector the first nine variables, for concentrations, will be $c=(0.5, 0.5, 0, 0, 0, 1, 0, 1, 0)$. The same applies identically to the other variables. Finally all the features were normalized by column.

In [None]:
# read experimental data

X,Y = read_data('input_data_211.csv')

### let's inspect one of the input vectors together with its output:

In [None]:
np.set_printoptions(linewidth=300)

index_of_structure = 42 # change to some number in [0,201] to explore

print('Structure #'+str(index_of_structure))
data.iloc[[index_of_structure]]

In [None]:
print('\033[1mINPUT\033[0m for structure #'+str(index_of_structure)+': \n'+str(X[index_of_structure]))
print('\033[1mOUTPUT\033[0m for structure #'+str(index_of_structure)+': \n'+str(Y[index_of_structure]))

### The goal is to build a machine learning model that is able to approximate the relationship $f$ between $X$ and $Y$, that is a relationship $f: [0,1]^{117} \rightarrow \mathbb{R}^2$. This is called a regression problem.
### In general, in the context of machine learning, one defines a training set $ \tau =\{(x_i,y_i), x_i \in X, y_i \in Y, i=1, ..., n\} $ by randomly sampling from the original data set and a function $f$ which has a set of parameters, $\theta$. The aim is then to find the best set of parameters $\theta^*$ that minimize an "empirical risk" function
### $\epsilon_n(f) = \frac{1}{n} \sum_{i=1}^n \ell(y_i,f_{\theta}(x_i)) + \lambda \Omega(\theta)$
### where $\ell$ is a loss function that measures the error in predicting $f(x_i)$ instead of $y_i$ and $\Omega$ is a regularization function for the parameters $\theta$. The parameter $\lambda$ is called hyperparameter, since it is not a parameter of the model itself, but rather an external one that is used to minimize the generalization error of the model, i.e. the error of the model on validation data that should mimic the behaviour on unseen data, i.e. the left-over test set. For example, linear regression $f=\theta^{\textrm{T}}x$ with square loss function $\ell = (y_i-\theta^{\textrm{T}}x_i)^2$ and Tikhonov (or Ridge) regularization $\Omega(\theta) = ||\theta||^2$ has the well known explicit solution $\theta^* = (X^{\textrm{T}}X+\lambda n I)^{-1}X^{\textrm{T}}y$, where $X$ is the $n\times d$ matrix containing the $n$ input training $d$-dimensional vectors, $Y$ is the $n\times 1$ that contains the relative outputs and $I$ the identity.

### Several models have been proposed in the past, with increasing complexity, for solving the most difficult regression tasks, such as support vector machines, neural networks and gradient boosted trees. More information can be found, among a lot of other good references, in the excellent (and free..!) book by Hastie, Tibshirani and Friedman https://hastie.su.domains/ElemStatLearn/ or in the one by Bishop, also free, available at https://www.bishopbook.com/.

### In this notebook, we will train an ensemble model, that is a collection of models that share only the same functional form, but that are trained on different train-test splits of the original sets and therefore have different optimal parameters and hyperparameters.

In [None]:
# choose how many models to include (i.e. how many splits of the dataset into training and test set), e.g. N=5 for a quick test, N=30 for a more robust model

N = 5

### Once the $N$ models will be trained, the quality of the ensemble model will be determined by some scores, averaged over the $N$ splits. In particular, we will use three different scores:
### 1) the $r^2$ score, known as coefficient of determination, defined as $r^2 = 1-\frac{\sum_i (y_i-f(x_i))^2}{\sum_i (y_i-\bar{y})^2}$, where $\bar{y}$ is the average of observations $y_i$.
### 2) the root mean squared error, defined as $ \textrm{RMSE} = \sqrt{\frac{1}{n}\sum_i(y_i-f(x_i))^2}$
### 3) the relative root mean squared error, which is a normalized version of the $\textrm{RMSE}$, defined as $\textrm{RRMSE} = \textrm{RMSE} $ / average of predictions. 

In [None]:
# prepare vectors to store scores, for both a and c lattice parameters, on both training and test set.

a_r2_train = np.empty(0)
a_r2_test = np.empty(0)

a_rmse_train = np.empty(0)
a_rmse_test = np.empty(0)

a_rrmse_train = np.empty(0)
a_rrmse_test = np.empty(0)

c_r2_train = np.empty(0)
c_r2_test = np.empty(0)

c_rmse_train = np.empty(0)
c_rmse_test = np.empty(0)

c_rrmse_train = np.empty(0)
c_rrmse_test = np.empty(0)

## Define which model to use as a regressor $f$ ##
### choose from:

### LR -> for linear model
### TR -> for Tikhonov regression
### SVR -> for a Support Vector regressor
### RF -> for a Random Forest regressor
### LightGBM -> for a Light Gradient Boosting Machines regressor
### XGB -> for a Extreme Gradient Boosting regressor

### note that different models have different performances (in terms of speed and accuracy) in different data sets. It is therefore advised that you get a feel of what are the training times by making some tests using a small number of models first.

In [None]:
model = 'LightGBM' 

save_folder = './MY_LGBM_MODELS_test/'       # create a new directory every time you run another training!

if not os.path.isdir(save_folder):
    os.makedirs(save_folder)

# prepare the list and files to store the best models 

best_models_a = []
best_models_a_file = 'best_models_a.sav'  

best_models_c = []
best_models_c_file = 'best_models_c.sav' 


   
# in the following file we will write the min and max values for both a and c of every split, since we will need them to do the inverse scaling 
# when using the final ensemble model for predictions

min_max_file = open(save_folder+'min_max.txt','w')

min_max_file.write('min_a'+' '+'max_a'+' '+'min_c'+' '+'max_c'+'\n')

### Main loop ###

In [None]:
for i in range(N):

    print('training models on split '+str(i+1)+'/'+str(N))
    
    # splitting in train and test set

    X_tr,X_te,y_tr,y_te = train_test_split(X,Y,test_size=0.2)

    # data augmentation

    X_tr, y_tr = permutations(X_tr,y_tr)
    X_te, y_te = permutations(X_te,y_te)

    # feature reduction

    X_tr = reduce_features(X_tr)
    X_te = reduce_features(X_te)

    # take validation set for hyperparameter optimization

    X_trainval, X_val, y_trainval, y_val = train_test_split(X_tr,y_tr, test_size = 0.3)

    a_tr = y_tr[:,0]
    a_te = y_te[:,0]
    a_trainval = y_trainval[:,0]
    a_val = y_val[:,0]
    
    c_tr = y_tr[:,1]
    c_te = y_te[:,1]
    c_trainval = y_trainval[:,1]
    c_val = y_val[:,1]
    
    # rescale output 

    max_a = max(a_tr)
    min_a = min(a_tr)

    max_c = max(c_tr)
    min_c = min(c_tr)

    # save min and max 

    min_max_file.write(str(min_a)+' '+str(max_a)+' '+str(min_c)+' '+str(max_c)+'\n')

    a_tr = rescale_output(a_tr, min_a, max_a)
    a_te = rescale_output(a_te, min_a, max_a)
    a_trainval = rescale_output(a_trainval, min_a, max_a)
    a_val = rescale_output(a_val, min_a, max_a)

    c_tr = rescale_output(c_tr, min_c, max_c)
    c_te = rescale_output(c_te, min_c, max_c)
    c_trainval = rescale_output(c_trainval, min_c, max_c)
    c_val = rescale_output(c_val, min_c, max_c)
    
    
    
    # hyperparameter optimization with optuna
    
    if model=='LR':
    
        # in this case there are no hyperparameters, so it is just
        
        final_reg_a = LinearRegression(n_jobs=-1)
        final_reg_c = LinearRegression(n_jobs=-1)
        
    if model=='TR':
        
        def objective_a(trial):

            alpha = trial.suggest_float('tr_alpha',0.01,1)
            tr_model =  KernelRidge(alpha=alpha, kernel='linear')

            tr_model.fit(X_trainval, a_trainval)
            score = tr_model.score(X_val, a_val)

            return score

        def objective_c(trial):

            alpha = trial.suggest_float('tr_alpha',0.01,1)
            tr_model =  KernelRidge(alpha=alpha, kernel='linear')
            
            tr_model.fit(X_trainval, c_trainval)
            score = tr_model.score(X_val, c_val)

            return score
        
        study_a = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_a.optimize(objective_a, n_trials = 50)

        study_c = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_c.optimize(objective_c, n_trials = 50)

        # save best hyperparameter found by maximization of objectives

        best_parameters_a = study_a.best_trial.params

        best_alpha_a = best_parameters_a.get("tr_alpha")

        final_reg_a = KernelRidge(alpha=best_alpha_a, kernel='linear')

        best_parameters_c = study_c.best_trial.params

        best_alpha_c = best_parameters_c.get("tr_alpha")

        final_reg_c = KernelRidge(alpha=best_alpha_c, kernel='linear')
    
    if model=='SVR':
        
        def objective_a(trial):

            degree = trial.suggest_int('svr_degree',2,10)
            epsilon = trial.suggest_float('svr_epsilon',0.001,1)
            C = trial.suggest_float('svr_C',1,100)
            kernel = trial.suggest_categorical('svr_kernel',["linear", "poly", "rbf"])

            svr_model =  SVR(degree=degree, epsilon=epsilon, C=C, kernel=kernel)

            svr_model.fit(X_trainval, a_trainval)
            score = svr_model.score(X_val, a_val)

            return score

        def objective_c(trial):

            degree = trial.suggest_int('svr_degree',2,10)
            epsilon = trial.suggest_float('svr_epsilon',0.001,1)
            C = trial.suggest_float('svr_C',1,100)
            kernel = trial.suggest_categorical('svr_kernel',["linear", "poly", "rbf"])

            svr_model =  SVR(degree=degree, epsilon=epsilon, C=C, kernel=kernel)

            svr_model.fit(X_trainval, c_trainval)
            score = svr_model.score(X_val, c_val)

            return score

        study_a = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_a.optimize(objective_a, n_trials = 50, n_jobs=-1)

        study_c = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_c.optimize(objective_c, n_trials = 50, n_jobs=-1)

        # save best hyperparameter found by maximization of objectives

        best_parameters_a = study_a.best_trial.params

        best_degree_a = best_parameters_a.get("svr_degree")
        best_kernel_a = best_parameters_a.get("svr_kernel")
        best_epsilon_a = best_parameters_a.get("svr_epsilon")
        best_C_a = best_parameters_a.get("svr_C")

        final_reg_a = SVR(degree=best_degree_a, epsilon=best_epsilon_a, C=best_C_a, kernel=best_kernel_a)

        best_parameters_c = study_c.best_trial.params

        best_degree_c = best_parameters_c.get("svr_degree")
        best_kernel_c = best_parameters_c.get("svr_kernel")
        best_epsilon_c = best_parameters_c.get("svr_epsilon")
        best_C_c = best_parameters_c.get("svr_C")

        final_reg_c = SVR(degree=best_degree_c, epsilon=best_epsilon_c, C=best_C_c, kernel=best_kernel_c)

    if model=='RF':

        def objective_a(trial):

            n_estimators = trial.suggest_int('rf_n_est',30,500)
            max_depth = trial.suggest_int('rf_max_depth',2,25)
            min_samples_split = trial.suggest_int('rf_min_samples_split',2,10)
            min_samples_leaf = trial.suggest_int('rf_min_samples_leaf',1,5)
            max_samples = trial.suggest_float('rf_sub',0,1)
            max_leaf_nodes = trial.suggest_int('rf_n_leaves',3,30)

            rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, max_samples=max_samples, max_leaf_nodes=max_leaf_nodes, bootstrap=True, n_jobs=-1)

            rf_model.fit(X_trainval, a_trainval)
            score = rf_model.score(X_val, a_val)

            return score
        
        def objective_c(trial):

            n_estimators = trial.suggest_int('rf_n_est',30,500)
            max_depth = trial.suggest_int('rf_max_depth',2,25)
            min_samples_split = trial.suggest_int('rf_min_samples_split',2,10)
            min_samples_leaf = trial.suggest_int('rf_min_samples_leaf',1,5)
            max_samples = trial.suggest_float('rf_sub',0,1)
            max_leaf_nodes = trial.suggest_int('rf_n_leaves',3,30)

            rf_model = RandomForestRegressor(n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split, min_samples_leaf=min_samples_leaf, max_samples=max_samples, max_leaf_nodes=max_leaf_nodes, bootstrap=True, n_jobs=-1)

            rf_model.fit(X_trainval, c_trainval)
            score = rf_model.score(X_val, c_val)

            return score

        study_a = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_a.optimize(objective_a, n_trials = 50)

        study_c = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_c.optimize(objective_c, n_trials = 50)

        # save best hyperparameter found by maximization of objectives

        best_parameters_a = study_a.best_trial.params

        best_n_est_a = best_parameters_a.get("rf_n_est")
        best_max_depth_a = best_parameters_a.get("rf_max_depth")
        best_min_samples_split_a = best_parameters_a.get("rf_min_samples_split")
        best_min_samples_leaf_a = best_parameters_a.get("rf_min_samples_leaf")
        best_max_samples_a = best_parameters_a.get("rf_sub")
        best_max_leaf_nodes_a = best_parameters_a.get("rf_n_leaves")

        final_reg_a = RandomForestRegressor(n_estimators=best_n_est_a, max_depth=best_max_depth_a, min_samples_split=best_min_samples_split_a, min_samples_leaf=best_min_samples_leaf_a, max_samples=best_max_samples_a, max_leaf_nodes=best_max_leaf_nodes_a, bootstrap=True, n_jobs=-1)

        best_parameters_c = study_c.best_trial.params

        best_n_est_c = best_parameters_c.get("rf_n_est")
        best_max_depth_c = best_parameters_c.get("rf_max_depth")
        best_min_samples_split_c = best_parameters_c.get("rf_min_samples_split")
        best_min_samples_leaf_c = best_parameters_c.get("rf_min_samples_leaf")
        best_max_samples_c = best_parameters_c.get("rf_sub")
        best_max_leaf_nodes_c = best_parameters_c.get("rf_n_leaves")

        final_reg_c = RandomForestRegressor(n_estimators=best_n_est_c, max_depth=best_max_depth_c, min_samples_split=best_min_samples_split_c, min_samples_leaf=best_min_samples_leaf_c, max_samples=best_max_samples_c, max_leaf_nodes=best_max_leaf_nodes_c, bootstrap=True, n_jobs=-1)
    
        
    if model=='LightGBM':

        def objective_a(trial):

            max_depth = trial.suggest_int('lgb_max_depth',2,25)
            n_estimators = trial.suggest_int('lgb_n_est',30,500)
            num_leaves = trial.suggest_int('lgb_n_leaves',3,30)
            learning_rate = trial.suggest_float('lgb_eta',0.01,0.4)
            min_split_gain = trial.suggest_float('lgb_min_split_gain',0,1)
            subsample = trial.suggest_float('lgb_sub',0,1)
            lambd = trial.suggest_float('lgb_lambda',0,10)

            lgb_model =  lgb.LGBMRegressor(max_depth=max_depth, n_estimators=n_estimators, num_leaves=num_leaves, learning_rate=learning_rate, min_split_gain=min_split_gain, subsample=subsample, reg_lambda=lambd, verbosity=-1, n_jobs=-1)

            lgb_model.fit(X_trainval, a_trainval)
            score = lgb_model.score(X_val, a_val)

            return score

        def objective_c(trial):

            max_depth = trial.suggest_int('lgb_max_depth',2,25)
            n_estimators = trial.suggest_int('lgb_n_est',30,500)
            num_leaves = trial.suggest_int('lgb_n_leaves',3,30)
            learning_rate = trial.suggest_float('lgb_eta',0.01,0.4)
            min_split_gain = trial.suggest_float('lgb_min_split_gain',0,1)
            subsample = trial.suggest_float('lgb_sub',0,1)
            lambd = trial.suggest_float('lgb_lambda',0,10)

            lgb_model =  lgb.LGBMRegressor(max_depth=max_depth, n_estimators=n_estimators, num_leaves=num_leaves, learning_rate=learning_rate, min_split_gain=min_split_gain, subsample=subsample, reg_lambda=lambd, verbosity=-1, n_jobs=-1)

            lgb_model.fit(X_trainval, c_trainval)
            score = lgb_model.score(X_val, c_val)

            return score

        study_a = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_a.optimize(objective_a, n_trials = 50)

        study_c = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_c.optimize(objective_c, n_trials = 50)
        # save best hyperparameter found by maximization of objectives

        best_parameters_a = study_a.best_trial.params

        best_max_depth_a = best_parameters_a.get("lgb_max_depth")
        best_n_est_a = best_parameters_a.get("lgb_n_est")
        best_n_leaves_a = best_parameters_a.get("lgb_n_leaves")
        best_learning_rate_a = best_parameters_a.get("lgb_eta")
        best_min_split_gain_a = best_parameters_a.get("lgb_min_split_gain")
        best_sub_a = best_parameters_a.get("lgb_sub")
        best_lambda_a = best_parameters_a.get("lgb_lambda")

        final_reg_a = lgb.LGBMRegressor(max_depth = best_max_depth_a, n_estimators = best_n_est_a, num_leaves = best_n_leaves_a, learning_rate=best_learning_rate_a, min_split_gain=best_min_split_gain_a, subsample = best_sub_a, reg_lambda = best_lambda_a, verbosity = -1, n_jobs=-1)

        best_parameters_c = study_c.best_trial.params

        best_max_depth_c = best_parameters_c.get("lgb_max_depth")
        best_n_est_c = best_parameters_c.get("lgb_n_est")
        best_n_leaves_c = best_parameters_c.get("lgb_n_leaves")
        best_learning_rate_c = best_parameters_c.get("lgb_eta")
        best_min_split_gain_c = best_parameters_c.get("lgb_min_split_gain")
        best_sub_c = best_parameters_c.get("lgb_sub")
        best_lambda_c = best_parameters_c.get("lgb_lambda")

        final_reg_c = lgb.LGBMRegressor(max_depth = best_max_depth_c, n_estimators = best_n_est_c, num_leaves = best_n_leaves_c, learning_rate=best_learning_rate_c, min_split_gain=best_min_split_gain_c, subsample = best_sub_c, reg_lambda = best_lambda_c, verbosity = -1, n_jobs=-1)

    elif model=='XGB':

        def objective_a(trial):

            max_depth = trial.suggest_int('xgb_max_depth',2,25)
            n_estimators = trial.suggest_int('xgb_n_est',30,500)
            eta = trial.suggest_float('xgb_eta',0.01,0.4)
            gamma = trial.suggest_float('xgb_gamma',0,1)
            subsample = trial.suggest_float('xgb_sub',0,1)
            lambd = trial.suggest_float('xgb_lambda',0,10)

            xgb_model = XGBRegressor(max_depth=max_depth, n_estimators=n_estimators, eta=eta, gamma=gamma, subsample=subsample, reg_lambda=lambd, n_jobs=-1)

            xgb_model.fit(X_trainval, a_trainval)
            score = xgb_model.score(X_val, a_val)

            return score

        def objective_c(trial):

            max_depth = trial.suggest_int('xgb_max_depth',2,25)
            n_estimators = trial.suggest_int('xgb_n_est',30,500)
            eta = trial.suggest_float('xgb_eta',0.01,0.4)
            gamma = trial.suggest_float('xgb_gamma',0,1)
            subsample = trial.suggest_float('xgb_sub',0,1)
            lambd = trial.suggest_float('xgb_lambda',0,10)

            xgb_model = XGBRegressor(max_depth=max_depth, n_estimators=n_estimators, eta=eta, gamma=gamma, subsample=subsample, reg_lambda=lambd, n_jobs=-1)

            xgb_model.fit(X_trainval, a_trainval)
            score = xgb_model.score(X_val, a_val)

            return score

        study_a = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_a.optimize(objective_a, n_trials = 50)

        study_c = optuna.create_study(direction='maximize', sampler=TPESampler(), pruner=optuna.pruners.HyperbandPruner(min_resource=1, max_resource='auto', reduction_factor=3))
        study_c.optimize(objective_c, n_trials = 50)

        # save best hyperparameter found by maximization of objectives

        best_parameters_a = study_a.best_trial.params

        best_max_depth_a = best_parameters_a.get("xgb_max_depth")
        best_n_est_a = best_parameters_a.get("xgb_n_est")
        best_eta_a = best_parameters_a.get("xgb_eta")
        best_gamma_a = best_parameters_a.get("xgb_gamma")
        best_sub_a = best_parameters_a.get("xgb_sub")
        best_lambda_a = best_parameters_a.get("xgb_lambda")

        final_reg_a = XGBRegressor(max_depth = best_max_depth_a, n_estimators = best_n_est_a, eta=best_eta_a, gamma=best_gamma_a, subsample = best_sub_a, reg_lambda = best_lambda_a, n_jobs=-1)

        best_parameters_c = study_c.best_trial.params

        best_max_depth_c = best_parameters_c.get("xgb_max_depth")
        best_n_est_c = best_parameters_c.get("xgb_n_est")
        best_eta_c = best_parameters_c.get("xgb_eta")
        best_gamma_c = best_parameters_c.get("xgb_gamma")
        best_sub_c = best_parameters_c.get("xgb_sub")
        best_lambda_c = best_parameters_c.get("xgb_lambda")

        final_reg_c = XGBRegressor(max_depth = best_max_depth_c, n_estimators = best_n_est_c, eta=best_eta_c, gamma=best_gamma_c, subsample = best_sub_c, reg_lambda = best_lambda_c, n_jobs=-1)


    # now let's fit two models with the best hyperparameters on the whole training set

    final_reg_a.fit(X_tr,a_tr)
    final_reg_c.fit(X_tr,c_tr)

    # save the scores on training and test set for current split

    a_r2_train = np.append(a_r2_train, final_reg_a.score(X_tr,a_tr))
    a_r2_test = np.append(a_r2_test, final_reg_a.score(X_te,a_te))

    a_tr_pred = final_reg_a.predict(X_tr)
    a_te_pred = final_reg_a.predict(X_te)

    a_rmse_train = np.append(a_rmse_train, root_mean_squared_error(a_tr, a_tr_pred))
    a_rmse_test = np.append(a_rmse_test, root_mean_squared_error(a_te, a_te_pred))

    a_rrmse_train = np.append(a_rrmse_train, a_rmse_train/a_tr_pred.mean())
    a_rrmse_test = np.append(a_rrmse_test, a_rmse_test/a_te_pred.mean())


    c_r2_train = np.append(c_r2_train, final_reg_c.score(X_tr,c_tr))
    c_r2_test = np.append(c_r2_test, final_reg_c.score(X_te,c_te))

    c_tr_pred = final_reg_c.predict(X_tr)
    c_te_pred = final_reg_c.predict(X_te)

    c_rmse_train = np.append(c_rmse_train, root_mean_squared_error(c_tr, c_tr_pred))
    c_rmse_test = np.append(c_rmse_test, root_mean_squared_error(c_te, c_te_pred))

    c_rrmse_train = np.append(c_rrmse_train, c_rmse_train/c_tr_pred.mean())
    c_rrmse_test = np.append(c_rrmse_test, c_rmse_test/c_te_pred.mean())

    best_models_a.append(final_reg_a)
    best_models_c.append(final_reg_c)
    
min_max_file.close()

    
pickle.dump(best_models_a, open(save_folder+best_models_a_file, 'wb'))
pickle.dump(best_models_c, open(save_folder+best_models_c_file, 'wb'))

print('training completed!')

### To inspect the best models found on each split we have to load them

In [None]:
loaded_models_a = pickle.load(open(save_folder+'best_models_a.sav','rb'))
loaded_models_c = pickle.load(open(save_folder+'best_models_c.sav','rb'))

### so that we can print the hyperparameters

In [None]:
# choose which model to inspect
n_model = 0

print('Hyperparameters of model '+str(n_model))
print(loaded_models_a[n_model].get_params())


### Compute average scores and write them in a file ###

In [None]:

file_scores = open(save_folder+'best_results_scores.txt','w')

file_scores.write('a r2 train average = '+str(a_r2_train.mean())+' +- '+str(np.std(a_r2_train))+'\n')
file_scores.write('a r2 test average = '+str(a_r2_test.mean())+' +- '+str(np.std(a_r2_test))+'\n')
file_scores.write('\n')
file_scores.write('a rmse train average = '+str(a_rmse_train.mean())+' +- '+str(np.std(a_rmse_train))+'\n')
file_scores.write('a rmse test average = '+str(a_rmse_test.mean())+' +- '+str(np.std(a_rmse_test))+'\n')
file_scores.write('\n')
file_scores.write('a rrmse train average = '+str(a_rrmse_train.mean())+' +- '+str(np.std(a_rrmse_train))+'\n')
file_scores.write('a rrmse test average = '+str(a_rrmse_test.mean())+' +- '+str(np.std(a_rrmse_test))+'\n')
file_scores.write('\n')
file_scores.write('--------------------------------'+'\n')
file_scores.write('\n')
file_scores.write('c r2 train average = '+str(c_r2_train.mean())+' +- '+str(np.std(c_r2_train))+'\n')
file_scores.write('c r2 test average = '+str(c_r2_test.mean())+' +- '+str(np.std(c_r2_test))+'\n')
file_scores.write('\n')
file_scores.write('c rmse train average = '+str(c_rmse_train.mean())+' +- '+str(np.std(c_rmse_train))+'\n')
file_scores.write('c rmse test average = '+str(c_rmse_test.mean())+' +- '+str(np.std(c_rmse_test))+'\n')
file_scores.write('\n')
file_scores.write('c rrmse train average = '+str(c_rrmse_train.mean())+' +- '+str(np.std(c_rrmse_train))+'\n')
file_scores.write('c rrmse test average = '+str(c_rrmse_test.mean())+' +- '+str(np.std(c_rrmse_test))+'\n')
file_scores.write('\n')

file_scores.close()

### Now that the models are saved, let's try to compute the lattice parameters of a hypotethical MAX phase ###

In [None]:
# Specify the chemical species of the three sublattices and the relative concentrations 

X_new = generate_features(m_names='Ti,Zr,Nb,Hf,Ta',a_names='Al,Sn',x_names='C',c_m='0.21,0.19,0.18,0.21,0.22',c_a='0.62,0.38',c_x='1')

X_new = np.asarray(X_new).reshape(1,-1)

# write a fake Y just to call the permutaitons functions

Y_new = np.asarray([3.0, 13.0]).reshape(1,-1) 

X_new_perm, y_new_perm = permutations(X_new,Y_new)

X_new_perm = reduce_features(X_new_perm)

In [None]:
# prepare the vectors for the predictions of the different models in the ensemble model, on the new generated input vector

pred_p_a = np.empty(0)
pred_p_c = np.empty(0)
pred_i_a = np.empty(0)
pred_i_c = np.empty(0)
pred_i_err_a = np.empty(0)
pred_i_err_c = np.empty(0)

# here is where we use the file with min and max values saved above

min_max = np.loadtxt(save_folder+'min_max.txt', skiprows=1)

In [None]:
# loop through the N models

for i in range(N):


    model_a = loaded_models_a[i]
    model_c = loaded_models_c[i]

    min_a = min_max[i][0]
    max_a = min_max[i][1]
    min_c = min_max[i][2]
    max_c = min_max[i][3]

    for x in X_new_perm:

        a_new_pred = model_a.predict(x.reshape(1,-1)) # prediction of a and c for each permutation x of the new input 
        c_new_pred = model_c.predict(x.reshape(1,-1))
        
        a_pred = (max_a-min_a)*a_new_pred+min_a   # rescale in the original range
        c_pred = (max_c-min_c)*c_new_pred+min_c

        pred_p_a = np.append(pred_p_a, a_pred)
        pred_p_c = np.append(pred_p_c, c_pred)

    # save the average predictions on permuations of the i-th model 
    pred_i_a = np.append(pred_i_a, pred_p_a.mean())
    pred_i_c = np.append(pred_i_c, pred_p_c.mean())

    # and the same for its errors 
    pred_i_err_a = np.append(pred_i_err_a, np.std(pred_p_a))
    pred_i_err_c = np.append(pred_i_err_c, np.std(pred_p_c))

### Now we have N models for $a$ and N models for $c$. ###

In [None]:

weights_a = 1/pred_i_err_a**2
weights_c = 1/pred_i_err_c**2

a_w_av =  np.dot(weights_a, pred_i_a)/np.sum(weights_a)
sigma_a_w_av = 1/math.sqrt(np.sum(weights_a))

c_w_av =  np.dot(weights_c, pred_i_c)/np.sum(weights_c)
sigma_c_w_av = 1/math.sqrt(np.sum(weights_c))

print('average a = '+str(a_w_av)+' +- '+str(sigma_a_w_av))
print('average c = '+str(c_w_av)+' +- '+str(sigma_c_w_av))


### let's visualize the structure ###

In [None]:
#
# A VIEW STRUCTURE WITH A SET OF INDICES
#
from ase.visualize import view
import matplotlib.pyplot as plt
import nglview as nv
def view_structure(structure,myvec=[]):
    """
    Use the ASE library to view an atoms object.
    Parameters
    ----------
    structure: Atoms object
    Returns
    -------
    NGLWidget with GUI: object to be viewed
    
    """
    t = nv.ASEStructure(structure)
    w = nv.NGLWidget(t, gui=True)
    
    
    from ase.neighborlist import NeighborList
    from ase.data import covalent_radii
    
    # Define cutoff based on typical covalent bond length for Carbon
    
    cutoff = covalent_radii[6] * 1.2  # Carbon covalent radius * scaling factor
    nl = NeighborList([cutoff] * len(structure), skin=0.3, self_interaction=False, bothways=True)
    nl.update(structure)
    # Print detected bonds
    bond_list = []
    for i, atom in enumerate(structure):
        neighbors = nl.get_neighbors(i)[0]
        for n in neighbors:
            bond_list.append((i, n))
    bonded_atoms = set([index for bond in bond_list for index in bond])
    highlight_selection = " or ".join(map(str, bonded_atoms))
    w.add_representation('spacefill', selection=highlight_selection, color="red", radius=0.4)
            
    w.add_unitcell()

    w.add_representation('label',label_type='atomindex',color='black')
    w.add_representation('spacefill',selection=myvec,color="blue",radius=0.5)
    w.add_representation('licorice', selection=myvec,radius=0.2)  # Adds bonds explicitly
    return w

In [None]:
from ase.io import read, write

# we can modify the cell length with the values obtained by the model  predictions:

new_a = 4*a_w_av
new_b = new_a
new_c = c_w_av

print(new_a)
print(new_b)
print(new_c)

# use this value in the _cell_length_ parameters in the cif file. Once modified, then use

a = read('HE-MAX.cif')

view_structure(a)

### The MAX phase for which we predicted the lattice parameters values was actually recently synthesized. Experimental values for the lattice parameters are reported to be $a=3.180$ and $c=14.150$. You can test the different base models as well as the number of models in the ensemble models to see how the accuracy in the prediction changes.  