# Experimental band gap prediction with the use of DFT data
In this notebook we are going to predict experimental  band gaps from  [Zhau et al.](https://pubs.acs.org/doi/10.1021/acs.jpclett.8b00124) using DFT band gap data from the OQMD database. 

In [83]:
%matplotlib inline
from matminer.featurizers.base import MultipleFeaturizer
from matminer.featurizers import composition as cf
from matminer.featurizers.conversions import StrToComposition
from matplotlib import pyplot as plt
from matplotlib.colors import LogNorm
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn.model_selection import cross_val_score, cross_val_predict, GridSearchCV, ShuffleSplit, KFold

import sklearn.linear_model as linear_model
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler


import warnings
warnings.filterwarnings('ignore')

## Adding band gap data from the OQMD database.

The aim of this part of the notebook is to add DFT band gap values to the dataset. The result is already saved in the file band_gaps_OQMD.csv and one may skip this part of the notebook if he is not interested on how to make queries to the OQMD database.

In [239]:
data = pd.read_csv("/home/dima/Desktop/ML/Data/reworked_data_bandgap.csv")

In [240]:
data.head()

Unnamed: 0,composition,E
0,Hg0.7Cd0.3Te,0.31
1,CuBr,2.998325
2,LuP,1.3
3,Cu3SbSe4,0.355
4,ZnO,3.402809


In [241]:
#add a new column 

data['band_gap_OQMD'] = np.nan
data

Unnamed: 0,composition,E,band_gap_OQMD
0,Hg0.7Cd0.3Te,0.310000,
1,CuBr,2.998325,
2,LuP,1.300000,
3,Cu3SbSe4,0.355000,
4,ZnO,3.402809,
...,...,...,...
4925,Tm2MgTl,0.000000,
4926,Nb5Ga4,0.000000,
4927,Tb2Sb5,0.000000,
4928,Lu2AlTc,0.000000,


Now we are going to fill the column 'band_gap_OQMD' with the band gap data from the OQMD data base. Let's define the function 

In [251]:
import requests
def add_band_gap(df):
    i=0
    while i !=len(df['band_gap_OQMD']):
        print(i)
        """
        Query to OQMD. It generates a dataset with the compound name and band gap value. 
        """
        url = "http://oqmd.org/oqmdapi/formationenergy?fields=name,band_gap&filter=chemical_formula="+str(df['composition'][i])
        response = requests.get(url).json()
        data_pd = pd.DataFrame(response['data'])
        """
        In the generated dataset from the OQMD for a given 'chemical_formula' (which is a compound name) we look for the compounds with the DFT band gap value close to the DFT one.    
        """
        if len(data_pd)==0:
            i+=1
            continue 
        if len(data_pd['band_gap'])==1:
            df['band_gap_OQMD'][i] = data_pd['band_gap'].mean()
            i+=1
            continue 
        bg_diff_abs = np.array([])
        bg_diff = np.array([])
        for gap in data_pd['band_gap']:
            if str(gap)!='nan':
                bg_diff_abs = np.append(bg_diff, abs(df['E'][i]-gap))
                bg_diff = np.append(bg_diff, df['E'][i]-gap)
        min_diff = bg_diff_abs.min()
        for num in bg_diff:
            if abs(num-min_diff)<0.0001:
                    df['band_gap_OQMD'][i] = df['E'][i]-min_diff
            elif abs(num+min_diff)<0.0001:
                    df['band_gap_OQMD'][i] = df['E'][i]+min_diff
        i+=1
    return df   

In [None]:
# Use the function defined above in order to add DFT band gap data to the dataset
add_band_gap(data)

There are 2791 not NaN's only

In [273]:
j=0
for i in data['band_gap_OQMD'].isnull():
    if i==False:
        j+=1
print(j)

2791


So it means that among more then 4000 compounds in the original database there are only 2791 for which DFT band gap values were found in the OQMD database

In [274]:
# save data to the csv file
data.to_csv('band_gaps_OQMD.csv', header=True, index=False)

# The main part: feauture generating, feature engineering and machine learning

In [85]:
# Download the data with experimental and DFT-clculated band gaps
data = pd.read_csv("/home/dima/Desktop/ML/Data/A general purpose - data/band_gaps_OQMD.csv")

Remove all the raw witn NaN's

In [86]:
original_count = len(data)
data.dropna(subset=['band_gap_OQMD'], inplace=True)
print('Removed %d/%d entries'%(original_count - len(data), original_count))
data.reset_index(drop=True)

Removed 2139/4930 entries


Unnamed: 0,composition,E,band_gap_OQMD
0,CuBr,2.998325,1.077
1,LuP,1.300000,1.896
2,Cu3SbSe4,0.355000,0.000
3,ZnO,3.402809,1.087
4,PtSb2,0.117444,0.000
...,...,...,...
2786,ScCoO3,0.000000,0.849
2787,Tm2MgTl,0.000000,0.000
2788,Nb5Ga4,0.000000,0.000
2789,Tb2Sb5,0.000000,0.000


Create a new column 'formula' which is a list of the elements of a given compound. It will be needed further for feature generating 

In [5]:
data = StrToComposition(target_col_id='formula').featurize_dataframe(data, 'composition',  ignore_errors=True)

HBox(children=(FloatProgress(value=0.0, description='StrToComposition', max=2791.0, style=ProgressStyle(descri…




Generate features with matminer featurizer with 'magpie' preset

In [8]:
from matminer.featurizers.composition import ElementProperty
feature_calculators_magpie = MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
                                          cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])
feature_labels_magpie = feature_calculators_magpie.feature_labels()
df_magpie = feature_calculators_magpie.featurize_dataframe(data, col_id='formula', ignore_errors=True)

HBox(children=(FloatProgress(value=0.0, description='MultipleFeaturizer', max=2791.0, style=ProgressStyle(desc…




In [9]:
df_magpie.head()

Unnamed: 0,composition,E,band_gap_OQMD,formula,0-norm,2-norm,3-norm,5-norm,7-norm,10-norm,...,MagpieData mean SpaceGroupNumber,MagpieData avg_dev SpaceGroupNumber,MagpieData mode SpaceGroupNumber,avg s valence electrons,avg p valence electrons,avg d valence electrons,avg f valence electrons,compound possible,max ionic char,avg ionic char
1,CuBr,2.998325,1.077,"(Cu, Br)",2,0.707107,0.629961,0.574349,0.552045,0.535887,...,144.5,80.5,64.0,1.5,2.5,10.0,0.0,False,0.244896,0.061224
2,LuP,1.3,1.896,"(Lu, P)",2,0.707107,0.629961,0.574349,0.552045,0.535887,...,98.0,96.0,2.0,2.0,1.5,0.5,7.0,True,0.190712,0.047678
3,Cu3SbSe4,0.355,0.0,"(Cu, Sb, Se)",3,0.637377,0.564295,0.521836,0.509034,0.502747,...,112.125,98.125,14.0,1.625,2.375,10.0,0.0,False,0.100238,0.022844
4,ZnO,3.402809,1.087,"(Zn, O)",2,0.707107,0.629961,0.574349,0.552045,0.535887,...,103.0,91.0,12.0,2.0,2.0,5.0,0.0,True,0.551131,0.137783
5,PtSb2,0.117444,0.0,"(Pt, Sb)",2,0.745356,0.693361,0.670782,0.667408,0.666732,...,185.666667,26.222222,166.0,1.666667,2.0,9.666667,4.666667,False,0.013138,0.00292


Data now includes space group number (minimum, maximum, avarage) which is useless. Drop space group data

In [10]:
dropcol_magpie = ['MagpieData minimum SpaceGroupNumber','MagpieData maximum SpaceGroupNumber','MagpieData range SpaceGroupNumber', 'MagpieData mean SpaceGroupNumber', 'MagpieData avg_dev SpaceGroupNumber', 'MagpieData mode SpaceGroupNumber']
df_magpie=df_magpie.drop(columns = dropcol_magpie)
for i in dropcol_magpie:
    feature_labels_magpie.remove(i)  

Generate features with 'deml' preset

In [11]:
from matminer.featurizers.composition import ElementProperty

ep_feat = ElementProperty.from_preset(preset_name="deml")
df_deml = ep_feat.featurize_dataframe(data, col_id="formula", ignore_errors=True)  # input the "composition" column to the featurizer
feature_calculators_deml = ep_feat.feature_labels()
dropcol_deml = ['DemlData minimum atom_num','DemlData maximum atom_num',
 'DemlData range atom_num',
 'DemlData mean atom_num',
 'DemlData std_dev atom_num',
 'DemlData minimum atom_mass',
 'DemlData maximum atom_mass',
 'DemlData range atom_mass',
 'DemlData mean atom_mass',
 'DemlData std_dev atom_mass',
 'DemlData minimum row_num',
 'DemlData maximum row_num','DemlData range row_num', 'DemlData mean row_num', 'DemlData std_dev row_num', 'DemlData minimum col_num', 'DemlData maximum col_num','DemlData range col_num','DemlData mean col_num', 'DemlData std_dev col_num',
'DemlData minimum melting_point',
'DemlData maximum melting_point',
'DemlData range melting_point',
'DemlData mean melting_point',
'DemlData std_dev melting_point',
'DemlData minimum electronegativity',
'DemlData maximum electronegativity',
'DemlData range electronegativity',
'DemlData mean electronegativity',
'DemlData std_dev electronegativity']
df_deml = df_deml.drop(columns=dropcol_deml)

HBox(children=(FloatProgress(value=0.0, description='ElementProperty', max=2791.0, style=ProgressStyle(descrip…




Merge 'magpie' and 'deml' features 

In [12]:
df_merge = pd.merge(df_magpie, df_deml)
df_merge

Unnamed: 0,composition,E,band_gap_OQMD,formula,0-norm,2-norm,3-norm,5-norm,7-norm,10-norm,...,DemlData minimum mus_fere,DemlData maximum mus_fere,DemlData range mus_fere,DemlData mean mus_fere,DemlData std_dev mus_fere,DemlData minimum FERE correction,DemlData maximum FERE correction,DemlData range FERE correction,DemlData mean FERE correction,DemlData std_dev FERE correction
0,CuBr,2.998325,1.077,"(Cu, Br)",2,0.707107,0.629961,0.574349,0.552045,0.535887,...,,,,,,,,,,
1,LuP,1.300000,1.896,"(Lu, P)",2,0.707107,0.629961,0.574349,0.552045,0.535887,...,,,,,,,,,,
2,Cu3SbSe4,0.355000,0.000,"(Cu, Sb, Se)",3,0.637377,0.564295,0.521836,0.509034,0.502747,...,-4.286226,-1.972581,2.313645,-3.050496,1.124909,-0.166226,0.057419,0.223645,-0.034246,0.100252
3,ZnO,3.402809,1.087,"(Zn, O)",2,0.707107,0.629961,0.574349,0.552045,0.535887,...,-4.760000,-0.840000,3.920000,-2.800000,2.771859,0.230000,0.430000,0.200000,0.330000,0.141421
4,PtSb2,0.117444,0.000,"(Pt, Sb)",2,0.745356,0.693361,0.670782,0.667408,0.666732,...,-4.286226,-3.952760,0.333466,-4.175071,0.235796,-0.432760,-0.166226,0.266534,-0.255071,0.188468
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2786,ScCoO3,0.000000,0.849,"(Sc, Co, O)",3,0.663325,0.614463,0.600984,0.600078,0.600002,...,-4.760000,-4.630242,0.129758,-4.732918,0.068666,-0.104349,0.489758,0.594106,0.215082,0.252240
2787,Tm2MgTl,0.000000,0.000,"(Tm, Mg, Tl)",3,0.612372,0.538609,0.506099,0.501109,0.500098,...,,,,,,,,,,
2788,Nb5Ga4,0.000000,0.000,"(Nb, Ga)",2,0.711458,0.637644,0.587958,0.570873,0.561251,...,-6.686752,-2.370000,4.316752,-4.768195,3.052404,0.353248,0.660000,0.306752,0.489582,0.216906
2789,Tb2Sb5,0.000000,0.000,"(Tb, Sb)",2,0.769309,0.729210,0.715743,0.714453,0.714293,...,,,,,,,,,,


Define one more feature - metal/nonmenetal 

In [13]:
df_merge['(non)metal'] = np.nan

In [14]:
# If metal (band gap=0), then (non)mental = 0, else (non)mental = 1
for i in range(0, len(df_merge['E'])):
    if df_merge['E'][i] < 0.0001:
        df_merge['(non)metal'][i]=0
    else:
        df_merge['(non)metal'][i]=1

We also add the column with metal/nonmetal for based on DFT calculation from OQMD 

In [15]:
df_merge['(non)metal_DFT'] = np.nan

In [16]:
for i in range(0, len(df_merge['band_gap_OQMD'])):
    if df_merge['band_gap_OQMD'][i] < 0.0001:
        df_merge['(non)metal_DFT'][i]=0
    else:
        df_merge['(non)metal_DFT'][i]=1

# Feauture engineering

We now turn to the feature enegenering. Define some useful functions

In [18]:
#This function takes a DataFrame as input and returns two columns, total missing values and total missing values percentage
def missing_percentage(df):
    total = df.isnull().sum().sort_values(ascending = False)[df.isnull().sum().sort_values(ascending = False) != 0]
    percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2)[round(df.isnull().sum().sort_values(ascending = False)/len(df)*100,2) != 0]
    return pd.concat([total, percent], axis=1, keys=['Total','Percent'])

In [19]:
#Drop all the columns with missing percent greater than "threshold"
def drop_missing(df, threshold=3):
    dropped = []
    for i in range(0, len(missing_percentage(df).Percent.index)):
        if missing_percentage(df).Percent[i] > threshold:
            dropped.append(missing_percentage(df).Percent.index[i])
    df = df.drop(columns=dropped)
    print(dropped)
    return df

In [20]:
# List all the columns
def col(df):
    for i in df.columns:
        print(i)

Drop all the columns with missing percent greater than "threshold"

In [21]:
df_merge = drop_missing(df_merge, threshold=1)

['DemlData range GGAU_Etot', 'DemlData mean mus_fere', 'DemlData minimum GGAU_Etot', 'DemlData maximum GGAU_Etot', 'DemlData mean GGAU_Etot', 'DemlData std_dev GGAU_Etot', 'DemlData minimum mus_fere', 'DemlData maximum mus_fere', 'DemlData range mus_fere', 'DemlData std_dev mus_fere', 'DemlData minimum FERE correction', 'DemlData maximum FERE correction', 'DemlData range FERE correction', 'DemlData mean FERE correction', 'DemlData std_dev FERE correction', 'DemlData mean electric_pol', 'DemlData minimum electric_pol', 'DemlData range electric_pol', 'DemlData maximum electric_pol', 'DemlData std_dev electric_pol', 'DemlData mean heat_cap', 'DemlData minimum heat_cap', 'DemlData maximum heat_cap', 'DemlData range heat_cap', 'DemlData std_dev heat_cap', 'DemlData mean heat_fusion', 'DemlData std_dev heat_fusion', 'DemlData minimum heat_fusion', 'DemlData range heat_fusion', 'DemlData maximum heat_fusion', 'DemlData std_dev atom_radius', 'DemlData mean atom_radius', 'DemlData range atom_ra

## Genetic algorithm for feature selection 

In [70]:
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from deap import creator, base, tools, algorithms
from scoop import futures
from scipy import interpolate
import random

In [None]:
# Define feature which will be used for the prediction of target. The target is whether a given compound metal or insulator (the value of '(non)metal')
features = []
for col in df_merge.columns:
    features+=[str(col)]
to_remove = ['composition', 'E', 'formula', '(non)metal']
for i in to_remove:
    features.remove(i)

In [71]:
# Define train data and target    
X = df_merge[features]
y = df_merge['(non)metal']

In [72]:
le = LabelEncoder()
le.fit(df_merge['(non)metal'])
allClasses = le.transform(df_merge['(non)metal'])
allFeatures = df_merge.drop(to_remove, axis=1)

In [73]:
# define cross validation strategy
def accuracy_score_cv(model,X,y):
    accur = cross_val_score(model, X, y, scoring="accuracy", cv=10)
    return accur

We will use DEAP module for the genetic algorithm for feature selection. The code written below is the modification of the code taken from [here](https://github.com/scoliann/GeneticAlgorithmFeatureSelection/blob/master/gaFeatureSelectionExample.py). Define some useful functions

In [74]:
# This function takes 'individual', which is a set of features of the data X and target y
def getFitness(individual, X, y):
    # Parse our feature columns that we don't use
    # Apply one hot encoding to the features
    cols = [index for index in range(len(individual)) if individual[index] == 0]
    X_Parsed = X.drop(X.columns[cols], axis=1)
    X_OhFeatures = pd.get_dummies(X_Parsed)

    
    # Apply logistic regression on the data, and calculate accuracy
    clf = XGBClassifier()
    accuracy = accuracy_score_cv(clf, X_OhFeatures, y).mean()

    # Return calculated accuracy as fitness
    return (accuracy,)

In [75]:
"""
This defines strategy of the genetic algorithm. The genetic algorithm is tuned in such a way that during the evolution the accuracy value should have its maximum value
"""
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)

# Create Toolbox
toolbox = base.Toolbox()
toolbox.register("attr_bool", random.randint, 0, 1)
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, len(df_merge.columns) - len(to_remove))
toolbox.register("population", tools.initRepeat, list, toolbox.individual)

# Continue filling toolbox...
toolbox.register("evaluate", getFitness, X=X, y=y)
toolbox.register("mate", tools.cxOnePoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.05)
toolbox.register("select", tools.selTournament, tournsize=3)

In [76]:
# This function gives hall of fame (the set of individuals with the best performance)
def getHof():

    # Initialize variables to use eaSimple
    numPop = 100
    numGen = 8
    pop = toolbox.population(n=numPop)
    hof = tools.HallOfFame(numPop * numGen)
    stats = tools.Statistics(lambda ind: ind.fitness.values)
    stats.register("avg", np.mean)
    stats.register("std", np.std)
    stats.register("min", np.min)
    stats.register("max", np.max)

    # Launch genetic algorithm
    pop, log = algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=0.2, ngen=numGen, stats=stats, halloffame=hof, verbose=True)

    # Return the hall of fame
    return hof

In [77]:
def getMetrics(hof):

    # Get list of percentiles in the hall of fame
    percentileList = [i / (len(hof) - 1) for i in range(len(hof))]
    
    # Gather fitness data from each percentile
    AccuracyList = []
    individualList = []
    for individual in hof:
        Accuracy = getFitness(individual, X, y)
        AccuracyList.append(Accuracy[0])
        individualList.append(individual)
        
    return AccuracyList, individualList, percentileList

In [78]:
'''
First, we will apply XGBoost regressor using all the features to acquire a baseline accuracy.
'''
individual = [1 for i in range(len(X.columns))]
Accuracy = getFitness(individual, X, y)
print('\nAccuracy with all features: \t' + str(Accuracy[0]))

'''
Now, we will apply a genetic algorithm to choose a subset of features that gives a better accuracy than the baseline.
'''
hof = getHof()
AccuracyList, individualList, percentileList = getMetrics(hof)

# Get a list of subsets that performed best on validation data 
maxValAccSubsetIndicies = [index for index in range(len(AccuracyList)) if AccuracyList[index] == max(AccuracyList)]
maxValIndividuals = [individualList[index] for index in maxValAccSubsetIndicies]
maxValSubsets = [[list(features)[index] for index in range(len(individual)) if individual[index] == 1] for individual in maxValIndividuals]

print('\n---Optimal Feature Subset(s)---\n')
for index in range(len(maxValAccSubsetIndicies)):
    print('Percentile: \t\t\t' + str(percentileList[maxValAccSubsetIndicies[index]]))
    print('Accuracy: \t\t' + str(AccuracyList[maxValAccSubsetIndicies[index]]))
    print('Individual: \t' + str(maxValIndividuals[index]))
    print('Number Features In Subset: \t' + str(len(maxValSubsets[index])))
    print('Feature Subset: ' + str(maxValSubsets[index]))


Accuracy with all features: 	0.946251920122888
gen	nevals	avg     	std      	min     	max     
0  	100   	0.939133	0.0054188	0.925826	0.949835
1  	60    	0.94362 	0.00363724	0.922965	0.95055 
2  	53    	0.945149	0.00235421	0.938011	0.950196
3  	66    	0.945873	0.00243043	0.935854	0.950196
4  	59    	0.946425	0.00261436	0.934782	0.951269
5  	63    	0.947109	0.00234374	0.938729	0.95163 
6  	54    	0.948309	0.00220796	0.938006	0.954135
7  	59    	0.948761	0.00227003	0.94303 	0.954135
8  	55    	0.949184	0.00215982	0.943386	0.954135

---Optimal Feature Subset(s)---

Percentile: 			0.0
Accuracy: 		0.9541346646185357
Individual: 	[1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 

We thus have found the optimal set of features

In [79]:
features_optimal = ['band_gap_OQMD', '0-norm', '2-norm', '3-norm', '5-norm', '7-norm', 'MagpieData minimum Number', 'MagpieData mode Number', 'MagpieData minimum MendeleevNumber', 'MagpieData maximum MendeleevNumber', 'MagpieData mode MendeleevNumber', 'MagpieData mean AtomicWeight', 'MagpieData mode AtomicWeight', 'MagpieData minimum MeltingT', 'MagpieData maximum MeltingT', 'MagpieData range MeltingT', 'MagpieData avg_dev MeltingT', 'MagpieData mode MeltingT', 'MagpieData maximum Column', 'MagpieData avg_dev Column', 'MagpieData mode Column', 'MagpieData range Row', 'MagpieData maximum CovalentRadius', 'MagpieData range CovalentRadius', 'MagpieData avg_dev CovalentRadius', 'MagpieData minimum Electronegativity', 'MagpieData range Electronegativity', 'MagpieData mode Electronegativity', 'MagpieData mean NsValence', 'MagpieData minimum NpValence', 'MagpieData range NpValence', 'MagpieData minimum NdValence', 'MagpieData maximum NdValence', 'MagpieData range NdValence', 'MagpieData mean NdValence', 'MagpieData minimum NfValence', 'MagpieData avg_dev NfValence', 'MagpieData maximum NValence', 'MagpieData range NValence', 'MagpieData avg_dev NValence', 'MagpieData mode NValence', 'MagpieData range NsUnfilled', 'MagpieData mean NsUnfilled', 'MagpieData avg_dev NsUnfilled', 'MagpieData maximum NpUnfilled', 'MagpieData mean NpUnfilled', 'MagpieData avg_dev NpUnfilled', 'MagpieData minimum NdUnfilled', 'MagpieData maximum NdUnfilled', 'MagpieData mean NdUnfilled', 'MagpieData mode NdUnfilled', 'MagpieData minimum NfUnfilled', 'MagpieData maximum NfUnfilled', 'MagpieData range NfUnfilled', 'MagpieData mean NfUnfilled', 'MagpieData avg_dev NfUnfilled', 'MagpieData mode NfUnfilled', 'MagpieData range NUnfilled', 'MagpieData mean NUnfilled', 'MagpieData avg_dev NUnfilled', 'MagpieData maximum GSvolume_pa', 'MagpieData mode GSvolume_pa', 'MagpieData range GSbandgap', 'MagpieData mean GSbandgap', 'MagpieData mode GSbandgap', 'MagpieData minimum GSmagmom', 'MagpieData maximum GSmagmom', 'MagpieData range GSmagmom', 'MagpieData mean GSmagmom', 'MagpieData avg_dev GSmagmom', 'avg s valence electrons', 'avg f valence electrons', 'compound possible', 'max ionic char', '(non)metal_DFT']

Let's test it once again

In [91]:
X = df_merge[features_optimal]
y = df_merge['(non)metal']
model = XGBClassifier()
accuracy_score_cv(model, X, y).mean()

0.9541346646185357

So, the accuracy value is higher then the baseline 0.946. Let us now compare it with the accuracy of DFT calculations 

In [89]:
accuracy_score(df_merge['(non)metal'], df_merge['(non)metal_DFT'])

0.8993192404156216