matminer: https://hackingmaterials.lbl.gov/matminer/

scikit-learn: https://scikit-learn.org/

The Material Project Workshop: https://workshop.materialsproject.org/lessons/08_ml_matminer/matminer-notes/

In [None]:
!pip install numpy
!pip install pandas
#!pip install -U pandas-profiling
!pip install pymatgen==2021.3.9
!pip install matminer==0.6.5
#!pip install figrecipes

Collecting pymatgen==2021.3.9
  Downloading pymatgen-2021.3.9.tar.gz (3.0 MB)
[K     |████████████████████████████████| 3.0 MB 13.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting ruamel.yaml>=0.15.6
  Downloading ruamel.yaml-0.17.17-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 76.1 MB/s 
[?25hCollecting spglib>=1.9.9.44
  Downloading spglib-1.16.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (292 kB)
[K     |████████████████████████████████| 292 kB 72.7 MB/s 
Collecting plotly>=4.5.0
  Downloading plotly-5.3.1-py2.py3-none-any.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 1.6 MB/s 
Collecting scipy>=1.5.0
  Downloading scipy-1.7.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (28.5 MB)
[K     |█████████████████████████████

Collecting matminer==0.6.5
  Downloading matminer-0.6.5.tar.gz (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 14.5 MB/s 
Collecting pint>=0.11
  Downloading Pint-0.18-py2.py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 69.0 MB/s 
Collecting future>=0.18.2
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 67.8 MB/s 
Collecting scikit_learn>=0.23.1
  Downloading scikit_learn-1.0.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.2 MB)
[K     |████████████████████████████████| 23.2 MB 70.3 MB/s 
Collecting jsonschema>=3.2.0
  Downloading jsonschema-4.2.0-py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 6.9 MB/s 
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.0.0-py3-none-any.whl (14 kB)
Building wheels for collected packages: matminer, future
  Building wheel for matminer (setup.py) ... [?25l[?25hdone
  Created wheel for matminer: filename=matminer-0.

In [None]:
# You can get various training data by using Matminer.
# Import the function for getting the dataset 
from matminer.datasets import get_available_datasets

In [None]:
# View available data names and details 
get_available_datasets()

In [None]:
# Import a function to read a dataset 
from matminer.datasets import load_dataset

In [None]:
# Read the dataset Import the function to read the dataset "dielectric_constant" into the pandas data frame 
df = load_dataset("dielectric_constant")

In [None]:
# Display the beginning of the data frame 
df.head()

In [None]:
df.tail()

In [None]:
# データフレームの各列の基本的な統計量を計算する  
#（統計量を計算できる数値データの列だけが表示される）
df["band_gap"]

In [None]:
# Fetch the 3rd row data (Note: python index starts from 0) 
df.iloc[2]

In [None]:
# Extract only row data whose "volume" is 580 or more
# (Refer to Boolean index: Extract only index whose condition is True) 
mask = (df["volume"] >= 580)
df[mask]

In [None]:
# Extract only row data whose "band_gap" is greater than 0 and generate a new data frame 
mask = (df["band_gap"] > 0)
semiconductor_df = df[mask]
semiconductor_df

In [None]:
# Drop the four columns (["nsites", "space_group", "e_electronic", "e_total"])
# (When the axis number is 0, the row is deleted, and when it is 1, the column is deleted)
# Note: After deleting, you need to assign it to a variable in the dataframe (the original variable df has not changed) 
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"], axis=1)

In [None]:
cleaned_df.head()

Calculate the Ionic contribution from the total (static) dielectric constant ϵtotal and the dielectric constant ϵelectronic
$$
\epsilon_{\text{inonic}} = \epsilon_{\text{total}} - \epsilon_{\text{electronic}}
$$
If you specify a column and perform four arithmetic operations, it will calculate for each corresponding element.
And if the data frame does not have the specified column name, a new column will be created 

In [None]:
df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]

In [None]:
# Confirm that the column has been added 
df.head()

Descriptor generation for machine learning 

Learn how to generate descriptors for machine learning models.
matminer's featuresizers provide methods to generate descriptors from composition and structure (atomic arrangement).
See the official documentation for a list of descriptors that can be generated:
https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html

Learn how to generate a descriptor from the pymatgen class Composition. First, create a Composition object with composition Fe2O3. 

In [None]:
from pymatgen import Composition

fe2o3 = Composition("Fe2O3")

In [None]:
# Import of class ElementProperty to generate elemental feature statistic as descriptor 
from matminer.featurizers.composition import ElementProperty

In [None]:
# List of names of elemental features
# The following class Element Parameters can be used: https://pymatgen.org/pymatgen.core.periodic_table.html 
prop_name = ["atomic_radius_calculated","molar_volume", "boiling_point", "melting_point","liquid_range"]

In [None]:
# Create an object of class ElementProperty
# Arguments: database name, elemental features, statistic (here weighted average) 
ep = ElementProperty("pymatgen",prop_name,['mean'])

In [None]:
# The name of the descriptor to be generated 
element_prop_labels = ep.feature_labels()
print(element_prop_labels)

In [None]:
element_weight_averaged_prop = ep.featurize(fe2o3)
print(element_weight_averaged_prop)

In [None]:
# Import of class ElementFraction to generate element ratio as descriptor 
from matminer.featurizers.composition import ElementFraction

In [None]:
# Create an object of class ElementFraction 
ef = ElementFraction()

In [None]:
# The name of the descriptor to be generated 
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)

In [None]:
# Generate descriptor 
element_fractions = ef.featurize(fe2o3)
print(element_fractions)

In [None]:
# Check the descriptor name and descriptor value together 
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])

In [None]:
# In the above steps, you learned how to generate a descriptor for a single Composition object,
# When there are many compositions, it is tedious to generate descriptors one by one.
# In the following, you will learn the procedure for collectively generating the descriptors of the composition and structure contained in the data frame.

# Read the dataset contained in matminer 
from matminer.datasets.dataset_retrieval import load_dataset

df = load_dataset("brgoch_superhard_training")
df.head()

In [None]:
# The column "composition" contains the Composition object for pymatgen.
# Only the element name is displayed, but confirm that there is also information such as composition ratio. 
print(df["composition"][2])
print(type(df["composition"][2]))

In [None]:
# By the method featurerize_dataframe of the object of class ElementFraction
# Can generate descriptors for data frames. (1st argument: data frame, 2nd argument: composition)
# (Use package multiprocessing to calculate descriptors for each composition in parallel)

# Note: Must be reassigned to a variable in the data frame 
df = ef.featurize_dataframe(df, "composition")

In [None]:
# Make sure the descriptor has been added 
df.head()

In [None]:
# Next, learn how to generate descriptors from pymatgen's Structure object (crystal structure data).

# Load a new dataset containing Structure objects 
df = load_dataset("phonon_dielectric_mp")

df.head()

In [None]:
# Make sure the column "structure" contains a Structure object 
print(df["structure"][0])

In [None]:
# Import the class "DensityFeatures" that generates a descriptor about density 
from matminer.featurizers.structure import DensityFeatures

In [None]:
# Creating an object of class DensityFeatures 
densityf = DensityFeatures()

In [None]:
# Descriptor name 
densityf.feature_labels()

In [None]:
# Generate and add a density descriptor from the crystal structure information of the data frame 
df = densityf.featurize_dataframe(df, "structure")

In [None]:
# Confirm that 3 descriptors have been added 
df.head()

In [None]:
# matminer provides a class for converting pymatgen objects.
# The following example tries the class StrToComposition, which converts a string to a composition.

# Import class StrToComposition 
from matminer.featurizers.conversions import StrToComposition

In [None]:
# Create an object of class StrToComposition 
stc = StrToComposition()

In [None]:
# Display column "formula", check data class 
print(df["formula"][0])
print(type(df["formula"][0]))

In [None]:
# Create a Composition object using the character string of the composition formula of the data frame 
df = stc.featurize_dataframe(df, "formula")

In [None]:
# Confirm that Composition has been added 
df.head()

In [None]:
# Make sure the column "composition" is an object of class Composition in pymatgen 
print(df["composition"][0])
print(type(df["composition"][0]))

Machine learning model training and prediction 

In [None]:
# Import function to load json format data 
from matminer.utils.io import load_dataframe_from_json

In [None]:
# Load the dataset for the elastic tensor of the material (data for the following papers)

# de Jong M, Chen W, Angsten T, Jain A, Notestine R, Gamst A, Sluiter M, Ande CK, van der Zwaag S,
# Plata JJ, Toher C, Curtarolo S, Ceder G, Persson KA, Asta M (2015)
# Charting the complete elastic properties of inorganic crystalline compounds. Scientific Data 2: 150009.
# https://doi.org/10.1038/sdata.2015.9
from google.colab import files
uploaded = files.upload() # elastic_tensor_2015_featurized.json
df = load_dataframe_from_json("elastic_tensor_2015_featurized.json")
df.head()

In [None]:
# Let the objective variable to be predicted be "K_VRH". This value indicates bulk modulus.
# Specify the attribute values of the data frame, extract only the information of numpy.array and assign it to the variable y. 
y = df['K_VRH'].values

print(y)

In [None]:
# Generate a matrix X consisting of descriptors (features) to be input to the machine learning model.
# The matrix X is a matrix of the size of the number of data × the number of descriptors.

# This matrix is generated by removing other than the descriptor (numerical data, other than the objective variable). 
X = df.drop(["structure", "formula", "composition", "K_VRH"], axis=1)

In [None]:
# Display the name of the descriptor 
print("There are {} possible descriptors:".format(X.columns))
print(X.columns)

In [None]:
# Import the machine learning model class RandomForestRegressor for predicting real numbers 
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Create an object of class RandomForestRegressor 
rf = RandomForestRegressor(n_estimators=100, random_state=1)

In [None]:
# Train the model using the training data (pair of feature matrix X and objective variable y) 
rf.fit(X, y)

In [None]:
# Predict the value of the objective variable from the features.
# Here, the objective variable of the training data is predicted 
y_pred = rf.predict(X)

In [None]:
# Import of numerical calculation package numpy 
import numpy as np

In [None]:
# Importing a function to calculate the mean squared error (MSE) 
from sklearn.metrics import mean_squared_error

In [None]:
# Calculate the MSE between the predicted and true values of the objective variable for the training data 
mse = mean_squared_error(y, y_pred)
print('training RMSE = {:.3f} GPa'.format(np.sqrt(mse)))

In [None]:
# Import a class to perform K-fold Cross Validation
# Cross-validation: One of the methods for estimating the test error of a prediction model 
from sklearn.model_selection import KFold

In [None]:
# Creating an object of class KFold
# k = 10, set to shuffle the data order 
kfold = KFold(n_splits=10, shuffle=True)

In [None]:
# Import functions to calculate cross-validation scores 
from sklearn.model_selection import cross_val_score

In [None]:
# Perform cross-validation
# Arguments: Machine learning model rf, training data input (feature) X, objective variable y, score to be calculated, cross-validation object
# Negative value of MSE is set as the score 
scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=kfold)


In [None]:
# Calculates the square root (RMSE: Root MSE) of the absolute value of the calculated score value and displays its average value
# (Since it is executed k times, there are k scores)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Mean RMSE: {:.3f}'.format(np.mean(rmse_scores)))

In [None]:
# A function that calculates the predicted value of test data calculated in the process of cross-validation 
from sklearn.model_selection import cross_val_predict

In [None]:
# Predicted value of test data in cross-validation 
y_pred = cross_val_predict(rf, X, y, cv=kfold)

In [None]:
# Import a class to visualize a scatter plot of forecast results 
from matminer.figrecipes.plot import PlotlyFig

In [None]:
# Visualization of forecast results
# Since the horizontal axis is the true value and the vertical axis is the predicted value, it is better to have a lot of data on the diagonal line. 
pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               mode='notebook')

pf.xy(xy_pairs=[(y, y_pred), ([0, 400], [0, 400])], 
      labels=df['formula'], 
      modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], 
      showlegends=False)

In [None]:
# Importance of features calculated in Random Forest 
rf.feature_importances_

In [None]:
# Visualize features sorted in descending order of importance
# 1st line: Importance of features
# 2nd line: Name of feature quantity
# 3rd: Data index in descending order of features
# 4th and subsequent lines: Visualize up to the top 5 
importances = rf.feature_importances_
included = X.columns.values
indices = np.argsort(importances)[::-1]

pf = PlotlyFig(y_title='Importance (%)',
               title='Feature by importances',
               mode='notebook')

pf.bar(x=included[indices][0:5], y=importances[indices][0:5])