# Introdução às Features em Machine Learning

Em Machine Learning, `feature` é o termo utilizado para descrever as variáveis independentes que são usadas para fazer previsões sobre uma variável dependente. Essas `features` são a base para construir um modelo de Machine Learning e, portanto, a escolha e preparação correta das `features` é crucial para obter um modelo preciso e confiável.

Uma característica importante das `features` em Machine Learning é que elas são geralmente numéricas. Isso ocorre porque a maioria dos algoritmos de Machine Learning é projetada para trabalhar com dados numéricos. Além disso, as `features` numéricas permitem uma maior flexibilidade para aplicar transformações e técnicas de pré-processamento de dados, o que pode melhorar a qualidade dos modelos de Machine Learning.

Neste notebook, vamos explorar os recursos disponibilizados pelo pacote `matminer` para a seleção e preparação de features em um conjunto de dados. Este pacote contém ferramentas úteis para a extração de features de materiais e compostos químicos, que são frequentemente usados em aplicações de ciência de materiais e química computacional. Além disso, o pacote inclui um banco de dados de materiais com mais de 65.000 compostos, o que o torna uma fonte valiosa de dados para modelagem de Machine Learning em ciência de materiais.

A biblioteca `matminer` já possui diversos datasets que podem ser utilizados em trabalhos de modelagem de Machine Learning. Esses conjuntos de dados foram compilados a partir de diversas fontes, e é possível verificar a origem de cada conjunto de dados em seus respectivos artigos.

In [52]:
from matminer.datasets import get_available_datasets

get_available_datasets() #mostra os datasets disponiveis

boltztrap_mp: Effective mass and thermoelectric properties of 8924 compounds in The  Materials Project database that are calculated by the BoltzTraP software package run on the GGA-PBE or GGA+U density functional theory calculation results. The properties are reported at the temperature of 300 Kelvin and the carrier concentration of 1e18 1/cm3.

brgoch_superhard_training: 2574 materials used for training regressors that predict shear and bulk modulus.

castelli_perovskites: 18,928 perovskites generated with ABX combinatorics, calculating gllbsc band gap and pbe structure, and also reporting absolute band edge positions and heat of formation.

citrine_thermal_conductivity: Thermal conductivity of 872 compounds measured experimentally and retrieved from Citrine database from various references. The reported values are measured at various temperatures of which 295 are at room temperature.

dielectric_constant: 1,056 structures with dielectric properties, calculated with DFPT-PBE.

double_

['boltztrap_mp',
 'brgoch_superhard_training',
 'castelli_perovskites',
 'citrine_thermal_conductivity',
 'dielectric_constant',
 'double_perovskites_gap',
 'double_perovskites_gap_lumo',
 'elastic_tensor_2015',
 'expt_formation_enthalpy',
 'expt_formation_enthalpy_kingsbury',
 'expt_gap',
 'expt_gap_kingsbury',
 'flla',
 'glass_binary',
 'glass_binary_v2',
 'glass_ternary_hipt',
 'glass_ternary_landolt',
 'heusler_magnetic',
 'jarvis_dft_2d',
 'jarvis_dft_3d',
 'jarvis_ml_dft_training',
 'm2ax',
 'matbench_dielectric',
 'matbench_expt_gap',
 'matbench_expt_is_metal',
 'matbench_glass',
 'matbench_jdft2d',
 'matbench_log_gvrh',
 'matbench_log_kvrh',
 'matbench_mp_e_form',
 'matbench_mp_gap',
 'matbench_mp_is_metal',
 'matbench_perovskites',
 'matbench_phonons',
 'matbench_steels',
 'mp_all_20181018',
 'mp_nostruct_20181018',
 'phonon_dielectric_mp',
 'piezoelectric_tensor',
 'ricci_boltztrap_mp_tabular',
 'steel_strength',
 'superconductivity2018',
 'tholander_nitrides',
 'ucsb_thermoe

Para obter o `dataset` é nescessario utilizar esse código, a primeira vez que for rodado o `dataset` será baixado localmente na pasta do `matminer`, dentro da respectiva venv. 

In [53]:
from matminer.datasets import load_dataset

df = load_dataset("dielectric_constant")

Para entender melhor o que cada coluna representa podemos utiliar o comando, `get_all_dataset_info` e utilizar o nome do `dataset` como argumento.

In [54]:
from matminer.datasets import get_all_dataset_info

print(get_all_dataset_info("dielectric_constant"))

Dataset: dielectric_constant
Description: 1,056 structures with dielectric properties, calculated with DFPT-PBE.
Columns:
	band_gap: Measure of the conductivity of a material
	cif: optional: Description string for structure
	e_electronic: electronic contribution to dielectric tensor
	e_total: Total dielectric tensor incorporating both electronic and ionic contributions
	formula: Chemical formula of the material
	material_id: Materials Project ID of the material
	meta: optional, metadata descriptor of the datapoint
	n: Refractive Index
	nsites: The \# of atoms in the unit cell of the calculation.
	poly_electronic: the average of the eigenvalues of the electronic contribution to the dielectric tensor
	poly_total: the average of the eigenvalues of the total (electronic and ionic) contributions to the dielectric tensor
	poscar: optional: Poscar metadata
	pot_ferroelectric: Whether the material is potentially ferroelectric
	space_group: Integer specifying the crystallographic structure of t

Como o `matminer` retorna um arquivo pandas todos os recusos utilizado no pandas para formatação e compreenssão dos dados pode ser utilziado sem problema.

In [55]:
df.describe()

Unnamed: 0,nsites,space_group,volume,band_gap,n,poly_electronic,poly_total
count,1056.0,1056.0,1056.0,1056.0,1056.0,1056.0,1056.0
mean,7.530303,142.970644,166.420376,2.119432,2.434886,7.248049,14.777898
std,3.388443,67.264591,97.425084,1.604924,1.148849,13.054947,19.435303
min,2.0,1.0,13.980548,0.11,1.28,1.63,2.08
25%,5.0,82.0,96.262337,0.89,1.77,3.13,7.5575
50%,8.0,163.0,145.944691,1.73,2.19,4.79,10.54
75%,9.0,194.0,212.106405,2.885,2.73,7.44,15.4825
max,20.0,229.0,597.341134,8.32,16.03,256.84,277.78


In [56]:
mask = df["volume"] >= 580
df[mask]

Unnamed: 0,material_id,formula,nsites,space_group,volume,structure,band_gap,e_electronic,e_total,n,poly_electronic,poly_total,pot_ferroelectric,cif,meta,poscar
206,mp-23280,AsCl3,16,19,582.085309,"[[0.13113333 7.14863883 9.63476955] As, [2.457...",3.99,"[[2.2839161900000002, 0.00014519, -2.238000000...","[[2.49739759, 0.00069379, 0.00075864], [0.0004...",1.57,2.47,3.3,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,As4 Cl12\n1.0\n4.652758 0.000000 0.000000\n0.0...
216,mp-9064,RbTe,12,189,590.136085,"[[6.61780282 0. 0. ] Rb, [1.750...",0.43,"[[3.25648277, 5.9650000000000007e-05, 1.57e-06...","[[5.34517928, 0.00022474000000000002, -0.00018...",2.05,4.2,6.77,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Rb6 Te6\n1.0\n10.118717 0.000000 0.000000\n-5....
219,mp-23230,PCl3,16,62,590.637274,"[[6.02561815 8.74038483 7.55586375] P, [2.7640...",4.03,"[[2.39067769, 0.00017593, 8.931000000000001e-0...","[[2.80467218, 0.00034093000000000003, 0.000692...",1.52,2.31,2.76,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,P4 Cl12\n1.0\n6.523152 0.000000 0.000000\n0.00...
251,mp-2160,Sb2Se3,20,62,597.341134,"[[3.02245275 0.42059268 1.7670481 ] Sb, [ 1.00...",0.76,"[[19.1521058, 5.5e-06, 0.00025268], [-1.078000...","[[81.93819038000001, 0.0006755800000000001, 0....",3.97,15.76,63.53,True,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Sb8 Se12\n1.0\n4.029937 0.000000 0.000000\n0.0...


In [57]:
mask = df["band_gap"] > 0
nonmetal_df = df[mask]
nonmetal_df

Unnamed: 0,material_id,formula,nsites,space_group,volume,structure,band_gap,e_electronic,e_total,n,poly_electronic,poly_total,pot_ferroelectric,cif,meta,poscar
0,mp-441,Rb2Te,3,225,159.501208,"[[1.75725875 1.2425695 3.04366125] Rb, [5.271...",1.88,"[[3.44115795, -3.097e-05, -6.276e-05], [-2.837...","[[6.23414745, -0.00035252, -9.796e-05], [-0.00...",1.86,3.44,6.23,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...
1,mp-22881,CdCl2,3,166,84.298097,"[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...",3.52,"[[3.34688382, -0.04498543, -0.22379197], [-0.0...","[[7.97018673, -0.29423886, -1.463590159999999]...",1.78,3.16,6.73,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...
2,mp-28013,MnI2,3,164,108.335875,"[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...",1.17,"[[5.5430849, -5.28e-06, -2.5030000000000003e-0...","[[13.80606079, 0.0006911900000000001, 9.655e-0...",2.23,4.97,10.64,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...
3,mp-567290,LaN,4,186,88.162562,[[-1.73309900e-06 2.38611186e+00 5.95256328e...,1.12,"[[7.09316738, 7.99e-06, -0.0003864700000000000...","[[16.79535386, 8.199999999999997e-07, -0.00948...",2.65,7.04,17.99,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...
4,mp-560902,MnF2,6,136,82.826401,"[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...",2.87,"[[2.4239622, 7.452000000000001e-05, 6.06100000...","[[6.44055613, 0.0020446600000000002, 0.0013203...",1.53,2.35,7.12,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ...,Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1051,mp-568032,Cd(InSe2)2,7,111,212.493121,"[[0. 0. 0.] Cd, [2.9560375 0. 3.03973 ...",0.87,"[[7.74896783, 0.0, 0.0], [0.0, 7.74896783, 0.0...","[[11.85159471, 1e-08, 0.0], [1e-08, 11.8515962...",2.77,7.67,11.76,True,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Cd1 In2 Se4\n1.0\n5.912075 0.000000 0.000000\n...
1052,mp-696944,LaHBr2,8,194,220.041363,"[[2.068917 3.58317965 3.70992025] La, [4.400...",3.60,"[[4.40504391, 6.1e-07, 0.0], [6.1e-07, 4.40501...","[[8.77136355, 1.649999999999999e-06, 0.0], [1....",2.00,3.99,7.08,True,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,La2 H2 Br4\n1.0\n4.137833 0.000000 0.000000\n-...
1053,mp-16238,Li2AgSb,4,216,73.882306,"[[1.35965225 0.96141925 2.354987 ] Li, [2.719...",0.14,"[[212.60750153, -1.843e-05, 0.0], [-1.843e-05,...","[[232.59707383, -0.0005407400000000001, 0.0025...",14.58,212.61,232.60,True,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Li2 Ag1 Sb1\n1.0\n4.078957 0.000000 2.354987\n...
1054,mp-4405,Rb3AuO,5,221,177.269065,"[[0. 2.808758 2.808758] Rb, [2.808758 2....",0.21,"[[6.40511712, 0.0, 0.0], [0.0, 6.40511712, 0.0...","[[22.43799785, 0.0, 0.0], [0.0, 22.4380185, 0....",2.53,6.41,22.44,True,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Rb3 Au1 O1\n1.0\n5.617516 0.000000 0.000000\n0...


In [58]:
df["poly_ionic"] = df["poly_total"] - df["poly_electronic"]

In [59]:
df.head()

Unnamed: 0,material_id,formula,nsites,space_group,volume,structure,band_gap,e_electronic,e_total,n,poly_electronic,poly_total,pot_ferroelectric,cif,meta,poscar,poly_ionic
0,mp-441,Rb2Te,3,225,159.501208,"[[1.75725875 1.2425695 3.04366125] Rb, [5.271...",1.88,"[[3.44115795, -3.097e-05, -6.276e-05], [-2.837...","[[6.23414745, -0.00035252, -9.796e-05], [-0.00...",1.86,3.44,6.23,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Rb2 Te1\n1.0\n5.271776 0.000000 3.043661\n1.75...,2.79
1,mp-22881,CdCl2,3,166,84.298097,"[[0. 0. 0.] Cd, [ 4.27210959 2.64061969 13.13...",3.52,"[[3.34688382, -0.04498543, -0.22379197], [-0.0...","[[7.97018673, -0.29423886, -1.463590159999999]...",1.78,3.16,6.73,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Cd1 Cl2\n1.0\n3.850977 0.072671 5.494462\n1.78...,3.57
2,mp-28013,MnI2,3,164,108.335875,"[[0. 0. 0.] Mn, [-2.07904300e-06 2.40067320e+...",1.17,"[[5.5430849, -5.28e-06, -2.5030000000000003e-0...","[[13.80606079, 0.0006911900000000001, 9.655e-0...",2.23,4.97,10.64,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,Mn1 I2\n1.0\n4.158086 0.000000 0.000000\n-2.07...,5.67
3,mp-567290,LaN,4,186,88.162562,[[-1.73309900e-06 2.38611186e+00 5.95256328e...,1.12,"[[7.09316738, 7.99e-06, -0.0003864700000000000...","[[16.79535386, 8.199999999999997e-07, -0.00948...",2.65,7.04,17.99,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLWAVE = F...,La2 N2\n1.0\n4.132865 0.000000 0.000000\n-2.06...,10.95
4,mp-560902,MnF2,6,136,82.826401,"[[1.677294 2.484476 2.484476] Mn, [0. 0. 0.] M...",2.87,"[[2.4239622, 7.452000000000001e-05, 6.06100000...","[[6.44055613, 0.0020446600000000002, 0.0013203...",1.53,2.35,7.12,False,#\#CIF1.1\n###################################...,{u'incar': u'NELM = 100\nIBRION = 8\nLDAUTYPE ...,Mn2 F4\n1.0\n3.354588 0.000000 0.000000\n0.000...,4.77


Como mencionado anteriormente, é recomendado utilizar dados numéricos e bem formatados independentemente do modelo de Machine Learning que será utilizado. Felizmente, a biblioteca `matminer` oferece recursos para a seleção e preparação de features que são extremamente poderosos e fáceis de usar. Além disso, é possível aplicar esses recursos para o dataframe inteiro e utilizar paralelização, tornando o processo muito mais rápido.

Por exemplo, é possível reformatar dados de composição de elementos como uma lista da fração de cada elemento da fórmula, colocando 0 para os elementos ausentes.

In [60]:
from pymatgen.core import Composition

elemento = Composition("Mn2O")

In [61]:
from matminer.featurizers.composition.element import ElementFraction

ef = ElementFraction()

In [62]:
element_fractions = ef.featurize(elemento)

print(element_fractions)

[0, 0, 0, 0, 0, 0, 0, 0.3333333333333333, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.6666666666666666, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]


In [63]:
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)

['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk', 'Cf', 'Es', 'Fm', 'Md', 'No', 'Lr']


In [64]:
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[24], element_fractions[24])

O 0.3333333333333333
Mn 0.6666666666666666


In [65]:
df = load_dataset("brgoch_superhard_training")
df.head()

Unnamed: 0,formula,bulk_modulus,shear_modulus,composition,material_id,structure,brgoch_feats,suspect_value
0,AlPt3,225.230461,91.197748,"(Al, Pt)",mp-188,"[[0. 0. 0.] Al, [0. 1.96140395 1.96140...","{'atomic_number_feat_1': 123.5, 'atomic_number...",False
1,Mn2Nb,232.69634,74.590157,"(Mn, Nb)",mp-12659,[[-2.23765223e-08 1.42974191e+00 5.92614104e...,"{'atomic_number_feat_1': 45.5, 'atomic_number_...",False
2,HfO2,204.573433,98.564374,"(Hf, O)",mp-352,"[[2.24450185 3.85793022 4.83390736] O, [2.7788...","{'atomic_number_feat_1': 44.0, 'atomic_number_...",False
3,Cu3Pt,159.31264,51.778816,"(Cu, Pt)",mp-12086,"[[0. 1.86144248 1.86144248] Cu, [1.861...","{'atomic_number_feat_1': 82.5, 'atomic_number_...",False
4,Mg3Pt,69.637565,27.588765,"(Mg, Pt)",mp-18707,"[[0. 0. 2.73626461] Mg, [0. ...","{'atomic_number_feat_1': 57.0, 'atomic_number_...",False


Como podemos perceber foi importado um novo dataframe, agora utilizando o código `ElementFraction()` e `featurize_dataframe` podemos criar as features baseado na composição.

In [66]:
ef = ElementFraction()
df = ef.featurize_dataframe(df, "composition")

ElementFraction: 100%|██████████| 2574/2574 [00:01<00:00, 1743.76it/s]


In [67]:
df.head()

Unnamed: 0,formula,bulk_modulus,shear_modulus,composition,material_id,structure,brgoch_feats,suspect_value,H,He,...,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr
0,AlPt3,225.230461,91.197748,"(Al, Pt)",mp-188,"[[0. 0. 0.] Al, [0. 1.96140395 1.96140...","{'atomic_number_feat_1': 123.5, 'atomic_number...",False,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Mn2Nb,232.69634,74.590157,"(Mn, Nb)",mp-12659,[[-2.23765223e-08 1.42974191e+00 5.92614104e...,"{'atomic_number_feat_1': 45.5, 'atomic_number_...",False,0,0,...,0,0,0,0,0,0,0,0,0,0
2,HfO2,204.573433,98.564374,"(Hf, O)",mp-352,"[[2.24450185 3.85793022 4.83390736] O, [2.7788...","{'atomic_number_feat_1': 44.0, 'atomic_number_...",False,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cu3Pt,159.31264,51.778816,"(Cu, Pt)",mp-12086,"[[0. 1.86144248 1.86144248] Cu, [1.861...","{'atomic_number_feat_1': 82.5, 'atomic_number_...",False,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Mg3Pt,69.637565,27.588765,"(Mg, Pt)",mp-18707,"[[0. 0. 2.73626461] Mg, [0. ...","{'atomic_number_feat_1': 57.0, 'atomic_number_...",False,0,0,...,0,0,0,0,0,0,0,0,0,0


Outro processo de `featurize` é o `DensityFeatures` que ira utilizar o dado da estrutura para transformar em dados de densidade, por exemplo:

In [68]:
df = load_dataset("phonon_dielectric_mp")

df.head()

Unnamed: 0,mpid,eps_electronic,eps_total,last phdos peak,structure,formula
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe


In [69]:
from matminer.featurizers.structure import DensityFeatures

densityf = DensityFeatures()
densityf.feature_labels() # utilizado para mostrar os dados disponiveis com esse metodo.

['density', 'vpa', 'packing fraction']

In [70]:
df = densityf.featurize_dataframe(df, "structure")

df.head()

DensityFeatures: 100%|██████████| 1296/1296 [00:01<00:00, 1022.93it/s]


Unnamed: 0,mpid,eps_electronic,eps_total,last phdos peak,structure,formula,density,vpa,packing fraction
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe,4.937886,44.545547,0.596286
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC,9.868234,16.027886,0.531426
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC,5.760895,12.199996,0.39418
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs,5.087634,13.991016,0.3196
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe,2.750191,35.937,0.428523


Como podemos perceber esse novo `dataset` não possue o dado `composition` como o primeiro, porêm possuimos o dado `formula` e a partir deste dado podemos utilizar o `StrToComposition` com o `featurize_dataframe` para converter a formula em uma coluna `composition`.

In [71]:
from matminer.featurizers.conversions import StrToComposition

stc = StrToComposition()
df = stc.featurize_dataframe(df, "formula", pbar=False)

Agora sendo possivel realizar o metodo para criar as colunas com os elementos, e representando a fração atomica de cada.

In [72]:
df = ef.featurize_dataframe(df, "composition")

ElementFraction: 100%|██████████| 1296/1296 [00:00<00:00, 1860.45it/s]


In [73]:
df

Unnamed: 0,mpid,eps_electronic,eps_total,last phdos peak,structure,formula,density,vpa,packing fraction,composition,...,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr
0,mp-1000,6.311555,12.773454,98.585771,"[[2.8943817 2.04663693 5.01321616] Te, [0. 0....",BaTe,4.937886,44.545547,0.596286,"(Ba, Te)",...,0,0,0,0,0,0,0,0,0,0
1,mp-1002124,24.137743,32.965593,677.585725,"[[0. 0. 0.] Hf, [-3.78195772 -3.78195772 -3.78...",HfC,9.868234,16.027886,0.531426,"(Hf, C)",...,0,0,0,0,0,0,0,0,0,0
2,mp-1002164,8.111021,11.169464,761.585719,"[[0. 0. 0.] Ge, [ 3.45311592 3.45311592 -3.45...",GeC,5.760895,12.199996,0.394180,"(Ge, C)",...,0,0,0,0,0,0,0,0,0,0
3,mp-10044,10.032168,10.128936,701.585723,"[[0.98372595 0.69559929 1.70386332] B, [0. 0. ...",BAs,5.087634,13.991016,0.319600,"(B, As)",...,0,0,0,0,0,0,0,0,0,0
4,mp-1008223,3.979201,6.394043,204.585763,"[[0. 0. 0.] Ca, [ 4.95 4.95 -4.95] Se]",CaSe,2.750191,35.937000,0.428523,"(Ca, Se)",...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1291,mp-998603,4.178159,14.155681,142.585768,[[1.69099645e-03 3.81913207e+00 1.07685858e-01...,RbPbBr3,4.581318,38.593148,0.507357,"(Rb, Pb, Br)",...,0,0,0,0,0,0,0,0,0,0
1292,mp-998604,3.548202,13.938313,223.585761,[[1.64439731e-03 3.64832409e+00 1.03287739e-01...,RbPbCl3,3.933733,33.688084,0.542370,"(Rb, Pb, Cl)",...,0,0,0,0,0,0,0,0,0,0
1293,mp-998612,3.960980,9.617663,219.718383,"[[-3.66731982 -1.91142875 2.96640499] K, [ 3....",KGeBr3,3.151275,37.038786,0.388217,"(K, Ge, Br)",...,0,0,0,0,0,0,0,0,0,0
1294,mp-999498,4.613954,4.972619,1090.585692,"[[ 1.57631457 -0.32583322 -1.57631457] N, [ 1....",N2,3.379498,6.882287,0.167146,(N),...,0,0,0,0,0,0,0,0,0,0
