In [2]:
import numpy as np
import pandas as pd
import json
import os
import seaborn as sns
import numpy as np
from pymatgen.core import Structure
from pymatgen.ext.matproj import MPRester
from pymatgen.core.composition import *
from pymatgen.analysis.chemenv.coordination_environments.coordination_geometry_finder import LocalGeometryFinder
from pymatgen.analysis.chemenv.coordination_environments.structure_environments import LightStructureEnvironments
from pymatgen.analysis.chemenv.coordination_environments.chemenv_strategies import SimplestChemenvStrategy
from scipy.spatial import ConvexHull
from pymatgen.core.periodic_table import *
%matplotlib inline

### Initial database : Materials project Ternary Li-oxides (Total of 1,830 structures to start with)
<br>
A portion of these will be removed in cases of (1) not contain any occupied tetrahedrla site, (2) have energy above hull (stability) above 50 meV/atom, (3) oxidation states cannot be automatically labeled (meaning that the compound is not easy to identify as an ionic crystal).

### After removing structures that have higher than 50 meV/atom above hull, we get 735 entries left.

### Next, I have exclude entries where the Li can sit in multiple sites OR X can sit in multiple sites OR Li and X sit in the same site.
<br>
Also, entries that DO NOT have any species that occupy a tetrahedral site are excluded.

<br>
This ends up with 113 final entries. All of these composition can be written as Li-X-O ternary lithium oxides. This is smaller than the initial database that we started with (1,830). This definitely is larger than the required minimum in the guideline. However, if we want to have a larger database, there are a few ways we can increase the size of our data. 
<br>
(1) Increase the E above hull constraint so that we allow higher E above hull (perhaps up to 80 meV/atom). 
<br>
(2) Exclude less entries in the process: This is doable, but I am not sure how this would affect the fitting.
<br> 

In [5]:
df = pd.read_pickle("Tetrahedral-Dataset_V2.pickle")

In [6]:
df.head()

Unnamed: 0,mpid,struct,formula,X_species,tet_li,tet_X,Tetrahedral Occupancy,Tetrahedral Volume,Competing Volume,Competing Environment,X_charge,X Ionic Radius,X Atomic Radius,X Electronegativity
0,mp-1019724,"{'@module': 'pymatgen.core.structure', '@class...",LiEuO2,Eu,"{'oct': [], 'tet': [{'csm': 1.575, 'env': 'T:4...","{'oct': [{'csm': 0.576, 'env': 'O:6', 'volume'...",Li,4.250468,17.373056,O:6,3.0,1.087,2.31,1.2
1,mp-1020012,"{'@module': 'pymatgen.core.structure', '@class...",Li2Ge4O9,Ge,"{'oct': [], 'tet': [], 'other': [{'csm': 1.345...","{'oct': [{'csm': 0.162, 'env': 'O:6', 'volume'...",Ge,2.911377,6.031204,S:5,4.0,0.67,1.25,2.01
2,mp-10620,"{'@module': 'pymatgen.core.structure', '@class...",LiPrO2,Pr,"{'oct': [], 'tet': [{'csm': 2.757, 'env': 'T:4...","{'oct': [], 'tet': [], 'other': [{'csm': 1.622...",Li,3.913044,22.408003,ST:7,3.0,1.13,2.47,1.13
3,mp-1172980,"{'@module': 'pymatgen.core.structure', '@class...",Li17(CoO4)3,Co,"{'oct': [], 'tet': [], 'other': [{'csm': 16.90...","{'oct': [], 'tet': [{'csm': 1.42, 'env': 'T:4'...",Co,3.995332,13.048954,BO_2:8,2.333333,0.885,1.52,1.88
4,mp-1176738,"{'@module': 'pymatgen.core.structure', '@class...",LiMn7O14,Mn,"{'oct': [], 'tet': [{'csm': 0.051, 'env': 'T:4...","{'oct': [{'csm': 0.749, 'env': 'O:6', 'volume'...",Li,4.025109,9.711415,O:6,3.857143,0.67,1.61,1.55


In [8]:
print("Number of data: ", len(df))

Number of data:  113


### The dataframe is loaded into "df", which was saved into Tetrahedral-Dataset_V2.pickle file, also included in this folder.
* The y vector should be made from the column "Tetrahedral Occupancy". We would give "1" if this column value is "Li" (meaning that the tetrahedral site is occupied to lithium) and "0" if this column is not "Li" (meaning that another cation occupies the tetrahedral site in this structure).
* (1) Feature 1 : Tetrahedral volume (unit in $\unicode{x212B}^3$)
* (2) Feature 2 : Competing volume (unit in $\unicode{x212B}^3$) - Smaller coefficient is expected for this part since this is a general volume for all kinds of coordination environments (cubic, 12-coordination, etc).
* (3) Feature 3 : Electronegativity of X
* (4) Feature 4 : Ionic radius of X - Since we are focused on the competition between Li and other cation for a given tetrahedral site, we may even convert this value to the ratio of radius ($r_X/r_{Li}$). We can decide later, or we can quickly add an additional column.
* (5) Feature 5 : Atomic radius of X - This will be less relevant than the ionic radius, but is still included. This can also be converted to the ratio of radius.
* (6) Feature 6 : One-hot encoding for different parts of periodic table, such as: "Is it a transition metal?", "Is it an Alkali metal?", "Is it an Alkali earth metal?", "Is it rare earth?"
* (7) We can add the row and column of the X element in the periodic table.