#Kaggle competition: Predicting Transparent Conductors#

##Background##
The Novel Materials discovery laboratory (NOMAD) hosted a [materials informatics challenge on Kaggle,](https://www.kaggle.com/c/nomad2018-predict-transparent-conductors) the most popular data science arena on the internet. The overall goal was to predict band gap (more on this later) and stability from a collection of structural features gathered via x-ray cyrstallography. Given that I've previously published work on organic semiconductors, I figured I would be able to apply at least a little bit of domain knowledge.

Just like any area of computer science, this project is all about trade-offs. Currently, the gold standard in quantum chemistry modeling is coupled-cluster (CCSD(T)) models. These are VERY expensive, often requiring weeks to run even on distributed systems. Less computationally complex are density fucntional theory (DFT) models, great for coarser calculations such as band gap. These are still relatively complex compared to classical ML models. So, therein lies the necessity of this project. Use previously collected DFT calculations, compare them to the crystal structure of  the materials, and develop a quantitative structure-property relationship (QSPR). QSPR is quite mature in the drug disovery field, where *in-silico* screening is very common. However, materials informatics is still [very much in its infancy](https://aip.scitation.org/doi/10.1063/1.4946894), mainly due to throughput in characterization. 

The data for this project is confined to materials composed of Aluminum (Al), Gallium (Ga), and Indium (In), with a fixed stoicheometric ratio of Oxygen. These materials are cheap to produce, chemically stable, and have a large [band gap](https://en.wikipedia.org/wiki/Band_gap#Photovoltaic_cells). This implies low absorbance of visible light and high conductivity.

##Digging in##
One of my favorite ways of digging into an unknown data set of low-medium cardinality is using a pairplot. This gives me a high-level view of what sort of feature engineering I need to do going forwards, and how much colinearity I have to worry about when fitting models.

In [2]:
import pandas as pd
import seaborn as sb

train_df = pd.read_csv('train.csv')
sb.pairplot(train_df.iloc[:-2], hue_order=train_df['bandgap_energy_ev'])

<seaborn.axisgrid.PairGrid at 0x7f0dc3bf4588>