## Data cleaning

In this notebook, we clean the given data. Three input datasets are used:

 - `dm` is the distance matrix between zeolites using PDD and AMD.
 - `synth` is the set of synthesis conditions from the literature.
 - `feat` is the set of features in zeolites

In [5]:
import itertools
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

In [6]:
dm = pd.read_csv("../data/iza_dm.csv", index_col=0)
synth = pd.read_excel("../data/synthesis-complete.xlsx")
feat = pd.read_csv("../data/zeo-features.csv", index_col=0)

In [10]:
def clean_name(x: str):
    if x in [np.nan, ' ']:
        return ""
    
    return x.strip()

def clean_iza(x: str):
    x = x.replace("*", "")
    if x.startswith("-"):
        x = x.replace("-", "")
    
    return x

In [11]:
SYNTH_COLS = [f"syn{n}" for n in range(1, 9)]
conditions = synth[SYNTH_COLS].values.reshape(-1).tolist()
conditions = set(conditions)
conditions = {x.strip() for x in conditions if x not in [np.nan, ' ']}

In [13]:
conditions_count = {x: 0 for x in conditions}

for _, row in synth[SYNTH_COLS].iterrows():
    v = row.dropna().values.tolist()
    for element in v:
        name = clean_name(element)
        if name in conditions_count:
            conditions_count[name] += 1

In [24]:
ccount = pd.Series(conditions_count)

In [25]:
popular = ccount.loc[ccount >= 8].sort_values()

In [26]:
new_columns = {}

for i, row in synth[SYNTH_COLS].iterrows():
    elements = [clean_name(x) for x in row.dropna().values.tolist()]
    new_columns[i] = {
        x: x in elements
        for x in popular.index
    }

In [27]:
new = pd.DataFrame(new_columns).T
new = new[sorted(new.columns)]

In [35]:
df = synth.drop([
    "sda", "formula", "FD","max_ring_size", "channel_dim",
    "inc_diameter", "inc_vol", "accvol", "comp", "Num N", "Num P"
] + SYNTH_COLS, axis=1)

df = pd.concat([df, new], axis=1)
df["Code"] = df["Code"].apply(clean_iza)

synfracs = df.groupby("Code").mean()[new.columns]

  synfracs = df.groupby("Code").mean()[new.columns]


In [36]:
synfracs.to_csv("../data/synthesis_fraction.csv")