#### TASK: Parse XML dump and write info to a dataframe
_In this notebook we extract chemical data from an XML dump (scrapped from Wikipedia) and store it in a more usable format (e.g DataFrame, Dictionary)._ Structually we want to go from:

> **&lt;title>** Sodium chloride **<\title>**
>
> **&lt;text>** _Seciton1={{Chembox Identifiers [...] Section8={{Chembox Related_ **<\text>**

_to:_
> chemical formula, property, value, units 
>
> [ (NaCl, Molar mass, 58.44, g mol^-1) , (NaCl, Density, 2.17, g cm^-3) , etc... ]

#### STEPS

    For each Chemical in XML File:
    * Store the chemical name
    * Identify the sections in the chemical text
    * Extract each section's text 
    * Extract and store the info from each section's text
    * If section2, do something hackey to extract chemical formula and store
    * Clean! Clean! Clean!

In [1]:
import re
import csv
import json
import pandas as pd

from process_chembox_xml import chemical_dict_to_csv, parse_xml_get_elements
from chembox import ChemBox

### Read in the XML File

> _Store (Chemical, DOM Element) in dictionary_
>
> _The Document Object Model (DOM): The DOM is a standard tree representation for XML data._

In [2]:
FILE_NAME_IN = 'data/wiki_chembox.xml'
FILE_NAME_OUT_CSV = 'data/chembox_data.csv'
FILE_NAME_OUT_JSON = 'data/chembox_data.json'

In [3]:
# Parse the file
chembox_dict = parse_xml_get_elements(FILE_NAME_IN)

print('Sample')
dict(list(chembox_dict.items())[:3])

Sample


{'Aluminium antimonide': <DOM Element: text at 0x11a576b90>,
 'Aluminium arsenate': <DOM Element: text at 0x11a57c690>,
 'Aluminium arsenide': <DOM Element: text at 0x11a580190>}

### Extract Chemical Properties (Iteratively)

In [4]:
# Iterate through chemical names and 
# extract their data
for chemical in chembox_dict.keys():
    
    dom_element = chembox_dict[chemical]
    
    # Instantiate class
    chembox = ChemBox(
        chemical_dom_element = dom_element
    )
    
    # Store chemical data in dict
    chembox_dict[chemical] = chembox.get_chemical_data()

In [5]:
# Summarize data
null_data = []
for k, v in chembox_dict.items():
    if not v:
        null_data.append(k)

summary_str = '{:,} chemicals in this dataset\n{:,} have no available data'
print(summary_str.format(len(chembox_dict), len(null_data)))

674 chemicals in this dataset
117 have no available data


_**Sample from XML (Some properties are listed without values)**_

In [6]:
print(chembox.get_text_data_from_element_TagName()[:1_250])

{{chembox
| Verifiedfields = changed
| Watchedfields = changed
| verifiedrevid = 470637657
| Name = Zirconium(IV) tungstate
| ImageFile = ZrW2O8 opaque polyhedra.svg
| ImageName = Zirconium(IV) tungstate
| OtherNames = zirconium tungsten oxide
|Section1={{Chembox Identifiers
| CASNo_Ref = {{cascite|changed|??}}
| CASNo = 16853-74-0
  }}
|Section2={{Chembox Properties
| Formula = Zr(WO<sub>4</sub>)<sub>2</sub>
| MolarMass = 586.92 g/mol
| Appearance = white powder
| Density = 5.09 g/cm<sup>3</sup>, solid
| Solubility = negligible
| MeltingPt = 
| BoilingPt = 
  }}
|Section7={{Chembox Hazards
| ExternalSDS = [https://web.archive.org/web/20100201192127/http://www.espi-metals.com/msds's/zirconiumtungstate.pdf MSDS]
| EUClass = not listed
| NFPA-H = 2
| NFPA-R = 0
| NFPA-F = 0
  }}
}}

'''Zirconium tungstate''' ({{Zirconium}}({{Tungsten}}{{Oxygen|4}})<sub>2</sub>) is a metal [[oxide]] with unusual properties. The phase formed at ambient pressure by reaction of [[zirconia|ZrO<sub>2</sub>]] a

_**Taking the data and storing it in the dictionary (see below) makes it much easier to use**_

In [7]:
print(chemical, '\n', 15*'-')
chembox_dict[chemical]

Zirconium tungstate 
 ---------------


{'CASNo_Ref': {'Value': '{{cascite|changed|??}}'},
 'CASNo': {'Value': '16853-74-0'},
 'Formula': 'Zr(WO4)2',
 'MolarMass': {'Value': '586.92', 'Unit': 'g/mol'},
 'Appearance': {'Value': 'white powder'},
 'Density': {'Value': '5.09', 'Unit': 'g/cm3, solid'},
 'Solubility': {'Value': 'negligible'},
 'ExternalSDS': {'Value': "[https://web.archive.org/web/20100201192127/http://www.espi-metals.com/msds's/zirconiumtungstate.pdf MSDS]"},
 'EUClass': {'Value': 'not listed'},
 'NFPA-H': {'Value': '2'},
 'NFPA-R': {'Value': '0'},
 'NFPA-F': {'Value': '0'}}

_**We can also store the data in a csv / DataFrame (see below)**_

In [8]:
# Save data to csv
chemical_dict_to_csv(
    chemDict = chembox_dict,
    filename = FILE_NAME_OUT_CSV,
)

# Save data as json
with open(FILE_NAME_OUT_JSON, 'w') as f:
    json.dump(chembox_dict, f, indent=4, sort_keys=True)

In [9]:
df = pd.read_csv('sample_chembox_data.csv')

# Sample Data
sample_props = [
    'BoilingPtC',
    'DeltaHf',
    'Density',
    'Entropy',
    'MeltingPtC',
    'MolarMass',
    'SMILES',
    'VaporPressure',
]

df[df['Property'].isin(sample_props)].sample(15)

Unnamed: 0,Name,Formula,Property,Value,Unit
110,Disulfur dichloride,S2Cl2,MeltingPtC,−80,
25,Sodium thiocyanate,NaSCN,MolarMass,81.072,g/mol
70,Perchloric acid,HClO4,MeltingPtC,-17,
14,Sodium thiocyanate,NaSCN,SMILES,[Na+].[S-]C#N,
53,Perchloric acid,HClO4,SMILES,OCl(=O)(=O)=O,
68,Perchloric acid,HClO4,Density,1.768,g/cm3
108,Disulfur dichloride,S2Cl2,MolarMass,135.04&nbsp;g/mol,
28,Sodium thiocyanate,NaSCN,MeltingPtC,287,
111,Disulfur dichloride,S2Cl2,BoilingPtC,137.1,
72,Perchloric acid,HClO4,BoilingPtC,203,
