## Christian Hansen
## Scraping data from an unstructured XML dataset

In [1]:
import pandas as pd
import re
from bs4 import BeautifulSoup
wiki = "./Wikipedia-20161003174511.xml"
bs = BeautifulSoup(open(wiki,'r').read(),"xml")

I started by looking into the xml file and breaking up the table or chemboxes. This became easier to do using beautiful soup as well as simple string regex and splitting based on reoccuring patterns. I struggled with various patterns to split on, but it didn't take long to get a feel for how the tables are structured.

The original prompt indicated that it should follow "chemical formula, property, value, units" and while this is sensible, eventually it seemed clearer to me to structure data as various columns correlating to one chemical. Many tables from the xml file had non-matching columns so that was something I tackled when I concatinated the rows together using as a dataframe object. The main downside and something that is easily fixed when the a chemical property is pulled from the data set is separating the value and the units. The column name is the feature name, or property, and the value of the chemical could be the molar mass, if that value existed in the row it would still contain both the value and the unit, seperating by space and slicing from the created array would give both the value and unit of that property.

I made an attempt to separate the unit and value with a function. Regardless, this is not an easy scrape and I'd say that this approach could be implemented better from the get go, but it's a tradeoff of hyper-specifying how to process the data or to be a little greedy and get the number and units together.
As for the bonus, I ended up encapsulating the data structure for each property according to each chemical. I may have not kept it true to the intended use of the data structure, but I did what I thought was reasonable given how my data was structured. This way, given a single compound you can have all its properties in that exact data type.

In [2]:
## Trying agian, getting all found section splits
sectioned = bs.findAll(text=re.compile("Chembox\n"))

In [3]:
# Specific functions to scrape sections 2 and 4, both the properties and thermochemistry tables for a single chemical
def section_2(section):
    array_ret= []
    for s,v in enumerate(section):
        if 'Formula' in v:
            v = (v + re.sub('[=|1 ]','',section[0][s+1]))
        if v.count('=') ==1:
            row = re.sub('[\n|}}]','',v).split('=')
            array_ret.append(row)
    return(array_ret)

def section_four(section):
    array_four = []
    for v in section:
        if ' = ' in v:
            array_four.append(re.sub('[|}\n]','',v).split(' = '))
    return array_four

In [4]:
## this splits all different chemboxes into unique parts

def get_sections(sectioned,chem_numb):
    if 'Name = ' in sectioned[chem_numb]:
        sections_name = sectioned[chem_numb].split('Name = ')[-1].split('Section')
#         print('name')
        return(sections_name)
    if 'OtherNames = ': 
#         print('othername')
        sections_name = sectioned[chem_numb].split('OtherNames = ')[-1].split('Section')
        return(sections_name)
    
# Separating the sectioned tables and scrapping the name only
    
def section_name(section_name):
    name_array = sections_name[0].split('\n')[0]
    if len(name_array)>1:
        name = name_array.split('<ref>')[0]
    else:
        name = name_array
    return name


## Extracting both section two and four
def extract_section_two_four(sections_name):
    ## functional section scraping
    name = section_name(sections_name)
    name = ['name',name]
    ## separetely get name of the forumula
    print(name[1])
    section_two_array = []
    section_four_array =[]
    for section in sections_name:
        ## take care of each section individually
        if 'Chembox Properties' in section:
#             print('Properties')
            section_two_array = section_2(section.split('\n| '))
        elif 'Chembox Thermochemistry' in section:
#             print('Thermo')
            
            section_four_array = section_four(section.split('\n| '))

    return(name,section_two_array,section_four_array)

## get the sections and put it into a dataframe row
def get_data_frame(sections_name):
    name, prop,thermo = extract_section_two_four(sections_name)
    column_name = [x[0] for x in (prop[1:]+thermo)] 
    values = [x[1] for x in (prop[1:]+thermo)]
    data_frame = pd.DataFrame(values,index = column_name).T
    data_frame.index = [name[1]]
    return(data_frame)

In [5]:
#concatinate individual tables into one giant dataframe.
data_frame = pd.DataFrame()
for chem_numb in range(len(sectioned)):
    print(chem_numb)
    try:
        # since some tables aren't transforming well at all and raising a specific error, I'll attempt use try/except
        sections_name = get_sections(sectioned,chem_numb)
        data_frame = pd.concat([data_frame,get_data_frame(sections_name)],axis = 0)
    except:
        pass

## Write dataframe to csv and fix the issues with spaces in the column names.

data_frame.columns = [x.strip(' ') for x in data_frame.columns]
data_frame.to_csv('Chembox_Table.csv',encoding='utf-8')

0
Ball and stick model of dimeric aluminium bromide
1
Ball and stick model of aluminium iodide dimer
2
Aluminium Nitride powder
3
Aluminium(3+) trioxidanide
4
Aluminium nitrate
5
Ammonia
6
Ammonium azide
7
Ammonium chromate(IV)
8
{{Unreferenced|date=December 2009}}
9
Ammonium hydroxide
10
Ammonium perchlorate
11

12
<!-- Barium dioxoironbis(olate) (substitutive) OR Barium tetraoxidoferrate(2-) (additive) -->
13
Beryllium borohydride
14
Beryllium+hydroxide
15
Beryllium nitrate
16

17
beryllium+oxide
18
Beryllium sulfite
19

20
borane ''(substitutive)''<br />
21
Boric acid<br />Trihydrooxidoboron<!-- This second IUPAC name has not been validated -->
22
Elbor
23
Boron trifluoride
24
cadmium+selenide
25
Cadmium telluride
26
Caesium bicarbonate
27
Caesium hydride
28
Calcium cyanamide
29
Calcium perchlorate 
30
Carbon+dioxide
31
Carbon+monoxide
32
Tetraiodomethane
33
Chlorine+dioxide
34
Chlorosyl
35
chlorine+trifluoride
36
Dihydroxidodioxidochromium
37
Chromium trioxide
38
Chromium(2+) sulfa

In [6]:
data_frame.head()

Unnamed: 0,Appearance,Absorbance,Appearance.1,AtmosphericOHRateConstant,BandGap,BoilingPt,BoilingPtC,BoilingPtK,BoilingPt_notes,BoilingPt_ref,...,Solvent4,Solvent5,SpecRotation,SublimationConditions,ThermalConductivity,VaporPressure,Viscosity,pKa,pKb,"tetragonal<br/>tP36, P4<sub>1</sub>2<sub>1</sub>2, No. 92<ref>{{cite journal doi=10.1524/zkri.1959.112.1-6.409 title=The crystal structure of keatite, a new form of silica year=1959 last1=Shropshire first1=Joseph last2=Keat first2=Paul P. last3=Vaughan first3=Philip A. journal=Zeitschrift für Kristallographie volume=112 pages=409–13bibcode"
Ball and stick model of dimeric aluminium bromide,,,white to pale yellow<br /> crystalline solid,,,,265.0,,,,...,,,,,,,,,,
Ball and stick model of aluminium iodide dimer,,,white powder<br />but impure samples<br />are...,,,,360.0,,", sublimes",,...,,,,,,,,,,
Aluminium Nitride powder,,,white to pale-yellow solid,,,,2517.0,,decomposes,,...,,,,,285 W/(m·K),,,,,
Aluminium(3+) trioxidanide,,,White [[amorphous]] powder,,,,,,,,...,,,,,,,,>7,,
Aluminium nitrate,,,"White crystals, solid <br /> [[hygroscopic]]",,,,150.0,,(nonahydrate) decomposes,,...,,,,,,,,,,


In [7]:
#checking to see if it can be read in alright. Looking good
read_in_df = pd.read_csv('Chembox_Table.csv')
read_in_df.index = read_in_df['Unnamed: 0']
read_in_df.index.rename('Name',inplace = True)
read_in_df.drop('Unnamed: 0',inplace = True,axis =1)
read_in_df.drop('Appearance',inplace = True,axis =1)
read_in_df.head()

Unnamed: 0_level_0,Absorbance,Appearance.1,AtmosphericOHRateConstant,BandGap,BoilingPt,BoilingPtC,BoilingPtK,BoilingPt_notes,BoilingPt_ref,"Closely related to α-quartz (with an Si-O-Si angle of 155°) and optically active; β-quartz converts to β-tridymite at 1140 K[[File:b-quartz.png100px]]-[[tridymiteα-tridymite]][[orthorhombic]]<br/>oS24, C222<sub>1</sub>, No.20<ref name=trid>{{cite journal doi=10.1524/zkri.1986.177.1-2.27 title=Structural change of orthorhombic-Itridymite with temperature: A study based on second-order thermal-vibrational parameters year=1986 last1=Kihara first1=Kuniaki last2=Matsumoto first2=Takeo last3=Imamura first3=Moritaka journal=Zeitschrift für Kristallographie volume=177 pages=27–38bibcode",...,Solvent4,Solvent5,SpecRotation,SublimationConditions,ThermalConductivity,VaporPressure,Viscosity,pKa,pKb,"tetragonal<br/>tP36, P4<sub>1</sub>2<sub>1</sub>2, No. 92<ref>{{cite journal doi=10.1524/zkri.1959.112.1-6.409 title=The crystal structure of keatite, a new form of silica year=1959 last1=Shropshire first1=Joseph last2=Keat first2=Paul P. last3=Vaughan first3=Philip A. journal=Zeitschrift für Kristallographie volume=112 pages=409–13bibcode"
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ball and stick model of dimeric aluminium bromide,,white to pale yellow<br /> crystalline solid,,,,265.0,,,,,...,,,,,,,,,,
Ball and stick model of aluminium iodide dimer,,white powder<br />but impure samples<br />are...,,,,360.0,,", sublimes",,,...,,,,,,,,,,
Aluminium Nitride powder,,white to pale-yellow solid,,,,2517.0,,decomposes,,,...,,,,,285 W/(m·K),,,,,
Aluminium(3+) trioxidanide,,White [[amorphous]] powder,,,,,,,,,...,,,,,,,,>7,,
Aluminium nitrate,,"White crystals, solid <br /> [[hygroscopic]]",,,,150.0,,(nonahydrate) decomposes,,,...,,,,,,,,,,


#### Checking to see if my function matches up with what I've created

In [8]:
section_2(sectioned[23].split('Name = ')[-1].split('Section')[2].split('\n|'))
## comparing my dataframe to my list 

[[u'2', u'{{Chembox Properties'],
 [u' Formula ', u' BF<sub>3</sub>{'],
 [u' MolarMass ',
  u' 67.82 g/mol (anhydrous) <br /> 103.837 g/mol (dihydrate)'],
 [u' Appearance ',
  u' colorless gas (anhydrous) <br /> colorless liquid (dihydrate)'],
 [u' Density ',
  u' 0.00276 g/cm<sup>3</sup> (anhydrous gas) <br /> 1.64 g/cm<sup>3</sup> (dihydrate)'],
 [u' SolubleOther ',
  u' soluble in [[benzene]], [[toluene]], [[hexane]], [[chloroform]] and [[methylene chloride]]'],
 [u' MeltingPtC ', u' \u2212126.8'],
 [u' BoilingPtC ', u' \u2212100.3'],
 [u' Dipole ', u' 0 D']]

In [9]:
sections_name = get_sections(sectioned,23)
get_data_frame(sections_name)

Boron trifluoride


Unnamed: 0,Formula,MolarMass,Appearance,Density,SolubleOther,MeltingPtC,BoilingPtC,Dipole,DeltaHf,DeltaGf,Entropy,HeatCapacity
Boron trifluoride,BF<sub>3</sub>{,67.82 g/mol (anhydrous) <br /> 103.837 g/mol ...,colorless gas (anhydrous) <br /> colorless li...,0.00276 g/cm<sup>3</sup> (anhydrous gas) <br ...,"soluble in [[benzene]], [[toluene]], [[hexane...",−126.8,−100.3,0 D,-1137 kJ/mol,-1120 kJ/mol,254.3 J/mol K,50.46 J/mol K


This is a quality/sanity check of the scrape and the reformating of the data.

In [10]:
data_frame.head(5)

Unnamed: 0,Appearance,Absorbance,Appearance.1,AtmosphericOHRateConstant,BandGap,BoilingPt,BoilingPtC,BoilingPtK,BoilingPt_notes,BoilingPt_ref,...,Solvent4,Solvent5,SpecRotation,SublimationConditions,ThermalConductivity,VaporPressure,Viscosity,pKa,pKb,"tetragonal<br/>tP36, P4<sub>1</sub>2<sub>1</sub>2, No. 92<ref>{{cite journal doi=10.1524/zkri.1959.112.1-6.409 title=The crystal structure of keatite, a new form of silica year=1959 last1=Shropshire first1=Joseph last2=Keat first2=Paul P. last3=Vaughan first3=Philip A. journal=Zeitschrift für Kristallographie volume=112 pages=409–13bibcode"
Ball and stick model of dimeric aluminium bromide,,,white to pale yellow<br /> crystalline solid,,,,265.0,,,,...,,,,,,,,,,
Ball and stick model of aluminium iodide dimer,,,white powder<br />but impure samples<br />are...,,,,360.0,,", sublimes",,...,,,,,,,,,,
Aluminium Nitride powder,,,white to pale-yellow solid,,,,2517.0,,decomposes,,...,,,,,285 W/(m·K),,,,,
Aluminium(3+) trioxidanide,,,White [[amorphous]] powder,,,,,,,,...,,,,,,,,>7,,
Aluminium nitrate,,,"White crystals, solid <br /> [[hygroscopic]]",,,,150.0,,(nonahydrate) decomposes,,...,,,,,,,,,,


In [11]:
read_in_df.head(5)

Unnamed: 0_level_0,Absorbance,Appearance.1,AtmosphericOHRateConstant,BandGap,BoilingPt,BoilingPtC,BoilingPtK,BoilingPt_notes,BoilingPt_ref,"Closely related to α-quartz (with an Si-O-Si angle of 155°) and optically active; β-quartz converts to β-tridymite at 1140 K[[File:b-quartz.png100px]]-[[tridymiteα-tridymite]][[orthorhombic]]<br/>oS24, C222<sub>1</sub>, No.20<ref name=trid>{{cite journal doi=10.1524/zkri.1986.177.1-2.27 title=Structural change of orthorhombic-Itridymite with temperature: A study based on second-order thermal-vibrational parameters year=1986 last1=Kihara first1=Kuniaki last2=Matsumoto first2=Takeo last3=Imamura first3=Moritaka journal=Zeitschrift für Kristallographie volume=177 pages=27–38bibcode",...,Solvent4,Solvent5,SpecRotation,SublimationConditions,ThermalConductivity,VaporPressure,Viscosity,pKa,pKb,"tetragonal<br/>tP36, P4<sub>1</sub>2<sub>1</sub>2, No. 92<ref>{{cite journal doi=10.1524/zkri.1959.112.1-6.409 title=The crystal structure of keatite, a new form of silica year=1959 last1=Shropshire first1=Joseph last2=Keat first2=Paul P. last3=Vaughan first3=Philip A. journal=Zeitschrift für Kristallographie volume=112 pages=409–13bibcode"
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Ball and stick model of dimeric aluminium bromide,,white to pale yellow<br /> crystalline solid,,,,265.0,,,,,...,,,,,,,,,,
Ball and stick model of aluminium iodide dimer,,white powder<br />but impure samples<br />are...,,,,360.0,,", sublimes",,,...,,,,,,,,,,
Aluminium Nitride powder,,white to pale-yellow solid,,,,2517.0,,decomposes,,,...,,,,,285 W/(m·K),,,,,
Aluminium(3+) trioxidanide,,White [[amorphous]] powder,,,,,,,,,...,,,,,,,,>7,,
Aluminium nitrate,,"White crystals, solid <br /> [[hygroscopic]]",,,,150.0,,(nonahydrate) decomposes,,,...,,,,,,,,,,


While there are a lot of NaN there are a fair number of actual columns that are populated. Some columns are more populated than others. It's hard to know what will be grabbed so this is a pretty greedy scrape. And eventually it would be best to clean up the columns and specify object types to perform modeling or recall from the dataframe effectively.

## Based on which column you'd like to recall you can split the value and units

In [76]:
# columns =data_frame.columns[data_frame.columns!='Appearance']
columns = [ u'Entropy',u'ThermalConductivity', u'VaporPressure', u'Viscosity',
        u'pKa', u'pKb','DeltaGf', u'DeltaHc', u'DeltaHf', u'Density', u'Dipole',u'HenryConstant', u'IsoelectricPt', u'LambdaMax', u'LogP', u'MagSus',
       u'MeltingPt', u'MeltingPt', u'MeltingPtC', u'MeltingPtF', u'MeltingPtK',
       u'MeltingPt_notes', u'MeltingPt_ref', u'MolarMass', u'O', u'Odor',
       u'Odour', u'Properties_ref', u'RefractIndex', u'RefractIndex',
       u'Solubility', u'Solubility1', u'Solubility2', u'Solubility2',
       u'Solubility3', u'SolubilityProduct', u'SolubilityProductAs',
       u'SolubleOther', u'Solvent', u'Solvent1', u'Solvent2', u'Solvent3',
       u'Solvent4', u'Solvent5', u'SpecRotation', u'SublimationConditions',
       u'ThermalConductivity', u'VaporPressure', u'Viscosity', u'pKa', u'pKb',]

#getting column names

In [77]:
## helper function to split the value of each row
def split_val(x):
    if len(x)>1:
        return((x[0],x[1:]))
    else:
        return(x)

In [78]:
# splitting data to get value and units separetely

def split_units(col,data_frame):
    return(data_frame[data_frame[col].notnull()][col].str.strip(' ').str.split(' ').apply(lambda x: split_val(x)))

In [79]:
col = 'BandGap'
split_units(col,read_in_df)

Name
beryllium+oxide                         (10.6, [eV])
cadmium+selenide                        (1.74, [eV])
Cadmium telluride    (1.5, [eV, (@300, K,, direct)])
gallium+arsenide         (1.424, [eV, (at, 300, K)])
Gallium nitride       (3.4, [eV, (300, K,, direct)])
NaN                                     (0.17, [eV])
Lead(II) iodide                          (2.3, [eV])
Name: BandGap, dtype: object

In [80]:
## looping through all the columns
for col in columns:
    print('Column: ' + col)
    print(split_units(col,read_in_df))
    print('-------------------------------------------------------------------')

Column: Entropy
Name
Aluminium Nitride powder                                          (20.2, [J/mol, K])
Ammonia                            (193&nbsp;J·mol<sup>−1</sup>·K<sup>−1</sup><re...
Ammonium chromate(IV)                                               (657, [J/K·mol])
Ammonium hydroxide                 (111&nbsp;J·mol<sup>−1</sup>·K<sup>−1</sup><re...
Beryllium+hydroxide                (47, [J·mol<sup>−1</sup>·K<sup>−1</sup><ref>{{...
beryllium+oxide                    (13.73–13.81, [J K<sup>−1</sup> mol<sup>−1</su...
borane ''(substitutive)''<br />     (187.88, [kJ, mol<sup>−1</sup>, K<sup>−1</sup>])
Elbor                                                            (14.77, [J/K, mol])
Boron trifluoride                                                (254.3, [J/mol, K])
Carbon+dioxide                          [214&nbsp;J·mol<sup>−1</sup>·K<sup>−1</sup>]
Carbon+monoxide                         (197.7, [J·mol<sup>−1</sup>·K<sup>−1</sup>])
Chlorine+dioxide                     (257.22

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

It looks like I'm catching a lot, but this is not consistent. There is high variability in how the value/units are organized in the original tables. This would be where I'd like to spend more time looking into how to generalize the splitting or selecting columns that fit a criteria to actually split on. 

## Looking into the memory footprint of the dataframe

In [81]:
foot_print = data_frame.memory_usage(deep = True).sum()*1.0/1048576
print('Number of megabytes of the dataframe: %0.04f mb' % foot_print )

Number of megabytes of the dataframe: 0.3866 mb


In [82]:
foot_print = read_in_df.memory_usage(deep = True).sum()*1.0/1048576
print('Number of megabytes of the dataframe: %0.04f mb' % foot_print )

Number of megabytes of the dataframe: 0.3195 mb


It's interesting to note how much smaller the footprint of the read in dataframe is compared to the created one. 

In [83]:
data_frame.dtypes.value_counts()

object    62
dtype: int64

In [84]:
read_in_df.dtypes.value_counts()

object     59
float64     2
dtype: int64

So there is the difference, I did drop rows and it automatically converted some rows to float64.

In [85]:
for i,x in enumerate(read_in_df['Formula']):
    x = str(x)
    read_in_df['Formula'][i]=(re.sub('[<sub/>}{} ]','', x))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


## Bonus 

Use the pypif module to create a PIF from your CSV output. 

In [86]:
# For each row, add its column
from pypif import pif
from pypif.obj import *
def get_dict_pif(df):
   
    # get column names except for formula
    columns = df.columns[df.columns != 'Formula']                 
    chem_array = {}

    ## making an array of each chemicalsystem and an array for properties for each system. 
    ## Essentialy a matrix of dictionaries.

    for index, row in df.iterrows():
        chemical_system = ChemicalSystem()
        chemical_system.chemical_formula = str(row['Formula'])
        properties  = []
        for col in columns:
            value = Property()
            value.property_ = str(col)
            value.value = row[col]
            chemical_system.properties = value
            properties.append(chemical_system.properties)

        chem_array[str(row['Formula'])]= pif.dumps(properties)
    return(chem_array)

In [73]:
chem_dict_out = get_dict_pif(read_in_df)
chem_dict_out.keys()

['',
 'ChemN4S4',
 'HIO3',
 'ChemTlFp2\xe2\x80\xa2p',
 'SF6',
 'MgO',
 'GaN',
 'BeSO3',
 'LiClO3',
 'GaA',
 'ABr3',
 'AlBr3rAl2Br6',
 'BF3',
 'ChemSrTiO3',
 '(NH4)2CrO4',
 'Mn3(PO4)2',
 'RNO2',
 '(CN)2',
 'Al(OH)3',
 'HNO2',
 'NiF2',
 'NH4N3,NH3.HN3',
 'Al(NO3)3',
 'Ge2H6',
 'ChemSO2',
 'AlI3',
 'LiNO3',
 'chemPI2',
 'CrO2Cl2',
 'ChemSiH2Cl2',
 'Be(NO3)2',
 'Si3N4',
 'chemN2O',
 'CSO4(anhydro)rCSO4\xc2\xb75H2O(pentahydrate)',
 'Na2S2O3',
 'Be3N2',
 'S2F10',
 'AF3',
 'H3PO3',
 'H3PO2',
 'SiO2',
 'Ca(ClO4)2',
 'P3(PO4)2',
 'HCN',
 'La2(CO3)3',
 'IF5',
 'Li2S',
 'CoF3',
 'CrO2F2',
 'CrSO4\xc2\xb75H2O',
 'NH4ClO4',
 'NH4ClO3',
 'LiAlH4',
 'GaSp\xe2\x80\xa2p',
 'CaCN2',
 'NaCl',
 'BrCN',
 'ChemA4Cl8',
 'H3AO3',
 'NH4OH',
 'FeO',
 'chemNaHCO3',
 'H5IO6(orthoperiodic)BRHIO4(metaperiodic)',
 'ChemUH2O4',
 'Li2SO3',
 'KAlF4',
 'CHCO3',
 'ACl3r(exitaA2Cl6)',
 'KMnO4',
 'CNCl',
 'Ga2S3',
 'HN3',
 'Be(BH4)2',
 'SeO2F2',
 'Cl2O3',
 'Xep+p[PtF6]p\xe2\x88\x92p',
 'NaClO3',
 'OO4',
 'CO',
 'MnCl2',
 '

In [74]:
# Testing out the returned arrray for AlI3AlS
chem_dict_out['HBrO']

'[{"property": "Absorbance", "value": NaN}, {"property": "Appearance.1", "value": " "}, {"property": "AtmosphericOHRateConstant", "value": NaN}, {"property": "BandGap", "value": NaN}, {"property": "BoilingPt", "value": NaN}, {"property": "BoilingPtC", "value": " 20-25"}, {"property": "BoilingPtK", "value": NaN}, {"property": "BoilingPt_notes", "value": NaN}, {"property": "BoilingPt_ref", "value": NaN}, {"property": "Closely related to \\u03b1-quartz (with an Si-O-Si angle of 155\\u00b0) and optically active; \\u03b2-quartz converts to \\u03b2-tridymite at 1140 K[[File:b-quartz.png100px]]-[[tridymite\\u03b1-tridymite]][[orthorhombic]]<br/>oS24, C222<sub>1</sub>, No.20<ref name=trid>{{cite journal doi=10.1524/zkri.1986.177.1-2.27 title=Structural change of orthorhombic-Itridymite with temperature: A study based on second-order thermal-vibrational parameters year=1986 last1=Kihara first1=Kuniaki last2=Matsumoto first2=Takeo last3=Imamura first3=Moritaka journal=Zeitschrift f\\u00fcr Krist

Each compund has multiple properties and the key to see all these properties is the chemical formula. This approach was pretty heavy handed. I can refine more and more, but I believe as a first processing this is a strong method of approach to scrape and reformating.

A few next possible steps:

-- In the future, refine the code to be more readable. Higher comment density.

-- Break up unit/values more effectively from the start, but again, I created a function to recall that information from the dataframe on a column by column basis. It will work with most columns, but some columns have values that are difficult to predict.

-- Master the pypif structures indepth. While I encapsulated each chemical compound within a dictionary, it's not the best way to structure it-- it's effectively a json formatting, but I wasn't able to effectively to feel comfortable with the final results above.