### Properties of inorganic chemical precursors

This Jupyter notebook cleans data imported from the CRC Handbook of Chemistry and Physics, 85th edition related to the solubility of various compounds. It then uses information about the periodic table of the elements to attempt to learn general rules about chemical solubility.

In [57]:
import pandas as pd

In [58]:
colNames = ['no', 'name', 'chemical_formula', 'CAS_no', 'mol_weight', 'physical_form', 'melting_point', 'boiling point', 
            'density', 'solubility_per_100gH2O', 'qualitative_solubility', 'blanks']
df = pd.read_csv('tabula-Inorganic_solubility_with_lines.tsv', sep='\t', names=colNames)
df = df.drop(labels='blanks', axis=1)
df.shape

(2740, 11)

In [59]:
# Eliminate blank lines
df = df.dropna(axis=0, thresh=1)
df.shape

(2711, 11)

In [60]:
# Get rid of the header line that sometimes accompanies a new page
df = df[df.no != 'No.'] 
print(df.shape)
print(df.iloc[:,:2].tail(5))

(2681, 11)
        no                           name
2735  2677              Zirconium nitride
2736  2678            Zirconium phosphide
2737  2679             Zirconium silicide
2738  2680              Zirconyl chloride
2739  2681  Zirconyl chloride octahydrate


More cleaning required - some of the reference footnotes in the solubility column have been read in as actual numbers! Let's make a function that tries to eliminate 2 significant digits from the end of the number and see whether the result is zero. If it is, we'll assume that the number had no footnote; if it's not, we'll just use that number.

In [61]:
def clean_solubility_references(sol_str):
    ''' Clean reference superscript numbers from solubility values
    Attempts to clean the solubility string sol_str by removing the last two characters, 
    comparing the resulting number to zero.
        
    sol_str - string to be cleaned
    returns float
    '''
    if(len(str(sol_str)) < 3): return float(sol_str) # Don't try to convert an empty string to a float
    stripped_number = float(str(sol_str)[:-2]) 
    if stripped_number != 0: return stripped_number
    return float(sol_str)
        
df.solubility_per_100gH2O = df.solubility_per_100gH2O.apply(lambda x: clean_solubility_references(x) if(pd.notnull(x)) else x)

Now we need to clean up the string names to make sure they all have ascii characters.

In [71]:
df[['name', 'solubility_per_100gH2O']][df.solubility_per_100gH2O.notnull()].head(10)

Unnamed: 0,name,solubility_per_100gH2O
20,Aluminum chloride,45.1
21,Aluminum chloride hexahydrate,45.1
24,Aluminum fluoride,0.5
25,Aluminum fluoride monohydrate,0.5
26,Aluminum fluoride trihydrate,0.5
35,Aluminum nitrate,68.9
36,Aluminum nitrate nonahydrate,68.9
44,Aluminum perchlorate nonahydrate,182.0
51,Aluminum sulfate,38.5
52,Aluminum sulfate octadecahydrate,38.5
