# Generating a list of chemical names and CAS Registry Numbers

Function:  This notebook generates a database table of Chemical Abstract Service Registry Numbers from substance names. 
Rationale:  The database table allows the CAS Registry Number to be used as an index for subsequent manipulations with 
solubility parameter data, enabling rapid and unambiguous operations on solvent sets.  Using a large data set now will make the system extensible for widespread use later. Having a custom notebook do this job ensures compatibility and
isolates the solubility parameter code from dependency on external sources.

### Dependencies / Requirements
The standard sqlite3 library is used. The XML library also needs to be installed.  The notebook expects a source file named
'chemid_latest.xml' to be located in a folder called 'chemidplus' within a folder named 'Data_Sources' co-located with this notebook.  Special note:  due to the large size of the unzipped .xml file, it is not provided in the GitHub source.  Instead,
a zipped version is provided on GitHub.  When downloading source code from GitHub, unzip the file and place it in a folder
named chemidplus in Data_Sources.

* Note:  **Input file size is very large -- ~ 1 GB**

In [1]:
import datetime
import pandas as pd
import numpy as np
import xml.etree.ElementTree as ET
import sqlite3 
# Input file
infile = 'Data_Sources/chemidplus/chemid_latest.xml'

### Import and Set-Up

In [2]:
# Fetch the input file and convert it to an XML tree -- root.tag should return
# 'file' if this executes correctly -- this step takes some time to execute
with open(infile,'r') as f:
    tree = ET.parse(infile)
root = tree.getroot()
root.tag

'file'

In [3]:
# FYI, print the number of chemicals in the root -- this should be several hundred k
len(root)

420658

In [4]:
# To read in the data, we will first make lists
# Later, we'll put these in DataFrames to process quickly
# For now, just make lists of name, CASRN and CASRN, synonym
prim_name = []
casrn = []
casrn_syn = []
syn = []

for substance in range(len(root)):
    subst_name = root[substance][0].find('NameOfSubstance')
    subst_cas = root[substance][1].find('CASRegistryNumber')
    
    # Skip all list additions if there is not both a main name and CASRN
    if subst_name is not None and subst_cas is not None:
        prim_name.append(subst_name.text)
        casrn.append(subst_cas.text)
        
        # Check if there is a systematic name
        subst_syst = root[substance][0].find('SystematicName')
        
        # Add it to synonym list if there is
        if subst_syst is not None:
            casrn_syn.append(subst_cas.text)
            syn.append(subst_syst.text)
            
        # Check if there are synonyms
        subst_syn = root[substance][0].findall('Synonyms')
        
        # Add if there are any ...
        if len(subst_syn) != 0:
            for synonym in subst_syn:
                casrn_syn.append(subst_cas.text)
                syn.append(synonym.text)
        

In [5]:
# Now, convert lists to DataFrames for further processing
name_CAS_df = pd.DataFrame({'Subst_Name':prim_name,'CASRN':casrn})
synonyms_df = pd.DataFrame({'CASRN':casrn_syn,'Synonym':syn})

In [6]:
# Get some DataFrame info on length and unique values
print (f'name_CAS has {len(name_CAS_df)} rows with {name_CAS_df.Subst_Name.nunique()} unique names.')
print (f'name_CAS has {name_CAS_df.CASRN.nunique()} unique CAS Registry Numbers.')
print (f'Synonyms table has {len(synonyms_df)} rows with {synonyms_df.Synonym.nunique()} synonyms.')

name_CAS has 120692 rows with 119838 unique names.
name_CAS has 120692 unique CAS Registry Numbers.
Synonyms table has 706245 rows with 690191 synonyms.


### Data Processing

What you should see is no duplicate CAS Registry Numbers, some duplicated names, and quite a few duplicated synonyms...
The following cells take care of the easy stuff ... 

In [7]:
# Drop cases where name, CAS pairs are duplicates -- there should not be any to start with
# but just in case there are, ths is a quick way that would fix it
name_CAS_df.drop_duplicates(inplace = True)
# For the synonyms column, drop duplicated CASRN, synonym pairs -- this usually represents a majority of cases
synonyms_df.drop_duplicates(inplace = True)

In [8]:
# Drop any rows of the Synonyms where the Synonym, CASRN pair is the same as a name, CASRN pair
synonyms_df = synonyms_df.merge(name_CAS_df, left_on = ['Synonym','CASRN'], right_on = ['Subst_Name','CASRN'], 
                                how='left', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
# This operation produces an extra column in synonyms_df that needs deletion
synonyms_df = synonyms_df.drop(columns = 'Subst_Name')

In [9]:
# Now check the results ...
print (f'name_CAS has {len(name_CAS_df)} rows with {name_CAS_df.Subst_Name.nunique()} unique names.')
print (f'name_CAS has {name_CAS_df.CASRN.nunique()} unique CAS Registry Numbers.')
print (f'Synonyms table has {len(synonyms_df)} rows with {synonyms_df.Synonym.nunique()} synonyms.')

name_CAS has 120692 rows with 119838 unique names.
name_CAS has 120692 unique CAS Registry Numbers.
Synonyms table has 643629 rows with 638199 synonyms.


To tackle ambiguous entries, we will make use of the fact that CAS Registry Numbers reflect the order of entry,
thus the more "common" compound is entered first.  We will assign a CAS priority number, sort on this, and 
only keep the duplicate with the lower number in the primary name table.  We will take the alternates to a separate
table for later use.

In [10]:
# Parse the CAS Registry Number -- the last digit is a checksum and can be dropped
name_CAS_df['CASRank'] = name_CAS_df.CASRN.str.split('-',3).apply(lambda x : x[0]).astype('int') * 100 \
                      + name_CAS_df.CASRN.str.split('-',3).apply(lambda x : x[1]).astype('int')
# Sort by this rank -- the source data is typically sorted already
name_CAS_df = name_CAS_df.sort_values(by='CASRank')
# Now create duplicate flags
name_CAS_df['Dup_Name_Flag'] = name_CAS_df.duplicated('Subst_Name',keep = False)
# Drop the rank columns as is it no longer needed
name_CAS_df = name_CAS_df.drop(columns='CASRank')

In [11]:
# First grab the duplicates 
alt_primary_name_df = name_CAS_df[name_CAS_df.duplicated('Subst_Name')]
# Now they can be safely dropped from the main table
name_CAS_df = name_CAS_df.drop_duplicates('Subst_Name')

In [12]:
# Now check the results ...
print (f'name_CAS has {len(name_CAS_df)} rows with {name_CAS_df.Subst_Name.nunique()} unique names.')
print (f'name_CAS has {name_CAS_df.CASRN.nunique()} unique CAS Registry Numbers.')
print (f'Synonyms table has {len(synonyms_df)} rows with {synonyms_df.Synonym.nunique()} synonyms.')

name_CAS has 119838 rows with 119838 unique names.
name_CAS has 119838 unique CAS Registry Numbers.
Synonyms table has 643629 rows with 638199 synonyms.


In [13]:
# Do for the synonyms table what was done for the name_CASRN table
# Parse the CAS Registry Number 
synonyms_df['CASRank'] = synonyms_df.CASRN.str.split('-',3).apply(lambda x : x[0]).astype('int') * 100 \
                      + synonyms_df.CASRN.str.split('-',3).apply(lambda x : x[1]).astype('int')
# Sort by this rank -- the source data is typically sorted already
synonyms_df = synonyms_df.sort_values(by='CASRank')
# Now create duplicate flags
synonyms_df['Dup_Name_Flag'] = synonyms_df.duplicated('Synonym',keep = False)
# Drop the rank columns as is it no longer needed
synonyms_df = synonyms_df.drop(columns='CASRank')

# First grab the duplicates 
alt_synonyms_df = synonyms_df[synonyms_df.duplicated('Synonym')]
# Now they can be safely dropped from the main table
synonyms_df = synonyms_df.drop_duplicates('Synonym')

# Now check the results ...
print (f'name_CAS has {len(name_CAS_df)} rows with {name_CAS_df.Subst_Name.nunique()} unique names.')
print (f'name_CAS has {name_CAS_df.CASRN.nunique()} unique CAS Registry Numbers.')
print (f'Synonyms table has {len(synonyms_df)} rows with {synonyms_df.Synonym.nunique()} synonyms.')

name_CAS has 119838 rows with 119838 unique names.
name_CAS has 119838 unique CAS Registry Numbers.
Synonyms table has 638199 rows with 638199 synonyms.


### Data Validation
For validation, the CAS Registry Numbers are checked against names and synonyms from a known solvent set.  The default set
is a list of 100 of the most commonly used solvents  for Hansen Solubility Parameter determination, based on Hansen's key experimental work.  Custom validations can be added by placing the name and CAS Registry Number into the 'validation.csv' file.
These hand-curated validations come from Sigma-Aldrich catalogs, Wikipedia, PubMed, and other on-line sources.  In all cases,
at least two sources were checked to validate the CAS Registry Number in the table.

In [20]:
name_CAS_df.loc[name_CAS_df.CASRN == '109-74-0']

Unnamed: 0,Subst_Name,CASRN,Dup_Name_Flag
3155,Butanenitrile,109-74-0,False


In [21]:
synonyms_df.loc[synonyms_df.CASRN == '109-74-0']

Unnamed: 0,CASRN,Synonym,Dup_Name_Flag
70262,109-74-0,HSDB 5013,False
70269,109-74-0,EC 203-700-6,False
70268,109-74-0,UNII-O3V36V0W0M,False
70267,109-74-0,Propylkyanid,False
70266,109-74-0,Propylkyanid [Czech],False
70265,109-74-0,n-Propyl cyanide,False
70264,109-74-0,Propyl cyanide,False
70263,109-74-0,NSC 8412,False
70261,109-74-0,EINECS 203-700-6,False
70254,109-74-0,Butane nitrile,False


In [16]:
alt_primary_name_df[alt_primary_name_df.CASRN == '100-52-7'].Subst_Name

Series([], Name: Subst_Name, dtype: object)

###  Notes

Source:  The National Library of Medicine provides the data source ChemIDPlus.  This source provides a monthly update
of basic chemical ID information.  It currently lists over 400,000 compounds, and is updated monthly.  Although the main use 
is for regulatory purposes, it is likely to provide an exhaustive list of solvents.  If making use of this data set, please
acknowledge the source and provide the link below to facilitate using the latest version of the data.  

The current version of the data is from 03-28-2019.  This script was last executed on:

In [17]:
print(datetime.datetime.now())

2019-04-22 16:55:46.971405


The URL for the latest ChemIDPlus download is:  ftp://ftp.nlm.nih.gov/nlmdata/.chemidlease/

CASRN or CAS Registry Number are registered trademarks of the chemical abstracts service.  Organizations that use this data
will need a license from the Chemical Abstracts Service.  The current notebook is for educational purposes only.