# Crystallization solvent extraction with NLP


Installing required packages.

In [None]:
!pip install chemdataextractor # for NLP
!pip install fuzzywuzzy        # for SI download
!pip install selenium          # for SI download
!cde data download            # download the data for chemdataextractor

Collecting chemdataextractor
  Downloading ChemDataExtractor-1.3.0-py3-none-any.whl (182 kB)
[?25l[K     |█▉                              | 10 kB 27.2 MB/s eta 0:00:01[K     |███▋                            | 20 kB 25.1 MB/s eta 0:00:01[K     |█████▍                          | 30 kB 11.4 MB/s eta 0:00:01[K     |███████▏                        | 40 kB 9.0 MB/s eta 0:00:01[K     |█████████                       | 51 kB 6.4 MB/s eta 0:00:01[K     |██████████▊                     | 61 kB 7.5 MB/s eta 0:00:01[K     |████████████▌                   | 71 kB 7.9 MB/s eta 0:00:01[K     |██████████████▍                 | 81 kB 8.2 MB/s eta 0:00:01[K     |████████████████▏               | 92 kB 9.1 MB/s eta 0:00:01[K     |██████████████████              | 102 kB 7.9 MB/s eta 0:00:01[K     |███████████████████▊            | 112 kB 7.9 MB/s eta 0:00:01[K     |█████████████████████▌          | 122 kB 7.9 MB/s eta 0:00:01[K     |███████████████████████▎        | 133 kB 7.9 

INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_crf-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_crf-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_crf_chemdner_cemp-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_crf_chemdner_cemp-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_dict_cs-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_dict_cs-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/cem_dict-1.0.pickle to /root/.local/share/ChemDataExtractor/models/cem_dict-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/clusters_chem1500-1.0.pickle to /root/.local/share/ChemDataExtractor/models/clusters_chem1500-1.0.pickle
INFO:chemdataextractor.data:Downloading http://data.chemdataextractor.org/models/pos_ap_genia_nocluster-1.0.pick

Collecting selenium
  Downloading selenium-4.1.3-py3-none-any.whl (968 kB)
[K     |████████████████████████████████| 968 kB 9.1 MB/s 
[?25hCollecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting trio~=0.17
  Downloading trio-0.20.0-py3-none-any.whl (359 kB)
[K     |████████████████████████████████| 359 kB 46.6 MB/s 
[?25hCollecting urllib3[secure,socks]~=1.26
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
[K     |████████████████████████████████| 138 kB 58.5 MB/s 
[?25hCollecting outcome
  Downloading outcome-1.1.0-py2.py3-none-any.whl (9.7 kB)
Collecting sniffio
  Downloading sniffio-1.2.0-py3-none-any.whl (10 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting wsproto>=0.14
  Downloading wsproto-1.1.0-py3-none-any.whl (24 kB)
Collecting pyOpenSSL>=0.14
  Downloading pyOpenSSL-22.0.0-py2.py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 2.8 MB/

Getting the crystallization data extraction code from our GitHub repository

In [None]:
!git clone https://github.com/caer200/solvent_nlp

Cloning into 'solvent_nlp'...
remote: Enumerating objects: 25, done.[K
remote: Counting objects: 100% (25/25), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 25 (delta 8), reused 15 (delta 4), pack-reused 0[K
Unpacking objects: 100% (25/25), done.


We first import the required code

In [None]:
from solvent_nlp.webscraping import *        # this code is for webscraping the PDF SI
from solvent_nlp.solvent_parser import *     # this code parses the solvent from text

The function below returns a the compound names and solvent used for crystallization 
from the parsed records. 

In [None]:
def get_compounds_with_solvent(records):
    data = []
    for compound in records:
        if compound.get("crystallization"):
            data.append(compound)
    return data

Let us fetch a SI using the webscraping code

In [None]:
saveloc = "."  # location to save the PDF
doi = "10.1039/C4OB01574F"   

sc = SIScraper(doi=doi, saveloc=saveloc)
sc.retrieveSI()



  soup = BeautifulSoup(self.html)


We use chemdataextractor (https://chemdataextractor.org) to do the processsing followed
by tokenization. The rule-based parsing of crystallization solvent is described in the solvent_parser.py.
We use the CompundParser that is built-in chemdataextractor and the SolventParser that we develop to extract
the crystallization solvent.

In [None]:
doc = Document.from_file(sc.filename)  # loading the PDF
records = doc.records.serialize()  # fetching the results from NLP as a python dict
get_compounds_with_solvent(records)

[{'crystallization': [{'solvent': 'DCM / PE'}], 'names': ['Li']},
 {'crystallization': [{'solvent': 'DCM / PE'}], 'names': ['S1-5']},
 {'crystallization': [{'solvent': 'DCM and methanol'}],
  'labels': ['3i'],
  'names': ['(S)-(L)-menthyl [(R)-1-(4-methoxylphenyl)-1-hydroxyethyl] phenylphosphinate']},
 {'crystallization': [{'solvent': 'DCM and methanol'}],
  'labels': ['3j'],
  'names': ['(S)-(L)-menthyl [(R)-1-biphenyl-1-hydroxyethyl] phenylphosphinate']},
 {'crystallization': [{'solvent': 'DCM and methanol'}],
  'labels': ['3u'],
  'names': ['(S)-(L)-menthyl [ (R)-(2-hydroxy-4-methyl pentan-2-yl ) ] phenylphosphinate']},
 {'crystallization': [{'solvent': 'DCM'}],
  'labels': ['3db'],
  'names': ['phenylphosphinate',
   'Cyclohexyl [1-hydroxy-1-(4-bromophenyl)ethyl]phenylphosphinate']},
 {'crystallization': [{'solvent': 'DCM and methanol'}],
  'labels': ['3eb'],
  'names': ['tert-Butyl [ 1-hydroxy-1-(4-bromophenyl) ethyl]phenylphosphinate',
   'tert-Butyl [1-hydroxy-1-(4-bromophenyl)e