# Accessing PubChem Compound synonyms via FTP

In [1]:
import os
import re
import gzip
import mmap

DATA_PATH = os.path.join(os.getenv('CAMELID_HOME'), 'synonyms', 'data')

## Download the filtered CID-synonym file

Warning: file is ~1 GB. You can also download `CID-Synonym-unfiltered.gz` from [here](ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/) yourself, or **run from command line:** `wget -b ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-filtered.gz`

In [None]:
from ftplib import FTP

def download_pubchem_synonyms(dest_path):
    ftp = FTP('ftp.ncbi.nlm.nih.gov')
    ftp.login()
    ftp.cwd('pubchem/Compound/Extras')
    ftp.retrbinary('RETR {}'.format('CID-Synonym-filtered.gz'), open(dest_path, 'wb').write)

In [2]:
synonyms_gz = os.path.join(DATA_PATH, 'CID-Synonym-filtered.gz')
# Uncomment to download:
# download_pubchem_synonyms(synonyms_gz)

## Parse needed information out of the synonyms file

From the README:

    These are listings of all names associated with a CID. The
    unfiltered list are names aggregated from all SIDs whose 
    standardized form is that CID, sorted by weight with the "best"
    names first. The filtered list has some names removed that are
    considered inconsistend with the structure. Both are gzipped text
    files with CID, tab, and name on each line. Note that the
    names may be composed of more than one word, separated by spaces.


The uncompressed file exceeds my disk quota on the OCF server, so I would like to keep it compressed. It's possible to read compressed data in Python using `gzip.open()`. But the data are then accessed as `bytes`, rather than `str` as with the regular `open()`. 

Additionally, we don't want to load the whole 1 GB file into RAM, so we want to either iterate through it or use some other method of accessing it. There are some interesting options in [this StackOverflow post](http://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho).

### Strategy 1: Regex on compressed file

Can we make re.findall() work on a `bytes`-type object?

In [14]:
cid = '8028'  # THF

# Yes, this is silly with the regex notation AND the format notation...
search_rx = r'(?:{})\t(.*)'.format(cid).encode('utf-8')
p = re.compile(search_rx)

text = b"""
8008\tHi
8028\tTHF
8028\ttetrahydrofuran
8080\tdrums
"""

re.findall(p, text)

[b'THF', b'tetrahydrofuran']

In [24]:
cid = '80'

# Yes, this is silly with the regex notation AND the format notation...
search_rx = r'(?:{})\t(.*)'.format(cid).encode('utf-8')
p = re.compile(search_rx)

with gzip.open(synonyms_gz, 'rb') as file, \
    mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
    for item in re.findall(p, s):
        print(item)

# Two problems:
# 1. You do find stuff in the file, but the results are all encoded
#    and fail to .decode('utf-8').
# 2. If you set cid = '8028', you don't find anything at all. That's a
#    major problem, because the regex search in the bytes object is not
#    doing what we expect it to do.

b"\xf1T\xc8\xeb\xabm\x1e\xc9\x9c\xc0\x03\x96\x08,\xd2\x92d,c\xc4\x06K\xb0\xa2)\xb26\xdb\xc6,#\x13K\x08\xed\xe1L\x82CE\xbc4\x8a;\x07\xf1\xdd\xe6 \xe2\x1d\xe9$\xf8\x93\xc2\x1e}\xff\xf8\xf1q\xfdp\x9fL\x8d\xf0\xd7\xca*Y\x80L\xa1\xf7\xa4'\r[\xad/\x08I\xfe\x96'\x89\xf7\xc2\x97R\xb46km\xfe\\\xbc\xfc\xf3\x8e\x16\xf4\xeb*C$\xf9Fi\x9chp9\xb9\xb8\xec_]\x9c\xcf\xae\xafg;4U\x95NO\xf8\x82A\x07\x81\xb7\xba0\x02\xd6"
b'=\x9d\xa0H\x14"\xf5\x82Bd\xf8\xe9n\x97\xd9\x03\x8c\xe0S\xb6\x06D\xe5\xc9\x1b.\xf07Q\xd4\x05e\x19[\x8d\xba\xd0\x9b\xa4?c\x85\xd6\x17)4\xeb\x07\x9eS\x19\xa6\xcaZ\x1f4\x1e\x1e\x04\xf5\x03\x18\xc7eK\xa8f\xd5z\xe0j\xbf\xc2\x05q]\x1eR\x8bwL\xd3\xfa\xc6!\x9fR\x19\xb5}:\xa4\x98\x08F\xda\xbe\x98eZ-\x91\r0\xa1c\xc7\x99\x92\x10\x08J}\x8e\xa4\x1f0A\xde\xc4\x9e\x14\x89\xe5\xd3\xcf\xb2\x94\x89\x8f\xffe<(y\xb8>\xf4\x0cD\xc9\xa3\xf7\x10 \'\xa0\xd6F\x8b\xf0g\xce\xb6\xc9\xc4\xe0\xa9\xe0\xbe=\xf7Q]\xcf\x9f\x00*\xc2L\xc0\xb4\xcf\xbb\x83?\xf6\xc6\x05Hw\xa8\xdaR<I.\xe0\xd5\xba-\xcbu7Y\xae\xc1\x8a\x88\xe5\xdd

### Strategy 2. TBD

In [4]:
# cids = ['8028', '88888888', '71609']
cid = '8028'

with gzip.open(synonyms_gz, 'rb') as file, \
    mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as data:
    result = data.find(cid)
    print(result)

TypeError: a bytes-like object is required, not 'str'