# Accessing PubChem Compound synonyms via FTP

In [1]:
import os
import re
# import gzip
import mmap

## Download and uncompress the filtered CID-synonym file

**Warning: Compressed file is ~1 GB. Uncompressed file > 4 GB.**

- Download `CID-Synonym-unfiltered.gz` from [here](ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/) yourself, or run from command line: `wget -b ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-filtered.gz`

- To unzip: `gzip -d CID-Synonym-unfiltered.gz` (use `-dk` to avoid deleting the `.gz` file).

In [31]:
# This is where I am keeping the file on my machine.
DATA_PATH = os.path.join(os.getenv('CAMELID_HOME'), 'synonyms', 'data')
synonyms_gz = os.path.join(DATA_PATH, 'CID-Synonym-filtered.gz')
synonyms = os.path.join(DATA_PATH, 'CID-Synonym-filtered')

## Parse needed information out of the synonyms file

From the README:

    These are listings of all names associated with a CID. The
    unfiltered list are names aggregated from all SIDs whose 
    standardized form is that CID, sorted by weight with the "best"
    names first. The filtered list has some names removed that are
    considered inconsistend with the structure. Both are gzipped text
    files with CID, tab, and name on each line. Note that the
    names may be composed of more than one word, separated by spaces.


The uncompressed file exceeds my disk quota on the OCF server, so I would like to keep it compressed. It's possible to read compressed data in Python using `gzip.open()`. But the data are then accessed as `bytes`, rather than `str` as with the regular `open()`. 

Additionally, we don't want to load the whole 1 GB file into RAM, so we want to either read it incrementally, or use some other method of accessing it, such as a memory map (`mmap` from the standard library). We'll also probably want to transform the data that we extract into another format on disk (JSONL, CSV, database) rather than storing it as a Python object in memory.

There are some interesting potential solutions in [this Stack Overflow post](http://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho) and [this one](http://codereview.stackexchange.com/questions/78224/optimize-huge-text-file-search).

### Strategy 1: Regex on compressed file

Can we make re.findall() work on a `bytes`-type object via `mmap`?

Test the regex part in a small example dataset...

In [3]:
cid = '8028'  # THF, should be in the file

# Yes, this is silly with the regex notation AND the format notation, but it works...
search_rx = r'(?:{})\t(.*)'.format(cid).encode('utf-8')
p = re.compile(search_rx)

text = """
8008\tHi
8028\tTHF
8028\ttetrahydrofuran
8080\tdrums
""".encode('utf-8')

re.findall(p, text)

[b'THF', b'tetrahydrofuran']

Try it on the actual file. Using `mmap` to be able to randomly access the file without loading it into memory (how does this work anyway?).

In [30]:
cid = '1'

search_rx = r'(?:{})\t(.*)'.format(cid).encode('utf-8')
p = re.compile(search_rx)

with gzip.open(synonyms_gz, 'rb') as file:
    s = file.readline()
    print(s.decode())
    print(re.findall(p, s))
    with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        for item in re.findall(p, mm):
            print(item)
            print(item.decode())

1	Acetyl-DL-carnitine

[b'Acetyl-DL-carnitine']
b"s\r\xf5\x83\tf=nd\x9d\x13\xdfJ\t\x1a?c\x1a\xa6\x16\xcf\x98\x03y\x1d\x86\xb8\xa5[\xcf\xd1\xdcS\x89~_\xee\x11\xcf\x8f\x16\x18\xb3G\xf4\xa2\x1b\x88\xcb\xf6\xe2JH')\xe6\x82Gv#\x91\xc2D\xa7\x92\xf8\xb2@7\xb6\xdd\xdd\xa5\x8c\xe6\x06h\x90\x92\xb2\xd3\r\x10F\xb3t\x14a\xb8\xb6p\xa33n(<\xe8\x88\xd2$J\x83\xee8\t\xee\xec\xe2\x05\x9b\x08"


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf5 in position 2: invalid start byte

Problems:
-  OK, you can decode the lines of the compressed file one at a time and search each one using `re`. However, you can't search a whole file object all at once unless you read it all into memory (undesirable), or use `mmap`.
-  **But** `re.findall()` on the `mmap` object does not work as expected. I think it is memory-mapping the compressed file.
    - It finds bytes that can't be decoded as text, and which do not correspond to the search string text.
    - Actually, it doesn't find `8028` at all.


### Strategy 2: Regex on uncompressed file

Works but is slow.

In [34]:
def find_synonyms(cid):
    search_rx = r'(?:{})\t(.*)'.format(cid).encode('utf-8')
    p = re.compile(search_rx)
    res = []

    with open(synonyms, 'r') as file, \
        mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mm:
        for item in re.findall(p, mm):
            res.append(item)

    return res

In [35]:
%timeit thf = find_synonyms(8028)

1 loop, best of 3: 16.1 s per loop


In [39]:
thf = find_synonyms(8028)
thf[:10]

[b'TETRAHYDROFURAN',
 b'Oxolane',
 b'109-99-9',
 b'Butylene oxide',
 b'Furanidine',
 b'Hydrofuran',
 b'Furan, tetrahydro-',
 b'Oxacyclopentane',
 b'Tetramethylene oxide',
 b'1,4-Epoxybutane']

Assuming we do this, next task is efficiently aggregating, storing, and further processing the results. SQLite database?

### Strategy 3: Read line by line...

Brute force. To make this less stupid, we should take advantage of the fact that the CIDs in the synonyms file are sorted (probably lexicographically, since the first CID is `1`). Not sure how exactly to do this.

In [7]:
# Exploratory...

cids = ['8028', '71609', '88888888']  # actual CIDs that should be in the file

with gzip.open(synonyms_gz, 'rb') as file:
    for i in range(len(cids)):
        for line in file:
            s = line.decode('utf-8')
            if s.startswith(cids[i]):
                print(s)
                break  # The wrong thing to do, but just testing

8028	TETRAHYDROFURAN

71609	Mescaline sulfate



KeyboardInterrupt: 