# Compound and food dictionaries

This notebook creates several dictionaries which can be readily loaded as pickled objects. 

Dictionaries saved in pickle files:
 - food_common.pickle: common English food names as keys, compounds as values
 - food_sci.pickle: scientific food names as keys, compounds as values
 - compound_common.pickle: compounds as keys, common English food names as values
 - compound_sci.pickle: compounds as keys, scientific food names as values
 - food_convert_common.pickle: common English food names as keys, scientific food names as values
 - food_convert_sci.pickle: scientific food names as keys, common English food names as values
 - food_sci_abbrev.pickle: scientific food names as keys, abbreviated scientific food names as values
 
Example usage:

In [2]:
import pickle
compound_common = pickle.load( open( "compound_common.pickle", "rb" ) )
for i, item in enumerate(compound_common.iteritems()):
    if i == 0:
        print 'id:name'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:name
--------
8-Deoxylactucin:['Chicory', 'Chicory', 'Chicory']
Isorhamnetin 3-(2G-apiosylrutinoside):['Cereals and cereal products']
Menthofurolactone:['Cornmint']
1,3,5,11-Bisabolatetraen-10-one:['Herbs and Spices']
Methyl [8]-Shogaol:['Ginger', 'Ginger']
8-Hydroxy-7-methoxy-2H-1-benzopyran-2-one:['Green vegetables']
1,4-Dimethylpyrrolo[1,2-a]pyrazine:['Animal foods']
PC(14:0/22:0):['Cow milk, pasteurized, vitamin A + D added, 0% fat', 'Cow milk, pasteurized, vitamin A + D added, 1% fat', 'Cow milk, pasteurized, vitamin A + D added, 2% fat', 'Cow milk, pasteurized, vitamin D added, 3.25% fat']
2-Ethylpyrazine:['Tea', 'Fenugreek', 'Cocoa and cocoa products', 'Potato', 'Coffee and coffee products', 'Tea', 'Mollusks', 'Cereals and cereal products', 'Green vegetables', 'Pulses', 'Crustaceans', 'Nuts']
N-Isobutyleicosa-trans-2-trans-4-cis-8-trienamide:['Pepper (Spice)']
2(3)-Benzoxazolinone:['Common wheat', 'Corn']


## Create dictionaries

First, extract only the second and fourth columns from the contents.csv files. The second column contains the compound id while the fourth has the food ids. If we want to join the ids with the food and compound names, we need to extract those from compounds.csv and foods.csv files.

In [150]:
from collections import defaultdict

# pip install csvcut

FOODB_PATH = '/home/adam/Documents/MIDS/W266/Project/data/foodb_2016-11-18/'
COMPOUND_ID_PATH = FOODB_PATH + 'compounds.csv'
FOOD_ID_PATH = FOODB_PATH + 'foods.csv'
COMPOUND_FOOD_ID_PATH = FOODB_PATH + 'contents.csv'
!head $COMPOUND_FOOD_ID_PATH
!csvcut -c 2,4 $COMPOUND_FOOD_ID_PATH  > compound_food_id.csv 
!csvcut -c 1,5 $COMPOUND_ID_PATH > compound_id.csv
!csvformat -D "|" compound_id.csv > compound_id_pipe_delimited.csv
!csvcut -c 1,2,3 $FOOD_ID_PATH > food_id.csv
!csvformat -D "|" food_id.csv > food_id_pipe_delimited.csv

id,source_id,"source_type",food_id,"orig_food_id","orig_food_common_name","orig_food_scientific_name","orig_food_part","orig_source_id","orig_source_name",orig_content,orig_min,orig_max,"orig_unit","orig_citation","citation","citation_type",creator_id,updater_id,"created_at","updated_at","orig_method","orig_unit_expression",standard_content
1,1,"Nutrient",4,"29","Kiwi","Actinidia chinensis PLANCHON [Actinidiaceae]","Fruit","FAT","FAT",NULL,700.000000000,38400.000000000,"ppm",NULL,"DUKE","DATABASE",NULL,NULL,"2014-11-05 13:42:11","2015-10-21 23:02:07",NULL,NULL,1955.000000000
2,1,"Nutrient",6,"53","Onion","Allium cepa L. [Liliaceae]","Bulb","FAT","FAT",NULL,1000.000000000,36079.000000000,"ppm",NULL,"DUKE","DATABASE",NULL,NULL,"2014-11-05 13:42:11","2015-10-21 23:02:07",NULL,NULL,1853.950000000
3,1,"Nutrient",6,"53","Onion","Allium cepa L. [Liliaceae]","Leaf","FAT","FAT",NULL,6000.000000000,77000.000000000,"ppm",NULL,"DUKE","DATABASE",NULL,NULL,"2014-11-05 13:42:11","2015-10-21 23:02:

### Extracting compound ids and names

In [3]:
compound = {}
with open('compound_id_pipe_delimited.csv','r') as infile:
    for line in infile:
        compound_id, name = line.strip('\n').split('|')
        # the longest compound id has 5 characters, any longer than that is invalid (that is, if somebody was kind enough
        # to place a note in the id column)
        if len(compound_id) > 5 or len(compound_id) == 0:
            continue
        compound[compound_id] = name

example

In [4]:
for i, item in enumerate(compound.iteritems()):
    if i == 0:
        print 'id:compound'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:compound
--------
11542:Pectenotoxin 2
11543:Pectenotoxin 3
11540:Antibiotic A 41030
11541:Pectenotoxin 1
11546:Desglucodesrhamnoparillin
11547:Testosterone
11544:Pectenotoxin 4
11545:Pectenotoxin 5
11548:Tetronasin
11549:24-Methylenelophenol
5989:Chamaviolin


## Extracting food ids and names

In [5]:
food_english = {}
food_latin = {}
# translate common to scientific names and vice versa
food_convert_common = {}
food_convert_sci = {}
with open('food_id_pipe_delimited.csv','r') as infile:
    for line in infile:
        try:
            food_id, english_name, latin_name = line.strip('\n').split('|')
        # if a note which contains a ',' is recorded in the same column as the Latin name, extract only the id,
        # English and Latin names.
        except ValueError:
            strings = line.strip('\n').split(',')
            food_id, english_name, latin_name = strings[0], strings[1], strings[2]
        # the longest valid food id has three characters, any longer than that is invalid (that is, if somebody was 
        # kind enough to place a note in the id column)
        if len(food_id) > 3 or len(food_id) == 0:
            continue
        food_english[food_id] = english_name
        food_latin[food_id] = latin_name
        food_convert_common[english_name] = latin_name
        food_convert_sci[latin_name] = english_name
        

example

In [6]:
for i, item in enumerate(food_english.iteritems()):
    if i == 0:
        print 'id:name'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:name
--------
344:Common octopus
345:Corn salad
346:Cottonseed
347:Catjang pea
340:Nuttall cockle
341:Coconut
342:Pacific cod
343:Atlantic cod
810:Ice cream cone
811:Molasses
812:Cracker


example

In [8]:
for i, item in enumerate(food_latin.iteritems()):
    if i == 0:
        print 'id:name'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:name
--------
344:Octopus vulgaris
345:Valerianella locusta
346:Gossypium
347:Vigna unguiculata ssp. cylindrica
340:Clinocardium nuttallii
341:Cocos nucifera
342:Gadus macrocephalus
343:Gadus morhua
810:NULL
811:NULL
812:NULL


In [9]:
for i, item in enumerate(food_convert_common.iteritems()):
    if i == 0:
        print 'id:name'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:name
--------
Oregon yampah:Perideridia oregana
Okra:Abelmoschus esculentus
Snail:Gastropoda
Black mulberry:Morus nigra
Avocado:Persea americana
Parsley:Petroselinum crispum
Elderberry:Sambucus
Sugar:NULL
Sweet bay:Laurus nobilis
Common bean:Phaseolus vulgaris
Fig:Ficus carica


## create compound - food id dictionaries

In [11]:
i = 0
food_dict = defaultdict(list)
compound_dict = defaultdict(list)
with open('compound_food_id.csv','r') as compound_food_ids:
    for line in compound_food_ids:
        if i != 0:
            compound_id, food_id = line.strip('\n').split(',')
            food_name = food_english[food_id]
            try:
                compound_name = compound[compound_id]
                food_dict[food_name].append(compound_name)
                compound_dict[compound_name].append(food_name)
            # KeyError occurs when a compound is not in the dictionary, e.g., if it is a nutrient such as fat
            # and not a small molecule
            except KeyError:
                pass
        i += 1

In [12]:
i = 0
# contains common names of foods 
food_common = defaultdict(list)
compound_common = defaultdict(list)
# contains scientific names of foods
food_sci = defaultdict(list)
compound_sci = defaultdict(list)


with open('compound_food_id.csv','r') as compound_food_ids:
    for line in compound_food_ids:
        if i != 0:
            compound_id, food_id = line.strip('\n').split(',')
            food_name_common = food_english[food_id]
            food_name_sci = food_latin[food_id]
            try:
                compound_name = compound[compound_id]
                food_common[food_name_common].append(compound_name)
                food_sci[food_name_sci].append(compound_name)     
                compound_common[compound_name].append(food_name_common)
                compound_sci[compound_name].append(food_name_sci)
            # KeyError occurs when a compound is not in the dictionary, e.g., if it is a nutrient such as fat
            # and not a small molecule, so in this case, we can safely skip the key error
            except KeyError:
                pass
        i += 1

example

In [156]:
for i, item in enumerate(compound_common.iteritems()):
    if i == 0:
        print 'id:name'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:name
--------
8-Deoxylactucin:['Chicory', 'Chicory', 'Chicory']
Isorhamnetin 3-(2G-apiosylrutinoside):['Cereals and cereal products']
Menthofurolactone:['Cornmint']
1,3,5,11-Bisabolatetraen-10-one:['Herbs and Spices']
Methyl [8]-Shogaol:['Ginger', 'Ginger']
8-Hydroxy-7-methoxy-2H-1-benzopyran-2-one:['Green vegetables']
1,4-Dimethylpyrrolo[1,2-a]pyrazine:['Animal foods']
PC(14:0/22:0):['Cow milk, pasteurized, vitamin A + D added, 0% fat', 'Cow milk, pasteurized, vitamin A + D added, 1% fat', 'Cow milk, pasteurized, vitamin A + D added, 2% fat', 'Cow milk, pasteurized, vitamin D added, 3.25% fat']
PC(20:2(11Z,14Z)/18:4(6Z,9Z,12Z,15Z)):['Cow milk, pasteurized, vitamin A + D added, 0% fat', 'Cow milk, pasteurized, vitamin A + D added, 1% fat', 'Cow milk, pasteurized, vitamin A + D added, 2% fat', 'Cow milk, pasteurized, vitamin D added, 3.25% fat']
2-Ethylpyrazine:['Tea', 'Fenugreek', 'Cocoa and cocoa products', 'Potato', 'Coffee and coffee products', 'Tea', 'Mollusks', 'Cereals and cer

In [147]:
for i, item in enumerate(compound_sci.iteritems()):
    if i == 0:
        print 'id:name'
        print '-'*8
    print '{0}:{1}'.format(item[0], item[1])
    if i == 10: break

id:name
--------
8-Deoxylactucin:['Cichorium intybus', 'Cichorium intybus', 'Cichorium intybus']
Isorhamnetin 3-(2G-apiosylrutinoside):['NULL']
Menthofurolactone:['Mentha arvensis']
1,3,5,11-Bisabolatetraen-10-one:['NULL']
Methyl [8]-Shogaol:['Zingiber officinale', 'Zingiber officinale']
8-Hydroxy-7-methoxy-2H-1-benzopyran-2-one:['NULL']
1,4-Dimethylpyrrolo[1,2-a]pyrazine:['NULL']
PC(14:0/22:0):['', '', '', '']
PC(20:2(11Z,14Z)/18:4(6Z,9Z,12Z,15Z)):['', '', '', '']
2-Ethylpyrazine:['Camellia sinensis', 'Trigonella foenum-graecum', 'NULL', 'Solanum tuberosum', 'NULL', 'Camellia sinensis', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL', 'NULL']
N-Isobutyleicosa-trans-2-trans-4-cis-8-trienamide:['Piper nigrum']


In [135]:
for i, item in enumerate(food_common.iteritems()):
    if i == 15:
        print 'id:name'
        print '-'*8
        print '{0}:{1}'.format(item[0], item[1])

id:name
--------
Norway pout:["Cyanidin 3-(6''-acetyl-galactoside)", 'Pelargonidin 3-arabinoside', 'Saturated fatty acids', 'Unsaturated fatty acids', 'Unsaturated fatty acids', 'Sucrose', 'Ethanol', 'Ash', 'Water', 'Retinol', 'Retinol', 'beta-Carotene', 'Vitamin D', 'Vitamin D3', 'Ergocalciferol', 'alpha-Tocopherol', 'alpha-Tocopherol', 'Phytomenadione', 'Thiamine hydrochloride', 'Riboflavine', 'Nicotinic acid', 'Nicotinic acid', 'Nicotinic acid', 'Pyridoxine', 'Pantothenic acid', 'Biotin', 'Folic acid', 'Cyanocobalamin', 'L-Ascorbic acid', 'L-Ascorbic acid', 'L-Dehydroascorbic acid', 'Sodium', 'Potassium', 'Calcium', 'Magnesium', 'Phosphorus', 'Iron', 'Copper', 'Zinc', 'Iodine', 'Manganese', 'Chromium', 'Selenium', 'Nickel', 'D-Fructose', 'D-Glucose', 'Lactose', 'Maltose', 'Sucrose', 'Sugars', 'Starch', 'Butanoic acid', 'Hexanoic acid', 'Octanoic acid', 'Decanoic acid', 'Dodecanoic acid', 'Tetradecanoic acid', 'Pentadecanoic acid', 'Hexadecanoic acid', 'Heptanoic acid', 'Octadecanoic

### Add abbreviated scientific names

In [34]:
food_sci_abbrev = defaultdict(list)
for i, key in enumerate(food_sci.keys()):
    keys = key.split(" ")
    # most scientific names are two words long
    # if longer, keep only the first two
    if len(keys) > 1:
        genus, species = keys[:2]
        food_sci_abbrev[key] = genus[0] + '. ' + species 
    # but some only have information on the family
    else:
        food_sci_abbrev[key] = key
    

## Save Python objects

In [35]:
pickle.dump(food_common, open( "food_common.pickle", "wb" ) )
pickle.dump(food_sci, open( "food_sci.pickle", "wb" ) )
pickle.dump(compound_common, open( "compound_common.pickle", "wb" ) )
pickle.dump(compound_sci, open( "compound_sci.pickle", "wb" ) )
pickle.dump(food_convert_common, open( "food_convert_common.pickle", "wb" ) )
pickle.dump(food_convert_sci, open( "food_convert_sci.pickle", "wb" ) )
pickle.dump(food_sci_abbrev, open( "food_sci_abbrev.pickle", "wb"))

## Load Python objects

In [37]:
test = pickle.load( open( "food_sci_abbrev.pickle", "rb" ) )
for i, item in enumerate(test.iteritems()):
    if i == 0:
        print 'id : name'
        print '-'*8
    print '{0} : {1}'.format(item[0], item[1])
    if i == 10: break

id : name
--------
Pouteria sapota : P. sapota
Scyphozoa : Scyphozoa
Dromaius novaehollandiae : D. novaehollandiae
Pistacia vera : P. vera
Zoarces americanus : Z. americanus
Phoenix dactylifera : P. dactylifera
Syzygium jambos : S. jambos
Clupeinae : Clupeinae
Alces alces : A. alces
Hyssopus officinalis : H. officinalis
Vitis vinifera : V. vinifera
