<a href="https://colab.research.google.com/github/bbchen33/Web-scraping/blob/master/KEGGpathways_categories.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Obtain KEGG categories and KEGG pathway IDs from a json file

https://www.kegg.jp/kegg-bin/get_htext?br08901
Click "Download json"

In [1]:
from google.colab import files
file = files.upload()

Saving br08901.json to br08901.json


In [0]:
import json
with open('br08901.json') as json_file:
  data = json.load(json_file)

Every layer can be called with the "children" key and each layer of children has a list of children. 

In [4]:
data['children'][0]['children'][0]

{'children': [{'name': '01100  Metabolic pathways'},
  {'name': '01110  Biosynthesis of secondary metabolites'},
  {'name': '01120  Microbial metabolism in diverse environments'},
  {'name': '01130  Biosynthesis of antibiotics'},
  {'name': '01200  Carbon metabolism'},
  {'name': '01210  2-Oxocarboxylic acid metabolism'},
  {'name': '01212  Fatty acid metabolism'},
  {'name': '01230  Biosynthesis of amino acids'},
  {'name': '01220  Degradation of aromatic compounds'}],
 'name': 'Global and overview maps'}

"Global and overview maps" is a category that has children pathways such as "01100 Metabolic pathways". 

In [7]:
data['children'][0]

{'children': [{'children': [{'name': '01100  Metabolic pathways'},
    {'name': '01110  Biosynthesis of secondary metabolites'},
    {'name': '01120  Microbial metabolism in diverse environments'},
    {'name': '01130  Biosynthesis of antibiotics'},
    {'name': '01200  Carbon metabolism'},
    {'name': '01210  2-Oxocarboxylic acid metabolism'},
    {'name': '01212  Fatty acid metabolism'},
    {'name': '01230  Biosynthesis of amino acids'},
    {'name': '01220  Degradation of aromatic compounds'}],
   'name': 'Global and overview maps'},
  {'children': [{'name': '00010  Glycolysis / Gluconeogenesis'},
    {'name': '00020  Citrate cycle (TCA cycle)'},
    {'name': '00030  Pentose phosphate pathway'},
    {'name': '00040  Pentose and glucuronate interconversions'},
    {'name': '00051  Fructose and mannose metabolism'},
    {'name': '00052  Galactose metabolism'},
    {'name': '00053  Ascorbate and aldarate metabolism'},
    {'name': '00500  Starch and sucrose metabolism'},
    {'name':

"Global and overview maps" is under a bigger category called "Metabolism" as seen at the bottom. I'll called the "Metabolism" layer "major category" since it is the highest layer possible here.

In [0]:
len_major_category = len(data['children'])

In [9]:
len_major_category

7

Now I'll make a dictionary where the key is the "major category name" and assigne the values as the middle category names such as "Global and overview maps" or "Carbohydrate metabolism".

In [0]:
major_category_dict = {}
for i in range(len_major_category):
  major_category_dict[data['children'][i]['name']] = []

In [13]:
major_category_dict

{'Cellular Processes': [],
 'Drug Development': [],
 'Environmental Information Processing': [],
 'Genetic Information Processing': [],
 'Human Diseases': [],
 'Metabolism': [],
 'Organismal Systems': []}

I could try to figure the number of the middle categories under each major category by doing 2 for loops but a faster way is just finding a number that is definitely larger than the number of the middle categories and use "try + except" to do for loop.

In [0]:
for i in range(len_major_category):
  for j in range(50):
    try:
      major_category_dict[data['children'][i]['name']].append(data['children'][i]['children'][j]['name'])
    except:
      pass

In [17]:
major_category_dict['Metabolism']

['Global and overview maps',
 'Carbohydrate metabolism',
 'Energy metabolism',
 'Lipid metabolism',
 'Nucleotide metabolism',
 'Amino acid metabolism',
 'Metabolism of other amino acids',
 'Glycan biosynthesis and metabolism',
 'Metabolism of cofactors and vitamins',
 'Metabolism of terpenoids and polyketides',
 'Biosynthesis of other secondary metabolites',
 'Xenobiotics biodegradation and metabolism',
 'Chemical structure transformation maps']

By comparing this with KEGG pathways on the website, I confirmed this dictionary is correct. To search for the middle categories under a specific major category, one can simply use the major category name as the key for the dictionary.

Conversely, one can use the get_major_category function below to find its parent (major category) name.

In [0]:
def get_major_category(middle_category_name):
  for key, value in major_category_dict.items():
    if middle_category_name in value:
      return key
    return 'The category cannot be found.'

In [21]:
get_major_category('Carbohydrate metabolism')

'Metabolism'

What if you want to know what major category a specific pathway ('01100  Metabolic pathways') is under but you don't care about the middle category ('Global and overview maps')? Basically, is there a way to make a dictionary by skipping the middle layer? Yes.

In [0]:
major_category_to_pathway_dict = {}
for i in range(len_major_category):
  major_category_to_pathway_dict[data['children'][i]['name']] = []

In [33]:
major_category_to_pathway_dict

{'Cellular Processes': [],
 'Drug Development': [],
 'Environmental Information Processing': [],
 'Genetic Information Processing': [],
 'Human Diseases': [],
 'Metabolism': [],
 'Organismal Systems': []}

In [0]:
for i in range(len_major_category):
  for j in range(50):
    for k in range(50):
      try:
        major_category_to_pathway_dict[data['children'][i]['name']].append(data['children'][i]['children'][j]['children'][k]['name'])
      except:
        pass

In [43]:
major_category_to_pathway_dict['Metabolism'][:10]

['01100  Metabolic pathways',
 '01110  Biosynthesis of secondary metabolites',
 '01120  Microbial metabolism in diverse environments',
 '01130  Biosynthesis of antibiotics',
 '01200  Carbon metabolism',
 '01210  2-Oxocarboxylic acid metabolism',
 '01212  Fatty acid metabolism',
 '01230  Biosynthesis of amino acids',
 '01220  Degradation of aromatic compounds',
 '00010  Glycolysis / Gluconeogenesis']

In [0]:
def get_major_category_from_pathway(pathway_name):
  for key, value in major_category_to_pathway_dict.items():
    if pathway_name in value:
      return key
    return 'The pathway cannot be found.'

In [42]:
get_major_category_from_pathway('01100  Metabolic pathways')

'Metabolism'