# SI Scraper

This notebook scrapes data from [Siamensis SI](http://www.siamensis.org/species_index).

## Scraping

- Get a data object from Siamensis SI.
- Loop through all children nodes recursively.
- Get `id`, `num_children`, `children_id` from each node and store as a list.
- Download `html` file of each `id` into `../node` folder.

## Parsing

- Loop through each item in a scraped list.
- Parse each `html` page and store parsed data in a dict.
- Save this parsed data as `json`.


### Get Data Object

In [108]:
import requests
import json
import re
import pandas as pd
from os import path, pardir, mkdir
from bs4 import BeautifulSoup
from tqdm import tqdm

In [2]:
# get whole tree from endpoint
r = requests.get('http://www.siamensis.org/json?type=tree')
# get json from request
si_json = r.json()[0][0]
# show keys of json obj
si_json.keys()

dict_keys(['data', 'attr', 'mlid', 'num_children', 'children'])

### Scraping

In [101]:
# function to get ids of all children in a list
def idGetter(children_ls):
    ids = []
    for child in children_ls:
        ids.append(child['attr']['link'].split('/')[-1])
    return ids

# function to scrapte data in the object recursively
def scraper(obj, keeper=[]):
    # each item is stored in dict
    item_dict = dict()
    # loop through keys in the object
    for key in obj.keys():
        # take attr link as an id and put in dict
        if key == 'attr':
            link_id = obj[key]['link'].split('/')[-1]
            # print(f"getting data of node id: {link_id}..")
            item_dict['id'] = link_id
        # get ids of children and count and put in dict
        elif key == 'children':
            all_ids = idGetter(obj[key])
            item_dict['num_children'] = len(all_ids)
            item_dict['children_ids'] = all_ids
            for item in obj[key]:
                # then scrape each children object with scraper
                # this will do recursively until no more obj
                scraper(item, keeper)
    # store each item dict in a keeper
    keeper.append(item_dict)
    # and return when all is done
    return keeper

In [97]:
extracted_node = scraper(si_json)

In [109]:
!mkdir -p ../node

In [110]:
for node in tqdm(extracted_node):
    url = f'http://www.siamensis.org/species_index/node/{node["id"]}'
    r = requests.get(url)
    save_path = f'../node/{node["id"]}.html'
    
    with open(save_path, 'w') as f:
        f.write(r.text)

100%|██████████| 6510/6510 [1:29:47<00:00,  1.12s/it]
