# SI Scraper

This notebook scrapes data from [Siamensis SI](http://www.siamensis.org/species_index).

## Scraping

- Get a data object from Siamensis SI.
- Loop through all children nodes recursively.
- Get `id`, `num_children`, `children_id` from each node and store as a list.
- Download `html` file of each `id` into `../node` folder.

## Parsing

- Loop through each item in a scraped list.
- Parse each `html` page and store parsed data in a dict.
- Save this parsed data as `json`.


### Get Data Object

In [177]:
import requests
import json
import re
import pandas as pd
from os import path, pardir, mkdir
from bs4 import BeautifulSoup
from tqdm import tqdm
from glob import glob
from copy import deepcopy
from dateutil import parser

In [2]:
# get whole tree from endpoint
r = requests.get('http://www.siamensis.org/json?type=tree')
# get json from request
si_json = r.json()[0][0]
# show keys of json obj
si_json.keys()

dict_keys(['data', 'attr', 'mlid', 'num_children', 'children'])

### Scraping

In [101]:
# function to get ids of all children in a list
def idGetter(children_ls):
    ids = []
    for child in children_ls:
        ids.append(child['attr']['link'].split('/')[-1])
    return ids

# function to scrapte data in the object recursively
def scraper(obj, keeper=[]):
    # each item is stored in dict
    item_dict = dict()
    # loop through keys in the object
    for key in obj.keys():
        # take attr link as an id and put in dict
        if key == 'attr':
            link_id = obj[key]['link'].split('/')[-1]
            # print(f"getting data of node id: {link_id}..")
            item_dict['id'] = link_id
        # get ids of children and count and put in dict
        elif key == 'children':
            all_ids = idGetter(obj[key])
            item_dict['num_children'] = len(all_ids)
            item_dict['children_ids'] = all_ids
            for item in obj[key]:
                # then scrape each children object with scraper
                # this will do recursively until no more obj
                scraper(item, keeper)
    # store each item dict in a keeper
    keeper.append(item_dict)
    # and return when all is done
    return keeper

In [131]:
extracted_node = scraper(si_json)

In [109]:
!mkdir -p ../node

In [110]:
# save all extracted_node as html per node
# this cell can take as long as 1.5 hrs, you can skip and use files in ../node
for node in tqdm(extracted_node):
    url = f'http://www.siamensis.org/species_index/node/{node["id"]}'
    r = requests.get(url)
    save_path = f'../node/{node["id"]}.html'
    
    with open(save_path, 'w') as f:
        f.write(r.text)

100%|██████████| 6510/6510 [1:29:47<00:00,  1.12s/it]


### Parsing

In [123]:
nodes = glob('../node/*.html')

In [139]:
copied_extracted_node = deepcopy(extracted_node)

In [182]:
counter = 0

for each in copied_extracted_node:
    file_name = f"../node/{each['id']}.html"
    
    soup = BeautifulSoup(open(file_name, "r"), "html.parser")

    # rank and rank name
    tmp = soup.select('.node-title')[0].text
    
    rank_pt = re.compile(r'^\s*([a-zA-Z]+)\s*\:')
    rank = rank_pt.search(tmp).group(1)
    
    rank_name_pt = re.compile(r':\s*([\(\)a-zA-Z]+)\s*$')
    rank_name = rank_name_pt.search(tmp).group(1)
    
    # author, timestamp and modified
    tmp = soup.select('.node-submitted')[0].text
    
    author_pt = re.compile(r'.*เขียนโดย (.*) เมื่อ.*')
    author = author_pt.search(tmp).group(1)
    
    timestamp_pt = re.compile(r'.*เมื่อ (.*)$')
    timestamp = timestamp_pt.search(tmp).group(1)
    
    print(author)
    print(timestamp)
    print(parser.parse(timestamp))
    
    print(soup.prettify())
    print(each)
    print('\n---------\n')
    
    
    if counter >= 5:
        break
    counter += 1

lepton
11 March 11 12:15
2011-03-11 12:15:00
<div class="node-body">
 <h2 class="node-title">
  Phylum: Crenarchaeota
 </h2>
 <div class="node-header">
 </div>
 <div class="node-submitted">
  เขียนโดย lepton เมื่อ 11 March 11 12:15
 </div>
</div>
<ul class="revise">
</ul>
<div class="node-count-wrapper">
 <div class="node-count">
  อ่าน 1,144 ครั้ง
 </div>
 <div id="fb-share-footer" style="display:none">
 </div>
 <div class="fb-share-button" data-href="species_index?id=6183" data-layout="button_count">
 </div>
</div>
<script type="text/javascript">
 jstree_nav = jQ.parseJSON('["673","708","710","6183"]');
</script>
{'id': '6183'}

---------

lepton
11 March 11 12:16
2011-03-11 12:16:00
<div class="node-body">
 <h2 class="node-title">
  Phylum: Euryarchaeota
 </h2>
 <div class="node-header">
 </div>
 <div class="node-submitted">
  เขียนโดย lepton เมื่อ 11 March 11 12:16
 </div>
</div>
<ul class="revise">
</ul>
<div class="node-count-wrapper">
 <div class="node-count">
  อ่าน 842 ครั้ง
 

ValueError: ('Unknown string format:', '17 กันยายน 2553 ')

In [138]:
extracted_node

[{'id': '6183'},
 {'id': '6184'},
 {'id': '710', 'num_children': 2, 'children_ids': ['6183', '6184']},
 {'id': '708', 'num_children': 1, 'children_ids': ['710']},
 {'id': '6210'},
 {'id': '6209'},
 {'id': '6208'},
 {'id': '6207'},
 {'id': '6206'},
 {'id': '6205'},
 {'id': '6204'},
 {'id': '41118'},
 {'id': '41119'},
 {'id': '41117', 'num_children': 2, 'children_ids': ['41118', '41119']},
 {'id': '41116', 'num_children': 1, 'children_ids': ['41117']},
 {'id': '41115', 'num_children': 1, 'children_ids': ['41116']},
 {'id': '41114', 'num_children': 1, 'children_ids': ['41115']},
 {'id': '6203', 'num_children': 1, 'children_ids': ['41114']},
 {'id': '6202'},
 {'id': '6201'},
 {'id': '6200'},
 {'id': '6199'},
 {'id': '6198'},
 {'id': '6197'},
 {'id': '6196'},
 {'id': '6195'},
 {'id': '6192'},
 {'id': '6191'},
 {'id': '6190'},
 {'id': '6189'},
 {'id': '6188'},
 {'id': '6187'},
 {'id': '6186'},
 {'id': '6185'},
 {'id': '711',
  'num_children': 24,
  'children_ids': ['6210',
   '6209',
   '620