# 台灣教育部國語辭典簡編本 網絡爬蟲
# Taiwan Ministry of Education Concise Chinese Dictionary Web Crawler/Scraper
## (for Kindle dictionary)

This notebook is scrapes the Ministry of Education (moe) website for the simplified/concise version of the chinese dictionary. [http://dict.concised.moe.edu.tw/](http://dict.concised.moe.edu.tw/) This scraper attempts to gather data into a .json file formatted similar to [this](https://github.com/g0v/moedict-data) popular repo. Though, the motivation is to gather data for a proper Traditional Chinese Kindle dictionary, which does not require as clean of data processing after the scraping so the format is not exactly the same as the mentioned repo's. 

The approach is to use the listings by stroke count to gather links to each entry and then gather at least the: title, bopomofo, pinyin, and definition. Some entries have synonyms and antonyms that are also extracted.

This notebook does require some babysitting as explained in the next markdown cell.

In [2]:
import re
from requests_html import HTMLSession
import json
import time

It seems that the moe server generates a unique 6 character 'ccd' identifier that only works for some number of requests. (idk what this is called) See stroke_main_url below where ccd=KD6FAc. I'm not sure how to get a new ccd after the current one expires, but once it does (which the program will let you know) we must manually load up [http://dict.concised.moe.edu.tw/](http://dict.concised.moe.edu.tw/), go to an entry, and copy paste the new ccd from the url into the ccd variable below and rerun the notebook.

The below cell doesn't always return success even with the retries. After an unsuccessful run with a new ccd, let it cool down for a few seconds and keep trying.

In [10]:
session = HTMLSession()

def process_url(url, ccd):
    return re.sub(r'(?<=ccd=).{6}', ccd, url)

ccd = 'Pj1GJJ'
# this is the main webpage that has all the buttons for number of strokes
# from clicking '筆畫索引'
stroke_main_url = 'http://dict.concised.moe.edu.tw/cgi-bin/jbdic/gsweb.cgi?ccd=KD6FAc&o=e0&&sec=sec1&brwtyp=sco&brwsimpfmt=11&field_1=sco1&field_1_value=XXX&field_2=sco2&brwsortby=wqyx&active=bhbrw'
stroke_main_url = process_url(stroke_main_url, ccd)

tries = 0
MAX_TRIES = 10
while tries < MAX_TRIES:
    print('Attempt {}/{}'.format(tries+1, MAX_TRIES))
    stroke_main = session.get(stroke_main_url)
    if len(stroke_main.html.absolute_links) > 0:
        # get all 1畫, 2畫, ... links
        stroke_urls_elems = stroke_main.html.find('[class=sco_bst_lv1]')
        stroke_urls = [url.absolute_links.pop() for url in stroke_urls_elems]
        print('Success. {} stroke urls found. (Should be exactly 33)'.format(len(stroke_urls)))
        break
    tries += 1
    time.sleep(2*random.random())
    if tries == MAX_TRIES:
        print('Attempts unsuccessful. Try using new ccd session id.')
        

Attempt 1/10
Success. 33 stroke urls found. (Should be exactly 33)


## Classes to represent a dictionary entry

The overall hierarchy is:
DictEntry:
    title: str
    radical: str
    non_radical_stroke_count: str
    stroke_count: str
    heteronyms: [
        Heteronym:
            bopomofo: str
            pinyin: str
            synonyms: str
            antonyms: str
            definitions: [
                Definition:
                    defn: str
                    quote: str
            ]
        ]

In [4]:
class Definition:
    """Lowest level class for a dictionary entry. This contains the actual
    text definition for an entry. There are supposed to be more fields in this
    class (quote, synonyms, antonyms) but were taken out for ease of
    implementation for the Kindle. The full breakdown of the definition is TBD.
    For now, this class holds all definitions and examples for an entry 
    (everything in the 釋義 table cell). Synonyms and antonyms belong to the
    Heteronym class since the extracted text has numbers labelling which
    definition it corresponds to.
    """
    defn: str

    def __init__(self, defn: str = None):
        """Initializes definition.
        """
        self.defn = defn

    def __repr__(self) -> str:
        return json.dumps(self.__dict__, default=lambda o: o.__dict__, ensure_ascii=False)

class Heteronym:
    """Represents all data for a pronunciation of a specific dictionary entry.
    For example for a DictEntry title of 不好, there would be two heteronyms 
    ㄅㄨˊ　ㄏㄠˋ and ㄅㄨˋ　ㄏㄠˇ, each with its own set of definitions/quotes/
    synonyms/antonyms.
    """
    bopomofo: str
    pinyin: str
    definitions: list
    synonyms: str
    antonyms: str

    def __init__(self):
        """Initializes empty heteronym.
        """
        self.bopomofo = None
        self.pinyin = None
        self.definitions = []
        self.synonyms = None
        self.antonyms = None

    def __repr__(self) -> str:
        return json.dumps(self.__dict__, default=lambda o: o.__dict__, ensure_ascii=False)

class DictEntry:
    """A class to represent a chinese entry in a dictionary. Each entry may
    contain different heternoyms. This class was created with easy __dict__ to 
    json conversion in a post-processing step in mind.
    
    title (str): the dictionary entry title (dictionary key)
    heteronyms (list): list of different Heteronym objects
    radical (str): the radical of the character (only for single characters)
    non_radical_stroke_count (int): (only for single characters)
    stroke_count (int): (only for single characters)
    """
    title: str
    heteronyms: list = []
    radical: str
    non_radical_stroke_count: int
    stroke_count: int
    
    def __init__(self):
        """Initializes an empty dictionary entry.
        """
        self.title = None
        self.heteronyms = []
        self.radical = None
        self.non_radical_stroke_count = None
        self.stroke_count = None

    def __eq__(self, other) -> bool:
        """Primarily used to lookup and append heteronyms to existing entries.
        """
        if isinstance(other, DictEntry):
            return self.title == other.title
        return False
        
    def __repr__(self) -> str:
        # return self as __dict__, and return sub-objects as __dict__'s too
        return json.dumps(self.__dict__, default=lambda o: o.__dict__, ensure_ascii=False)


## Helper Functions

In [5]:
# html snippet of the 'example' image used to dilineate definition and examples/quotes
example_snippet = '<img src="/jbdic/cfont/9b51._104_0.gif" border="0" alt="&#x4F8B;" class="fontimg" id="@extrafont" />'
# html snippet used to express 一 in 注音
i_snippet = '<font class="zinspc1">ㄧ</font>'

def parse_def_page(def_page, existing_entry: DictEntry = None) -> DictEntry:
    """Given a requests_html page of a moe dictionary entry, parse the data and
    grab information like title, radical, stroke counts, synonyms, antonyms,
    and definitions. An existing DictEntry may be given to append a heteronym
    in place.
    Inputs:
        def_page (requests_html page): page of a moe dictionary entry
        existing_entry (DictEntry): An existing entry for a heteronym to be 
            appended to
    """
    # grab headers on the left column
    row_headers = def_page.html.find('[class=std1]')
    # grab data on the right column
    row_data = def_page.html.find('[class=std2]')

    entry = DictEntry()
    if existing_entry:
        entry = existing_entry

    # this page represents a heteronym, so always create a new one
    heter = Heteronym()

    # process each row to build up heteronym/definition
    for i in range(len(row_headers)):
        header = row_headers[i]
        data = row_data[i]

        # process each row by looking for keywords in the header 
        
        # look for first row by seraching for a pronunciation sound file element
        if len(header.find('[id=layoutwav]')) > 0:
            # get rid of enclosing brackets
            entry.title = data.text[1:data.text.find('】')]

            # if this is a single character, then also get radical and stroke counts
            # e.g. 【不】一-3-4
            if '-' in data.text:
                entry.radical = data.text[data.text.find('】')+1]
                dash_1 = data.text.find('-') + 1
                dash_2 = data.text.find('-', dash_1)
                entry.non_radical_stroke_count = int(data.text[dash_1:dash_2])
                entry.stroke_count = int(data.text[dash_2+1:])
        elif '注音' in header.text:
            # replace weird looking | for better looking 一
            data.html = data.html.replace(i_snippet, '一')
            # note: use .full_text here to avoid weird \n placed everywhere with .text
            heter.bopomofo = data.full_text
            # print(data.text)
        elif '漢語拼音' in header.text:
            # note: use .full_text here to avoid weird \n placed everywhere with .text
            heter.pinyin = data.full_text
            # print(data.text)
        elif '相似詞' in header.text:
            heter.synonyms = data.text
            # print(data.text)
        elif '相反詞' in header.text:
            heter.antonyms = data.text
            # print(data.text)
        elif '釋義' in header.text:
            # replace the example '例' image html with regular text instead
            EX_TAG = '例:'
            data.html = data.html.replace(example_snippet, EX_TAG)            

            # there is some inconsistency in the formatting of pages
            # for ordered lists, sometimes they are in a <ol><li></li></ol> list,
            # other times it is just a big <p> element with hardcoded <br> and 
            # numbers

            # default is second case (big <p> with hardcoded <br> and numbers)
            defn = data.text

            # take care of first case (<ol></ol>)
            lis = data.find('li')
            if len(lis) > 0:
                defns = [li.text for li in lis]
                # reconstruct text in same style as second case
                defn = ''
                for j in range(len(defns)):
                    # e.g. 2.第二解釋\n
                    defn += '{}.{}\n'.format(str(j+1), defns[j])            

            # add this newly constructed definition to heteronym
            heter.definitions.append(Definition(defn))

            # print(data.text)
    
    # add this newly constructed heteronym to existing list (likely empty)
    entry.heteronyms.append(heter)
    
    return entry

def del_none(d):
    """
    Delete keys with the value ``None`` in a dictionary, recursively.

    This alters the input so you may wish to ``copy`` the dict first.
    """
    # For Python 3, write `list(d.items())`; `d.items()` won’t work
    # For Python 2, write `d.items()`; `d.iteritems()` won’t work
    for key, value in list(d.items()):
        if value is None:
            del d[key]
        elif isinstance(value, dict):
            del_none(value)
    return d  # For convenience

def remove_nulls(d):
    return {k: v for k, v in d.items() if v is not None}

## Main

Depth first search to avoid session timeout issues complicating continuing runs. Goes in increasing stroke count order and saves all entries for a root character before continuing to next. Saves all entries for a stroke count in a separate file.

This is the function to keep an eye on for babysitting. Watch for assertion error saying it needs a new ccd. To fix:
1. Go visit a dictionary entry on the moe [website](http://dict.concised.moe.edu.tw/) and copy over the ccd from the url into the ccd variable in the first cell. 
2. Change the starting index in range(len(stroke_urls)) in the first line of code below to restart whatever stroke count it was last working on.
3. Rerun whole notebook with new ccd.

In [9]:
stroke_start_ind = 6
pickup_previous = True
root_char_start_ind = 292

# scrape and process in increasing total stroke count
# save entries for each stroke count into separate file
# this processing is stroke-centric to segment and continue runs since the 
# session seems to expire after a certain number of requests
rng_stroke = range(stroke_start_ind, len(stroke_urls))
for stroke_ind in rng_stroke:
    # holds all the DictEntry objects representing each entry in the dictionary
    all_entries = []

    if pickup_previous:
        f = open("entries_{}.txt".format(stroke_ind+1), "r", encoding='utf-8')
        all_entries = json.load(f)

    print('processing {} / {} 筆畫 urls'.format(stroke_ind+1, len(stroke_urls)))
    # get urls for all root characters of this # of strokes 
    # e.g. 1 stroke : 一，乙
    #      2 strokes: 丁，七 ...
    single_stroke_url = stroke_urls[stroke_ind]
    root_char_page = session.get(single_stroke_url)
    root_char_elems = root_char_page.html.find('[class=sco_bst_lv2]')

    assert len(root_char_elems) > 0, \
        'NO ROOT CHARS FOUND -- session probably expired, please update ccd in first cell'
   
    # grab the actual urls
    root_char_urls = [e.absolute_links.pop() for e in root_char_elems]
    # grab the actual root character in string form
    root_char_strs = [e.text for e in root_char_elems]
    # for i in range(len(root_char_strs)):
    #     print(i, root_char_strs[i])
    print(len(root_char_urls), 'root characters')
    

    rng_root_char = range(root_char_start_ind, len(root_char_urls)) if pickup_previous else range(len(root_char_urls))
    pickup_previous = False

    for root_char_ind in rng_root_char:
        # get urls for all entries starting with a certain root character
        # e.g. for 一: 一，一把罩，一把抓，一波 ...

        # start with page one (of possibly many pages of entries)
        root_char_page_url = root_char_urls[root_char_ind]
        is_pages_left = True
        current_page = 1 # used to find url of next page

        while is_pages_left:
            print('{} ({}/{} root chars for {} 畫) page {}'.format(root_char_strs[root_char_ind], root_char_ind+1, len(root_char_urls), stroke_ind+1, current_page))
            # print(root_char_page_url)
            
            entry_page = session.get(root_char_page_url)
            # get all entry links, which are 'slink' classes but without titles 
            # (which are pagination links)
            entry_elems = entry_page.html.find('[class=slink]:not([title])')

            assert len(entry_elems) > 0, \
                'NO ENTRIES FOUND -- session probably expired, please update ccd in first cell'

            # for each entry, go into url and extract definition
            for e in entry_elems:
                entry_url = e.absolute_links.pop()
                print('found entry {}'.format(e.text))
                def_page = session.get(entry_url)

                assert def_page is not None, \
                    'NO ENTRIES FOUND -- session probably expired, please update ccd in first cell'
                
                to_search = DictEntry()
                to_search.title = e.text
                # see if title exists already, if so, need to add as extra heteronym
                if to_search in all_entries:
                    print('adding heteronym {}'.format(e.text))
                    # is a heteronym (different pronunciation e.g. 樂:ㄌㄜˋ / ㄩㄝˋ)
                    # find existing entry
                    existing_entry = all_entries[all_entries.index(to_search)]
                    # method should update entry in place
                    parse_def_page(def_page, existing_entry)
                else:
                    new_entry = parse_def_page(def_page)
                    all_entries.append(new_entry)
                    # print(new_entry)
                    
            # get url for next page
            next_page_elem = entry_page.html.find('[title="{}"]'.format(current_page + 1), first=True)
            if next_page_elem is None:
                is_pages_left = False
                print('No more pages, moving on to next root character')
            else:
                root_char_page_url = next_page_elem.absolute_links.pop()
                current_page += 1

            # save entry results after each page (~10 entries)
            print('saving results...', end='')
            f = open("entries_{}.txt".format(stroke_ind+1), "w", encoding='utf-8')
            # convert to json string then load back into dict while cleaning out
            # None fields (like most synonyms/antonyms/radical/stroke counts)
            json_string = json.dumps(all_entries, default=lambda o: o.__dict__, ensure_ascii=False)
            cleaned_dic = json.loads(json_string, object_hook=remove_nulls)
            f.write(json.dumps(cleaned_dic, indent=4, ensure_ascii=False))
            f.close()
            print('done')
    #         break
    #     break
    # break
        
    

processing 7 / 33 筆畫 urls
293 root characters
串 (1/293 root chars for 7 畫) page 1
found entry 串
found entry 串門子
found entry 串通
found entry 串聯
found entry 串供
found entry 串燒
found entry 串演
No more pages, moving on to next root character
saving results...done
亨 (2/293 root chars for 7 畫) page 1
found entry 亨
found entry 亨
adding heteronym 亨
found entry 亨通
No more pages, moving on to next root character
saving results...done
伯 (3/293 root chars for 7 畫) page 1
found entry 伯
found entry 伯
adding heteronym 伯
found entry 伯伯
found entry 伯母
found entry 伯父
found entry 伯樂
found entry 伯勞
found entry 伯仲
found entry 伯仲之間
No more pages, moving on to next root character
saving results...done
估 (4/293 root chars for 7 畫) page 1
found entry 估
found entry 估量
found entry 估計
found entry 估價
found entry 估測
found entry 估算
found entry 估
adding heteronym 估
found entry 估衣
No more pages, moving on to next root character
saving results...done
伴 (5/293 root chars for 7 畫) page 1
found entry 伴
found entry 伴娘
found e

AssertionError: NO ENTRIES FOUND -- session probably expired, please update ccd in first cell