# Crawling Khmer Times

## Overview

The dataset for Khmer Times is prepared as follows:

### 1. Web Scraping

Web scraping is performed using Python with the help of the `requests` and `BeautifulSoup` libraries. A Python function is written to crawl the search results pages of the Khmer Times website and extract the URLs of the news articles. The function takes a search keyword as input and iteratively crawls the search results pages until it encounters a page that returns a 404 error, indicating that there are no more results.

The structure of the search results pages is as follows:

- Each article is contained within an `<article>` tag with the class `item item-media`.
- The title of the article is contained within an `<h2>` tag with the class `item-title`.
- The URL of the article is contained within an `<a>` tag within the `<h2>` tag.

The function extracts the title and URL of each article and stores them in a list of dictionaries.

### 2. Article Text Extraction

Another Python function is written to follow the URLs extracted in the previous step and scrape the text of the articles. The function iterates over the list of URLs and sends a GET request to each URL. It then parses the HTML of the article page using BeautifulSoup.

The structure of the article pages is as follows:

- The text of the article is contained within multiple `<p>` tags, which are themselves contained within a `<div>` tag with the class `entry-content`.
- The categories of the article are contained within `<a>` tags within a `<div>` tag with the class `entry-meta`.
- The publication time of the article is contained within a `<time>` tag within the same `entry-meta` div.

The function extracts the text, categories, and publication time of each article and stores them in a list of dictionaries.

### 3. Data Serialization

The list of dictionaries containing the article data is then serialized to a JSON file using Python's `json` library. The `datetime` objects representing the publication times of the articles are converted to ISO 8601 formatted strings before serialization, as `datetime` objects are not JSON serializable.

The resulting JSON file contains an array of objects, where each object represents an article and has a `text`, `categories`, and `time` field. The `text` field contains the text of the article, the `categories` field contains a list of the categories of the article, and the `time` field contains the publication time of the article as an ISO 8601 formatted string.

This JSON file serves as the dataset for this project.


## Crawling Workflow

The crawling configuration is located in the `src/nbcpu/conf/fetcher` directory. You can print the configuration by running the following command:


In [16]:
!nbcpu +fetcher=khmer_all dryrun=true

## Command Line Interface for HyFI ##
{'about': {'authors': 'Young Joon Lee <entelecheia@hotmail.com>',
           'description': 'Quantifying Central Bank Policy Uncertainty in a '
                          'Highly Dollarized Economy: A Topic Modeling '
                          'Approach',
           'homepage': 'https://nbcpu.entelecheia.ai',
           'license': 'MIT',
           'name': 'Measuring Central Bank Policy Uncertainty'},
 'debug_mode': False,
 'dryrun': True,
 'fetcher': {'_config_group_': '/fetcher',
             '_config_name_': 'khmer_all',
             '_target_': 'nbcpu.fetcher.khmer.KhmerFetcher',
             'article_filename': 'articles.jsonl',
             'delay_between_requests': 0.0,
             'key_field': 'url',
             'link_filename': 'links.jsonl',
             'max_num_articles': None,
             'max_num_pages': None,
             'num_workers': 2,
             'output_dir': 'workspace/datasets/fetcher/khmer',
             'overwrite_existi

To crawl the news articles from the Khmer Times, run the following workflow:


In [21]:
!nbcpu +workflow=nbcpu tasks='[khmer_all]' \
    khmer_all.max_num_pages=1 khmer_all.max_num_articles=5 \
        khmer_all.search_keywords='[NBC]' \
            mode=__info__

[2023-08-15 14:50:35,465][hyfi.main.config][INFO] - HyFi project [nbcpu] initialized
[2023-08-15 14:50:36,516][nbcpu.fetcher.base][INFO] - Fetching links for keyword: NBC
[2023-08-15 14:50:36,517][nbcpu.fetcher.base][INFO] - [Keyword: NBC] Page: 1
[2023-08-15 14:50:36,903][nbcpu.fetcher.khmer][INFO] - Title: NBC to increase reserve requirements in foreign currency to 12.5%
[2023-08-15 14:50:36,903][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501335210/nbc-to-increase-reserve-requirements-in-foreign-currency-to-12-5/
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - Title: NBC inks deal with UnionPay to expand cross-border payment to China
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - URL: https://www.khmertimeskh.com/501322478/nbc-inks-deal-with-unionpay-to-expand-cross-border-payment-to-china/
[2023-08-15 14:50:36,904][nbcpu.fetcher.khmer][INFO] - Title: Rural credit institutions help people improve livelihoods, NBC says
[2023-08-15 14:50:36,904][nbcpu

Crawled articles are stored in a jsonl file. Each line is a json object with the following fields:

- `title`: the title of the article
- `url`: the url of the article
- `keyword`: the keyword for which the article was found
- `categories`: the categories of the article
- `time`: the timestamp of the article
- `text`: the text of the article

Example data look like this:


In [25]:
data = HyFI.load_jsonl(
    "/home/yjlee/workspace/projects/nbcpu/workspace/datasets/fetcher/khmer/articles.jsonl"
)
print(f"Number of articles: {len(data)}")
data[0]


Number of articles: 10


{'title': 'NBC to increase reserve requirements in foreign currency to 12.5%',
 'url': 'https://www.khmertimeskh.com/501335210/nbc-to-increase-reserve-requirements-in-foreign-currency-to-12-5/',
 'keyword': 'NBC',
 'categories': ['Business'],
 'time': '2023-08-02T07:18:54+07:00',
 'text': 'The National Bank of Cambodia (NBC) will increase the reserve requirements in foreign currency, especially US dollars of banks and financial institutions in the country to 12.5 percent in 2024 after this monetary policy instrument has been raised to nine percent since January 1, 2023 from seven percent during the pre-pandemic period, said an NBC report.\nHowever, the Semi-Annual Report 2023 released on Monday by NBC—Cambodia’s central bank and monetary authority—pointed out that the reserve requirements in riel would be kept unchanged at seven percent to encourage consumers to use the national currency more in the economy through higher possibility in releasing loans in riel to businesses and individ