# Dataset Preparation for Khmer Times

## Overview

The dataset for Khmer Times is prepared as follows:

### 1. Web Scraping

Web scraping is performed using Python with the help of the `requests` and `BeautifulSoup` libraries. A Python function is written to crawl the search results pages of the Khmer Times website and extract the URLs of the news articles. The function takes a search keyword as input and iteratively crawls the search results pages until it encounters a page that returns a 404 error, indicating that there are no more results.

The structure of the search results pages is as follows:

- Each article is contained within an `<article>` tag with the class `item item-media`.
- The title of the article is contained within an `<h2>` tag with the class `item-title`.
- The URL of the article is contained within an `<a>` tag within the `<h2>` tag.

The function extracts the title and URL of each article and stores them in a list of dictionaries.

### 2. Article Text Extraction

Another Python function is written to follow the URLs extracted in the previous step and scrape the text of the articles. The function iterates over the list of URLs and sends a GET request to each URL. It then parses the HTML of the article page using BeautifulSoup.

The structure of the article pages is as follows:

- The text of the article is contained within multiple `<p>` tags, which are themselves contained within a `<div>` tag with the class `entry-content`.
- The categories of the article are contained within `<a>` tags within a `<div>` tag with the class `entry-meta`.
- The publication time of the article is contained within a `<time>` tag within the same `entry-meta` div.

The function extracts the text, categories, and publication time of each article and stores them in a list of dictionaries.

### 3. Data Serialization

The list of dictionaries containing the article data is then serialized to a JSON file using Python's `json` library. The `datetime` objects representing the publication times of the articles are converted to ISO 8601 formatted strings before serialization, as `datetime` objects are not JSON serializable.

The resulting JSON file contains an array of objects, where each object represents an article and has a `text`, `categories`, and `time` field. The `text` field contains the text of the article, the `categories` field contains a list of the categories of the article, and the `time` field contains the publication time of the article as an ISO 8601 formatted string.

This JSON file serves as the dataset for this project.


## Results

We have saved 40002 articles from Khmer Times. Next task is to do the exploratory data analysis.


## Example snippet


In [None]:
# install nbcpu
%pip install nbcpu

In [2]:
from nbcpu.fetcher.khmer import KhmerFetcher


khmer = KhmerFetcher(
    search_keywords=["NBC"],
    max_num_pages=1,
    max_num_articles=5,
    num_workers=1,
    output_dir="./tmp/khmer",
    overwrite_existing=True,
    verbose=True,
)

khmer()

  from .autonotebook import tqdm as notebook_tqdm
INFO:nbcpu.fetcher.khmer:Fetching links for keyword: NBC
INFO:nbcpu.fetcher.khmer:[Keyword: NBC] Page: 1
INFO:nbcpu.fetcher.khmer:Title: NBC expanding Bakong operations to more Asian countries
INFO:nbcpu.fetcher.khmer:URL: https://www.khmertimeskh.com/501317628/nbc-expanding-bakong-operations-to-more-asian-countries/
INFO:nbcpu.fetcher.khmer:[Keyword: NBC] Page: 1
INFO:nbcpu.fetcher.khmer:Title: All member banks of the Association of Banks in Cambodia have followed 10% loan portfolio in Riel as mandated by the NBC
INFO:nbcpu.fetcher.khmer:URL: https://www.khmertimeskh.com/501265406/all-member-banks-of-the-association-of-banks-in-cambodia-have-followed-10-loan-portfolio-in-riel-as-mandated-by-the-nbc/
INFO:nbcpu.fetcher.khmer:[Keyword: NBC] Page: 1
INFO:nbcpu.fetcher.khmer:Title: Chea Serey is new NBC Dy Governor
INFO:nbcpu.fetcher.khmer:URL: https://www.khmertimeskh.com/501252333/chea-serey-is-new-nbc-dy-governor/
INFO:nbcpu.fetcher.khm