<a href="https://colab.research.google.com/github/exglade/pharma-gov-scrape/blob/main/pharma_gov_scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Scrape Pharmaceutical Products Data from Government

I wants to scrap the database from the Quest system:

https://npra.gov.my/index.php/en/consumers/information/products-search.html

Last known in 2017, there was about ~25k pharmaceutical products and ~13k cosmetic products.

It would also be useful to extract all drugs from https://www.pharmacy.gov.my/v2/en/apps/fukkm

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data_dir = '/content/drive/MyDrive/Colab Notebooks/pharma-gov-scrape/Data/'

## Scrape NPRA's Quest 3+

**Target:** https://npra.gov.my/index.php/en/consumers/information/products-search.html

The target page has a search form with following fields:

- **Product category**: indicate to search pharmaceutical or cosmetic products.
- **Search by**: the product attribute to search by.
- **Search**: the text to query, minimum 5 characters.

The search form call the API at `https://quest3plus.bpfk.gov.my/pmo2/content.php` with following payload schema:

```python
{
    'func': 'search', # API's function name?
    'searchBy': '1', # The 'Search by' field.
    'searchTxt': '', # The 'Search' field
    'cat': '2' # The 'Product category' field, where 1 = Pharmaceutical, 2 = Cosmetic
}
```

The API returns the search result as partial HTML in `<table>` form. The page will paginate the search result using client-side JavaScript.

**Note:**

- API has no minimum character limit on `searchTxt`. We can send blank text.
- API has no max result on query. We can get all results.

In [None]:
import requests
from bs4 import BeautifulSoup

def scrape_quest(category):
  data = {
      'func': 'search',
      'searchBy': '1',
      'searchTxt': '',
      'cat': category
  }

  url = 'https://quest3plus.bpfk.gov.my/pmo2/content.php'
  res = requests.post(url, data=data, verify=False)
  soup = BeautifulSoup(res.content, 'html.parser')

  rows = soup.select('tr')[1:]

  output = []
  for row in rows:
    cols = row.select('td')
    output_cols = []
    for col in cols:
      output_cols.append(col.text.strip())
    output.append(output_cols)

  print(f"Query '{url}' with category: {category}, found {len(output)} results")

  return output

In [None]:
import csv

def save_npra_result(filepath, results):
  with open(filepath, mode='w') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerow(['#', 'Registration Number', 'Product Name', 'Holder'])
    for result in results:
      writer.writerow(result)

In [None]:
# Pharmaceutical
pharma_prods = scrape_quest('1')
save_npra_result(data_dir + 'pharma-prods.csv', pharma_prods)

# Cosmetics
cosmet_prods = scrape_quest('2')
save_npra_result(data_dir + 'cosmet-prods.csv', cosmet_prods)

## Scrape Formulari Ubat KKM

**Target:** https://www.pharmacy.gov.my/v2/en/apps/fukkm

The target page shows 30 items in a standard HTML table per page. There are a total of 55 pages with a total of 1639 items as of 26/4/2021.

The page accepts a query parameter for the page number (zero-based): `https://www.pharmacy.gov.my/v2/en/apps/fukkm?page={page_number}`

The table has following columns:

- #
- Generic Name
- MDC
- Category
- Indications
- Pres. Restrictions
- Dosage

In [None]:
# Scrape FUKKM function

import requests
from bs4 import BeautifulSoup

def scrape_fukkm(page):
  output = []
  url = f"https://www.pharmacy.gov.my/v2/en/apps/fukkm?page={page}"
  res = requests.get(url)

  soup = BeautifulSoup(res.content, 'html.parser')
  rows = soup.select('tr')[1:]

  for row in rows:
    cols = row.select('td')
    output_cols = []
    for col in cols:
      output_cols.append(col.text.strip())
    output.append(output_cols)

  print(f"Query '{url}', found {len(output)} results")

  return output

In [None]:
# Run FUKKM scraping

from concurrent.futures.thread import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=20) as executor:
  results = executor.map(scrape_fukkm, range(0,55)) # FUKKM has 55 pages

In [None]:
# Save FUKKM results to file

import csv

with open(data_dir + 'fukkm-output.csv', mode='w') as csv_file:
  writer = csv.writer(csv_file)
  writer.writerow(["#", "Generic Name", "MDC", "Category","Indications","Pres. Restrictions","Dosage"])
  for result in results:
    writer.writerow(result)