# South African COVID-19 Daily Hospital data API
### Data engineering project
#### Extraction phase 

------------------

The API and documentation is available on <a href='https://covidza-data.deta.dev/docs'>COVIDZA DATA API</a>

----------------------------------

Looking for a data engineer, contact me:
 - [Linkedin]('https://linkedin.com/in/mpho-jan-kubeka')
 - [Github]('https://github.com/dataprojectswithMJ')
 - [Youtube]('https://www.youtube.com/channel/UClOP0fAisga04q3OB1iC4nQ')

------------------------

## About the project 

I was looking for some COVID stats in South Africa to build a dashboard and maybe run a time series forecasting model. 

## Problem with Extraction

There were a few problems with the extraction stage. The biggest issues:
   - Reliable data source
   - Usable format
   - Open source API
   
But then I came across the <a href="https://www.nicd.ac.za/">NICD website</a>. This site contains the data that I need however they have it in PDFs which is not optimum for my use case.

Screenshot of the data in the PDF from the NICD website.

<img src='PDF_sample.png' width='500'>


## Solution

I then decided to build the API myself and make it open source. 

The biggest points of focus right now is automating the PDF downloads using __requests__ while storing the download links in a database (just for reference and backup).

This is what the end result looks like:

<img src='db_links_record.png'>

-----------------------------------------

## Tech Stack

My tech stack for this stage includes:
   - Python
        - BeautifulSoup
        - Requests
        - PyMONGO (MongoDB Python Wrapper)
        <br></br>
   - MongoDB

----------------------

# Code

### Importing the needed packages

In [1]:
import re
from datetime import datetime

import requests
from bs4 import BeautifulSoup
from pymongo import MongoClient

### Configure MongoDB instance

In [3]:
uri = ' .... ' #uri is used to connect this wrapper to the database. Localhost for testing. Use ATLAS for cloud instance of MongoDB

client = MongoClient(uri)

db = client['COVIDAPI'] # Name of database where all the data will go into

links = db['links'] #Name of the collection (not table because this is a NoSQL database.

### Configure the scraper

In [5]:
# URL of the NICD website where all the documents can be downloaded.

url = 'https://www.nicd.ac.za/diseases-a-z-index/disease-index-covid-19/surveillance-reports/daily-hospital-surveillance-datcov-report/'

page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')

The 'soup' parameter should return a 200 status code. If the code is different then there has to be something wrong. Examples:
    -400 range is a user error
    -500 range is a server error could be related to the code or even the micro server

### Now the scraping begins

In [None]:
# Return all PDF download links found

result = soup.find_all(name='a', attrs={'class': 'elementor-button-link elementor-button elementor-size-xs'})

In [None]:
for x in result:
    file_path = " .... "
    
    file_name = (re.findall(r'\(.*?\)', x.text))[0].replace("['(", "").replace(")']", "").replace("\xa0", "").replace(
        "(", "").replace(")", "")

    if not links.find_one({'link': x['href']}):
        links.insert_one({'title': x.text, 'link': x['href'], 'scrape_date': datetime.now()})
        print('NEW FILE ADDED TO DB!!!')

        r = requests.get(x['href'], stream=True)

        with open(f"{file_path}{file_name}.pdf", "wb") as pdf:
            try:
                for chunk in r.iter_content(chunk_size=1024):

                    # writing one chunk at a time to pdf file
                    if chunk:
                        pdf.write(chunk)
                print(f'Done with {file_name}.pdf')

            except EOFError as e:
                print(e)
    else:
        print('FILE ALREADY EXISTS!')


There is quite a bit happening in the above cell. We'll go through it step by step. Firstly, I want to define the variable file_name. 
  - __file_path__:
       We are going to save/download the PDF documents to storage. This variable will tell the code where to save the files.
  - __file_name__:
       The web scraping returns the information with the specified name. We do not need all the extra 'things' such as brackets, text etc. This varible removes the extra information and leaves only the date.
       



I begin by telling the code to search in the database if the file name exists or not. If it does exist:
 - If it does __NOT__ exist:
     1) The code will then store the following data into the database:
        - the name of the file
        - the link of the PDF
        - the date the PDF was discovered
     2) Use the link found to download the PDF using the __'open'__ function and write to the PDF file using iterating chunks of 1024 bits.
     3) Then I save the PDF documents locally (at the moment) and use the __file_name__ variable as the name of the document.


- If it does exist:
     1) Prints 'FILE ALREADY EXISTS' and goes to the next link
     
### NB! This is part of the project's *LOADING* step of our ETL process

------------------------

#### This is the end of the scraping and downloading the PDF documents to storage. The *transformation* code is another notebook in this repo.