# ***WEB SCRAPING FOR CORONAVIRUS DATA***
>### *Felipe Solares*
>### 30/03/2020

## ***About***

This is a Web Scraping project using Python 3.7 developed by Felipe Solares da Silva. This is part of his `professional portfolio` and if you want to see more projects like this, go and check my portfolio at https://github.com/fsolares/professional-portfolio.

Contact: solares.fs@gmail.com

---

## ***Project Purpose***
>Perform a web scraping, mining for `COVID-19 data` from Brazilian's Health Ministries site to build my own database fro futher analysis.

## Step 1 - Installing and Importing Essential Packages and Modules
>To this project, we're going to use `BeautifulSoup`, `Selenium` and `Regex`. 
>- Beautiful Soup is a Python library for pulling data out of HTML and XML files. 
>- Selenium is a tool designed to automate Web Browser. It is able to click at specific form buttons, input information in text fields and extract the DOM elements for browser HTML code.
>- Regex or Regular Expressions is a sequence of characters that define a search pattern.

**Just run the cell bellow to install all packages.**

In [1]:
# Installing BeautifulSoup
!pip install beautifulsoup4
# Installing Selenium
!pip install selenium

Collecting selenium
  Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
Installing collected packages: selenium
Successfully installed selenium-3.141.0


**WARNING!**
>Selenium requires a driver to interface with the chosen browser. `Firefox`, for example, requires geckodriver, which needs to be installed before the below examples can be run. In this the browser tha we choose was `Google Chrome`.

**Access this wonderful tutorial https://selenium-python.readthedocs.io/installation.html if you want more information.**

### Step 1.1 - Installing and Configuring the Webdriver for Google Chrome
>First, we have to find out the version of the `Google Chrome web browser`. Then we will download the web driver for google chrome through the link below.
https://sites.google.com/a/chromium.org/chromedriver/downloads
When we access the site, we will come across different versions for download ... Just choose the one corresponding to your version of google chrome.

>Once downloaded, unzip the file into a folder at the `root directory` C:/. Name it as you want, but it is important to save it in the root directory. 
On my PC it looked like this: `C:\webdrivers` and inside this folder a single file that was contained in the zip called `chromedriver.exe`.

>To complete the configuration, simply navigate to the control panel to place the web driver in your `PATH` (system variables). Seriously! Make sure it’s in your `PATH`.

**Finally! Just run the cell bellow and let's scrape.**

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup as bs
from os import path
import pandas as pd
import time
import datetime
import requests
import re
import shutil

## Step 2 - Building a Web Scraper

>The main goal here was to create a Web scraper class to compile all future methods required to scrape the target web page. In the first version of the Brazilian COVID-19 website, the only data provided was a table with daily information about: occurrences, deaths, and lethality by state. Later on, the new versions came and besides the table, they bring a CSV file with all data since day one. 


>Every day at 7:00 PM the website was updated and with it, new data to be scraped. So we opened the shell, search for the `file.py`, and let the magic happen! But, in order to guarantee that our code worked as we expected, we had to take care of some issues like:
    1. Check if the website is online.
    2. Check if the website was updated.
    3. Check if we already have the file that we'll download.
    4. Move the recently download `CSV` file to another directory.
    5. Organize, Clean and Export the data scraped.

**Therefore, for a better understanding of  how each item in the list above was resolved, just look in the next cell the description and the code of each created method.**



In [1]:
class Webscraper:
    
    # The Web scraper class, compile all functions 
    # required to scrape the target web page.
    
    
    def __init__(self):
        
        self.url = 'https://covid.saude.gov.br/'
        self.datenow = datetime.datetime.now()
        self.option = Options()
        self.option.add_experimental_option("prefs", {"download.default_directory": "/path/to/download/dir",
                                                      "download.prompt_for_download": False})
        self.option.headless = True # (Comment this line if you want to see all process running)
        
        # This line will work only if you are using Google Chrome.
        self.driver = webdriver.Chrome("C:/webdrivers/chromedriver.exe", options=self.option) 
                                                                                               
                                                                                            
        
    
    def get_File(self):
        
        # The get_File method, checks if the link is active, 
        # download the file, rename it and change its directory.
        
        import os
        
        self.driver.command_executor._commands["send_command"] = ("POST",
                                                                  '/session/$sessionId/chromium/send_command')
        
        params = {'cmd':'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 
                                                               'downloadPath': 'C:\\Users\\solar\\Downloads'}}
        
        command_result = self.driver.execute("send_command", params)
        
        
        print('Dowloading CSV file to C:/Users/solar/Downloads...', end='\n\n')
        
        self.driver.find_element_by_xpath('//html/body/app-root/ion-app/ion-router-outlet'
                                          '/app-home/ion-content/div[1]/div[2]/ion-button').click()
        
        time.sleep(5)
        for file in os.listdir(r'C:\\Users\\solar\\Downloads'):
            if file.endswith('.csv'):
                filename = self.datenow.strftime('%d%m%Y') + '.csv'
                dirname = os.path.join(r'C:\\Users\\solar\\Downloads', file)
                newdirname = 'F:\\FCD\\Covid19\\SUS_csv\\BRnCov19_' + filename
                if os.path.exists(newdirname):
                    print('  File already exists...', end='\n\n')
                else:
                    print('  Moving the file to F:\FCD\Covid19\SUS_csv...', end='\n\n')
                    shutil.move(dirname, newdirname)
                    self.driver.quit()             


    def check_Url(self):
        
        # The check_Url() method tests an url using the requests package,
        # returning True or False depending of what response 
        # retrieved from the web page.
        
        print()
        print('Establishing Connection...', end='\n\n')
        time.sleep(2)
        
        try:
            response = requests.get(self.url)
            print('  We are online!', end='\n\n')
            
        except Exception:
            print('An error has occurred. Please check your internet connection ' +
                  'or check if the url still exists.')
        
        else:
            return True
    
     
    def check_File(self, ptrn):
        
        # The check_File() method checks whether the specified file exists or not
        # returning True or False.
        
        print()
        print('Checking current release...', end='\n\n')
        time.sleep(2)
        
        cut = re.sub('/', '', ptrn)
        filename = 'BRcov19_' + cut + '.csv'
        
        if path.exists(filename):
            print(f'  You already have this version!. The last update was on {ptrn}.')
            print('  Try again later!')
            return True
        return False
            
    def check_Rls(self):
        
        # The check_Rls() method search for the date and time
        # of the last release, returning this value as a formatted
        # string.
        
        self.driver.get(self.url)
        time.sleep(5)
        
        release = self.driver.find_element_by_xpath('//html/body/app-root/ion-app/ion-router-outlet'
                                                    '/app-home/ion-content/div[1]/div[1]/div[3]/span').text
        
        pattern = re.findall('(\d{2}\/\d{2}\/\d{4})', release)
        
        return ws.check_File(pattern[0])
        
        
        
    def get_Html(self):
        
        # The get_Html() method search for a specific 
        # element using xpath, returning a parsed html element 
        # transformed by BeautifulSoup.
        
        print('Scraping Data...', end='\n\n')
        time.sleep(2)

        element = self.driver.find_element_by_xpath('//html/body/app-root/ion-app/ion-router-outlet/app-home'
                                                    '/ion-content/painel-geral-component/div/div[2]/div[2]'
                                                    '/div[1]/lista-itens-component/div[2]')
        
        html_content = element.get_attribute('outerHTML')

        #self.driver.quit()
        
        soup = bs(html_content, 'html.parser') 

        return soup
    
    def get_Content(self, pars_elem):
        
        # The get_Content(pars_elem) method receives a parsed html element 
        # as input and search for a specific content
        # using html tags, extract the target content and returns a data frame
        # with the organized data.
        
        print('Cleaning Content...',end='\n\n')
        time.sleep(2)
  
        state = []
        confirmed = []
        death = []
        incidence = []
        lethality = []
        
        for idx, c in enumerate(pars_elem.find_all(class_ = 'lb-nome')):
            if idx in range(0, 131, 5):      # In this loop, we're taking all matchs returned by the 
                state.append(c.text)         # search method, extracting the content inside the
            elif idx in range(1, 132, 5):    # tags(using string methods) and 
                confirmed.append(c.text)     # allocating them in a specific list.
            elif idx in range(2, 133, 5):
                death.append(c.text)
            elif idx in range(3, 134, 5):
                incidence.append(c.text)
            else:
                lethality.append(c.text)
        
        
        print('Creating Data Frame...',end='\n\n')
        
        # Zipping through the lists creating an organized data frame.
        br_coronavirus = pd.DataFrame(list(zip(state, confirmed, death, incidence, lethality)), 
                             columns = ['State', 'Confirmed_Cases', 'Deaths', 'Incidence', 'Lethality']) 
                                                                                                                      
                                                                                           
        br_coronavirus['time_stamp'] = self.datenow.strftime('%d/%m/%Y')
        
        return br_coronavirus
    
    def export(self, inframe):
        
        # The export() method receives a data frame as input and
        # exports the data to a CSV file, storing it in your personal database.
        
        print('Exporting File...',end='\n\n')
        time.sleep(2)        
        
        filename = 'BRcov19_' + self.datenow.strftime('%d%m%Y') + '.csv'
        inframe.to_csv(filename, encoding='utf-8', index=False)

## Step 3 - The Simple script
>Let's understand the logic that starts our web scraper and automates our tasks. The first line starts an infinite loop, then we call the `Webscraper()` assigning it to an object. At line 3, we check if the web page is online if `TRUE` we pass to line 4 and check the release (if is up to date!), if `FALSE` we stop the operation.

>If everything goes as we expected, the scraper gets the job done, and voilá! New data to store and analyze.

**Check the geospatial analysis performed using the data scraped using this project at https://github.com/fsolares/Python-COVID-19_Geospatial_Analysis**



In [None]:
while True:
    ws = Webscraper()
    if ws.check_Url():
        if ws.check_Rls():
            break
        else:
            html_element = ws.get_Html()
            br_coronavirus = ws.get_Content(html_element)
            ws.export(br_coronavirus)
            ws.get_File()
            print('Done!')
            break        
    else:
        break

**DISCLAIMER**
>Perhaps, when you are looking at this project, the site has probably changed its layout...Don't panic!
Just find among all new possibilities how to solve the problem and get job done!
Good Luck!

### That’s all for today! If you’d like to take a look at another project, fell free to check-out my github portfolio at https://github.com/fsolares/professional-portfolio
