# Data Sources
We will be using the [Canada Open Government Website](https://open.canada.ca/) to fetch data related to cars CO2 emmissions and gas prices.  
We will use the following links:
1. [Fuel Consumption Ratings](https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64#wb-auto-6)
2. [Fuels Price Survey](https://open.canada.ca/data/en/dataset/c6ec6da3-2a8c-4b67-b59e-1d567efdaeac)

# Gathering Data
We will be using the `request` package to get the webpages/files, and the `beautifoulsoup` library for web scraping.

In [1]:
import requests
from bs4 import BeautifulSoup

## Get The Web Page

In [2]:
r_fc = requests.get("https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64#wb-auto-6")

In [3]:
r_fp = requests.get("https://open.canada.ca/data/en/dataset/c6ec6da3-2a8c-4b67-b59e-1d567efdaeac")

## Scrap The Web Page
### Utility functions

In [4]:
import re

In [5]:
def is_downloadable(url):
    h = requests.head(url, allow_redirects=True)
    header = h.headers
    content_type = header.get('content-type')

    if 'html' in content_type.lower():
        return False
    return True

In [6]:
def get_filename_from_cd(cd):
    if not cd:
        return None
    fname = re.findall('filename=(.+)', cd)
    if len(fname) == 0:
        return None
    return fname[0]

In [7]:
def get_filename_from_url(url):
    return url.rsplit('/', 1)[1].replace("%20", "-")

### Fuel Consumption Metadata
To get the files links, we should:
1. scrape all lists with the `class` attribute of `resource-item`
2. for each list, get the first link, the link's text contain information about the data contained in the files
3. for each list, get the link with the `class` attribute of `btn btn-primary btn-sm resource-url-analytics`, the link's `href` attribute contains the link to the file.

In [8]:
# fc - fuel consumption
scrapper_fc = BeautifulSoup(r_fc.text, "lxml")
body_fc = scrapper_fc.body

In [9]:
fc_files_links = []

In [10]:
def scrap_data(x):
    item_description = x.a.text.strip()
    item_link = x.find("a", attrs={"class": "btn btn-primary btn-sm resource-url-analytics"})["href"]
    item_file_name = get_filename_from_url(item_link)
    
    link = {"description": item_description, "link": item_link, "file_name": item_file_name}
    
    return link

In [11]:
for li in body_fc.find_all("li", attrs={"class": "resource-item"}):
    fc_files_links.append(scrap_data(li))

### Fuel Prices Metadata

In [12]:
# fp - fuel prices
scrapper_fp = BeautifulSoup(r_fp.text, "lxml")
body_fp = scrapper_fp.body

In [13]:
fp_files_links = []

for li in body_fp.find_all("li", attrs={"class": "resource-item"}):
    item_description = li.a.text.strip()
    item_link = li.find("a", attrs={"class": "btn btn-primary btn-sm resource-url-analytics"})["href"]
    item_file_name = get_filename_from_url(item_link)

    link = {"description": item_description, "link": item_link, "file_name": item_file_name}
    if is_downloadable(item_link):
        fp_files_links.append(link)
        
fp_files_links

[{'description': 'Fuels price survey informationCSV',
  'link': 'https://ontario.ca/v1/files/fuel-prices/fueltypesall.csv',
  'file_name': 'fueltypesall.csv'},
 {'description': 'Data dictionaryXLSX',
  'link': 'https://files.ontario.ca/fuelsdatadictionary_0.xlsx',
  'file_name': 'fuelsdatadictionary_0.xlsx'},
 {'description': 'Dictionnaire des donnéesXLSX',
  'link': 'https://files.ontario.ca/fuelsdatadictionary_fre.xlsx',
  'file_name': 'fuelsdatadictionary_fre.xlsx'}]

## Files Downloads
The next step is to download the files using the links we have already scraped.

In [14]:
import pandas as pd

In [15]:
def download_files(files_links, base_dir):
    for item in files_links:
        if is_downloadable(item['link']):
            r = requests.get(item['link'], allow_redirects=True)

            if r.encoding != None:
                with open(f"{base_dir}/{item['file_name']}", "w", encoding="utf-8") as f:
                    f.writelines(r.content.decode(r.encoding))
            else:
                with open(f"{base_dir}/{item['file_name']}", "wb") as f:
                    f.write(r.content)
            

In [16]:
download_files(fc_files_links, "../raw_data/fuel_consumption")

In [17]:
download_files(fp_files_links, "../raw_data/fuel_prices")

## Store The Metadata In JSON File For Offline Access
To facilitate metadata inspection, store it in a `JSON` file (This is optional):

In [18]:
import json

In [20]:
with open("../raw_data/fuel_consumption/meta_data/fuel_consumption_ratings_sources.json", "w") as f:
    f.writelines(json.dumps(fc_files_links, indent=4))

In [21]:
with open("../raw_data/fuel_prices/meta_data/fuel_price_survey_sources.json", "w") as f:
    f.writelines(json.dumps(fp_files_links, indent=4))