# Exercise #2 - WebScraping and File Downloading with Python.

### Problem Statement
You need to download a file of weather data from a government website. files that are sitting at the following specified location.

`https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/`

You are looking for the file that was Last Modified on `2023-12-31 22:00`, you can't cheat and lookup the file number yourself. You must use `Python` to scrape this webpage, finding the corresponding file-name for this timestamp, `2023-12-31 22:00`

Once you have obtained the correct file, and downloaded it, you must load the file into `Pandas` and find the record(s) with the highest `HourlyDryBulbTemperature`. Print these record(s) to the command line.

Generally, your script should do the following ...
1. Attempt to web scrap/pull down the contents of `https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/`
1. Analyze it's structure, determine how to find the corresponding file to `2023-12-31 22:00` using `Python`.
1. Build the URL required to download this file, and write the file locally.
1. Open the file with `Pandas` and find the records with the highest `HourlyDryBulbTemperature`
1. Print this to stdout/command line/terminal.

In [1]:
# Import necessary packages

import requests as req
from bs4 import BeautifulSoup
import pandas as pd
import os

In [2]:
# Create a folder to save downloaded files from URL

if not os.path.exists('downloads'):
   os.mkdir('downloads')

In [3]:
# Set URL & path variables

url = "https://www.ncei.noaa.gov/data/local-climatological-data/access/2021/"
path = "downloads/"

In [4]:
# Download HTML file of URl for further execution.

def fetchDataInFile(url, path):
    r = req.get(url)
    with open(path, "w") as f:
        f.write(r.text)

fetchDataInFile(url, path+"data.html")

In [5]:
# Read and prettify html file using beautifulsoup

with open(path+"data.html", "r") as f:
    html_doc = f.read()

soup = BeautifulSoup(html_doc, "html.parser")
print(soup.prettify())

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>
   Index of /data/local-climatological-data/access/2021
  </title>
 </head>
 <body>
  <h1>
   Index of /data/local-climatological-data/access/2021
  </h1>
  <table>
   <tr>
    <th>
     <a href="?C=N;O=D">
      Name
     </a>
    </th>
    <th>
     <a href="?C=M;O=A">
      Last modified
     </a>
    </th>
    <th>
     <a href="?C=S;O=A">
      Size
     </a>
    </th>
    <th>
     <a href="?C=D;O=A">
      Description
     </a>
    </th>
   </tr>
   <tr>
    <th colspan="4">
     <hr/>
    </th>
   </tr>
   <tr>
    <td>
     <a href="/data/local-climatological-data/access/">
      Parent Directory
     </a>
    </td>
    <td>
    </td>
    <td align="right">
     -
    </td>
    <td>
    </td>
   </tr>
   <tr>
    <td>
     <a href="01001099999.csv">
      01001099999.csv
     </a>
    </td>
    <td align="right">
     2023-12-31 21:37
    </td>
    <td align="right">
     4.0M
    </td>
    <td>
 

In [6]:
# Find the CSV file for specified Last Modified date

csvName = soup.find(text='2023-12-31 22:00  ').parent.previous_sibling.text
print(csvName)

01009099999.csv


  csvName = soup.find(text='2023-12-31 22:00  ').parent.previous_sibling.text


In [7]:
# Load file into Pandas

df = pd.read_csv(url+"/"+csvName)
df.head()

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,REPORT_TYPE,SOURCE,HourlyAltimeterSetting,HourlyDewPointTemperature,...,BackupDirection,BackupDistance,BackupDistanceUnit,BackupElements,BackupElevation,BackupEquipment,BackupLatitude,BackupLongitude,BackupName,WindEquipmentChangeDate
0,1009099999,2021-01-01T04:00:00,80.65,25.0,5.0,"KARL XII OYA, SV",FM-12,4,,,...,,,,,,,,,,
1,1009099999,2021-01-01T19:00:00,80.65,25.0,5.0,"KARL XII OYA, SV",FM-12,4,,,...,,,,,,,,,,
2,1009099999,2021-01-01T21:00:00,80.65,25.0,5.0,"KARL XII OYA, SV",FM-12,4,,,...,,,,,,,,,,
3,1009099999,2021-01-02T07:00:00,80.65,25.0,5.0,"KARL XII OYA, SV",FM-12,4,,,...,,,,,,,,,,
4,1009099999,2021-01-03T01:00:00,80.65,25.0,5.0,"KARL XII OYA, SV",FM-12,4,,,...,,,,,,,,,,


In [8]:
# find the records with the highest HourlyDryBulbTemperature

df.loc[df['HourlyDryBulbTemperature'] == df['HourlyDryBulbTemperature'].max()]

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,REPORT_TYPE,SOURCE,HourlyAltimeterSetting,HourlyDewPointTemperature,...,BackupDirection,BackupDistance,BackupDistanceUnit,BackupElements,BackupElevation,BackupEquipment,BackupLatitude,BackupLongitude,BackupName,WindEquipmentChangeDate
276,1009099999,2021-07-12T01:00:00,80.65,25.0,5.0,"KARL XII OYA, SV",FM-12,4,,,...,,,,,,,,,,


In [9]:
print('Exercise-2 Completed Successfully')

Exercise-2 Completed Successfully
