## Process historical weather page from Weather Underground

### Objective

* csv file with tabulated weather data gathered from web page

### Rationale
* Why This?  Weather data in csv form can be used to build the necessary predictive database, but is not available as such.  This tool builds the data in the appropriate form using freely available services.

* Why Me (andrewguenthner)?  Since I will be building the models, I am best suited to put this data into the right form

* Why Now?  Open weather data has a finite life-span.  It needs to be put into a long-term storable form before it goes away.

### Requirements

* Pandas 0.24.2
* Numpy 1.16.4
* Beautifulsoup4 

### Input and Output

* The notebook processes all files of type .html that it can identify in a given folder. The files should be named as `{station_id}_{mmddyy}.html` where {station_id} is the Personal Weather Station ID and mmddyy is the observation date 

* All input files should be in the folder/sub-folder `models/wx_record/wu_raw_html` given that the notebook is located in
`models/notebooks`.  Adjust accordingly if needed.

* Output files will be given the name `{station_id}_{mmddyy}_p01.csv` and stored in the folder/sub-folder `models/wx_record/wx_station_by_date`

### Import and set-up

In [60]:
import datetime as dt 
import re
import glob
import bs4
import pandas as pd

In [78]:
def create_output_filename(corresponding_input_name : str) -> str:
    """Generates output filename for processing parsed Weather Underground pages, given the input file name"""
    try:
        path_parts_list = re.split(r'[/\\]',corresponding_input_name)
        filename = path_parts_list.pop()
        name_and_suffix = filename.split('.')
        parts_of_name = name_and_suffix[0].split('_')
        parts_of_name.append('p01.csv')
        new_filename = '_'.join(parts_of_name)
        parent_dir = path_parts_list.pop()
        new_parent_dir = 'wx_station_by_date'
        path_parts_list.append(new_parent_dir)
        path_parts_list.append(new_filename)
        return '/'.join(path_parts_list)
    except IndexError:
        return corresponding_input_name + '_filename_parse_error'

In [79]:
def parse_observations(sections):
    """Takes a list of strings and returns a set of lists containing matched observations.  For use with Weather Underground
    html file parsers.  The lists returned contain blanks if parsing fails.  Values returned include observation time, 
    temperature, wind direction, wind speed, wind gust, humidity, and precipitation rate."""
    obstimes, temp_avgs, winddir_avgs, windspeed_avgs, windgust_avgs, humidity_avgs, precip_rates = ([] for _ in range(7))
    for section in sections:
        obs = section.split('&q;')
        try:
            obstime = dt.datetime.fromisoformat(obs[0])
            obstimes.append(obstime)
        except ValueError:
            continue
        try:
            temp_avg_ix = obs.index('tempAvg')
            temp_avg = float(obs[temp_avg_ix + 1].lstrip(':').rstrip(','))
            temp_avgs.append(temp_avg)
        except ValueError:
            temp_avgs.append('')
        try:
            winddir_avg_ix = obs.index('winddirAvg')
            winddir_avg = float(obs[winddir_avg_ix + 1].lstrip(':').rstrip(','))
            winddir_avgs.append(winddir_avg)
        except ValueError:
            winddir_avgs.append('')
        try:
            windspeed_avg_ix = obs.index('windspeedAvg')
            windspeed_avg = float(obs[windspeed_avg_ix + 1].lstrip(':').rstrip(','))
            windspeed_avgs.append(windspeed_avg)
        except ValueError:
            windspeed_avgs.append('')
        try:
            windgust_avg_ix = obs.index('windgustAvg')
            windgust_avg = float(obs[windgust_avg_ix + 1].lstrip(':').rstrip(','))
            windgust_avgs.append(windgust_avg)
        except ValueError:
            windgust_avgs.append('')
        try:
            humidity_avg_ix = obs.index('humidityAvg')
            humidity_avg = float(obs[humidity_avg_ix + 1].lstrip(':').rstrip(','))
            humidity_avgs.append(humidity_avg)
        except ValueError:
            humidity_avgs.append('')
        try:
            precip_rate_ix = obs.index('precipRate')
            precip_rate = float(obs[precip_rate_ix + 1].lstrip(':').rstrip(','))
            precip_rates.append(precip_rate)
        except ValueError:
            precip_rates.append('')
    return obstimes, temp_avgs, winddir_avgs, windspeed_avgs, windgust_avgs, humidity_avgs, precip_rates

In [80]:
def process_file(input_file):
    """Takes a stored html file with weather observations from Weather Underground and converts it to a .csv file with
    a scraped observation table.  Returns the number of files successfully processed (1 or 0)"""
    try:
        soup = bs4.BeautifulSoup(open(input_file), 'html.parser')
        page_sections = soup.text.split('&q;obsTimeLocal&q;:&q;')
        # Dump the giant first bit
        page_sections.pop(0)
        # And dump the last bit
        page_sections.pop()
        # parse_observations will populate lists that go to the DataFrame 
        otime, t, w_dir, w_spd, w_gust, rh, prcp = parse_observations(page_sections)
        obs_df = pd.DataFrame({'time':otime,'T':t,'w_dir':w_dir,'w_spd':w_spd,
                              'w_gust':w_gust,'rh':rh,'precip':prcp})
        output_file = create_output_filename(input_file)
        obs_df.to_csv(output_file)
        return 1
    except (IndexError):
        print(f'File {input_file} skipped due to error in processing.')
    return 0

In [81]:
processed_count = 0
for filename in glob.glob('../wx_record/wu_raw_html/*.html'):
    # process_file will save the output file as a side effect !!!
    processed_count += process_file(filename)
print(f'{processed_count} files processed.')

5 files processed.
