## Scrape Precipitation forecast values from _wundergroung.com_

In this notebook I have developed and tested the code to scrape weather data from *wunderground.com*. Once the code was working nicely, I have implemented a function that I can run within the ``streamlit_app.py`` script.

I need the precipitation forecast to make the predictions.

www.wunderground.com seems to have some security feature which blocks known spider/bot user agents (like ```urllib``` used by python). I have tried it myself and I couldn´t get the page source. This makes sense because they want you to pay for their API.

If you don´t want to pay (like me) you have to simulate that you are accessing from a known browser user agent (i.e. Chrome).

This is why I use **Selenium WebDriver**. WebDriver drives a browser natively, as a user would.

REQUIREMENTS
- Install selenium ```!pip install selenium```
- Make sure that ```chromedriver.exe``` location matches with the one specified here:<br>
```driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)```.  
If you have cloned the repository and run the Jupyter Notebook Server from the ``notebook`` folder, all relative paths should be fine.

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None
from datetime import date, timedelta
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# Use .format(YYYY, M, D)
lookup_URL = 'https://www.wunderground.com/hourly/us/ny/new-york-city/date/{}-{}-{}.html'

options = webdriver.ChromeOptions();
options.add_argument('headless'); # to run chrome in the backbroung

driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)

start_date = date.today() + pd.Timedelta(days=1)
end_date = date.today() + pd.Timedelta(days=4)

df_prep = pd.DataFrame()

while start_date != end_date:
    print('gathering data from: ', start_date)
    formatted_lookup_URL = lookup_URL.format(start_date.year,
                                             start_date.month,
                                             start_date.day)
    
    driver.get(formatted_lookup_URL)
    rows = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, '//td[@class="mat-cell cdk-cell cdk-column-liquidPrecipitation mat-column-liquidPrecipitation ng-star-inserted"]')))
    for row in rows:
        prep = row.find_element_by_xpath('.//span[@class="wu-value wu-value-to"]').text
        # append new row to table
        df_prep = df_prep.append(pd.DataFrame({"Day":[str(start_date.day)], 'Precipitation':[prep]}),
                                 ignore_index = True)
    
    start_date += timedelta(days=1)
df_prep

gathering data from:  2020-07-29
gathering data from:  2020-07-30
gathering data from:  2020-07-31


Unnamed: 0,Day,Precipitation
0,29,0
1,29,0
2,29,0
3,29,0
4,29,0
5,29,0
6,29,0
7,29,0
8,29,0
9,29,0


In [30]:
# Convert script into function

def scrape_data(today, days_in):
    # import libraries
    import numpy as np
    import pandas as pd
    from datetime import date, timedelta
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC

    # Use .format(YYYY, M, D)
    lookup_URL = 'https://www.wunderground.com/hourly/us/ny/new-york-city/date/{}-{}-{}.html'

    options = webdriver.ChromeOptions();
    options.add_argument('headless'); # to run chrome in the backbroung

    driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)

    start_date = today + pd.Timedelta(days=1)
    end_date = today + pd.Timedelta(days=days_in + 1)

    df_prep = pd.DataFrame()

    while start_date != end_date:
        timestamp = pd.Timestamp(str(date.today())+' 00:00:00')
        
        print('gathering data from: ', start_date)
        
        formatted_lookup_URL = lookup_URL.format(start_date.year,
                                                 start_date.month,
                                                 start_date.day)

        driver.get(formatted_lookup_URL)
        rows = WebDriverWait(driver, 60).until(EC.visibility_of_all_elements_located((By.XPATH, '//td[@class="mat-cell cdk-cell cdk-column-liquidPrecipitation mat-column-liquidPrecipitation ng-star-inserted"]')))
        for row in rows:
            hour = timestamp.strftime('%H')
            prep = row.find_element_by_xpath('.//span[@class="wu-value wu-value-to"]').text
            # append new row to table
            df_prep = df_prep.append(pd.DataFrame({"dayhour":[str(start_date.day)+hour], 'Precipitation':[prep]}),
                                     ignore_index = True)
            
            timestamp += pd.Timedelta('1 hour')

        start_date += timedelta(days=1)
    return df_prep

In [33]:
# test the function
from datetime import date, timedelta
import pandas as pd
pd.options.display.max_columns = None
pd.options.display.max_rows = None

d = scrape_data(date.today(), 2)
d

gathering data from:  2020-07-29
gathering data from:  2020-07-30


Unnamed: 0,dayhour,Precipitation
0,2900,0
1,2901,0
2,2902,0
3,2903,0
4,2904,0
5,2905,0
6,2906,0
7,2907,0
8,2908,0
9,2909,0
