## Welcome to your first casestudy
- In this case study you have to scrape weather data from the website  **"http://www.estesparkweather.net/archive_reports.php?date=200901"**
- Scrape all the available attributes of weather data for each day from **2009-01-01 to 2018-10-28**
- Ignore records for missing days
- Represent the scraped data as **pandas dataframe** object.

### Dataframe specific deatails
- Expected column names (order dose not matter):   
       ['Average temperature (°F)', 'Average humidity (%)',
       'Average dewpoint (°F)', 'Average barometer (in)',
       'Average windspeed (mph)', 'Average gustspeed (mph)',
       'Average direction (°deg)', 'Rainfall for month (in)',
       'Rainfall for year (in)', 'Maximum rain per minute',
       'Maximum temperature (°F)', 'Minimum temperature (°F)',
       'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
       'Minimum pressure', 'Maximum windspeed (mph)',
       'Maximum gust speed (mph)', 'Maximum heat index (°F)']
- Each record in the dataframe corresponds to weather deatils of a given day
- Make sure the index column is **date-time format (yyyy-mm-dd)**
- Perform necessary data cleaning and type cast each attributes to relevent data type

### Saving the dataframe
- Once you are done with you scrapping save your dataframe as pickle file by name 'dataframe.pk'

#### Sample code to save pickle file
```python
import pickle
with open("dataframe.pk", "wb") as file:
    pickle.dump(<your_dataframe>, file)
```
 
 

### Run the below cell to import necessary packages
- These packages should be sufficient to perform you task
- In case if you are looking are any other packages run **!pip3 install <package_name> --user with in a cell**

In [1]:
import pandas as pd
import pickle

In [2]:
columns =  ['Average temperature (°F)', 'Average humidity (%)',
 'Average dewpoint (°F)', 'Average barometer (in)',
 'Average windspeed (mph)', 'Average gustspeed (mph)',
 'Average direction (°deg)', 'Rainfall for month (in)',
 'Rainfall for year (in)', 'Maximum rain per minute',
 'Maximum temperature (°F)', 'Minimum temperature (°F)',
 'Maximum humidity (%)', 'Minimum humidity (%)', 'Maximum pressure',
 'Minimum pressure', 'Maximum windspeed (mph)',
 'Maximum gust speed (mph)', 'Maximum heat index (°F)']

columns_mapper = {i: column for i, column in enumerate(columns, 1)}

In [3]:
dates_index = pd.date_range(start="2009-01-01", end="2018-11-01", freq='M')
dates_parsed = [(str(date.year), "{:0>2}".format(date.month)) for date in dates_index]

In [4]:
def process(table, year, month):
    translator = str.maketrans("", "", "°F%°")
    table.loc[1:, 1] = table.loc[1:, 1].str.split().apply(lambda x: x[0].translate(translator))
    table.loc[0, 1] = "{}-{}-{:0>2}".format(year, month, table.loc[0, 0].split()[1])
    table = table.T.drop(0).set_index(0)
    return table

In [5]:
def filter_criterion(table):
    if table.shape == (len(columns)+1, 2):
        try:
            int(table.loc[0, 0].split()[1])
            return True
        except ValueError:
            return False
    else:
        return False

In [6]:
def data_for_month(year, month):
    data = pd.DataFrame()
    url = "http://www.estesparkweather.net/archive_reports.php?date={}"
    for table in pd.read_html(url.format(year+month)):
        if filter_criterion(table):
            table = process(table, year, month)
            if not table.empty:
                data = pd.concat([data, table])
    return data

In [7]:
global_data = pd.DataFrame()
for (year, month) in dates_parsed:
    data = data_for_month(year, month)
    print("{}: {} days".format(year+month, data.shape[0]))
    global_data = pd.concat([global_data, data])

200901: 31 days
200902: 28 days
200903: 28 days
200904: 30 days
200905: 31 days
200906: 30 days
200907: 31 days
200908: 31 days
200909: 30 days
200910: 31 days
200911: 30 days
200912: 0 days
201001: 31 days
201002: 28 days
201003: 31 days
201004: 30 days
201005: 30 days
201006: 30 days
201007: 30 days
201008: 31 days
201009: 29 days
201010: 31 days
201011: 30 days
201012: 30 days
201101: 31 days
201102: 28 days
201103: 3 days
201104: 4 days
201105: 31 days
201106: 30 days
201107: 18 days
201108: 31 days
201109: 4 days
201110: 31 days
201111: 29 days
201112: 10 days
201201: 31 days
201202: 0 days
201203: 31 days
201204: 30 days
201205: 31 days
201206: 30 days
201207: 31 days
201208: 31 days
201209: 30 days
201210: 31 days
201211: 30 days
201212: 31 days
201301: 31 days
201302: 28 days
201303: 31 days
201304: 30 days
201305: 26 days
201306: 30 days
201307: 7 days
201308: 31 days
201309: 14 days
201310: 31 days
201311: 30 days
201312: 31 days
201401: 31 days
201402: 28 days
201403: 31 day

In [8]:
global_data.rename(columns=columns_mapper, inplace=True)
global_data.index = pd.to_datetime(global_data.index.rename("Day"))
global_data = global_data.drop(global_data.tail(3).index)

In [9]:
global_data.shape

(3280, 19)

In [10]:
global_data = global_data.astype('float')
global_data.dtypes

Average temperature (°F)    float64
Average humidity (%)        float64
Average dewpoint (°F)       float64
Average barometer (in)      float64
Average windspeed (mph)     float64
Average gustspeed (mph)     float64
Average direction (°deg)    float64
Rainfall for month (in)     float64
Rainfall for year (in)      float64
Maximum rain per minute     float64
Maximum temperature (°F)    float64
Minimum temperature (°F)    float64
Maximum humidity (%)        float64
Minimum humidity (%)        float64
Maximum pressure            float64
Minimum pressure            float64
Maximum windspeed (mph)     float64
Maximum gust speed (mph)    float64
Maximum heat index (°F)     float64
dtype: object

In [11]:
with open("dataframe.pk", "wb") as file:
    pickle.dump(global_data, file)