## Data scraping from Bureau of Meteorology website

A list of all meteorological stations, including those decomissioned in the past, from the Bureau of Meteorology. 

In [182]:
import requests
import pandas as pd
import numpy as np
from datetime import datetime
import re

In [183]:
request = requests.get("http://www.bom.gov.au/climate/data/lists_by_element/alphaVIC_123.txt")

The data requires some processing and formating. Below function does that.

In [184]:
def clean_meteo_record(row):
    """
    Function to format a meteostation records.
    Order will change, columns start with:
        'Name', 'Start', 'End', ...
        
    and in default order from there.
    """
    date_pttn = re.compile("[A-Z]{1}[a-z]{2}\s[0-9]{4}")
    
    [start, end] = date_pttn.findall(row)
    rest = re.sub(date_pttn, "", row) 
    
    site = rest.strip().split()[0] 
    rest = " ".join(rest.strip().split()[1:])
    
    if rest[-1] not in ['N', 'Y']: rest += " N" # There is one missing value
        
    name = " ".join(rest.split()[:-5])
    rest = rest.split()[-5:]
    
    return [site, name, start, end, *rest]

In [185]:
columns = ['Site', 'Name', 'Start', 'End', 'Lat', 'Lon', 'Years', '%', 'AWS']

df = pd.DataFrame(data=[clean_meteo_record(row) for row in request.text.split("\r\n")[4:-7]], 
                  columns=columns)

df.Site = df.Site.astype(np.int32)
df.Lat = df.Lat.astype(np.float32)
df.Lon = df.Lon.astype(np.float32)
df.Years = df.Years.astype(np.float32)
df["%"] = df["%"].astype(np.float32)
df

Unnamed: 0,Site,Name,Start,End,Lat,Lon,Years,%,AWS
0,85000,ABERFELDY,Sep 1969,Oct 1974,-37.700001,146.366699,5.100000,95.0,N
1,90180,AIREYS INLET,Jul 1990,Oct 2020,-38.458302,144.088303,30.299999,97.0,Y
2,88001,ALEXANDRA (POST OFFICE),Jan 1965,Feb 1970,-37.191601,145.711594,5.200000,83.0,N
3,89000,ARARAT POST OFFICE,Jan 1962,Apr 1969,-37.283298,142.949997,7.300000,98.0,N
4,89085,ARARAT PRISON,Jun 1969,Oct 2020,-37.276901,142.978607,51.400002,99.0,N
...,...,...,...,...,...,...,...,...,...
253,85098,YALLOURN,Apr 1932,Oct 1949,-38.200001,146.399994,17.600000,100.0,N
254,85103,YALLOURN SEC,Jan 1957,Sep 1986,-38.185799,146.331696,29.700001,99.0,N
255,85151,YARRAM AIRPORT,Oct 2007,Oct 2020,-38.564701,146.747894,13.100000,97.0,Y
256,81124,YARRAWONGA,May 1993,Oct 2020,-36.029400,146.030502,27.299999,97.0,Y


In [186]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258 entries, 0 to 257
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Site    258 non-null    int32  
 1   Name    258 non-null    object 
 2   Start   258 non-null    object 
 3   End     258 non-null    object 
 4   Lat     258 non-null    float32
 5   Lon     258 non-null    float32
 6   Years   258 non-null    float32
 7   %       258 non-null    float32
 8   AWS     258 non-null    object 
dtypes: float32(4), int32(1), object(4)
memory usage: 13.2+ KB


Filter out stations that are currently active

In [187]:
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
today = datetime.today()
this_month = months[today.month-1] + " " + str(today.year)
this_month

'Oct 2020'

In [188]:
station_ids = df[df['End'] == this_month]['Site'].values
len(station_ids)

90

As of October 2020, 90 stations are operational in Victoria. Now we can pull the data from 2015 for these stations. The content is in Zip files.