## Assignment:

Fill up the following information:

- Name: Zhang Huiyi
- Student ID: 24474940

Write a web scraper in Python that harvests the historical data from the Hong Kong Observatory.

The Observatory makes available the daily mean values of the weather elements calculated from data in the 30 years from 1981 to 2010 for the 5-day period centred on the day specified. The address for the month of October is:

https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal10.htm

Starting from that page, perform the following operations:

1.	Scrape the historical weather data for the month of October from the Hong Kong Observatory. [60 marks]

2.	Find the appropriate links to scrape the other months data. Scrape the links with Python code and store them in a Python list. [10 marks]

3.	Harvest all the monthly data from different months and store it on a Pandas dataframe. [30 marks]


In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

# 1. Scrape the historical weather data for the month of October from the Hong Kong Observatory.

# function to get specific data in the table
def getTdList(tr):
  return {td.get('headers')[-1]: td.get_text(strip=True) for td in tr.find_all('td')}

# function to get headers of the table
def getThList(tr):
  return {th.get('id'): th.get_text(strip=True) for th in tr.find_all('th')}

# function to get every rows and remove the last row
def getTrList(table):
  list = table.find_all('tr')
  list.pop()

  # Organize the list of headers and data, then store them in a dictionary.
  thDataArr = [getThList(tr) for tr in [list.pop(0), list.pop(0)]]
  thDataArr[0].update(thDataArr[1])
  thDataMap = thDataArr[0]
  tdData = [getTdList(tr) for tr in list]

  return {
      'thDataMap': thDataMap,
      'tdData': tdData
  }

# Define a data scraping function for specific web page.
def scrapingData(url):
  page = requests.get(url)
  soup = BeautifulSoup(page.content, 'html.parser')

  # Get the data in two tables.
  tables = soup.find_all('tbody')
  data = [getTrList(table) for table in tables]

  # Combine headers of two tables
  headerTitleMap = data[0].get('thDataMap')
  headerTitleMap.update(data[1].get('thDataMap'))
  headerTitleMap.update({'seatemp': headerTitleMap.get('am')})
  headerTitleMap.update({'seatemp2': headerTitleMap.get('pm')})

  #Combine data of two table
  targetData = data[0]['tdData']
  for i, item in enumerate(targetData):
    obj = data[1]['tdData'][i]
    item.update(obj)

  return pd.DataFrame({headerTitleMap.get(key): [d[key] for d in targetData] for key in targetData[0]})

# Call the function to extract weather data for October.
scrapingData('https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal10.htm')

Unnamed: 0,Date,MeanPressure(hPa)Figure,MeanMaximum(deg. C),Mean(deg. C),MeanMinimum(deg. C),Wet Bulb(deg. C)Figure,Dew Point(deg. C)Figure,RelativeHumidity(%)Figure,MeanDailyRainfall(mm)Figure,Amountof Cloud(%)Figure,BrightSunshineDuration(hours)Figure,PrevailingDirection(degrees),MeanSpeed(km/h),AM(deg. C),PM(deg. C)
0,1 Oct,1012.0,29.1,26.8,25.1,23.7,22.3,77,6.0,64,5.6,80,27,27.1,27.5
1,2 Oct,1012.3,29.1,26.8,25.1,23.7,22.1,76,4.3,63,5.7,80,27,27.1,27.4
2,3 Oct,1012.4,29.1,26.8,25.0,23.6,21.9,76,5.8,62,5.9,80,27,27.1,27.3
3,4 Oct,1012.5,29.0,26.7,24.9,23.4,21.7,75,6.5,61,6.0,80,26,27.0,27.3
4,5 Oct,1012.5,28.9,26.6,24.8,23.2,21.4,74,5.8,60,6.2,80,24,27.0,27.3
5,6 Oct,1012.6,28.8,26.5,24.7,23.0,21.2,74,4.6,59,6.3,90,23,27.0,27.3
6,7 Oct,1012.8,28.7,26.4,24.5,22.8,20.9,73,4.4,59,6.3,90,23,27.0,27.3
7,8 Oct,1013.0,28.6,26.3,24.4,22.7,20.7,73,2.7,57,6.4,90,24,26.9,27.3
8,9 Oct,1013.3,28.5,26.3,24.5,22.7,20.8,73,2.3,57,6.4,80,25,26.9,27.3
9,10 Oct,1013.5,28.5,26.2,24.5,22.8,20.9,74,2.6,57,6.4,80,27,26.8,27.2


In [2]:
# 2. Find the appropriate links to scrape the other months data. Scrape the links with Python code and store them in a Python list.
def getMonthUrl():
  # i:02 —— Discovered and utilized the naming rules of links
  return [f"https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal{i:02}.htm" for i in range(1, 13)]
monthUrl = getMonthUrl()
print(monthUrl)

['https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal01.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal02.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal03.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal04.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal05.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal06.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal07.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal08.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal09.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal10.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal11.htm', 'https://www.hko.gov.hk/en/cis/normal/1981_2010/dnormal12.htm']


In [4]:
# 3. Harvest all the monthly data from different months and store it on a Pandas dataframe.
pd.concat([scrapingData(url) for url in monthUrl], axis=0)

Unnamed: 0,Date,MeanPressure(hPa)Figure,MeanMaximum(deg. C),Mean(deg. C),MeanMinimum(deg. C),Wet Bulb(deg. C)Figure,Dew Point(deg. C)Figure,RelativeHumidity(%)Figure,MeanDailyRainfall(mm)Figure,Amountof Cloud(%)Figure,BrightSunshineDuration(hours)Figure,PrevailingDirection(degrees),MeanSpeed(km/h),AM(deg. C),PM(deg. C)
0,1 Jan,1020.1,19.3,17.0,15.2,14.3,11.8,73,1.2,52,5.3,060,24,18.3,18.6
1,2 Jan,1020.3,19.3,17.1,15.3,14.3,11.8,72,0.7,51,5.5,070,25,18.2,18.5
2,3 Jan,1020.3,19.3,17.1,15.3,14.3,11.8,72,0.8,51,5.4,070,25,18.1,18.4
3,4 Jan,1020.3,19.2,17.0,15.3,14.3,11.8,73,1.0,53,5.3,070,26,18.0,18.3
4,5 Jan,1020.4,19.2,17.0,15.2,14.3,11.8,73,1.1,54,5.4,070,26,17.9,18.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26,27 Dec,1020.5,19.3,17.0,15.0,14.1,11.3,71,1.0,54,5.1,070,26,18.6,18.9
27,28 Dec,1020.3,19.1,16.9,14.9,14.1,11.4,72,1.4,55,4.8,070,26,18.6,18.9
28,29 Dec,1020.2,19.1,16.8,14.8,14.0,11.4,72,1.5,54,4.9,070,25,18.4,18.8
29,30 Dec,1020.2,19.1,16.8,14.9,14.0,11.5,72,1.4,53,5.1,060,25,18.4,18.7


# Acknowledgements

- The code in this notebook are modified from Dr. Xinzhi Zhang teaching material and other various sources.

- Parts of this code example is compiled by He Can (ITM@HKBU), and referred from the tutorial compiled by Alexander Fred Ojala, from https://www.crummy.com/software/BeautifulSoup/bs4/doc/ & https://www.dataquest.io/blog/web-scraping-tutorial-python/

All codes are for educational purposes only and released under the CC1.0.