# Web Scraping

In [1]:
# Uncomment the following line to install required packages
# !pip install re, cloudscraper, requests, beautifulsoup4, pandas

# Importing the necessary libraries
import re
import requests
import cloudscraper
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
url = "https://www.educationcounts.govt.nz/find-school/school/profile?school=69"
response = requests.get(url)

# Checking if the request was successful
if response.status_code == 200:
    html_content = response.text
    print("Successfully fetched the page content!")
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Failed to retrieve page. Status code: 403


The server returned a 403 status code because it likely blocked the requests library

In [3]:
url = "https://www.educationcounts.govt.nz/find-school/school/profile?school=69"

# Initializing the cloudscraper
scraper = cloudscraper.create_scraper()
response = scraper.get(url)

# Checking if the request was successful
if response.status_code == 200:
    html_content = response.text
    print("Successfully fetched the page content!")
else:
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Successfully fetched the page content!


Using cloudscraper bypasses the anti-bot mechanisms and successfully fetches the HTML content of the URL.

In [4]:
# Initializing the cloudscraper
scraper = cloudscraper.create_scraper()

# Defining the URLs for each school
urls = [
    {
        "profile_url": "https://www.educationcounts.govt.nz/find-school/school/profile?district=&region=&school=69",
        "roll_url": "https://www.educationcounts.govt.nz/find-school/school/population/year?district=&region=&school=69"
    },
    {
        "profile_url": "https://www.educationcounts.govt.nz/find-school/school/profile?school=151&district=24&region=4",
        "roll_url": "https://www.educationcounts.govt.nz/find-school/school/population/year?district=24&region=4&school=151"
    },
    {
        "profile_url": "https://www.educationcounts.govt.nz/find-school/school/profile?district=7619&region=2&school=6929",
        "roll_url": "https://www.educationcounts.govt.nz/find-school/school/population/year?district=7619&region=2&school=6929"
    }
]

In [5]:
# List to store results
data = []

for pair in urls:
    # Profile info
    profile_response = scraper.get(pair["profile_url"])
    soup_profile = BeautifulSoup(profile_response.text, "html.parser")

    school_name_element = soup_profile.find("div", id="page-heading").find("h1")
    school_name = school_name_element.text.strip() if school_name_element else "Not Found"

    years_catered_element = soup_profile.find("span", id="schoolType")
    years_catered = "Not Found"
    if years_catered_element:
        match = re.search(r"\((.*?)\)", years_catered_element.text.strip())
        if match:
            years_catered = match.group(1)

    gender_element = soup_profile.find("span", id="schoolGender")
    gender_type = gender_element.text.strip() if gender_element else "Not Found"

    # Number of students
    roll_response = scraper.get(pair["roll_url"])
    if roll_response.status_code == 200:
        soup_roll = BeautifulSoup(roll_response.text, "html.parser")
        total_row = soup_roll.find("td", string="Total").find_parent("tr")
        total_roll = total_row.find_all("td")[-1].text.strip()
    else:
        total_roll = "Not Found"

    data.append({
        "School Name": school_name,
        "Number of Students": total_roll,
        "Years Level Catered": years_catered,
        "Gender": gender_type
    })

# Display the table
df = pd.DataFrame(data)
df

Unnamed: 0,School Name,Number of Students,Years Level Catered,Gender
0,Mt Albert Grammar School,3498,Year 9-15,Co-Educational
1,Western Heights High School,1180,Year 9-15,Co-Educational
2,Alfriston College,1245,Year 9-15,Co-Educational


---

### Integrating this data to enhance the forecasting model

The dataset containing can enhance the forecasting accuracy of the Prophet demand prediction time series model by providing valuable contextual data to refine the predictions for product and size-specific sales. The number of students, such as 3498 for Mt Albert Grammar School, can be used as an external regressor to reflect the potential demand driven by school population size, allowing the model to better capture variations in sales for specific items. Additionally, aligning the model with the "Years Level Catered" can help incorporate academic cycles, such as increased demand at the start of the school year, into the seasonality component. While the gender data indicates that all the schools are co-ed, it could still be explored for subtle demand trends since the product lines vary by gender.

---