Certain packages for this procedure need to be installed, this includes: pandas, seleniium beautifulsoup4, webdriver-manager, and openpyxl.
In order for this procedure to work, you need to paste the line below in your powershell before proceeding:

"pip install pandas selenium beautifulsoup4 webdriver-manager openpyxl"


In [17]:
import pandas as pd

# Load URLs from an Excel or CSV file
def load_urls(file_path):
    # Read the file into a DataFrame
    df = pd.read_csv(file_path)  # Use read_csv() for CSV files
    urls = df['URL'].tolist()  # The column containing URLs is named 'URL'
    return urls


Checking how many URLs are listed in file

In [18]:
# Example usage
file_path = "websites.csv"  #CSV file containing URLS
url_list = load_urls(file_path)
print(f"Loaded {len(url_list)} URLs.")


Loaded 7 URLs.


Summary of 'websites' csv file - procedure will access the URL of each row

In [19]:
websiteData = pd.read_csv("websites.csv")
websiteData

Unnamed: 0,Name,API?,"Type (HTML/XML, Javascript, PDF)",Robots.txt permission?,Crawl-Delay?,URL,CATEGORY
0,Describing Hearing Loss,NO,HTML,YES,,https://www.aussiedeafkids.org.au/resources/yo...,Education
1,Describing the severity of a hearing loss,NO,HTML,YES,,https://www.aussiedeafkids.org.au/resources/yo...,Education
2,What do hearing aids do?,NO,HTML,YES,,https://www.aussiedeafkids.org.au/resources/yo...,Hearing Aids
3,Sound Waves,NO,PDF,YES,,https://www.aussiedeafkids.org.au/wp-content/u...,Education
4,Classroom tips,NO,HTML,YES,,https://www.aussiedeafkids.org.au/resources/ed...,Teacher resources
5,How listener friendly is your classroom?,NO,HTML,YES,,https://www.aussiedeafkids.org.au/resources/ed...,Teacher resources
6,A quick guide to communication,NO,HTML,YES,,https://www.aussiedeafkids.org.au/resources/la...,Education


The 'webdriver manager' package allows the program to open a browser tab in Google Chrome (assuming it has been installed)
It will be used to access the URLs within the 'websites' csv file, in order for the web-scraping function to work.
- While the function executes, a google chrome window will open and attempt to go through each individual URL, which will be displayed on your screen.

In [20]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

# Initialize Selenium WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [21]:
# Function to scrape data from a single URL
def scrape_data(url):
    driver.get(url)  # Open the webpage
    html = driver.page_source  # Get HTML content
    
    soup = BeautifulSoup(html, "html.parser")
    
    # Extract relevant data (customize selectors based on target website)
    title = soup.title.text if soup.title else "No title found"
    main_content = soup.find("div", class_="content-type-content")  # Example selector
    
    if main_content:
        content_text = main_content.get_text().strip()
    else:
        content_text = "No content found"
    
    return {"URL": url, "Title": title, "Content": content_text}

In [22]:
# Example usage
scraped_data = []
for url in url_list:
    try:
        data = scrape_data(url)
        scraped_data.append(data)
        print(f"Scraped data from {url}")
    except Exception as e:
        print(f"Failed to scrape {url}: {e}")

Scraped data from https://www.aussiedeafkids.org.au/resources/your-childs-hearing/hearing-loss/describing-hearing-loss/
Scraped data from https://www.aussiedeafkids.org.au/resources/your-childs-hearing/hearing-loss/describing-the-severity-of-a-hearing-loss/
Scraped data from https://www.aussiedeafkids.org.au/resources/your-childs-hearing/hearing-aids/what-do-hearing-aids-do/
Scraped data from https://www.aussiedeafkids.org.au/wp-content/uploads/2024/12/Sound-Waves-section-5.pdf
Scraped data from https://www.aussiedeafkids.org.au/resources/education/information-for-your-childs-teacher/classroom-tips/
Scraped data from https://www.aussiedeafkids.org.au/resources/education/information-for-your-childs-teacher/how-listener-friendly-is-your-classroom/
Scraped data from https://www.aussiedeafkids.org.au/resources/language-and-communication/getting-started/a-quick-guide-to-communication/


The scraped data, will then be saved to a seperate CSV file, it will also specifiy if the procedure didn't work for a particular URL - thus saying 'no content found' in the content column, and respectively for the title column. 

In [23]:
# Save scraped data to a CSV file
def save_to_csv(data, output_file):
    df = pd.DataFrame(data)  # Convert list of dictionaries to DataFrame
    df.to_csv(output_file, index=False)
    print(f"Data saved to {output_file}")

In [24]:
# Example usage
output_file = "scraped_data.csv"
save_to_csv(scraped_data, output_file)

datascrapResults = pd.read_csv("scraped_data.csv")
datascrapResults

Data saved to scraped_data.csv


Unnamed: 0,URL,Title,Content
0,https://www.aussiedeafkids.org.au/resources/yo...,Describing hearing loss | Aussie Deaf Kids,The following questions help to identify the d...
1,https://www.aussiedeafkids.org.au/resources/yo...,Describing the severity of a hearing loss | Au...,"When hearing loss is measured, it will be desc..."
2,https://www.aussiedeafkids.org.au/resources/yo...,What do hearing aids do? | Aussie Deaf Kids,It is a common belief that people with hearing...
3,https://www.aussiedeafkids.org.au/wp-content/u...,No title found,No content found
4,https://www.aussiedeafkids.org.au/resources/ed...,Classroom tips | Aussie Deaf Kids,When addressing the pupil say his/her name fir...
5,https://www.aussiedeafkids.org.au/resources/ed...,How listener friendly is your classroom? | Aus...,A good listening environment is crucial to suc...
6,https://www.aussiedeafkids.org.au/resources/la...,A quick guide to communication | Aussie Deaf Kids,"As your child grows, it will be important to u..."
