# Popular Tourist Destinations in New Zealand
*A Webscrapping Project on Dynamic and Static Webpages*
* I Putu Agastya Harta Pratama
* Łukasz Brzoska

Faculty of Economic Sciences <br>
University of Warsaw <br>
Warsaw, Poland <br>
2025

Data obtained from the New Zealand's government Tourism Board (trading as Tourism New Zealand): https://www.newzealand.com/int/


With so much to offer, traveling to New Zealand may be quite burdensome. The national tourism board of New Zealand has created a website so that prospective tourists may see what the country has to offer. This project aims to informations from the website's top tourist destinations in New Zealand, along with the activities offered there, with the means of webscrapping.

Beautifulsoup and Selenium are both used in this project's scraping process. <br>
Selenium is used to automate specific user actions, so that the information may be accessed. <br> 
The static HTML content that was shown in each page's front end is then scraped using Beautifulsoup. It's also used for collecting links and texts. 

## Imports

In [1]:
# FOR DATA PROCESSING:
import pandas as pd
import numpy as np

# FOR MEASURING COMPUTATION TIME, CREATING FIXED DELAYS:
import time

# FOR APPLYING BEAUTIFULSOUP
from bs4 import BeautifulSoup

# FOR APPLYING SELENIUM:
import selenium 
from selenium import webdriver 
from webdriver_manager.firefox import GeckoDriverManager 
from selenium.webdriver.chrome.service import Service 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.webdriver.common.by import By 
from selenium.common.exceptions import TimeoutException

# FOR SAVING DATA:
import pickle # pickle format of saved output

# FOR URL PARSING:
from urllib.parse import urljoin

# FOR SHOWING IMAGES IN THE NOTEBOOK
from IPython.display import Image

Once all of the necessary packages has been imported, we will now install the necessary driver. <br>
For the purpose of this project, we will be utilising the firefox webbrowser.

In [2]:
def save_object(obj, filename): 
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
        
firefoxpath = GeckoDriverManager().install(); print("Driver Installed at: ", firefoxpath)

Driver Installed at:  /Users/agastyaharta/.wdm/drivers/geckodriver/mac64/v0.36.0/geckodriver


## Accessing Website

Firstly, we need to access this site. It leads us to the main page of New Zealand's Tourism page.

In [3]:
website = "https://www.newzealand.com/int/"

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox) 

driver_firefox.maximize_window()
driver_firefox.get(website)

In [None]:
Image(filename='assets/1.png')

## Collecting Popular Places 

Next, we are acquiring four of New Zealand's most visited destinations, the majority of which are at the city or district level. <br> 
We are then gathering the city names and links, which is done by licking the search button displayed on the top right corner of this website. <br>
BeautifulSoup is used to gather the city names and URLs that are included in the HTML code, and Selenium is used to automate the button clicking. 

In [4]:
# Accessing website using Selenium
driver_firefox.get(website) 

start = time.time()
time.sleep(np.random.chisquare(3)+5) # + wait random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior

# Clicking the search button to trigger 
target_button_xpath = "//i[@class='o-icon js-icon search-icon']//*[@class='icon search']"
target_button = WebDriverWait(driver_firefox, 4).until(
    EC.element_to_be_clickable((By.XPATH, target_button_xpath))
)
target_button.click()

In [5]:
# Using beautifulsoup to generate city labels and their corresponding links
html = driver_firefox.page_source # Refering from the website_search variable and converting it to string using page_source
soup = BeautifulSoup(html, "html.parser")

# Finding the right element for city ("Popular places to visit") because they share the same element with ("Popular things to do")
group_labels = soup.find_all("p", class_="popular-searches__group-label")
target_label = None
for label in group_labels:
    if "Popular places to visit" in label.text:
        target_label = label
        break

# Once target label are found, we are extracting each links in corresponds to each city names
popular_links = [] # List to save those elements
if target_label:
    city_list = target_label.find_next_sibling("ul", class_="popular-searches__group-items")
    for link in city_list.find_all("a", class_="popular-searches__group-item"):
        city_name = link.get_text(strip=True)
        href = urljoin(website, link["href"])
        popular_links.append((city_name, href))

# Printing the output of collected names of popular places (cities) and their corresponding search links
try: # Error handling
    print("Popular Places to Visit in New Zealand:")
    for city, url in popular_links:
        print(f"{city}: {url}")
except Exception as e: # Error handling
    print("Cannot retrieve data")

Popular Places to Visit in New Zealand:
Auckland: https://www.newzealand.com/int/utilities/search/?q=Auckland&type=popular
Queenstown: https://www.newzealand.com/int/utilities/search/?q=Queenstown&type=popular
Lake Tekapo / Takapō: https://www.newzealand.com/int/utilities/search/?q=Lake+Tekapo+%2F+Takap%C5%8D&type=popular
Wānaka: https://www.newzealand.com/int/utilities/search/?q=W%C4%81naka&type=popular


According to this website, in New Zealand there are four of the most visited cities. Our observations indicate that the quantity of activities varies by city. For example, Auckland has 740 listed, whereas Lake Tekapō has 82. We are only gathering 50 activities per city to maintain balance.

In [None]:
Image(filename='assets/2.png')

 Afterwards, we open a new browser tab for each city links that are stored in the popular_links list and stores the unique tab handle in a dictionary called city_tab_handles. By doing this, we can keep multiple city pages open in separate tabs and easily switch between them later using their cleaned-up city names as keys. It allows us to immediately access those city, and keep the scraping process organised across multiple locations without repeatedly closing and reopening browser windows.

In [6]:
city_tab_handles = {}

for city, url in popular_links:
    # Opening new tab
    driver_firefox.execute_script("window.open();")
    driver_firefox.switch_to.window(driver_firefox.window_handles[-1])
    
    # Load city URL
    driver_firefox.get(url)
    time.sleep(5)

    # Store tab handle
    handle_key = city.lower().replace(" ", "_").replace("/", "_")
    city_tab_handles[handle_key] = driver_firefox.current_window_handle

In [7]:
city_tab_handles

{'auckland': '3afcd02e-d844-415b-b8f0-34b85d40cc24',
 'queenstown': 'f25e6575-698b-42b1-85cf-283dd3bef6c6',
 'lake_tekapo___takapō': 'aace41ff-5805-4d36-990f-678cd9ff1ee7',
 'wānaka': 'e36493b3-7f89-4552-adac-abdad2042f0e'}

## Auckland

We are now switching to the first city within the list. In here we are clicking the activities filter using. Since this website only load 10 results of activities, we have to click the "Show More Result" button. To do this, we are using Selenium to automate the button filter clicking, as well as the loading more results. <br>

We need to load all of the results that we want to scrape beforehand and then obtaining 

In [8]:
# Switching to "Auckland Tab"
driver_firefox.switch_to.window(city_tab_handles["auckland"])

time.sleep(5)
# Click on the "Activities" filter
try:
    filter_xpath = "//span[contains(text(),'Activities')]"
    filter_button = WebDriverWait(driver_firefox, 4).until(
        EC.element_to_be_clickable((By.XPATH, filter_xpath))
    )
    filter_button.click()
    print("Activities' filter clicked on Auckland page.")
except Exception as e:
    print(f"Failed to click 'Activities': {e}")
    
    

Activities' filter clicked on Auckland page.


We decided to scrape 50 activities for each city, but, since there were only 10 activities displayed at once it was required to create a loop that pressed the "load more" button to display the next 10 activities

In [9]:
time.sleep(np.random.chisquare(3)+5)
# Minimum clicks
click = 0 
# Maximum clicks (Each page loads 10 results)
max_clicks = 4
while click < max_clicks:
    try:
        load_more_xpath = '//*[@id="search-results"]/div[2]/div/div[3]/button'
        load_more_button = WebDriverWait(driver_firefox, 5).until(
            EC.element_to_be_clickable((By.XPATH, load_more_xpath))
        )

        # Click the button
        load_more_button.click()
        click += 1
        print("Loading more pages...")

        # Waiting content to load
        time.sleep(5)

    except TimeoutException:
        print("All pages loaded (no more button).")
        break

Loading more pages...
Loading more pages...
Loading more pages...
Loading more pages...


### Data Scraping for Auckland

Once all of the acrivities were loaded it was time to scrape the data concerning each one of them. By implementing Beautiful Soup we were able to iterate over every single activity in Auckland and gather information about their title, their link, description and image.

#### Main Page

In [None]:
html = driver_firefox.page_source
soup = BeautifulSoup(html, "html.parser")

results_container = soup.find("div", class_="search-results__results")
activity_blocks = results_container.find_all("div", class_="results__wrapper") if results_container else []

# Saving each columns to list
titles_auckland = []
links_auckland= []
descriptions_auckland = []
images_auckland = []


for activity in activity_blocks:
    try:
        # Title
        title_path = activity.select_one("h4.results__title a")
        title = title_path.get_text(strip=True) if title_path else ""
        
        # Link
        link = title_path["href"] if title_path and "href" in title_path.attrs else ""

        # Description
        desc_path = activity.select_one("p.results__description")
        description = desc_path.get_text(strip=True) if desc_path else ""

        # Image
        img_path = activity.select_one("figure.results__photo img")
        img_url = img_path["src"] if img_path and "src" in img_path.attrs else ""

        # Append All
        titles_auckland.append(title)
        links_auckland.append(link)
        descriptions_auckland.append(description)
        images_auckland.append(img_url)

    except Exception as e:
        print(f"Skipping block due to: {e}")
        continue

In [13]:
titles_auckland

['Auckland Scenic Tour 3 Hour',
 'Odysseum Auckland',
 'Auckland Tours',
 'Auckland Museum',
 'Breakout Auckland',
 'Paintvine - Auckland',
 'Skydive Auckland',
 'Hello Auckland small-group walking tour | Aucky Walky Tours',
 'Auckland City Express - Private Tour (Sedan or Minivan up to 11 passengers)',
 'Auckland City Sightseeing Tour',
 'Suzannah Maree Photography',
 'Auckland Surfboard Rentals',
 'Auckland Adventure Jet',
 'Auckland Harbour Sailing',
 'Auckland Adventure Park',
 'Auckland Sunrise Tours',
 'Auckland Floral Experiences',
 'Zero Latency Auckland',
 'Auckland Sea Kayaks',
 'Auckland Explorer Bus',
 'Adrenalin Forest Auckland',
 'Auckland Theatre Company',
 'Auckland City Tours',
 'Auckland Day Tours',
 'Game Over Auckland',
 'Auckland Botanic Gardens',
 'Sunset Tours Auckland',
 'Auckland Motorbike Hire',
 'Escape HQ Auckland',
 'HireBikes - Auckland Central',
 'Aotea Gifts - Auckland',
 'Auckland Historic Bar Tour',
 "A 'Taste' of the Segway Sensation - A Ride with Mag

In [12]:
links_auckland

['https://www.newzealand.com/int/plan/business/auckland-scenic-tour/',
 'https://www.newzealand.com/int/plan/business/odyssey-sensory-maze/',
 'https://www.newzealand.com/int/plan/business/auckland-tours/',
 'https://www.newzealand.com/int/plan/business/auckland-museum-tamaki-paenga-hira/',
 'https://www.newzealand.com/int/plan/business/breakout-auckland/',
 'https://www.newzealand.com/int/plan/business/paintvine/',
 'https://www.newzealand.com/int/plan/business/skydive-auckland/',
 'https://www.newzealand.com/int/plan/business/aucky-walky-tours-ltd/',
 'https://www.newzealand.com/int/plan/business/auckland-tour-minivan-up-to-11-passengers/',
 'https://www.newzealand.com/int/plan/business/discover-auckland-city-tour-the-city-of-sails/',
 'https://www.newzealand.com/int/plan/business/suzannah-maree-photography/',
 'https://www.newzealand.com/int/plan/business/auckland-surfboard-rentals/',
 'https://www.newzealand.com/int/plan/business/auckland-adventure-jet/',
 'https://www.newzealand.c

#### Secondary Each Activities Page

We decided to further broaden the scope of information about every activity and in order to do that we accessed the supbage of every activity that we wanted to scrape. After doing that we were presented with a new page that contained much more specific information such as: address, phone number and email. 

In [14]:
street_addresses_auckland = []
localities_auckland = []
emails_auckland = []
phone_numbers_auckland = []

for idx, url in enumerate(links_auckland):
    try:
        # Open 
        driver_firefox.get(url)
        WebDriverWait(driver_firefox, 5).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p[itemtype='http://schema.org/LocalBusiness']"))
        )

        detail_soup = BeautifulSoup(driver_firefox.page_source, "html.parser")
        address_block = detail_soup.select_one("p[itemtype='http://schema.org/LocalBusiness']")

        # Street
        street_path = address_block.select_one("span[itemprop='streetAddress']")
        street_text = street_path.get_text(strip=True) if street_path else ""
        
        # Locality
        locality_path = address_block.select_one("span[itemprop='addressLocality']")
        locality_text = locality_path.get_text(strip=True) if locality_path else ""
        
        # Phone
        phone_path = driver_firefox.find_elements(By.CSS_SELECTOR, "a.js-phone-link")
        phone_number = phone_path[0].get_attribute("href").replace("tel:", "").strip() if phone_path else ""
        
        # Email
        email_tag = driver_firefox.find_elements(By.CSS_SELECTOR, "a[href^='mailto:']")
        email = email_tag[0].get_attribute("href").replace("mailto:", "").strip() if email_tag else ""

    except Exception as e:
        print(f"{idx+1}. Failed to extract data for: {links_auckland[idx]} — {e}")
        street_text = ""
        locality_text = ""

    street_addresses_auckland.append(street_text)
    localities_auckland.append(locality_text)
    emails_auckland.append(email)
    phone_numbers_auckland.append(phone_number)
    
    # Wait time to avoid being blocked
    wait_time = np.random.chisquare(3) + 4
    print(f"Sleeping for {wait_time:.2f} seconds...")
    time.sleep(wait_time)

Sleeping for 6.19 seconds...
Sleeping for 4.23 seconds...
Sleeping for 8.86 seconds...
Sleeping for 10.84 seconds...
Sleeping for 7.95 seconds...
Sleeping for 5.56 seconds...
Sleeping for 8.58 seconds...
Sleeping for 6.12 seconds...
Sleeping for 7.14 seconds...
Sleeping for 5.56 seconds...
Sleeping for 8.22 seconds...
Sleeping for 4.30 seconds...
Sleeping for 6.79 seconds...
Sleeping for 5.18 seconds...
Sleeping for 8.46 seconds...
Sleeping for 5.71 seconds...
Sleeping for 6.29 seconds...
Sleeping for 7.14 seconds...
Sleeping for 5.90 seconds...
Sleeping for 6.70 seconds...
Sleeping for 7.82 seconds...
Sleeping for 10.30 seconds...
Sleeping for 9.95 seconds...
Sleeping for 6.82 seconds...
Sleeping for 11.08 seconds...
Sleeping for 5.29 seconds...
Sleeping for 5.61 seconds...
Sleeping for 12.81 seconds...
Sleeping for 4.88 seconds...
Sleeping for 15.94 seconds...
Sleeping for 8.40 seconds...
Sleeping for 13.10 seconds...
Sleeping for 4.28 seconds...
Sleeping for 4.54 seconds...
Sleeping

### Final Check of Auckland Scrapped Lists

The results of scraping activities in Auckland are presented below:

In [None]:
auckland_scrapped_lists = [
    titles_auckland,
    links_auckland,
    descriptions_auckland,
    images_auckland,
    street_addresses_auckland,
    localities_auckland,
    emails_auckland,
    phone_numbers_auckland,
]

list_names = [
    "titles_auckland",
    "links_auckland",
    "descriptions_auckland",
    "images_auckland",
    "street_addresses_auckland",
    "localities_auckland",
    "emails_auckland",
    "phone_numbers_auckland",
]

for i, name in enumerate(auckland_scrapped_lists):
    print(f"List length of {list_names[i]}: {len(name)}")

The exact same process was repeated for the remaining cities of Queenstown, Lake Tekapo and Wanaka. Since each city had their own page with availible activities, we created seperate datasets for all of them which we then connected into a dataframe containing full information about all cities.


Below we present the processess of scraping the remaining cities:

## Queenstown

In [None]:
driver_firefox.switch_to.window(city_tab_handles["queenstown"])

time.sleep(5)
# Click on the "Activities" filter
try:
    filter_xpath = "//span[contains(text(),'Activities')]"
    filter_button = WebDriverWait(driver_firefox, 4).until(
        EC.element_to_be_clickable((By.XPATH, filter_xpath))
    )
    filter_button.click()
    print("Activities' filter clicked on Queenstown page.")
except Exception as e:
    print(f"Failed to click 'Activities': {e}")

In [None]:
time.sleep(np.random.chisquare(3)+5)
click = 0 
max_clicks = 4
while click < max_clicks:
    try:
        load_more_xpath = '//*[@id="search-results"]/div[2]/div/div[3]/button'
        load_more_button = WebDriverWait(driver_firefox, 5).until(
            EC.element_to_be_clickable((By.XPATH, load_more_xpath))
        )

        # Click the button
        load_more_button.click()
        click += 1
        print("Loading more...")

        # Optional: wait for new content to load
        time.sleep(5)

    except TimeoutException:
        print("All activities loaded (no more button).")
        break

### Data Scraping for Queenstown

#### Main Page

In [None]:
html = driver_firefox.page_source
soup = BeautifulSoup(html, "html.parser")

results_container = soup.find("div", class_="search-results__results")
activity_blocks = results_container.find_all("div", class_="results__wrapper") if results_container else []

titles_queenstown = []
links_queenstown= []
descriptions_queenstown = []
images_queenstown = []

for activity in activity_blocks:
    try:
        # Title
        title_path = activity.select_one("h4.results__title a")
        title = title_path.get_text(strip=True) if title_path else ""
        
        # Link
        link = title_path["href"] if title_path and "href" in title_path.attrs else ""

        # Description
        desc_path = activity.select_one("p.results__description")
        description = desc_path.get_text(strip=True) if desc_path else ""

        # Image
        img_path = activity.select_one("figure.results__photo img")
        img_url = img_path["src"] if img_path and "src" in img_path.attrs else ""

        # Append All
        titles_queenstown.append(title)
        links_queenstown.append(link)
        descriptions_queenstown.append(description)
        images_queenstown.append(img_url)

    except Exception as e:
        print(f"Skipping block due to: {e}")
        continue

#### Secondary Each Activities Page

In [None]:
street_addresses_queenstown = []
localities_queenstown = []
emails_queenstown = []
phone_numbers_queenstown = []

for idx, url in enumerate(links_queenstown):
    try:
        driver_firefox.get(url)
        WebDriverWait(driver_firefox, 5).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p[itemtype='http://schema.org/LocalBusiness']"))
        )

        detail_soup = BeautifulSoup(driver_firefox.page_source, "html.parser")
        address_block = detail_soup.select_one("p[itemtype='http://schema.org/LocalBusiness']")

        # Street
        street_path = address_block.select_one("span[itemprop='streetAddress']")
        street_text = street_path.get_text(strip=True) if street_path else ""
        
        # Locality
        locality_path = address_block.select_one("span[itemprop='addressLocality']")
        locality_text = locality_path.get_text(strip=True) if locality_path else ""
        
        # Phone
        phone_path = driver_firefox.find_elements(By.CSS_SELECTOR, "a.js-phone-link")
        phone_number = phone_path[0].get_attribute("href").replace("tel:", "").strip() if phone_path else ""
        
        # Email
        email_tag = driver_firefox.find_elements(By.CSS_SELECTOR, "a[href^='mailto:']")
        email = email_tag[0].get_attribute("href").replace("mailto:", "").strip() if email_tag else ""

    except Exception as e:
        print(f"{idx+1}. Failed to extract data for: {links_queenstown[idx]} — {e}")
        street_text = ""
        locality_text = ""

    street_addresses_queenstown.append(street_text)
    localities_queenstown.append(locality_text)
    emails_queenstown.append(email)
    phone_numbers_queenstown.append(phone_number)
    
    wait_time = np.random.chisquare(3) + 4
    print(f"Sleeping for {wait_time:.2f} seconds...")
    time.sleep(wait_time)

### Final Check of Queenstown Scrapped Lists

In [None]:
queenstown_scrapped_lists = [
    titles_queenstown,
    links_queenstown,
    descriptions_queenstown,
    images_queenstown,
    street_addresses_queenstown,
    localities_queenstown,
    emails_queenstown,
    phone_numbers_queenstown,
]

list_names = [
    "titles_queenstown",
    "links_queenstown",
    "descriptions_queenstown",
    "images_queenstown",
    "street_addresses_queenstown",
    "localities_queenstown",
    "emails_queenstown",
    "phone_numbers_queenstown",
]

for i, name in enumerate(queenstown_scrapped_lists):
    print(f"List length of {list_names[i]}: {len(name)}")

## Lake Tekapo

In [None]:
driver_firefox.switch_to.window(city_tab_handles["lake_tekapo___takapō"])

time.sleep(5)
# Click on the "Activities" filter
try:
    filter_xpath = "//span[contains(text(),'Activities')]"
    filter_button = WebDriverWait(driver_firefox, 4).until(
        EC.element_to_be_clickable((By.XPATH, filter_xpath))
    )
    filter_button.click()
    print("Activities' filter clicked on Lake Tekapo / Takapō page.")
except Exception as e:
    print(f"Failed to click 'Activities': {e}")

In [None]:
time.sleep(np.random.chisquare(3)+5)
click = 0 
max_clicks = 4
while click < max_clicks:
    try:
        load_more_xpath = '//*[@id="search-results"]/div[2]/div/div[3]/button'
        load_more_button = WebDriverWait(driver_firefox, 5).until(
            EC.element_to_be_clickable((By.XPATH, load_more_xpath))
        )

        # Click the button
        load_more_button.click()
        click += 1
        print("Loading more...")

        # Optional: wait for new content to load
        time.sleep(5)

    except TimeoutException:
        print("All activities loaded (no more button).")
        break

### Data Scraping - Lake Tekapo

#### Main Page

In [None]:
html = driver_firefox.page_source
soup = BeautifulSoup(html, "html.parser")

results_container = soup.find("div", class_="search-results__results")
activity_blocks = results_container.find_all("div", class_="results__wrapper") if results_container else []

titles_tekapo = []
links_tekapo= []
descriptions_tekapo = []
images_tekapo = []

for activity in activity_blocks:
    try:
        # Title
        title_path = activity.select_one("h4.results__title a")
        title = title_path.get_text(strip=True) if title_path else ""
        
        # Link
        link = title_path["href"] if title_path and "href" in title_path.attrs else ""

        # Description
        desc_path = activity.select_one("p.results__description")
        description = desc_path.get_text(strip=True) if desc_path else ""

        # Image
        img_path = activity.select_one("figure.results__photo img")
        img_url = img_path["src"] if img_path and "src" in img_path.attrs else ""

        # Append All
        titles_tekapo.append(title)
        links_tekapo.append(link)
        descriptions_tekapo.append(description)
        images_tekapo.append(img_url)

    except Exception as e:
        print(f"Skipping block due to: {e}")
        continue

#### Secondary Each Activities Page

In [None]:
street_addresses_tekapo = []
localities_tekapo = []
emails_tekapo = []
phone_numbers_tekapo = []

for idx, url in enumerate(links_tekapo):
    try:
        driver_firefox.get(url)
        WebDriverWait(driver_firefox, 5).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p[itemtype='http://schema.org/LocalBusiness']"))
        )

        detail_soup = BeautifulSoup(driver_firefox.page_source, "html.parser")
        address_block = detail_soup.select_one("p[itemtype='http://schema.org/LocalBusiness']")

        # Street
        street_path = address_block.select_one("span[itemprop='streetAddress']")
        street_text = street_path.get_text(strip=True) if street_path else ""
        
        # Locality
        locality_path = address_block.select_one("span[itemprop='addressLocality']")
        locality_text = locality_path.get_text(strip=True) if locality_path else ""
        
        # Phone
        phone_path = driver_firefox.find_elements(By.CSS_SELECTOR, "a.js-phone-link")
        phone_number = phone_path[0].get_attribute("href").replace("tel:", "").strip() if phone_path else ""
        
        # Email
        email_tag = driver_firefox.find_elements(By.CSS_SELECTOR, "a[href^='mailto:']")
        email = email_tag[0].get_attribute("href").replace("mailto:", "").strip() if email_tag else ""

    except Exception as e:
        print(f"{idx+1}. Failed to extract data for: {links_tekapo[idx]} — {e}")
        street_text = ""
        locality_text = ""

    street_addresses_tekapo.append(street_text)
    localities_tekapo.append(locality_text)
    emails_tekapo.append(email)
    phone_numbers_tekapo.append(phone_number)
    
    wait_time = np.random.chisquare(3) + 4
    print(f"Sleeping for {wait_time:.2f} seconds...")
    time.sleep(wait_time)

### Final Check of Lake Tekapo Scrapped Lists

In [None]:
tekapo_scrapped_lists = [
    titles_tekapo,
    links_tekapo,
    descriptions_tekapo,
    images_tekapo,
    street_addresses_tekapo,
    localities_tekapo,
    emails_tekapo,
    phone_numbers_tekapo,
]

list_names = [
    "titles_tekapo",
    "links_tekapo",
    "descriptions_tekapo",
    "images_tekapo",
    "street_addresses_tekapo",
    "localities_tekapo",
    "emails_tekapo",
    "phone_numbers_tekapo",
]

for i, name in enumerate(tekapo_scrapped_lists):
    print(f"List length of {list_names[i]}: {len(name)}")

## Wanaka

In [None]:
driver_firefox.switch_to.window(city_tab_handles["wānaka"])

time.sleep(5)
# Click on the "Activities" filter
try:
    filter_xpath = "//span[contains(text(),'Activities')]"
    filter_button = WebDriverWait(driver_firefox, 4).until(
        EC.element_to_be_clickable((By.XPATH, filter_xpath))
    )
    filter_button.click()
    print("Activities' filter clicked on Wanaka page.")
except Exception as e:
    print(f"Failed to click 'Activities': {e}")

In [None]:
time.sleep(np.random.chisquare(3)+5)
click = 0 
max_clicks = 4
while click < max_clicks:
    try:
        load_more_xpath = '//*[@id="search-results"]/div[2]/div/div[3]/button'
        load_more_button = WebDriverWait(driver_firefox, 5).until(
            EC.element_to_be_clickable((By.XPATH, load_more_xpath))
        )

        # Click the button
        load_more_button.click()
        click += 1
        print("Loading more...")

        # Optional: wait for new content to load
        time.sleep(5)

    except TimeoutException:
        print("All activities loaded (no more button).")
        break

### Data Scraping - Wanaka

#### Main Page

In [None]:
html = driver_firefox.page_source
soup = BeautifulSoup(html, "html.parser")

results_container = soup.find("div", class_="search-results__results")
activity_blocks = results_container.find_all("div", class_="results__wrapper") if results_container else []

titles_wanaka = []
links_wanaka= []
descriptions_wanaka = []
images_wanaka = []

for activity in activity_blocks:
    try:
        # Title
        title_path = activity.select_one("h4.results__title a")
        title = title_path.get_text(strip=True) if title_path else ""
        
        # Link
        link = title_path["href"] if title_path and "href" in title_path.attrs else ""

        # Description
        desc_path = activity.select_one("p.results__description")
        description = desc_path.get_text(strip=True) if desc_path else ""

        # Image
        img_path = activity.select_one("figure.results__photo img")
        img_url = img_path["src"] if img_path and "src" in img_path.attrs else ""

        # Append All
        titles_wanaka.append(title)
        links_wanaka.append(link)
        descriptions_wanaka.append(description)
        images_wanaka.append(img_url)

    except Exception as e:
        print(f"Skipping block due to: {e}")
        continue

#### Secondary Each Activities Page

In [None]:
street_addresses_wanaka = []
localities_wanaka = []
emails_wanaka = []
phone_numbers_wanaka = []

for idx, url in enumerate(links_wanaka):
    try:
        driver_firefox.get(url)
        WebDriverWait(driver_firefox, 5).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "p[itemtype='http://schema.org/LocalBusiness']"))
        )

        detail_soup = BeautifulSoup(driver_firefox.page_source, "html.parser")
        address_block = detail_soup.select_one("p[itemtype='http://schema.org/LocalBusiness']")

        # Street
        street_path = address_block.select_one("span[itemprop='streetAddress']")
        street_text = street_path.get_text(strip=True) if street_path else ""
        
        # Locality
        locality_path = address_block.select_one("span[itemprop='addressLocality']")
        locality_text = locality_path.get_text(strip=True) if locality_path else ""
        
        # Phone
        phone_path = driver_firefox.find_elements(By.CSS_SELECTOR, "a.js-phone-link")
        phone_number = phone_path[0].get_attribute("href").replace("tel:", "").strip() if phone_path else ""
        
        # Email
        email_tag = driver_firefox.find_elements(By.CSS_SELECTOR, "a[href^='mailto:']")
        email = email_tag[0].get_attribute("href").replace("mailto:", "").strip() if email_tag else ""

    except Exception as e:
        print(f"{idx+1}. Failed to extract data for: {links_wanaka[idx]} — {e}")
        street_text = ""
        locality_text = ""

    street_addresses_wanaka.append(street_text)
    localities_wanaka.append(locality_text)
    emails_wanaka.append(email)
    phone_numbers_wanaka.append(phone_number)
    
    wait_time = np.random.chisquare(3) + 4
    print(f"Sleeping for {wait_time:.2f} seconds...")
    time.sleep(wait_time)

### Final Check of Wanaka Scrapped Lists

In [None]:
wanaka_scrapped_lists = [
    titles_wanaka,
    links_wanaka,
    descriptions_wanaka,
    images_wanaka,
    street_addresses_wanaka,
    localities_wanaka,
    emails_wanaka,
    phone_numbers_wanaka,
]

list_names = [
    "titles_wanaka",
    "links_wanaka",
    "descriptions_wanaka",
    "images_wanaka",
    "street_addresses_wanaka",
    "localities_wanaka",
    "emails_wanaka",
    "phone_numbers_wanaka",
]

for i, name in enumerate(wanaka_scrapped_lists):
    print(f"List length of {list_names[i]}: {len(name)}")

## Convert to Dataframe

The last step of the project was to combine all datasets into a single dataframe:


In [None]:

data_auckland = pd.DataFrame({
    "place": ["Auckland"] * len(titles_auckland),
    "activities": titles_auckland,
    "activity_descriptions": descriptions_auckland,
    "activity_address_streets": street_addresses_auckland,
    "activity_localities": localities_auckland,
    "activity_emails": emails_auckland,
    "activity_phone_numbers": phone_numbers_auckland,
    "activity_links": links_auckland,
    "activity_images" : images_auckland
})


data_queenstown = pd.DataFrame({
    "place": ["Queenstown"] * len(titles_queenstown),
    "activities": titles_queenstown,
    "activity_descriptions": descriptions_queenstown,
    "activity_address_streets": street_addresses_queenstown,
    "activity_localities": localities_queenstown,
    "activity_emails": emails_queenstown,
    "activity_phone_numbers": phone_numbers_queenstown,
    "activity_links": links_queenstown,
    "activity_images" : images_queenstown
})

data_tekapo = pd.DataFrame({
    "place": ["Tekapo"] * len(titles_tekapo),
    "activities": titles_tekapo,
    "activity_descriptions": descriptions_tekapo,
    "activity_address_streets": street_addresses_tekapo,
    "activity_localities": localities_tekapo,
    "activity_emails": emails_tekapo,
    "activity_phone_numbers": phone_numbers_tekapo,
    "activity_links": links_tekapo,
    "activity_images" : images_tekapo
})

data_wanaka = pd.DataFrame({
    "place": ["Wanaka"] * len(titles_wanaka),
    "activities": titles_wanaka,
    "activity_descriptions": descriptions_wanaka,
    "activity_address_streets": street_addresses_wanaka,
    "activity_localities": localities_wanaka,
    "activity_emails": emails_wanaka,
    "activity_phone_numbers": phone_numbers_wanaka,
    "activity_links": links_wanaka,
    "activity_images" : images_wanaka
})

In [None]:
data_auckland.tail(3)

In [None]:
data_queenstown.tail(3)

In [None]:
data_tekapo.tail(3)

In [None]:
data_wanaka.tail(3)

In [None]:
data_all_city = []

data_all_city.append(data_auckland)
data_all_city.append(data_queenstown)
data_all_city.append(data_tekapo)
data_all_city.append(data_wanaka)

final_data = pd.concat(data_all_city, ignore_index=True)

In [None]:
final_data

## Export to Pickle

In [None]:
final_data.to_pickle("pickle_dump/all_city_activities.pkl")
data_auckland.to_pickle("pickle_dump/auckland_city_activities.pkl")
data_queenstown.to_pickle("pickle_dump/queenstown_city_activities.pkl")
data_tekapo.to_pickle("pickle_dump/tekapo_city_activities.pkl")
data_wanaka.to_pickle("pickle_dump/wanaka_city_activities.pkl")

In [15]:
driver_firefox.quit()

## Mitigation

### Rotation of User Agents

In [None]:
#service_firefox = Service(executable_path = firefoxpath) 
#options_firefox = webdriver.FirefoxOptions(); options_firefox.add_argument("--headless")
#driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox)

#driver_firefox.get("https://us.cnn.com/")

# returns our current User-Agent
#user_agent = driver_firefox.execute_script("return navigator.userAgent;")
#print("Current User-Agent:", user_agent)

#driver_firefox.quit()

### Rotation of IP Addresses