## Programming for Data Science 
### Project Notebook: "Where should I live?" 
#### Group Members:
- Afonso Fernandes / 20241710
- Lourenço Lima / 20241711
- Pedro Jorge / 20241819
- David Morais / 20241759
## Project Repository
GitHub Repository:  
https://github.com/afonsolince06/-Where-should-I-live-PDS-Project


### Introduction

In this part of the project, the goal is to build an interactive map of Europe that allows users to explore key information about major European cities. The task combines web scraping, data integration, and geospatial visualization to create an informative and interactive tool.

To accomplish this, we will:

-Scrape the geographical coordinates of each city directly from the Wikipedia Main Page, ensuring accuracy and consistency with the provided dataset.

-Match the scraped coordinates with the dataset entries so that each city is correctly assigned to its corresponding country, population, average salary, and cost of living.

-Use the cleaned and enriched dataset to construct an interactive map of Europe, where each city can be clicked or hovered over to display its relevant information.

By the end of this section, we will have a fully functional map that visually represents European cities and provides meaningful insights through an intuitive interface. This builds on the skills developed earlier in the project and introduces new concepts in geospatial data handling and visualization.

#### Import essential libraries and define an alias for them

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime
from bs4 import BeautifulSoup       # process html
from selenium import webdriver      # automate web browser interaction
from selenium.webdriver.common.by import By #specify specify how to locate elements on a web page.
import requests                    # make requests to fetch web pages
import time


In [4]:

urll=  'https://en.wikipedia.org/wiki/Main_Page'
from selenium.webdriver.chrome.options import Options as Options_c

options = Options_c()
browser_c = webdriver.Chrome(options=Options_c())
browser_c.get(urll)



In [7]:
search_button = browser_c.find_elements(By.CLASS_NAME, "cdx-text-input__input")
search_button

[<selenium.webdriver.remote.webelement.WebElement (session="585f4ab4e11c91c62fd3cbf5e344c280", element="f.19E1E9B74757B7E0D43F3EE73B7EEB93.d.4CF464180FC2BEF0C15E0D21F21D09CA.e.2")>,
 <selenium.webdriver.remote.webelement.WebElement (session="585f4ab4e11c91c62fd3cbf5e344c280", element="f.19E1E9B74757B7E0D43F3EE73B7EEB93.d.4CF464180FC2BEF0C15E0D21F21D09CA.e.10")>]

In [8]:
search_button[0].click()
from selenium.webdriver.common.keys import Keys
search_box = search_button[0]
search_box.send_keys("Vienna")

# Optionally, press Enter to submit the search
search_box.send_keys(Keys.RETURN)


In [9]:
vienna_latitude=browser_c.find_element(By.CLASS_NAME, "latitude")
print(vienna_latitude.text)


48°12′30″N


In [12]:
data=pd.read_csv('city_data_cleaned.csv')
cities=data['City'].tolist()
countries=data['Country'].tolist()
cities



['Salzburg',
 'Vienna',
 'Antwerp',
 'Bruges',
 'Brussels',
 'Gent',
 'Dobrich',
 'Sofia',
 'Split',
 'Zagreb',
 'Lefkosia',
 'Lemesos',
 'Ostrava',
 'Prague',
 'Copenhagen',
 'Odense',
 'Tallinn',
 'Helsinki',
 'Tampere',
 'Lyon',
 'Paris',
 'Toulouse',
 'Berlin',
 'Cologne',
 'Dresden',
 'Dusseldorf',
 'Frankfurt am Main',
 'Hamburg',
 'Hanover',
 'Leipzig',
 'Munich',
 'Stuttgart',
 'Athens',
 'Thessaloniki',
 'Budapest',
 'Debrecen',
 'Miskolc',
 'Cork',
 'Dublin',
 'Florence',
 'Milan',
 'Naples',
 'Rome',
 'Turin',
 'Venice',
 'Riga',
 'Luxembourg',
 'Malta',
 'Amsterdam',
 'Eindhoven',
 'Rotterdam',
 'The Hague',
 'Utrecht',
 'Bergen',
 'Oslo',
 'Stavanger',
 'Cracow',
 'Lodz',
 'Warsaw',
 'Braga',
 'Coimbra',
 'Lisbon',
 'Porto',
 'Giroc',
 'Bratislava',
 'Ljubljana',
 'Barcelona',
 'Madrid',
 'Malaga',
 'Seville',
 'Valencia',
 'Gothenburg',
 'Malmo',
 'Stockholm',
 'Basel',
 'Geneva',
 'Zurich',
 'Adana',
 'Ankara',
 'Edinburgh',
 'Glasgow',
 'Leeds',
 'Liverpool',
 'London']

In [13]:
cities_1=['Salzburg',
 'Vienna',
 'Antwerp',
 'Bruges']
countries_1=['Austria',
 'Austria',
 'Belgium',
 'Belgium']

In [14]:
def get_city_coordinates(cities):
    options = webdriver.ChromeOptions()
    browser = webdriver.Chrome(options=options)
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC


    results = {}

    for city in cities:
        browser.get("https://en.wikipedia.org/wiki/Main_Page")
        try:
            # -----------------------------
            # 1. Find search bar (fresh each loop)
            # -----------------------------
            search_bar = WebDriverWait(browser, 15).until(
                EC.element_to_be_clickable((By.ID, "searchInput"))
            )
            search_bar.clear()
            search_bar.send_keys(city)
            search_bar.send_keys(Keys.RETURN)

            # Wait for page to load (city page)
            time.sleep(10)
            WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.ID, "firstHeading"))
            )
        

            # -----------------------------
            # 2. Extract coordinates
            # -----------------------------
            WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.CLASS_NAME, "latitude"))
            )
            WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.CLASS_NAME, "longitude"))
            )
            latitude_element = browser.find_element(By.CLASS_NAME, "latitude")
            longitude_element = browser.find_element(By.CLASS_NAME, "longitude")

            latitude = latitude_element.text
            longitude = longitude_element.text

            results[city] = (latitude, longitude)

        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None

        time.sleep(5)
        # go back to the main page for the next loop
        browser.get("https://en.wikipedia.org/wiki/Main_Page")

    browser.quit()
    return results


In [15]:


def get_city_coordinates_1(cities):
    options = webdriver.ChromeOptions()
    # Add an argument to run Chrome headless (without a GUI) for stability
    # options.add_argument("--headless") 
    browser = webdriver.Chrome(options=options)

    results = {}

    for city in cities:
        print(f"Searching for {city}...")
        try:
            # Go to Wikipedia main page
            browser.get("https://en.wikipedia.org/wiki/Main_Page")

            # Search for the city
            search_bar = WebDriverWait(browser, 15).until(
                EC.element_to_be_clickable((By.ID, "searchInput"))
            )
            search_bar.clear()
            search_bar.send_keys(city)
            search_bar.send_keys(Keys.RETURN)

            # --- MODIFIED LOGIC START ---

            # Check if the current page is a search results page.
            # We look for the main heading 'Search results' or 'The page X was not found'.
            # If the search led directly to an article, this check will likely time out, 
            # and we move to the coordinate extraction step.

            try:
                WebDriverWait(browser, 5).until(
                    EC.presence_of_element_located((By.CLASS_NAME, "mw-search-results"))
                )
                
                # If the search results container is found, it means we are on the results page.
                print(f"Search results page found for {city}. Clicking first link.")
                
                # Find the first result link within the results list.
                first_result = WebDriverWait(browser, 15).until(
                    # This XPath/CSS targets the first anchor (<a>) in the first list item (<li>) 
                    # of the search results (<ul>)
                    EC.element_to_be_clickable((By.CSS_SELECTOR, "ul.mw-search-results > li:first-child a"))
                )
                first_result.click()
                time.sleep(3) # Wait for city page to load after clicking the link
                
            except TimeoutException:
                # This means the search led directly to a page (not the search results page). 
                # We assume the correct article page is loaded and continue to extraction.
                print(f"Direct article match for {city}.")
                pass # Continue to the coordinate extraction below

            # --- MODIFIED LOGIC END ---

            # Extract coordinates
            latitude_element = WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.CLASS_NAME, "latitude"))
            )
            # Use `find_element` as `presence_of_element_located` has confirmed the page structure
            longitude_element = browser.find_element(By.CLASS_NAME, "longitude")

            latitude = latitude_element.text
            longitude = longitude_element.text

            results[city] = (latitude, longitude)
            print(f"Successfully retrieved coordinates for {city}: ({latitude}, {longitude})")

        except TimeoutException:
            print(f"Coordinates not found for {city}: Timed out while waiting for elements.")
            results[city] = None
        except NoSuchElementException:
            print(f"Coordinates not found for {city}: Element not found on page.")
            results[city] = None
        except Exception as e:
            print(f"Coordinates not found for {city}. General error: {e}")
            results[city] = None

        time.sleep(2) # Pause before next city

    browser.quit()
    return results

# Example Usage:
# city_list = ["London", "Paris", "Berlin"]
# coordinates = get_city_coordinates(city_list)
# print("\nFinal Coordinates:")
# print(coordinates)

In [16]:


def get_city_coordinates_2(cities):
    options = webdriver.ChromeOptions()
    browser = webdriver.Chrome(options=options)
    results = {}

    for city in cities:
        try:
            browser.get("https://en.wikipedia.org/wiki/Main_Page")
            search_bar = WebDriverWait(browser, 15).until(
                EC.element_to_be_clickable((By.ID, "searchInput"))
            )
            search_bar.clear()
            search_bar.send_keys(city)
            search_bar.send_keys(Keys.RETURN)
            time.sleep(5)

            # Function to find coordinates on current page
            def find_coordinates():
                try:
                    lat = browser.find_element(By.CLASS_NAME, "latitude").text
                    lon = browser.find_element(By.CLASS_NAME, "longitude").text
                    return (lat, lon)
                except:
                    return None

            coords = find_coordinates()
            if coords:
                results[city] = coords
                continue

            # If search results page exists, click first result
            search_results = browser.find_elements(By.CSS_SELECTOR, "ul.mw-search-results li a")
            if search_results:
                search_results[0].click()
                time.sleep(2)
                coords = find_coordinates()
                if coords:
                    results[city] = coords
                    continue

            # If disambiguation, click first link and check
            links = browser.find_elements(By.CSS_SELECTOR, "div.mw-parser-output ul li a")
            for link in links:
                link.click()
                time.sleep(2)
                coords = find_coordinates()
                if coords:
                    results[city] = coords
                    break
                browser.back()
                time.sleep(2)
            if city not in results:
                results[city] = None

        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None

    time.sleep(5)
    browser.quit()
    time.sleep(5)
    return results



In [17]:


def get_city_coordinates_3(cities):
    options = webdriver.ChromeOptions()
    browser = webdriver.Chrome(options=options)
    results = {}

    for city in cities:
        try:
            browser.get("https://en.wikipedia.org/wiki/Main_Page")
            time.sleep(3)

            # Search for "City city" to reduce ambiguity
            search_bar = browser.find_element(By.ID, "searchInput")
            search_bar.clear()
            search_bar.send_keys(f"{city} city")
            search_bar.send_keys(Keys.RETURN)
            time.sleep(4)  # wait for redirect or search results

            # Check if search results page exists
            if "search" in browser.current_url:
                try:
                    first_result = browser.find_element(By.CSS_SELECTOR, "ul.mw-search-results li a")
                    first_result.click()
                    time.sleep(4)
                except:
                    pass

            # Check if disambiguation page
            if "disambiguation" in browser.current_url:
                try:
                    first_link = browser.find_element(By.CSS_SELECTOR, "div.mw-parser-output ul li a")
                    first_link.click()
                    time.sleep(4)
                except:
                    pass

            # Extract coordinates
            latitude = browser.find_element(By.CLASS_NAME, "latitude").text
            longitude = browser.find_element(By.CLASS_NAME, "longitude").text
            results[city] = (latitude, longitude)

        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None

        time.sleep(2)  # small pause before next city

    browser.quit()
    return results



In [18]:
coordinates_city=get_city_coordinates_3(cities_1)
coordinates_city

Coordinates not found for Vienna. Error: Message: stale element reference: stale element not found in the current frame
  (Session info: chrome=142.0.7444.175); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#staleelementreferenceexception
Stacktrace:
Symbols not available. Dumping unresolved backtrace:
	0x7ff7240ba235
	0x7ff723e12630
	0x7ff723ba16dd
	0x7ff723ba94c8
	0x7ff723bac52f
	0x7ff723bac5ef
	0x7ff723bfb23a
	0x7ff723bfb317
	0x7ff723bf0b5c
	0x7ff723c22b0a
	0x7ff723becc36
	0x7ff723c4baba
	0x7ff723beb0ed
	0x7ff723bebf63
	0x7ff7240e5d60
	0x7ff7240dfe8a
	0x7ff724101005
	0x7ff723e2d71e
	0x7ff723e34e1f
	0x7ff723e1b7c4
	0x7ff723e1b97f
	0x7ff723e018e8
	0x7ffa7c85e8d7
	0x7ffa7e04c53c

Coordinates not found for Bruges. Error: Message: stale element reference: stale element not found in the current frame
  (Session info: chrome=142.0.7444.175); For documentation on this error, please visit: https://www.selenium.dev/docume

{'Salzburg': ('47°48′00″N', '13°02′42″E'),
 'Vienna': None,
 'Antwerp': ('51°13′04″N', '04°24′01″E'),
 'Bruges': None}