## Programming for Data Science 
### Project Notebook: "Where should I live?" 
#### Group Members:
- Afonso Fernandes / 20241710
- Lourenço Lima / 20241711
- Pedro Jorge / 20241819
- David Morais / 20241759
## Project Repository
GitHub Repository:  
https://github.com/afonsolince06/-Where-should-I-live-PDS-Project


### Introduction

In this part of the project, the goal is to build an interactive map of Europe that allows users to explore key information about major European cities. The task combines web scraping, data integration, and geospatial visualization to create an informative and interactive tool.

To accomplish this, we will:

-Scrape the geographical coordinates of each city directly from the Wikipedia Main Page, ensuring accuracy and consistency with the provided dataset.

-Match the scraped coordinates with the dataset entries so that each city is correctly assigned to its corresponding country, population, average salary, and cost of living.

-Use the cleaned and enriched dataset to construct an interactive map of Europe, where each city can be clicked or hovered over to display its relevant information.

By the end of this section, we will have a fully functional map that visually represents European cities and provides meaningful insights through an intuitive interface. This builds on the skills developed earlier in the project and introduces new concepts in geospatial data handling and visualization.

#### Import essential libraries and define an alias for them

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from datetime import datetime
from bs4 import BeautifulSoup       # process html
from selenium import webdriver      # automate web browser interaction
from selenium.webdriver.common.by import By #specify specify how to locate elements on a web page.
import requests                    # make requests to fetch web pages
import time


In [2]:

urll=  'https://en.wikipedia.org/wiki/Main_Page'
from selenium.webdriver.chrome.options import Options as Options_c

options = Options_c()
browser_c = webdriver.Chrome(options=Options_c())
browser_c.get(urll)



In [9]:
search_button = browser_c.find_elements(By.CLASS_NAME, "cdx-text-input__input")
search_button

[<selenium.webdriver.remote.webelement.WebElement (session="57e27d5ec3bd01ebbb7a777fcb65e4b2", element="f.56E20482F3663EF017367BF3A555F361.d.C7EBC11A56A410BA270237941877C3AA.e.102")>,
 <selenium.webdriver.remote.webelement.WebElement (session="57e27d5ec3bd01ebbb7a777fcb65e4b2", element="f.56E20482F3663EF017367BF3A555F361.d.C7EBC11A56A410BA270237941877C3AA.e.110")>]

In [10]:
search_button[0].click()
from selenium.webdriver.common.keys import Keys
search_box = search_button[0]
search_box.send_keys("Vienna")

# Optionally, press Enter to submit the search
search_box.send_keys(Keys.RETURN)


In [11]:
vienna_latitude=browser_c.find_element(By.CLASS_NAME, "latitude")
print(vienna_latitude.text)


48°12′30″N


In [12]:
data=pd.read_csv('city_data_cleaned.csv')
cities=data['City'].tolist()
countries=data['Country'].tolist()
cities



['Salzburg',
 'Vienna',
 'Antwerp',
 'Bruges',
 'Brussels',
 'Gent',
 'Dobrich',
 'Sofia',
 'Split',
 'Zagreb',
 'Lefkosia',
 'Lemesos',
 'Ostrava',
 'Prague',
 'Copenhagen',
 'Odense',
 'Tallinn',
 'Helsinki',
 'Tampere',
 'Lyon',
 'Paris',
 'Toulouse',
 'Berlin',
 'Cologne',
 'Dresden',
 'Dusseldorf',
 'Frankfurt am Main',
 'Hamburg',
 'Hanover',
 'Leipzig',
 'Munich',
 'Stuttgart',
 'Athens',
 'Thessaloniki',
 'Budapest',
 'Debrecen',
 'Miskolc',
 'Cork',
 'Dublin',
 'Florence',
 'Milan',
 'Naples',
 'Rome',
 'Turin',
 'Venice',
 'Riga',
 'Luxembourg',
 'Malta',
 'Amsterdam',
 'Eindhoven',
 'Rotterdam',
 'The Hague',
 'Utrecht',
 'Bergen',
 'Oslo',
 'Stavanger',
 'Cracow',
 'Lodz',
 'Warsaw',
 'Braga',
 'Coimbra',
 'Lisbon',
 'Porto',
 'Giroc',
 'Bratislava',
 'Ljubljana',
 'Barcelona',
 'Madrid',
 'Malaga',
 'Seville',
 'Valencia',
 'Gothenburg',
 'Malmo',
 'Stockholm',
 'Basel',
 'Geneva',
 'Zurich',
 'Adana',
 'Ankara',
 'Edinburgh',
 'Glasgow',
 'Leeds',
 'Liverpool',
 'London']

In [13]:
cities_1=['Salzburg',
 'Vienna',
 'Antwerp',
 'Bruges']
countries_1=['Austria',
 'Austria',
 'Belgium',
 'Belgium']

In [14]:
def get_city_coordinates(cities):
    options = webdriver.ChromeOptions()
    browser = webdriver.Chrome(options=options)
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC


    results = {}

    for city in cities:
        browser.get("https://en.wikipedia.org/wiki/Main_Page")
        try:
            # -----------------------------
            # 1. Find search bar (fresh each loop)
            # -----------------------------
            search_bar = WebDriverWait(browser, 15).until(
                EC.element_to_be_clickable((By.ID, "searchInput"))
            )
            search_bar.clear()
            search_bar.send_keys(city)
            search_bar.send_keys(Keys.RETURN)

            # Wait for page to load (city page)
            time.sleep(10)
            WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.ID, "firstHeading"))
            )
        

            # -----------------------------
            # 2. Extract coordinates
            # -----------------------------
            WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.CLASS_NAME, "latitude"))
            )
            WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.CLASS_NAME, "longitude"))
            )
            latitude_element = browser.find_element(By.CLASS_NAME, "latitude")
            longitude_element = browser.find_element(By.CLASS_NAME, "longitude")

            latitude = latitude_element.text
            longitude = longitude_element.text

            results[city] = (latitude, longitude)

        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None

        time.sleep(5)
        # go back to the main page for the next loop
        browser.get("https://en.wikipedia.org/wiki/Main_Page")

    browser.quit()
    return results


In [15]:


def get_city_coordinates_1(cities):
    options = webdriver.ChromeOptions()
    # Add an argument to run Chrome headless (without a GUI) for stability
    # options.add_argument("--headless") 
    browser = webdriver.Chrome(options=options)

    results = {}

    for city in cities:
        print(f"Searching for {city}...")
        try:
            # Go to Wikipedia main page
            browser.get("https://en.wikipedia.org/wiki/Main_Page")

            # Search for the city
            search_bar = WebDriverWait(browser, 15).until(
                EC.element_to_be_clickable((By.ID, "searchInput"))
            )
            search_bar.clear()
            search_bar.send_keys(city)
            search_bar.send_keys(Keys.RETURN)

            # --- MODIFIED LOGIC START ---

            # Check if the current page is a search results page.
            # We look for the main heading 'Search results' or 'The page X was not found'.
            # If the search led directly to an article, this check will likely time out, 
            # and we move to the coordinate extraction step.

            try:
                WebDriverWait(browser, 5).until(
                    EC.presence_of_element_located((By.CLASS_NAME, "mw-search-results"))
                )
                
                # If the search results container is found, it means we are on the results page.
                print(f"Search results page found for {city}. Clicking first link.")
                
                # Find the first result link within the results list.
                first_result = WebDriverWait(browser, 15).until(
                    # This XPath/CSS targets the first anchor (<a>) in the first list item (<li>) 
                    # of the search results (<ul>)
                    EC.element_to_be_clickable((By.CSS_SELECTOR, "ul.mw-search-results > li:first-child a"))
                )
                first_result.click()
                time.sleep(3) # Wait for city page to load after clicking the link
                
            except TimeoutException:
                # This means the search led directly to a page (not the search results page). 
                # We assume the correct article page is loaded and continue to extraction.
                print(f"Direct article match for {city}.")
                pass # Continue to the coordinate extraction below

            # --- MODIFIED LOGIC END ---

            # Extract coordinates
            latitude_element = WebDriverWait(browser, 15).until(
                EC.presence_of_element_located((By.CLASS_NAME, "latitude"))
            )
            # Use `find_element` as `presence_of_element_located` has confirmed the page structure
            longitude_element = browser.find_element(By.CLASS_NAME, "longitude")

            latitude = latitude_element.text
            longitude = longitude_element.text

            results[city] = (latitude, longitude)
            print(f"Successfully retrieved coordinates for {city}: ({latitude}, {longitude})")

        except TimeoutException:
            print(f"Coordinates not found for {city}: Timed out while waiting for elements.")
            results[city] = None
        except NoSuchElementException:
            print(f"Coordinates not found for {city}: Element not found on page.")
            results[city] = None
        except Exception as e:
            print(f"Coordinates not found for {city}. General error: {e}")
            results[city] = None

        time.sleep(2) # Pause before next city

    browser.quit()
    return results

# Example Usage:
# city_list = ["London", "Paris", "Berlin"]
# coordinates = get_city_coordinates(city_list)
# print("\nFinal Coordinates:")
# print(coordinates)

In [16]:


def get_city_coordinates_2(cities):
    options = webdriver.ChromeOptions()
    browser = webdriver.Chrome(options=options)
    results = {}

    for city in cities:
        try:
            browser.get("https://en.wikipedia.org/wiki/Main_Page")
            search_bar = WebDriverWait(browser, 15).until(
                EC.element_to_be_clickable((By.ID, "searchInput"))
            )
            search_bar.clear()
            search_bar.send_keys(city)
            search_bar.send_keys(Keys.RETURN)
            time.sleep(5)

            # Function to find coordinates on current page
            def find_coordinates():
                try:
                    lat = browser.find_element(By.CLASS_NAME, "latitude").text
                    lon = browser.find_element(By.CLASS_NAME, "longitude").text
                    return (lat, lon)
                except:
                    return None

            coords = find_coordinates()
            if coords:
                results[city] = coords
                continue

            # If search results page exists, click first result
            search_results = browser.find_elements(By.CSS_SELECTOR, "ul.mw-search-results li a")
            if search_results:
                search_results[0].click()
                time.sleep(2)
                coords = find_coordinates()
                if coords:
                    results[city] = coords
                    continue

            # If disambiguation, click first link and check
            links = browser.find_elements(By.CSS_SELECTOR, "div.mw-parser-output ul li a")
            for link in links:
                link.click()
                time.sleep(2)
                coords = find_coordinates()
                if coords:
                    results[city] = coords
                    break
                browser.back()
                time.sleep(2)
            if city not in results:
                results[city] = None

        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None

    time.sleep(5)
    browser.quit()
    time.sleep(5)
    return results



In [17]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

def get_cities_coordinates(cities):
    """
    Scrapes latitude and longitude from Wikipedia for a list of cities.
    
    Args:
        cities (list): List of city names (strings)
    
    Returns:
        dict: {city_name: (latitude, longitude) or None}
    """
    url = 'https://en.wikipedia.org/wiki/Main_Page'
    
    options = webdriver.ChromeOptions()
    # options.add_argument("--headless")  # Uncomment to run headless
    browser = webdriver.Chrome(options=options)
    
    results = {}
    
    for city in cities:
        try:
            browser.get(url)
            time.sleep(2)  # wait for page to load
            
            # Correct search input
            search_input = browser.find_element(By.ID, "searchInput")
            search_input.clear()
            search_input.send_keys(city)
            search_input.send_keys(Keys.RETURN)
            
            time.sleep(3)  # wait for page to load
            
            # Extract coordinates
            latitude = browser.find_element(By.CLASS_NAME, "latitude").text
            longitude = browser.find_element(By.CLASS_NAME, "longitude").text
            results[city] = (latitude, longitude)
            print(f"{city}: ({latitude}, {longitude})")
            
        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None
        
        time.sleep(2)  # pause before next city
    
    browser.quit()
    return results




In [18]:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_cities_coordinates_1(cities):
    url = 'https://en.wikipedia.org/wiki/Main_Page'
    options = webdriver.ChromeOptions()
    browser = webdriver.Chrome(options=options)
    results = {}

    for city in cities:
        try:
            browser.get(url)
            time.sleep(3)

            # Search input
            search_input = browser.find_element(By.ID, "searchInput")
            search_input.clear()
            search_input.send_keys(city)
            search_input.send_keys(Keys.RETURN)
            time.sleep(3)

            # Case 1: Search results page
            try:
                first_result = browser.find_element(By.CSS_SELECTOR, "ul.mw-search-results li a")
                first_result.click()
                time.sleep(3)
            except:
                pass  # no search results list

            # Case 2: Disambiguation page
            try:
                disambig_link = browser.find_element(By.CSS_SELECTOR, "table.disambig a")
                disambig_link.click()
                time.sleep(3)
            except:
                pass  # not a disambiguation page

            # Extract coordinates
            latitude = browser.find_element(By.CLASS_NAME, "latitude").text
            longitude = browser.find_element(By.CLASS_NAME, "longitude").text
            results[city] = (latitude, longitude)
            print(f"{city}: ({latitude}, {longitude})")

        except Exception as e:
            print(f"Coordinates not found for {city}. Error: {e}")
            results[city] = None

        time.sleep(2)

    browser.quit()
    return results


In [19]:
coordinates_city=get_cities_coordinates_1(cities)
coordinates_city

Salzburg: (47°48′00″N, 13°02′42″E)
Vienna: (48°12′30″N, 16°22′21″E)
Antwerp: (51°13′04″N, 04°24′01″E)
Coordinates not found for Bruges. Error: Message: stale element reference: stale element not found in the current frame
  (Session info: chrome=142.0.7444.176); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#staleelementreferenceexception
Stacktrace:
0   chromedriver                        0x00000001010e2ecc cxxbridge1$str$ptr + 2941512
1   chromedriver                        0x00000001010dab88 cxxbridge1$str$ptr + 2907908
2   chromedriver                        0x0000000100bf22b0 _RNvCsgXDX2mvAJAg_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 74020
3   chromedriver                        0x0000000100bf81e0 _RNvCsgXDX2mvAJAg_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 98388
4   chromedriver                        0x0000000100bfa314 _RNvCsgXDX2mvAJAg_7___rustc35___rust_no_alloc_shim_is_unstable_v2 + 106888

{'Salzburg': ('47°48′00″N', '13°02′42″E'),
 'Vienna': ('48°12′30″N', '16°22′21″E'),
 'Antwerp': ('51°13′04″N', '04°24′01″E'),
 'Bruges': None,
 'Brussels': ('50°50′48″N', '04°21′09″E'),
 'Gent': None,
 'Dobrich': ('43°34′N', '27°50′E'),
 'Sofia': None,
 'Split': None,
 'Zagreb': ('45°48′47″N', '15°58′39″E'),
 'Lefkosia': ('35°10′21″N', '33°21′54″E'),
 'Lemesos': ('34°40′29″N', '33°02′39″E'),
 'Ostrava': ('49°50′8″N', '18°17′33″E'),
 'Prague': ('50°5′15″N', '14°25′17″E'),
 'Copenhagen': ('55°40′34″N', '12°34′06″E'),
 'Odense': ('55°23′45″N', '10°23′19″E'),
 'Tallinn': ('59°26′14″N', '24°44′43″E'),
 'Helsinki': ('60°10′15″N', '24°56′15″E'),
 'Tampere': None,
 'Lyon': ('45°46′03″N', '4°50′06″E'),
 'Paris': ('48°51′24″N', '2°21′8″E'),
 'Toulouse': ('43°36′16″N', '1°26′38″E'),
 'Berlin': ('52°31′12″N', '13°24′18″E'),
 'Cologne': ('50°56′11″N', '6°57′10″E'),
 'Dresden': ('51°03′00″N', '13°44′24″E'),
 'Dusseldorf': ('51°13′32″N', '6°46′36″E'),
 'Frankfurt am Main': ('50°06′38″N', '08°40′56″E'

In [None]:
import pandas as pd 
import numpy as np
import plotly.express as px


from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
import time 


try:
    from webdriver_manager.chrome import ChromeDriverManager
    def get_chrome_driver(options):
        
        return webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
except ImportError:
    
    def get_chrome_driver(options):
        print("Atenção: webdriver-manager não disponível. Certifique-se de que o ChromeDriver está no seu PATH.")
        return webdriver.Chrome(options=options)


city_data = pd.read_csv('city_data_cleaned.csv')
city_data_coords = city_data.copy()
city_data_coords['Latitude'] = np.nan
city_data_coords['Longitude'] = np.nan



def scrape_city_coordinates_robust(df):
    
    
    options = webdriver.ChromeOptions()
    
    
    try:
        driver = get_chrome_driver(options)
    except Exception as e:
        print(f"ERRO CRÍTICO: Não foi possível iniciar o Chrome. Detalhes: {e}")
        return df_result
    
    base_url = "https://en.wikipedia.org/wiki/Main_Page"
    df_result = df.copy()
    
    print("Início do processo de Scraping. Será usado um tempo de espera forçado para garantir a leitura.")
    
    for index, row in df_result.iterrows():
        city = row['City']
        country = row['Country']
        query = f"{city}, {country}" 
        
        try:
            driver.get(base_url)
            
            
            SEARCH_INPUT_SELECTOR = "#searchInput" 
            search_input = WebDriverWait(driver, 10).until(
                EC.element_to_be_clickable((By.CSS_SELECTOR, SEARCH_INPUT_SELECTOR))
            )
            search_input.clear()
            search_input.send_keys(query)
            search_input.send_keys(Keys.RETURN)
            
            
            time.sleep(1)
            
            
            WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CLASS_NAME, "geo"))
            )
            
            
            soup = BeautifulSoup(driver.page_source, 'html.parser')
            
            
            geo_tag = soup.find("span", {"class": "geo"})
            
            if geo_tag:
                coords_text = geo_tag.get_text()
                if ';' in coords_text:
                    lat, lon = coords_text.split(';')
                    df_result.at[index, 'Latitude'] = float(lat.strip())
                    df_result.at[index, 'Longitude'] = float(lon.strip())
                    print(f" Sucesso: {city}")
                else:
                    print(f" Aviso: Coordenadas encontradas, mas em formato inválido para {city}.")
            else:
                print(f" Falha: Elemento de coordenadas não encontrado para {city}.")
                
        except Exception as e:
            print(f"Erro ao processar {city}. Detalhes: Elemento de Coordenada não apareceu a tempo.")

    driver.quit()
    print("\nScraping concluído. A verificar resultados...")
    return df_result


city_data_with_coords = scrape_city_coordinates_robust(city_data_coords)
city_data_with_coords.to_csv('city_data_with_coords.csv', index=False)
print(f"Total de cidades com coordenadas encontradas: {city_data_with_coords['Latitude'].count()}")



map_df = city_data_with_coords.dropna(subset=['Latitude', 'Longitude'])

if not map_df.empty:
    print("\nGerando Mapa Interativo...")
    fig_map = px.scatter_mapbox(
        map_df,
        lat="Latitude",
        lon="Longitude",
        hover_name="City",
        
        hover_data={
            "Latitude": False, "Longitude": False,
            "Country": True,
            "Population": ':,.0f', 
            "Average Monthly Salary": ':,.0f €', 
            "Average Cost of Living": ':,.0f €'  
        },
        color="Average Monthly Salary", 
        size="Population",              
        color_continuous_scale="Viridis",
        zoom=3, 
        height=700,
        title="Mapa Interativo de Cidades Europeias"
    )

    fig_map.update_layout(mapbox_style="open-street-map")
    fig_map.update_layout(margin={"r":0,"t":40,"l":0,"b":0})
    fig_map.show()
else:
    print("Geração do mapa ignorada: Dados de coordenadas insuficientes.")


Atenção: webdriver-manager não disponível. Certifique-se de que o ChromeDriver está no seu PATH.
Início do processo de Scraping. Será usado um tempo de espera forçado para garantir a leitura.
❌ Erro ao processar Salzburg. Detalhes: Elemento de Coordenada não apareceu a tempo.
✅ Sucesso: Vienna
✅ Sucesso: Antwerp
✅ Sucesso: Bruges
✅ Sucesso: Brussels
✅ Sucesso: Gent
✅ Sucesso: Dobrich
✅ Sucesso: Sofia
✅ Sucesso: Split
✅ Sucesso: Zagreb
❌ Erro ao processar Lefkosia. Detalhes: Elemento de Coordenada não apareceu a tempo.
❌ Erro ao processar Lemesos. Detalhes: Elemento de Coordenada não apareceu a tempo.
❌ Erro ao processar Ostrava. Detalhes: Elemento de Coordenada não apareceu a tempo.
✅ Sucesso: Prague
✅ Sucesso: Copenhagen
✅ Sucesso: Odense
✅ Sucesso: Tallinn
✅ Sucesso: Helsinki
✅ Sucesso: Tampere
✅ Sucesso: Lyon
✅ Sucesso: Paris
✅ Sucesso: Toulouse
✅ Sucesso: Berlin
✅ Sucesso: Cologne
✅ Sucesso: Dresden
✅ Sucesso: Dusseldorf
✅ Sucesso: Frankfurt am Main
✅ Sucesso: Hamburg
✅ Sucesso: Ha


*scatter_mapbox* is deprecated! Use *scatter_map* instead. Learn more at: https://plotly.com/python/mapbox-to-maplibre/

