### Goal: Generate a grid of circular coverage areas across the Czech Republic

This cell creates a grid of geographic circles that span the Czech Republic, each representing a 5 km radius area. These circles will later be used to systematically search for businesses within each region.

On the target webpage, we can specify:
- the **code of the business** (e.g., industry or activity type),
- the **location** (latitude and longitude) around which we want to search,
- and the **radius of the search circle** (in kilometers).

The key techniques used in this cell include:
- **Geospatial grid generation** using latitude and longitude spacing
- **Use of `folium`** to visualize the coverage on an interactive map
- **Dynamic filename creation** for storing data associated with each circle
- **Looping and rounding** to ensure consistent grid points


In [8]:
import folium
import math
from geopy.distance import geodesic
import os

# Define bounding box for Czech Republic
min_lat, max_lat = 48.55, 51.06
min_lon, max_lon = 12.09, 18.86

# Define radius and spacing for coverage circles
radius_km = 5
lat_spacing = 0.06  # ~5 km spacing in latitude
lon_spacing = 0.1   # ~5 km spacing in longitude

# Initialize folium map centered on Czech Republic
map_cover_circles = folium.Map(location=[49.8175, 15.4730], zoom_start=7)

# Prepare list to store circle center coordinates and filenames
circle_data = []
data_dir = 'Center files'

# Generate grid of circle centers
lat = min_lat
while lat <= max_lat:
    lon = min_lon
    while lon <= max_lon:
        lat_rounded = round(lat, 5)
        lon_rounded = round(lon, 5)
        center = (lat_rounded, lon_rounded)
        filename = f"center_{lat_rounded},{lon_rounded}.txt"
        
        # Store center and corresponding filename
        circle_data.append({
            "center": center,
            "filename": os.path.join(data_dir, filename)
        })
        lon += lon_spacing
    lat += lat_spacing

# Add circles to the folium map
for item in circle_data:
    folium.Circle(
        location=item["center"],
        radius=radius_km * 1000,  # Convert km to meters
        color='blue',
        fill=True,
        fill_opacity=0.2,
        popup=f"File: {item['filename']}"
    ).add_to(map_cover_circles)

# Print number of circles generated
print(len(circle_data))

# Display the map
map_cover_circles


2856


### Goal: Scrape business data from a Czech business registry website using circle-based geolocation

This cell defines a function that automates web scraping for Czech beauty businesses. It loops through a list of geographic circles (created earlier), and for each one:

- Constructs a query URL using the **business code**, **center coordinates**, and **search radius**
- Sends the request using **Selenium WebDriver**
- Handles potential **JavaScript alerts** that may interrupt scraping
- Displays progress in the notebook for transparency

The webpage allows us to specify:
- the **business code** (e.g., for cosmetic services),
- the **location** around which we want to search (latitude and longitude),
- and the **radius** of the search circle (in kilometers).


In [9]:
import time
import pandas as pd
from bs4 import BeautifulSoup
from selenium.webdriver.common.alert import Alert
from selenium.common.exceptions import NoAlertPresentException, UnexpectedAlertPresentException
from IPython.display import clear_output

# Define scraping function
def scrape_circle_data(base_url, business_code, radius_km, circle_data, driver):
    extended_data = []  # Store metadata for each query

    # Loop through each circle (location and filename)
    for index, item in enumerate(circle_data):
        lat, lon = item["center"]
        filename = item["filename"]

        # Display progress
        print(f"Processing {index + 1} of {len(circle_data)}: {filename}")
        
        # Prepare query parameters
        params = {
            "roleSubjektu": "P",  # Role of business entity
            "predmet": business_code,  # Business code (e.g., for beauty services)
            "kodPredmetu": business_code,  # Duplicate key used by the site
            "center": f"{lat},{lon}",  # Coordinates of search center
            "distance": str(radius_km)  # Radius in kilometers
        }

        # Construct full query URL
        query_string = ";".join([f"{k}={v}" for k, v in params.items()])
        full_url = f"{base_url};{query_string}"
        print(f"Accessing: {full_url}")

        # Clear notebook output for cleaner display
        clear_output(wait=True)
        
        # Load the page using Selenium
        driver.get(full_url)
        time.sleep(5)  # Wait for page to fully load

        # Attempt to dismiss any alert that may pop up
        try:
            Alert(driver).dismiss()
        except NoAlertPresentException:
            pass  # No alert present, continue

        # Try to parse and save the page content
        try:
            soup = BeautifulSoup(driver.page_source, "html.parser")
            with open(filename, "w", encoding="utf-8") as f:
                f.write(soup.prettify())  # Save formatted HTML to file
        except UnexpectedAlertPresentException:
            print(f"Alert blocked access for {filename}. Skipping.")
            continue  # Skip this circle if alert blocks access

        # Store metadata for this query
        extended_item = {
            "latitude": lat,
            "longitude": lon,
            "filename": filename,
            "full_url": full_url
        }
        extended_data.append(extended_item)

    # Close the browser session
    driver.quit()

    # Save metadata to Excel file
    df = pd.DataFrame(extended_data)
    df.to_excel("extended_circle_data.xlsx", index=False)

### Goal: Launch scraping session using Selenium and predefined circle grid

This cell initializes the Selenium WebDriver and launches the scraping process using the `scrape_circle_data` function. It targets a Czech government website that allows users to search for businesses by:

- **Business code** (e.g., R11404 for beauty services),
- **Geographic location** (latitude and longitude),
- and **Search radius** (in kilometers).

The scraping function will loop through a grid of predefined circle centers (`circle_data`) and:
- Construct a query URL for each location,
- Load the page using Selenium,
- Save the HTML content locally,
- Record metadata for each query.

This setup is essential for triggering the full scraping workflow across the Czech Republic.


In [10]:
from selenium import webdriver

# Initialize Selenium WebDriver (Chrome)
driver = webdriver.Chrome()

# Define base URL for Czech business registry search
base_url = "https://rzp.gov.cz/verejne-udaje/en/udaje/vyber-zivnosti"

# Define business code for beauty services (R11404)
business_code = "R11404"

# Define search radius in kilometers
radius_km = 5

# Launch scraping function using predefined circle grid
scrape_circle_data(base_url, business_code, radius_km, circle_data, driver)

Processing 2856 of 2856: Center files\center_51.01,18.79.txt
Accessing: https://rzp.gov.cz/verejne-udaje/en/udaje/vyber-zivnosti;roleSubjektu=P;predmet=R11404;kodPredmetu=R11404;center=51.01,18.79;distance=5


### Goal: Parse saved HTML files to extract business entity records

This cell defines a function that reads previously scraped HTML files from a specified directory (`data_dir`) and extracts structured business data. Each file corresponds to a geographic search circle and contains multiple business cards.

For each card, the function extracts:
- **Name** of the business
- **Registered office address**
- **Distance from the search center**
- **Legal type**, **role**, and **registration number (RN)**

Key techniques used:
- **HTML parsing** with `BeautifulSoup`
- **Dynamic tag manipulation** to ensure consistent label extraction
- **Data normalization** into a list of dictionaries
- **Progress display** using `clear_output` and `print`

The result is a clean `pandas` DataFrame containing all parsed records, ready for analysis or export.


In [11]:
import os
import pandas as pd
from bs4 import BeautifulSoup

# Define function to parse entity records from saved HTML files
def parse_entity_records(data_dir):
    records = []  # List to store parsed records

    # Loop through each file in the directory
    for index, filename in enumerate(os.listdir(data_dir)):
        # Optional filter: uncomment to process only center files
        # if filename.startswith('center') and filename.endswith('.txt'):

        # Open and parse the HTML file
        with open(os.path.join(data_dir, filename), 'r', encoding='utf-8') as file:
            soup = BeautifulSoup(file.read(), 'html.parser')

            # Display progress
            clear_output(wait=True)
            print(index)
            
            # Loop through each business card on the page
            for card in soup.select('div.card.sheet'):
                record = {'source_file': filename}  # Track source file

                # Extract business name
                name_tag = card.select_one('h5')
                record['name'] = name_tag.get_text(strip=True) if name_tag else None

                # Extract registered office and distance
                address_dd = card.select_one("div[title='Corresponding entity address'] dd")
                if address_dd:
                    span = address_dd.find("span", title="Distance from the location")
                    distance = span.get_text(strip=True) if span else None
                    full_address = address_dd.get_text(strip=True)
                    if span:
                        full_address = full_address.replace(distance, '').strip()
                    record['registered_office'] = full_address
                    record['distance_from_the_location'] = distance
                else:
                    record['registered_office'] = None
                    record['distance_from_the_location'] = None

                # Initialize fields for legal type, role, and RN
                record['legal_type'] = None
                record['role'] = None
                record['RN'] = None

                # Build label-value map from <dt>/<dd> pairs
                label_map = {}
                for dl in card.select('dl.compact.tree'):
                    # Inject missing 'Legal type' label if needed
                    new_dt = soup.new_tag('dt')
                    new_dt.string = 'Legal type'
                    first_dt = dl.find('dt')
                    if first_dt:
                        first_dt.insert_before(new_dt)

                    # Extract all label-value pairs
                    dt_tags = dl.select('dt')
                    dd_tags = dl.select('dd')
                    for dt, dd in zip(dt_tags, dd_tags):
                        label = dt.get_text(strip=True)
                        value = dd.get_text(strip=True)
                        label_map[label] = value

                # Assign extracted values to record
                record['legal_type'] = label_map.get('Legal type')
                record['role'] = label_map.get('Role')
                record['RN'] = label_map.get('RN')

                # Add record to the list
                records.append(record)

    # Convert list of records to DataFrame
    return pd.DataFrame(records)

### Goal: Load and preview parsed business entity records

This cell runs the `parse_entity_records` function to extract structured data from previously saved HTML files in the `'Center files'` directory. These files contain business listings retrieved from the Czech government website.

The function parses each file to extract:
- **Business name**
- **Registered office address**
- **Distance from search center**
- **Legal type**, **role**, and **registration number (RN)**

The result is a `pandas` DataFrame containing all parsed records. The `.head()` method is used to preview the first few rows and verify the structure of the data.


In [12]:
# Parse HTML files in 'Center files' directory and extract business records
df = parse_entity_records('Center files')

# Display the first few rows of the resulting DataFrame
df.head()

2855


Unnamed: 0,source_file,name,registered_office,distance_from_the_location,legal_type,role,RN
0,"center_48.55,14.29.txt",Van Dat Doan,"Studánky 90, 382 73, Vyšší Brod",(3.7 km),Natural person,entrepreneur,5107121
1,"center_48.55,14.29.txt",VIPbeauty s.r.o.,"Studánky 90, 382 73, Vyšší Brod",(3.7 km),Legal person,entrepreneur,11691085
2,"center_48.55,14.29.txt",Thi Huyen Bui,"Studánky 90, 382 73, Vyšší Brod",(3.7 km),Foreign natural person,entrepreneur,21216851
3,"center_48.55,14.29.txt",Van Phuong Nguyen,"Studánky 90, 382 73, Vyšší Brod",(3.7 km),Natural person,entrepreneur,28690508
4,"center_48.55,14.29.txt",PPD TRADE s.r.o.,"Studánky 89, 382 73, Vyšší Brod",(3.7 km),Legal person,entrepreneur,3629805


In [13]:
# Save to Excel
df.to_excel('circles_data.xlsx', index=False)

### Goal: Extract unique business entities based on core identifying attributes

This cell filters the full dataset to retain only the columns that uniquely identify each business entity:
- **RN** (Registration Number)
- **Name**
- **Legal type**
- **Role**

It then removes duplicate entries to ensure each entity appears only once. This is useful for:
- Creating a clean list of distinct businesses
- Avoiding redundancy in analysis or reporting
- Preparing for grouping or merging operations

The `.drop_duplicates()` method ensures that only unique combinations of these four fields are retained.


In [14]:
df_grouped = df[['RN', 'name', 'legal_type', 'role']].drop_duplicates()
df_grouped.head()

Unnamed: 0,RN,name,legal_type,role
0,5107121,Van Dat Doan,Natural person,entrepreneur
1,11691085,VIPbeauty s.r.o.,Legal person,entrepreneur
2,21216851,Thi Huyen Bui,Foreign natural person,entrepreneur
3,28690508,Van Phuong Nguyen,Natural person,entrepreneur
4,3629805,PPD TRADE s.r.o.,Legal person,entrepreneur


In [15]:
df_grouped['legal_type'].value_counts()

legal_type
Natural person            17128
Legal person               1029
Foreign natural person      212
Foreign legal person          1
Name: count, dtype: int64