In [None]:
# run this command on databricks first, not necessary locally
!pip install scrapy

# Part 1 - Parsing Hikes

In the first part of the assignment, you need to extract the relevant attributes from the web pages scraped from hikr.org. Extend the `parse` function so that it extracts all the attributes you need to create the ranking. You may define your own helper functions and extend the `parse` function as necessary. Just keep in mind that the arguments/result types should not be changed to enable you to use the function in the second part of the assignment.

## Chosen Features
The follwing features are extracted from the hikr.org tour pages:

### 1. Region (`region`)
A string representing the tour's geographical region, as a breadcrumb path (e.g., "World » Italy » Lombardy") <br>
**Cleaning:** Individual parts are whitespace-trimmed and joined by " » ". It will be `None` if not found or if all parts are empty.

### 2. Tour Date (`tour_date`)
The date of the tour, formatted as a "dd.mm.yyyy" string (e.g., "05.06.2010"). <br>
**Cleaning:** The day and month are zero-padded. It will be `None` if the original date string is not found, is in an unexpected format, or contains an unrecognized month name.

### 3. Descent in Meters (`descent_meters`)
The total descent of the tour in meters. <br>
**Cleaning:** The "m" unit removed and all letters converted to lowercase (e.g., "600"). Leading/trailing whitespace removed. It will be `None` if not found.

### 4. Ascent in Meters (`ascent_meters`)
The total ascent of the tour in meters. <br>
**Cleaning:** The "m" unit removed and all letters converted to lowercase (e.g., "600"). Leading/trailing whitespace removed. It will be `None` if not found.

In [1]:
import scrapy
from scrapy.selector import Selector

# Parses a hikr.org tour and extracts all the attributes we are interested in.
# Parameters:
#   tour: HTML Content of the hikr.org tour.
# Result:
#   A dictionary containing the extracted attributes for this tour.
def parse(tour):
    # id is the filename, text is the file content
    [id, text] = tour

    # Parse it using scrapy
    document = Selector(text=text)

    name_raw = document.css('h1.title::text').get()
    # Clean: remove leading/trailing whitespace. If not found, it remains None.
    name = name_raw.strip() if name_raw else None

    # 1. Region
    region_xpath = '//tr[td[@class="fiche_rando_b" and contains(normalize-space(.), "Region:")]]/td[@class="fiche_rando"]//a/text()'
    region_parts_raw = document.xpath(region_xpath).getall()
    # Clean: Strip whitespace from each part, filter out any empty strings, then join with " » ".
    # If no parts are found, region remains None.
    region = None
    if region_parts_raw:
        cleaned_parts = [part.strip() for part in region_parts_raw if part.strip()]
        if cleaned_parts:
            region = ' » '.join(cleaned_parts)

    # 2. Tour Date
    tour_date_xpath = '//tr[td[@class="fiche_rando_b" and contains(normalize-space(.), "Tour Datum:")]]/td[@class="fiche_rando"]/text()'
    tour_date_raw_str = document.xpath(tour_date_xpath).get()

    tour_date = None
    if tour_date_raw_str:
        # Clean: remove leading/trailing whitespace.
        cleaned_date_str = tour_date_raw_str.strip()
        # German month names to month number mapping
        german_months = {
            'Januar': 1, 'Februar': 2, 'März': 3, 'April': 4, 'Mai': 5, 'Juni': 6,
            'Juli': 7, 'August': 8, 'September': 9, 'Oktober': 10, 'November': 11, 'Dezember': 12
        }
        parts = cleaned_date_str.split()
        if len(parts) == 3:
            day = int(parts[0])
            month_name = parts[1]
            year = int(parts[2])

            month_number = german_months.get(month_name)

            if month_number:
                tour_date = f"{day:02d}.{month_number:02d}.{year}"

    descent_xpath = '//tr[td[@class="fiche_rando_b" and contains(normalize-space(.), "Abstieg:")]]/td[@class="fiche_rando"]/text()'
    descent_raw_str = document.xpath(descent_xpath).get()
    # Clean: convert to lowercase, remove "m", and strip whitespace.
    descent_meters = None
    if descent_raw_str:
        descent_meters = descent_raw_str.lower().replace('m', '').strip()

    # 4. Ascent
    ascent_xpath = '//tr[td[@class="fiche_rando_b" and contains(normalize-space(.), "Aufstieg:")]]/td[@class="fiche_rando"]/text()'
    ascent_raw = document.xpath(ascent_xpath).get()
    # Clean: convert to lowercase, remove "m", and strip whitespace.
    ascent_meters = None
    if ascent_raw:
        ascent_meters = ascent_raw.lower().replace('m', '').strip()

    # Assemble the result dictionary
    result = {
        'name': name,
        'region': region,
        'tour_date': tour_date,
        'descent_meters': descent_meters,
        'ascent_meters': ascent_meters
    }

    return result

In [2]:
# Extract the 200posts.zip file in the same folder where this jupyter notebook is located.
# Then you can run the parse function on an example tour:
with open('200posts/post24013.html', 'r', encoding='utf-8') as f:
    content = f.read()
    r = parse([f.name, content])
    print(r)


FileNotFoundError: [Errno 2] No such file or directory: '200posts/post24013.html'

# Part 2 - Parallelization & Aggregation (Spark)

It is highly recommended to wait with this part until after the Spark lecture!

This part only works on databricks!

Warning: In the community edition, databricks terminates your cluster after 2 hours of inactivity. If you re-create the cluster, you will lose your data.

To add a library such as scrapy, it might not always work with the command above. Should you run into problems, you can alternatively do the following:

- Go to the "Clusters" panel on the left
- Select your cluster
- Go to the "Libraries" tab
- Click "Install New"
- Choose "PyPI" as library source
- Type the name of the library, "scrapy", into the package field
- Click "Install"
- Wait until the installation has finished

You can now use the newly installed library in your code.

In [None]:
# AWS Access configuration
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAYFVAOB5OOWVMUSCZ")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "BddS/X8w8qXdBkkqbzmO+5RgmfPRQuIT+wbUxrn2")

# Contains the whole hikr dataset.
# The full dataset contains 42330 tours and has a size of around 3 GB. Use this dataset for your final results if possible.
# Execution is likely to take around 20 to 30 minutes.
# tours = sc.wholeTextFiles("s3a://dawr-hikr3/hikr/*.html")

# There are 8176 posts starting with "post10*", which is a nicer size for smaller experiments. (~ 5 minutes to process)
# tours = sc.wholeTextFiles("s3a://dawr-hikr3/hikr/post10*.html")

# If you want to further shrink the dataset size for testing, you can add another zero (or more) to the pattern (post100*.html).
tours = sc.wholeTextFiles("s3a://dawr-hikr3/hikr/post100*.html")

In [None]:
# Apply our parse function and persist the parse results so that we can repeat all further steps easier
import pyspark
parsedTours = tours.map(parse).persist(pyspark.StorageLevel.MEMORY_AND_DISK)

In [None]:
# actually force the parsedTours RDD. Above it was only defined, but not evaluated. This will take a while.
parsedTours.count()

In [None]:
# TODO
# Add your code here. Note that executing this cell and any below can reuse the results from "parsedTours".

# Example - let's just collect everything
parsedTours.collect()

## Part 2 Final ranking
List your final top 10 mountain peaks that occur the most often within your filtered tours. State how you handle cases where two peaks occur the same number of times.

# Part 3 - Analysis of data quality
Add further code for analysis of data quality here. Don't forget to include at least one aggregation, such as average tour length per season.