# Wandrer scraper

This code utilizes my account at [wandrer.earth] to identify regions where I have traveled in order top optimize game points for distance traveled within specific regions. The code is intended to do the following:


1. Differentiate between foot, bike, or total achievement
2. Identify regions where I have the least distance to a milestone (25%, 50%, 75%, 90%, or 99% completion)
3. Report a map with locations on transit closest to the given distance
    1. Show a map with a 2.5k and 5k diameter around my common hubs   
        1. Kids' activities
        2. Transit stops
    2. Overlay map with the key areas closest to completion

In [1]:
#imports used:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from collections import Counter
import json
import re
import geopandas as gpd
from shapely.geometry import Point
import matplotlib.pyplot as plt

#not yet used, but imported:
from urllib.request import urlopen
import mechanicalsoup
import time
from requests.auth import HTTPBasicAuth

`payload.py` contains my login (secret), password (secret), login_url (https://wandrer.earth/signin), and dashboard_url (secret)

The code below opens and reads my payload file and initiates a session in my account at wandrer.earth

In [2]:
#my specific credentials for login
with open('payload.py', 'r') as file:
    exec(file.read())

In [3]:
#create a specific session
with requests.session() as s: 
    
    #use GET command to find the authenticity token within the login url
    req = s.get(login_url).text 
    html = BeautifulSoup(req,"html.parser") 
    token = html.find("input", {"name": "authenticity_token"}).attrs["value"] 

    #use my specific authenticity token (which is scraped from the site) plus credentials (from the payload file) to access my account
    payload = { 
        "authenticity_token": token, 
        'athlete[email]': login, 
        'athlete[password]': password, 
    }
    
    #response following posting my data to the location above
    res = s.post(login_url, data=payload) 

#report outcome of login activity
if res.ok or res.status_code == 302:
    redirected_url = res.url
    print("Login successful", res.status_code, redirected_url)
else:
    print("Login failed")

Login successful 200 https://wandrer.earth/dashboard


A `200` code indicates successful login to my dashboard. I next scrape the information from the dashboard to find regions, distances, and percent completion

In [None]:
#retrieve the contents of the dashboard site
r = s.get(dashboard_url) 
soup = BeautifulSoup (r.content, "html.parser")
print(dashboard_url, soup)

* All code above is used to locate, open, and read the wandrer.earth data. This must be executed properly before moving forward.
* `soup` is now a variable containing the html content of the site `dashboard_url`

Next, I make lists to hold the region gb code, which allows for opening the specific achievement site and retrieving foot, bike, and total distance numbers. Additionally, lists will hold region name and the gps coordinates for their location.

In [None]:
# Initialize lists to store data
region_gb_list = []
region_name_list = []
region_gps_list = []
region_progress_list = []

# Set counters to track missing values
no_name = 0
no_gb = 0
no_gps = 0
no_progress = 0

# Find all elements with the class "achievement_row"
achievement_rows = soup.find_all(class_="achievement_row")

#iterate through each achievement row
for achievement_row in achievement_rows:
    #extract location text
    #if section has name, use that, then check missing name, otherwise 'N/A'
    
    present_name_element = achievement_row.select_one('.achievement_name')
    missing_name_element = achievement_row.select_one('.missing_achievement_name')

    if present_name_element:
        achievement_name = present_name_element.text
        
    elif missing_name_element:
        achievement_name = missing_name_element.text
    
    else:
        achievement_name = 'N/A'
        no_name = no_name + 1
    
    region_name_list.append(achievement_name)
    
    # Extract "gb ID" value
    data_gb_value = achievement_row.select_one('.geom_toggle')['data-gb'] if achievement_row.select_one('.geom_toggle') else "N/A"
    if data_gb_value == "N/A":
        no_gb += 1
    region_gb_list.append(data_gb_value)

    # Extract coordinates
    coordinates = eval(achievement_row.select_one('.geom_toggle')['data-diagonal'])['coordinates'] if achievement_row.select_one('.geom_toggle') else "N/A"
    if coordinates == "N/A":
        no_gps += 1
    region_gps_list.append(coordinates)
    
    # Extract progress
    progress_element = achievement_row.select_one('.progress-bar')
    progress = progress_element['style'].split(':')[1].strip('%;') if progress_element else "N/A"
    if progress == "N/A":
        no_progress += 1
    region_progress_list.append(progress)

# Show the number of missing values vs. found values
print(f"GB IDs\n Missing: {no_gb}, Present: {len(region_gb_list) - no_gb}")
print(f"Names\n Missing: {no_name}, Present: {len(region_name_list) - no_name}")
print(f"Coordinates\n Missing: {no_gps}, Present: {len(region_gps_list) - no_gps}")
print(f"Foot Progress\n Missing: {no_progress}, Present: {len(region_progress_list) - no_progress}")


There are 173 valid Name calls, but only 162 valid Regions. The name calls are likely non-location achievements, which will be dropped in the subsequent df.

Next, I'll scrape sub-sites for each GB value to find the total distance available by plugging the GB value into a base url for the geometry badges, and retrieving the result. This is done by using `region_gb_list` to create a new list, `num_region_gb_list`, containing only regions that are floats (have geometry for location).

In [None]:
#define a list for only numeric gb entries
num_region_gb_list = [x for x in region_gb_list if 'N/A' not in x]

The code in the next cell searches through each item in the `num_region_gb_list`, plugs it into the base url, gets the soup and parses, looking for that region's distance on foot, bike, and total. Because some regions only have total distance, I account for that in the total distance search.

In [None]:
#prepare lists for data on distances:
#(by including the code in this block, it will prevent overloading redundant data in these lists if the code is re-run)
region_foot_distance_list = []
region_bike_distance_list = []
region_total_distance_list = []

#look for the summary pages of each of the regions represented by the gb codes, by scraping their individual URLs.
for i in num_region_gb_list:
    gb_url = ('https://wandrer.earth/geometry_badges/'+i)
    r = s.get(gb_url) 
    sub_soup = BeautifulSoup(r.content, "html.parser")

#retrieve the total distance on bike, foot, and total.
#some gb areas do not have breakdown by bike and foot, assume the total is equal to the distance on bike or foot.

    #foot distance
    foot_span = sub_soup.find('span', string=lambda s: 'foot' in s)

    #bike distance
    bike_span = sub_soup.find('span', string=lambda s: 'bike' in s)
    
    #total distance
        #this is an interesting case, because when bike and run distances are not presented, the string is different.
        #use an if statement to look for cases with all 3 listed, and else find the single total value.
    if sub_soup.find('span', string=lambda s: 'total length' in s):
        total_span = sub_soup.find('span', string=lambda s: 'total length' in s)
    else: total_span = sub_soup.find('span', string=lambda s: 'km. Worth' in s)

    #convert foot, bike, or total found in html to TEXT, strip excess characters
    text = lambda f: f.get_text().strip().replace(',', '') if f else 'NA'
    
    #Append the distances to new lists
    region_foot_distance_list.append(text(foot_span))
    region_bike_distance_list.append(text(bike_span))
    region_total_distance_list.append(text(total_span))

    print(f'{gb_url} \n    {text(foot_span)},\n    {text(bike_span)},\n    {text(total_span)}')

---

### Checkpoint 1

* I've retrieved key information about achievements from the dashboard!
* I've also retrieved my foot progress percentages!

I want to make a dataframe from this information, but need the shape to match for the rows. I will check the length of each list I want to include to determine their suitability to combine into a df.

In [None]:
region_lists = [region_gb_list, region_name_list, region_gps_list, region_foot_distance_list, region_bike_distance_list, region_total_distance_list, region_progress_list]
list_names = ["region_gb_list", "region_name_list", "region_gps_list", "region_foot_distance_list", "region_bike_distance_list", "region_total_distance_list", "region_progress_list"]

for name, region in zip(list_names, region_lists):
    print(f'{len(region)} :*: {name}')

Regions number 175 while distances measure 162. I will inspect the data to confirm the additional 13 rows come from time achievements rather than geographic achievements. First, the region information will be combined into a preliminary df.

In [None]:
data = {'GB code': region_gb_list, 'Locations': region_name_list, 'Coordinates': region_gps_list, 'Foot Progress (%)': region_progress_list}
df = pd.DataFrame(data)
df.sample(20)

In [None]:
time_achievements = df[df['GB code'] == 'N/A']
time_achievements

I confirmed that df contains the month achievements, so I will drop these rows and then merge in the gb-distances.

In [None]:
#drop the time achievement rows
df.drop(df[df['GB code'] == 'N/A'].index, inplace=True)
df.shape

Now the shape of the df matches that of the distance lists, which can be merged to a comprehensive df

In [None]:
#merge the distance data into the preliminary df
df['Foot'] = region_foot_distance_list
df['Bike'] = region_bike_distance_list
df['Total'] = region_total_distance_list
df.shape, df.dtypes

In [None]:
#inspect the combined df
df.head()

The dataframe has been successfully created, the desired columns are present, and the information appears correct based on manual inspection of my account info online.

The data is all object type. The following updates are needed:

Column | Change needed
-|-
<b>GB code</b> | change to `int` dtype
<b>Foot Progress</b> | 1. extract number, 2. change to `float` dtype
<b>Locations</b> | none
<b>Coordinates</b> | 1. change to series dtype, 2 calculate midpoint, 3. change to `float` dtype
<b>Foot</b></b> | split to new columns: distance, points. Trim text and change to `float` dtype
<b>Bike</b> | split to new columns: distance, points. Trim text and change to `float` dtype
<b>Total distance</b> |split to new columns: distance, points. Trim text and change to `float` dtype


In [None]:
#update the GB code to float
df['GB code'] = df['GB code'].astype('float64')
df.dtypes

In [None]:
df.shape

Because the coordinates are composed of a list of lists, I will split into coordinate1 and coordinate2, and then split each of these into longitude and latitude, resaving as 4 new lists and dropping the coordinates and c1, c2 lists used for the transformation.

In [None]:
#create the c1 and c2 for the two coordinate pairs
df[['c1', 'c2']] = pd.DataFrame(df['Coordinates'].tolist(), columns=['c1', 'c2'])
df[['c1', 'c2']]

In [None]:
#split c1 and c2 into longitude and latitude, then compile into the df, dropping the coordinates, c1, and c2 columns used for their generation
df[['lon1', 'lat1']] = pd.DataFrame(df['c1'].tolist(), columns=['lon1', 'lat1'])
df[['lon2', 'lat2']] = pd.DataFrame(df['c2'].tolist(), columns=['lon2', 'lat2'])

df = df.drop(['Coordinates', 'c1', 'c2'], axis=1)
df

In [None]:
#check datatypes of df column values
df.info()

In [None]:
#since 'GB code' is float64 dtype, I can count nulls to ensure these were not carried fron the original scrape of date progress awards
df['GB code'].isnull().sum()

---

### Checkpoint 2

* `df` contains 9 columns, containing key information on GB code, location, and distances possible
* 5 columns are float while 4 are object dtype
* I need to:
    * extract the distance possible from the text, and convert to float for each max distance column
    * scrape [wandrer.earth] for percent completed in each of foot, bike, total (for official, rather than calculated value)
    * scrape [wandrer.earth] for distance completed in each of foot, bike, total.

In [None]:
# retrieve distance values, drop extra text
foot_bike_total = df[['Foot', 'Bike', 'Total']]  # Selecting distance columns

for column in foot_bike_total.columns:
    new_distance_values = []  # store distance values
    new_worth_values = []  # store worth 
    for region_text in foot_bike_total[column]:
        if pd.notnull(region_text):  # Check for non-null values
            # Use regular expression to find the distance and worth
            dist_match = re.search(r'(\d+\.\d+)\s*km', str(region_text))
            worth_match = re.search(r'(\d+\.\d+)\s*total', str(region_text))

            if dist_match:  #assume all dist match will have worth match
                distance_value = dist_match.group(1)
                worth_value = worth_match.group(1)
                # append found distance to a new list for distance value
                new_distance_values.append(distance_value)
                # append found worth to a new list for worth value
                new_worth_values.append(worth_value)
            else:
                # If no distance found, list as N/A
                new_distance_values.append("N/A")
                new_worth_values.append("N/A")
        else:
            # For null values, list as N/A
            new_distance_values.append("N/A")
            new_worth_values.append("N/A")

    # Update the DataFrame column with modified values
    df[column + " Distance (km)"] = new_distance_values
    df[column + " Points"] = new_worth_values

# Print the updated DataFrame
df.head()

In [None]:
# retrieve progress values, drop extra text 
new_percent_values = []  # store percent values

for i in df['Foot Progress (%)']:   # Selecting progress column
    if pd.notnull(i):  # Check for non-null values
        # Use regular expression to find the distance and worth
        match = re.search(r'(\d+\.\d+)\s*%', str(i))
        
        if match:
            # make the value a float
            percent = float(match.group(1))
            # append found percent to a new list for percent value
            new_percent_values.append(percent)
        else:
            new_percent_values.append("N/A") # account for null values
       
    else:
        # For null values, list as N/A
        new_percent_values.append("N/A")

# Update the DataFrame column with modified values
df['Foot Progress (%)'] = new_percent_values

# Print the updated DataFrame
df.head()

In [None]:
df.info()

Drop the Foot Progress (%), Foot, Bike, Total columns, convert distance and worth columns to float dtype.

In [None]:
#drop the time achievement rows
df.drop(['Foot', 'Bike', 'Total'], axis=1, inplace=True)
print(df.shape)
df.info()

In [None]:
float_cols = ['Foot Distance (km)', 'Foot Points', 'Bike Distance (km)', 'Bike Points', 'Total Distance (km)', 'Total Points']

for col in float_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Convert the specified columns to float64
df[float_cols] = df[float_cols].astype('float64')

# Check the data types after conversion
df.info()

---

### Checkpoint 3

* `df` contains 13 columns, containing key information on GB code, location, distances, completed percent, and point value
* all columns except Locations are float dtype
* lat and lon values are seperated.

Next, 

1. IF:
    * distance >75%, find <u>the distance required to complete (>99%) region</u>
    * distance >50%, find <u>the distance required to move to >75%</u>
    * distance >25%, find <u>the distance required to move to >50%</u>
    * distance <25%, find <u>the distance required to move to >25%</u>
2. find the distance of regions that haven't been started (0% complete and would move to 25%)
3. sort by shortest distance to completion for the region
4. calculate center longutide and latitude for the lon1+lon2 and lat1+lat2

In [None]:
inprogress_df = df
inprogress_df

In [None]:
inprogress_df['Percent to Level'] = 0.0

for index, row in inprogress_df.iterrows():
    progress = row['Foot Progress (%)']
    if progress >= 99.0:
        0
    elif progress >= 90.0:
        inprogress_df.at[index, 'Percent to Level'] = 99.0 - progress
    elif progress >= 75.0:
        inprogress_df.at[index, 'Percent to Level'] = 90.0 - progress
    elif progress >= 50.0:
        inprogress_df.at[index, 'Percent to Level'] = 75.0 - progress
    elif progress >= 25.0:
        inprogress_df.at[index, 'Percent to Level'] = 50.0 - progress
    else:
        inprogress_df.at[index, 'Percent to Level'] = 25.0 - progress
        
inprogress_df.sort_values(by='Percent to Level', ascending=True, inplace=True)
inprogress_df.head()

In [None]:
# Replace 'N/A' with NaN for numeric comparison
inprogress_df['Foot Distance (km)'] = pd.to_numeric(inprogress_df['Foot Distance (km)'], errors='coerce')

# Create conditions to identify 'N/A' values
condition_na = inprogress_df['Foot Distance (km)'].isnull()

# Calculate 'Distance to Level' based on conditions
inprogress_df['Distance to Level'] = np.where(
    condition_na,
    (inprogress_df['Percent to Level'] / 100) * inprogress_df['Total Distance (km)'],
    (inprogress_df['Percent to Level'] / 100) * inprogress_df['Foot Distance (km)']
)

# Show the modified DataFrame
inprogress_df

Point maximizer is calculated by dividing the foot points possible by the percent to the next level, which should result in larger values for the point maximizer when there are more points available for less distance.

In [None]:
inprogress_df['point_maximizer'] =  inprogress_df['Foot Points'] / inprogress_df['Percent to Level']
inprogress_df

In [None]:
#search for initiated regions with leveling opportunities
fastest_wins_df = inprogress_df[(inprogress_df['Foot Progress (%)'] <= 99) & (inprogress_df['Foot Progress (%)'] > 0)]

# Sorting by 'Foot Distance (km)' and 'Distance to Level'

fastest_wins_df = fastest_wins_df.sort_values(by='point_maximizer', ascending=False)

# Rounding 'Distance to Level' and converting it to float with 3 decimal places
fastest_wins_df['Distance to Level'] = round((fastest_wins_df['Distance to Level']).astype(float), 3)
fastest_wins_df = fastest_wins_df[fastest_wins_df['Distance to Level'] < 100]

# Displaying the top 20 rows
fastest_wins_df = fastest_wins_df.head(20)
fastest_wins_df = fastest_wins_df.sort_values(by='Distance to Level', ascending=True)
fastest_wins_df

---

### Checkpoint 4

* `fastest_wins_df` shows the top regions for the quickest leveling up, sorted based on maximizing points, but for regions with less than 50km to the leveling event

In [None]:
#find the average lat and lon values, which should represent a point at the middle of the region area
fastest_wins_df['Avg Longitude'] = ((fastest_wins_df['lon1'] + fastest_wins_df['lon2'])/2)
fastest_wins_df['Avg Latitude'] = ((fastest_wins_df['lat1'] + fastest_wins_df['lat2'])/2)
fastest_wins_df.info()

In [None]:
# Create a GeoDataFrame from latitude and longitude columns
geometry = [Point(xy) for xy in zip(fastest_wins_df['Avg Longitude'], fastest_wins_df['Avg Latitude'])]
gdf = gpd.GeoDataFrame(fastest_wins_df, geometry=geometry, crs='EPSG:4326')
gdf

In [None]:
%%time

# Create a GeoDataFrame with Point geometries
gdf['geometry'] = gdf.apply(lambda row: Point(row['Avg Longitude'], row['Avg Latitude']), axis=1)

# Get a basemap of Metro Vancouver
metro_vancouver = gpd.read_file("Vancouver-local-area-boundary.geojson")  # GeoJSON file

# Project both GeoDataFrames to the same CRS (coordinate reference system)
gdf = gdf.to_crs(metro_vancouver.crs)

# Plot the GeoDataFrame with circles representing point_maximizer
fig, ax = plt.subplots(figsize=(10, 10))

# Plot the basemap
metro_vancouver.plot(ax=ax, alpha=0.5, color='gray', edgecolor='blue')

# Plot the points with color gradient based on 'Distance to Level' and size based on 'point_maximizer'
# lighter blue means shorter distance to level, and larger circle means higher points
gdf.plot(ax=ax, cmap='Oranges', markersize='point_maximizer', label="Locations", legend=True, edgecolor="blue")

# Annotate each point with its 'Locations' name
for idx, row in gdf.iterrows():
    ax.annotate(row['Locations'], (row['Avg Longitude'], row['Avg Latitude']), textcoords="offset points", xytext=(0,5), ha='center')

# Set the axis labels
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')

# Show the plot
plt.show()