INFO 5002 - Fall Semester 2023

Group 2 - Team GeekySquad

Team Members:
Carli Arbon,
Dong Lu,
SangYun Han,
Wesley Tanoto

Final Project Submission

# Short Project Description:
Inspired by our struggle to find a highly rated travel destination for short school breaks, our team decided to build a simple application that will recommend a highly rated destination that is the closest to the user's starting point. We use the data scraped from Travel Sentiment Index where it published 100 best travel destinations based on millions of online reactions and survey with actual travelers. We use various class concepts and Python libraries to come up with a simple and intuitive application that displays a destination along with an interactive map with pins and description of each place and recommended things to do.

INSTRUCTION TO USE APPLICATION: Please note that this program runs a bit slow in the Jupyter Notebook. For faster performance, please run it in Google Collab using a GPU. The first step in using this application is by going to STEP 4 of PART 2 after running all prior codes in both PART 1 and PART 2. In this step, as a user you will be asked to input a starting point; this could be a city or country of your choice and you can simply enter the name of it in the input box (Please capitalize the first letter of the city or country name you entered). You can then run the rest of the code to complete the process. You should be able to see a map at the end showing three pins showing location for starting point, recommended destination as well as distance information.

# PART 1 - Web Scraping and Data Cleanup

In [1]:
# Step 1: We want to start with scraping a website with a credible data source pertaining to travel ratings. In this case, we are using the data, obtained from the Sentiment Index website.
'''First, we need to import model/Python packages that we can use to scrape this website. We opted to use the setup that we learned in our lecture: BeautifulSoup, requests, and re.
We also import Pandas to enable us to create DataFrame to store all data that we are going to use in tabular format.'''

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

'''This is where we begin scraping the data. We load the URL from the Sentiment Index website into the notebook. Then, we use the BeautifulSoup to parse the data and get specific information from the website.'''
url = "https://www.sentiment-index.com/most-loved"
response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, "html.parser")

'''Since all the information is embeded within a div tags in the raw HTML file, we received a recommendation from Professor Ram that we can convert these data point that we want to use into a Strings. Here is how we did it:'''
# Find all div tags that contain the target structure
target_divs = soup.find_all("div", class_="Zc7IjY")

'''By converting it into Strings, we can clean up the data easily.'''
# convert into str
target_divs_text = str(target_divs)

'''This is the step we took to further cleanup the data.'''
# remove content in <> structure
text = re.sub('<.+?>',' ', target_divs_text)

'''We then print out the data that we pre-cleaned. This will display, 99 out of 100 places that we want to use. (We will address a unique scenario for the last place in the next step beacause it has a different HTML code structure than the other 99 places we cleaned up in this part.)'''
print(text)

[                100           Lexington          Kentucky, United States              23.38           Tourism   
   Sentiment   
   Score   ®            Restaurants &amp; Dining          TOP ASSET:         VISIT    
 
 
  
 
 
         ,                 97           San Diego          California, United States              23.54           Tourism   
   Sentiment   
   Score   ®            Beaches          TOP ASSET:         VISIT    
 
 
  
 
 
         ,                 94           Cook Islands           ​               23.84           Tourism   
   Sentiment   
   Score   ®            Beaches          TOP ASSET:         VISIT    
 
 
  
 
 
         ,                 91           Park City          Utah, United States              24.13           Tourism   
   Sentiment   
   Score   ®            Skiing &amp; Snowboarding          TOP ASSET:         VISIT    
 
 
  
 
 
         ,                 88           Newcastle Upon Tyne          England, United Kingdom              24.3   

In [2]:
# Step 2: This is the step we took to cleanup the unique HTML formatting on the one last place to standardize it with the rest of the datasets to allow us to add it to the final data list.

'''First, we want to get the data from that one place with unique HTML class - Maldives (Rank#1)'''
# get the data of top one place - Maldives
target_divs_one = soup.find_all("div", class_="tI1Qp6")

'''Next, we cleaned up the data to be able to extract only the information we need.'''
# convert into str
target_divs_one_text = str(target_divs_one)

text_one = re.sub('<.+?>',' ', target_divs_one_text)
text_one = text_one.replace('\n', '')
text_one = text_one.replace(' , ', ' / ')
text_one = text_one.replace('  ', '|')

# clean up the data
text_one = re.sub(r'TOP ASSET:', '', text_one)
text_one = re.sub(r'VISIT', '', text_one)
text_one = re.sub(r' Tourism', '', text_one)
text_one = re.sub(r'Sentiment', '', text_one)
text_one = re.sub(r'Score', '', text_one)
text_one = re.sub(r' ®', '', text_one)
text_one = re.sub(r'#', '', text_one)

text_one = text_one[1:50]
result_one = text_one.split('/')

for i in result_one:
    m = i.split('|')
    #print(m)
    lst_one = [item.strip() for item in m if item.strip() != '']
lst_one.insert(2, 'Unknow Country')

'''To confirm that we get the right information that we can use in the right format, we print it out to test if it works.'''
print(lst_one)

['1', 'Maldives', 'Unknow Country', '40.26']


In [3]:
# Step 3: Next, we use similar approach in Step 2 to clean up the rest of the data from the 99 other places.

'''We start with getting the data for all other 99 places and clean it up using the same approach with Step 2.'''
# get the data of rest 99 places

text = text.replace('&amp;', '&')
text = text.replace('\n', '')
text = text.replace(' , ', ' / ')
text = text.replace('  ', '|')

# delete redundant data that not needed
text = re.sub(r'TOP ASSET:', '', text)
text = re.sub(r'VISIT', '', text)
text = re.sub(r' Tourism', '', text)
text = re.sub(r'Sentiment', '', text)
text = re.sub(r'Score', '', text)
text = re.sub(r' ®', '', text)

'''We print it out to make sure we only get the information that we need to proceed.'''
print(text)

[||||||||100||||| Lexington|||||Kentucky, United States|||||||23.38||||||||||||||||||Restaurants & Dining||||||||| ||||||||| /|||||||| 97||||| San Diego|||||California, United States|||||||23.54||||||||||||||||||Beaches||||||||| ||||||||| /|||||||| 94||||| Cook Islands||||| ​||||||| 23.84||||||||||||||||||Beaches||||||||| ||||||||| /|||||||| 91||||| Park City|||||Utah, United States|||||||24.13||||||||||||||||||Skiing & Snowboarding||||||||| ||||||||| /|||||||| 88||||| Newcastle Upon Tyne|||||England, United Kingdom|||||||24.3||||||||||||||||||Restaurants & Dining||||||||| ||||||||| /|||||||| 85||||| Brisbane|||||Queensland, Australia|||||||24.42||||||||||||||||||Restaurants & Dining||||||||| ||||||||| /|||||||| 82||||| Scenic Rim|||||Queensland, Australia|||||||24.56||||||||||||||||||Nature Photography||||||||| ||||||||| /|||||||| 79||||| Big Bear Lake|||||California, United States|||||||24.72||||||||||||||||||Nature Photography||||||||| ||||||||| /|||||||| 76||||| Grampians|||||Victo

In [4]:
# Step 4: This is the final cleanup phase where we convert data from all 100 places into a list of lists.

'''First, we remove all unnecessary elements.'''
# remove suqare brackets in the beginning and end of data
text = text[1:-1]

# using / to split str and convert into a list
result = text.split('/')

# using | to split list and convert into a list of lists
# replace \u200b with Unknow Country
lst = []
for i in result:
    m = i.split('|')
    #print(m)
    cleaned_data = [item.strip() for item in m if item.strip() != '']
    cleaned_data = ['Unknown Country' if '\u200b' in item else item for item in cleaned_data]
    #print(cleaned_data)
    lst.append(cleaned_data)

'''We then merge the first set of data with other sets of data, and get the final cleaned data.'''
lst.append(lst_one)

'''Finally, we print the final list of data that we will use to create a Panda DataFrame.'''
print(lst)


[['100', 'Lexington', 'Kentucky, United States', '23.38', 'Restaurants & Dining'], ['97', 'San Diego', 'California, United States', '23.54', 'Beaches'], ['94', 'Cook Islands', 'Unknown Country', '23.84', 'Beaches'], ['91', 'Park City', 'Utah, United States', '24.13', 'Skiing & Snowboarding'], ['88', 'Newcastle Upon Tyne', 'England, United Kingdom', '24.3', 'Restaurants & Dining'], ['85', 'Brisbane', 'Queensland, Australia', '24.42', 'Restaurants & Dining'], ['82', 'Scenic Rim', 'Queensland, Australia', '24.56', 'Nature Photography'], ['79', 'Big Bear Lake', 'California, United States', '24.72', 'Nature Photography'], ['76', 'Grampians', 'Victoria, Australia', '24.95', 'Hiking & Rock Climbing'], ['73', 'Fernie', 'British Columbia, Canada', '25.25', 'Skiing & Snowboarding'], ['70', 'Dubrovnik', 'Croatia', '25.59', 'Beaches'], ['67', 'Durango', 'Colorado, United States', '25.77', 'Accommodation'], ['64', 'Port Stephens', 'New South Wales, Australia', '25.95', 'Beaches'], ['61', 'Antigua a

# PART TWO - DataFrame Creation and Application Logic

In [5]:
#Step 1: Now that we have a clean dataset, we will use another concept that we learn in class which is to create a Pandas DataFrame.

'''First, we import the cleaned data into a new DataFrame by assigning it to variable name 'data' with five columns namely Rank, City, State and Country, Tourism Sentiment Score and Asset. These columns act as a label for all the data.'''
# import cleaned data into Pandas
data = pd.DataFrame(data=lst, columns=["Rank","City","State, Country","Tourism Sentiment Score","Asset"])

'''In this step, we also ensure that all data types are assigned correctly so that the program will be able to run. In this case, we want the Rank to be displayed as an integer and the Tourism Sentiment Score to be displayed in a float. This is done so we do not have to format things again in the future when we try to call out and/or use the data.'''
# convert string into integer and float
data['Rank']=data['Rank'].astype(str).astype(int)
data['Tourism Sentiment Score']=data['Tourism Sentiment Score'].astype(str).astype(float)
data

Unnamed: 0,Rank,City,"State, Country",Tourism Sentiment Score,Asset
0,100,Lexington,"Kentucky, United States",23.38,Restaurants & Dining
1,97,San Diego,"California, United States",23.54,Beaches
2,94,Cook Islands,Unknown Country,23.84,Beaches
3,91,Park City,"Utah, United States",24.13,Skiing & Snowboarding
4,88,Newcastle Upon Tyne,"England, United Kingdom",24.30,Restaurants & Dining
...,...,...,...,...,...
95,11,Perth and Kinross,"Scotland, United Kingdom",33.70,Accommodation
96,8,Ibiza,"Balearic Islands, Spain",34.16,Beaches
97,5,Yarra Valley,"Victoria, Australia",36.18,Winery & Vineyards
98,2,Whitsundays,"Queensland, Australia",37.64,Beaches


In [6]:
#Step 2: We then want to rank the data based on the Tourism Sentiment Score in a descending order (Highest score/ranked place at the top)
# sort table in ascending order of rank
data.sort_values(by=['Rank'], ascending=True, inplace=True)
data.head(10)

Unnamed: 0,Rank,City,"State, Country",Tourism Sentiment Score,Asset
99,1,Maldives,Unknow Country,40.26,
98,2,Whitsundays,"Queensland, Australia",37.64,Beaches
65,3,Sunshine Coast,"Queensland, Australia",36.96,Beaches
32,4,Seychelles,Unknown Country,36.74,Beaches
97,5,Yarra Valley,"Victoria, Australia",36.18,Winery & Vineyards
64,6,Sestriere,"Turin, Italy",34.91,Skiing & Snowboarding
31,7,Cayman Islands,Unknown Country,34.31,Beaches
96,8,Ibiza,"Balearic Islands, Spain",34.16,Beaches
63,9,Cairns,"Queensland, Australia",33.98,Diving & Snorkeling
30,10,Venice,Italy,33.78,Architecture


In [None]:
# UNUSED CODE SETUP EXPLANATION: This is where our model evolves. We initially explore the following code setup but according to our discussion with Zhengrui, this model (even though it works) is too simple. Hence, we decided to abandon this approach and move with a different code setup to make the interface more interactive.
'''
# old version of input

# provide hint for countries
unique_countries = data['State, Country'].unique()
cleaned_countries = [location.split(', ')[-1] if ', ' in location else location for location in unique_countries]
unique_cleaned_countries = list(set(cleaned_countries))
print('You can find country you are interested in from below: ')
print(unique_cleaned_countries)

# provide hint for assets
unique_assets = data['Asset'].str.split().explode().unique()
filtered_assets = [asset for asset in unique_assets if asset is not None and '&' not in asset]

print('You can find assets you are interested in from below: ')
print(filtered_assets)

# User input

yr_choice = input("Enter your favorite country or asset: ")
filtered_places = data[data['State, Country'].str.contains(yr_choice) | data['Asset'].str.contains(yr_choice)]
if not filtered_places.empty:
    print('My recommdendation is: ')
    print(filtered_places)
else:
    print('Sorry, I can not find your dream destination.')
'''

In [7]:
#Step 3: Next, we want to create the application logic that will allow us to create a recommendation for our user. The idea is to be able to take a user input as a starting location of their travel and used it as a point of reference to get the nearest destination from the list places in our DataFrame.
'''To start with, we want to use the Geopy Python package. We need to first install this package in our Jupyter Notebook by using the following statement: '''

!pip install geopy




In [10]:
#Step 4: This is the step where we create the algorithm for taking user input and binding it with a geolocator to be able to define the geographic coordinate (latitute and longitude data) for their starting point.
'''First, we need to import Next, we use the Nominatim geocoding services from Geopy library to assign geographic coordinates to all the destinations in the top 100 list as well as the starting point (the city or country that the user enters as an input to indicate their starting point).
We also make sure to import the Python Math package in this step because we want to use it to calculate the geographic distance between the two location on a given map (i.e., the starting point and the recommended destination) later on in our model.'''

# ask user to input a location
from geopy.geocoders import Nominatim
from math import radians, sin, cos, sqrt, atan2

'''Next, we will establish the connection to the Google Maps API and bind it with an input method to allow our program to read the geographic coordnates for the city or place that the user inputted.'''
geolocator = Nominatim(user_agent="https://www.google.com/maps")
yr_location = input("What city are you in? ")

#input location
location = geolocator.geocode(yr_location)

What city are you in? Seattle


In [11]:
#Step 5: This is the code setup we write as a last step to read the coordinates information from the city or place that the user enters in the input method above.
'''This is important because it establishes the starting point coordinate that will allows the application to calculate the distance and look for the closest destination from our DataFrame.'''
# get the coordinates of input location
latitude = location.latitude
longitude = location.longitude
coordinates_input = [latitude, longitude]
print("Here is the Latitude and Longitude of the starting point: ")
print(coordinates_input)

Here is the Latitude and Longitude of the starting point: 
[47.6038321, -122.330062]


In [12]:
#Step 6: We create a function to caclulate the geographic distance between the starting point and the recommended destination.
'''First, we used the Haversine function to calculate the distance between two cities.'''
# haversine function to calculate distance between two cities
def haversine(lat1, lon1, lat2, lon2):
    # Convert latitude and longitude from degrees to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])

    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    distance = 6371 * c  # Radius of the Earth in kilometers (use 3959 for miles)

    return distance

In [13]:
#Step 7: In this step, we created a new column to store the distance between all destination in our DataFrame in relative to the starting point.

'''This is important because we want to come up with a logic in which we will find the destination from the DataFrame with the smallest distance to our starting point and use it to recommend.'''
# create three new columns showing latitude and longtitude for places in database
# and the distance between these places and input location

data['Location'] = data['City'].apply(lambda city: geolocator.geocode(city))
data['Latitude'] = data['Location'].apply(lambda location: location.latitude)
data['Longitude'] = data['Location'].apply(lambda location: location.longitude)
data['Distance'] = data.apply(lambda row: haversine(location.latitude, location.longitude, row['Latitude'], row['Longitude']), axis=1)
# keep two decimal places for distance
data['Distance'] = data['Distance'].round(2)
print(data)

'''This is the final outcome where we made a recommendation to the user where we can see the place that is the closest to the starting point with its information such as name of the city, country, rank and assets.'''
recommendation = data.nsmallest(1, 'Distance')
print("=======================================================")
print("We recommended the following destination: ")
print(recommendation)

    Rank            City              State, Country  Tourism Sentiment Score  \
99     1        Maldives              Unknow Country                    40.26   
98     2     Whitsundays       Queensland, Australia                    37.64   
65     3  Sunshine Coast       Queensland, Australia                    36.96   
32     4      Seychelles             Unknown Country                    36.74   
97     5    Yarra Valley         Victoria, Australia                    36.18   
..   ...             ...                         ...                      ...   
34    96           Stowe      Vermont, United States                    23.61   
1     97       San Diego   California, United States                    23.54   
66    98      Wollongong  New South Wales, Australia                    23.52   
33    99      Gold Coast       Queensland, Australia                    23.44   
0    100       Lexington     Kentucky, United States                    23.38   

                    Asset  

In [14]:
#Step 8: Next, we want to get all the important data for the destination we recommended so we can map it out in a map as a part of the output we want to show to the user.
'''This is where we get geographic information for the destination we recommended.'''
# get the info of output city and convert into float type
latitude_output = recommendation['Latitude'].iloc[0].astype(str).astype(float)
longitude_output = recommendation['Longitude'].iloc[0].astype(str).astype(float)
name_output = recommendation['City'].iloc[0]

'''This is where we get the distance between the two cities to be able to calculate and display the distance in kilometers in the next step.'''
# get distance between input city and output city, and convert into float type
distance_km = recommendation['Distance'].iloc[0].astype(str).astype(float)
asset_output = recommendation['Asset'].iloc[0]
score_output = recommendation['Tourism Sentiment Score'].iloc[0]

print("Here are important information pertaining to our recommended destination: ")
print("Recomended Destination: ", name_output)
print("Latitude: ", latitude_output)
print("Longitude: ", longitude_output)
print("Distance from your start point: ", distance_km)
print("Recommended attractions: ", asset_output)
print("The Travel Sentiment Score rating: ", score_output)

Here are important information pertaining to our recommended destination: 
Recomended Destination:  Kelowna
Latitude:  49.8879177
Longitude:  -119.495902
Distance from your start point:  328.11
Recommended attractions:  Winery & Vineyards
The Travel Sentiment Score rating:  24.43


# PART THREE - Data Visualization and Output

In [15]:
#Step 1: To get the best visualization, we opted to use an interactive map that can print output with starting point, end point and other information pertaining to the location.
'''First, we need to install a Folium package which can be used to generate this interactive map. We need to install it in our Jupyter Notebook.'''
!pip install folium



In [16]:
# Step 2: This is the last part to ensure we have all the components needed to display the final output in the form of map to user.
'''First, we need to generate a map by importing folium and then get the parameter we need and feed it into the folium model we use in this step.'''
# generate the map

import folium

# get the coordinates of the output location
coordinates_output = [latitude_output, longitude_output]

'''This is to get the middle point between two cities where we want to display distance between the two cities/places.'''
# create a map centered at the average location of start and end point
average_location = [(coordinates_input[0] + coordinates_output[0]) / 2, (coordinates_input[1] + coordinates_output[1]) / 2]
my_map = folium.Map(location=average_location, zoom_start=4)

'''We added pop-up information for each location pins which consist of starting point label and the destination information.'''
popup_info = "<b>" + name_output + "</b><br>Asset: " + str(asset_output) + "</b><br><br>Tourism Sentiment Score: " + str(score_output)
popup_startinfo = "<b>" + yr_location + "</b><br>This is your start point."

'''This is for adding a location pin in the map.'''
# add markers for input and output city
folium.Marker(location=coordinates_input, popup=popup_startinfo).add_to(my_map)
folium.Marker(location=coordinates_output, popup=popup_info).add_to(my_map)

'''This is the line between two cities.'''
# add a line between input and output city
folium.PolyLine(locations=[coordinates_input, coordinates_output], color='blue', weight=2.5, opacity=1).add_to(my_map)

'''This is important description for the middle point to make sure it is formated properly to differentiate it with the two location pins.'''
# add a descriptor on the map with the distance
descriptor = f'Distance: {distance_km:.2f} kilometers'
folium.map.Marker(
    [average_location[0], average_location[1]],
    icon=folium.Icon(color='red'),
    popup=folium.Popup(descriptor, parse_html=True)
).add_to(my_map)

'''This is to save the map localy in our notebook.'''
# Save the map to an HTML file
my_map.save('map_with_pins_line_and_descriptor.html')

'''This is to finally display the map as an output with all the elements.'''
# Display the map
my_map