# Recommending National Park Trails
#### DS2500 Final Report
Andrew Fielding, Grace Brueckner, Jeffrey Pan, and Katie Warren
Team Index 105
![mountain_outline](https://i.ibb.co/nsz3d3x/mountain-outline.png)

### Table of Contents 
I.  [Executive Sumary](#executive-summary)

II.  [Introduction](#introduction)
- [Motivation](#motivation)
- [Utilizing Data to Solve this Problem](#utilizing-data)
    
III.  [Data Description](#data-description)
- [Data Description](#data)
- [Pipeline](#pipeline)
- [Initial Visualizations](#initial-vis)

IV.  [Method](#method)

V.  [Results](#results)
- [Visualizations](#visualizations)
    
VI.  [Discussion](#discussion)
- [Interpretation of Results](#interpretations)
- [Takeaways](#takeaways)

<a id='executive-summary'></a>
# Executive Summary

The goal of this project was to build a trail recommendation system for National Parks that outputs trails that are compatible with user preferences and values. To accomplish this, we utilized a dataset found on [Kaggle](https://www.kaggle.com/datasets/planejane/national-park-trails) which pulled trail information and data from [AllTrails](https://www.alltrails.com), an online database of trailmaps which allows users to search for trails and see their ratings and features (for more details, see [Data Description](#data-description)). After cleaning the data and visualizing some patterns from the original dataset (see [Initial Visualizations](#initial-vis)), we focused on two primary machine learning algorithms to create our recommendation system: K-Means Clustering and K Nearest Neighbors. To "manipulate" our data to better match trails based on user values, we scaled the data and then weighted the features based on how important they were to the user so that they influenced the clustering accordingly. After clustering the trails and finding the cluster that best matches a trail populated with the user’s preferences, we utilized K Nearest Neighbors to find the top 5 most similar trails nationwide and up to 5 nearest neighbors that fall within the user's ideal travel distance (see [Method](#method) and [Results](#results) to learn more). Our program also returns to the user an interactive map that plots each of the trails within the cluster with key information, so the user can use this visualization to find more compatible trails (see [Results: Visualizations](#result-vis)).

<a id='introduction'></a>
# Introduction

<a id='motivation'></a>
### *Motivation*
America has over 60 national parks, each full of beautiful hikes and trails. But how is one to choose a trail from this huge selection? Much of the decision for hikers today relies on arduous individual research. The Internet has a whole host of articles explaining how to get started like the following: [7 Tips for Choosing a Hiking Trail](https://gobreck.com/trip-ideas/hiking/9-tips-for-choosing-a-hiking-trail), [How to Choose a Hiking Trail: Tips Before Heading Out](https://bearfoottheory.com/how-to-choose-a-hiking-trail/), and [Hiking for Beginners: The Essential Guide](https://americanhiking.org/blog/hiking-for-beginners-essential-guide/).  [AllTrails](https://www.alltrails.com) also serves as a great resource for finding trails with certain features and in certain locations. However, these resources do not go so far as allowing a user to customize their experience by finding a true recommendation based on a blend of the user's unique preferences on all important features of a trail; instead, a user is typically left to select which trail features matter to them absolutely, eliminating large chunks of trails from consideration on a binary basis. There is a gap in the current resource base for prioritizing the elements of a trail that matter the most to a user and deemphasizing (but still accounting for their preferences on) those that matter less. If a person were to undertake this sort of research themself, it might take hours or more. 

Indeed, one of our group members experienced this very phenomenon when visiting Yosemite National Park for just one day in summer 2021. With hundreds of trails to choose from, this group member spent over a week trying to find the best trail to suit their family's unique preferences. With over [280](https://www.alltrails.com/parks/us/california/yosemite-national-park#:~:text=Want%20to%20find%20the%20best,from%20nature%20lovers%20like%20you.) trails to choose from in the park, time for just one hike, and a desire to make an unforgettable memory, finding the one that most matched their family's ideal hike proved nearly impossible. This problem is, as we’ve discussed, a common one--most people don't have time to visit every national park (indeed, to do so would take more [days](https://www.washingtonpost.com/news/wonk/wp/2016/08/02/how-to-visit-nearly-every-national-park-in-one-epic-road-trip/) of driving than the average American has off in a [year](https://www.zippia.com/advice/pto-statistics/#:~:text=10%20days%20is%20the%20average,of%20their%20paid%20time%20off.)), and even if they did, they certainly wouldn't have time to leisurely hike every trail in the park. When a person does make the sacrifice of time and money required to visit a park and hike a trail, they should be able to ensure they have the best experience possible, without resorting to hours of research or impersonal articles like [this](https://www.outsideonline.com/adventure-travel/national-parks/the-best-hike-in-every-national-park/) one, which seek to provide a boilerplate “best” trail in every park. Each person’s unique personality and needs contributes to what is the best trail for them. Thus, our project aims to simplify and personalize the trail-choosing process by recommending to the user a list of trails that match their particular, precise preferences on key trail features according to which of these features matter to them the most. 


<a id='utilizing-data'></a>
### *Utilizing Data to Solve this Problem*
In order to recommend trails to the user that best match their preferences, we will utilize a dataset found on [Kaggle](https://www.kaggle.com/datasets/planejane/national-park-trails) which pulled trail information and data from [AllTrails](https://www.alltrails.com). The user will be queried about their trail preferences and how much they care about some of the features included in the dataset. Then, the program will run a K-Means Clustering to group similar trails according to these features at varied weights to make some of the features influence the clustering more heavily. The user's preferences will then be sorted into one of these existing clusters, and K Nearest Neighbors will be used to find the most similar trails within the cluster. This information will help the program recommend the top five best trails nationwide for the user. We will also incorporate Haversine distances with the latitude and longitude data from each trail so that we are also able to return up to five closest neighbors that fall within the distance that the user has specified they are willing to travel.

<a id="data-description"></a>
# Data Description

<a id='data'></a>
### *Data Description*

We will be utilizing a Kaggle dataset that contains information about trails in National Parks. 

From this dataset, we will be able to obtain characteristics about each trail including: 
- features
- activities
- ratings
- popularity
- location (latitude and longitude)
- etc

For the purposes of this project, we will focus mostly on the following characteristics: 
- features (broken out as dummy/Boolean variables)
    - wildlife
    - dogs
    - kids
    - views
- popularity
- length
- visitor usage
- elevation gain
- difficulty
- latitude 
- longitude

Then we can utilize this information to recommend trails to users based on their preferences and values. The location of the trails can be utilized for Haversine distances to recommend trails that are close to our user as well as just general matches.

<a id='pipeline'></a>
### *Pipeline*
One of the initial things we will do is delete columns from the data frame that don’t have to do with our analysis. For example, the “units” column does not give us any useful data for our prediction model, so we can delete this column. 

Currently, our data frame has columns for features and activities. In each observation in these columns is a list of the features or activities that trail possesses. While the values in each row appear to be lists, they are in fact lists that look like strings, so we need to clean these up and make them into lists that are iterable so we can work with them more easily. Once we create true lists, we need to make the dataset more usable, so we will have to simplify the features and activities columns so that each individual feature and individual activity has a dummy column in our dataframe. We will do this using one-hot encoding to create dummy columns for each feature or activity, with values 0 and 1 representing whether the trail possesses that feature or activity. 

We will accomplish this task with the following functions:
- 'clean_list'
- 'clean_str'
- 'add_dummy_variables'
- 'get_variable_list'

*note: we also added a dummy variable 'dogs-yes' that works as the inverse of 'dogs-no' to make some of our later implementation work more elegantly*

Additionally, we plan to work with the coordinates given by the '_geoloc' column in the original dataset. However, much like the lists in the features and activities columns, while the values in _geoloc look like dictionaries, they are in fact strings that look like dictionaries. Additionally, it would be easier to work with each trail's coordinates if we had separate columns for latitude and longitude. Thus, we will go through the process of converting these strings to true dictionaries, and, once we have them as dictionaries, splitting up their key/value pairs so that latitude and longitude each have their own columns. We will also convert them to radians instead of degrees for use with Haversine Distances. 

We will accomplish this task using the following functions:
- 'convert_dict_like_strings'
- 'value_lists'
- some code implementation of the two functions

Another factor going into data processing is making sure we understand all of the data. Because there was no data dictionary, we had to refer back to the user interface on the All Trails website to determine the units of measurement that the dataset used. 

We determined that the columns “elevation gain” as well as “length” were measured with meters. For example, if we look at the data for the Harding Ice Field Trail, we can see that the recorded elevation gain is 1,116.897 and the recorded length is 15,610.59. On the All Trails website, the length is 9.2 miles, which is equivalent to 14,806 meters, and the elevation gain is 3,461 feet, which is equivalent to 1,109.77 meters. These measurements are similar to those found in our dataset. The slight differences could be accounted for by recent changes in trail length and elevation due to erosion, since the data was collected three years ago and the All Trails website is updated frequently. 

Before beginning our analysis, we removed all rows with missing elements from our data. Before running our clustering, we scale normalized *all* variables that we were clustering on. While there is some debate on whether you should scale normalize dummy variables, when we did not include these variables in the scale normalization process, the weighted aspect of the dataframe made it so that the dummy variables were almost always being "outweighed" by the continuous features and were not being considered heavily in the clustering process. We chose to normalize these in this context just to give these features a chance to influence our clustering when appropriate. 


In [1]:
# imports
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

# don't want to be bogged down with warnings
warnings.filterwarnings('ignore')

# importing data
df = pd.read_csv('nationalparktrails.csv')
df.head()

Unnamed: 0,trail_id,name,area_name,city_name,state_name,country_name,_geoloc,popularity,length,elevation_gain,difficulty_rating,route_type,visitor_usage,avg_rating,num_reviews,features,activities,units
0,10020048,Harding Ice Field Trail,Kenai Fjords National Park,Seward,Alaska,United States,"{'lat': 60.18852, 'lng': -149.63156}",24.8931,15610.598,1161.8976,5,out and back,3.0,5.0,423,"['dogs-no', 'forest', 'river', 'views', 'water...","['birding', 'camping', 'hiking', 'nature-trips...",i
1,10236086,Mount Healy Overlook Trail,Denali National Park,Denali National Park,Alaska,United States,"{'lat': 63.73049, 'lng': -148.91968}",18.0311,6920.162,507.7968,3,out and back,1.0,4.5,260,"['dogs-no', 'forest', 'views', 'wild-flowers',...","['birding', 'camping', 'hiking', 'nature-trips...",i
2,10267857,Exit Glacier Trail,Kenai Fjords National Park,Seward,Alaska,United States,"{'lat': 60.18879, 'lng': -149.631}",17.7821,2896.812,81.9912,1,out and back,3.0,4.5,224,"['dogs-no', 'partially-paved', 'views', 'wildl...","['hiking', 'walking']",i
3,10236076,Horseshoe Lake Trail,Denali National Park,Denali National Park,Alaska,United States,"{'lat': 63.73661, 'lng': -148.915}",16.2674,3379.614,119.7864,1,loop,2.0,4.5,237,"['dogs-no', 'forest', 'lake', 'kids', 'views',...","['birding', 'hiking', 'nature-trips', 'trail-r...",i
4,10236082,Triple Lakes Trail,Denali National Park,Denali National Park,Alaska,United States,"{'lat': 63.73319, 'lng': -148.89682}",12.5935,29772.79,1124.712,5,out and back,1.0,4.5,110,"['dogs-no', 'lake', 'views', 'wild-flowers', '...","['birding', 'fishing', 'hiking', 'nature-trips...",i


In [2]:
'''
goal: remove unnecessary columns such as units
'''
del df['units']
df.head()

Unnamed: 0,trail_id,name,area_name,city_name,state_name,country_name,_geoloc,popularity,length,elevation_gain,difficulty_rating,route_type,visitor_usage,avg_rating,num_reviews,features,activities
0,10020048,Harding Ice Field Trail,Kenai Fjords National Park,Seward,Alaska,United States,"{'lat': 60.18852, 'lng': -149.63156}",24.8931,15610.598,1161.8976,5,out and back,3.0,5.0,423,"['dogs-no', 'forest', 'river', 'views', 'water...","['birding', 'camping', 'hiking', 'nature-trips..."
1,10236086,Mount Healy Overlook Trail,Denali National Park,Denali National Park,Alaska,United States,"{'lat': 63.73049, 'lng': -148.91968}",18.0311,6920.162,507.7968,3,out and back,1.0,4.5,260,"['dogs-no', 'forest', 'views', 'wild-flowers',...","['birding', 'camping', 'hiking', 'nature-trips..."
2,10267857,Exit Glacier Trail,Kenai Fjords National Park,Seward,Alaska,United States,"{'lat': 60.18879, 'lng': -149.631}",17.7821,2896.812,81.9912,1,out and back,3.0,4.5,224,"['dogs-no', 'partially-paved', 'views', 'wildl...","['hiking', 'walking']"
3,10236076,Horseshoe Lake Trail,Denali National Park,Denali National Park,Alaska,United States,"{'lat': 63.73661, 'lng': -148.915}",16.2674,3379.614,119.7864,1,loop,2.0,4.5,237,"['dogs-no', 'forest', 'lake', 'kids', 'views',...","['birding', 'hiking', 'nature-trips', 'trail-r..."
4,10236082,Triple Lakes Trail,Denali National Park,Denali National Park,Alaska,United States,"{'lat': 63.73319, 'lng': -148.89682}",12.5935,29772.79,1124.712,5,out and back,1.0,4.5,110,"['dogs-no', 'lake', 'views', 'wild-flowers', '...","['birding', 'fishing', 'hiking', 'nature-trips..."


In [3]:
'''
goal: break out features and activities to store them as new columns
- double for loop
- want to see what kind of features there are, maybe for loop into each row and make a list of each unique feature?
'''
# looking at the features 
# only doing 5 rows to get an idea of the data formatting
for row in range(5):
    print(df['features'][row])

''' theyre all strings that look like lists '''

['dogs-no', 'forest', 'river', 'views', 'waterfall', 'wild-flowers', 'wildlife']
['dogs-no', 'forest', 'views', 'wild-flowers', 'wildlife']
['dogs-no', 'partially-paved', 'views', 'wildlife']
['dogs-no', 'forest', 'lake', 'kids', 'views', 'wild-flowers', 'wildlife']
['dogs-no', 'lake', 'views', 'wild-flowers', 'wildlife']


' theyre all strings that look like lists '

In [4]:
def clean_list(lst):
    """
    cleans list by removing brackets, apostrophes, and commas from string and
    appending it to a new list
    
    Args:
    lst(list) : a list 
    
    Returns;
    cleaned_feature_list
    """
    cleaned_feature_list = []
    for i in lst:
        i = i.strip("]").strip("[").strip("'").strip(",").strip("'")
        if i not in cleaned_feature_list:
            cleaned_feature_list.append(i)
            
    return cleaned_feature_list
        
            
def clean_str(string):
    """
    cleans string by removing brackets, apostrophes, and commas
    
    Args:
    str(str) : a string 
    
    Returns:
    string(str) : a cleaned string
    
    """
    
    string = string.strip("]").strip("[]").strip("'").strip(",").strip("'")
    return string 

In [5]:
def get_variable_list(dataframe, col_name):
    """
    Takes a column of lists in a dataframe, and creates a list of all variables
      mentioned in the column
    
    Args:
        dataframe(dataframe): the dataframe we want to add dummy columns to
        col_name(str): the name of the column that holds the lists of each
            trait or variable

    Returns: 
       clean_variable_list (list): a list of all variables present in the column
    """
    # cleaning variable list to get a list of all vars
    variable_list = []
    
    # want to get all of the features within the dataset
    for row in range(len(dataframe)):
        list_test = df[col_name][row].split()

        for variable_idx in range(len(list_test)):
            if list_test[variable_idx] not in variable_list:
                variable_list.append(list_test[variable_idx])
    
    # this is the list of all variables mentioned in the column
    clean_variable_list = clean_list(variable_list)

    return clean_variable_list

In [6]:
def add_dummy_variables(dataframe, col_name):
    """
    Takes a list of variables in a column of a dataset and adds each to
        a dataframe as unique dummy variable columns
    
    Args:
        dataframe(dataframe): the dataframe we want to add dummy columns to
        col_name(str): the name of the column that holds the lists of each
            trait or variable
    """
    # cleaning variable list to get a list of all vars
    clean_variable_list = get_variable_list(dataframe, col_name)
    
    # adding dummy variables for the feat_list
    for var in clean_variable_list:
        var_list = []
        
        for idx in range(0, len(dataframe)):
            if var in dataframe.loc[idx, col_name]:
                var_list.append(1)
            else:
                var_list.append(0)
        # add a column of these dummy values to the dataframe
        dataframe[var] = var_list

In [7]:
# adding dummy variables for features and activities into the dataframe
add_dummy_variables(df, 'features')
add_dummy_variables(df, 'activities')

In [8]:
import ast
from copy import copy
import math

def convert_dict_like_string(df, col):
    """ converts a dict-like string to a dict

    Args:
        df (Pandas DataFrame): some dataframe
        col (pd df column): some column in the df where each row is a dict-like string
    
    Returns:
        dict_list (list): a list of the converted dicts
    """
    # initialize empty list
    dict_list = []
    
    # loop through every row in col and convert each value to a dict
    for item in df.loc[:, col]:
        dict_list.append(ast.literal_eval(item))
    
    return dict_list

def value_lists(lst):
    """ creates lists of values from a list of dicts with 2 values
    
    Args:
        lst (list): some list of dicts with 2 values
    
    Returns:
        val0_list (list): list of the first val in all dicts in lst
        val1_list (list): list of the second val in all dicts in lst
        val2_list (list): list of the populairty of all values in dict
    """
    
    # initialize empty lists
    val0_list = []
    val1_list = []
    val2_list = []
    
    # loop through larger list 
    for i in range(len(lst)):
        # add first value of the dict lst[i] to a list
        val0_list.append(list(lst[i].values())[0])
        # add second value of the dict lst[i] to a list
        val1_list.append(list(lst[i].values())[1])
        # add color to list 
        val2_list.append(df.loc[i, 'avg_rating'])
        
    return val0_list, val1_list, val2_list

def continental_us_coords(lat_list, long_list):
    """ IDs coords from within the continental US from lists
    
    Args:
        lat_list (list): some list of lattitudes
        long_list (list): some list of longitudes
        
    Returns: 
        continental_lat_list (list): list of lats only including those within the continental US
        continental_long_list (list): list of longs only including those within the continental US
        continental_popularity_list (list): a list of popularity only including the popularity of parks within the 
                                continental US
    """
    # initialize continential lat, lon, and popularity lists
    continental_long_list = []
    continental_lat_list = []
    continental_popularity_list = []
    
    # loop through the longitude list
    for i in range(len(long_list)):
        # if the index is greater than -130, append onto new continental lists
        if long_list[i] > -130:
            continental_long_list.append(long_list[i])
            continental_lat_list.append(lat_list[i]) 
            continental_popularity_list.append(df.loc[i, 'avg_rating']) 
    # return
    return continental_lat_list, continental_long_list, continental_popularity_list

# initialie dict_list
dict_list = convert_dict_like_string(df, '_geoloc')
# initialize list_tuple
list_tuple = value_lists(dict_list)
# add lat and long colums
df['lat'] = list_tuple[0]
df['long'] = list_tuple[1]

# drop geoloc column
df.drop('_geoloc', inplace = True, axis = 1)

In [9]:
# removing rows that have empty spots
df = df.dropna()

In [10]:
# make a column for dogs-yes to be opposite of dogs-no
# set default of column to be true 
df['dogs-yes'] = 1
dogs_yes_idx = df.columns.get_loc('dogs-yes')
dogs_no_idx = df.columns.get_loc('dogs-no')

# iterate through the dataframe
for i in range(len(df.index)): 
    # if dogs-no is true, change dogs-yes to false
    if df.iloc[i, dogs_no_idx] == 1:
        df.iloc[i, dogs_yes_idx] = 0

<a id="initial-vis"></a>
### *Initial Visualizations*

In [11]:
# reset index 
df.reset_index(inplace = True,drop = True)

In [12]:
def find_park_averages(df): 
    '''find the average rating, latitude, and longitude of all of the trails in one park 
        Arguments: 
            df(Pandas dataframe): dataframe containing all national park data
        Returns: 
            avg_park (dict): a dicitionary with each park as the key and the park name, 
                             average rating, average latitude and average longitude as the values
                            
        '''
   
    # import packages
    from collections import defaultdict

    # make a list of all the parks
    park_list = list(df['area_name'].unique())

    # initialize empty dict we will return
    avg_park = defaultdict(lambda: 0)
    
    # loop through all of the parks
    for park in park_list:
        # create a boolean to pull out only the data from the current park
        park_bool = df.loc[:, 'area_name'] == park
        # make a dataframe of only that park's data
        park_avg_df = df.loc[park_bool, :]
        # find the average rating, latitude, and longitude by averaging the columns
        rating = park_avg_df['avg_rating'].mean()
        lat = park_avg_df['lat'].mean()
        long = park_avg_df['long'].mean()
        # add these values to the dictionary for that particular park
        avg_park[park] = park, rating, lat, long
    
    # return completed dictionary
    return avg_park

In [13]:
# get dictionary of parks and average ratings & locations/convert to dataframe
avg_park = find_park_averages(df)

In [14]:
'''
- We will use Flourish to create a data visualization of national parks on a map of the US
- We need to upload the coordinates, numerical rating of parks, as well as categorical rating of  
the parks in a csv file to create the Flourish visualization 
- we will turn the dictionoary into a dataframe and then convert the dataframe to a csv
'''
# make dict into a dataframe
col_list = ['Park Name','Average Rating', 'Average Latitude', 'Average Longitude']
new = pd.DataFrame.from_dict(avg_park, orient = 'index', columns = col_list)
new.head()

Unnamed: 0,Park Name,Average Rating,Average Latitude,Average Longitude
Kenai Fjords National Park,Kenai Fjords National Park,4.75,60.188655,-149.63128
Denali National Park,Denali National Park,4.026316,63.683251,-149.430641
Glacier Bay National Park,Glacier Bay National Park,3.7,58.455836,-135.859576
Katmai National Park,Katmai National Park,4.25,58.55806,-155.77792
Grand Canyon National Park,Grand Canyon National Park,4.416667,36.123143,-112.096563


In [15]:
# create new column 'rating category', intialize default to low
new['Rating Category'] = 'Low'
new.reset_index()

# find the minimum rating
min_rating = new['Average Rating'].min()

# find the maximum rating 
max_rating = new['Average Rating'].max()

# determine the bucket size for rating category
bucket_size = (max_rating - min_rating)/4

# determine bucket 1 bounds
bucket_1_min = min_rating
bucket_1_max = min_rating + bucket_size 

# determine bucket 2 upper bounds 
bucket_2_max = bucket_1_max + bucket_size

# determine bucket 3 upper bounds 
bucket_3_max = bucket_2_max = bucket_size


# put average rating into bins 
# bins are determined from the min and max values of the ratings
counter = 0
for idx, row in new.iterrows(): 
    if bucket_1_min <=  row.loc['Average Rating'] < bucket_1_max: 
        new.iloc[counter, 4] = 'Low'      
    elif bucket_1_max <=  row.loc['Average Rating'] < bucket_2_max: 
        new.iloc[counter, 4] = 'Somewhat Low'
    elif bucket_2_max <=  row.loc['Average Rating'] < bucket_3_max: 
        new.iloc[counter, 4] = 'Somewhat High'
    else: 
        new.iloc[counter, 4] = 'High'
    counter += 1
    
new.head()

Unnamed: 0,Park Name,Average Rating,Average Latitude,Average Longitude,Rating Category
Kenai Fjords National Park,Kenai Fjords National Park,4.75,60.188655,-149.63128,High
Denali National Park,Denali National Park,4.026316,63.683251,-149.430641,High
Glacier Bay National Park,Glacier Bay National Park,3.7,58.455836,-135.859576,High
Katmai National Park,Katmai National Park,4.25,58.55806,-155.77792,High
Grand Canyon National Park,Grand Canyon National Park,4.416667,36.123143,-112.096563,High


In [16]:
# find the minimum rating
new['Average Rating'].min()

3.3333333333333335

In [17]:
# find the maximum rating 
new['Average Rating'].max()

4.75

In [18]:
# create new column 'rating category', intialize default to low
new['Rating Category'] = 'Low'
new.reset_index()

# put average rating into bins 
# bins are determined from the min and max values of the ratings
counter = 0
for idx, row in new.iterrows(): 
    if 3.25 <=  row.loc['Average Rating'] < 3.4: 
        new.iloc[counter, 4] = 'Extremely Low'      
    elif 3.4 <=  row.loc['Average Rating'] < 3.55: 
        new.iloc[counter, 4] = 'Somewhat Low'
    elif 3.55 <=  row.loc['Average Rating'] < 3.7: 
        new.iloc[counter, 4] = 'Moderately Low'
    elif 3.7 <=  row.loc['Average Rating'] < 3.85: 
        new.iloc[counter, 4] = 'Slightly Low'
    elif 4 <=  row.loc['Average Rating'] < 4.15: 
        new.iloc[counter, 4] = 'Neutral'
    elif 4.15 <=  row.loc['Average Rating'] < 4.3: 
        new.iloc[counter, 4] = 'Slightly High'
    elif 4.3 <=  row.loc['Average Rating'] < 4.45: 
        new.iloc[counter, 4] = 'Moderately High'
    elif 4.45 <=  row.loc['Average Rating'] < 4.6: 
        new.iloc[counter, 4] = 'Somewhat High'
    else: 
        new.iloc[counter, 4] = 'Extremely High'
    counter += 1
    
new.head()

Unnamed: 0,Park Name,Average Rating,Average Latitude,Average Longitude,Rating Category
Kenai Fjords National Park,Kenai Fjords National Park,4.75,60.188655,-149.63128,Extremely High
Denali National Park,Denali National Park,4.026316,63.683251,-149.430641,Neutral
Glacier Bay National Park,Glacier Bay National Park,3.7,58.455836,-135.859576,Slightly Low
Katmai National Park,Katmai National Park,4.25,58.55806,-155.77792,Slightly High
Grand Canyon National Park,Grand Canyon National Park,4.416667,36.123143,-112.096563,Moderately High


In [19]:
# export to csv
# this csv is then downloaded and uploaded to flourish to create
new.to_csv('avg_park_dataset.csv', encoding='utf-8')


In [38]:
# view our data visualization on Flourish
# https://stackoverflow.com/questions/4302027/how-to-open-a-url-in-python
import webbrowser
webbrowser.open("https://public.flourish.studio/visualisation/11997781/")


True

In [21]:
# import plotly to build a map scatter visualization of these averages
import plotly.express as px

# call the dataframe into the plotting mechanism
fig = px.scatter_mapbox(new,lat="Average Latitude", lon="Average Longitude", hover_name='Park Name', 
                        hover_data=['Average Rating'], title="Average Rating by National Park", 
                        color= 'Average Rating', opacity=.8,zoom=2.6)

# other style things
fig.update_layout(mapbox_style="carto-positron")
# since there are fewer points, make the dots bigger
fig.update_traces(marker={'size': 12})
# show plot
fig.show()

After the visualization with coordinates in degrees is done, longitude and latitude data can be converted to radians. Scaling must be done after the first visualization as well, since it will otherwise skew the average rating of national parks.

In [22]:
# initialize lat and long indicies
lat_idx = df.columns.get_loc('lat')
long_idx = df.columns.get_loc('long')

# loop through df.index and convert to radians
for i in range(len(df.index)):
    df.iloc[i, lat_idx] = math.radians(df.iloc[i, lat_idx])
    df.iloc[i, long_idx] = math.radians(df.iloc[i, long_idx])

In [23]:
# alternate method of scaling data 
# create a list of the current columns you will scale
# including the binary variables to help level the playing field for the weighted dataframe
feat_list = ['popularity', 'length', 'elevation_gain',
       'difficulty_rating', 'visitor_usage', 'wildlife', 'dogs-no', 
            'views', 'kids', 'dogs-yes']

# create a list of titles for columns of the newly scaled features
scaled_feat = ['popularity_scaled', 'length_scaled', 'elevation_gain_scaled',
       'difficulty_rating_scaled', 'visitor_usage_scaled', 'wildlife_scaled', 'dogs-no_scaled', 
            'views_scaled', 'kids_scaled', 'dogs-yes_scaled'] 

# apply scaling
for i in range(10):
    for row in df[feat_list[i]]: 
        df[scaled_feat[i]] = df[feat_list[i]] / df[feat_list[i]].std()

<a id='method'></a>
# Method

Our model recommends trails to users based on their preferences for certain features as well as how much they prioritize each of these features. Our model returns the names and features of the 5 most similar trails nationwide to the user’s preferences, as well as up to 5 trails most similar to the user’s preferences within the distance the user is willing to travel. 

To achieve this, we used [K-Means Clustering](https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1) and [K-Nearest Neighbors Classification](https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761). The most important aspect of this implementation is building a weighted dataframe to cluster on. We start out with a scaled data frame, so that all of the data is scale normalized and features like elevation gain, with values in the thousands will not outweigh dummy variable features like kids, that have values of 0 or 1. Once we have a baseline scale normalized data frame, we want to alter the data frame so that the features users prioritize will be weighted more. To do this, we gather input from the user on how much they value the features dog-friendliness, kid-friendliness, wildlife, views, popularity, length, visitor usage, elevation gain, and difficulty rating on a scale of 1 to 5, 1 being that a feature is not important to them, and 5 being that a feature is very important to them. Then we multiply each column of the data frame according to the user's input of 1, 2, 3, 4, or 5. 

We can use K-Means clustering on the weighted data frame. As mentioned, we cluster on the features 1) dog-friendliness, 2) kid-friendliness, 3) wildlife, 4) views, 5) popularity, 6) length, 7) visitor usage, 8) elevation gain, and 9) difficulty rating. One pitfall of this method is we have to make an assumption of which features would represent an average user’s interest, so if a user does not care at all about any of the dummy features (dog-friendliness, kid-friendliness, wildlife, views) they will not get satisfying recommendations. We chose to use 20 clusters, arbitrarily, since the ideal number of clusters would change each time a user runs the model since the clustering will be run on a different weighted data frame each time. However, using K-Means Clustering is useful since each trail in the weighted dataframe has numerical data for each feature, so it will cluster trails with the most similar numerical data together. 

Next we gather the specific features a user wants their ideal trail to have - for the dummy variables dog-friendliness, kid-friendliness, views, and wildlife, they input yes or no, and for the numerical features popularity, length, visitor usage, elevation gain, and difficulty rating, they input their desired number. 

Now that we have trained K-Means clustering on the weighted dataframe and we have the user’s inputs (scaled to match the scale of the dataframe) we can predict which cluster the user’s inputs would be in. This machine learning method is beneficial as it allows us to narrow down all of the trails in the nation to a pool of trails nationwide which are most similar to the user’s inputs. 

Once we have this cluster of trails most similar to the user's inputs, we can then use a K-Nearest Neighbors classifier to find the 5 most similar trails nationwide to the user’s ideal trail inputs using vector distances. We also recognize that some users may not be willing to travel across the country to hike, so we ask the user to input a distance they are willing to travel. Next we calculate the Haversine distance (distance between two latitude/longitude coordinate pairs) between each trail in the cluster and the user’s inputted coordinates. We can then go through all the trails in the cluster and determine if any of the trails' Haversine distance is less than the user’s inputted travel distance. If it is, we will print the first (up to) 5 of these neighbors that are within the proper distance. 


In [24]:
class UserPreferences():
    """ takes the user's preferences and stores them in a meaningful way to use to predict and cluster on later
    
    Attributes:
        original_df(dataframe) = the original dataframe with all added columns and features with a normalized 
            scale
        weighted_features(dataframe) = an abridged and edited version of the original dataframe with only the 
            features the user will be asked about included, will be scaled up according to coefficient list to
            make the ML clustering attribute more weight to the features that the user cares about more
        user_row(list) = list of datapoints to mimic real data based on the user's indicated preferences
        coefficient_list(list) = a list of coefficients to multiply the normalized scale features in the 
            weighted_features dataframe by to represent the user's preferences, will also be used to scale the user's 
            preference data to mirror the manipulation we did of the original data
        travel_distance (int) = the ideal number of miles the user is willing to travel to reach a trail
        x_columns(list) = list of column names of the features the program will ask the user about


    """

    # make the columns we are looking at a class attribute 
    # x_columns(list) = list of column names of the features the program will ask the user about

    x_columns = ['trail_id', 'wildlife_scaled', 'dogs-yes_scaled', 'kids_scaled', 'views_scaled', 
                     'popularity_scaled', 'length_scaled', 'visitor_usage_scaled', 
                     'elevation_gain_scaled', 'difficulty_rating_scaled'] 
    
    def __init__(self, original_df, travel_distance = 50):
        
        self.original_df = original_df
        
        self.weighted_features = self.build_weighted_df()
        
        self.user_row = ['user_input']
        
        self.coefficient_list = []
        
        self.travel_distance = travel_distance
        
        self.lat = 0
        
        self.long = 0


    def build_weighted_df(self): 
        '''takes in the original datagrame to initialize a datagrame and populate with scale-normalized features 
        
            Arguments: 
                None
            Returns: 
                weighted_features(df) = a simple dataframe that will change once we get the users input
        '''
        
        # initialize a dataframe and populate with scale-normalized features
        weighted_features = pd.DataFrame()
        
        # build a simple dataframe that we will change when we get the user's input
        for column in UserPreferences.x_columns:
            weighted_features[column] = self.original_df[column]
        
        # return new dataframe
        return weighted_features
        
                
    def get_user_input(self):
        """ takes player's input to determine which coefficients to scale each column by, updates the 
            weighted_features dataframe as well as the coefficient_list based on user input
        
        Arguments:
            None
            
        Returns:
            None
        """
        # list of all of the features we will eventually cluster based on
        x_feat_list = ['wildlife_scaled', 'dogs-yes_scaled', 'kids_scaled', 'views_scaled', 'popularity_scaled', 
                       'length_scaled', 'visitor_usage_scaled', 
                       'elevation_gain_scaled', 'difficulty_rating_scaled']
        
        # an ordered list of the above features but in a nice format to present to the user when getting inputs
        x_feat_show_user = ['seeing wildlife', 'dog-friendliness', 'kid-friendliness', 'views', 'popularity',
                            'length', 'visitor usage', 'elevation gain', 'difficulty']
        

        

        # iterate through the features and get a number from 1 to 5 to represent how much they care about an
        # aspect of a trail
        # by multiplying the scale-normalized columns by these coefficients, we are able to manipulate the clustering
        # to value the higher scaled features more
        
        for feat in x_feat_show_user:
            # secure valid user input; requery if input is invalid
            while True:
                # prompt user
                coefficient = input(f'On a scale of 1 to 5 how much do you value {feat} in a trail? ')
                # if true, break
                if coefficient.isdigit() == True:
                    # if the number fits the constraints then break
                    if int(coefficient) > 0 and int(coefficient) < 6:
                        break
                    else:
                        # print invalid input and reprompt
                        print('invalid input')
                elif coefficient.isdigit() != True: 
                    print('invalid input')
            # append to coefficient list and apply weight to scale normalized values in feat column
            if coefficient.isdigit() == True and int(coefficient) < 6 and int(coefficient) > 0:
                coefficient = int(coefficient)
                self.coefficient_list.append(coefficient)

                self.weighted_features[x_feat_list[x_feat_show_user.index(feat)]] = self.weighted_features[x_feat_list[x_feat_show_user.index(feat)]]*coefficient
                
        # insitialize a list of the dummy features we are clustering on
        dummy_feat = ['wildlife_scaled', 'dogs-yes_scaled','kids_scaled','views_scaled']
        # initiailize an empty list
        new_list = []
        
        # iterate over all column names in dummy list
        for value in dummy_feat:
            # make an array of all unique values in current column from dummy list
            array = self.weighted_features[value].unique()
            for value in array: 
                # append non-zero unique value (there will only be one, the scale normalized truth value) to a list
                if value != 0:
                    new_list.append(value)
        
        # iterate through the coefficient list changing the dummy coefficients to reflect the scale normalized truth value
        # this allows them to "punch above their weight" and stay competitive with the other scale normalized values like length
        # if these coefficients were left as numbers 1-5, the user's dummy feats would get overpowered and lost 
        for value in range(0, 4):
            self.coefficient_list[value] = new_list[value]            
                
                
           
           
    def get_user_location(self):
        """ takes the player's input to use as a starting location
        
        
        Arguments:
            None
            
        Returns:
            None
        
        """
        # requery user until they give the correct type of input
        while True:
            # prompt user
            location = input(f'what are the coordinates of your location? ').split(',')
            # if true, break
            
            try:
                self.lat = float(location[0])
                self.long = float(location[1])
                break
            
            except ValueError:
                print('invalid input')
                

        # https://www.geeksforgeeks.org/degrees-and-radians-in-python/
        lat_long = [math.radians(float(item)) for item in location]
        distance_list = []
        # https://pythonexamples.org/pandas-dataframe-iterate-rows-iterrows/
        for index, row in self.original_df.iterrows():
            # https://www.w3schools.com/python/ref_math_acos.asp#:~:text=acos()%20method%20returns%20the,lie%20between%20%2D1%20to%201.
            # https://www.geeksforgeeks.org/program-distance-two-points-earth/#:~:text=For%20this%20divide%20the%20values,is%20the%20radius%20of%20Earth.
            d = 3963.0 * math.acos((math.sin(row['lat']) * math.sin(lat_long[0])) + math.cos(row['lat']) * math.cos(lat_long[0]) * math.cos(lat_long[1] - row['long']))
            distance_list.append(d)
        self.original_df['distance'] = distance_list
        cols = self.original_df.columns.tolist()
        # https://stackoverflow.com/questions/13148429/how-to-change-the-order-of-dataframe-columns
        cols = cols[:5] + cols[-1:] + cols[6:-1]
        self.original_df = self.original_df[cols]
        
    def user_ideal_trail(self):
        """ takes player's input to determine what to populate their row of preferences with
            this relies on a bit more hard coding to ensure that our prompts for the users are clear to populate
            a row with values scaled and expressed like the original data
        
        Arguments:
            None
        
        Returns:
            None
        """
        
        # initialize a list of what should look like the scale-normalized data in a given row of the original dataframe
        # this list does not account for the coefficients gathered above, but sets us up with a list that will work nicely
        # with the coefficient_list
        user_trail_list = []
        
        dummy_feat = ['wildlife', 'dog-friendly','kid-friendly','views']
        
        for feat in dummy_feat:
            while True:
                # prompts user for features
                pref = input(f'Do you want the trail to have the following feature: {feat}? (yes or no) ')
                # if the preference is yes or no break loop
                if pref == 'yes' or pref =='no':
                    break
                else:
                    # if not, then reprompt
                    print('invalid input')
                    
            # append to user trail list if yes or no
            if pref == 'yes':
                user_trail_list.append(1)
            else:
                user_trail_list.append(0)
    
    
                
        # make list of things to ask user
        input_list = [f'How popular do you want the trail to be on a scale of 1 to 10?(10 being the most popular) ', 
                     f'How many miles long do you want the trail to be? ',
                     f'How much usage should the trail have on a scale of 1 to 10 (10 being the most used)? ',
                     f'What elevation gain (in meters) do you want in a trail? ',
                     f'How difficult do you want the trail to be on a scale of 1 to 10 (10 being the most difficult)? ']
        
        # make list of constants to multiply input by so it matches the original scale
        constant_list = [85/10, 1609.34, 4/10, 1, 7/10]
        
        # make list of column feature aligns with in original dataframe 
        column_list = ['popularity', 'length', 'visitor_usage', 'elevation_gain', 'difficulty_rating']
        
        
        # find out the preference of the user for each feature 
        for i in range(5): 
            
            while True:
                
                # prompts user for features
                pref = input(input_list[i])
                
                # returns false if the input is not a number
                if pref.isdigit() == False:
                    print('invalid input')
                
                # create a range between 1 and 10 for indicies 0, 2, and 4
                elif i == 0 or i == 2 or i == 4: 
                    if int(pref) < 11 and int(pref) > 0:
                        break
                    else:
                        print('invalid input')
                # miles and elevation GAIN cannot be negative, at indicies at 1 and 3
                elif i == 1 or i == 3:
                    if int(pref) >= 0:
                        break
                    else: 
                        print('invalid input')
                
            # scale the input to match the original scale of feature in dataframe
            pref = int(pref) * constant_list[i]
            
            # divide by the original dataframe's std. deviation to set it to the same scale on the back end 
            pref = pref/self.original_df[column_list[i]].std()
            
            # add to list 
            user_trail_list.append(pref)
            
            

        # zip coefficients and user preferences and then multiply for a list of user data scaled to match
        # the weighted preferences of the user
        for num1, num2 in zip(user_trail_list, self.coefficient_list):
            self.user_row.append(num1 * num2)
    
            
            
    def add_cluster_value(self, cluster_val):
        """ Adds a value to the user_row attribute to later populate the 'cluster' column in a dataframe
        
        Arguments:
            cluster_val(int): the number of the cluster that the user_row is predicted to be a part of
            
        Returns:
            None
        """
        # add cluster to the dataframe
        # having this function makes it easier for the UserPreferences and ClusterPredict objects to interact
        self.user_row.append(cluster_val)

In [25]:
class ClusterPredict(UserPreferences):
    """ 
    
    Attributes:
        user_pref(class object UserPreferences): keeps track of the user's preferences and "fake" data row to
            cluster and knn based on
        predicted_cluster(int): the number of the cluster that the user's data is predicted to be in
        x_feat_list(list): list of the features/columns we are clustering based off of
        cluster_df(dataframe): a dataframe built from a boolean that is populated only with the trails from the 
            same cluster as the user's data
        super_df (dataframe): a merged dataframe of weighted_features dataframe and the original dataframe
        nn_trail_id_list(list): ordered list of the trail_ids of the nearest neighbors trails to the user's data within the cluster
    """
    
    def __init__(self, original_df, travel_distance = 50):
        
        UserPreferences.__init__(self=self, original_df=original_df, travel_distance = travel_distance)

        self.predicted_cluster = None
        
        self.x_feat_list = ['wildlife_scaled', 'dogs-yes_scaled', 'kids_scaled', 'views_scaled', 'popularity_scaled', 
                       'length_scaled', 'visitor_usage_scaled', 
                       'elevation_gain_scaled', 'difficulty_rating_scaled']
        
        self.cluster_df = pd.DataFrame()
        
        self.super_df = pd.DataFrame()
        
        self.nn_trail_id_list = []
        
        
    def make_clusters_and_predict(self):
        """ runs a k means cluster on the weighted data to group similar trails based on "more important" features
        
        Arguments:
            None
            
        Returns:
            None
        """
        # imports
        import numpy as np
        from sklearn.cluster import KMeans
        
    
        # print(self.original_df.loc[:, self.x_feat_list].var())
        
        # call on user_pref class to get the weighted_features dataframe
        weighted_df = self.weighted_features
        
        # a reasonable number of clusters for 3,000 pieces of data
        n_clusters = 20
        
        # x features called in the dataframe
        x = weighted_df.loc[:, self.x_feat_list].values

        # clustering
        kmeans = KMeans(n_clusters=n_clusters)
        kmeans.fit(x)
        y = kmeans.predict(x)
        
        # add a column to the dataframe
        weighted_df['cluster'] = y
        
        # call user_row from user_pref class
        user_row = self.user_row
        
        # have to change it into an array so that the kmeans predict can operate on it
        feature_array = np.array(user_row[1:len(user_row)])
        feature_array = feature_array.reshape(1,-1)
        self.predicted_cluster = int(kmeans.predict(feature_array))
        
        # add predicted cluster to the user_row in the user_pref class
        self.add_cluster_value(self.predicted_cluster)
        
        # create a boolean to build a dataframe with only trails belonging to the same cluster
        s_bool = weighted_df.loc[:, 'cluster'] == self.predicted_cluster
        self.cluster_df = weighted_df.loc[s_bool, :]
        # adding the user's data as the last row of this dataframe
        self.cluster_df.loc[len(self.cluster_df)] = self.user_row 

            
        
    def find_nn_in_cluster(self):
        """ works from the cluster dataframe to find the top 5 nearest neighbors to the user's preference data
        
        Arguments:
            None
            
        Returns:
            None
        """
        import numpy as np
        # every column counts except for trail_id
        X = np.array(self.cluster_df.iloc[:,1:])
        
        # run nearest neighbors fit
        from sklearn.neighbors import NearestNeighbors
        knn = NearestNeighbors(n_neighbors=len(self.cluster_df.index))
        knn.fit(X)
        user_data = X[-1].reshape(1,-1)
        nn_idx_array = knn.kneighbors(user_data, return_distance=False)
        
        # now we have an array of the trail_id numbers
        # put these ids into a list to access easily to return meaningful information to the user
        for list in nn_idx_array:
            for num in list:
                self.nn_trail_id_list.append(self.cluster_df.iloc[num,0])

        
    def return_user_trails(self):    
        """ returns a printed out summary of the recommended trails we found for the user
        
        Arguments:
            None
        
        Returns:
            None
        """

        # intialize a counter to keep track of the number of local recs printed
        counter1 = 1
        
        print('The following are the most recommended trails within your travel distance based on your preferences and values:')
        print()
        # iterate over all trail IDs of nearest neighbors in the user's cluster
        for idx in range(1,len(self.nn_trail_id_list)):
            # skips the user's input with an arbitrary trail_id identifier
            trail_id = self.nn_trail_id_list[idx]
            trail_bool = self.original_df.loc[:, 'trail_id'] == trail_id
            # pull column values we care about from original df
            df_row = self.original_df.loc[trail_bool, :].iloc[0, 0:18]
            # check that current trail is within distance and that fewer than 6 recommendations have been made so far
            if df_row['distance'] <= self.travel_distance and counter1<6:
                # print trail specs
                print('The number ', counter1,' trail we recommend within', self.travel_distance, 'miles of your location is: ', df_row['name'])
                print('\t It is located in ', df_row['city_name'],',',df_row['state_name'])
                print('\t And it has the following features:', df_row['features'])
                print('\t And activities:', df_row['activities'])
                print('\t It is', round(df_row['distance'],2), ' miles from your location.')
                print('\t It has a popularity score of ', round(((df_row['popularity']/85) * 10),2), ' out of 10.')
                print('\t It is ', round((df_row['length']/1609), 2), 'miles long.')
                print('\t It has a visitor usage score of', round(((df_row['visitor_usage']/4) * 10), 2), 'out of 10.')
                print('\t The trail gains', df_row['elevation_gain'], 'meters of elevation.')
                print('\t And it has a difficulty rating of', round(((df_row['difficulty_rating']/7) * 10),2), 'out of 10.')

                # update counter
                counter1 += 1    
            
                
        print('The following are the most recommended trails nationwide based on your preferences and values:')
        print()
        
        # iterate over all trail IDs of nearest neighbors in the user's cluster
        for idx in range(1,len(self.nn_trail_id_list)):
            # skips the user's input with an arbitrary trail_id identifier
            trail_id = self.nn_trail_id_list[idx]
            trail_bool = self.original_df.loc[:, 'trail_id'] == trail_id
            # pull column values we care about from original df
            df_row = self.original_df.loc[trail_bool, :].iloc[0, :]
            # pull the nearest neighbor if it is one of the first 5 (closest 5 to the user's input) neighbors in the list
            if idx <= 5:
                # print trail specs
                print('The number ', idx,' trail nationwide we recommend for your preferences is: ', df_row['name'])
                print('\t It is located in ', df_row['city_name'],',',df_row['state_name'])
                print('\t And it has the following features:', df_row['features'])
                print('\t And activities:', df_row['activities'])
                print('\t It is', round(df_row['distance'],2), ' miles from your location.')
                print('\t It has a popularity score of ', round(((df_row['popularity']/85) * 10),2), ' out of 10.')
                print('\t It is ', round((df_row['length']/1609), 2), 'miles long.')
                print('\t It has a visitor usage score of', round(((df_row['visitor_usage']/4) * 10), 2), 'out of 10.')
                print('\t The trail gains', df_row['elevation_gain'], 'meters of elevation.')
                print('\t And it has a difficulty rating of', round(((df_row['difficulty_rating']/7) * 10),2), 'out of 10.')
       

    def return_user_plot(self):
        """ plots a plotly map of the user's trail cluster, also builds super_df which is. a merge between the weighted_features
        dataframe and the original one so that we can access the original trail information
        
        Arguments:
            None
        
        Returns:
            None
            
        """
        import pandas as pd
        self.super_df = pd.merge(self.original_df, self.weighted_features, on='trail_id')
        self.super_df['color'] = 'Original Dataset'
        
        # add empty row of correct lengthfor user data
        df_user = pd.DataFrame([[0]*self.super_df.shape[1]],columns=self.super_df.columns)
       # self.super_df = self.super_df.append(df_user, ignore_index=True)
        self.super_df = pd.concat([self.super_df, df_user], ignore_index=True)

        
        # populate the empty row with user row data                         
        for idx in range(1, len(self.user_row)):
            self.super_df.iloc[len(self.super_df)-1,-12+idx] = self.user_row[idx]
            
        # add some other nice elements to row for plotting purposes
        self.super_df.iloc[len(self.super_df)-1,0] = 'user_id'
        self.super_df.iloc[len(self.super_df)-1,1] = 'User Input'
        self.super_df.iloc[len(self.super_df)-1,-1] = 'User Input'

                                 
        # get lat and longitude back into degrees
        self.super_df['long'] = self.super_df['long']*180/3.14159265
        self.super_df['lat'] = self.super_df['lat']*180/3.14159265
           
        # after scaling the rest of the data add the data from user
        self.super_df.iloc[len(self.super_df)-1,64] = self.lat
        self.super_df.iloc[len(self.super_df)-1,65] = self.long
        
        # create a boolean to build a dataframe with only trails belonging to the same cluster
        cluster_bool = self.super_df.loc[:, 'cluster'] == self.predicted_cluster
        
        plot_df = self.super_df.loc[cluster_bool, :]      
        
        import plotly.express as px
        fig = px.scatter_mapbox(plot_df,
                                lat="lat",
                                lon="long",
                                hover_name='name',
                                hover_data=['trail_id','distance','city_name','state_name','features','activities','cluster'], 
                                title="All Recommended Trails Based Off of Your Preferences",
                                color= 'color',
                                opacity=.8,zoom=2.6)

        fig.update_layout(mapbox_style="carto-positron")
        fig.update_traces(marker={'size': 5})
        fig.show()
    

    def run_program(self):
        """ calls the functions of the involved class objects in a meaningful order for the user
        
        Arguments:
            None
        
        Returns:
            None
        """
        print('Please answer the following questions about what you value in a trail.')
        # grab user's location
        self.get_user_location()
        # grab data on how much the user cares about each feature
        self.get_user_input()
        print()
        print('Please answer the following questions about your trail preferences.')
        # grab user's ideal trail specs
        self.user_ideal_trail()
        # cluster based on weighted data created from inputs
        self.make_clusters_and_predict()
        # find nearest neighbors
        self.find_nn_in_cluster()
        print()
        # return nearest neighbors
        self.return_user_trails()
        # generate an intuitive/interactive results plot
        self.return_user_plot()
        
    def cluster_pca(self):
        """ creates a PCA plot of the user's preferred trail's cluster
        
        Arguments:
            None
        
        Returns:
            None
        
        """
        from sklearn.decomposition import PCA
        import plotly.express as px

        merged_feat_list = ['wildlife_scaled_y', 'dogs-yes_scaled_y', 'kids_scaled_y', 'views_scaled_y', 'popularity_scaled_y', 
                       'length_scaled_y', 'visitor_usage_scaled_y', 
                       'elevation_gain_scaled_y', 'difficulty_rating_scaled_y']
        
        # create a boolean to build a dataframe with only trails belonging to the same cluster
        cluster_bool = self.super_df.loc[:, 'cluster'] == self.predicted_cluster
        
        # bool through the super_df to pull only the information from the relevant cluster
        new_df = self.super_df.loc[cluster_bool, :]    
        
        # pull data for plot
        x = new_df.loc[:, merged_feat_list]
        pca = PCA(n_components=2, whiten=True)
        x_compress = pca.fit_transform(x)
        # add features back into dataframe (for plotting)
        new_df['pca0'] = x_compress[:, 0]
        new_df['pca1'] = x_compress[:, 1]

        # create interactive scatterplot of compressed features using plotly
        fig = px.scatter(new_df, x='pca0', y='pca1', hover_data=['trail_id','distance','name','area_name','length', 'difficulty_rating','features','activities', 'cluster'], color='color',
                        title= 'PCA Graph of Just the Cluster that Includes User Input')
        # plot 
        fig.show()
        
    def big_pca(self):
        """ creates a PCA of the entire super_dataframe to verify if clustering looks correct
        
        Arguments:
            None
        
        Returns:
            None
        """
        # import packages
        from sklearn.decomposition import PCA
        import plotly.express as px
        
        # create big_df
        big_df = self.super_df
        
        # columns from weighted features got new names with the data merge
        merged_feat_list = ['wildlife_scaled_y', 'dogs-yes_scaled_y', 'kids_scaled_y', 'views_scaled_y', 'popularity_scaled_y', 
                       'length_scaled_y', 'visitor_usage_scaled_y', 
                       'elevation_gain_scaled_y', 'difficulty_rating_scaled_y']

        # get all from merged_feat_list for cluster
        x = big_df.loc[:, merged_feat_list]
        
        # initialize model and fit
        pca = PCA(n_components=2)
        x_pca = pca.fit_transform(x)
        
        # create pca and pca1 columns
        big_df['pca0'] = x_pca[:, 0]
        big_df['pca1'] = x_pca[:, 1]

        # plot 
        fig = px.scatter(big_df, x='pca0', y='pca1', hover_data=['trail_id','distance','name','area_name','length', 'difficulty_rating','features','activities'], color='cluster', title = 'PCA Graph of All Trail Datapoints in Dataset')
        fig.show()

In [26]:
def run_recommendation():
    '''
    gathers a user's desired travel distance and then runs the program
    
    Args: 
        None
    Returns: 
        predict(obj) = an object of the class Cluster Predict
    '''
    while True:       
        # prompts user for preferred travel distance
        travel_distance = input('Welcome to the National Park Recomendation System!\n'
                                'By entering your preferences,we can recomend the '
                                'National Parks that best align with your preferences nationwide,\n'
                                'and within a certain travel distance!\n\n'
                                'Please input how many miles you are willing to travel from your location,'
                                'in the form of a whole number: ')
            
        # make sure travel distnace is an int, if not, requery the user:
        if travel_distance.isnumeric() == False:
        
            print('invalid input')
                
        else:
            break
    
    # convert travel_distance to correct input needed for ClusterPredict
    travel_distance = int(travel_distance)
    
    # run program
    predict = ClusterPredict(original_df = df, travel_distance = travel_distance)
    predict.run_program()
    
    return predict

In [27]:
example1 = run_recommendation()

Please answer the following questions about what you value in a trail.

Please answer the following questions about your trail preferences.

The following are the most recommended trails within your travel distance based on your preferences and values:

The number  1  trail we recommend within 100 miles of your location is:  Oak Flat Loop Trail
	 It is located in  Montrose , Colorado
	 And it has the following features: ['forest', 'kids', 'views', 'wild-flowers', 'wildlife']
	 And activities: ['birding', 'hiking', 'nature-trips', 'trail-running', 'walking']
	 It is 98.87  miles from your location.
	 It has a popularity score of  1.64  out of 10.
	 It is  1.3 miles long.
	 It has a visitor usage score of 2.5 out of 10.
	 The trail gains 94.7928 meters of elevation.
	 And it has a difficulty rating of 4.29 out of 10.
The following are the most recommended trails nationwide based on your preferences and values:

The number  1  trail nationwide we recommend for your preferences is:  Jordan

In [35]:
example2 = run_recommendation()

Please answer the following questions about what you value in a trail.

Please answer the following questions about your trail preferences.

The following are the most recommended trails within your travel distance based on your preferences and values:

The number  1  trail we recommend within 100 miles of your location is:  Delicate Arch Trail
	 It is located in  Moab , Utah
	 And it has the following features: ['dogs-no', 'kids', 'partially-paved', 'views', 'wild-flowers', 'wildlife']
	 And activities: ['birding', 'hiking', 'nature-trips', 'rock-climbing', 'walking']
	 It is 1.48  miles from your location.
	 It has a popularity score of  7.49  out of 10.
	 It is  3.1 miles long.
	 It has a visitor usage score of 7.5 out of 10.
	 The trail gains 186.8424 meters of elevation.
	 And it has a difficulty rating of 4.29 out of 10.
The number  2  trail we recommend within 100 miles of your location is:  Devils Garden Loop Trail with 7 Arches
	 It is located in  Thompson , Utah
	 And it has 

<a id='results'></a>
# Results

First graph: This graph shows all the trails that are in the same cluster as the user’s “ideal” trail (a hypothetical trail populated with the features the user desires and scaled based on their weighting of those features’ importances). It allows a user to view where each trail that is similar to their preferences (and thus in their cluster) is in the United States in relation to their location. This graph doesn’t really allow one to analyze our model’s performance, and is more a convenience for a user and an interesting visualization of our clustering results. 

Second graph: This graph shows a “big PCA”, which is a principle component analysis on the entire data set, as scaled and weighted based on the user’s preferences on our 9 features. It plots the 9 features on a 2D axis. The color bar on this graph tells a viewer where each cluster of data lies on the PCA graph. This allows a viewer to zoom in and see a representation of the vector distance between the trails within each cluster (how similar they are to each other). It also allows a viewer to see a broad analysis of how well the clustering performed, by showing trails from within the same cluster being grouped together on the PCA plot. 

Third graph: This graph shows a “small PCA”, which is a principle component analysis only on the trails that fall within the same cluster as the user’s “ideal” trail, again simplifying the 9 features onto a 2D axis. The graph has the user’s ideal trail plotted in red, and everything else plotted in blue. This allows a viewer to get an idea of the vector distances and degree of similarity between the different trails in the same cluster as the user’s ideal trail, as well as the features of those trails. It also allows a viewer to easily see where the nearest neighbors in the cluster are; with some error, it is a good way to visualize the 5 nearest neighbors because generally, the points closest to the user input point on the PCA will be those neighbors. 


In [36]:
example1.big_pca()

In [37]:
example1.cluster_pca()

# Analysis of Results - Example 1

Given this polarized test case based in Millard County, Utah, United States, the results prove promising. The preferences the user gave indicate that kid and dog friendliness are at the highest priority, and the rest were not. The returned trails had features where dogs were permitted and that they were enjoyable for kids. Oak Flat Loop Trail - the local recommended trail, is within the desired distance range for traveling of 100 miles with the distance from the user’s location being 98.87 miles. This trail also had a close difficulty rating from the user’s preference, being 4.29 compared to 4, respectively. The nationwide trails are far from the user’s current location, but still return attributes and features that are similar with the user’s polarized preferences. When there are polarized preferences, the model performs well and scales each feature to a point where the polarized features are being heavily prioritized compared to others. 

We also see the weights coming into play in the PCA that includes all of the data. Since the priorities are so much higher for kids and dogs, we see a large distance between the clusters from this weighting. Since these are both binary variables, their axes are quite distinct and divide the clusters clearly on this graph between the ones that fit the dog friendly or kid friendly and those that do not.


In [33]:
example2.big_pca()

In [34]:
example2.cluster_pca()

# Analysis of Results - Example 2

When we have a more convoluted case, and the user does not have one standout aspect that they care about significantly more than the rest, the returned trails are more varied in their features. In the second example, the user prioritizes seeing wildlife and the length of the trail being 3 miles at a weight of 5, but the user also has two other features at weight 4 and two more at weight 3. This makes it so that if there is not a perfect match for the user, some of their preferences may not be accommodated because multiple features have been designated as relatively important to the user. The first returned trail nationwide, Delicate Arch Trail, does not match the user’s preferences for dogs, but the length of the trail is still 3.1 miles long, and it does have wildlife as a feature. The program still does a good job of prioritizing the main preferences, but with limited trails to choose from, the nearest neighbors does not guarantee matching the user to the perfect trail because it might not exist.

Perhaps a future version of the model that incorporates a feedback loop could help mitigate this issue. If a user could tell us whether or not they liked a trail our model identified as close to their preferences which may have lacked some features they wanted, we could determine whether the model still works in these convoluted cases, or perhaps guide us in the direction of what needs to be fixed. 

You can see this in the outputted PCA clustering for the full dataset as the points are more concentrated in one area than in the PCA for example 1, so the clusters are a little bit less distinct and more varied. While the user will still get trails that best match their preferences, the clustering is working with so many variables that not all will necessarily be present in every trail.


<a id='discussion'></a>
# Discussion

<a id='interpretations'></a>
### *Interpretation of Graphic Results*

With every user input, we see a different set of PCA graphs because the clustering changes for every user, since a user’s unique preferences create a new weighted dataframe to be clustered. For the PCA including all data points, we observed that with more "polarized" user inputs, there are more defined clumps observed in the PCA. For example, if someone didn't care about much except for only a couple variables—say, for instance, someone cared greatly about a trail being dog friendly, kid friendly, and having a difficulty level of 4, but little about any other feature—we see the PCA being divided into distinct subsets of clusters by those binary variables. However, if the elements are scaled more consistently, or in such a way as to make the model cluster on more “competitively” weighted data, the data is more concentrated in one region, but we still see spatial definitions between the clusters, so the clustering appears to work well on our data and do a good job of parsing out similar trails that all fall close to each other by vector distance. The difference between the two input examples’ big PCA graphs may be accounted for by particularly “polarized” inputs leading to points and clusters that are more self-contained, grouped tighter, and are further apart in space from each other, because the clustering is most heavily influenced by just a couple features. You can also visually see some of the scaling at work in the PCA when the data appears to be stretching in a given direction. This can be attributed to the scaling where some variables are determining the clusters more than others. Also, if you zoom in and examine one cluster, you can see that the points that are visually closer to each other have similar features and characteristics. We see the value of the nearest neighbors in our second PCA graph that includes only the data points from the same cluster as the user's input. We see that the points that appear to be visually closer to the user's input on this PCA resemble the recommendations given by the model (the nearest neighbors). This leads us to believe that the recommendation system is working well at matching trails to each other as well as to the user's preferences. One quirk that we are having more trouble understanding is this tendency that appears in the cluster PCA for polarized user inputs to generate a cluster of points arranged in diagonal lines, whereas more consistently scaled user inputs generate a more random scatter of points. We thought this may have something to do with the trails in a more polarized user input set being more uniform on those heavily weighted features, while more those in a more consistently weighted user input set might have more differences from each other as the cluster tries to parse out all these different features and their importances. 

<a id='takeaways'></a>
### *Takeaways*

Our model further improves on the AllTrails searching functionality where it allows the user to prioritize certain features over others. This is especially useful for users who only want a specific aspect of a trail or even a combination of features in their recommendation. While our model does have some improvements on the AllTrails model, their feature that gives the estimated duration to hike the trail can be a very important metric that could be considered for future implementations of our model. Many people plan other things to do in their day when they visit National Parks, and may not want to dedicate an entire day to the one trail. An estimated duration variable can give users insight to how long and difficult the trail may be.   


While the results of our recommendation model are promising, there are still a lot of factors that need to be considered for a widescale implementation. Factors can include the cost of travel, disability accessibility, and health needs. Trail recommendations that are given can be nationwide, despite fitting the preferences given by the user. These trails can be too far away, and the cost of travel can be unreasonable at times. We also made some broad assumptions about who would be using our model by choosing the features to cluster on, and it largely caters to a stereotypical family without accommodating a wide audience of users with different interests in needs. For example, if someone is looking for a trail to snowshoe or do another "niche" activity, our program is not catered to them. Similarly, some of the trails recommended by our model can be more difficult for people who have disabilities and health issues, such as asthma. The trails may be too high in elevation or may not be wheelchair accessible. There are features about paved paths and ADA accessibility in our dataset that we did not include due to the limited scope and time frame of this project. While our recommendation system tries to account for as many features as possible, generalizations had to be made to simplify the model and prevent too many variables from confounding our findings.


A good way to improve our model is to incorporate a feedback model that obtains information from the user after they have hiked a certain trail. This information would be very useful to get an indication that the model truly works the way that we want it to and that users are getting trails that fit their preferences. 


Another coding takeaway involves the scaling of dummy variables. While there are different fields of thought that advocate for or against scale normalizing these dummy variables, we found that in this context, with the scaled dataframe we were clustering on, it made it so that the dummy variable preferences could not be prioritized in the same way as the continuous variables. When we weren't normalizing the dummy variables, we could put 'dogs-yes' at a 5 priority, for example, and the rest of the elements at a 1 priority, and we were still not getting trails that were dog friendly in our output, but after we scaled these elements as well, we found that the recommendation system did a better job at returning trails with these features prioritized. We learned that the choice between scale normalizing and not is extremely context dependent when it comes to the dummy variables.