# COGS 108 - Final Project 

# Overview

One of San Diego County's attractions are the multitude of hikes that one can visit. There are so many to choose from, and people want to know where to find the best hikes. Using data generated from Yelp reviews of San Diego park locations, I looked into which parks had the highest ratings and where they were located. This data was then plotted onto a map of San Diego county in order to analyze if there are any patterns in the distribution of top ranked San Deigo County hikes. 

# Name & GitHub

- Name: Andrea Mansilla
- GitHub Username: amansilla05

# Research Question

In what regions of San Diego County are the best (top rated) hikes located?

## Background and Prior Work

**Background:** I spent some time looking up trails in SD county and going through reviews on Alltrail: a website similar to Yelp but focused only on hiking locations. I also went through the data available on github to see which dataset(s) best fit my objective of finding the locations most popular hikes. 

**Prior Work:** The work I did throughout the course has helped prepare for doing my own data cleaning and subsequent analysis. I utilzied the Tutorials on GitHub as well as all of the lecture slides available to us. Having experience from completing projects, gave me the skills to know which type of analysis to use and the different packages that would produce the appropriate results 

References (include links):
- 1) https://www.alltrails.com/us/california/san-diego
- 2) http://hikingsdcounty.com/hiking-trails-in-san-diego-county-map-view/

# Hypothesis


The most popular hikes in San Diego country are going to be found along the coast or further inland; there will be no clustering towards the middle of the city.

# Dataset(s)

- Dataset Name: **yelp_SD_parks.csv**
- Link to the dataset: 
- Number of observations: 834 observations

The data set contains information about the yelp page of several San Diego County parks, trails, playgrounds, and other recreation facilities. This information includes the name, address, phone number, and lattitude/longitude coordinates of each park as well as the yelp id, url, rating and number of reviews. 

- Dataset Name: **parks_sd.geojson**
- Link to the dataset: https://data.sandiego.gov/datasets/park-locations/
- Number of observations: a lot 

The data is displayed as blue shaded regions, representing parks, on a map of San Diego county; due to this formatting of the data, it is difficult to see exactly how many observations there are. Each park has the following information: objectid, name (short name), alias (long name), gisacres (number of calculate acres), park type (either Historic, Local, National, Open Space, Other, Regional, State), location, and the owner. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
## imports
import pandas as pd
import numpy as np
##import geopandas as gpd
import matplotlib.pyplot as plt
import geojson
import folium
%matplotlib inline

## reading the data 
yelp_df = pd.read_csv('https://raw.githubusercontent.com/COGS108/individual_fa20/master/data/yelp_SD_parks.csv')
##parks_map = gpd.read_file('https://raw.githubusercontent.com/COGS108/individual_fa20/master/data/parks_datasd.geojson')

In [2]:
#change number of rows being displayed in order to more easily check the data 
pd.set_option("max_rows", 3000)

In [3]:
# check df
yelp_df.head()

Unnamed: 0,name,address,phone,id,url,rating,review_count,longitude,latitude,is_closed
0,Balboa Park,"1549 El Prado San Diego, CA 92101",16192390000.0,9M_FW_-Ipx93I36w-_ykBg,https://www.yelp.com/biz/balboa-park-san-diego...,5.0,2105,-117.15315,32.734502,False
1,Civita Park,"7960 Civita Blvd San Diego, CA 92108",,3AEHjqNrTmggA6G9VdhQfg,https://www.yelp.com/biz/civita-park-san-diego...,4.5,46,-117.147278,32.778315,False
2,Waterfront Park,"1600 Pacific Hwy San Diego, CA 92101",16192330000.0,3unbJeYrn1RmInZGmjp80g,https://www.yelp.com/biz/waterfront-park-san-d...,4.5,242,-117.172479,32.721952,False
3,Trolley Barn Park,"Adams Ave And Florida St San Diego, CA 92116",,PvHxIYrmaiFKdWUDTMDzcg,https://www.yelp.com/biz/trolley-barn-park-san...,4.5,102,-117.143789,32.762463,False
4,Bay View Park,"413 1st St Coronado, CA 92118",,6IF4VB9-fkv_F-LBvG8ppQ,https://www.yelp.com/biz/bay-view-park-coronad...,5.0,42,-117.178967,32.701785,False


# Data Cleaning

1. use a function "is_hike" to find which entries in dataframe are hikes from keywords. replace non-hike entries with None apply function to dataframe
2. drop unneccessary columns from dataframe
3. drop rows with None values 
4. sort the rows in alphabetical order by the name
5. reset the indexes 
6. repeat for other data set
    - **note:** did not end up using parks_datasd.geojson but I wanted to leave in the work that I did to clean the data because I am proud of the work I did.

In [4]:
## function for finding hikes
def is_hike(park):
    park = str(park).lower()
    if 'trail' in park:
        return park
    elif 'summit' in park:
        return park
    elif 'canyon' in park:
        return park
    else:  
        return None

In [5]:
##apply function to the name column of yelp_df
yelp_df['name'] = yelp_df['name'].apply(is_hike)

In [6]:
##remove phone, id, url, and is_closed columns from yelp_df
yelp_df = yelp_df.drop(columns = ['address', 'phone', 'id', 'url', 'review_count', 'is_closed'])

In [7]:
## drop rows with None values in name to only include hikes in df
yelp_df = yelp_df.dropna(axis = 0, how = 'any')

In [8]:
## alphabetize rows by 'name' column
yelp_df = yelp_df.sort_values(by = ['name'])

In [9]:
## reset the indexes to ascending numerical order
yelp_df = yelp_df.reset_index(drop = True)

## source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [10]:
## find most top rated hikes... so drop col <4.0
    # Get names of indexes where rating <4.0
indexNames = yelp_df[ yelp_df['rating'] < 4.0 ].index
    # Delete these row indexes from dataFrame
yelp_df.drop(indexNames , inplace=True)

## source: https://www.geeksforgeeks.org/drop-rows-from-the-dataframe-based-on-certain-condition-applied-on-a-column/

In [11]:
## reset the indexes to ascending numerical order
yelp_df = yelp_df.reset_index(drop = True)

## source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

In [12]:
##check the dataframe
yelp_df

Unnamed: 0,name,rating,longitude,latitude
0,annie's canyon trail,4.5,-117.272446,33.003597
1,camp elliot trails,4.0,-117.0559,32.817864
2,canyonside community park,4.5,-117.130233,32.94215
3,coast to crest trail,4.5,-117.06774,33.066486
4,coast walk trail,5.0,-117.26627,32.84864
5,coastal rail trail,4.5,-117.268889,32.980309
6,cypress canyon park,4.5,-117.064023,32.923995
7,del mar trails park,4.5,-117.217885,32.943851
8,el cajon mountain hiking trail,4.5,-116.884575,32.912594
9,florida canyon,4.0,-117.142446,32.726833


In [13]:
## convert park_gpd into pandas dataframe
#parks_df = pd.DataFrame(parks_gpd)

In [14]:
#parks_df.head()

In [15]:
## repeat data cleaning steps for parks_df

##apply function to the name column
#parks_df['alias'] = parks_df['alias'].apply(is_hike)

##remove phone, id, url, and is_closed columns from yelp_df
#parks_df = parks_df.drop(columns = ['objectid', 'name', 'park_type', 'location', 'owner'])

## drop rows with None values in name to only include hikes in df
#parks_df = parks_df.dropna(axis = 0, how = 'any') 

## alphabetize rows by 'name' column
#parks_df = parks_df.sort_values(by = ['alias']) ## replacing name with values from location???

## #reset the indexes to ascending numerical order
#parks_df = parks_df.reset_index()

## replace old indexes with the new ones
#parks_df = parks_df.drop(columns = ['index'])

In [16]:
## drop duplicate values in name
#parks_df = parks_df.drop_duplicates(subset = ['alias'])

In [17]:
#check data
#parks_df

# Data Analysis & Results

Include cells that describe the steps in your data analysis.

In [18]:
## get lattitudes and longitudes from yelp
longitude = yelp_df['longitude'].tolist() 
latitude = yelp_df['latitude'].tolist() 
name = yelp_df['name'].tolist()

In [19]:
## check lists
print('longitudes:', longitude)
print('latitudes:', latitude)

longitudes: [-117.2724462, -117.0558997, -117.130233, -117.06773999999999, -117.26626999999999, -117.2688892, -117.06402340000001, -117.21788529999999, -116.8845749, -117.1424463, -117.1214457, -117.05601000000001, -117.24227920000001, -117.251678, -117.12766200000002, -117.16221029999998, -117.2599594, -117.11976000000001, -117.05601000000001, -117.0789191, -117.01487040000002, -117.07369209999999, -117.22833999999999, -117.04475159999998, -117.0426976, -117.07276110000001, -117.0650876, -117.023145, -117.19322199999999, -117.21250400000001, -117.2191697, -117.10107109999998, -116.57910179999999, -117.0079621, -117.1334989, -117.02652340000002, -117.8146607, -117.1975108, -117.26300790000002, -117.13429, -117.1105922]
latitudes: [33.003597, 32.817863800000005, 32.94215006, 33.06648563, 32.84864, 32.98030915, 32.92399496, 32.9438511, 32.91259447, 32.72683341, 32.73010346, 32.817954, 33.08883526, 32.885099, 32.9371, 32.73712585, 32.9323146, 32.99367018, 32.817955, 32.82559464, 33.007178

In [20]:
## create dataframe with just lattitudes and longitudes
park_loc = pd.DataFrame(list(zip(latitude, longitude, name)), 
               columns =['latitude', 'longitude', 'name']) 
park_loc.head()

## source: https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/

Unnamed: 0,latitude,longitude,name
0,33.003597,-117.272446,annie's canyon trail
1,32.817864,-117.0559,camp elliot trails
2,32.94215,-117.130233,canyonside community park
3,33.066486,-117.06774,coast to crest trail
4,32.84864,-117.26627,coast walk trail


In [21]:
## make a map from open street map
sd = folium.Map(location = [32.7157,-117.1611],
                tiles = 'Stamen Terrain')

In [22]:
## set boundary of map to san diego county
    ## coordinates source: https://boundingbox.klokantech.com/ 
sd.fit_bounds([[32.52952353, -117.61108103], [33.50529726, -116.08102998]])

In [23]:
## plot park_locations geodataframe onto sd map
for i in range(0,len(park_loc)):
    folium.Marker([park_loc.iloc[i]['latitude'], park_loc.iloc[i]['longitude']], 
                 popup = park_loc.iloc[i]['name']).add_to(sd)

In [24]:
## check map
sd

# Ethics & Privacy

**Privacy:** Yelp is user generated data. If it is not scraped and/or cleaned well it could reveal people's identities through their username, comments, or even photos. A malicious actor could gain access to this information and use it for unethical purposes. On the topic of photos, people post pictures on yelp that often contain their faces. If information of these images are included in the dataframe, any facial recognition software could be used to identify  the yelp users that posted photos. Yelp reviews are also posted from personal computers, so it is possible that information like io addresses can be scraped from the reviews, which can be used to reveal someone's location. Therefore, it is imperative that the data is well cleaned, especially when it is user-generated- in order to protect users privacy and safety.

**Ethics:** Reviews and ratings from Yelp are opinions. People feel protected on the internet, so leave reviews that are over-exaggerated or fueled by strong emotions. Therefore, average ratings are influenced by these emotionally charged reviews and may not accurately reflect the popularity or quality of the hike. This has implications for the owners of these spaces of how much maintenance or advertising they do on their locations based on information from reviews such as these. They may be allocating resources to the wrong places or not using the correct resources because of biased reviews. Furthermore, websites such as OpenStreeMaps, are open data sources, but they ask to be credited when someone uses their services. It would be unethical to use information gathered from sources such as OpenStreetMaps and not cite them as a source. It could even be considered plagairism, as one is using data they did not acquire on their own. When collecting data, it important to always give credit to the sources from which the data came.

# Conclusion & Discussion

The map displays the locations of the most popular hikes in San Diego county, according to Yelp reviews. The pins represent the latitude and longitude coordinates of each park. According to the distribution of the pins, there is no clear pattern of where the most popular hikes are located in terms of region. It appears that most of the popular hikes range from North San Diego County towards the middle of San Diego (city), with a few outliers in the far East and South areas of San Diego County. Based on the map, the hypothesis is rejected, since there is no obvious pattern. 

It does appear that many of the popular hikes fall along the coastline, and there is some clustering in the central parts of San Diego. Further analysis could be done to statistically if there are clusters or groupings. The geographic areas could also be analyzed to see if there is some correlation between location and hikes, such as the presence mountains. 

The data presented by the map has implications for the owners of these locations. It displays which parks are most popular, and likely most frequented; therefore, these parks need more resources in order to be properly maintained and kept in good condition for frequent visitors. At the same time, owners can see if their location is NOT on this map, and can focus on advertising efforts to attract more visitors to these locations. Further analysis can provide more information on how park owners can maintain and promote their locations. 