# Scrapping California Trails

This is the first Jupyter Notebook.

In this Jupyter Notebook I used selenium to scrap all California Trails from www.alltrails.com

I found some code from here on scrapping Colorado Hiking Trails: <br>
https://github.com/oschow/take-a-hike/blob/master/AllTrails/scrape_clean/scrape_ratings.py

This code is from 2016, so some of it doesn't work but it was a good reference for me to get started on using the selenium library to scrap from a Javascript website. 

I had to also read up on selenium, from this website: https://www.seleniumhq.org/docs/03_webdriver.jsp

I also had to download chromedriver from here: http://chromedriver.chromium.org/downloads This is used with selenium to automate the button clicking to load more hiking trails and user reviews

I used the beautiful soup library for scrapping which we learned in class at General Assembly

# **NOTE:** 
- <span style="color:Red"> If you run this code it will take a very long time to run as it continues to press the load more button until all 7000+ trails are loaded.

## Libraries

In [1]:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd

## Functions

This function **get_all_hikes** calls in the webdriver browser chromedrive by going
to https://www.alltrails.com/us/california?ref=search and does the following:
1. Loads in the browser Chrome Webdriver
2. Clicks the load more button until the button no longer exist
5. Saves the BeautifulSoup to soup
6. Returns soup

In [1]:
def get_all_hikes(browser):
    browser.get('https://www.alltrails.com/us/california?ref=search')
# This counter is used to count the number of times it goes through the loop
#     counter = 1
    while True:
    # This while loop presses the load more button
        try:
        # This try is an error handler so it attempts to load the button
            load_more_hikes = WebDriverWait(driver=browser, timeout=10).until(
                EC.visibility_of_element_located((
                By.XPATH, 
                "//div[@id='load_more'] [@class='feed-item load-more trail-load'][//a]")))
            load_more_hikes.click()
            time.sleep(5)
            
# this is commented out when I test a small subset to make sure the code ran
# It breaks at 3 load button presses
#             counter+=1
#             if counter == 3:
#                 break
        except:
        # once the button is gone it will break
            break
            
    # Once the browser is fully loaded with all the trails, 
    # no more load button, we can save all the BeautifulSoup 
    # and save the soup and return the soup for evaluation 
    # and extracting all the trails
    soup = BeautifulSoup(browser.page_source, "lxml")
    return soup

## Main Code

In [3]:
# define browser as Chrome web driver
browser = webdriver.Chrome('../chromedriver')

# define soup to call the function get_all_hikes
soup = get_all_hikes(browser)

# This saves the soup trail results cards as hikes
hikes = soup.select('div.trail-result-card')

# using a comphrehensive list, I extracted the name of every trail into a list,
# and called this list trails_webpage
trails_webpage = [hike.findChild('a')['href'] for hike in hikes]

#### Saving the Trails Web Extension data frame as a csv

In [4]:
pd.DataFrame(data={'trails_web_extension': trails_webpage}).to_csv('./trails_df.csv', index=False)

In [5]:
# This saves the name of the hikes
trail_name = [hike.findChild('h3').text for hike in hikes]

#### Example of the raw data scrapped of one of the hikes. This is the HTML

In [69]:
hikes[0]

<div class="trail-result-card" data-id="0" data-reactid="6" itemid="/trail/us/california/potato-chip-rock-via-mt-woodson-trail" itemprop="containsPlace" itemscope="" itemtype="http://schema.org/LocalBusiness" style="width:100%;background-color:rgba(108,139,133,1);background-image:linear-gradient(rgba(0, 0, 0, 0.0), rgba(0, 0, 0, 0.7)), url(/api/alltrails/trails/10111800/profile_photo?show_placeholder=no&amp;size=large&amp;key=3p0t5s6b5g4g0e8k3c1j3w7y5c3m4t8i);background-repeat:no-repeat;background-position:center center;height:145px;"><span class="item-rank" data-reactid="7"><!-- react-text: 8 -->#<!-- /react-text --><!-- react-text: 9 -->1<!-- /react-text --></span><link data-reactid="10" href="/api/alltrails/trails/10111800/profile_photo?show_placeholder=no&amp;size=large&amp;key=3p0t5s6b5g4g0e8k3c1j3w7y5c3m4t8i" itemprop="image"/><a class="item-link" data-reactid="11" href="/trail/us/california/potato-chip-rock-via-mt-woodson-trail?ref=result-card" itemprop="url"></a><div class="ite

#### This just shows the names of the trails that were extracted from alltrails.com website

In [22]:
trail_name[0:5]

['Potato Chip Rock via Mt. Woodson Trail',
 'Vernal and Nevada Falls via the Mist Trail',
 'Eaton Canyon Trail',
 'Bridge to Nowhere via East Fork Trail',
 'Alamere Falls via Coast Trail from Palomarin Trailhead']

## Summary

- All the trails have been extracted from the website with the url extension to go to each trails page and saved into a pandas dataframe called ***trails_df.csv*** I moved this file into the /data folder to prevent it to be overwritten

- The next Jupyter Notebook will be extracting the reviewers and their ratings 

- Link the notebook 2: <br> https://git.generalassemb.ly/boxndragon04/California_Hiking_Recommendation_System/blob/master2/notebooks/02_Scrapping_Reviews.ipynb
