## Extracting Each Trail's info and the Reviewers and their Ratings

This Jupyter Notebook will deep dive into each trail in California and extract out the following:
- Name of the trail
- Description of each trail
- Distance of the trail
- Elevation of the trail
- Type of route of the trail
- Difficulty of the trail
- Average rating of the trail
- Location of the trail
- Number of reviews
- Reviewers and their rating

## Libraries

In [1]:
import numpy as np
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd

#### Lets read in the dataframe that has all the trails web extension from 01_Selenium_Get_All_CA_Hiking_Trails

In [2]:
df = pd.read_csv('../data/trails_df.csv')

#### Lets check to make sure this is the right data

In [3]:
df.head(3)

Unnamed: 0,trails_web_extension
0,/trail/us/california/potato-chip-rock-via-mt-w...
1,/trail/us/california/vernal-and-nevada-falls-v...
2,/trail/us/california/eaton-canyon-trail?ref=re...


#### Lets add Columns to this df

In [4]:
df['name'] = ''
df['description1'] = ''
df['description2'] = ''
df['distance'] = ''
df['elevation'] = ''
df['route'] = ''
df['difficulty'] = ''
df['rating'] = ''
df['location'] = ''
df['numreviews'] = ''
df['reviewers_rating'] = ''

#### Lets Check the Dataframe with the added columns

In [5]:
df.head(2)

Unnamed: 0,trails_web_extension,name,description1,description2,distance,elevation,route,difficulty,rating,location,numreviews,reviewers_rating
0,/trail/us/california/potato-chip-rock-via-mt-w...,,,,,,,,,,,
1,/trail/us/california/vernal-and-nevada-falls-v...,,,,,,,,,,,


In [6]:
df.shape

(7728, 12)

As you can see from above we have a df with 7728 rows for each trail that we will need to iterate through to scrap the content for each column. 

This is the most difficult and time consuming part in gathering the data for the user/reviewer of the hike and their ratings.

This actually took over a week to do for me.

# Functions

This function **scrap_CA_trail** loads in the url of the hiking trail in and presses the load more button until it no longer is available to create the soup and returns the soup to a saved variable.

In [7]:
def scrap_CA_trail(browser, hike_url):
    browser.get(hike_url)
    while True:
        try:
            load_more_ratings = WebDriverWait(browser, 2).until(
                EC.visibility_of_element_located((
                By.XPATH,"//div[@id='load_more'] [@class='feed-item load-more'][//a]")))
            load_more_ratings.click()
            time.sleep(2)
        except:
            break
    soup = BeautifulSoup(browser.page_source, "lxml")
    return soup

This function **trails_to_scrap** takes in 2 conditions. A start index for the trail to start scrapping at and an end index to stop scrapping. It opens the chrome webdrive, and loads in one hiking page at a time from the list of hiking trails. It then proceeds to find the load more button and press the button until the button no longer exist. Once that condition is meet, the while loop breaks, and it extracts the soup (html) and the specific features from the trail. It will also save the dataframe at every 25 trails it scrap so you have the data saved as backup. 

In [8]:
def trails_to_scrap(start, end=''):
    browser = webdriver.Chrome('../chromedriver')
    for i, trail in enumerate(df.trails_web_extension[start:end]): 
        i=i+start

        california_hiking_trail_url = 'http://www.alltrails.com' + trail

        soup2 = scrap_CA_trail(browser, california_hiking_trail_url)

        df.name[i] = soup2.find('h1').text
        df.description1[i] = soup2.find('p').text
        df.description2[i] = soup2.findAll('p')[1].text
        df.distance[i] = soup2.find('section', id='trail-stats').find('div').find_all(
        'span')[0].text.replace('\nDISTANCE\n','').replace(' miles\n', '')
        df.elevation[i] = soup2.find('section', id='trail-stats').find('div').find_all(
        'span')[1].text.replace('\nELEVATION GAIN\n','').replace(' feet\n', '')
        df.route[i] = soup2.find('section', id='trail-stats').find('div').find_all(
        'span')[2].text.replace('\nROUTE TYPE\n','').replace('\n', '')
        df.difficulty[i] = soup2.find('div', id='difficulty-and-rating').find('span').text
        df.rating[i] = soup2.find('div', id='difficulty-and-rating').find('meta')['content']
        df.location[i] = soup2.find('div', id='title-and-menu-box').findAll('span')[5].text
        df.numreviews[i] = soup2.find('div', id='difficulty-and-rating').find_all('span')[4].text

        reviewer_names_and_rating = []
        for reviewer in range(0,len(soup2.findAll('span', itemprop='author'))):
            reviewer_name = soup2.findAll('span', itemprop='author')[reviewer].text
            reviewer_rating = soup2.findAll("div", {"class": "width-for-stars-holder"})[reviewer].find('meta')['content']
            reviewer_names_and_rating.append({reviewer_name: reviewer_rating})
        df.reviewers_rating[i] = reviewer_names_and_rating

        if i % 25 == 0:
            df.to_csv(f'./california_hikes_{i}_df.csv', index=False)


## Main Code

#### This is a demo of the code for one trail. I displayed the dataframe at location 7000 to show you how the information was extracted

In [9]:
trails_to_scrap(7000, 7001)
df.iloc[6999:7002]

Unnamed: 0,trails_web_extension,name,description1,description2,distance,elevation,route,difficulty,rating,location,numreviews,reviewers_rating
6999,/trail/us/california/santiago-peak?ref=result-...,,,,,,,,,,,
7000,/trail/us/california/falls-trail-and-middle-tr...,Falls Trail and Middle Trail,Falls Trail and Middle Trail is a 8.7 mile mod...,"Since I live nearby, I hike this trail often. ...",8.7,1489.0,Loop,MODERATE,4.0,Mount Diablo State Park,3.0,"[{'Oscar d.': '4'}, {'Kay Jung': '5'}, {'Thoma..."
7001,/trail/us/california/devils-slide-trail-willow...,,,,,,,,,,,


## **NOTE:**
This code takes a long time to run. It took me 2 weeks to run. I divided up the trails to be scrapped in this method because the lower index (0) trails have the highest reviews and the higher index (7000+) trails have the lowest reviews. Because the code can't run continuously unless you have a server to run it. This is the best method for scrapping. 

In [10]:
trails_to_scrap(0, 5)
trails_to_scrap(5, 10)
trails_to_scrap(10, 20)
trails_to_scrap(20, 40)
trails_to_scrap(40, 80)
trails_to_scrap(80, 160)
trails_to_scrap(160, 320)
trails_to_scrap(320, 1000)
trails_to_scrap(1000, 2000)
trails_to_scrap(2000, 4000)
trails_to_scrap(7727, None)

This is the code for saving the final dataframe

In [11]:
df.to_csv('./california_hikes_df.csv', index=False)

## Summary

- This notebook collected all the features of the trails and the reviewers of each trail and their rating of the trail. 

- The data is saved to a dataframe csv file to be loaded in the next jupyter notebook for EDA.

- The link to the next notebook is located here: <br>
https://git.generalassemb.ly/boxndragon04/California_Hiking_Recommendation_System/blob/master2/notebooks/03_EDA_California_Hiking_Trails.ipynb