# Scraping Google Reviews
The following notebook details how Google Maps reviews of parks in Montréal are being collected.

It is broken down into the following sections: 
<br>1. [Loading necessary libraries](#loading-lib)
<br>2. [Collecting park information](#collecting-park-info)
<br>3. [Collecting Google Maps data](#google-maps-calls)

<a id="loading-lib"></a>
## 1. Loading necessary libraries
One of the essential libraries for scraping web data is Selenium. To use Selenium, it first has to be installed as well as making sure that certain options are set for later scraping.

In [4]:
# install necessary libraries 
!pip install selenium
!apt-get update # to update ubuntu to correctly run apt install
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Collecting selenium
[?25l  Downloading https://files.pythonhosted.org/packages/80/d6/4294f0b4bce4de0abf13e17190289f9d0613b0a44e5dd6a7f5ca98459853/selenium-3.141.0-py2.py3-none-any.whl (904kB)
[K     |▍                               | 10kB 16.4MB/s eta 0:00:01[K     |▊                               | 20kB 22.7MB/s eta 0:00:01[K     |█                               | 30kB 26.5MB/s eta 0:00:01[K     |█▌                              | 40kB 28.6MB/s eta 0:00:01[K     |█▉                              | 51kB 31.5MB/s eta 0:00:01[K     |██▏                             | 61kB 26.1MB/s eta 0:00:01[K     |██▌                             | 71kB 25.4MB/s eta 0:00:01[K     |███                             | 81kB 26.7MB/s eta 0:00:01[K     |███▎                            | 92kB 27.7MB/s eta 0:00:01[K     |███▋                            | 102kB 29.0MB/s eta 0:00:01[K     |████                            | 112kB 29.0MB/s eta 0:00:01[K     |████▍                           | 12

In [2]:
# making sure that we are in the correct path 
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

In [1]:
# load selenium module for scraping 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException

In [6]:
# update options for scraping 
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

In [302]:
# create a webdriver browser instance to make website calls
#browser = webdriver.Chrome('chromedriver', options=chrome_options)

# if developing locally and not in Google Collab
browser = webdriver.Chrome(executable_path="/Users/andreamock/Documents/chromedriver")

<a id="collecting-park-info"></a>
## 2. Collecting park information 

### 2.1 Collecting names of parks 
There are multiple parks in Montréal. There are multiple approaches that can be taken in order to get a complete list of parks. The official website of Montréal contains a [list of parks](https://montreal.ca/lieux?mtl_content.lieux.category.code=PARC) in the city which will be used as the basis for collecting reviews for different parks. 

In [None]:
# get the website with all the parks in Montréal
browser.get('https://montreal.ca/lieux?mtl_content.lieux.installation.code=PARC')

In [3]:
def gatherParkNames(browserEl):
    ''' Searches for the park names as well as urls to their corresponding site.
    Returns a list of tuples, where each tuple contains the park name and corresponding url'''

    # find_elements_by_xpath returns an array of selenium objects.
    park_elements = browserEl.find_elements_by_xpath('//div[@class="list-group list-group-teaser hub-list-group "]/a')

    # extract the links and names of the parks 
    all_park_info = []
    for p in park_elements: 
        park_name= p.text.split('\n')[0] # just extract park name, not part of city and address
        park_url = p.get_attribute('href')
        all_park_info.append((park_name, park_url))

    return all_park_info

In [4]:
def getAllParks(browserEl, numPages): 
    '''given the number of pages to traverse extracts all of the parks on the Montréal website and 
    returns the park name, the url to an information page for that park in the form of a list'''
    
    allparksInfo = [] # list to collect the information about all parks
    for i in range(numPages): 
        browserEl.get('https://montreal.ca/lieux?mtl_content.lieux.installation.code=PARC&page='+str(i+1))
        parksInfo = gatherParkNames(browserEl) # gather park information from one of the park overview pages
        allparksInfo = allparksInfo + parksInfo

    return allparksInfo

In [5]:
# gather all of the park names from the Montréal website
allParks = getAllParks(browser, 10)

In [6]:
# number of total parks 
len(allParks)

948

In [7]:
# sample entry of a park
allParks[0]

('Aire de repos 8e Avenue',
 'https://montreal.ca/lieux/aire-de-repos-8e-avenue')

### 2.2 Extracting additional park information 
The Montréal park overview site also has a designated page for each park which offers information about each park opening times, general information as well as a link to Google maps for the park. Therefore since we collected the name of each park as well as to the individual site designated to each park we use it as a way to gather additional information about the park for later if needed. 

In [22]:
def extractAdditionalParkInfo(parkInfo, browserEl):
    ''' Given a list of park names and corresponding urls, extracts the park's google maps url as well as description
    if present. Returns a list of tuples that contains the park name, the link to the park's site on the montreal  
    website and the description of the park 
    '''
    fullParkInfo = []
    for parkName, parkLink in parkInfo:
        browserEl.get(parkLink)
        try: 
            parkUrl = browserEl.find_elements_by_xpath(
                '//div[@class="list-item-content"]/div[@class="list-item-action mt-1"]/a')
            googleMapsUrl = parkUrl[0].get_attribute('href') #retrieve the url to Google Maps
        except: 
            googleMapsUrl = None
        
        # try extracting a description, if not present just set to None
        try: 
            # get a description of location
            description = browser.find_elements_by_xpath('//div[@class="content-module-stacked"]/div/p')[0].text
        
        except: 
            description = None
        
        fullParkInfo.append((parkName, parkLink, googleMapsUrl, description))
    return fullParkInfo

In [23]:
# extract additional park info
allParksData = extractAdditionalParkInfo(allParks, browser)

In [24]:
# sample collected data entry 
allParksData[:1]

[('Aire de repos 8e Avenue',
  'https://montreal.ca/lieux/aire-de-repos-8e-avenue',
  'https://www.google.com/maps/search/?api=1&query=Boulevard%20Saint-Joseph%20Lachine%20H8S%202M2%20Qu%C3%A9bec,%20Canada',
  'L’aire de repos de la 8e Avenue offre un point de vue sur le canal de Lachine.')]

Having collected data for all the parks in Montréal, before moving on to Google Maps it is important to saveguard the data. To do so, the collected data set will be saved in a CVS file and can be easily loaded in the next time without having to repeat the data collecting step again. 

In [25]:
import pandas as pd

In [28]:
# save park information in dataframe
park_df = pd.DataFrame(allParksData, columns=['name', 'link', 'google_maps', 'description'])
park_df.head()

Unnamed: 0,name,link,google_maps,description
0,Aire de repos 8e Avenue,https://montreal.ca/lieux/aire-de-repos-8e-avenue,https://www.google.com/maps/search/?api=1&quer...,L’aire de repos de la 8e Avenue offre un point...
1,Bassin de la Brunante,https://montreal.ca/lieux/bassin-de-la-brunante,https://www.google.com/maps/search/?api=1&quer...,Le bassin de la Brunante est un lieu privilégi...
2,Belvédère du Chemin-Qui-Marche,https://montreal.ca/lieux/belvedere-du-chemin-...,https://www.google.com/maps/search/?api=1&quer...,C’est un parc linéaire proche du fleuve Saint-...
3,Boisé du parc Marcel-Laurin,https://montreal.ca/lieux/boise-du-parc-marcel...,https://www.google.com/maps/search/?api=1&quer...,Consultez la carte des sentiers.
4,Boisé Saint-Conrad,https://montreal.ca/lieux/boise-saint-conrad,https://www.google.com/maps/search/?api=1&quer...,Venez profiter des attraits de la nature ou pr...


In [29]:
# save data in csv file
#park_df.to_csv('ParkInformation.csv')

<a id="google-maps-calls"></a>
## 3. Making Google API calls
After having collected the name of all the parks in Montréal, the next step is to search them up on Google Maps and extract the reviews for each park. 

### 3.1 Functions to collect reviews
For the collection of reviews Selenium will be utilized. In order to search for a particular park, one must first search up the name of the park, click on the reviews, scroll down to gather all reviews since the site is dynamically loaded and finally collect the reviews and store them. For each park the reviews will be stored in a csv file. 

In [46]:
# load necessary libraries
import time
from bs4 import BeautifulSoup
import re
from datetime import datetime

In [35]:
def searchplace(browserEl, search):
    ''' finds the search bar and performes a search for a given search phrase'''
    place = browserEl.find_element_by_class_name("tactile-searchbox-input")
    place.clear()
    place.send_keys(search)
    submitButton = browserEl.find_element_by_xpath("/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[1]/div[1]/div[2]/div[1]/button")
    submitButton.click()

In [177]:
def goToAllReviews(browserEl):
    '''helper function that clicks on more reviews once a google maps place is loaded'''
    element = browserEl.find_elements_by_xpath('//button[@jsaction=\'pane.reviewlist.goToReviews\']')
    time.sleep(2)
    element[0].click()

In [89]:
def scroll(browserEl): 
    '''scrolls down to dynamically load all of the reviews'''
    
    keepScrolling = True

    while keepScrolling: 
        time.sleep(2) # to allow for everything to load
        try: 
            # scroll down 
            scrollable_div = browserEl.find_element_by_css_selector('div.wo1ice-loading.noprint')
            browserEl.execute_script("arguments[0].scrollIntoView(true);", scrollable_div)
        except: 
            # once scrolled to the bottem print notification
            print('reached the end')
            keepScrolling = False

In [217]:
def expandReview(browserEl):
    '''Helper function that clicks open all reviews that are longer and for which the text is otherwise not 
    fully visible'''
    
    expandReviews = browserEl.find_elements_by_xpath('//button[@jsaction=\'pane.review.expandReview\']')

    for ex in expandReviews: 
        time.sleep(2)
        try: 
            ex.click()
        except:
            print('error expanding review pane')
    #print('All reviews expanded successfully')

In [165]:
def collectReviewInfo(review,reviewFor): 
    ''' 
    Given the html for a review, as well as the name of the park for which the review is (reviewFor) extracts the 
    review id, username, url of the contributer profile, published date, number of previous reviews user has, 
    number of stars and the text of the review
    if there and returns a dictionary with the review information 
    '''
    
    reviewInfo = {} # dictionary to store review information 
    
    review_id = review['data-review-id']
    username = review.find('div', class_='ODSEW-ShBeI-title').find('span').text
    user_url = review.find('a')['href']
    date_published = review.find('span', class_='ODSEW-ShBeI-RgZmSc-date').text
    num_stars = float(review.find('span', class_='ODSEW-ShBeI-H1e3jb')['aria-label'].split(' ')[1])
    
    try: # collect number of previous reviews by user if present
        review_nums = review.find('div', class_='ODSEW-ShBeI-VdSJob').find_all('span')[1].text
        num_reviews = int(re.findall('\d+', review_nums.split()[0])[0]) 
    except:
        num_reviews = 0
    
    try: # extract review text if present
        review_text = review.find('span', class_='ODSEW-ShBeI-text').text
    except Exception as e:
        review_text = None
    
    reviewInfo['review_for'] = reviewFor
    reviewInfo['review_id'] = review_id
    reviewInfo['username'] = username
    reviewInfo['user_url'] = user_url
    reviewInfo['published'] = date_published
    reviewInfo['date_retrieved'] = datetime.now()
    reviewInfo['num_stars'] = num_stars
    reviewInfo['num_reviews'] = num_reviews
    reviewInfo['review_text'] = review_text
    
    return reviewInfo

In [218]:
def collectParkReviews(query, browserEl):
    '''Given the name of a park does a google maps search for the park, expands all reviews and scrapes them.
    If no park reviews are found None is returned, otherwise all of the reviews are returned in the form of a list'''
    
    time.sleep(2)
    searchplace(browserEl, query + ' Montréal') # searches for the park in search bar
    
    try: 
        time.sleep(5) # leave time to load page
        goToAllReviews(browserEl) # tries going to the review page
        time.sleep(2) 
        
        scroll(browserEl) # scrolls down
        time.sleep(2) # leave time to load page
        
        expandReview(browserEl) # expands long reviews
        time.sleep(3) # leave time to load page
        
        # use BeautifulSoup to parse and extract the information for each review 
        response = BeautifulSoup(browserEl.page_source, 'html.parser')
        rblock = response.find_all('div', class_='ODSEW-ShBeI NIyLF-haAclf gm2-body-2')
        allReviewData = [collectReviewInfo(ireview, query) for ireview in rblock]
        return allReviewData # return the list of collected review information 
    except:
        print('unable to collect reviews')
        return None

In [219]:
def collectMultipleReviews(parkList, browserEl):
    '''Given a list of multiple park names searches for reviews for all of the parks and for each successful 
    collection of a parks saves the reviews for that particular park in a csv file with the parks name. 
    If a park did not have any reviews or review collection was unsuccessful the names of these parks will be returned 
    as a list for further troubleshooting. 
    '''
    
    unsuccessful = []
    for park in parkList: 
        browserEl.get('https://www.google.com/maps')
        time.sleep(5)
        reviewsData = collectParkReviews(park, browserEl)
        if reviewsData is None: 
            unsuccessful.append(park)
        else:
            df = pd.DataFrame(reviewsData)
            df.to_csv(park + '.csv')
            
    return unsuccessful

In [151]:
browser.get('https://www.google.com/maps')

In [293]:
# list of some of the parks 
list(park_df['name'][640:660])

['Parc Moulin-du-Rapide',
 'Parc Mozart',
 'Parc Mullins-Richmond',
 'Parc Mullins-Wellington',
 'Parc Munro',
 'Parc Murielle-Dumont',
 'Parc Napoléon',
 'Parc Napoléon-Sénécal',
 'Parc-nature de l’Anse-à-l’Orme',
 "Parc-nature de l'Île-de-la-Visitation",
 'Parc-nature de la Pointe-aux-Prairies',
 "Parc-nature du Bois-de-L'Île-Bizard",
 'Parc-nature du Bois-de-Liesse',
 'Parc-nature du Bois-de-Saraguay',
 'Parc-nature du Cap-Saint-Jacques',
 'Parc-nature du Ruisseau-De Montigny',
 'Parc Nelson-Mandela',
 'Parc Nesbitt',
 'Parc Neuville-sur-Vanne',
 'Parc Nicolas-Tillemont']

In [None]:
# scraping of a group of parks 
missing11 = collectMultipleReviews(list(park_df['name'][770:800]), browser)

el found []
unable to collect reviews
el found [<selenium.webdriver.remote.webelement.WebElement (session="73c99ef7c0f0c61fc16a1fa91f433885", element="94275dbd-8f89-4e0c-9a3b-b35e583a8726")>]
reached the end
All reviews expanded successfully
el found [<selenium.webdriver.remote.webelement.WebElement (session="73c99ef7c0f0c61fc16a1fa91f433885", element="d5f4eace-e4f4-486e-8a8e-d81da95d706f")>]
reached the end
All reviews expanded successfully
el found [<selenium.webdriver.remote.webelement.WebElement (session="73c99ef7c0f0c61fc16a1fa91f433885", element="86163cc6-7cd5-468c-8f9a-eee6f72f1cde")>]


### 3.2 Collecting reviews with special format
After searching for certain parks there are mutliple parks with a particular name or only 1 review for that park. Therefore, it is necessary to add an additional processing step for the parks for which not data was abke to be collected via the code from above. 

In [275]:
oldParks = [ 'Parc Guillaume-Couture', 'Parc Gédéon-De Catalogne', 
 'Parc-école Saint-Pierre-Apôtre', 'Parc du Pied-du-Courant', 
 'Parc du Père-Marquette', 'Parc du Mail', 'Parc du Bocage', 'Parc des Hirondelles', 'Parc des Écluses', 'Parc de la Fontaine',
 'Mini-parc Querbes', 'Parc Coubertin', 'Parc Chamberland', 'Parc Chabot', 'Parc Bélanger De Chateaubriand',
'Parc Baldwin', 'Parc Houde', 'Parc Angrignon', 'Parc Ahuntsic', 'Parc J.-Albert-Gariépy', 'Parc J.O.R.-Leduc',
 'Parc Jarry','Parc Gédéon-De Catalogne', 'Parc Jeanne-Mance','Parc Jessie-Maxwell-Smith','Parc Lucie-Bruneau',
  'Parc Mackenzie-King','Parc Maisonneuve','Parc Mignault', 'Parc Monty', 'Parc-nature de la Pointe-aux-Prairies',       
'Parc-nature du Bois-de-Liesse', 'Parc-nature du Cap-Saint-Jacques', 'Parc Nicolas-Viel', 'Parc Ovila-Pelletier',
         'Parc Painter','Parc Paul-Séguin', 
]