# Gathering Subway data
One part of urban infrastructure are subway stations. Often people complain about the cleanliness and other aspects of them. Therefore they also provide useful data that can be used to gauge the sentiment of citizens surrounding various infrastructure objects. Here we will be gathering reviews for the subway stations in Montréal. The notebook has the following three sections: 
<br>1. [Gathering names of subways](#subways)
<br>2. [Scraping reviews from Google Maps](#collecting-reviews)
<br>3. [Merging subway reviews](#merging-reviews)

<a id="subways"></a>
## 1. Gathering subway names
Unlike the repository of park names on the Montréal website, there is no such comprehensive list for subways. Instead, there is an article on Wikipedia includes a table of Montréal subway stations. With the help of BeautifulSoup the html can be parsed and the table can be extracted. The table includes the name of the station when it was created and the background of its name. 

In [2]:
import pandas as pd # library for data analysis
import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML documents

In [5]:
# get the response in the form of html
# bus routes: "https://en.wikipedia.org/wiki/List_of_Soci%C3%A9t%C3%A9_de_transport_de_Montr%C3%A9al_bus_routes"
wikiurl="https://en.wikipedia.org/wiki/List_of_Montreal_Metro_stations"
# table_class="wikitable sortable jquery-tablesorter"
response=requests.get(wikiurl)

In [6]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(response.text, 'html.parser')
subwaytable=soup.find('table',{'class':"wikitable"}) # find the table in html 

In [7]:
# save table in a dataframe 
df=pd.read_html(str(subwaytable))
# convert list to dataframe
df=pd.DataFrame(df[0])

In [8]:
df.head() 

Unnamed: 0,Name,Odonym,Namesake,Line,Opened
0,Angrignon,Boulevard Angrignon; Parc Angrignon,"Jean-Baptiste Angrignon, city councillor",,3 Sep 1978
1,Monk,Boulevard Monk,"James Monk, Quebec Attorney-General",,3 Sep 1978
2,Jolicoeur,Rue Jolicoeur,"Jean-Moïse Jolicoeur, parish priest",,3 Sep 1978
3,Verdun,Rue de Verdun; borough of Verdun,"Notre-Dame-de-Saverdun, France, hometown of Se...",,3 Sep 1978
4,De L'Église,Avenue de l'Église,Église Saint-Paul,,3 Sep 1978


The dataframe still includes many columns that are unnecessary as well as names that need to be cleaned. First the unnecessary columns are dropped and then the name of stations are cleaned by removing unnecessary text. In total there are 68 stations. The names of the stations are saved in a csv file for later use. 

In [41]:
# drop the unwanted columns
data = df.copy()
data = data.drop(['Line'], axis=1)

In [46]:
# remove te
namesClean = df['Name'].apply(lambda x: x.split('formerly')[0])
data['Name'] = namesClean

In [11]:
data.shape 

(68, 4)

In [12]:
# save subway data in csv file 
data.to_csv('MontrealSubways.csv')

<a id="collecting-reviews"></a>
## 2. Scraping reviews from Google Maps

The name of subway stations collected in part one of this notebooks allows for the collection of comments that are related to a particular subway station. With the list of subways 

In [2]:
# load selenium module for scraping 
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException

In [14]:
# load necessary libraries
import time
import re
from datetime import datetime

In [110]:
browser = webdriver.Chrome(executable_path="/Users/andreamock/Documents/chromedriver")

In [15]:
def searchplace(browserEl, search):
    ''' finds the search bar and performes a search for a given search phrase in google maps'''
    place = browserEl.find_element_by_class_name("tactile-searchbox-input")
    place.clear()
    place.send_keys(search)
    submitButton = browserEl.find_element_by_xpath("/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[1]/div[1]/div[2]/div[1]/button")
    submitButton.click()

In [16]:
def clickOnCorrectPlace(browserEl): 
    '''If a search result in Google Maps yields multiple results clicks on the first matching results to gather 
    reviews'''
    time.sleep(5)
    try: 
        itemInfo = browserEl.find_element_by_class_name("a4gq8e-aVTXAb-haAclf-jRmmHf-hSRGPd")
        itemInfo.click()
    except: 
        print('Already correct link')

In [96]:
def extractReviewNumbers(browserEl): 
    '''Given a google maps page extracts the number of reviews for particular place and returns it. If 
    the number of reviews is not stated returns 0 '''
    time.sleep(5)
    numberReviews = 0
    try:
        numReviews = browserEl.find_elements_by_xpath('//button[@class="widget-pane-link"]')[0].text
        
        if len(numReviews) ==0: 
            numRev2 = browserEl.find_elements_by_xpath('//body/jsl[1]/div[3]/div[9]/div[8]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/div[1]/div[2]/div[1]/div[1]/span[1]/span[1]/span[1]/span[2]/span[1]/span[1]')[0].text
            numberReviews = numRev2 
        else:
            numberReviews = numReviews
            
        print('num', numberReviews)
        reviewNums = int(re.findall('\d+',numberReviews)[0])
    except:
        #print('Number of reviews not given')
        reviewNums = 0
    return reviewNums

In [80]:
def goToAllReviews(browserEl):
    '''helper function that clicks on more reviews once a google maps place is loaded'''
    wait=WebDriverWait(browserEl, 5)
    elements = ['//button[@jsaction=\'pane.reviewChart.moreReviews\']', 
                '//div[@jsaction=\'pane.reviewChart.moreReviews\']', 
                '//button[@jsaction=\'pane.reviewlist.goToReviews\']',
                '//div[@jsaction=\'pane.reviewlist.goToReviews\']'
               ]
    
    for el in elements: 
        browserElement = browserEl.find_elements_by_xpath(el)
        if len(browserElement) != 0: 
            
            try: 
                browserElement[0].click()
                break
            except:
                print('exception occured')      

In [81]:
def scroll(browserEl): 
    '''scrolls down to dynamically load all of the reviews'''
    
    keepScrolling = True

    while keepScrolling: 
        time.sleep(2) # to allow for everything to load
        try: 
            # scroll down 
            scrollable_div = browserEl.find_element_by_css_selector('div.wo1ice-loading.noprint')
            browserEl.execute_script("arguments[0].scrollIntoView(true);", scrollable_div)
        except: 
            # once scrolled to the bottem print notification
            print('reached the end')
            keepScrolling = False

In [82]:
def expandReview(browserEl):
    '''Helper function that clicks open all reviews that are longer and for which the text is otherwise not 
    fully visible'''
    
    expandReviews = browserEl.find_elements_by_xpath('//button[@jsaction=\'pane.review.expandReview\']')

    for ex in expandReviews: 
        time.sleep(2)
        try: 
            ex.click()
        except:
            print('error expanding review pane')
    #print('All reviews expanded successfully')

In [83]:
def collectReviewInfo(review,reviewFor): 
    ''' 
    Given the html for a review, as well as the name of the item for which the review is (reviewFor) extracts the 
    review id, username, url of the contributer profile, published date, number of previous reviews user has, 
    number of stars and the text of the review
    if there and returns a dictionary with the review information 
    '''
    
    reviewInfo = {} # dictionary to store review information 
    
    review_id = review['data-review-id']
    username = review.find('div', class_='ODSEW-ShBeI-title').find('span').text
    user_url = review.find('a')['href']
    date_published = review.find('span', class_='ODSEW-ShBeI-RgZmSc-date').text
    num_stars = float(review.find('span', class_='ODSEW-ShBeI-H1e3jb')['aria-label'].split(' ')[1])
    
    try: # collect number of previous reviews by user if present
        review_nums = review.find('div', class_='ODSEW-ShBeI-VdSJob').find_all('span')[1].text
        num_reviews = int(re.findall('\d+', review_nums.split()[0])[0]) 
    except:
        num_reviews = 0
    
    try: # extract review text if present
        review_text = review.find('span', class_='ODSEW-ShBeI-text').text
    except Exception as e:
        review_text = None
    
    reviewInfo['review_for'] = reviewFor
    reviewInfo['review_id'] = review_id
    reviewInfo['username'] = username
    reviewInfo['user_url'] = user_url
    reviewInfo['published'] = date_published
    reviewInfo['date_retrieved'] = datetime.now()
    reviewInfo['num_stars'] = num_stars
    reviewInfo['num_reviews'] = num_reviews
    reviewInfo['review_text'] = review_text
    
    return reviewInfo

In [84]:
def collectReviews(query, browserEl):
    '''Given the name of a location/item does a google maps search for the location, expands all reviews and scrapes 
    them. If no reviews are found None is returned, otherwise all of the reviews are returned in the form of a list'''
    
    
    time.sleep(2)
    searchplace(browserEl, query + ' subway station Montréal') # searches for the subway station in search bar
    
    time.sleep(5)
    clickOnCorrectPlace(browserEl)
    
    try: 
        time.sleep(5) # leave time to load page
        
        numReviews = extractReviewNumbers(browserEl) # gets the number of reviews for a particular location
        print('Num reviews', numReviews)
        if numReviews == 0:
            my_file = open("noSubwayReviews.txt","a+") # adds the name of an item with no reviews to a txt file 
            my_file.write(query+ '\n')
            return None # no reviews found
        
        if numReviews > 3: 
            time.sleep(5)
            goToAllReviews(browserEl) # tries going to the review page
            time.sleep(2) 
            scroll(browserEl) # scrolls down
            time.sleep(2) # leave time to load page
        
        expandReview(browserEl) # expands long reviews
        time.sleep(3) # leave time to load page
        
        # use BeautifulSoup to parse and extract the information for each review 
        response = BeautifulSoup(browserEl.page_source, 'html.parser')
        rblock = response.find_all('div', class_='ODSEW-ShBeI NIyLF-haAclf gm2-body-2')
        print('reached this point')
        allReviewData = [collectReviewInfo(ireview, query) for ireview in rblock]
        return allReviewData # return the list of collected review information 
    
    except:
        print('unable to collect reviews')
        return None

In [85]:
def collectMultipleReviews(itemList, browserEl):
    '''Given a list of multiple locations (i.e park names) searches for reviews for all of the locations and for 
    each successful collection saves the reviews for that particular location in a csv file with the location name. 
    If a location did not have any reviews or review collection was unsuccessful the names of these location will be 
    returned as a list for further troubleshooting. 
    '''
    
    unsuccessful = []
    for item in itemList: 
        browserEl.get('https://www.google.com/maps')
        time.sleep(5)
        reviewsData = collectReviews(item, browserEl)
        if reviewsData is None: 
            unsuccessful.append(item)
        else:
            df = pd.DataFrame(reviewsData)
            df.to_csv(item + '.csv')
            
    return unsuccessful # returns the list of subways/items for which no reviews were able to be colllected

In [203]:
# collect a subset of reviews
collectMultipleReviews(['Peel', 'Berri-UQAM', 'Pie-IX'], browser)

Already correct link
num 90 reviews
Num reviews 90
im here
reached the end
reached this point
Already correct link
num 431 reviews
Num reviews 431
im here
reached the end
reached this point
num 74 reviews
Num reviews 74
im here
reached the end
reached this point


[]

In some cases, scraping the reviews can cause some issues with the dynamic loading of a Google Maps review page. Since the reviews are loaded dynamically in some cases the loading of a page gets stuck (ie. when the number of total reviews is large or the page continues to show the loading sign although it has loaded all of the reviews). In such cases a 'manual' scrolling method is used in order to avoid an infinite loop. Below is the code for the few cases where this approach was needed.

In [68]:
# search for reviews 

query = 'Peel'
search = query + ' subway station Montréal'
browser.get('https://www.google.com/maps')
searchplace(browser, search)
goToAllReviews(browser)

i = 0
while i in range(1000): 
    scrollable_div = browser.find_element_by_css_selector('div.wo1ice-loading.noprint')
    browser.execute_script("arguments[0].scrollIntoView(true);", scrollable_div)
    i +=1
    
expandReview(browser)

In [None]:
# parse and save reviews

resp = BeautifulSoup(browser.page_source, 'html.parser')
rblock1 = resp.find_all('div', class_='ODSEW-ShBeI NIyLF-haAclf gm2-body-2')

allReviewData = [collectReviewInfo(ireview, query) for ireview in rblock1]
len(allReviewData)

df = pd.DataFrame(allReviewData)
df.to_csv(query + '.csv')

<a id="merging-reviews"></a>

## 3. Merging subway reviews  

For each subway the reviews are stored in a seperate csv file. To compile a dataset that contains all of the reviews the datasets have to be merged. Using pandas, os, and glob the datasets are extracted from the joint folder they are saved in and then concatenated into one large dataset and saved as a csv for later use.


In [213]:
# importing libraries
import glob
import os

# directory with all of the scraped park reviews 
dirName = '/Users/andreamock/Documents/SubwayReviews'

# merging the files
joined_files = os.path.join(dirName, "*.csv")
  
# A list of all joined files 
joined_list = glob.glob(joined_files)

# join files in a pandas dataframe
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
df

Unnamed: 0.1,Unnamed: 0,review_for,review_id,username,user_url,published,date_retrieved,num_stars,num_reviews,review_text
0,0,Lionel-Groulx,ChZDSUhNMG9nS0VJQ0FnSUNjdjkzOVZREAE,Jisan Samrat,https://www.google.com/maps/contrib/1182926626...,a year ago,2021-06-30 00:37:13.278544,5.0,109,Green Line and Orange Line
1,1,Lionel-Groulx,ChdDSUhNMG9nS0VJQ0FnSUR5enRHYW13RRAB,Sheila Ferrando,https://www.google.com/maps/contrib/1096381594...,3 months ago,2021-06-30 00:37:13.279844,5.0,82,It's w metro station Google!
2,2,Lionel-Groulx,ChZDSUhNMG9nS0VJQ0FnSUNxeHVpR0NnEAE,John Mandrake,https://www.google.com/maps/contrib/1049594479...,3 weeks ago,2021-06-30 00:37:13.283496,4.0,47,Good metro station man
3,3,Lionel-Groulx,ChdDSUhNMG9nS0VJQ0FnSURLdXByYmdBRRAB,Abdoulie Touray,https://www.google.com/maps/contrib/1108310335...,a month ago,2021-06-30 00:37:13.285167,5.0,6,Easily accessible
4,4,Lionel-Groulx,ChRDSUhNMG9nS0VJQ0FnSURLdExZUBAB,George Swanson,https://www.google.com/maps/contrib/1165742509...,a month ago,2021-06-30 00:37:13.286322,5.0,8,Nice metro
...,...,...,...,...,...,...,...,...,...,...
4884,68,Pie-IX,ChZDSUhNMG9nS0VJQ0FnSURBcFpqLUZnEAE,Frédéric Desroches,https://www.google.com/maps/contrib/1045305300...,3 years ago,2021-06-30 11:58:08.908984,4.0,124,
4885,69,Pie-IX,ChZDSUhNMG9nS0VJQ0FnSUM4cWNMYkRREAE,Ariel Gauthier,https://www.google.com/maps/contrib/1111854191...,11 months ago,2021-06-30 11:58:08.909773,4.0,6,
4886,70,Pie-IX,ChZDSUhNMG9nS0VJQ0FnSUN3Z0wtVENnEAE,Viviane Bonneau,https://www.google.com/maps/contrib/1088751603...,4 years ago,2021-06-30 11:58:08.910557,4.0,159,
4887,71,Pie-IX,ChZDSUhNMG9nS0VJQ0FnSUNjbTlPRUFREAE,Jisan Samrat,https://www.google.com/maps/contrib/1182926626...,a year ago,2021-06-30 11:58:08.911391,5.0,109,


In [214]:
merged_df = df.drop('Unnamed: 0', axis=1) # get rid of unnecessary column
merged_df.head()

Unnamed: 0,review_for,review_id,username,user_url,published,date_retrieved,num_stars,num_reviews,review_text
0,Lionel-Groulx,ChZDSUhNMG9nS0VJQ0FnSUNjdjkzOVZREAE,Jisan Samrat,https://www.google.com/maps/contrib/1182926626...,a year ago,2021-06-30 00:37:13.278544,5.0,109,Green Line and Orange Line
1,Lionel-Groulx,ChdDSUhNMG9nS0VJQ0FnSUR5enRHYW13RRAB,Sheila Ferrando,https://www.google.com/maps/contrib/1096381594...,3 months ago,2021-06-30 00:37:13.279844,5.0,82,It's w metro station Google!
2,Lionel-Groulx,ChZDSUhNMG9nS0VJQ0FnSUNxeHVpR0NnEAE,John Mandrake,https://www.google.com/maps/contrib/1049594479...,3 weeks ago,2021-06-30 00:37:13.283496,4.0,47,Good metro station man
3,Lionel-Groulx,ChdDSUhNMG9nS0VJQ0FnSURLdXByYmdBRRAB,Abdoulie Touray,https://www.google.com/maps/contrib/1108310335...,a month ago,2021-06-30 00:37:13.285167,5.0,6,Easily accessible
4,Lionel-Groulx,ChRDSUhNMG9nS0VJQ0FnSURLdExZUBAB,George Swanson,https://www.google.com/maps/contrib/1165742509...,a month ago,2021-06-30 00:37:13.286322,5.0,8,Nice metro


In [215]:
merged_df.to_csv('SubwayReviews.csv') # save merged dataframe to a csv file 