#  Scraping Yelp Data

Yelp is site that allows users to publish reviews about businesses, public places, activities and other things around the world. Among the data found on the site are parks and metro stations. Two key aspects of city infrastructure. The following notebook details how to extract infrastructure information from Yelp and saving it into csv files. 

It is broken down into the following sections: 
<br>1. [Scraping name of places](#places-scraping)
<br>2. [Scraping individual reviews](#scraping-indiv)
<br>3. [Appendix](#appendix)

<a id="places-scraping"></a>
## 1. Scraping name of places
Before being able to scrape the individual reviews, it is necessary to collect a list of places as well as the url for each place to scrape the reviews for each of them. For example we can search for all of the parks located in Montreal. The list of search results can be saved (ie in a csv file or dataframe) and used to collect reviews for each individual place. 

In [335]:
# import necessary libraries 
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time

In [73]:
def getSearchResults(queryUrl): 
    """ Given a yelp search query url, retrieves all of the search results saving each in a dictionary detailing 
    the name of the place, the link to the place, the number of reviews and the average rating of the place. All of the 
    dictionaries are stored in a list which is returned. 
    """
    response = requests.get(queryUrl)

    soup = BeautifulSoup(response.text)
    
    numPages= bs.find('div', 
         {'class': 'border-color--default__09f24__1eOdn text-align--center__09f24__1P1jK'})
    number = numPages.find('span').text
    totalPages = int(number.split()[-1])
    
    counter = 1
    
    resultsList = []
    
    while counter <= totalPages: 
        time.sleep(5)
        
        containers = soup.findAll("div", {'class':
                              'container__09f24__21w3G hoverable__09f24__2nTf3 margin-t3__09f24__5bM2Z margin-b3__09f24__1DQ9x padding-t3__09f24__-R_5x padding-r3__09f24__1pBFG padding-b3__09f24__1vW6j padding-l3__09f24__1yCJf border--top__09f24__8W8ca border--right__09f24__1u7Gt border--bottom__09f24__xdij8 border--left__09f24__rwKIa border-color--default__09f24__1eOdn'})
        
        for item in containers: 
            itemDict = dict()
            itemInfo = item.find('span', {'class': 'css-1pxmz4g'})
            
            # get the url to that particular location information 
            
            itemUrl = itemInfo.find('a')['href']
            itemLink = "https://www.yelp.com" + itemUrl 
            
            # get the name of the place
            placeName = itemInfo.find('a')['name']
            
            itemDict['Name'] = placeName
            itemDict['Link'] = itemLink
            
            ratingsInfo = item.find('span', 
                                            {'class':'display--inline__09f24__EhyFv border-color--default__09f24__1eOdn'})
                
            print('rat', ratingsInfo)
            try:
            
                # get the average number of stars   
                starRatings = float(ratingsInfo.find('div')['aria-label'].replace(" star rating", "")) 
                itemDict['Rating'] = starRatings
                
                # extract the number or reviews 
                reviewNum = item.find('span', {'class': 'reviewCount__09f24__EUXPN css-e81eai'}).text
                itemDict['Num_Reviews'] = int(reviewNum)
            except:
                itemDict['Rating'] = 0
                itemDict['Num_Reviews'] = 0
            
            resultsList.append(itemDict)
            
       #if there are multiple pages with reviews, go to next page
        try: 
            newLink = queryUrl + '&start=' + str(counter*10)
            response = requests.get(newLink)

            soup = BeautifulSoup(response.text)
            
        except: 
            print('no more pages available')
        
        counter += 1     
    
    return resultsList   
    

In [18]:
# yelp search query for parks in montreal 
allParksUrl = 'https://www.yelp.com/search?find_desc=park&find_loc=Montreal%2C+Quebec%2C+Canada'

In [None]:
# example of extracting the list of all parks in montreal that are on yelp
allParks = getSearchResults(allParksUrl)

In [83]:
# save the list of parks in a data frame 
parksYelpDf = pd.DataFrame(allParks)
parksYelpDf.head()

Unnamed: 0,Name,Link,Rating,Num_Reviews
0,Parc du Mont-Royal,https://www.yelp.com/biz/parc-du-mont-royal-mo...,4.5,352
1,Parc la Fontaine,https://www.yelp.com/biz/parc-la-fontaine-mont...,4.5,48
2,Parc de La Cite,https://www.yelp.com/biz/parc-de-la-cite-saint...,4.5,4
3,Square Saint-Louis,https://www.yelp.com/biz/square-saint-louis-mo...,5.0,23
4,Parc-nature de l’Île-de-la-Visitation,https://www.yelp.com/biz/parc-nature-de-l-%C3%...,4.5,10


In [84]:
# save parks in a csv file 
#parksYelpDf.to_csv('YelpParks.csv')

<a id="scraping-indiv"></a>
## 2. Scraping individual reviews
Having the particular link to place on Yelp, one can scrape the reviews on the page for that particular place. To do so we create a framework for the way a review information is saved in the ReviewInfo class. This framework was adopted from <a href="https://github.com/Poomulus/Yelp-Review-Scraper">this Github repository</a>.

The scraping of each individual review is done with the help of BeautifulSoup, which extracts the review, review date, number of stars, the reviewer's name, the reviewer's profile link as well as their location. For each individual place the reviews are stored in a list and then saved in a csv file with the name of the place the reviews were collected for. 

In [112]:
class ReviewInfo:
    """framework to save the information for a review"""
    def __init__(self, comment, rating, ratingDate, profileName, profileLink, location):
        self.comment = comment
        self.rating = rating
        self.ratingDate = ratingDate
        self.profileName = profileName
        self.profileLink = profileLink
        self.location = location

    def to_dict(self):
        return {
            'comment': self.comment,
            'rating': self.rating,
            'ratingDate': self.ratingDate,
            'profileName': self.profileName,
            'profileLink': self.profileLink,
            'location': self.location
        }

In [212]:
def scrapeYelpReviews2(yelpLink, french=False): 
    """Given a link of a place, extracts all of the reviews for that place. For each review we extract the 
    the comment, review date, number of stars, the reviewer's name, the reviewer's profile link as well as 
    their location.
    Additionally if the reviews are in French, it makes sure that the saved link also have the french part of 
    the link included. 
    """
    
    response = requests.get(yelpLink)

    soup = BeautifulSoup(response.text)
    

    # find out how many pages there are to iterate over
    time.sleep(10)
    numPagesDiv = soup.find('div',  
             {'class': 'border-color--default__373c0__2oFDT text-align--center__373c0__1l506'})
    print('pagedov', numPagesDiv)
    numberText = numPagesDiv.find('span').text
    numPages = int(numberText.split()[-1])
    print(numPages)
    counter = 1

    reviewList = [] # list to save review information 
    while counter <=numPages+1:
            time.sleep(5)
            review_containers = soup.findAll("div", {'class':'review__373c0__13kpL border-color--default__373c0__2oFDT'})

            for review in review_containers:

                theReview = ReviewInfo('', '', '', '', '', '')

                # get rating

                divCont = review.find("span", {'class': 'display--inline__373c0__2SfH_ border-color--default__373c0__30oMI'}
                         ).find("div")
                try: 
                    theReview.rating = int(divCont['aria-label'].replace(" star rating", ""))

                except: # if in French
                    theReview.rating = int(divCont['aria-label'].replace(" étoiles", ""))

                # extract date of review 
                theReview.ratingDate = review.find("span",{'class':'css-e81eai'} ).text


                # extract comment text
                theReview.comment = review.find("p",{'class':'comment__373c0__1M-px css-n6i4z7'} ).text

                
                # get the profile details: name name and link 
                profileInfo = review.find("div",{'class':'user-passport-info border-color--default__373c0__2oFDT'} )
                print('profile inf', profileInfo)
                profileName = profileInfo.find('span').text # get the name of the user
                print('profile name', profileName)
                try: 
                    userUrl = profileInfo.find('a')['href']
                    if french: 
                        part1Url = 'https://fr.yelp.ca'
                    else: 
                        part1Url = 'https://www.yelp.com'

                    profileLink = part1Url + userUrl # create a full url for user profile
                except:
                    profileLink = None

                theReview.profileLink = profileLink
                theReview.profileName = profileName

                profileLocation = review.find('div', 
                      {'class': 'responsive-hidden-small__373c0__1ozaH border-color--default__373c0__2oFDT'}).text
                theReview.location = profileLocation

                #ADD REVIEW TO LIST
                reviewList.append(theReview)

            #if there are multiple pages with reviews, go to next page
            try: 
                newLink = yelpLink + '&start=' + str(counter*10)
                response = requests.get(newLink)

                soup = BeautifulSoup(response.text)

            except: 
                print('no more pages available')

            counter += 1

    return reviewList 

    

In [230]:
def scrapeYelpReviews(yelpLink, french=False): 
    """Given a link of a place, extracts all of the reviews for that place. For each review we extract the 
    the comment, review date, number of stars, the reviewer's name, the reviewer's profile link as well as 
    their location.
    Additionally if the reviews are in French, it makes sure that the saved link also have the french part of 
    the link included. 
    """
    
    response = requests.get(yelpLink)

    soup = BeautifulSoup(response.text)
    
    try: 
        # find out how many pages there are to iterate over
        time.sleep(10)
        numPagesDiv = soup.find('div',  
             {'class': 'border-color--default__373c0__2oFDT text-align--center__373c0__1l506'})
        numberText = numPagesDiv.find('span').text
        numPages = int(numberText.split()[-1])
        counter = 1

        reviewList = [] # list to save review information 
        while counter <=numPages+1:
            time.sleep(5)
            review_containers = soup.findAll("div", {'class':'review__373c0__13kpL border-color--default__373c0__2oFDT'})

            for review in review_containers:

                theReview = ReviewInfo('', '', '', '', '', '')

                # get rating

                divCont = review.find("span", {'class': 'display--inline__373c0__2SfH_ border-color--default__373c0__30oMI'}
                         ).find("div")
                try: 
                    theReview.rating = int(divCont['aria-label'].replace(" star rating", ""))

                except: # if in French
                    theReview.rating = int(divCont['aria-label'].replace(" étoiles", ""))

                # extract date of review 
                theReview.ratingDate = review.find("span",{'class':'css-e81eai'} ).text


                # extract comment text
                theReview.comment = review.find("p",{'class':'comment__373c0__1M-px css-n6i4z7'} ).text


                # get the profile details: name name and link 
                profileInfo = review.find("div",{'class':'user-passport-info border-color--default__373c0__2oFDT'} )
                
                profileName = profileInfo.find('span').text # get the name of the user
                
                try:
                    userUrl = profileInfo.find('a')['href']
                    if french: 
                        part1Url = 'https://fr.yelp.ca'
                    else: 
                        part1Url = 'https://www.yelp.com'

                    profileLink = part1Url + userUrl # create a full url for user profile
                except:
                    profileLink = None
                    
                theReview.profileLink = profileLink
                theReview.profileName = profileName

                profileLocation = review.find('div', 
                      {'class': 'responsive-hidden-small__373c0__1ozaH border-color--default__373c0__2oFDT'}).text
                theReview.location = profileLocation

                #ADD REVIEW TO LIST
                reviewList.append(theReview)

            #if there are multiple pages with reviews, go to next page
            try: 
                newLink = yelpLink + '&start=' + str(counter*10)
                response = requests.get(newLink)

                soup = BeautifulSoup(response.text)

            except: 
                print('no more pages available')

            counter += 1

        return reviewList 
    except:
        print('No reviews found')
        return None
    

In [125]:
def createCSV(reviewList, placeName, french=False):
    """Given the list of a reviews and the name of a place saves the reviews in a csv file with the name of the csv 
    file being the name of the place. If the reviews are in French, the name of the csv file has additionaly 
    fr in the name of the file. """
    if french:
        name = placeName + ' fr'
    else:
        name = placeName
        
    with open(name + '.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['ID', 'Comment', 'Rating', 'Rating Date', 'Profile Name', 'Profile Link', 'Profile Location'])
        x = 1
        for review in reviewList:
            writer.writerow([str(x), review.comment, review.rating, review.ratingDate, review.profileName, 
                             review.profileLink, review.location])
            x+=1

In [189]:
def extractReviews(placeName, yelpLink,french=False): 
    '''Scrapes the reveiw for a place and also saves it in a csv file. Additionally all of the scraped reviews 
    are returned. If the reviews are inf french, then the French parameter is true and makes sure that the saved
    links correspond to the French links and saved in a corresponding labeled csv file.'''
    scrapedReviews = scrapeYelpReviews(yelpLink,french)
    if scrapedReviews is not None:
        createCSV(scrapedReviews, placeName, french)
    return scrapedReviews

In [207]:
def extractReviews2(placeName, yelpLink,french=False): 
    '''Scrapes the reveiw for a place and also saves it in a csv file. Additionally all of the scraped reviews 
    are returned. If the reviews are inf french, then the French parameter is true and makes sure that the saved
    links correspond to the French links and saved in a corresponding labeled csv file.'''
    scrapedReviews = scrapeYelpReviews2(yelpLink,french)
    if scrapedReviews is not None:
        createCSV(scrapedReviews, placeName, french)
    return scrapedReviews

In [135]:
# extract dataframe of parks that actually have reviews
parksWReviews = parksYelpDf[parksYelpDf['Num_Reviews'] > 0 ]

parksWReviews.reset_index(drop=True)

Unnamed: 0,Name,Link,Rating,Num_Reviews
0,Parc du Mont-Royal,https://www.yelp.com/biz/parc-du-mont-royal-mo...,4.5,352
1,Parc la Fontaine,https://www.yelp.com/biz/parc-la-fontaine-mont...,4.5,48
2,Parc de La Cite,https://www.yelp.com/biz/parc-de-la-cite-saint...,4.5,4
3,Square Saint-Louis,https://www.yelp.com/biz/square-saint-louis-mo...,5.0,23
4,Parc-nature de l’Île-de-la-Visitation,https://www.yelp.com/biz/parc-nature-de-l-%C3%...,4.5,10
...,...,...,...,...
114,Ile des Soeurs // Nun’s Island,https://www.yelp.com/biz/ile-des-soeurs-nuns-i...,3.5,3
115,Parc Saint-Paul,https://www.yelp.com/biz/parc-saint-paul-montr...,5.0,1
116,Parc Ballantyne - Gary Cartier Field,https://www.yelp.com/biz/parc-ballantyne-gary-...,4.0,1
117,Parc Beaudet,https://www.yelp.com/biz/parc-beaudet-montr%C3...,4.0,1


In [186]:
caLinks = parksWReviews['Link'].apply(lambda x: 'https://fr.yelp.ca/biz' + x.split('biz')[-1])
caLinks[1]

'https://fr.yelp.ca/biz/parc-la-fontaine-montr%C3%A9al-2?osq=park'

In [187]:
parksWReviews['FrLink'] = caLinks

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [188]:
parksWReviews

Unnamed: 0,Name,Link,Rating,Num_Reviews,FrLink
0,Parc du Mont-Royal,https://www.yelp.com/biz/parc-du-mont-royal-mo...,4.5,352,https://fr.yelp.ca/biz/parc-du-mont-royal-mont...
1,Parc la Fontaine,https://www.yelp.com/biz/parc-la-fontaine-mont...,4.5,48,https://fr.yelp.ca/biz/parc-la-fontaine-montr%...
2,Parc de La Cite,https://www.yelp.com/biz/parc-de-la-cite-saint...,4.5,4,https://fr.yelp.ca/biz/parc-de-la-cite-saint-h...
3,Square Saint-Louis,https://www.yelp.com/biz/square-saint-louis-mo...,5.0,23,https://fr.yelp.ca/biz/square-saint-louis-mont...
4,Parc-nature de l’Île-de-la-Visitation,https://www.yelp.com/biz/parc-nature-de-l-%C3%...,4.5,10,https://fr.yelp.ca/biz/parc-nature-de-l-%C3%AE...
...,...,...,...,...,...
148,Ile des Soeurs // Nun’s Island,https://www.yelp.com/biz/ile-des-soeurs-nuns-i...,3.5,3,https://fr.yelp.ca/biz/ile-des-soeurs-nuns-isl...
151,Parc Saint-Paul,https://www.yelp.com/biz/parc-saint-paul-montr...,5.0,1,https://fr.yelp.ca/biz/parc-saint-paul-montr%C...
155,Parc Ballantyne - Gary Cartier Field,https://www.yelp.com/biz/parc-ballantyne-gary-...,4.0,1,https://fr.yelp.ca/biz/parc-ballantyne-gary-ca...
158,Parc Beaudet,https://www.yelp.com/biz/parc-beaudet-montr%C3...,4.0,1,https://fr.yelp.ca/biz/parc-beaudet-montr%C3%A...


In [246]:
for i in range(28,58): 
    time.sleep(10)
    print(parkRev2.iloc[i]['Name'], i, parkRev2.iloc[i]['FrLink'])
    extractReviews(parkRev2.iloc[i]['Name'], parkRev2.iloc[i]['FrLink'], french=True)

In [155]:
parksWReviews['Name'][49:]

49                          Parc Willibrord
50                       Parc Lhasa-de Sela
51                       Parc des Carrières
52                    Place Kate-McGarrigle
54             Parc Monseigneur J-A Richard
                       ...                 
148          Ile des Soeurs // Nun’s Island
151                         Parc Saint-Paul
155    Parc Ballantyne - Gary Cartier Field
158                            Parc Beaudet
159    Parc des Compagnons de Saint-Laurent
Name: Name, Length: 70, dtype: object

In [None]:
# to do Parc Père-Marquette, Parc Baile, Parc Linéaire de la Commune, Belvédère Camillien-Houde 48, 
# Parc Lhasa-de Sela 50, Parc des Carrières 51, Parc Monseigneur J-A Richard 53,Parc Girouard 54, Parc de l’Aqueduc 58
# Parc de la Promenade Bellerive 63, Parc Atwater 64, Parc Dante 65, Parc John F Kennedy 66, Woonerf Saint-Pierre 69
# Parc Frédéric-Back 72

In [214]:
import os

In [233]:
# directory with all of the scraped park reviews 
dirName = '/Users/andreamock/Documents/yelpParkReviewsFr'

In [234]:
# gather the list of parks who successfully were scraped
listOfFiles = os.listdir(dirName)
fileNames = [file.split('.csv')[0] for file in listOfFiles]

In [240]:
fCleaned = [f.strip(' fr') for f in fileNames]

In [241]:
needScraping = [] # list of parks that were unable to be scraped 
for parkN in list(parksWReviews['Name']):
    if parkN not in fCleaned and '/' not in parkN: 
        needScraping.append(parkN)

In [242]:
len(needScraping)

59

In [243]:
parkRev2 = parksWReviews[parksWReviews['Name'].apply(lambda x: x in needScraping)]
parkRev2.reset_index(drop=True)

Unnamed: 0,Name,Link,Rating,Num_Reviews,FrLink
0,Parc la Fontaine,https://www.yelp.com/biz/parc-la-fontaine-mont...,4.5,48,https://fr.yelp.ca/biz/parc-la-fontaine-montr%...
1,Parc Sir-Wilfrid-Laurier,https://www.yelp.com/biz/parc-sir-wilfrid-laur...,4.5,18,https://fr.yelp.ca/biz/parc-sir-wilfrid-laurie...
2,Notre Dame de Grâce Park,https://www.yelp.com/biz/parc-notre-dame-de-gr...,5.0,2,https://fr.yelp.ca/biz/parc-notre-dame-de-gr%C...
3,Parc Desmarchais,https://www.yelp.com/biz/parc-desmarchais-mont...,4.5,3,https://fr.yelp.ca/biz/parc-desmarchais-montr%...
4,Jardin Botanique de Montréal,https://www.yelp.com/biz/jardin-botanique-de-m...,4.5,229,https://fr.yelp.ca/biz/jardin-botanique-de-mon...
5,Le Vieux-Port de Montréal,https://www.yelp.com/biz/le-vieux-port-de-mont...,4.5,105,https://fr.yelp.ca/biz/le-vieux-port-de-montr%...
6,Parc du Quai-de-la Tortue,https://www.yelp.com/biz/parc-du-quai-de-la-to...,4.0,2,https://fr.yelp.ca/biz/parc-du-quai-de-la-tort...
7,Murray Hill Park,https://www.yelp.com/biz/murray-hill-park-west...,5.0,2,https://fr.yelp.ca/biz/murray-hill-park-westmo...
8,Parc Joe-Beef,https://www.yelp.com/biz/parc-joe-beef-montr%C...,3.5,3,https://fr.yelp.ca/biz/parc-joe-beef-montr%C3%...
9,Place Jacques-Cartier,https://www.yelp.com/biz/place-jacques-cartier...,4.0,12,https://fr.yelp.ca/biz/place-jacques-cartier-m...


In [None]:
yelpScrapedParkReviews

In [264]:
# directory with all of the scraped park reviews 
dirName1 = '/Users/andreamock/Documents/yelpScrapedParkReviewsFr'

# gather the list of parks who successfully were scraped
listOfFiles1 = os.listdir(dirName1)
fileNames1 = [file.split('.csv')[0] for file in listOfFiles1]

In [254]:
listOfFiles1[0]

'Parc du Mont-Royal.csv'

In [260]:
df1 = pd.read_csv(dirName1 + '/' + listOfFiles1[0])
df1 = df1.drop('ID', axis=1)
df1['Place'] = fileNames1[0]

In [261]:
df1

Unnamed: 0,Comment,Rating,Rating Date,Profile Name,Profile Link,Profile Location,Place
0,This is a winter review. For all winter activ...,5,1/16/2021,Aimee H.,https://www.yelp.com/user_details?userid=oW9Po...,"Montreal, Canada",Parc du Mont-Royal
1,Parc du Royal has a character all on its own a...,5,8/23/2020,Mercedes C.,https://www.yelp.com/user_details?userid=ooHF-...,"Bronx, NY",Parc du Mont-Royal
2,This park provides a fun amount of physical ac...,5,8/28/2020,John S.,https://www.yelp.com/user_details?userid=xHMq2...,"Toronto, Canada",Parc du Mont-Royal
3,Parc du Mont-Royal or Mount Royal Park is like...,4,2/8/2020,Daniel B.,https://www.yelp.com/user_details?userid=j14Wg...,"Atlanta, GA",Parc du Mont-Royal
4,Really nice urban park with amazing views of t...,5,9/17/2020,Amy L.,https://www.yelp.com/user_details?userid=TfLxu...,"Toronto, Canada",Parc du Mont-Royal
...,...,...,...,...,...,...,...
308,I did not exactly do any research before headi...,5,6/27/2011,C. W.,https://www.yelp.com/user_details?userid=oHpBP...,"Arlington, VA",Parc du Mont-Royal
309,originally landscaped by Frederick Law Olmsted...,5,4/26/2011,Anthony K.,https://www.yelp.com/user_details?userid=5Ymfs...,"Montreal, Canada",Parc du Mont-Royal
310,I went to Montreal a couple of weeks ago for t...,5,5/16/2011,BLAUGRAN A.,https://www.yelp.com/user_details?userid=8Lsb9...,"Milwaukee, WI",Parc du Mont-Royal
311,This is a great place to go in Montreal. The v...,5,5/27/2009,Jennifer L.,https://www.yelp.com/user_details?userid=iHi_b...,"Belleville, NJ",Parc du Mont-Royal


In [266]:
for f in listOfFiles1: 
    place = f.split('.csv')[0].strip(' fr')
    df = pd.read_csv(dirName1 + '/' + f)
    df = df.drop('ID', axis=1)
    df['Place'] = place 
    df.to_csv(f)

In [269]:
import glob

In [275]:
dirName = '/Users/andreamock/Documents/yelpScrapedParkReviewsFr'

In [276]:
# merging the files
joined_files = os.path.join(dirName, "*.csv")
  
# A list of all joined files 
joined_list = glob.glob(joined_files)

In [277]:
# join files in a pandas dataframe
df = pd.concat(map(pd.read_csv, joined_list), ignore_index=True)
df

Unnamed: 0.1,Unnamed: 0,Comment,Rating,Rating Date,Profile Name,Profile Link,Profile Location,Place
0,0,Belle découverte pour les petits de 0 à 5 ans ...,4,10/9/2019,Emilie I.,https://fr.yelp.ca/user_details?userid=4Tn7Fjg...,"Montréal, QC",Parc Alphonse Télesphore Lépine
1,0,Le centre de la Nature est un endroit génial a...,5,8/12/2018,Valerie R.,https://fr.yelp.ca/user_details?userid=bBzOlCX...,"Laval, QC",Centre de la Nature
2,1,Je me suis rendue au Centre de la nature à la ...,5,2/3/2018,Judith C.,https://fr.yelp.ca/user_details?userid=4ONcRRi...,"Laval, QC",Centre de la Nature
3,2,"J'adore cette endroit, surtout pour les enfant...",4,16/12/2019,Lisa-Marie P.,https://fr.yelp.ca/user_details?userid=9n8cuO0...,"Le Sud-Ouest, Montréal, QC",Centre de la Nature
4,3,"J'adore ! Espace vert, sentiers, jeux d'eau, p...",5,28/7/2018,Cat R.,https://fr.yelp.ca/user_details?userid=lNDdOlm...,"Montréal, QC",Centre de la Nature
...,...,...,...,...,...,...,...,...
219,0,"Îlot de verdure au centre-ville, idéal pour pi...",5,28/1/2017,Jean T.,https://fr.yelp.ca/user_details?userid=uM43N9F...,"Montréal, QC",Parc Baile
220,0,Nous ne venons pas souvent dans le quartier ma...,5,18/8/2018,Valerie R.,https://fr.yelp.ca/user_details?userid=bBzOlCX...,"Laval, QC",Parc Morgan
221,0,"À partir de la route Côte-des-Neiges, une asce...",4,11/10/2018,Fanny C.,https://fr.yelp.ca/user_details?userid=v4e_WTX...,"Montréal, QC",Belvédère
222,1,Superbe vue sur le coté ouest de Montreal. Bel...,4,18/9/2016,Pat M.,https://fr.yelp.ca/user_details?userid=qKpkRCP...,"Montréal, QC",Belvédère


In [278]:
merged_df = df.drop('Unnamed: 0', axis=1) # get rid of unnecessary column
merged_df.head()

Unnamed: 0,Comment,Rating,Rating Date,Profile Name,Profile Link,Profile Location,Place
0,Belle découverte pour les petits de 0 à 5 ans ...,4,10/9/2019,Emilie I.,https://fr.yelp.ca/user_details?userid=4Tn7Fjg...,"Montréal, QC",Parc Alphonse Télesphore Lépine
1,Le centre de la Nature est un endroit génial a...,5,8/12/2018,Valerie R.,https://fr.yelp.ca/user_details?userid=bBzOlCX...,"Laval, QC",Centre de la Nature
2,Je me suis rendue au Centre de la nature à la ...,5,2/3/2018,Judith C.,https://fr.yelp.ca/user_details?userid=4ONcRRi...,"Laval, QC",Centre de la Nature
3,"J'adore cette endroit, surtout pour les enfant...",4,16/12/2019,Lisa-Marie P.,https://fr.yelp.ca/user_details?userid=9n8cuO0...,"Le Sud-Ouest, Montréal, QC",Centre de la Nature
4,"J'adore ! Espace vert, sentiers, jeux d'eau, p...",5,28/7/2018,Cat R.,https://fr.yelp.ca/user_details?userid=lNDdOlm...,"Montréal, QC",Centre de la Nature


In [279]:
merged_df.to_csv('YelpParkReviewsScrapedFr.csv') # save merged dataframe to a csv file 

<a id="appendix"></a>

## 3. Appendix
There are multiple ways that one can scrape data from websites. The method used in this notebook was using BeautifulSoup. However, Selenium can also be used as another method. Here is some initial code that uses Selenium to collect individual reviews as an alternative to the method detailed above. 

In [9]:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [7]:
def getYelpReviews(yelpLink):
    
    driver = webdriver.Chrome(executable_path="/Users/andreamock/Documents/chromedriver")
    driver.get(yelpLink)

    # find out how many pages there are to iterate over
    numPagesDiv = browser.find_element_by_xpath("//div[@class=' border-color--default__373c0__2oFDT text-align--center__373c0__1l506']")
    numberText = numPagesDiv.find_element_by_tag_name('span').text
    numPages = int(numberText.split()[-1])
    counter = 1
    
    reviewList = [] # list to save review information 
    while counter <=numPages:

        WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//div[@class=' review__373c0__13kpL border-color--default__373c0__2oFDT']")))
        reviews = driver.find_elements_by_xpath("//div[@class=' review__373c0__13kpL border-color--default__373c0__2oFDT']")

        for review in reviews:
            
            theReview = ReviewInfo('', '', '', '', '')

            #GET RATING
            ratingDiv = review.find_element_by_xpath("//body/div[@id='wrap']/div[2]/yelp-react-root[1]/div[1]/div[3]/div[1]/div[1]/div[2]/div[1]/div[1]/div[2]/section[1]/div[2]/div[1]/ul[1]/li[1]/div[1]/div[2]/div[1]/div[1]/span[1]/div[1]")
            #rating = ratingDiv.find_element_by_xpath(".//span[@class=' display--inline__373c0__2SfH_ border-color--default__373c0__30oMI']")
            theReview.rating = ratingDiv.get_attribute('aria-label').replace(" star rating", "")

            #GET DATE OF RATING
            dateOfReview = review.find_element_by_xpath(".//span[@class=' css-e81eai']")
            theReview.ratingDate = dateOfReview.text

            #GET COMMENT
            commentParagraph = review.find_element_by_xpath(".//p[@class='comment__373c0__1M-px css-n6i4z7']")
            theReview.comment = commentParagraph.text

            #GET PROFILE NAME AND LINK TO PROFILE
            profileDiv = review.find_element_by_xpath(".//div[@class=' user-passport-info border-color--default__373c0__2oFDT']")
            profileLink = profileDiv.find_element_by_tag_name('a').get_attribute('href')
            profileName = profileDiv.find_element_by_tag_name('span').text
            theReview.profileLink = profileLink
            theReview.profileName = profileName


            #ADD REVIEW TO LIST
            reviewList.append(theReview)

        #CHECK IF NEXT PAGE FOR MORE COMMENTS EXISTS - IF IT DOES NOT WE WILL STOP GETTING REVIEWS
        newLink = yelpLink + '&start=' + str(counter*10)
        driver.get(newLink)
        
        counter += 1
            
    driver.close()
    return reviewList 