# Scraping SeeClickFix Reviews

SeeClickFix is a site that, according to their own description, is a "tool your neighborhood needs to fix that broken sidewalk and the pothole on the bus route, giving your kids a safer trip to school and improving the quality of life where you live". 

Since SeeClickFix is a site to voice discontent about urban infrastructure it is the optimal place to collect data for urban infrastructure related comments. The following notebook details the scraping and (partly) cleaning of comments collected from SeeClickFix. The whole scraping process is done with the help of Selenium. 


**Credit:** The code was partly adapted from a fellow research intern, Abhay Mahajan, who is also working on the same project this summer with me. The original code can be found on his [Github](https://github.com/mahajan-abhay/Concordia_University_MITACS).

1. [Scraping reviews](#link-scrape)
2. [Gather more SeeClickFix Reviews](#review-scrape)


<a id="link-scrape"></a>
## 1. Scraping reviews

### 1.1 Gathering links and comments 
To scrape each individual review, first the overview page with all of the reviews has to be scraped to collect the link to go to each individual review. The function gatherLinks allows us to gather a link for each review. Afterwards one can then collect the information such as where the issue was submitted, date of review etc. and save it in a dictionary. 

In [1]:
# import selenium for scraping
import selenium
from selenium import webdriver

In [2]:
# import libraries for data manipulation
import pandas as pd
import numpy as np
import time

In [218]:
browser = webdriver.Chrome(executable_path="/Users/andreamock/Documents/chromedriver")

In [4]:
def gatherLinks(driver,url, numPages):
    '''Given the url to a page and the number of pages to scroll through to retrieve complaints, iteratives 
    over all of the pages of complaints and saves the links to each individual issue in a list. Finally the list is 
    returned '''
    
    hrefs = []
    for i in range(numPages):
        driver.get(url + str(i)) # get a certain page of comments
        # extract all of the issues on one page 
        issues = driver.find_elements_by_class_name('riverTitle') 
        for issue in issues:
            element = issue.find_elements_by_tag_name("a") # gather link for a particular issues
            link = element[0].get_attribute("href")
            hrefs.append(link) # add link to 
    return hrefs # return all of the 
    

In [6]:
allLinks = gatherLinks(browser,"https://seeclickfix.com/watchers/23686?page=", 119)

In [7]:
len(allLinks) # number of total links collected

1426

In [8]:
allLinks[:3] # subset of links 

['https://seeclickfix.com/issues/1089017-roads-and-sidewalks-rues-et-trottoirs',
 'https://seeclickfix.com/issues/3343034-roads-and-sidewalks-rues-et-trottoirs',
 'https://seeclickfix.com/issues/1105853-municipal-buildings-batiments-municipaux']

In [100]:
def gatherCommentInfo(driver,link):
    driver.get(link) # pull up page 
    
    commentDict = {}
    try: 
        title = driver.find_element_by_xpath("//span[@itemprop = 'name']").text
        location = driver.find_element_by_xpath("//div[@class = 'tagline']").text
        locationClean = location.split('•')[0].strip() # get rid of unnecessary text in location info
    
        description = driver.find_element_by_xpath("//div[@itemprop = 'articleBody']").text
        descriptionClean = description.strip('DESCRIPTION').strip()
        infoBox = driver.find_elements_by_class_name("txt-block")
        infoItems = ['Issue ID', 'Submitted To', 'Category', 'Viewed', 'Neighborhood', 'Reported via']
        for i in range(len(infoBox)):
            lineInfo = infoBox[i].text.split(':')

            category = lineInfo[0].strip()
            infoClean = lineInfo[-1].strip()
            if category in infoItems: 
                commentDict[category] = infoClean
        
        for item in infoItems:
            if item not in commentDict:
                commentDict[item] = None
        
        dateEl = driver.find_element_by_xpath("//div[@class = 'txt-block']/time")
        commentDict['Date'] = dateEl.get_attribute("datetime")
        commentDict['Title'] = title
        commentDict['Location'] = locationClean
        commentDict['Description'] = descriptionClean
        
    except:
        print('Error collecting information for ' + str(link))
    
    return commentDict    
    

In [102]:
# example of a data collected for one complaint
comment = gatherCommentInfo(browser,allLinks[5])
comment

{'Issue ID': '3605026',
 'Submitted To': 'Côte Saint Luc',
 'Category': 'Roads and sidewalks • Rues et trottoirs',
 'Viewed': '228 times',
 'Neighborhood': 'Côte-Saint-Luc',
 'Reported via': 'mobile application',
 'Date': '2017-08-02 13:26:54',
 'Title': 'Roads and sidewalks • Rues et trottoirs',
 'Location': '5581 Rosedale H4V 2J3',
 'Description': 'Two people have already tripped and injured themselves at night on these broken sidewalks. Thanks in advance!'}

In [103]:
# all of the data items that will be collected
comment.keys()

dict_keys(['Issue ID', 'Submitted To', 'Category', 'Viewed', 'Neighborhood', 'Reported via', 'Date', 'Title', 'Location', 'Description'])

### 1.2 Saving collected info in csv file
The scraping process happens in an iterative manner. For each dictionary (one review and its information) a new row in a csv file is created and added. Therefore after going though all of the links and scraping them, one ends up with a csv file (here titled seeClickFixData.csv) that contains all of the reviews. 

In [88]:
import csv

In [104]:
# assign header columns
headerList = ['Issue ID', 'Submitted To', 'Category', 'Viewed', 'Neighborhood', 
              'Reported via', 'Date', 'Title', 'Location', 'Description']
  
# open CSV file and assign header
with open("seeClickFixData.csv", 'w') as file:
    dw = csv.DictWriter(file, delimiter=',', 
                        fieldnames=headerList)
    dw.writeheader()

In [95]:
def append_dict_as_row(file_name, elem_dict):
    # Open file in append mode
    with open(file_name, 'a+', newline='') as write_obj:
        # Create a writer object from csv module
        w = csv.DictWriter(write_obj, elem_dict.keys())
        w.writerow(elem_dict) # 2

In [96]:
def gatherAllComments(driver,listOfLinks): 
    for commentLink in listOfLinks:
        commentInfo = gatherCommentInfo(driver, commentLink)
        append_dict_as_row('seeClickFixData.csv', commentInfo)

In [140]:
gatherAllComments(browser,allLinks[1100:])

In [128]:
# read in scraped data as pandas dataframe 
df = pd.read_csv('seeClickFixData.csv')
df.shape

(1425, 10)

In [129]:
df.head() # sample reviews

Unnamed: 0,Issue ID,Submitted To,Category,Viewed,Neighborhood,Reported via,Date,Title,Location,Description
0,1089017,Côte Saint Luc,Roads and sidewalks • Rues et trottoirs,462 times,Côte-Saint-Luc,//www.cotesaintluc.org,2014-05-22 17:10:20,Roads and sidewalks • Rues et trottoirs,"6826 The Avenue Côte Saint-Luc, Quebec",The sidewalk/curb that extends from bike path ...
1,3343034,Côte Saint Luc,Roads and sidewalks • Rues et trottoirs,1360 times,Côte-Saint-Luc,//www.cotesaintluc.org,2017-04-16 22:02:52,Roads and sidewalks • Rues et trottoirs,"Ch Emerson Côte Saint-Luc, Québec","Today is day 43, I am beginning to fear my com..."
2,1105853,Côte Saint Luc,Municipal buildings • Bâtiments municipaux,700 times,Côte-Saint-Luc,mobile application,2014-06-03 23:42:45,Municipal buildings • Bâtiments municipaux,"7500 Chemin Mackle Côte Saint-Luc, QC H4W 1A6,...",Chairs too close to railing. Kids could climb ...
3,1515685,Côte Saint Luc,Waste • Matières résiduelles,2016 times,Côte-Saint-Luc,,2015-03-06 15:21:10,Waste • Matières résiduelles,"5752 Av :Lockwood Cote Saint Luc , Quebec",THE COLOR OF A BIN SHOULD NOT MATTER. THURSDAY...
4,1118476,Côte Saint Luc,Other • Autre,404 times,Côte-Saint-Luc,//www.cotesaintluc.org,2014-06-11 15:30:06,BBQ ON BALCONIES,"7030 Kildare Rd Côte Saint-Luc, Quebec",Is the by-law forbidding BBQing on balconies s...


### 1.3 Cleaning data
The newly collected data still needs to be cleaned as for example reviews that do not have a description are sometimes denoted by the text no description provided. In addition, the cateogry name is quite long and can be replaced with a more readable format which includes only the English name of each category. Similar cleaning steps are taken as well to end up with a dataframe that does not include duplicate values and is more readable. 

In [130]:
dfClean = df.copy() # create a copy that will contained clean data

In [131]:
# identify entries where no description is provided 
noDescription = df['Description'].apply(lambda x: 'No description provided' in x) 

In [135]:
# substitute all entries that have value no description with None

dfClean.loc[noDescription,'Description'] = None

In [142]:
# gcheck out the different categories 
dfClean['Category'].unique()

array(['Roads and sidewalks • Rues et trottoirs',
       'Municipal buildings • Bâtiments municipaux',
       'Waste • Matières résiduelles', 'Other • Autre',
       'Traffic lights and signs • Feux de circulation & signalisation',
       'Snow Removal • Déneigement', 'Parks • Parcs',
       'Streetlight • Lampadaire', 'Trees and grass • Arbres et gazon',
       'Graffiti', 'Grass • Gazon', 'None', 'Trees • Arbres'],
      dtype=object)

In [144]:
topic = dfClean['Category'].apply(lambda x: x.split('•')[0].strip())
topic[:5] # cleaned topics including only English names 

0    Roads and sidewalks
1    Roads and sidewalks
2    Municipal buildings
3                  Waste
4                  Other
Name: Category, dtype: object

In [145]:
# create a cleaned topic column
dfClean['Topic'] = topic
dfClean = dfClean.drop(columns=['Category'])

In [146]:
dfClean.head()

Unnamed: 0,Issue ID,Submitted To,Viewed,Neighborhood,Reported via,Date,Title,Location,Description,Topic
0,1089017,Côte Saint Luc,462 times,Côte-Saint-Luc,//www.cotesaintluc.org,2014-05-22 17:10:20,Roads and sidewalks • Rues et trottoirs,"6826 The Avenue Côte Saint-Luc, Quebec",The sidewalk/curb that extends from bike path ...,Roads and sidewalks
1,3343034,Côte Saint Luc,1360 times,Côte-Saint-Luc,//www.cotesaintluc.org,2017-04-16 22:02:52,Roads and sidewalks • Rues et trottoirs,"Ch Emerson Côte Saint-Luc, Québec","Today is day 43, I am beginning to fear my com...",Roads and sidewalks
2,1105853,Côte Saint Luc,700 times,Côte-Saint-Luc,mobile application,2014-06-03 23:42:45,Municipal buildings • Bâtiments municipaux,"7500 Chemin Mackle Côte Saint-Luc, QC H4W 1A6,...",Chairs too close to railing. Kids could climb ...,Municipal buildings
3,1515685,Côte Saint Luc,2016 times,Côte-Saint-Luc,,2015-03-06 15:21:10,Waste • Matières résiduelles,"5752 Av :Lockwood Cote Saint Luc , Quebec",THE COLOR OF A BIN SHOULD NOT MATTER. THURSDAY...,Waste
4,1118476,Côte Saint Luc,404 times,Côte-Saint-Luc,//www.cotesaintluc.org,2014-06-11 15:30:06,BBQ ON BALCONIES,"7030 Kildare Rd Côte Saint-Luc, Quebec",Is the by-law forbidding BBQing on balconies s...,Other


In [151]:
def cleanReportedVia(entry):
    if type(entry) != float: # only strip from strings
        return entry.strip('//')
    else:
        return entry

In [153]:
cleanedReported = dfClean['Reported via'].apply(lambda x: cleanReportedVia(x))
cleanedReported[:5]

0    www.cotesaintluc.org
1    www.cotesaintluc.org
2      mobile application
3                     NaN
4    www.cotesaintluc.org
Name: Reported via, dtype: object

In [154]:
# clean the reported via column
dfClean = dfClean.drop(columns=['Reported via'])
dfClean['Reported via'] = cleanedReported

In [155]:
dfClean.head() # newly cleaned dataframe 

Unnamed: 0,Issue ID,Submitted To,Viewed,Neighborhood,Date,Title,Location,Description,Topic,Reported via
0,1089017,Côte Saint Luc,462 times,Côte-Saint-Luc,2014-05-22 17:10:20,Roads and sidewalks • Rues et trottoirs,"6826 The Avenue Côte Saint-Luc, Quebec",The sidewalk/curb that extends from bike path ...,Roads and sidewalks,www.cotesaintluc.org
1,3343034,Côte Saint Luc,1360 times,Côte-Saint-Luc,2017-04-16 22:02:52,Roads and sidewalks • Rues et trottoirs,"Ch Emerson Côte Saint-Luc, Québec","Today is day 43, I am beginning to fear my com...",Roads and sidewalks,www.cotesaintluc.org
2,1105853,Côte Saint Luc,700 times,Côte-Saint-Luc,2014-06-03 23:42:45,Municipal buildings • Bâtiments municipaux,"7500 Chemin Mackle Côte Saint-Luc, QC H4W 1A6,...",Chairs too close to railing. Kids could climb ...,Municipal buildings,mobile application
3,1515685,Côte Saint Luc,2016 times,Côte-Saint-Luc,2015-03-06 15:21:10,Waste • Matières résiduelles,"5752 Av :Lockwood Cote Saint Luc , Quebec",THE COLOR OF A BIN SHOULD NOT MATTER. THURSDAY...,Waste,
4,1118476,Côte Saint Luc,404 times,Côte-Saint-Luc,2014-06-11 15:30:06,BBQ ON BALCONIES,"7030 Kildare Rd Côte Saint-Luc, Quebec",Is the by-law forbidding BBQing on balconies s...,Other,www.cotesaintluc.org


In [306]:
dfClean

Unnamed: 0,Issue ID,Submitted To,Viewed,Neighborhood,Date,Title,Location,Description,Topic,Reported via
0,1089017,Côte Saint Luc,462 times,Côte-Saint-Luc,2014-05-22 17:10:20,Roads and sidewalks • Rues et trottoirs,"6826 The Avenue Côte Saint-Luc, Quebec",The sidewalk/curb that extends from bike path ...,Roads and sidewalks,www.cotesaintluc.org
1,3343034,Côte Saint Luc,1360 times,Côte-Saint-Luc,2017-04-16 22:02:52,Roads and sidewalks • Rues et trottoirs,"Ch Emerson Côte Saint-Luc, Québec","Today is day 43, I am beginning to fear my com...",Roads and sidewalks,www.cotesaintluc.org
2,1105853,Côte Saint Luc,700 times,Côte-Saint-Luc,2014-06-03 23:42:45,Municipal buildings • Bâtiments municipaux,"7500 Chemin Mackle Côte Saint-Luc, QC H4W 1A6,...",Chairs too close to railing. Kids could climb ...,Municipal buildings,mobile application
3,1515685,Côte Saint Luc,2016 times,Côte-Saint-Luc,2015-03-06 15:21:10,Waste • Matières résiduelles,"5752 Av :Lockwood Cote Saint Luc , Quebec",THE COLOR OF A BIN SHOULD NOT MATTER. THURSDAY...,Waste,
4,1118476,Côte Saint Luc,404 times,Côte-Saint-Luc,2014-06-11 15:30:06,BBQ ON BALCONIES,"7030 Kildare Rd Côte Saint-Luc, Quebec",Is the by-law forbidding BBQing on balconies s...,Other,www.cotesaintluc.org
...,...,...,...,...,...,...,...,...,...,...
1420,2081179,Côte Saint Luc,190 times,Côte-Saint-Luc,2015-11-30 22:25:45,Streetlight • Lampadaire,"5606 Avenue Parkhaven Côte Saint-Luc, Québec",,Streetlight,
1421,2327988,Côte Saint Luc,275 times,Côte-Saint-Luc,2016-03-20 23:35:40,Streetlight • Lampadaire,"5705-5787 Avenue Rembrandt Côte Saint-Luc, Québec",Lampadaire brisé dans le parc.,Streetlight,mobile application
1422,1882099,Côte Saint Luc,289 times,Côte-Saint-Luc,2015-08-30 19:31:02,Roads and sidewalks • Rues et trottoirs,"5716 Avenue Parkhaven Côte Saint-Luc, QC H4W 1...",There's a possible water main break under the ...,Roads and sidewalks,mobile application
1423,2349556,Côte Saint Luc,259 times,Côte-Saint-Luc,2016-03-29 05:05:34,GC01 and PB46 have faulty lights,"7482 Chemin Pineview Côte Saint-Luc, Québec",Non functional lights,Streetlight,


In [304]:
dfClean.nunique() # check how many unique values exist

Issue ID        1010
Submitted To       2
Viewed           625
Neighborhood       1
Date            1010
Title            259
Location         853
Description      953
Topic             13
Reported via       4
dtype: int64

In [309]:
# drop duplicates
dfClean = dfClean.drop_duplicates(subset=['Date','Title', 'Location','Description', 'Topic'])

In [311]:
dfClean

Unnamed: 0,Issue ID,Submitted To,Viewed,Neighborhood,Date,Title,Location,Description,Topic,Reported via
0,1089017,Côte Saint Luc,462 times,Côte-Saint-Luc,2014-05-22 17:10:20,Roads and sidewalks • Rues et trottoirs,"6826 The Avenue Côte Saint-Luc, Quebec",The sidewalk/curb that extends from bike path ...,Roads and sidewalks,www.cotesaintluc.org
1,3343034,Côte Saint Luc,1360 times,Côte-Saint-Luc,2017-04-16 22:02:52,Roads and sidewalks • Rues et trottoirs,"Ch Emerson Côte Saint-Luc, Québec","Today is day 43, I am beginning to fear my com...",Roads and sidewalks,www.cotesaintluc.org
2,1105853,Côte Saint Luc,700 times,Côte-Saint-Luc,2014-06-03 23:42:45,Municipal buildings • Bâtiments municipaux,"7500 Chemin Mackle Côte Saint-Luc, QC H4W 1A6,...",Chairs too close to railing. Kids could climb ...,Municipal buildings,mobile application
3,1515685,Côte Saint Luc,2016 times,Côte-Saint-Luc,2015-03-06 15:21:10,Waste • Matières résiduelles,"5752 Av :Lockwood Cote Saint Luc , Quebec",THE COLOR OF A BIN SHOULD NOT MATTER. THURSDAY...,Waste,
4,1118476,Côte Saint Luc,404 times,Côte-Saint-Luc,2014-06-11 15:30:06,BBQ ON BALCONIES,"7030 Kildare Rd Côte Saint-Luc, Quebec",Is the by-law forbidding BBQing on balconies s...,Other,www.cotesaintluc.org
...,...,...,...,...,...,...,...,...,...,...
1420,2081179,Côte Saint Luc,190 times,Côte-Saint-Luc,2015-11-30 22:25:45,Streetlight • Lampadaire,"5606 Avenue Parkhaven Côte Saint-Luc, Québec",,Streetlight,
1421,2327988,Côte Saint Luc,275 times,Côte-Saint-Luc,2016-03-20 23:35:40,Streetlight • Lampadaire,"5705-5787 Avenue Rembrandt Côte Saint-Luc, Québec",Lampadaire brisé dans le parc.,Streetlight,mobile application
1422,1882099,Côte Saint Luc,289 times,Côte-Saint-Luc,2015-08-30 19:31:02,Roads and sidewalks • Rues et trottoirs,"5716 Avenue Parkhaven Côte Saint-Luc, QC H4W 1...",There's a possible water main break under the ...,Roads and sidewalks,mobile application
1423,2349556,Côte Saint Luc,259 times,Côte-Saint-Luc,2016-03-29 05:05:34,GC01 and PB46 have faulty lights,"7482 Chemin Pineview Côte Saint-Luc, Québec",Non functional lights,Streetlight,


In [310]:
dfClean.to_csv('seeClickFixDataClean.csv') # save cleaned dataframe to csv

<a id="review-scrape"></a>
## 2. Collecting more SeeClickFix data
In addition to the data collected above, there is more data that can be scraped from SeeClickFix. But unlike before the next comments that are collected, don't come in a static list. Instead they appear in a dynamically loading page with multiple pages to click through and a map. The html also has somewhat of a different format and thus we need to write functions that will work on the other type of pages found on SeeClickFix. An important thing to note is that also the amount of attributes, and descriptions is not as extensive in the following dataset. 

### 2.1 Gathering reviews 

In [220]:
def gatherMoreLinks(driver): 
    '''Clicks through multiple pages and gathers the link to each comments, returns all of the links as a list 
    in the end'''
    links = []
    continueClicking = True
    
    pages = driver.find_element_by_xpath("//div[@class = 'scf-flex scf-flex-justify-between ember-view']").text
    commentsTotal = pages.split('\n')[0].split('of') # grab only number of items showing and total number of items
    print('totcom', commentsTotal)
    totalNumComments = commentsTotal[-1].strip() # total number of comments
    
    while continueClicking:
        time.sleep(5) # sleep time to allow for loading of page 
        hrefs = driver.find_elements_by_xpath(
            "//a[@class = 'ember-view scf-flex scf-flex-align-center scf-color-primary-dark scf-no-underline']")
        for href in hrefs: # gather individual links on one page and add them to a list
            links.append(href.get_attribute("href"))
        
        next_button = driver.find_element_by_xpath("//button[@class = 'scf-pagination--button paginate-next']")
        next_button.click()
        
        # gather the number of page one is at
        currentPage = driver.find_element_by_xpath("//div[@class = 'scf-flex scf-flex-justify-between ember-view']").text
        currentComments = currentPage.split('\n')[0].split('of') # grab only number of items showing
        currentCommentCount = currentComments[0].split()[-1] # number of comment currently showing
        
        #print('totalnum', totalNumComments, 'curr', currentCommentCount) 
        if  totalNumComments == currentCommentCount:
            continueClicking = False
    return links 

In [219]:
# get the link where more reviews are listed
browser.get('https://seeclickfix.com/web_portal/2k1ssvSpae6TaFxzH6eMTSUr/issues/map?lat=45.466188626180674&lng=-73.71551555581392&max_lat=45.794339630460705&max_lng=-72.93548583984376&min_lat=45.13555516012536&min_lng=-74.49554443359376&status=open%2Cacknowledged%2Cclosed%2Carchived&zoom=10')

In [237]:
# gather the links for each comment
moreLinks = gatherMoreLinks(browser)

In [239]:
len(moreLinks) # total number of additional links collected

2108

In [222]:
moreLinks[:3] # sample of a few links 

['https://seeclickfix.com/web_portal/2k1ssvSpae6TaFxzH6eMTSUr/issues/map/10112914?lat=45.466188626180674&lng=-73.71551555581392&max_lat=45.70905627558719&max_lng=-73.16619873046876&min_lat=45.222677199620094&min_lng=-74.26483154296876&status=open%2Cacknowledged%2Cclosed%2Carchived&zoom=10',
 'https://seeclickfix.com/web_portal/2k1ssvSpae6TaFxzH6eMTSUr/issues/map/9733691?lat=45.466188626180674&lng=-73.71551555581392&max_lat=45.70905627558719&max_lng=-73.16619873046876&min_lat=45.222677199620094&min_lng=-74.26483154296876&status=open%2Cacknowledged%2Cclosed%2Carchived&zoom=10',
 'https://seeclickfix.com/web_portal/2k1ssvSpae6TaFxzH6eMTSUr/issues/map/8625095?lat=45.466188626180674&lng=-73.71551555581392&max_lat=45.70905627558719&max_lng=-73.16619873046876&min_lat=45.222677199620094&min_lng=-74.26483154296876&status=open%2Cacknowledged%2Cclosed%2Carchived&zoom=10']

An additional optional step is to save the links to each comments description. The following can saved for later use, in case one decides to revisit the links later.

In [223]:
with open("links.txt", 'w') as f: # save the links in a txt file if needed for later
    for l in moreLinks:
        f.write(str(l) + '\n')

In [244]:
def gatherComplaintInfo(driver, link): 
    '''Using Selenium (driver), navigate to a particular link and collect the title of the review, location,
    description and date the review was composed. The data is saved in a dictionary and returned. If a particular item
    i.e. the location is not provide None is added instead for that particular dictionary item. 
    '''
    complaintDict= dict()
    
    driver.get(link)
    time.sleep(5) # allow page to load
    itemList = ['Title', 'Location', 'Description', 'Date']
    elementList = [
        "//a[@class = 'ember-view scf-flex scf-flex-align-center scf-color-primary-dark scf-no-underline']",
        "//p[@class = 'scf-c-issue-header__truncated-item scf-flex-item scf-mb-none scf-color-mono-dark']",
        "//p[@class = 'scf-mb-base scf-add-line-breaks']","//time[@class = 'scf-capitalize scf-inline-block']"
    ]
    for i in range(len(itemList)):
        try: 
            if itemList[i] == 'Location': 
                location = driver.find_element_by_xpath(elementList[i]).text
                info = location.split('\n')[1]
            elif itemList[i] != 'Date': 
                info = driver.find_element_by_xpath(elementList[i]).text 
            else:
                info = driver.find_element_by_xpath(elementList[i]).get_attribute("datetime")
            complaintDict[itemList[i]] = info
        except:
            complaintDict[itemList[i]] = None
    
    return complaintDict
    

In [235]:
gatherComplaintInfo(browser, moreLinks[0]) # gather the information for one complaint

{'Title': 'Needs full-depth repair.',
 'Location': '4539-4551 4e Rue Laval, Québec, H7W 2K1, CAN',
 'Description': None,
 'Date': '2021-06-11T11:07:40-04:00'}

In [236]:
def gatherMoreComments(driver,listOfLinks): 
    '''Use Selenium (driver object) to extract the information about '''
    for commentLink in listOfLinks:
        commentInfo = gatherComplaintInfo(driver, commentLink)
        append_dict_as_row('moreSeeClickFixData.csv', commentInfo)

In [238]:
# assign header columns
headerList2 = ['Title', 'Location', 'Description', 'Date']
  
# open CSV file and assign header
with open("moreSeeClickFixData.csv", 'w') as file:
    dw = csv.DictWriter(file, delimiter=',', 
                        fieldnames=headerList2)
    dw.writeheader()

In [251]:
gatherMoreComments(browser,moreLinks[2000:])

### 2.2  Clean collected data 
The collected data from SeeClickFix includes two different datasets, each which have different formats. The first dataset has more information including the topic/category of a review.

In [253]:
moreDataDf = pd.read_csv('moreSeeClickFixData.csv') # read in the collected complaints dataset
moreDataDf.head()

Unnamed: 0,Title,Location,Description,Date
0,Needs full-depth repair.,"4539-4551 4e Rue Laval, Québec, H7W 2K1, CAN",,2021-06-11T11:07:40-04:00
1,Post to Neighbors,"1890 Rue LéAndre-Descotes Laval, QC H7W 5M4, C...",street light is not working,2021-04-17T20:20:20-04:00
2,Debris,"5759 Av LéGer Côte-Saint-Luc QC H4W 2E8, Canada",There is a large pile of broken branches on 58...,2020-09-21T16:12:45-04:00
3,Traffic issue at Kincourt and Guelph,"Chemin Guelph & Avenue Kincourt Côte-St-Luc, Q...",Something has to be done to slow the traffic o...,2020-08-27T10:46:36-04:00
4,Beth Zion Air Conditioners,"5716-5716 Hudson Ave Cote-St-Luc, Quebec, H4W ...",The Beth Zion air conditioners are only suppos...,2020-08-27T06:56:45-04:00


In [274]:
def cleanDescription(text, title):
    if type(text) != float:
         if 'test' in text.lower():
            return True
    if type(title) != float:
        return 'test' in title.lower() 
    return False

In [275]:
isTest = moreDataDf.apply(lambda x: cleanDescription(x['Description'], x['Title']), axis=1) # indicator of if a description is only a test

isTest[-5:]

2103    False
2104    False
2105    False
2106    False
2107     True
dtype: bool

In [283]:
moreDataDf[isTest] # comments to be filtered out sine they are only test values

Unnamed: 0,Title,Location,Description,Date
15,test app,"150 Rue De Louvain O Montréal H2N 1B4, Canada",Hghhgj gghh,2019-12-29T09:02:52-05:00
19,Test,"4860 Rue De Bullion Montréal, Québec, H2T, CAN",Test,2019-08-18T20:40:19-04:00
76,test it,"1590–1596 Boul Des Laurentides Laval QC H7N, C...",Ceci est un test,2018-04-16T14:29:20-04:00
215,Post to Neighbors,"31 Willibrord Montréal, Quebec",Test,2017-06-20T22:20:11-04:00
621,Other • Autre,"5801 Cavendish Blvd Côte Saint-Luc, Québec",Test from IT,2016-02-10T12:25:57-05:00
622,Other • Autre,"5801 Cavendish Blvd Côte Saint-Luc, Québec",Test from IT,2016-02-10T12:25:26-05:00
635,Test,"155 Rue Notre Dame Est Montréal, Québec",#neigemtl,2016-01-10T12:04:44-05:00
2107,,,TEST - Graffiti on storage buildings behind K1...,


In [284]:
cleanDf = moreDataDf[isTest == False]
cleanDf.shape

(2100, 4)

In [289]:
def getCategory(text): 
    if type(text) != float:
        return '•' in text
    return False

In [295]:
# indecator if category of comment is included in the title
categoryPresent = cleanDf['Title'].apply(lambda x: getCategory(x)) 

In [298]:
withCat = cleanDf[categoryPresent] # save the entries with a category in additional dataframe

withCat.head()

Unnamed: 0,Title,Location,Description,Date
121,Streetlight • Lampadaire,"5750 Avenue Wentworth Côte Saint-Luc, QC H4W 2...",Burned out,2017-09-02T16:32:48-04:00
122,Traffic lights and signs • Feux de circulation...,"5744 Boul Cavendish Côte Saint-Luc, QC H4W, Ка...",cavendish traffic light are not synchronised. ...,2017-08-31T15:21:19-04:00
123,Other • Autre,"7166 Chemin De La CôTe-Saint-Luc Montréal, Québec",Construction site. The side walk should be fen...,2017-08-29T13:10:48-04:00
126,Other • Autre,Kildare And Shalom,Flowers and plants planted are so high they in...,2017-08-27T12:22:24-04:00
127,Parks • Parcs,"5678 Chemin Merrimac Côte Saint-Luc, QC H4W 1S...",Concrete cans knocked over,2017-08-27T12:07:10-04:00


In [299]:
withoutCat = cleanDf[categoryPresent == False] # save the entries without a category in additional dataframe
withoutCat.head()

Unnamed: 0,Title,Location,Description,Date
0,Needs full-depth repair.,"4539-4551 4e Rue Laval, Québec, H7W 2K1, CAN",,2021-06-11T11:07:40-04:00
1,Post to Neighbors,"1890 Rue LéAndre-Descotes Laval, QC H7W 5M4, C...",street light is not working,2021-04-17T20:20:20-04:00
2,Debris,"5759 Av LéGer Côte-Saint-Luc QC H4W 2E8, Canada",There is a large pile of broken branches on 58...,2020-09-21T16:12:45-04:00
3,Traffic issue at Kincourt and Guelph,"Chemin Guelph & Avenue Kincourt Côte-St-Luc, Q...",Something has to be done to slow the traffic o...,2020-08-27T10:46:36-04:00
4,Beth Zion Air Conditioners,"5716-5716 Hudson Ave Cote-St-Luc, Quebec, H4W ...",The Beth Zion air conditioners are only suppos...,2020-08-27T06:56:45-04:00


In [303]:
withCat['Title'].apply(lambda x: x.split('•')[0].strip()).unique()

array(['Streetlight', 'Traffic lights and signs', 'Other', 'Parks',
       'Roads and sidewalks', 'Trees', 'Waste', 'Roads and s6idewalks',
       'Snow Removal', 'Municipal buildings', 'Grass', 'Trees and grass',
       'Defective Streetlight', 'Pothole', 'Sign damage',
       'Debris in street', 'Trash can full', 'Sidewalk damage'],
      dtype=object)

In [322]:
topics = withCat['Title'].apply(lambda x: x.split('•')[0].strip())
withCat.loc[:,'Topic'] = topics

In [324]:
withCat.head()

Unnamed: 0,Title,Location,Description,Date,Topic
121,Streetlight • Lampadaire,"5750 Avenue Wentworth Côte Saint-Luc, QC H4W 2...",Burned out,2017-09-02T16:32:48-04:00,Streetlight
122,Traffic lights and signs • Feux de circulation...,"5744 Boul Cavendish Côte Saint-Luc, QC H4W, Ка...",cavendish traffic light are not synchronised. ...,2017-08-31T15:21:19-04:00,Traffic lights and signs
123,Other • Autre,"7166 Chemin De La CôTe-Saint-Luc Montréal, Québec",Construction site. The side walk should be fen...,2017-08-29T13:10:48-04:00,Other
126,Other • Autre,Kildare And Shalom,Flowers and plants planted are so high they in...,2017-08-27T12:22:24-04:00,Other
127,Parks • Parcs,"5678 Chemin Merrimac Côte Saint-Luc, QC H4W 1S...",Concrete cans knocked over,2017-08-27T12:07:10-04:00,Parks


There are multiple entries that do not have a category. Going through each review manually would be a very ardeous task, therefore using key words for each topic is an easy alternative to classify the vast majority of reviews without category.

In [400]:
def assignTopic(text):
    if type(text) != str:
        return None
    textLower = text.lower()
    if 'graffit' in textLower:
        return 'Graffiti'
    elif (('pothole' in textLower) or ('pot hole' in textLower) or ('sidewalk' in textLower) 
        or ('road' in textLower) or ('side walk' in textLower) or ('trottoir' in textLower)):
        return 'Roads and sidewalks'
    elif 'snow' in textLower: 
        return 'Snow Removal'
    elif ('street light' in textLower) or ('streetlight' in textLower):
        return 'Streetlight'
    elif (('traffic light' in textLower) or ('sign' in textLower) or 
    ('red light' in textLower) or ('green light' in textLower)): 
        return 'Traffic lights and signs'
    elif (('waste' in textLower) or ('garbage' in textLower) or ('trash' in textLower) or 
    ('déchets'in textLower) or ('poubelle' in textLower) or ('recyclage' in textLower)):
        return 'Waste'
    elif ('tree' in textLower) and ('grass' in textLower):
        return 'Trees and grass'
    elif 'tree' in textLower:
        return 'Trees'
    elif 'grass' in textLower:
        return 'Grass'
    elif ('park' in textLower) or ('parc' in textLower): 
        return 'Parks'
    elif 'building' in textLower:
        return 'Municipal buildings'
    else:
        return None

In [401]:
predTopics = withoutCat['Title'].apply(lambda x: assignTopic(x))
predTopics.value_counts()

Graffiti                    314
Roads and sidewalks         159
Streetlight                  60
Traffic lights and signs     47
Waste                        47
Parks                        25
Trees                        25
Snow Removal                 18
Grass                         4
Municipal buildings           1
Trees and grass               1
Name: Title, dtype: int64

In [409]:
noPred = predTopics.apply(lambda x: x is None)
withoutCat[noPred].head(10)

Unnamed: 0,Title,Location,Description,Date
0,Needs full-depth repair.,"4539-4551 4e Rue Laval, Québec, H7W 2K1, CAN",,2021-06-11T11:07:40-04:00
1,Post to Neighbors,"1890 Rue LéAndre-Descotes Laval, QC H7W 5M4, C...",street light is not working,2021-04-17T20:20:20-04:00
2,Debris,"5759 Av LéGer Côte-Saint-Luc QC H4W 2E8, Canada",There is a large pile of broken branches on 58...,2020-09-21T16:12:45-04:00
3,Traffic issue at Kincourt and Guelph,"Chemin Guelph & Avenue Kincourt Côte-St-Luc, Q...",Something has to be done to slow the traffic o...,2020-08-27T10:46:36-04:00
4,Beth Zion Air Conditioners,"5716-5716 Hudson Ave Cote-St-Luc, Quebec, H4W ...",The Beth Zion air conditioners are only suppos...,2020-08-27T06:56:45-04:00
5,skunk in the area Leger and McAlear,"5761 Av LéGer Côte-Saint-Luc QC H4W 2E8, Canada",There is a skunk in the area Most evenings You...,2020-08-13T20:29:17-04:00
6,Backflow valves for sewer lines.,"5752 Avenue Lockwood Cote-St-Luc, Quebec, H4W ...",Is there a by-law that is in effect to have ba...,2020-08-02T23:39:17-04:00
14,Campus FabCity,"50 Rue De Louvain O Montréal, Québec, H2N 1B4,...",C'est ici que le Campus...,2019-12-29T09:32:16-05:00
16,Damage to front lawn,"6885 Chemin Schweitzer Côte-St-Luc, Québec, H4...",I suspect that a work vehicle parked temporari...,2019-11-25T14:44:45-05:00
18,Sewer cover deep,"Boulevard St-Martin O Laval, Québec, H7S, CAN",This Sewer keeps having issues year after year...,2019-10-06T17:56:54-04:00


In [413]:
withoutCat.loc[:,'Topic'] = predTopics.copy()

In [421]:
df2 = pd.concat([withCat,withoutCat])
df2 = df2.reset_index(drop=True)

In [423]:
df2.tail()

Unnamed: 0,Title,Location,Description,Date,Topic
2095,nid-de-poule,"275 Rue Notre-Dame Est Montréal, QC",,2012-01-05T13:05:07-05:00,
2096,Trou dangereux,"330-332 Rue Chevalier Châteauguay, QC J6J 4P8,...",Le trou est la depuis le mois de mars et perso...,2011-06-26T08:47:17-04:00,
2097,Ugly building,"3001-3035 Rue Saint Antoine Ouest Westmount, Q...",This building is really ugly. Please tear or b...,2011-01-14T13:48:08-05:00,Municipal buildings
2098,bottle,"301-385 Avenue Lansdowne Westmount, QC H3Z 2L5...",Half empty,2010-09-01T23:13:43-04:00,
2099,big pothole here,"Montreal QC H1Z 2T2, Canada",fix it!,2010-03-09T11:08:53-05:00,Roads and sidewalks


In [425]:
df2.to_csv('seeClickFixDataClean2.csv') # save second dataset as csv file