# Web Scrape labelled images of building styles

The ACOToronto website contains the TOBuilt database -- an open source database of images and information about buildings and structures in toronto.
http://www.acotoronto.ca/tobuilt_new_detailed.php

Order of actions
* we will first use the search tool on the main page to find all the available architectural styles (http://www.acotoronto.ca/tobuilt_new_detailed.php)
* then we will call the page for each style.  This page contains a thumbnail image of each building classified in that style, a link for more details and basic info about the building.  
Note: you generally have to follow a redirection to get to the style page
(http://www.acotoronto.ca/search_buildingsR-d.php?sid=8065)
* download the image locally to image/<style>
* call building details page to get info on building dates, architects etc
(http://www.acotoronto.ca/show_building.php?BuildingID=3883)
 page is structured with alternating building_info and building_info2 divs
building_info contains the name of the info to follow
building_info2 contains the value
The companies row contains the architects. There can be muliple architects for a building, so these are < li > items.
Sometimes the archtect is a hyperlink, but not always, so need to handle both cases specially

American colonial
Annex Style
Art deco
Arts and Crafts
Beaux arts
Brutalist
Byzantine
Chicago style
Classical revival
Commercial style
Contemporary
Deconstructivism
Dutch colonial
Early modern
Edwardian classical
English Cottage style
Georgian revival
Gothic revival
Greek revival
International style
Italianate
Late modernist
Log construction
Mid century expressionist
Mirrored tower
Modern classical
Modern historicist
Modernist
Neo Palladian
Neo-Chateau
Neo-Georgian
Neo-modernist
Neo-Tudor
Postmodern
Prairie style
Queen Anne
Regency
Renaissance revival
Richardsonian Romanesque
Romanesque revival
Sculptural
Second empire
Shingle style
Spanish colonial
Toronto Bay and Gable
Workers Cottage

In [1]:
# Import libaries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import string
import re
import urllib
import os
import time
from datetime import datetime

from selenium import webdriver
from sqlalchemy import sql, Table, MetaData
import ast
from ast import literal_eval
import pandas as pd

In [2]:
from sqlalchemy import create_engine, Column, Integer, String, Sequence, Float
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from models import connect_db, PointsOfInterest, ArchitecturalStyles, Architects,POICategories

### Set up base variables

In [3]:

main_page = 'http://www.acotoronto.ca/tobuilt_new_detailed.php'
style_url="http://www.acotoronto.ca/search_buildingsDB_d2.php"
site_root = "http://www.acotoronto.ca/"
debug=False
buildings_list=[]
rerun_webscrape=False # rerun all  webscraping
populate_db = False # repopulate database

df_to_db_map={
    'Name':'name',
    'Completed':'build_year'   ,
    'Demolished' :'demolished_year',
    'Address' :'address' ,
    'Bld_link':'external_url',
    'Notes': 'details',
    'Image':'image_url',
    'Heritage':'heritage_status',
    'Current use':'current_use',
    'Type':'poi_type'
}

In [None]:
def load_page(url):
    response = requests.get(url)
    if response.status_code == 200:
        return response.text
    else:
        print(f"Error connecting: status code {response.status_code}")

In [None]:
def download_image(style, img_url):
    '''
    * downloads image to the appropriate style sub-folder in the images directory (and creates folder if missing)
    * to test: download_image('test','/tobuilt_bk/php/Buildingimages/106BedfordRd.jpg')
    '''

    dest_dir = f"../Images/{style}"

    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)
    full_url = f'{site_root}/{img_url}'
    image_path = f"{dest_dir}/{img_url.split('/')[-1]}"
    res = urllib.request.urlretrieve(full_url, image_path)


In [None]:
# borrowed from https://michaeljsanders.com/2017/05/12/scrapin-and-scrollin.html
def get_scrolling_page(page_url):
    browser = webdriver.Chrome("C:\\Users\\blahjays\\Downloads\\chromedriver_win32\\chromedriver.exe")

    # Tell Selenium to get the URL you're interested in.
   # browser.get("http://www.acotoronto.ca/search_buildingsR-d.php?sid=8225")
    browser.get(page_url)

    # Selenium script to scroll to the bottom, wait 3 seconds for the next batch of data to load, then continue scrolling.  It will continue to do this until the page stops loading new data.
    lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
    match=False
    while(match==False):
            lastCount = lenOfPage
            time.sleep(3)
            lenOfPage = browser.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
            if lastCount==lenOfPage:
                match=True

    # Now that the page is fully scrolled, grab the source code.
    source_data = browser.page_source

    # Throw your source into BeautifulSoup and start parsing!
    bs_data = bs(source_data)
    browser.close()
    return bs_data

#### Building Details pages
* page is structured with alternating building_info and building_info2 divs
* building_info contains the name of the info to follow
* building_info2 contains the value
* The companies row contains the architects.  There can be muliple architects for a building, so these are < li > items.
* Sometimes the archtect is a hyperlink, but not always, so need to handle both cases specially

In [None]:
def get_building_details(bld_link):
    
    flags_dict = {
        'Completed':None,
        'Demolished': None,
        'Companies': None,
        'Type':None,
        'Current use': None,
        'Heritage': None,
       'Notes': None
    }
    architects=[]
    
    full_url = f'{site_root}{bld_link}'
    html_bld = load_page(full_url)
    soup_bld = BeautifulSoup(html_bld)
    # get all building info elements
    info = soup_bld.find_all('div',{'class': 'building_info'})
    for inf in info:
        for flag in flags_dict.keys():
            if flag in inf.text:
                if 'Companies' in flag:
                    company_box = inf.find_next()

                    if 'Architect - ' in company_box.text:
                        li_items = company_box.find_all('li')

                        for arc in li_items:
                            if (arc.find('span')):
                                architect = arc.find('span').previousSibling
                                if architect.name =='a':
                                    architects.append(architect.text.strip())
                                else:
                                    architects.append(architect.replace('-','').strip())
                                
                        flags_dict[flag] = architects
                else:
                    flags_dict[flag] = inf.find_next().text.strip()
    return flags_dict

### Get Buildings for style

In [None]:
def get_buildings_for_style(soup, style):
    buildings=[]
   
    for building in soup.find_all('div',{'class': 'box'}):
        build_style = building.find('div',{'class': 'box_image'})['style']
        image_url = re.findall('\((.*?)\)', build_style)[0]
        build_dict={
            'Style':style,
            'Name':building.findChild('span',{'class':'title'}).text,
            'Bld_link':building.findChild('a').get_attribute_list(key='href')[0],
            'Image':image_url,
            'Address': str(building.findChild('div',{'class':'the_box_text'}).findChild('p')).replace('<br/>',' ').replace('<p>','').replace('</p>','')
        }

        download_image(style,image_url)
        #follow link to get more info on building
        if debug==True: 
            print(build_dict['Bld_link'])
        build_details_dict = get_building_details(build_dict['Bld_link'])
        build_dict = {**build_dict, **build_details_dict}
        buildings.append(build_dict)
   # bld_df = pd.DataFrame(buildings)
    return buildings

In [None]:

def process_style (style):
    '''
    Test: #process_style('Arts and Crafts')
    '''
    print (f"loading {style}")
    curr_style_url = f"{style_url}?MainStyle={style}"
    soup=get_scrolling_page(curr_style_url)
    print(curr_style_url)
#     html = load_page(curr_style_url)
#     soup = BeautifulSoup(html)
#     if 'document.location' in soup.text:
#         # need to redirect to another page
#         redirect_url = soup.text.strip().replace('document.location = "','').replace('";',"")
#         curr_style_url=f"{site_root}{redirect_url}"
#        # print(curr_style_url)
#         html2 = load_page(curr_style_url)
#         soup = BeautifulSoup(html2)
        
    style_list = get_buildings_for_style(soup, style)
    print(style_list)
    buildings_list.extend(style_list)

In [None]:
def run_webscrape():
    '''
    run web scraping
    '''
    html = load_page(main_page)
    soup = BeautifulSoup(html)
    style_options = soup.find('select',{'name':'MainStyle'}).find_all('option')
    for style in style_options[1:]:
        process_style(style.text)
        time.sleep(5)
        bld_df = pd.DataFrame(buildings_list)
        bld_df.to_csv('../data/aco_buildings_'+ str(round(time.time(),0)) + '.csv')

In [None]:
# def save_to_database(db):
#     # recreate full links for urls
#     bld_df['Bld_link'] =bld_df['Bld_link'].apply(lambda x: f'{site_root}{x}' )
#     bld_df['Image'] =bld_df['Image'].apply(lambda x: f'{site_root[:-1]}{x}' )
#     db=connect_db() #establish connection
#     meta = MetaData(db)
#     table = Table('points_of_interest', meta, autoload=True)
    
#     for ix,row in bld_df.iterrows():
#         row_dict ={df_to_db_map[k]:v for k, v in row.items() if k in df_to_db_map.keys() and not pd.isnull(v)}
#         new_row=db.execute(table.insert(), [ 
#             row_dict
#         ])
#         new_id=new_row.inserted_primary_key[0]
#         if new_id:
#             db.execute('''INSERT INTO architectural_styles(poi_id, style) VALUES ( {},'{}')'''.format(new_id, row['Style']))
#             if row['Companies']:
#                 for architect in row['Companies']:

#                     db.execute('''INSERT INTO architects(poi_id, architect_name) VALUES ( {},'{}')'''.format(new_id, architect.replace("'","''")))


#### Save to database
* ran into Error at one point because website contained duplicate entries for Moriyama and Teshima Architects for building 755.  (IntegrityError: duplicate key value violates unique constraint "architects_pkey" #DETAIL:  Key (poi_id, architect_name)=(755, Moriyama and Teshima Architects) already exists.
* added check that aren't adding duplicate
* set up dictionary with entries like: poi_dict = {}
        poi_dict['name'] = row['Name']

In [8]:
def save_to_database_ORM(session):
    '''
    Saves scraped data to database using SqlAlchemy ORM
    Updates three tables: points_of_interest, archtectural_styles, architects
    The relationship between these tables is defined in models.py, so it automatically populates the poi_id column
    in the child tables with the poi_id of the main entry 
    '''
    
    for index, row in bld_df.iterrows():
        
        poi_dict ={df_to_db_map[k]:v for k, v in row.items() if k in df_to_db_map.keys() and not pd.isnull(v)}
        poi_dict['source']= site_root
        poi = PointsOfInterest(**poi_dict )

        # define style
        style=ArchitecturalStyles(style=row['Style'])
        poi.styles.append(style)
        
        # architects (can be multiple)
        if row['Companies']:
            prev_company=""
            for company in row['Companies']:
                if company != prev_company and not 'Also see' in company:
                    architect = Architects(architect_name= company.replace("'","''"))
                    poi.architects.append(architect)
                    prev_company=company
        session.add(poi)
        session.commit()

In [4]:

if rerun_webscrape:
    run_webscrape()
    bld_df = pd.DataFrame(buildings_list)
    bld_df.to_csv('../data/aco_buildings_'+ str(round(time.time(),0)) + '.csv')
else:
    # open file and save results to database
    bld_df=pd.read_csv('../data/aco_buildings_1543192082.0.csv',index_col=0) #, converters={"Companies": literal_eval}) #,  converters={1:ast.literal_eval})
    # clean up list stored in csv -- have to get python to treat as a list
    bld_df['Companies']=bld_df['Companies'].fillna('[]')
    bld_df.Companies = bld_df.Companies.apply(literal_eval)
    # create full urls out of links
    bld_df['Bld_link'] =bld_df['Bld_link'].apply(lambda x: f'{site_root}{x}' )
    bld_df['Image'] =bld_df['Image'].apply(lambda x: f'{site_root[:-1]}{x}' )
    
bld_df.head()


Unnamed: 0,Address,Bld_link,Companies,Completed,Current use,Demolished,Heritage,Image,Name,Notes,Style,Type
0,20 Chestnut Park Rosedale Toronto,http://www.acotoronto.ca/show_building.php?Bui...,[Alfred E. Boultbee],1905,Residential,,South Rosedale Heritage Conservation District,http://www.acotoronto.ca/tobuilt_bk/php/Buildi...,Robert Grieg House,"First Occupant: Greig, Robert\r\r\n\r\r\nFirst...",American colonial,Detached house
1,22 Chestnut Park Rosedale Toronto,http://www.acotoronto.ca/show_building.php?Bui...,[Alfred E. Boultbee],1905,Residential,,South Rosedale Heritage Conservation District,http://www.acotoronto.ca/tobuilt_bk/php/Buildi...,22 Chestnut Park,"First Occupant: Falconbridge, John D.\r\r\n\r\...",American colonial,Detached house
2,154-156 Amelia Street Cabbagetown Toronto,http://www.acotoronto.ca/show_building.php?Bui...,[],unknown,Residential,,Cabbagetown North Heritage Conservation District,http://www.acotoronto.ca/tobuilt_bk/php/Buildi...,154-156 Amelia Street,,Arts and Crafts,Semi-detached house
3,450 Blythwood Road Lawrence Park Toronto,http://www.acotoronto.ca/show_building.php?Bui...,[],1953,Educational,,,http://www.acotoronto.ca/tobuilt_bk/php/Buildi...,Sunny View Public School,,Arts and Crafts,School
4,450 Broadview Avenue Riverdale Toronto,http://www.acotoronto.ca/show_building.php?Bui...,[Robert McCallum],1906,Clubhouse,,Heritage property,http://www.acotoronto.ca/tobuilt_bk/php/Buildi...,St. Matthew's Lawn Bowling Club,Formerly at 548 Gerrard Street East This build...,Arts and Crafts,Low-rise


In [5]:
bld_df.shape

(3527, 12)

In [10]:
if populate_db:
    db=connect_db() #establish connection
    Session = sessionmaker(bind=db)
    session = Session() 
    save_to_database_ORM(session)
    

In [11]:
session.close()