# Web Scraping Real Estate Data: Case Study Bogota, Colombia (Colab version)
[Author: Elias Buitrago Bolivar](https://github.com/ebuitrago?tab=repositories)

This jupyter notebook depicts a python based web scraping  algorithm to obtain real estate data from the portal fincaraiz.com.co. The code presented here is functional and was tested by scraping real estate data of used apartments sales from Bogota, Colombia. This upgraded version is compatible with Colab.
_Updated: Dec 4, 2023_


## Install required libraries

In [4]:
!pip install lxml
!pip install scrapy
!pip3 install requests-html
!pip3 install selenium



In [17]:
%%shell
# Install chromedriver
# Credits: https://medium.com/@MinatoNamikaze02/running-selenium-on-google-colab-a118d10ca5f8
sudo apt -y update
sudo apt install -y wget curl unzip
wget http://archive.ubuntu.com/ubuntu/pool/main/libu/libu2f-host/libu2f-udev_1.1.4-1_all.deb
dpkg -i libu2f-udev_1.1.4-1_all.deb
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
dpkg -i google-chrome-stable_current_amd64.deb

wget -N https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/120.0.6099.62/linux64/chromedriver-linux64.zip -P /tmp/
unzip -o /tmp/chromedriver-linux64.zip -d /tmp/
chmod +x /tmp/chromedriver-linux64/chromedriver
mv /tmp/chromedriver-linux64/chromedriver /usr/local/bin/chromedriver

pip install selenium chromedriver_autoinstaller

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Waiting for headers] [C[0m                                                                               Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
[33m0% [Connecting to archive.ubuntu.com (185.125.190.36)] [Waiting for headers] [C[0m                                                                               Hit:3 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:7 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-d



### Web Scraping real estate data
This section explains the web scraping process applied to the fincaraiz.com.co web page.

In [None]:
# !pip install undetected_chromedriver

## Import required libraries


---

In [18]:
'''
credits:
https://github.com/googlecolab/colabtools/issues/3347
https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com
Sept 19, 2023
'''

#
!pip3 install chromedriver-autoinstaller

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

import time
import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import chromedriver_autoinstaller



## Setup chrome and chrome driver


---



In [19]:
# setup chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless') # ensure GUI is off
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# # set path to chromedriver as per your configuration
# chromedriver_autoinstaller.install()


## Section to declare functions

---



### Function gethref

In [8]:
#Function to get 'href' from each article item
def gethref(soup):

    links = []
    for article in soup.find_all('article'):
        url = article.find('a', href=True)
        if url:
            link = url['href']
            links.append(link)
    print("Href obtained: ", len(links))

    return links

### Function remove_proyectos

In [9]:
#Function to remove "/proyectos-de-vivienda" articles
def remove_proyectos(links):

    ind = [x.find("/proyecto-de-vivienda") for x in links]
    ind2del = []
    i=0
    for x in ind:
        if x == 0:
            ind2del.append(i)
        i = i +1

    url_inmuebles_todelete = []
    [url_inmuebles_todelete.append(links[x]) for x in ind2del]
    print("Articles '/proyecto-de-vivienda' identified: ", len(url_inmuebles_todelete))

    url_inmuebles = 0
    url_inmuebles = links
    for i in range(len(url_inmuebles_todelete)):
        url_inmuebles.remove(url_inmuebles_todelete[i])

    return url_inmuebles

### Function scrapper

In [10]:
#Function to call housing_features routine on each href
def scrapper(url_inmueble):

    url_inm = 'https://www.fincaraiz.com.co'+ url_inmueble + '/'
    print(url_inm)

    # set up the webdriver
    driver = webdriver.Chrome(options=chrome_options)

    # Scrape
    driver.get(url_inm)
    driver.implicitly_wait(10)
    html=driver.page_source
    soup = bs(html,'lxml')



    #Obtaining the html from the web page after applying Selenium
    soup=bs(html,'lxml')

    #Create a list to store info obtained from one particular property
    features = []

    #Applying function to obtain variables defined from one particular property
    features = housing_features(soup)

    #Close the web browser tab
    driver.close()

    # quit the driver
    driver.quit()

    return(features)

### Function housing_features

In [11]:
# Version 1.0
def housing_features(soup):

    #Obtaining whole info from the html section that stores main housing variables
    s = soup.find('main',{'class':'jss6'}).find_all('p')
    print(s)

    #Extract first 10 features from soup
    plist = varsfromscrap(s)

    #Extracting and adding name to features list
    aux1 = soup.find('header',{'class': 'jss115'}).find_all('p')[0].text.split(' ')
    plist.append(aux1[0])
    print("El tipo de vivienda es: " + aux1[0])

    #Extracting and adding location to features list
    p = soup.find('div',{'class': 'jss123'}).find_all('p')[1].text.split('-')
    plist.append(p[0].strip())

    #Extracting and adding price to features list
    #Verification to find the price because the card may include a header <p>with news about the price.
    #For instance: "El precio bajó recientemente".
    #That kind of label shifts the position of the <p> corresponding to the price.
    try:
        p=soup.find('div',{'class': 'jss10'}).find_all('p')[1].text
        aux = p.replace('$\xa0','').replace('.','')
        plist.append(int(aux))
    except:
        p=soup.find('div',{'class': 'jss10'}).find_all('p')[2].text
        aux = p.replace('$\xa0','').replace('.','')
        plist.append(int(aux))

    return plist

### Function scrapebyPages

In [12]:
def scrapebyPages(min, max):
  #Range of pages from the total search to scrape in.
  #It is recomended to cover a range of one hundred pages in each iteration of this section.
  for i in range(min,max):

      pag = i
      url = f'https://www.fincaraiz.com.co/apartamentos-casas/venta/bogota/bogota-dc?pagina={pag}'
      print(url)

      driver = webdriver.Chrome(options=chrome_options)
      driver.get(url)
      driver.implicitly_wait(10)
      html = driver.page_source
      soup = bs(html,'lxml')

      #Get href
      links = gethref(soup)
      print(links)

      #Remove "Proyectos de vivienda"
      url_inmuebles = []
      url_inmuebles = remove_proyectos(links)
      print(url_inmuebles)

      #Scrapping
      p = []
      #Scrapping a los inmuebles filtrados
      for i in range(len(url_inmuebles)):
          print('Scrapping', i, '/', len(url_inmuebles), '...')
          p.append(scrapper(url_inmuebles[i]))
          print(p[i])

          #append list to DataFrame
          data.loc[len(data)] = p[i]

  #Close the web browser tab
  driver.close()

  # quit the driver
  driver.quit()

  return data

### Function varsfromscrap

In [13]:
#Function to extract first 10 features from soup
def varsfromscrap(soup):

    features = [0]*28

    #Transform from bs4.element.ResultSet to list
    plist = []
#     for i in range(len(soup)-1):
    for j in range(len(soup)-3):
        plist.append(soup[j].text)

    #***Habitaciones***
    try:
        i = plist.index("Habitaciones")
        features[0] = plist[i+1]
    except:
        features[0] = 'No definida'

    #***Baños***
    try:
        i = plist.index("Baños")
        features[1] = plist[i+1]
    except:
        features[1] = 'No definida'

    #***Parqueaderos***
    try:
        i = plist.index("Parqueaderos")
        features[2] = plist[i+1]
    except:
        features[2] = '0'

    #***Área construída***
    try:
        i = plist.index("Área construída")
        features[3] = plist[i+1]
    except:
        features[3] = 'No definida'

    #***Área privada***
    try:
        i = plist.index("Área privada")
        features[4] = plist[i+1]
    except:
        features[4] = 'No definida'

    #***Estrato***
    try:
        i = plist.index("Estrato")
        features[5] = plist[i+1]
    except:
        features[5] = 'No definida'

    #***Estado***
    try:
        i = plist.index("Estado")
        features[6] = plist[i+1]
    except:
        features[6] = 'No definida'

    #***Antigüedad***
    try:
        i = plist.index("Antigüedad")
        features[7] = plist[i+1]
    except:
        features[7] = 'No definida'

    #***Administración***
    try:
        i = plist.index("Administración")
        features[8] = plist[i+1]
    except:
        features[8] = 'No definida'

    #***Precio m²***
    try:
        i = plist.index("Precio m²")
        features[9] = plist[i+1]
    except:
        features[9] = 'No definida'

    #***Ascensor***
    try:
        i = plist.index("Ascensor")
        features[10] = 1
    except:
        features[10] = 0

    #***Circuito cerrado de TV***
    try:
        i = plist.index("Circuito cerrado de TV")
        features[11] = 1
    except:
        features[11] = 0

    #***Parqueadero Visitantes***
    try:
        i = plist.index("Parqueadero Visitantes")
        features[12] = 1
    except:
        features[12] = 0

    #***Portería / Recepción***
    try:
        i = plist.index("Portería / Recepción")
        features[13] = 1
    except:
        features[13] = 0

    #***Zonas Verdes***
    try:
        i = plist.index("Zonas Verdes")
        features[14] = 1
    except:
        features[14] = 0

     #***Salón Comunal***
    try:
        i = plist.index("Salón Comunal")
        features[15] = 1
    except:
        features[15] = 0

     #***Balcón***
    try:
        i = plist.index("Balcón")
        features[16] = 1
    except:
        features[16] = 0

     #***Barra estilo americano***
    try:
        i = plist.index("Barra estilo americano")
        features[17] = 1
    except:
        features[17] = 0

     #***Calentador***
    try:
        i = plist.index("Calentador")
        features[18] = 1
    except:
        features[18] = 0

     #***Chimenea***
    try:
        i = plist.index("Chimenea")
        features[19] = 1
    except:
        features[19] = 0

    #***Citófono***
    try:
        i = plist.index("Citófono")
        features[20] = 1
    except:
        features[20] = 0

    #***Cocina Integral***
    try:
        i = plist.index("Cocina Integral")
        features[21] = 1
    except:
        features[21] = 0

    #***Terraza***
    try:
        i = plist.index("Terraza")
        features[22] = 1
    except:
        features[22] = 0

    #***Vigilancia***
    try:
        i = plist.index("Vigilancia")
        features[23] = 1
    except:
        features[23] = 0

    #***Parques cercanos***
    try:
        i = plist.index("Parques cercanos")
        features[24] = 1
    except:
        features[24] = 0

     #***Estudio***
    try:
        i = plist.index("Estudio")
        features[25] = 1
    except:
        features[25] = 0

     #***Patio***
    try:
        i = plist.index("Patio")
        features[26] = 1
    except:
        features[26] = 0

     #***Depósito / Bodega***
    try:
        i = plist.index("Depósito / Bodega")
        features[27] = 1
    except:
        features[27] = 0


    return features


## Start scraping

---

In [20]:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

In [21]:
cols = ['habitaciones','baños','parqueaderos','area_construida','area_privada','estrato','estado','antiguedad',
        'administracion','precio_m2', 'Ascensor', 'Circuito cerrado de TV',
       'Parqueadero Visitantes', 'Portería / Recepción', 'Zonas Verdes', 'Salón Comunal', 'Balcón',
       'Barra estilo americano', 'Calentador', 'Chimenea', 'Citófono', 'Cocina Integral', 'Terraza',
       'Vigilancia', 'Parques cercanos', 'Estudio', 'Patio', 'Depósito / Bodega', 'nombre','ubicacion','precio']
data = pd.DataFrame(columns=cols)

In [None]:
data = scrapebyPages(3,6)

https://www.fincaraiz.com.co/apartamentos-casas/venta/bogota/bogota-dc?pagina=3
Href obtained:  25
['/inmueble/apartamento-en-venta/el-cortijo/bogota/10197211', '/inmueble/apartamento-en-venta/cedritos/bogota/10259782', '/inmueble/apartamento-en-venta/santa-barbara/bogota/10259644', '/inmueble/apartamento-en-venta/los-rosales/bogota/7934853', '/inmueble/apartamento-en-venta/santa-barbara/bogota/5214435', '/inmueble/apartamento-en-venta/cedritos/bogota/10284922', '/inmueble/apartamento-en-venta/bella-suiza/bogota/10141961', '/inmueble/casa-en-venta/bosa-los-naranjos/bogota/10101190', '/inmueble/casa-en-venta/alhambra/bogota/10207684', '/inmueble/apartamento-en-venta/pasadena/bogota/5424931', '/inmueble/casa-en-venta/villa-del-rio/bogota/10253062', '/inmueble/apartamento-en-venta/galan/bogota/10177924', '/inmueble/apartamento-en-venta/mazuren/bogota/10105232', '/inmueble/apartamento-en-venta/la-cabrera/bogota/10293217', '/inmueble/casa-en-venta/villas-del-mediterraneo/bogota/10157963', '

In [None]:
data.head()

In [None]:
data.to_csv('housing_fincaraiz_example_041223.csv', encoding='utf-8', index=False)

### Testing code for scraping only one page

In [None]:
#*****************************
#Code for testing in one page
#*****************************

# # set the target URL
# pag = 1
# url = f'https://www.fincaraiz.com.co/apartamentos-casas/venta/bogota/bogota-dc?pagina={pag}'

# # set up the webdriver
# driver = webdriver.Chrome(options=chrome_options)

# print(url)
# driver.get(url)
# driver.implicitly_wait(10)
# html = driver.page_source
# soup = bs(html,'lxml')

# # quit the driver
# driver.quit()

# links = []
# links = gethref(soup)
# links

# aux = []
# cols = ['habitaciones','baños','parqueaderos','area_construida','area_privada','estrato','estado','antiguedad',
#         'administracion','precio_m2', 'Ascensor', 'Circuito cerrado de TV',
#        'Parqueadero Visitantes', 'Portería / Recepción', 'Zonas Verdes', 'Salón Comunal', 'Balcón',
#        'Barra estilo americano', 'Calentador', 'Chimenea', 'Citófono', 'Cocina Integral', 'Terraza',
#        'Vigilancia', 'Parques cercanos', 'Estudio', 'Patio', 'Depósito / Bodega', 'nombre','ubicacion','precio']
# data = pd.DataFrame(columns=cols)

# #Remove "Proyectos de vivienda"
# url_inmuebles = []
# url_inmuebles = remove_proyectos(links)

# #Scrapping
# p = []
# #Scrapping a los inmuebles filtrados
# for i in range(len(url_inmuebles)):
#   print('Scrapping', i, '/', len(url_inmuebles), '...')
#   p.append(scrapper(url_inmuebles[i]))
#   print(p[i])

#   #append list to DataFrame
#   data.loc[len(data)] = p[i]

## Referencias
---



https://github.com/kiteco/kite-python-blog-post-code/blob/master/Web%20Scraping%20Tutorial/script.py

https://medium.com/geekculture/scrappy-guide-to-web-scraping-with-python-475385364381

https://stackoverflow.com/questions/47730671/python-3-using-requests-does-not-get-the-full-content-of-a-web-page