<a href="https://colab.research.google.com/github/felipeaguirre66/RealStateScrapper/blob/main/scraper_real_state.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Real State Scraper**

Este código scrapea información de los principales servicios de propiedades:

[Argenprop](https://www.argenprop.com/), [Mercado Libre](https://www.mercadolibre.com.ar/) y [Zonaprop](https://www.zonaprop.com.ar/)


Las variables a scrapear son:

*   ciudad (city)
*   direccion (adress)
*   tipo de propiedad (property_type): departamento o casa
*   tipo de contrato (contract_type): alquiler o venta
*   cantidad de habitaciones (rooms)
*   metros cuadrados (squared_meters)
*   precio de alquiler/venta (buy_rent_price)
*   precio de expensas (expensas_price)
*   precio de expensas (expensas_price)
*   moneda de alquiler/venta (buy_rent_currency): USD o \$
*   moneda de expensas (expensas_currency): USD o $
*   servicio (service): argen_prop, mercado_libre o zona_prop
*   link de la propiedad



## **Como utilizarlo:**

0. En el último casillero, pegue el link a su base de datos.

1. En el casillero debajo, introduzca la cantidad de paginas que quiere scrapear para cada servicio y para cada tipo de propiedad y condicion de contrato (puede ejecutar solo esa celda para estimar cuánto tiempo tomará).

2. En el menú de arriba, elija "Entorno de ejecución" y luego "Ejecutar todas".

3. Puede pedirle que le de acceso a su cuenta de google, hágalo.

4. Una vez finalizado, dirigase al archivo "query_real_state" para consultarlo.

In [None]:
# Argenprop
# 2 seconds per page (20 properties per page)
argenprop_cantidad_paginas_a_scrapear = {
    
                                      'departamentos venta':500,
                                      'departamentos alquiler':500,
                                      'casas venta':500,
                                      'casas alquiler':500,

                                      }

# Mercado Libre
# 3 seconds per page (48 properties per page)
mercado_libre_cantidad_paginas_a_scrapear = {
    
                                          'departamentos venta':100,
                                          'departamentos alquiler':100,
                                          'casas venta':100,
                                          'casas alquiler':100,

                                          }
# Zona Prop
# 72 seconds per page (20 properties per page)
zona_prop_cantidad_paginas_a_scrapear = {
    
                                'departamentos venta':100,
                                'departamentos alquiler':100,
                                'casas venta':100,
                                'casas alquiler':100,

                                }




ap_pages_len = sum(argenprop_cantidad_paginas_a_scrapear.values())
ml_pages_len = sum(mercado_libre_cantidad_paginas_a_scrapear.values())
zp_pages_len = sum(zona_prop_cantidad_paginas_a_scrapear.values())
print(f"""
Argenprop tomará {ap_pages_len*2} segundos por {ap_pages_len*20} propiedades.
Mercado Libre tomará {ml_pages_len*3} segundos por {ml_pages_len*48} propiedades.
Zonaprop tomará {zp_pages_len*72} segundos por {zp_pages_len*20} propiedades.
El total tomará {round((ap_pages_len*2+ml_pages_len*3+zp_pages_len*72)/60)} minutos por {ap_pages_len*20+ml_pages_len*48+zp_pages_len*20}
""")


Argenprop tomará 4000 segundos por 40000 propiedades.
Mercado Libre tomará 1200 segundos por 19200 propiedades.
Zonaprop tomará 28800 segundos por 8000 propiedades.
El total tomará 567 minutos por 67200



### Imports

In [1]:
# Installs for Google Sheet connection
import gspread
import google.auth
from google.colab import auth
from google.colab import files
from google.auth import default
!pip install --upgrade -q pygsheets
import pygsheets

# Helpers
def update_sheet(df, link, overwrite_last_sheet):

  """
  Updates Database in Drive.
  
  Input:
    df: scraped df
    link: link to Google Sheet's database
    overwrite_last_sheet: weather to overwrite db or create a new sheet
  """

  # Acces database
  gc = pygsheets.client.Client(creds)
  sh = gc.open_by_url(link)
  worksheet_names = [worksheet.title for worksheet in sh.worksheets()]

  # Today's date as sheets name
  now = datetime.datetime.now()
  today = now.strftime("%d/%m/%Y")

  # Choose if overwite db or not
  if overwrite_last_sheet != 'si':
    while True: # add space to name if needed
      if today in worksheet_names:
        today = today + ' '
      else:
        break
    wks = sh.add_worksheet(today)
    print('Creating new database')

  else:
    wks = sh.worksheets()[-1]
    print('Overwriting database')

  # Get the current number of rows from both df and db
  num_rows_in_gs = wks.rows
  num_rows_in_df = df.shape[0]

  # Get number of rows to insert
  num_rows_to_insert = num_rows_in_df - num_rows_in_gs + 1

  # Insert rows
  wks.insert_rows(row=1, number = num_rows_to_insert)

  # Insert Dataframe
  wks.set_dataframe(df, f'A1')

  wks = sh.add_worksheet(today)

  # Divide df into chunks smaller than limit
  rows = df.shape[0]
  limit = 3845
  iterations_needed = math.ceil(rows/limit)
  
  # Iterate and load df
  for i in range(iterations_needed):
    first_row = i*limit
    last_row = (i+1)*limit
    this_df = df.iloc[first_row:last_row]

    if i != 0: 
      this_df.columns = this_df.iloc[0]
      this_df = this_df.iloc[1:]

    wks.set_dataframe(this_df, f'A{first_row+1}')


# Authenticate in drive
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)


 # Selenium imports
!apt-get update
!apt install firefox
!wget https://github.com/mozilla/geckodriver/releases/download/v0.30.0/geckodriver-v0.30.0-linux64.tar.gz
!tar -xvzf geckodriver-v0.30.0-linux64.tar.gz
!pip install selenium

import os
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

# Set the path to the Firefox binary
firefox_binary = '/usr/bin/firefox'

# Set the path to the Firefox web driver
driver_path = os.getcwd() + '/geckodriver'

# Set the options for Firefox
options = webdriver.FirefoxOptions()
options.headless = True

# Other imports
import pandas as pd
import regex as re
import requests
from bs4 import BeautifulSoup
import time
from time import sleep
import datetime
import math

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/158.2 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m153.6/158.2 KB[0m [31m9.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.2/158.2 KB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.11.2 requires google-auth-oauthlib<0.5,>=0.4.1, but you have google-auth-oauthlib 1.0.0 which is incompatible.[0m[31m
Ign:1 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
Hit:3 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64  Release
Get:4 h

  options.headless = True


### General Helpers

In [None]:
# General Helpers

headers = {
    "Accept-Language": "en-US,en;q=0.5",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Referer": "http://thewebsite.com",
    "Connection": "keep-alive"}

def price_multiple_replace(text, dict=None):
    """
    Multiple regex replacements (to replace in key, replacement in value) 
    """
    if not dict:
        dict = {"," : ".", "." : ""}
        
    # Create a regular expression  from the dictionary keys
    regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
    
    # For each match, look-up corresponding value in dictionary
    return float(regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text))

# Argenprop

## Scraper

In [None]:
def scrape_argenprop(url_ap):
    
    result_ap = requests.get(url_ap, headers=headers)

    soup_ap = BeautifulSoup(result_ap.text, 'lxml')

    all_data_ap = []

    # Links
    base_link = 'https://www.argenprop.com'
    div_container = soup_ap.find_all('div',{'class':'listing__item'})
    links_ap = []
    for div in div_container:
        a_link = div.find_all('a', href=True)
        last_link = a_link[0]['href']
        links_ap.append(base_link+last_link)
    all_data_ap.append(links_ap)


    # Service
    all_data_ap.append(['argen_prop']*len(links_ap))

    # City
    city_ap = soup_ap.find_all('p', {'class':'card__title--primary show-mobile'})
    city_ap = [c.text for c in city_ap]
    all_data_ap.append(city_ap)

    # Adress
    address_ap = soup_ap.find_all(['h2','p'], {'class':'card__address'})
    address_ap = [c.text.strip() for c in address_ap]
    address_ap = [re.findall('.+', str(c))[0] for c in address_ap]
    all_data_ap.append(address_ap)

    # Property Type
    # property_type_ap = soup_ap.find_all('p', {'class':'card__title--primary hide-mobile'})
    # property_type_ap = [c.text.split(' ')[0] for c in property_type_ap]
    if 'casa' in url_ap:
        property_type_ap = ['casa']*len(links_ap)
    else:
        property_type_ap = ['departamento']*len(links_ap)
    all_data_ap.append(property_type_ap)

    # Contract type
    # contract_type_ap = soup_ap.find_all('p', {'class':'card__title--primary hide-mobile'})
    # contract_type_ap = [c.text.split(' ')[2] for c in contract_type_ap]
    if 'alquiler' in url_ap:
        contract_type_ap = ['alquiler']*len(links_ap)
    else:
        contract_type_ap = ['venta']*len(links_ap)
    all_data_ap.append(contract_type_ap)


    # Rooms and Squared Meters
    ul_container = soup_ap.find_all('ul', {'class':'card__main-features'})
    rooms_ap = []
    squared_meters_ap = []
    for ul in ul_container:
        
        amb = re.findall('(\d+)\s(?:ambiente|dorm\.)', str(ul.find_all('span')))
        if len(amb)>0:
            rooms_ap.append(int(amb[0]))
        else:
            rooms_ap.append(1)
            
        m2 = re.findall('(\d+)\s*m²', str(ul.find_all('span')))
        if len(m2)>0:
            squared_meters_ap.append(int(m2[0]))
        else:
            squared_meters_ap.append(0)
    all_data_ap.append(rooms_ap)
    all_data_ap.append(squared_meters_ap)


    # Price and currency type
    buy_rent_currency_ap = []
    buy_rent_price_ap = []

    expensas_currency_ap = []
    expensas_price_ap = []

    all_currency_and_price_ap = soup_ap.find_all('p', {'class':'card__price'})
    all_currency_and_price_ap = [c.text for c in all_currency_and_price_ap]
    all_currency_and_price_ap = [re.findall('(USD|\$)\s?(\d+(?:[,.]\d+)*)', str(c)) for c in all_currency_and_price_ap]

    for cp in all_currency_and_price_ap:
        if len(cp)>0:
            buy_rent_currency_ap.append(cp[0][0])
            buy_rent_price_ap.append(price_multiple_replace(cp[0][1]))
            if len(cp)>1:
                expensas_currency_ap.append(cp[1][0])
                expensas_price_ap.append(price_multiple_replace(cp[1][1]))
            else:
                expensas_currency_ap.append('consultar_moneda')
                expensas_price_ap.append(0)
        else:
            buy_rent_currency_ap.append('consultar_moneda')
            buy_rent_price_ap.append(0)
            expensas_currency_ap.append('consultar_moneda')
            expensas_price_ap.append(0)
            
    all_data_ap.append(buy_rent_currency_ap)
    all_data_ap.append(expensas_currency_ap)
    all_data_ap.append(buy_rent_price_ap)
    all_data_ap.append(expensas_price_ap)

    return all_data_ap

## Iteration

In [None]:
all_url_ap = [  'https://www.argenprop.com/departamento-venta-pagina-',
                'https://www.argenprop.com/departamento-alquiler-y-alquiler-temporal-pagina-',
                'https://www.argenprop.com/casa-venta-pagina-',
                'https://www.argenprop.com/casa-alquiler-y-alquiler-temporal-pagina-']


total_pages = sum(argenprop_cantidad_paginas_a_scrapear.values())
page_counter = 0

t1 = time.time()

all_columns_ap = []
for i_url, url_ap in enumerate(all_url_ap): # iterate general URLs
    cant_pag = list(argenprop_cantidad_paginas_a_scrapear.values())[i_url]
    categorie = list(argenprop_cantidad_paginas_a_scrapear.keys())[i_url]
    for pag in range(cant_pag): # iterate paginas
        try:

          page_counter += 1
          print(f'Scraping {categorie}, pag. {pag+1} of {cant_pag} from this category. {page_counter} of {total_pages} total pages.')
          
          this_url = url_ap+str(pag+1)

          results_ap = scrape_argenprop(this_url)
          
          all_columns_ap.append(results_ap)

        except:
          print(f'Error iteration {page_counter+1}')

t2 = time.time()

print(f'This took {(t2-t1)/60} minutes, or {round((t2-t1)/page_counter)} seconds per page.')

Scraping departamentos venta, pag. 1 of 500 from this category. 1 of 2000 total pages.
Scraping departamentos venta, pag. 2 of 500 from this category. 2 of 2000 total pages.
Scraping departamentos venta, pag. 3 of 500 from this category. 3 of 2000 total pages.
Scraping departamentos venta, pag. 4 of 500 from this category. 4 of 2000 total pages.
Scraping departamentos venta, pag. 5 of 500 from this category. 5 of 2000 total pages.
Scraping departamentos venta, pag. 6 of 500 from this category. 6 of 2000 total pages.
Scraping departamentos venta, pag. 7 of 500 from this category. 7 of 2000 total pages.
Scraping departamentos venta, pag. 8 of 500 from this category. 8 of 2000 total pages.
Scraping departamentos venta, pag. 9 of 500 from this category. 9 of 2000 total pages.
Scraping departamentos venta, pag. 10 of 500 from this category. 10 of 2000 total pages.
Scraping departamentos venta, pag. 11 of 500 from this category. 11 of 2000 total pages.
Scraping departamentos venta, pag. 12 o

# Mercado Libre

## Scraper

In [None]:
# Helpers
def get_ml_expensas_price(link_expensas):
    
    """
    For Mercado Libre
    """
    expensas_price = []

    for l_expensas in link_expensas:
        result = requests.get(l_expensas, headers=headers)
        soup = BeautifulSoup(result.text, 'lxml')
        try:
            exp = soup.find('p', {'class':'ui-pdp-color--GRAY ui-pdp-size--XSMALL ui-pdp-family--REGULAR ui-pdp-maintenance-fee-ltr'}).text
        except:
            exp = '0'
        expensas_price.append(exp)

    expensas_price = [re.findall('\d+(?:[,.]\d+)*', ex)[0] for ex in expensas_price]
    expensas_price = [int(price_multiple_replace(c)) for c in expensas_price]
    
    return expensas_price

In [None]:
def scrape_mercadolibre(url_ml, scrap_expensas_ML = True):

    result_ml = requests.get(url_ml, headers=headers)

    soup_ml = BeautifulSoup(result_ml.text, 'lxml')
        
    all_data_ml = []

    # Links
    div_container = soup_ml.find_all('div',{'class':'ui-search-result__wrapper shops__result-wrapper'})
    links_ml = []
    for div in div_container:
        a_link = div.find_all('a', href=True)
        link = a_link[0]['href']
        links_ml.append(link)
    all_data_ml.append(links_ml)

    # Service
    all_data_ml.append(['mercado_libre']*len(links_ml))

    # City
    city_ml = soup_ml.find_all('span', {'class':'ui-search-item__group__element ui-search-item__location shops__items-group-details'})
    city_ml = [c.text for c in city_ml]
    all_data_ml.append(city_ml)


    # Adress
    address_ml = soup_ml.find_all('span', {'class':'ui-search-item__group__element ui-search-item__location shops__items-group-details'})
    address_ml = [c.text.split(',')[0] for c in address_ml]
    all_data_ml.append(address_ml)

    # Property type
    # property_type_ml = soup_ml.find_all('span', {'class':'ui-search-item__group__element ui-search-item__subtitle shops__items-group-details'})
    # property_type_ml = [c.text.split(' en ')[0] for c in property_type_ml]
    if 'casa' in url_ml:
        property_type_ml = ['casa']*len(links_ml)
    else:
        property_type_ml = ['departamento']*len(links_ml)
    all_data_ml.append(property_type_ml)

    # Contract type
    # contract_type_ml = soup_ml.find_all('span', {'class':'ui-search-item__group__element ui-search-item__subtitle shops__items-group-details'})
    # contract_type_ml = [c.text.split(' en ')[-1] for c in contract_type_ml]
    if 'alquiler' in url_ml:
        contract_type_ml = ['alquiler']*len(links_ml)
    else:
        contract_type_ml = ['venta']*len(links_ml)
    all_data_ml.append(contract_type_ml)

    # Rooms and Squared Meters
    squared_meters = []
    rooms = []
    squared_meters_and_rooms = soup_ml.find_all('ul', {'class':'ui-search-card-attributes ui-search-item__group__element shops__items-group-details'})
    squared_meters_and_rooms = [sm_and_r.find_all('li', {'class':'ui-search-card-attributes__attribute'}) for sm_and_r in squared_meters_and_rooms]

    for sm_and_r in squared_meters_and_rooms:
        sm_and_r = [smr.text for smr in sm_and_r]
        if len(sm_and_r)==2:
            squared_meters.append(sm_and_r[0])
            rooms.append(sm_and_r[1])
        else:
            if 'm²' in sm_and_r[0]:
                squared_meters.append(sm_and_r[0])
                rooms.append('0')
            else:
                squared_meters.append('0')
                rooms.append(sm_and_r[0])
                
    squared_meters = [int(price_multiple_replace(sm.split(' ')[0])) for sm in squared_meters]
    rooms = [int(price_multiple_replace(r.split(' ')[0])) for r in rooms]
    all_data_ml.append(rooms)
    all_data_ml.append(squared_meters)


    # All Currency type
    buy_rent_currency_ml = soup_ml.find_all('span', {'class':'price-tag-symbol'})
    buy_rent_currency_ml = [c.text.replace('U$S', 'USD') for c in buy_rent_currency_ml]
    buy_rent_currency_ml = buy_rent_currency_ml[len(buy_rent_currency_ml)-len(links_ml):] # Eliminate those with no link
    expensas_currency_ml = ['$']*len(links_ml)
    all_data_ml.append(buy_rent_currency_ml)
    all_data_ml.append(expensas_currency_ml)


    # Buy Rent Price
    buy_rent_price_ml = []
    buy_rent_price_ml = soup_ml.find_all('span', {'class':'price-tag-fraction'})
    buy_rent_price_ml = [c.text for c in buy_rent_price_ml]
    buy_rent_price_ml = [int(price_multiple_replace(c)) for c in buy_rent_price_ml]
    buy_rent_price_ml = buy_rent_price_ml[len(buy_rent_price_ml)-len(links_ml):] # Eliminate those with no link
    all_data_ml.append(buy_rent_price_ml)


    # Expensas Price
    if scrap_expensas_ML:
        expensas_price = get_ml_expensas_price(links_ml)
    else:
        expensas_price = [0]*len(links_ml)
    all_data_ml.append(expensas_price)
    
    return all_data_ml

## Iteration

In [None]:
all_url_ml = ['https://inmuebles.mercadolibre.com.ar/departamentos/venta/_Desde_NRO_PAGINA_NoIndex_True',
              'https://inmuebles.mercadolibre.com.ar/departamentos/alquiler-temporario/_Desde_NRO_PAGINA_NoIndex_True',
              'https://inmuebles.mercadolibre.com.ar/casas/venta/_Desde_NRO_PAGINA_NoIndex_True',
              'https://inmuebles.mercadolibre.com.ar/casas/alquiler-temporario/_Desde_NRO_PAGINA_NoIndex_True']

total_pages = sum(mercado_libre_cantidad_paginas_a_scrapear.values())
page_counter = 0

t1 = time.time()

all_columns_ml = []
for i_url, url_ml in enumerate(all_url_ml): # iterate general URLs
    cant_pag = list(mercado_libre_cantidad_paginas_a_scrapear.values())[i_url]
    categorie = list(mercado_libre_cantidad_paginas_a_scrapear.keys())[i_url]
    for pag in range(cant_pag): # iterate paginas
        try:

          page_counter += 1
          print(f'Scraping {categorie}, pag. {pag+1} of {cant_pag} from this category. {page_counter} of {total_pages} total pages.')
          
          pag_counter = (48*pag)+1
          
          this_url = url_ml.replace('NRO_PAGINA',str(pag_counter)) #define page number
          
          results_ml = scrape_mercadolibre(this_url)

          all_columns_ml.append(results_ml)

        except:
          print(f'Error iteration {page_counter+1}')

t2 = time.time()

print(f'This took {(t2-t1)/60} minutes, or {round((t2-t1)/page_counter)} seconds per page.')

Scraping departamentos venta, pag. 1 of 100 from this category. 1 of 400 total pages.
Scraping departamentos venta, pag. 2 of 100 from this category. 2 of 400 total pages.
Scraping departamentos venta, pag. 3 of 100 from this category. 3 of 400 total pages.
Scraping departamentos venta, pag. 4 of 100 from this category. 4 of 400 total pages.
Scraping departamentos venta, pag. 5 of 100 from this category. 5 of 400 total pages.
Scraping departamentos venta, pag. 6 of 100 from this category. 6 of 400 total pages.
Scraping departamentos venta, pag. 7 of 100 from this category. 7 of 400 total pages.
Scraping departamentos venta, pag. 8 of 100 from this category. 8 of 400 total pages.
Scraping departamentos venta, pag. 9 of 100 from this category. 9 of 400 total pages.
Scraping departamentos venta, pag. 10 of 100 from this category. 10 of 400 total pages.
Scraping departamentos venta, pag. 11 of 100 from this category. 11 of 400 total pages.
Scraping departamentos venta, pag. 12 of 100 from 

# ZonaProp

## Scraper

In [None]:
def scrape_zonaprop(url_zp):

    # Create a new Firefox browser instance
    driver = webdriver.Firefox(firefox_binary=firefox_binary, executable_path=driver_path, options=options)

    driver.get(url_zp)
    
    sleep(60)

    all_data_zp = []

    # Links
    start_link_zp = 'https://www.zonaprop.com.ar'
    links_zp = driver.find_elements(By.XPATH, '//*/div[@data-to-posting]')
    links_zp = [start_link_zp+l.get_attribute('data-to-posting') for l in links_zp]
    all_data_zp.append(links_zp)

    # Service
    all_data_zp.append(['zona_prop']*len(links_zp))

    # City
    city_zp = driver.find_elements(By.XPATH, '//*/div[@data-qa="POSTING_CARD_LOCATION"]')
    city_zp = [c.text for c in city_zp]
    all_data_zp.append(city_zp)

    # Adress
    adress_zp = driver.find_elements(By.XPATH, '//*/div[@class="sc-ge2uzh-0 bzGYzE"]')
    adress_zp = [c.text for c in adress_zp]
    all_data_zp.append(adress_zp)

    # Property type
    if 'casa' in url_zp:
        property_type_zp = ['casa']*len(links_zp)
    else:
        property_type_zp = ['departamento']*len(links_zp)
    all_data_zp.append(property_type_zp)

    # Contract Type
    if 'alquiler' in url_zp:
        contract_type_zp = ['alquiler']*len(links_zp)
    else:
        contract_type_zp = ['venta']*len(links_zp)
    all_data_zp.append(contract_type_zp)

    # Rooms and Squared Meters
    squared_meters_and_rooms_zp = driver.find_elements(By.XPATH, '//*/div[@data-qa="POSTING_CARD_FEATURES"]')
    squared_meters_and_rooms_zp = [sm.text for sm in squared_meters_and_rooms_zp]
    rooms_zp = []
    squared_meters_zp = []
    for smr in squared_meters_and_rooms_zp:
        amb = re.findall('(\d+)\s(?:ambiente|dorm\.)', smr)
        if len(amb)>0:
            rooms_zp.append(int(amb[0]))
        else:
            rooms_zp.append(1)
        
        m2 = re.findall('(\d+)\s*m²', smr)
        if len(m2)>0:
            squared_meters_zp.append(int(m2[0]))
        else:
            squared_meters_zp.append(0)
    all_data_zp.append(rooms_zp)
    all_data_zp.append(squared_meters_zp)


    # All price and currency
    all_price_and_currency_zp = driver.find_elements(By.XPATH, '//*/div[@class="sc-12dh9kl-0 cysiyu"]')
    all_price_and_currency_zp = [l.text.lower().strip() for l in all_price_and_currency_zp]

    expensas_currency = []
    expensas_price = []
    buy_rent_currency = []
    buy_rent_price = []

    for pc in all_price_and_currency_zp:
        all_matches = re.findall('(?:usd|\$)\s*\d*(?:[,.]\d*)*(?:\s*expensas)*', pc)
        found_expensas = False
        if len(all_matches)>0:
            for am in all_matches:
                if 'expensas' in am:
                    expensas_currency.append(['$' if '$' in am else 'USD'][0])
                    expensas_price.append(price_multiple_replace(re.findall('\d+(?:[,.]\d*)*', am)[0]))
                    found_expensas = True
                else:
                    buy_rent_currency.append(['$' if '$' in am else 'USD'][0])
                    buy_rent_price.append(price_multiple_replace(re.findall('\d+(?:[,.]\d*)*', am)[0]))
                
            if not found_expensas:
                expensas_currency.append('consult_currency')
                expensas_price.append('consult_price')
                    
        else:
            expensas_currency.append('consult_currency')
            expensas_price.append('consult_price')
            buy_rent_currency.append('consult_currency')
            buy_rent_price.append('consult_price')
    
    all_data_zp.append(buy_rent_currency)
    all_data_zp.append(expensas_currency)
    all_data_zp.append(buy_rent_price)
    all_data_zp.append(buy_rent_price)
                    
    driver.close()
    return all_data_zp

## Iteration

In [None]:
all_url_zp = [
              'https://www.zonaprop.com.ar/departamentos-venta-pagina-NRO_PAGINA.html',
              'https://www.zonaprop.com.ar/departamentos-alquiler-pagina-NRO_PAGINA.html',
              'https://www.zonaprop.com.ar/casas-alquiler-pagina-NRO_PAGINA.html',
              'https://www.zonaprop.com.ar/casas-venta-pagina-NRO_PAGINA.html']


total_pages = sum(zona_prop_cantidad_paginas_a_scrapear.values())
page_counter = 0

t1 = time.time()

all_columns_zp = []
for i_url, url_zp in enumerate(all_url_zp): # iterate general URLs
    cant_pag = list(zona_prop_cantidad_paginas_a_scrapear.values())[i_url]
    categorie = list(zona_prop_cantidad_paginas_a_scrapear.keys())[i_url]
    for pag in range(cant_pag): # iterate paginas
        try:

          page_counter += 1
          print(f'Scraping {categorie}, pag. {pag+1} of {cant_pag} from this category. {page_counter} of {total_pages} total pages.')
          
          this_url = url_zp.replace('NRO_PAGINA',str(pag+1)) #define page number
          
          results_zp = scrape_zonaprop(this_url)

          all_columns_zp.append(results_zp)

        except:
          print(f'Error iteration {page_counter+1}')

t2 = time.time()

print(f'This took {(t2-t1)/60} minutes, or {round((t2-t1)/page_counter)} seconds per page.')

Scraping departamentos venta, pag. 1 of 100 from this category. 1 of 400 total pages.


  driver = webdriver.Firefox(firefox_binary=firefox_binary, executable_path=driver_path, options=options)
  driver = webdriver.Firefox(firefox_binary=firefox_binary, executable_path=driver_path, options=options)


Scraping departamentos venta, pag. 2 of 100 from this category. 2 of 400 total pages.
Scraping departamentos venta, pag. 3 of 100 from this category. 3 of 400 total pages.
Scraping departamentos venta, pag. 4 of 100 from this category. 4 of 400 total pages.
Scraping departamentos venta, pag. 5 of 100 from this category. 5 of 400 total pages.
Scraping departamentos venta, pag. 6 of 100 from this category. 6 of 400 total pages.
Scraping departamentos venta, pag. 7 of 100 from this category. 7 of 400 total pages.
Scraping departamentos venta, pag. 8 of 100 from this category. 8 of 400 total pages.
Scraping departamentos venta, pag. 9 of 100 from this category. 9 of 400 total pages.
Scraping departamentos venta, pag. 10 of 100 from this category. 10 of 400 total pages.
Scraping departamentos venta, pag. 11 of 100 from this category. 11 of 400 total pages.
Scraping departamentos venta, pag. 12 of 100 from this category. 12 of 400 total pages.
Scraping departamentos venta, pag. 13 of 100 fro

# DataFrame

In [None]:
# Combine each of the 12 (city, price, etc) columns found in each of the N pages scraped
amount_columns = 12

ml_combined = [ [item for sublist in all_columns_ml for item in sublist[i]] for i in range(amount_columns)]
ap_combined = [ [item for sublist in all_columns_ap for item in sublist[i]] for i in range(amount_columns)]
zp_combined = [ [item for sublist in all_columns_zp for item in sublist[i]] for i in range(amount_columns)]

# Combine every scrapped web
all_columns = []
for i in range(amount_columns):
    all_columns.append(ml_combined[i]+ap_combined[i]+zp_combined[i])
    
# Construct df
col_names = ['link','service','city','adress','property_type','contract_type',
             'rooms','squared_meters','buy_rent_currency','expensas_currency','buy_rent_price','expensas_price']
df = pd.DataFrame(all_columns).T
df.columns = col_names

# fill nans
df['link'].fillna('no_info', inplace=True)
df['adress'].fillna('no_info', inplace=True)
df['city'].fillna('no_info', inplace=True)
df['expensas_currency'].fillna('consult_currency', inplace=True)
df['buy_rent_currency'].fillna('consult_currency', inplace=True)
df['expensas_currency'].fillna('consult_currency', inplace=True)
df['buy_rent_price'].fillna(99999999, inplace=True)
df['expensas_price'].fillna(99999999, inplace=True)

# Create total price per month
df['total_price'] = df['buy_rent_price'] + df['expensas_price']

# Write retults on Google Sheets

In [None]:
# Link to the database
link_db = 'https://docs.google.com/spreadsheets/d/182xWzXJ6HQ6f8DwKkS78dSt_4psjic_NP7FJVy4-h7I/edit#gid=0'

# Update database
update_sheet(df, link_db, overwrite_last_sheet='si')