#A house web scraper in Python!
In a few months I’ll have to leave my rented apartment and look for a new one. As painful as this experience can be, especially as a real estate bubble looms in the horizon, I decided to use it as yet another incentive to improve my Python skills! In the end I want to be able to do two things:
· Scrape all the search results from one of the main real estate websites in Portugal (where I live) and build a database with all the listings found
· Use the collected listings to perform some EDA, and ultimately try to find undervalued properties
The website I will be scraping is the real estate portal from Sapo, one of the oldest and most visited websites in Portugal. They have a very large amount of real estate listings for us to scrape. Chances are you are using a different website, but you should be able to adapt the code very easily.
Before we begin with the code snippets, let me just give you a summary of what I will be doing. I will use the results page from a simple search in Sapo website where I can specify some parameters beforehand (like zone, price filters, number of rooms, etc) to reduce the task time, or simply query the whole list of results in Lisbon.
We then need to use a command to reach ask a response from the website. The result will be some html code, which we will then use to get the elements we want for our final table. After deciding what to take from each search result property, we need a for loop to open each of the search pages and perform the scraping.
That sounds pretty easy, where do I start?
Like most projects, we need to import the modules to be used. I will use Beautiful Soup to take care of the html’s we will be fetching. Always make sure the site you are trying to access allows scraping. You can easily do that if you add “/robots.txt” to the original domain. Inside this file you can see if there are guidelines regarding what is allowed to scrape.


In [1]:
from bs4 import BeautifulSoup
from requests import get
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

Some websites automatically block any kind of scraping, and that’s why I’ll define a header to pass along the get command, which will basically make our queries to the website look like they are coming from an actual browser. When we run the program, I’ll have a sleep command between pages, so we can mimic a “more human” behavior and don’t overload the site with several requests per second. You will get blocked if you scrape too aggressively, so it’s a nice policy to be polite while scraping.

In [2]:
headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'})

Then we define the base url to be used when querying the website. For this purpose I will just limit my search to Lisbon and sort by creation date. The address bar quickly updates and gives me the parameters sa=11 for Lisbon, and or=10 for the sorting, which I will use in the sapo variable.

In [3]:
sapo = "https://www.cri-ce.com.br/imoveis/a-venda/casa/itaitinga"
response = get(sapo, headers=headers)

And now we can test if we can communicate with the website. You can get several codes from this command, but if you get “200” it’s usually a sign that you’re good to go. You can see a list of these codes here.
We can print the response and the first 1000 characters of the text.

In [4]:
print(response)

<Response [200]>


In [5]:
print(response.text[:1000])

<!DOCTYPE html><html lang="pt-BR" style="--primary-color: 246, 144, 49; --primary-color-light: 249, 223, 199; --primary-color-medium: 248, 170, 98; --primary-color-dark: 176, 94, 17; --secondary-color: 84, 84, 84; --secondary-color-light: 161, 161, 161; --secondary-color-medium: 110, 110, 110; --secondary-color-dark: 33, 33, 33; --text-primary: #3d3d3d; --text-secondary: #FFF"><head><!--M^s0-2 s0 2--><link rel="icon" type="image/x-icon" href="https://imgs.kenlo.io/VWRCUkQ2Tnp3d1BJRDBJVe1s0xgxS7daNJUEv7tewTj5teT1Ozqgzm1JNjAvUFRCJadQk2NyQ4sn9UZultlp41E0iI0WVb63pyib08LPuonI8wO937X4npyd++rBfez57sdijPeqSgH9uvU-F9J8OhwjPLd2GBXgVomMYyAP+GOH+gDHS7xMCS4fxktgz09F.png"><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1"><meta name="description" content="Na Ceará Rede Imóveis você encontra casas à venda em Ancuri, Barrocão, Jabuti, Gereraú e Pedras - Itaitinga. Confira mais imóveis à venda ou para alugar em 

Alright, we’re all set to start exploring whatever we get from the website. We need to define the Beautiful Soup object that will help us read this html. That’s what BS does: it picks the text from the response and parses the information in a way that makes it easier for you to navigate in its structure and get its contents.
#Time to make some Soup!

In [6]:
html_soup = BeautifulSoup(response.text, 'html.parser')
#html_soup

Before extracting the price, we want to be able to identify each result in the page. In order to know what tags we need to call, we can follow them from the price tag to the top until we reach something that looks like the main container for each result. We can see it below:

In [7]:
house_containers = html_soup.find_all('div', class_="link-all")


We now have an object that can be iterated while we scrape the results in each search page. Let’s try and get the price we saw before. I’ll define the variable first which will be the structure of our first house (picked up from the house_containers variable).

In [None]:
first = house_containers[0]
first.find_all('span')
print(first.prettify())

So the price is quite easily obtainable, but there are some special characters along the text. A simple way to take care of it is to simply replace the special character with nothing. I’ll break it down below, as I transform the string into an integer.

In [9]:
var_1 = first.find_all('span')[0].text
var_1

'R$ 40.000'

In [10]:
var_1=var_1.replace('.','')
var_1 = float(var_1[3:])
print(var_1,type(var_1))

40000.0 <class 'float'>


In this last step, itertools helped me retrieve only the digits from the second step. We just scraped our first price! The other fields we want to get are: Title, Size, Date posted, location, condition status, short description, link for the property and link to a thumbnail.
I’ll give some examples below before we build the amazing for loop that will get us every result from every page.

In [11]:
bairro = first.find_all('h2',class_='card-title')[0].text
bairro

'Ancuri'

In [12]:
titulo = first.find_all('h3',class_='card-text')[0].text
titulo

'Casa em Itaitinga'

In [13]:
descricao = first.find_all('p',class_='description')[0].text
descricao

'Casa REPASSE\r\n 02 Quartos (02 suítes )\r\n Wc social  reversível \r\n Sala Ampla\r\n Cozinha americana\r\n Área de serviço coberta \r\n quintal e varanda \r\n 02 Vagas na Garagem;\r\n 75 m² área construída.\r\n\r\nCorretores de Plantão  \r\n Segunda a sábado: 8:00 as 17:00 \r\n Domingo: 8:00 as 13:00.\r\n\r\n Ciro Chaves Imóveis  18769J\r\n (85) 99227-8053 Whatsapp'

In [14]:
items = first.find_all(class_="values")[0]
items.prettify()
items = items.find_all('span',class_='h-money')
dados_items={'Quartos':0,'Suítes':0,'Banheiros':0,'Vagas':0,'Área':0}
for i in range(5):
  items[i]
  item = float([d for d in items[i]][0])
  #print(item)
  if i==0:  
    dados_items.update({'Quartos':item})
  elif i==1:
    dados_items.update({'Suítes':item})
  elif i==2:
    dados_items.update({'Banheiros':item})
  elif i==3:
    dados_items.update({'Vagas':item})
  elif i==4:
    dados_items.update({'Área':item})

dados_items

{'Banheiros': 2.0, 'Quartos': 2.0, 'Suítes': 2.0, 'Vagas': 2.0, 'Área': 75.0}

In [15]:
items = pd.DataFrame(dados_items.items(),columns=['Item','Valor'])
items

Unnamed: 0,Item,Valor
0,Quartos,2.0
1,Suítes,2.0
2,Banheiros,2.0
3,Vagas,2.0
4,Área,75.0


In [16]:
first.find_all('p',cclass_='searchPropertyDescription')[0]

IndexError: ignored

In [None]:
#gets all the links
for url in first.find_all('a'):
  print(url.get('href'))

#Enough with tags, let’s scrape some pages already!
Once your’re comfortable with the fields to extract and you found a way to extract them all from each result container, it’s time to setup the base of our crawler. The following lists will be created to handle our data and later be used to put together the dataframe.

In [None]:
# setting up the lists that will form our dataframe with all the results
titles = []
created = []
prices = []
areas = []
zone = []
condition = []
descriptions = []
urls = []
thumbnails = []

From a quick check on the original web page, I see there are 871 pages of results. We can give it a little more room and set the loop for 900 iterations. We’ll add something to break the loop if it finds a page without any house container. The page command is the &pn=x in the end of the address, where x is the results page number.
The code is made up with two for loops, which navigate through every house in the page, for every page possible.
If you follow along, you can notice we’re simply collecting the data we already explored above as we cycle through results. The price field turned out more complicated as there were cases containing both Sell and Rent prices separated by a “/”. In some results, the index 2 returned “Contacte Anunciante” so I had to update the code with an if statement to look for the price in the next index position.

In [None]:
%%time

n_pages = 0

for page in range(0,2100):
    n_pages += 1
    sapo_url = 'https://casa.sapo.pt/Venda/Apartamentos/?sa=11&lp=10000&or=10'+'&pn='+str(page)
    r = get(sapo_url, headers=headers)
    page_html = BeautifulSoup(r.text, 'html.parser')
    house_containers = page_html.find_all('div', class_="searchResultProperty")
    if house_containers != []:
        for container in house_containers:
            
            # Price            
            price = container.find_all('span')[3].text
            if price == 'Contacte Anunciante':
                price = container.find_all('span')[3].text
                if price.find('/') != -1:
                    price = price[0:price.find('/')-1]
            if price.find('/') != -1:
                price = price[0:price.find('/')-1]
            
            price_ = [int(price[s]) for s in range(0,len(price)) if price[s].isdigit()]
            price = ''
            for x in price_:
                price = price+str(x)
            prices.append(int(price))

            # Zone
            location = container.find_all('p', class_="searchPropertyLocation")[0].text
            location = location[7:location.find(',')]
            zone.append(location)

            # Title
            name = container.find_all('span')[0].text
            titles.append(name)

            # Status
            status = container.find_all('p')[5].text
            condition.append(status)

            # Area
            m2 = container.find_all('p')[9].text
            if m2 != '-':
                m2 = m2.replace('\xa0','')
                m2 = float("".join(itertools.takewhile(str.isdigit, m2)))
                areas.append(m2)
                
            else:
                m2 = container.find_all('p')[7].text
                if m2 != '-':
                    m2 = m2.replace('\xa0','')
                    m2 = float("".join(itertools.takewhile(str.isdigit, m2)))
                    areas.append(m2)
                else:
                    areas.append(m2)

            # Creation date
            date = pd.to_datetime(container.find_all('div', class_="searchPropertyDate")[0].text[21:31])
            created.append(date)

            # Description
            desc = container.find_all('p', class_="searchPropertyDescription")[0].text[7:-6]
            descriptions.append(desc)

            # url
            link = 'https://casa.sapo.pt/' + container.find_all('a')[0].get('href')[1:-6]
            urls.append(link)

            # image
            img = str(container.find_all('img')[0])
            img = img[img.find('data-original_2x=')+18:img.find('id=')-2]
            thumbnails.append(img)
    else:
        break
    
    sleep(randint(1,2))
    
print('You scraped {} pages containing {} properties.'.format(n_pages, len(titles)))

In [None]:
cols = ['Title', 'Zone', 'Price', 'Size (m²)', 'Status', 'Description', 'Date', 'URL', 'Image']

lisboa = pd.DataFrame({'Title': titles,
                           'Price': prices,
                           'Size (m²)': areas,
                           'Zone': zone,
                           'Date': created,
                           'Status': condition,
                           'Description': descriptions,
                           'URL': urls,
                           'Image': thumbnails})[cols]

lisboa.to_excel('lisboa_raw.xls')

# lisboa = pd.read_excel('lisboa_raw.xls')