# **Idealista scraper**
*Jose Ramon Estevez Melgarejo 2022-03-04.*

## Introduction
Welcome to the first mini project of my personal portfolio where I create a web scraper to obtain valuable information from [Idealista](https://www.idealista.com). When structuring my portfolio I thought that a good Exploratory Data Analysis (EDA) was the first thing I needed to do in order to showcase my abilities. After searching for a messy dataset I could work with for some time I decided to create my own scraper to obtain data for my analysis. So this project will be divided into tree parts:

1.  Data extraction via web scraping (current notebook) 
2.  Exploratory Data Analysis (EDA)
3.  Machine Learning Regression model to estimate house price

### Why idealista?
Well idealista is one of the main real state platforms that Spanish people use to sell and buy houses. After having worked as a data scientist for more than three years I have finished plenty of projects but it is difficult to use them for my portfolio without revealing sensitive information about the companies I have been working for. Therefore, I decided to develop a study about the real state situation at my home city (Cadiz) that I could maybe benefit from. Also, idealista website is a relatively easy site to scrap and data is not really well-structured (I wanted to start my EDA with messy data as data cleansing is also a big part of data science).


### Scraping method
As far as I know there are three main scraping libraries / frameworks. Scrapy, Selenium and Requests + Beautiful soup. Even though Scrapy and selenium seem to be more robust alternatives I decided not to over complicate my self and use Requests + Beautiful sou given that idealista website is relatively simple to scrap.

## Index
1.  Importing Libraries & script general variables
2.  Scraping free proxies and defining Request Headers
3.  Houses scraping
    1.  Houses ids scraping
    2.  Houses info scraping


## 1. Importing Libraries 

In [31]:
import requests
from bs4 import BeautifulSoup as bs
import random
import time
import pandas as pd
import proxies_scr # source to a personal script to scrap free proxy sites
import datetime as dt

First we will define two variables that will help us to limit our house search if we wanted to and select the number of free working proxies that we will use to scrap our data. Why we use free proxies will be explained in the next section. 

In [32]:
search_limit = 10 # Limit search of houses.
n_prox = 1 # Selecting the number of free proxies to use.

## 2.  Scraping free proxies and defining Request Headers


 It is very common that websites block your IP if you do many consecutive requests in a short period of time or if they suspect that the one entering the site is not a human (our case). So in order to avoid been blocked we have to do two things. 

-  First, we need to send some request headers. Request headers provide information about the request context. Without headers, we will have a 403 response meaning that our access is denied.

-  Second, we need to rotate proxies. As mentioned before if we make too many requests in a short period of time, idealista will block our IP. A way around this is to use some free proxies and alternate IPs.  

### 2.1 Request Headers

This point is simple to solve, you just need to inspect the website, under the Network section you will find the request headers which you can copy and format as a dictionary:

In [33]:
headers = {
    'accept': '*/*',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'es,es-ES;q=0.9,en;q=0.8,fr;q=0.7',
    'cache-control': 'no-cache',
    'pragma': 'no-cache',
    'referer': 'https://www.idealista.com/en/areas/venta-viviendas/?shape=%28%28ez_%7EEn%7Bse%40_bCceIniKcdFpvAfiEa%7EI%7E_J%29%29',
    #'sec-ch-ua': " Not A;Brand";v="99", "Chromium";v="98", "Google Chrome";v="98",
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': 'macO',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.109 Safari/537.36',
    'x-newrelic-id': 'VQIGUlZbGwIBXFhWBQEDVw==',
    'x-requested-with': 'XMLHttpRequest'
}

### 2.2 Searching for free proxies

This is by itself another easy mini scraping project. 
The function 'get_proxies()' located in 'proxies_scr.py' will scrap two websites where we can find lists of free proxies (https://www.proxy-list.download/HTTP & https://free-proxy-list.net/). This is very convenient but there is a little problem, idealista sees these lists the same way we do, and they try to block them. This means that not all proxies will work. In order to make the whole process a bit faster, we will first try proxies to see if they work. 

The purpose of this while loop is to keep trying proxies until we have a considerable amount of them. The amount of working proxies can be decided setting the variable 'n_prox' with the number we want (Section 1). After the while loop we append ' ' to the pool of proxies to account with our local IP as well. 

In [34]:
# Looking  for proxies
print('Selecting working proxies.')
working_proxies = []
tested_proxies = []
round = 1

while len(working_proxies) < n_prox:
    #print(f'Round {round} of proxy scrapping.')
    #print(f'{len(tested_proxies)} proxies tested')
    #print(f'{len(working_proxies)} proxies found')
    proxies = proxies_scr.get_proxies()
    for prox in proxies:
        if prox not in working_proxies and prox not in tested_proxies:
            if len(working_proxies) >= n_prox:
                break
            else:

                try:
                    url = 'https://www.idealista.com/'
                    proxy = 'http://' + prox
                    r = requests.get(url, headers=headers, proxies={'http': proxy, 'https': proxy},  timeout=10)
                    #print(r.status_code)
                    if r.status_code in [200, 429]:
                        working_proxies.append(prox)
                        break
                        
                    else:
                        tested_proxies.append(prox)
                except:
                    tested_proxies.append(prox)

    if len(working_proxies) < n_prox:
        time.sleep(20)

    round = round + 1
    

print(f'{len(working_proxies)} proxies found plus local one.')
working_proxies.append('') # to add local IP
print('working proxies are:')
print(working_proxies)

Selecting working proxies.
1 proxies found plus local one.
working proxies are:
['188.138.106.158:5566', '']


## 3. Houses scraping
Now that we have our request headers and our pool of working proxies we can start scraping houses information from idealista.

At idealista.com we have the option to select a searching area. For this study I have selected the area that I am interested in which is Cadiz city (the most beautiful city in Spain, my home city). 
After selecting the area of interest we get an url ('https://www.idealista.com/en/areas/venta-viviendas/pagina-{x}?shape=%28%28ez_%7EEn%7Bse%40_bCceIniKcdFpvAfiEa%7EI%7E_J%29%29' where 'x' of 'pagina-{x}' is the page number of the results) that we can use to scrap all houses ids posted for that area in all result pages. After obtaining all houses ids we can proceed to request information from each house url.



### 3.1 Houses ids scraping
With the following chunk of code we will scrap all ids for the search area but notice that we created a variable in section 1 to limit the amount of houses ids scraped. 

We will use the pool of free proxies found and request headers to make our requests.


In [35]:
print('Finding houses ids')
x = 1 # for pagination
ids = []
while True:
    if len(ids) >= search_limit:
        #print(f'{len(ids)} ids scraped')
        #print('Limit reached')
        break

    url = f'https://www.idealista.com/en/areas/venta-viviendas/pagina-{x}?shape=%28%28ez_%7EEn%7Bse%40_bCceIniKcdFpvAfiEa%7EI%7E_J%29%29'
    random_index_prox = random.randint(0, len(working_proxies)-1)
    prox = working_proxies[random_index_prox]
    if prox != '':
        proxy = 'http://' + prox
    else:
        proxy = ''

    try:
        r = requests.get(url, headers=headers, proxies={'http': proxy, 'https': proxy},  timeout=10)
        #print('IP: ' + proxy + f' -- status code: {r.status_code}')
        soup = bs(r.text, 'html.parser')
        selected_page = soup.find('main', {'class':'listing-items'}).find('div', {'class':'pagination'}).find('li', {'class':'selected'}).text

        if x == int(selected_page):
            articles = soup.find('main', {'class':'listing-items'}).find_all('article')
        else:
            break
                
        homes_ids = [article.get('data-adid') for article in articles]
                
        for id in homes_ids:
            if id:
                ids.append(id)
                
        x = x + 1
        #print('IP seems to work')
        #print(f'{len(ids)} ids scraped')

    except:
        #print('IP: ' + proxy +  ' -- failed, trying a new one.')
        #time.sleep(random.randint(1, 5)*random.random()) # to avoid getting blocked

        pass

if len(ids) > search_limit:
    ids = ids[0:search_limit]
    
print('Houses Ids found:')
print(ids)

Finding houses ids
Houses Ids found:
['94283285', '89113690', '95291210', '96746079', '96622572', '94971680', '84849222', '84635808', '95854107', '93809726']


### 3.1 Houses info scraping
Now that we have all our hoses ids we can scrap the information about each house. This process might take long because some of our working proxies might get blocked or removed during the process. There is also the possibility to pay to get more reliable proxies but since this is a small personal project and I have no rush, waiting is not a big deal. 



In [36]:
# Getting all info from individual houses
print('Finding houses info')
raw_data = pd.DataFrame()

for id in ids:
    house_url = f'https://www.idealista.com/en/inmueble/{id}/'
    keep_searching_proxi = True

    while keep_searching_proxi:
        random_index_prox = random.randint(0, len(working_proxies)-1)
        prox = working_proxies[random_index_prox]
        if prox != '':
            proxy = 'http://' + prox
        else:
            proxy = ''

        try:
            r = requests.get(house_url, headers=headers, proxies={'http': proxy, 'https': proxy},  timeout=10)
            
            if r.status_code == 404:
                print('ID: ' + id +  ' -- Remmoved from idealista.')
                keep_searching_proxi = False
            
            elif r.status_code == 200:
                tries = 0
                while tries < 10:
                    try:
                        #print('IP:' +  proxy + ' -- seems to work')
                        soup = bs(r.text, 'html.parser')
                        tittle = soup.find('span', {'class':'main-info__title-main'}).text
                        city = soup.find('span', {'class':'main-info__title-minor'}).text
                        price_act = soup.find('span', {'class':'info-data-price'}).find('span', {'class':'txt-bold'}).text

                        try:
                            price_first = soup.find('span', {'class':'pricedown'}).find('span', {'class':'pricedown_price'}).text
                        except:
                            price_first = price_act

                        details_1 = soup.find('div', {'class':'details-property-feature-one'}).find_all('div', {'class':'details-property_features'})
                        details_1_desc = []
                        for ul in details_1:
                            lis = ul.findAll('li')
                            for li in lis:
                                details_1_desc.append(li.text.strip())

                        details_2 = soup.find('div', {'class':'details-property-feature-two'}).find_all('div', {'class':'details-property_features'})
                        details_2_desc = []
                        for ul in details_2:
                            lis = ul.findAll('li')
                            for li in lis:
                                details_2_desc.append(li.text.strip())

                        details_3 = soup.find('div', {'class':'details-property-feature-three'}).find_all('div', {'class':'details-property_features'})
                        details_3_desc = []
                        for ul in details_3:
                            lis = ul.findAll('li')
                            for li in lis:
                                details_3_desc.append(li.text.strip())

                        advertiser = soup.find('div', {'class':'professional-name'}).find('div', {'class':'name'}).text.strip()

                        line = {
                            'house_id': id,
                            'tittle': tittle,
                            'city': city,
                            'price_act': price_act,
                            'price_first': price_first,
                            'details_1_desc': details_1_desc,
                            'details_2_desc': details_2_desc,
                            'details_3_desc': details_3_desc,
                            'advertiser' : advertiser
                        }

                        raw_data = raw_data.append(line, ignore_index=True)
                        #time.sleep(random.randint(1, 3)*random.random()) # to avoid getting blocked
                        to_go = len(ids) - ids.index(id)
                        #print(f'remaining houses to scrap: {to_go}')
                        keep_searching_proxi = False
                        tries = tries + 10
                    except Exception as e: 
                        #print('Estatus code 200 but not working')
                        #print(e)
                        tries = tries + 1
                        keep_searching_proxi = False
            
            else:
                #print('IP: ' + proxy +  f' -- failed due a status code {r.status_code}, trying a new one.')
                pass


        except:
            #print('IP: ' + proxy +  ' -- failed, trying a new one. General error')
            #time.sleep(random.randint(1, 5)*random.random()) 
            pass

raw_data['datetime'] = dt.datetime.now() # adding datetime col to know when we scraped.

print(raw_data.head())

Finding houses info
   house_id                                             tittle   city  \
0  94283285     Flat / apartment for sale in Zona Bahía Blanca  Cadiz   
1  89113690  Duplex for sale in Mentidero - Teatro Falla - ...  Cadiz   
2  95291210  Flat / apartment for sale in Mentidero - Teatr...  Cadiz   
3  96746079  Flat / apartment for sale in Urb. alameda apod...  Cadiz   
4  96622572  Flat / apartment for sale in Urb. Playa Santa ...  Cadiz   

  price_act price_first                                     details_1_desc  \
0   750,000     750,000  [262 m² built, 6 bedrooms, 4 bathrooms, Terrac...   
1   575,000     575,000  [135 m² built, 3 bedrooms, 3 bathrooms, Terrac...   
2   320,000     320,000  [190 m² built, 8 bedrooms, 3 bathrooms, Second...   
3   390,000     390,000  [131 m² built, 117 m² floor area, 3 bedrooms, ...   
4   720,000     720,000  [168 m² built, 167 m² floor area, 3 bedrooms, ...   

                   details_2_desc details_3_desc               advertise

Here is the field description of our result dataframe:

-   ***house_id***: unique house identifier.

-   ***tittle***: tittle of the offer

-   ***city***: city where the house is located

-   ***price_act***: actual price

-   ***price_first***: first published price

-   ***details_1_desc***: list of characteristics of the house

-   ***details_2_desc***: list of characteristics of the house 2

-   ***details_3_desc***: list of characteristics of the house 3

-   ***advertiser***: type of advertiser, private, agency..

-   ***datetime***: timestamp of scraping date.


Finlly we will save our result dataframe into a csv.

In [37]:
# Saving data
now = dt.datetime.now() # current date and time
file_name = 'data/raw_data_' + now.strftime("%Y%m%d") + '.csv'

print('Saving data')
raw_data.to_csv(file_name)

print('End')


Saving data
End
