# Scraping

In this section we have developed a web scraping tool to obtain a dataset of houses information. We've described step by step the process to build the dataset.

## Import libraries

In [5]:
# Import libscrap library containing all functions created
import libscrap
from importlib import reload
reload(libscrap)

# Import external libraries to compute the dataset
from bs4 import BeautifulSoup
import requests
import re
import time
import pandas as pd

## Get urls of houses

First of all we are collecting urls of houses from the web using the function developed collect_houses_urls that returns a list of urls from the pages indicated. We have divided in short steps and with time sleeps to avoid web blocking. At this point it has not been necessary to use the threads because it is fast enough. We store the results in different variables to preserve downloaded results in case of blocking.

In [281]:
urls_houses = []
# 25 results per page --> n_pages = 400 to retrive 10k results
urls_houses0 = libscrap.collect_houses_urls(n_pages=(1, 100)) 
time.sleep(20)
urls_houses1 = libscrap.collect_houses_urls(n_pages=(101, 200))
time.sleep(20)
urls_houses2 = libscrap.collect_houses_urls(n_pages=(201, 300))
time.sleep(20)
urls_houses3 = libscrap.collect_houses_urls(n_pages=(301, 400))
time.sleep(20)
urls_houses4 = libscrap.collect_houses_urls(n_pages=(401, 500))
time.sleep(20)
urls_houses5 = libscrap.collect_houses_urls(n_pages=(501, 600))
time.sleep(20)
urls_houses6 = libscrap.collect_houses_urls(n_pages=(601, 700))
time.sleep(20)
urls_houses7 = libscrap.collect_houses_urls(n_pages=(701, 800))
time.sleep(20)
urls_houses8 = libscrap.collect_houses_urls(n_pages=(801, 900))
time.sleep(20)
urls_houses9 = libscrap.collect_houses_urls(n_pages=(901, 1000))

Done:  (1, 100)
Done:  (101, 200)
Done:  (201, 300)
Done:  (301, 400)
Done:  (401, 500)
Done:  (501, 600)
Done:  (601, 700)
Done:  (701, 800)
Done:  (801, 900)
Done:  (901, 1000)


Here we add all the urls in the different list variables to a single one and clean the memory.

In [291]:
urls_houses.extend(urls_houses0)
urls_houses.extend(urls_houses1)
urls_houses.extend(urls_houses2)
urls_houses.extend(urls_houses3)
urls_houses.extend(urls_houses4)
urls_houses.extend(urls_houses5)
urls_houses.extend(urls_houses6)
urls_houses.extend(urls_houses7)
urls_houses.extend(urls_houses8)
urls_houses.extend(urls_houses9)

In [None]:
del urls_houses0, urls_houses1, urls_houses2, urls_houses3, urls_houses4, urls_houses5, urls_houses6, urls_houses7, urls_houses8, urls_houses9

Now, we want to write the urls into a csv file to keep them save from the next step. Also we are reading the data and testing that everything went ok:

In [294]:
# Write the urls into a pandas dataframe
df = pd.DataFrame(urls_houses)
df.to_csv("url_list.txt", sep='\t', encoding='utf-8', index=False, index_label=False)

In [295]:
# Read the urls from file and convert the pandas dataframe into a list
read = pd.read_csv("url_list.txt", sep='\t', encoding='utf-8', )
urls_houses = read.loc[:, "0"].tolist()

In [301]:
# Print the number of urls and the number of non-repeated urls
print("Number of results:", len(urls_houses), "| Non repeated results:", len(set(urls_houses)))
# Display some values of the list
display(urls_houses[0:5])
# Create a list variable with non-repeated urls
set_urls_houses = set(urls_houses) # List with all non-repeated urls

Number of results: 24940 | Non repeated results: 24893


['https://www.immobiliare.it/53131931-Vendita-Bilocale-viale-Italo-Calvino-Roma.html',
 'https://www.immobiliare.it/69489650-Vendita-Quadrilocale-via-Alessandro-Fleming-Roma.html',
 'https://www.immobiliare.it/69192522-Vendita-Quadrilocale-via-Aosta-45-Roma.html',
 'https://www.immobiliare.it/68192795-Vendita-Attico-Mansarda-largo-Arturo-Donaggio-Roma.html',
 'https://www.immobiliare.it/69727066-Vendita-Villa-via-Cristoforo-Sabbadino-88-Roma.html']

## Get information about houses

Once obtained the urls, we are going to obtain the information of each house scraping the web. For this goal we are using collect_information_houses that using threading obtain the information of the houses indicated in the range. In order to avoid web blocking we are making 30 resquests every 5 seconds, when it finish if there is any thread still alive, we wait 5 seconds more to avoid failed urls.

In [302]:
# Variable to manage the threads each time
increment = 30
start = 0
last = increment
n_iter = len(urls_houses)//increment
print(n_iter)

# Loop for generating new threads to obtain house information
for i in range(n_iter):
    list_houses_info, errors, thread_list, failed_urls = libscrap.collect_information_houses(urls_houses[start:last])
    print("Failed urls", len(failed_urls), "| Houses completed", len(list_houses_info), "|| ")
    start = last + 1
    last += increment
    time.sleep(5)
    
    # Loop into threads status and look if there is any still alive
    for t in thread_list:
        if t.isAlive():
            print("-", end = "")
            time.sleep(5) # If it is alive, wait 5 seconds more

831
Failed urls 0  | Houses completed 0 Failed urls 0  | Houses completed 30 Failed urls 0  | Houses completed 59 Failed urls 0  | Houses completed 88 Failed urls 0  | Houses completed 117 Failed urls 0  | Houses completed 146 Failed urls 0  | Houses completed 175 Failed urls 0  | Houses completed 204 Failed urls 0  | Houses completed 233 Failed urls 0  | Houses completed 262 Failed urls 0  | Houses completed 291 Failed urls 0  | Houses completed 320 Failed urls 0  | Houses completed 349 Failed urls 0  | Houses completed 378 Failed urls 0  | Houses completed 407 -
-
-
-
Failed urls 18  | Houses completed 418 -
-
-
-
Failed urls 46  | Houses completed 419 -
-
-
-
Failed urls 73  | Houses completed 421 Failed urls 73  | Houses completed 450 -
Failed urls 73  | Houses completed 478 Failed urls 73  | Houses completed 507 Failed urls 73  | Houses completed 536 Failed urls 74  | Houses completed 565 Failed urls 74  | Houses completed 594 Failed urls 74  | Houses completed 623 Failed urls 74 

Failed urls 222  | Houses completed 11234 -
-
Failed urls 222  | Houses completed 11263 Failed urls 222  | Houses completed 11292 Failed urls 222  | Houses completed 11321 Failed urls 222  | Houses completed 11350 Failed urls 222  | Houses completed 11379 Failed urls 222  | Houses completed 11408 -
Failed urls 222  | Houses completed 11437 -
Failed urls 222  | Houses completed 11466 -
Failed urls 222  | Houses completed 11495 -
Failed urls 222  | Houses completed 11524 Failed urls 222  | Houses completed 11553 Failed urls 222  | Houses completed 11582 -
Failed urls 222  | Houses completed 11611 -
-
-
-
Failed urls 233  | Houses completed 11629 Failed urls 233  | Houses completed 11658 Failed urls 233  | Houses completed 11687 -
Failed urls 233  | Houses completed 11716 Failed urls 233  | Houses completed 11745 Failed urls 233  | Houses completed 11774 Failed urls 233  | Houses completed 11803 -
Failed urls 233  | Houses completed 11832 Failed urls 233  | Houses completed 11861 Failed u

Failed urls 527  | Houses completed 21891 Failed urls 527  | Houses completed 21920 Failed urls 527  | Houses completed 21949 Failed urls 527  | Houses completed 21978 Failed urls 527  | Houses completed 22007 Failed urls 527  | Houses completed 22036 Failed urls 527  | Houses completed 22065 Failed urls 527  | Houses completed 22094 Failed urls 527  | Houses completed 22123 -
Failed urls 527  | Houses completed 22152 Failed urls 527  | Houses completed 22181 Failed urls 527  | Houses completed 22210 Failed urls 527  | Houses completed 22239 Failed urls 527  | Houses completed 22268 Failed urls 527  | Houses completed 22297 Failed urls 527  | Houses completed 22326 Failed urls 527  | Houses completed 22355 Failed urls 527  | Houses completed 22384 Failed urls 527  | Houses completed 22413 Failed urls 527  | Houses completed 22442 Failed urls 527  | Houses completed 22471 Failed urls 527  | Houses completed 22500 Failed urls 527  | Houses completed 22529 Failed urls 527  | Houses comple

At this step we are creating a pandas dataframe with all the data obtained before

In [322]:
df = pd.DataFrame(list_houses_info, columns=['price', 'locali', 'superficie', 'bagni', 'piano', 'description'])

Now, we want to write pandas data frame into a csv file to keep them save from the next step. Also we are reading the data and and writing another file with no duplicates:

In [319]:
# Write pandas dataframe into a csv file
df.to_csv("data_houses_df_larger.csv", sep='\t', encoding='utf-8', index=False)
# Read pandas dataframe from file
read = pd.read_csv("data_houses_df_larger.csv", sep='\t', encoding='utf-8')

In [321]:
# Drop duplicates
read2 = read.drop_duplicates()
# Write pandas dataframe with no duplicates into a csv file
read2.to_csv("data_houses_df_no_duplicates.csv", sep='\t', encoding='utf-8', index=False)
# Read pandas dataframe from csv file with no duplicates
read2 = pd.read_csv("data_houses_df_no_duplicates.csv", sep='\t', encoding='utf-8')

In [12]:
# Show some results of the final pandas dataframe obtained from scraping
display(read2[:5])

Unnamed: 0,price,locali,superficie,bagni,piano,description
0,€ 20.000,,19,,,OSTIA - Castel Fusano– BOX - Via Giuseppe Rena...
1,€ 225.000,2,50,1,1,papillo eur\r\r\n ...
2,€ 189.000,4,168,3+,T,COMPLESSO ISLA BIANCA\r\r\n ...
3,€ 450.000,4,135,1,A,VENDITA QUADRILOCALE ULTIMO PIANO RE DI ROMA\r...
4,€ 1.700.000,5+,460,3+,T,Prestigiosa proprietà su due livelli in Via de...
