# Web scraping for houses dataset
In this notebook, we will create a dataset of houses found from [Funda](https://www.funda.nl/) (Dutch real-estate website). In order to do this, we need to program a web bot to retrieve all the information for us. We will use a combination of [Selenium](https://selenium-python.readthedocs.io/) and [Beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for this.

This notebook is part of my House Price series in which we create a dataset, train a prediction model, and deploy the model and an accompanying web app.

© Bryan Lusse - 2022

## Imports

In [1]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from tqdm import tqdm

## Getting data

First, we need to give our bot acces to the internet. For this we use Selenium. Selenium is a python package that automates web browser interaction. 

First we start our webdriver

In [2]:
driver = webdriver.Chrome(ChromeDriverManager().install())

[WDM] - Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 8.07M/8.07M [00:00<00:00, 11.3MB/s]
  """Entry point for launching an IPython kernel.


A screen must have popped up showing an empty browser. This is the browser that will be used by our bot. 

Now we do some initializations. These lists are all predefined in order for our information to be appended to it.

In [3]:
adresses = [] 
cities = []
prices = [] 
living_sizes = []
lot_sizes = []
build_years = []
build_types = []
house_types = []
roofs = []
rooms = []
toilets = []
floors = []
energylabels = []
positions = []
gardens = []
neighbourhood_prices = []

true_features = ['Soort woonhuis', 'Soort bouw', 'Bouwjaar', 'Soort dak', 'Aantal kamers', 'Aantal badkamers', 'Aantal woonlagen', 'Energielabel', 'Ligging', 'Tuin']
arrays = [house_types, build_types, build_years, roofs, rooms, toilets, floors, energylabels, positions, gardens]

Now the main part starts, retrieving the data. Our bot can be told to go to a certain URL and retrieve information, but the information our bot retrieves is _ALL_ the HTML that is on the webpage. In order to make sense of this, we use Beautifulsoup. Beautifulsoup can do quick searches in HTML and XML data. 

In order to tell beautifulsoup which data to find, we need to explicitly mention the type of element, name of the element's class or element attributes. This causes  a lot of searching and trial and error, in order to find the correct class names and get the data out in the form that you want.

Funda gives results of houses like the following, showing the adress, price, size and number of rooms:

![img](assets/funda.png)

The HTML location of this information can be found, when inspecting the page. We can, for example, see that the data on the price is stored inside a <font color='red'>'span'</font> element with the class: <font color='red'>'search-result-price'</font>.

![img](assets/funda_price.png)

In beautifulsoup, this can be retrieved using:

```python
price = text.find('span', attrs={'class':'search-result-price'})
```

For more information about class names and search strategies in beautifulsoup, I suggest reading an in-depth web scraping tutorial like [this one](https://www.dataquest.io/blog/web-scraping-beautifulsoup/).

In our case, retrieving the data about the houses is a tad harder. This is because the search results do not give all data on the houses. More data is given on the webpage of each house listing. Our approach is therefore to navigate to this listing (retrieve URL first) and do the same process of retrieving data there too. 

See if you can understand this part of the code too :-)

> IMPORTANT: Before running this code (or code like it that you have written yourself), make sure that you navigate to your URL in the pop up browser window first. Sometimes here you need to do a captcha or other 'prove you are not a bot' test. The bot itself cannot move past these screens.

In [4]:
url = "https://www.funda.nl/koop/heel-nederland/" # The URL with the data you want to extract
for j in tqdm(range(1,500)): # In this case we extract data from 500 pages of results
    if j==1:
        driver.get(url)
    else:
        driver.get(url+'p'+str(j)+'/') # Add string defining the page of the search results

    content = driver.page_source
    soup = BeautifulSoup(content) 
    for a in soup.findAll('div',href=False, attrs={'class':'search-result-main'}): # Loop over all individual house results
        # Find initial data
        address = a.find('h2', attrs={'class':'search-result__header-title fd-m-none'})
        city = a.find('h4', attrs={'class':'search-result__header-subtitle fd-m-none'})
        price = a.find('span', attrs={'class':'search-result-price'})
        living_size = a.find('span', attrs={'title':'Gebruiksoppervlakte wonen'})
        lot_size = a.find('span', attrs={'title':'Perceeloppervlakte'})
        
        # Find URL of listing and extract HTML data
        listing_url = a.find('a', href=True, attrs={'data-object-url-tracking':'resultlist'})
        driver.get(listing_url['href'])
        content_full = driver.page_source
        soup_full = BeautifulSoup(content_full) 
        
        # Find list of house characteristics
        characteristics = soup_full.findAll('dl', attrs={'class':'object-kenmerken-list'})

        feature_names = []
        features = []

        loop_range = np.arange(len(characteristics))
        loop_range = loop_range[np.arange(len(loop_range))!=2]
        loop_range = loop_range[np.arange(len(loop_range))!=4]
        
        # For each characteristic, retrieve the corresponding value
        for i in loop_range:
            for feature in characteristics[i].findAll('dt'):
                feature_names.append(list(filter(None, feature.text.split('\n')))[0])

            for feature in characteristics[i].findAll('dd'):
                try:
                    features.append(list(filter(None, feature.text.split('\n')))[0])
                except: 
                    pass
        
        # Add data to corresponding predefined list        
        for name, array in zip(true_features, arrays):
            if name in feature_names:
                array.append(features[feature_names.index(name)])
            else:
                array.append(np.nan)
                
        
        neighbourhood_price_per_m2 = soup_full.find('div', attrs={'class':'text-right md:text-left md:w-[64.81481%] md:pl-[3.4037%] text-dark-1'})
        
        # Append initial data to lists
        adresses.append(' '.join(address.text.split('\n')[1].split(' ')[14:]))
        prices.append(price.text.split(' k.k.')[0])
        cities.append(' '.join(city.text.split('\n')[1].split(' ')[14:]))
        living_sizes.append(living_size.text)
        lot_sizes.append(lot_size.text)
        if(neighbourhood_price_per_m2 is None):
            neighbourhood_prices.append(np.nan)
        else:
            neighbourhood_prices.append(' '.join(neighbourhood_price_per_m2.text.split('\n')[1].split(' ')[12:]))

After this has run, we can combine the lists into a DataFrame. Shown below, is a partially retrieved dataset of a 1000 entries. I retrieved data in chunks of 50-100 pages, as extracting 500 pages of data in one go took a long time.

In [5]:
df = pd.DataFrame({'Address': adresses,
                   'City': cities,
                   'Price': prices,
                   'Lot size (m2)': lot_sizes,
                   'Living space size (m2)': living_sizes,
                   'Build year': build_years,
                   'Build type': build_types,
                   'House type': house_types,
                   'Roof': roofs,
                   'Rooms': rooms,
                   'Toilet': toilets,
                   'Floors': floors,
                   'Energy label': energylabels,
                   'Position': positions,
                   'Garden': gardens,
                   'Estimated neighbourhood price per m2': neighbourhood_prices}) 
df.head()

Unnamed: 0,Address,City,Price,Lot size (m2),Living space size (m2),Build year,Build type,House type,Roof,Rooms,Toilet,Floors,Energy label,Position,Garden,Estimated neighbourhood price per m2
0,Uiterwaard 4,Lopik,€ 375.000,126 m²,116 m²,2000,Bestaande bouw,"Eengezinswoning, tussenwoning",Zadeldak bedekt met pannen,4 kamers (3 slaapkamers),1 badkamer en 1 apart toilet,3 woonlagen,A,Aan rustige weg en in woonwijk,Achtertuin en voortuin,5.53
1,Gedempte Schalk Burgergracht 73,Haarlem,€ 450.000,129 m²,107 m²,1910,Bestaande bouw,"Eengezinswoning, tussenwoning",Plat dak bedekt met bitumineuze dakbedekking,4 kamers (3 slaapkamers),1 badkamer en 1 apart toilet,2 woonlagen,F,"Aan rustige weg, in centrum en in woonwijk",Achtertuin en voortuin,2.615
2,Waadse Poldergracht 49,Muiden,€ 1.050.000,180 m²,207 m²,2019,Bestaande bouw,"Herenhuis, tussenwoning",Plat dak,6 kamers (5 slaapkamers),1 badkamer en 1 apart toilet,3 woonlagen,A,"Aan rustige weg, aan vaarwater, aan water en i...",Achtertuin,1.455
3,Streepvaren 4,Bergschenhoek,€ 525.000,202 m²,127 m²,1996,Bestaande bouw,"Bungalow, tussenwoning (semi-bungalow)",Plat dak bedekt met bitumineuze dakbedekking,5 kamers (4 slaapkamers),1 badkamer en 1 apart toilet,2 woonlagen,A,In woonwijk en vrij uitzicht,Achtertuin en voortuin,2.25
4,F.A. Molijnlaan 27,Nunspeet,€ 695.000,582 m²,148 m²,1930,Bestaande bouw,"Eengezinswoning, vrijstaande woning",Tentdak bedekt met pannen,8 kamers (4 slaapkamers),1 badkamer en 1 apart toilet,2 woonlagen en een zolder,E,Beschutte ligging en in bosrijke omgeving,"Achtertuin, tuin rondom, voortuin en zijtuin",9.095


In [6]:
print('Dataset consists of information on ' + str(len(df)) + ' houses')

Dataset consists of information on 1070 houses


## Saving data
Finally, we save the data. In this case, this was the data from page 401-500.

In [7]:
df.to_csv('house_data_p401-p500.csv', index=False, encoding='utf-8')