# Scraping Huizenzoeker.nl to Analyse the Dutch Housing Market

### Introduction
Which places in the Netherlands are hit hardest by the Dutch Housing crisis, and which the least?
Momentarily, the housing crisis is one of the most prominent societal challenges in the Netherlands. This script scrapes information of the Dutch housing market, enabling use to analyse the housing market and clearify which areas are hit hardest by the housing crisis. This script provides information such as gem. vraagprijs, # verkochte woningen, gem. vierkante meter prijs, and % overboden. The dataframe that is generated through this script offers very interesting information, for example for first-time buyers that are having a hard time purchasing their first home on the current stressed Dutch housing market.

The script is divided into seven steps:
* **Step 1. Loading all the basics**: this step loads all the relevent packages and sets up the BeautifulSoup basis.
* **Step 2. Collecting the municipality URLs**: this step collects the urls of the municipalities in the Netherlands. For this step, we first need to create a list of the province URLs (twelve in total; for each province in the Netherlands). From these twelve province URLs, we are able to scrape the municipality URLs, since each province URL offers the opportunity to navigate to their corresponding municipalities.
* **Step 3. Scraping data from URLs (municipality-level)**: this step scrapes the data from the municipality URLs that we have generated in step 2.
* **Step 4: Exporting dataframe as CSV file**: this step exports the final dataframe as CSV file.
* **Step 5: Providing summary statistics**: this step provides summary statistics for the final dataframe that we have generated in R, by loading the CSV file that we exported in step 4 in R.
* **Step 6: Scraping data from woningmarkt dashboard (province-level)**: this step scrapes the data from each province. The same code that was used for scraping data on municipality-level is employed in this step.
* **Step 7: Exporting dashboard data as CSV file**: this step exports the provinces data as CSV file.

## Step 1: Loading all the basics

In [1]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd 
import time 
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import json

In [2]:
url = 'https://www.huizenzoeker.nl/woningmarkt/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

## Step 2: Collecting the municipality URLs

We first construct a base url and a province_url which once appended together creates the URL to each of the woningmarkt pages for each province. We generate the generate_links() function to append these parts of the URL together. 

In [3]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/'
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/']

In [4]:
def generate_links(base_url,province_url): 
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links

page_links = generate_links(base_url,province_url)

We then use this list of all provinces, to extract all municipalities from each, making use of window handling. 

In [5]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 94.0.4606
Get LATEST driver version for 94.0.4606
There is no [win32] chromedriver for browser 94.0.4606 in cache
Get LATEST driver version for 94.0.4606
Trying to download new driver from https://chromedriver.storage.googleapis.com/94.0.4606.61/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\danie\.wdm\drivers\chromedriver\win32\94.0.4606.61]


In [21]:
page_urls_full = []

for link in page_links:
    driver.switch_to.window(driver.window_handles[-1])
    driver.get(link)
    # time.sleep(2)
    
    for elem in driver.find_elements_by_xpath("//li//div//a[@href]"):
        urls = elem.get_attribute('href')
        page_urls_full.append(urls)

In [8]:
subset = page_urls_full[30:33] # use subset to try out on few urls (for time convenience)

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/medemblik/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/oostzaan/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/opmeer/']

In [23]:
page_urls_full #now run the code with all links instead of the subset to export it to R 

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/aalsmeer/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/alkmaar/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/amstelveen/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/amsterdam/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/beemster/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/bergen-nh/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/beverwijk/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/blaricum/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/bloemendaal/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/castricum/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/den-helder/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/diemen/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/drechterland/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/edam-volendam/',
 'https://www.huizenzoeker.nl/w

## Step 3: Scrape data from each url (municipality-level)

For each municipality we extract:
* *trend data*: gem. vraagprijs, verkochte woningen, gem.vierkantemeter prijs, % overboden (and how these numbers how changed t.o.v. vorige maand) 
* *other information*: besteedbaar inkomen, aantal inwoners

#### Warning: Running the next cell for 'page_urls_full' will take aprox. 30 minutes. You might want to replace page_urls_full for 'subset'!

In [24]:
fn = 'saved_data.json'

def extract_city_trends(page_urls_full):
    trend_list = []
    for page_url in page_urls_full:
        driver.get(page_url)
        time.sleep(5) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            #Province name
        province_name = soup.find_all('a')[6].get_text()
            # City name
        city_name = soup.find_all('h2')[0].get_text()
        city_name = city_name.replace('Woningmarkt','')
        city_name = city_name.replace(' ', '')
            # Gemiddelde vraagprijs
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # Aantal verkochte woningen
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # Gemiddelde vierkante meter prijs
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # Percentage overboden
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Inwoners
        content = soup.find("div", {"class": "buurt-info"})
        inwoners = content.find_all('p')[3].get_text
        inwoners = str(inwoners)
        inwoners = re.search('Dat zijn(.+?)inwoners', inwoners)
        if inwoners:
            found_inwoners = inwoners.group(1)
            found_inwoners = found_inwoners.strip()
            found_inwoners = found_inwoners.replace(".", ",")
        else:
            found_inwoners = 'NA'
            # Bevolkingsgroei
        content = soup.find("div", {"class": "buurt-info"})
        populatiegroei = content('p')[4].get_text
        populatiegroei = str(populatiegroei)
        populatiegroei_increase = re.search('afgelopen jaar met (.+?) gegroeid', populatiegroei)
        if populatiegroei_increase:
            found_populatiegroei = populatiegroei_increase.group(1)
            found_populatiegroei = found_populatiegroei.strip()
        else:
            found_populatiegroei = 'NA'
        populatiegroei_decline = re.search('afgelopen jaar met (.+?) gekrompen', populatiegroei)
        if populatiegroei_decline:
            found_populatiegroei_decline = populatiegroei_decline.group(1)
            found_populatiegroei_decline = found_populatiegroei_decline.strip() 
        else:
            found_populatiegroei_decline = 'NA'
            # Append list
        save_obj = {'Province':province_name, "City":city_name, 
                    "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                    "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                    "Gem. m2 prijs":m2_prijs, "%Δ M2 prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                    "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                    "Besteedbaar inkomen (per huishouden)":bes_inkomen,
                    "Aantal inwoners": found_inwoners,
                    "% Populatie stijging":found_populatiegroei, "% Populatie daling":found_populatiegroei_decline}
        trend_list.append(save_obj)
        f=open(fn, 'a', encoding='utf-8')
        f.write(json.dumps(save_obj)+'\n')
        f.close()
    return(trend_list)

In [25]:
df = extract_city_trends(page_urls_full) 
pd.DataFrame(df)

Unnamed: 0,Province,City,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden),Aantal inwoners,% Populatie stijging,% Populatie daling
0,Noord-Holland,Aalsmeer,"€ 685,000",57.47%,12,-7.69%,"€ 4,476",9.22%,10.67%,3.42%,"€ 45,800",31859,0.41%,
1,Noord-Holland,Alkmaar,"€ 362,500",25.00%,38,-39.68%,"€ 3,926",10.62%,14.05%,1.43%,"€ 36,300",109436,0.81%,
2,Noord-Holland,Amstelveen,"€ 570,000",18.13%,21,-56.25%,"€ 4,724",1.88%,305.01%,296.30%,"€ 37,800",91675,0.92%,
3,Noord-Holland,Amsterdam,"€ 450,000",7.78%,230,-27.44%,"€ 6,961",5.90%,16.10%,0.37%,"€ 30,100",872757,1.13%,
4,Noord-Holland,Beemster,"€ 612,000",-12.26%,4,-33.33%,"€ 4,311",-6.89%,-0.23%,-12.10%,"€ 47,300",10022,2.81%,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,Limburg,ValkenburgaandeGeul,"€ 365,000",-5.19%,4,-55.56%,"€ 3,308",9.75%,12.60%,5.41%,"€ 35,600",16367,,0.63%
348,Limburg,Venlo,"€ 319,000",16.00%,32,-50.00%,"€ 2,727",17.80%,7.84%,-0.72%,"€ 33,700",101802,0.20%,
349,Limburg,Venray,"€ 297,000",8.99%,8,-33.33%,"€ 2,729",23.09%,8.98%,1.22%,"€ 39,100",43614,0.66%,
350,Limburg,Voerendaal,"€ 287,500",-11.13%,2,-75.00%,"€ 2,185",-12.53%,7.69%,-2.09%,"€ 40,800",12475,0.18%,


In [27]:
# df = extract_city_trends(page_urls_full) 
# pd.DataFrame(df)

Unnamed: 0,City,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden),Aantal inwoners,% Populatie stijging,% Populatie daling
0,Aalsmeer,"€ 685,000",57.47%,12,-7.69%,"€ 4,476",9.22%,10.67%,3.42%,"€ 45,800",31859,0.41%,
1,Alkmaar,"€ 372,500",28.45%,34,-46.03%,"€ 3,970",11.86%,14.05%,1.43%,"€ 36,300",109436,0.81%,
2,Amstelveen,"€ 570,000",18.13%,21,-56.25%,"€ 4,724",1.88%,305.01%,296.30%,"€ 37,800",91675,0.92%,
3,Amsterdam,"€ 465,000",10.71%,216,-30.99%,"€ 6,942",5.52%,16.10%,0.38%,"€ 30,100",872757,1.13%,
4,Beemster,"€ 612,000",-12.26%,4,-33.33%,"€ 4,311",-6.89%,-0.23%,-12.10%,"€ 47,300",10022,2.81%,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,ValkenburgaandeGeul,"€ 365,000",-5.19%,4,-55.56%,"€ 3,308",9.75%,12.60%,5.41%,"€ 35,600",16367,,0.63%
348,Venlo,"€ 334,500",15.74%,31,-49.18%,"€ 2,735",17.94%,7.84%,-0.72%,"€ 33,700",101802,0.20%,
349,Venray,"€ 297,000",8.99%,8,-33.33%,"€ 2,729",23.09%,8.98%,1.22%,"€ 39,100",43614,0.66%,
350,Voerendaal,"€ 287,500",-11.13%,2,-75.00%,"€ 2,185",-12.53%,7.69%,-2.09%,"€ 40,800",12475,0.18%,


In [26]:
final_dataframe=pd.DataFrame(df) #dataframe with all data for all municipalities in the Netherlands

## Step 4: Exporting dataframe as CSV file

For the final output (a CSV file, so tabular data) we would want the output of the scraper to be gathered in one single dictionary.

In [27]:
final_dataframe.to_csv('huizenzoeker_scraper_data.csv') 

## Step 5: Providing summary statistics

**Exporting our output to RStudio and then importing that CSV here**

First we try to generate some summary statistics by using the output of our scraping data. We can't do this directly as you can see as most variables are seen as characters, while they should be numerics. Therefore we exported the final_dataframe to R to change these datatypes and then export it as CSV to then use it here to generate some summary statistics: count, mean, std, min, max, 25%, 50%, 75%. 

In [28]:
huizenzoeker = pd.read_csv('huizenzoeker_data2.csv', encoding= 'latin-1') #Adding the latin encoding solved the UNICODE error 

In [29]:
huizenzoeker = pd.DataFrame(huizenzoeker)
huizenzoeker

Unnamed: 0,Province,City,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen,n_inwoners,perc_pop_stijging,perc_pop_daling
0,Noord-Holland,Aalsmeer,685000.0,57.47,12,-7.69,4476.0,9.22,10.67,3.42,45800.0,31859,0.41,
1,Noord-Holland,Alkmaar,362500.0,25.00,38,-39.68,3926.0,10.62,14.05,1.43,36300.0,109436,0.81,
2,Noord-Holland,Amstelveen,570000.0,18.13,21,-56.25,4724.0,1.88,305.01,296.30,37800.0,91675,0.92,
3,Noord-Holland,Amsterdam,450000.0,7.78,230,-27.44,6961.0,5.90,16.10,0.37,30100.0,872757,1.13,
4,Noord-Holland,Beemster,612000.0,-12.26,4,-33.33,4311.0,-6.89,-0.23,-12.10,47300.0,10022,2.81,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,Limburg,ValkenburgaandeGeul,365000.0,-5.19,4,-55.56,3308.0,9.75,12.60,5.41,35600.0,16367,,0.63
348,Limburg,Venlo,319000.0,16.00,32,-50.00,2727.0,17.80,7.84,-0.72,33700.0,101802,0.20,
349,Limburg,Venray,297000.0,8.99,8,-33.33,2729.0,23.09,8.98,1.22,39100.0,43614,0.66,
350,Limburg,Voerendaal,287500.0,-11.13,2,-75.00,2185.0,-12.53,7.69,-2.09,40800.0,12475,0.18,


In [30]:
huizenzoeker.describe() #to return the summary statisitics for all municipalities in the Netherlands. 
#look whether these statistics seem realistic/right due to the thousands and decimal separator problem we faced. 

Unnamed: 0,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen,n_inwoners,perc_pop_stijging,perc_pop_daling
count,352.0,341.0,352.0,346.0,352.0,341.0,348.0,346.0,349.0,352.0,292.0,59.0
mean,369034.9,9.90956,18.127841,-22.113699,3133.180614,3.972903,13.985144,4.637139,40037.535817,49282.267045,0.800925,0.340678
std,154982.9,28.825046,26.904773,52.621778,1020.587474,13.924234,45.292619,45.529907,4728.094766,73491.321323,0.678329,0.372331
min,0.0,-40.86,0.0,-100.0,0.0,-34.77,-6.47,-13.4,25400.0,0.0,0.0,0.01
25%,295000.0,-8.16,6.0,-52.2225,2603.25,-4.78,7.935,-0.77,36900.0,21680.5,0.34,0.15
50%,350000.0,4.52,11.0,-33.33,3081.5,3.04,10.29,0.925,40400.0,30790.5,0.63,0.26
75%,418000.0,21.18,19.25,-3.965,3601.25,10.5,12.9575,2.8675,43500.0,49697.25,1.1025,0.425
max,1695000.0,149.63,230.0,325.0,9059.0,96.89,788.89,782.81,53800.0,872757.0,5.43,2.55


## Step 6: Scraping woningmarkt dashboard (province-level)

### Step 6a: First scraping the trend, and other data

Generating links of all provinces: 

In [31]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/'
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/']

Defining a function to paste together these URL parts: 

In [32]:
def generate_links(base_url,province_url): 
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links
page_links = generate_links(base_url,province_url)
print(page_links)

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zuid-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zeeland/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/', 'https://www.huizenzoeker.nl/woningmarkt/utrecht/', 'https://www.huizenzoeker.nl/woningmarkt/flevoland/', 'https://www.huizenzoeker.nl/woningmarkt/friesland/', 'https://www.huizenzoeker.nl/woningmarkt/groningen/', 'https://www.huizenzoeker.nl/woningmarkt/drenthe/', 'https://www.huizenzoeker.nl/woningmarkt/overijssel/', 'https://www.huizenzoeker.nl/woningmarkt/gelderland/', 'https://www.huizenzoeker.nl/woningmarkt/limburg/']


In [33]:
def extract_province_trends(page_links):
    trend_list = []
    for page_link in page_links:
        driver.get(page_link)
        time.sleep(1) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # City name
        province_name = soup.find_all('h2')[0].get_text()
        province_name = province_name.replace('Woningmarkt','')
        province_name = province_name.replace(' ', '')
            # Gemiddelde vraagprijs
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # Aantal verkochte woningen
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # Gemiddelde vierkante meter prijs
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # Percentage overboden
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Inwoners
        content = soup.find("div", {"class": "buurt-info"})
        inwoners = content.find_all('p')[3].get_text
        inwoners = str(inwoners)
        inwoners = re.search('Dat zijn(.+?)inwoners', inwoners)
        found_inwoners = 'NA'
        if inwoners:
            found_inwoners = inwoners.group(1)
            found_inwoners = found_inwoners.strip()
            found_inwoners = found_inwoners.replace(".", ",")
            # Bevolkingsgroei
        content = soup.find("div", {"class": "buurt-info"})
        populatiegroei = content('p')[4].get_text
        populatiegroei = str(populatiegroei)
        populatiegroei_increase = re.search('afgelopen jaar met (.+?) gegroeid', populatiegroei)
        if populatiegroei_increase:
            found_populatiegroei = populatiegroei_increase.group(1)
            found_populatiegroei = found_populatiegroei.strip()
        else:
            found_populatiegroei = 'NA'
        populatiegroei_decline = re.search('afgelopen jaar met (.+?) gekrompen', populatiegroei)
        if populatiegroei_decline:
            found_populatiegroei_decline = populatiegroei_decline.group(1)
            found_populatiegroei_decline = found_populatiegroei_decline.strip() 
        else:
            found_populatiegroei_decline = 'NA'
            # Append list
        trend_list.append({"Province":province_name, 
                    "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                    "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                    "Gem. m2 prijs":m2_prijs, "%Δ M2 prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                    "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                    "Besteedbaar inkomen (per huishouden)":bes_inkomen,
                    "Aantal inwoners": found_inwoners,
                    "% Populatie stijging":found_populatiegroei, "% Populatie daling":found_populatiegroei_decline})
    return(trend_list)

In [34]:
df2 = extract_province_trends(page_links) 
province_dataframe = pd.DataFrame(df2)
province_dataframe

Unnamed: 0,Province,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden),Aantal inwoners,% Populatie stijging,% Populatie daling
0,Noord-Holland,"€ 425,000",13.33%,976,-24.75%,"€ 4,437",9.53%,28.66%,16.00%,"€ 36,200",2879527,0.92%,
1,Zuid-Holland,"€ 365,000",7.67%,1283,-38.14%,"€ 3,602",5.54%,20.98%,10.76%,"€ 35,800",3708696,0.95%,
2,Zeeland,"€ 280,000",1.82%,168,-38.91%,"€ 2,676",5.31%,10.05%,1.92%,"€ 36,900",383488,0.12%,
3,Noord-Brabant,"€ 350,000",3.24%,738,-46.83%,"€ 3,197",5.37%,8.74%,0.87%,"€ 38,100",2548585,0.71%,
4,Utrecht,"€ 425,000",9.25%,608,-8.30%,"€ 4,190",4.05%,13.03%,1.06%,"€ 39,500",1354834,0.94%,
5,Flevoland,"€ 339,000",4.31%,153,-39.53%,"€ 2,982",1.64%,16.40%,1.69%,"€ 39,500",423021,1.55%,
6,Friesland,"€ 285,000",3.64%,322,-12.74%,"€ 2,429",-1.82%,11.71%,1.07%,"€ 34,900",649957,0.35%,
7,Groningen,"€ 250,000",11.11%,316,-9.20%,"€ 2,550",6.65%,20.27%,4.51%,"€ 30,600",540009,0.38%,
8,Drenthe,"€ 309,000",4.75%,236,-25.55%,"€ 2,481",1.31%,11.58%,1.09%,"€ 37,100",493682,0.31%,
9,Overijssel,"€ 300,000",1.69%,457,-16.15%,"€ 2,692",4.42%,10.50%,0.66%,"€ 36,900",1162406,0.52%,


### Step 6b: Scrape some more woningmarkt dashboard data

To this dataframe, we now want to add more data from the woningmarkt dashboard per province, e.g. 'aantal geintereseerden per woning', huuraanbod, profiel huizenzoekers (?), over woningen...  

But for now,  we export this dataframe already as CSV to R to fix the characters into numerics; such that it is an useable dataset!

## Step 7: Exporting dashboard data as CSV

In [37]:
province_dataframe.to_csv(r'C:\Users\danie\OneDrive\Documents\Repositories\oDCM-project-team-3\src\collection\huizenzoeker_province_data.csv') #at province-level

In [38]:
huizenzoeker_province = pd.read_csv('huizenzoeker_province_data1.csv', encoding= 'latin-1')

In [39]:
huizenzoeker_province = pd.DataFrame(huizenzoeker_province)
huizenzoeker_province

Unnamed: 0,Province,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen,n_inwoners,perc_pop_stijging
0,Noord-Holland,425000.0,13.33,976,-24.75,4437,9.53,28.66,16.0,36200,2879527,0.92
1,Zuid-Holland,365000.0,7.67,1283,-38.14,3602,5.54,20.98,10.76,35800,3708696,0.95
2,Zeeland,280000.0,1.82,168,-38.91,2676,5.31,10.05,1.92,36900,383488,0.12
3,Noord-Brabant,350000.0,3.24,738,-46.83,3197,5.37,8.74,0.87,38100,2548585,0.71
4,Utrecht,425000.0,9.25,608,-8.3,4190,4.05,13.03,1.06,39500,1354834,0.94
5,Flevoland,339000.0,4.31,153,-39.53,2982,1.64,16.4,1.69,39500,423021,1.55
6,Friesland,285000.0,3.64,322,-12.74,2429,-1.82,11.71,1.07,34900,649957,0.35
7,Groningen,250000.0,11.11,316,-9.2,2550,6.65,20.27,4.51,30600,540009,0.38
8,Drenthe,309000.0,4.75,236,-25.55,2481,1.31,11.58,1.09,37100,493682,0.31
9,Overijssel,300000.0,1.69,457,-16.15,2692,4.42,10.5,0.66,36900,1162406,0.52


In [40]:
huizenzoeker_province.describe()

Unnamed: 0,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen,n_inwoners,perc_pop_stijging
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,330250.0,5.744167,531.75,-27.476667,3069.75,4.163333,17.365,6.389167,36483.333333,1445613.0,0.626667
std,55959.603936,3.740944,349.086372,13.318931,681.205632,2.965414,9.787681,9.803833,2394.627826,1109205.0,0.419227
min,250000.0,1.69,153.0,-46.83,2429.0,-1.82,8.74,0.66,30600.0,383488.0,0.1
25%,285000.0,3.54,296.0,-38.3325,2533.0,2.03,11.31,1.0675,35575.0,528427.2,0.34
50%,324000.0,4.395,416.5,-28.555,2837.0,4.865,13.485,1.805,36900.0,1139804.0,0.595
75%,353750.0,8.065,740.5,-15.2975,3298.25,5.605,20.4475,6.0725,37650.0,2201610.0,0.925
max,425000.0,13.33,1283.0,-8.3,4437.0,9.53,42.52,33.65,39500.0,3708696.0,1.55
