# Scraping www.huizenzoeker.nl/woningmarkt/ for all municipalities in The Netherlands

### ODCM project - Team 3 

Which municipalities in the Netherlands are hit hardest by the Dutch Housing crisis, and which the least? 
We use the site www.huizenzoeker.nl/woningmarkt/ to analyse the Dutch Housing Market, including the gem. vraagprijs, # verkochte woningen, gem. vierkante meter prijs, % overboden. 

## Step 1: Loading all the basics

In [3]:
#import the packages (after you have installed them properly)
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd 
import time 
from functools import reduce
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

In [14]:
#set the basis for BeautifulSoup 
url = 'https://www.huizenzoeker.nl/woningmarkt/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

In [15]:
#set the basis for Selenium
chrome_path = "C:\Documents\MSc_Marketing_Analytics\oDCM\oDCM-project-team-3\src\collection\chromedriver.exe" 
#depends on your own path

## Step 2: Collecting the URLs

### Step 2a: Extracting first the provinces, and then all the municipalities 

Here we collect the URLs for all municipalities of a province using Selenium to make the process more efficient.

We first construct a base url and a province_url which once appended together creates the URL to each of the woningmarkt pages for each province. We generate the generate_links() function to append these parts of the URL together. 

In [7]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/'

In [8]:
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/']

In [9]:
def generate_links(base_url,province_url): 
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links

In [10]:
page_links = generate_links(base_url,province_url)
print(page_links)

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zuid-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zeeland/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/', 'https://www.huizenzoeker.nl/woningmarkt/utrecht/', 'https://www.huizenzoeker.nl/woningmarkt/flevoland/', 'https://www.huizenzoeker.nl/woningmarkt/friesland/', 'https://www.huizenzoeker.nl/woningmarkt/groningen/', 'https://www.huizenzoeker.nl/woningmarkt/drenthe/', 'https://www.huizenzoeker.nl/woningmarkt/overijssel/', 'https://www.huizenzoeker.nl/woningmarkt/gelderland/', 'https://www.huizenzoeker.nl/woningmarkt/limburg/']


We then use this list of all provinces, to extract all municipalities from each, making use of window handling. 

In [16]:
driver = webdriver.Chrome()

In [17]:
for x in range(len(page_links)): #WARNING: running this code snippet will open 12 Chrome tabs automatically!!
    driver.get(page_links[x])
    if x < 11:
        driver.execute_script("window.open('');")
        driver.switch_to.window(driver.window_handles[x+1])

In [18]:
driver.window_handles

['CDwindow-53B0DD1807F129FB59A399B7966C0339',
 'CDwindow-1B69B0A2ECF1D5D0CDF6F044317D69E7',
 'CDwindow-6022DDC8CE7C486AE1A1EAA0DB7717A6',
 'CDwindow-F0EBEA51D909EBD0720EE674E591FBAD',
 'CDwindow-AA45BD22BD1348012427C54E9392E770',
 'CDwindow-A2F37BDD207BFCF3A9A5B089A3DAFB52',
 'CDwindow-817BD9AA1E8D291C464B86DA6FD680DE',
 'CDwindow-A9A664FB575C06DF2527FEB9754CF289',
 'CDwindow-003F2D32308DE81092B24A30233A2214',
 'CDwindow-1C3520930BA4B2715492368FD11C54C5',
 'CDwindow-A46F88F8408E998BEC1B40BB906E1E49',
 'CDwindow-B9A0EBB12AE94F39524CD1F74076A52D']

In [19]:
page_urls_full = []
for handle in driver.window_handles:        
    driver.switch_to.window(handle)
    elem1 = driver.find_elements_by_xpath("//li//div//a[@href]")
    
    for elem in elem1:
        urls = elem.get_attribute('href')
        page_urls_full.append(urls)   

In [20]:
page_urls_full

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/aalsmeer/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/alkmaar/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/amstelveen/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/amsterdam/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/beemster/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/bergen-nh/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/beverwijk/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/blaricum/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/bloemendaal/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/castricum/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/den-helder/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/diemen/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/drechterland/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/edam-volendam/',
 'https://www.huizenzoeker.nl/w

In [21]:
subset = page_urls_full[30:33] # defined subset to try out on few urls first (for time convenience)
subset

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/medemblik/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/oostzaan/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/opmeer/']

# Step 3: Scrape data from each url

For each municipality we now extract:
* *trend data*: gem. vraagprijs, verkochte woningen, gem.vierkantemeter prijs, % overboden (and how these numbers how changed t.o.v. vorige maand) 
* *other information*: besteedbaar inkomen, aantal inwoners

#### Warning: Running the next cell will take aprox. 30 minutes. You might want to replace #page_urls_full for 'subset'! Otherwise, remove "#" in front of 'page_urls_full'.

In [25]:
fn = 'saved_data.json'
import json

def extract_city_trends(page_urls_full):
    trend_list = []
    for page_url in page_urls_full:
        driver.get(page_url)
        time.sleep(5) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # City name
        city_name = soup.find_all('h2')[0].get_text()
        city_name = city_name.replace('Woningmarkt','')
        city_name = city_name.replace(' ', '')
            # Gemiddelde vraagprijs
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # Aantal verkochte woningen
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # Gemiddelde vierkante meter prijs
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # Percentage overboden
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Inwoners and bevolkingsgroei (still to be added)
            # Append list
        save_obj = {"City":city_name, 
                           "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                           "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                           "Gem. m2 prijs":m2_prijs, "%Δ M2 prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                           "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                           "Besteedbaar inkomen (per huishouden)":bes_inkomen}
        trend_list.append(save_obj)
        f=open(fn, 'a', encoding='utf-8')
        f.write(json.dumps(save_obj)+'\n')
        f.close()
    return(trend_list)

In [None]:
# remove # in lines below to use this test line
# df = extract_city_trends(page_urls_full[0:2]) 
# pd.DataFrame(df)

In [26]:
df = extract_city_trends(page_urls_full) 
pd.DataFrame(df)

Unnamed: 0,City,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden)
0,Aalsmeer,"€ 725,000",55.41%,9,-25.00%,"€ 4,297",3.19%,7.25%,-3.13%,"€ 45,800"
1,Alkmaar,"€ 410,000",43.86%,24,-59.32%,"€ 4,013",13.94%,12.67%,1.53%,"€ 36,300"
2,Amstelveen,"€ 700,000",47.37%,14,-68.89%,"€ 5,097",10.09%,8.71%,0.81%,"€ 37,800"
3,Amsterdam,"€ 465,000",9.41%,166,-43.73%,"€ 6,993",5.52%,15.72%,1.81%,"€ 30,100"
4,Beemster,"€ 675,000",-3.23%,3,-50.00%,"€ 4,299",-7.15%,11.87%,1.79%,"€ 47,300"
...,...,...,...,...,...,...,...,...,...,...
347,ValkenburgaandeGeul,"€ 325,000",-15.58%,3,-66.67%,"€ 2,567",-14.83%,7.19%,10.08%,"€ 35,600"
348,Venlo,"€ 379,500",43.21%,26,-50.00%,"€ 2,735",17.84%,8.56%,3.05%,"€ 33,700"
349,Venray,"€ 297,000",14.23%,8,-27.27%,"€ 2,729",27.11%,7.76%,-1.93%,"€ 39,100"
350,Voerendaal,"€ 287,500",3.23%,2,-66.67%,"€ 2,185",-12.53%,9.78%,-0.59%,"€ 40,800"


In [29]:
final_dataframe=pd.DataFrame(df) #dataframe with all data for all municipalities in the Netherlands

**The 'aantal inwoners' en 'bevolkingsgroei' scraper for all municipalities (first separately)**

In [68]:
 # Aantal inwoners
def extract_inwoners(page_urls_full):
    inwoners_city = []
    for page_url in page_urls_full:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        inwoners = soup.find("div", {"class": "buurt-info"})
        new_inwoners = inwoners.find_all('p')[3].get_text
        new_inwoners2 = str(new_inwoners)
        new_inwoners1 = re.search('Dat zijn(.+?)inwoners', new_inwoners2)
        found = 'NA'
        if new_inwoners1:
            found = new_inwoners1.group(1)
            found = found.strip()
        inwoners_city.append({'City':city_name1, 'Aantal inwoners':found})
    return(inwoners_city)
inwoners_city = extract_inwoners(page_urls_full)

In [70]:
print(inwoners_city)

[{'City': ' Aalsmeer', 'Aantal inwoners': '31.859'}, {'City': ' Alkmaar', 'Aantal inwoners': '109.436'}, {'City': ' Amstelveen', 'Aantal inwoners': '91.675'}, {'City': ' Amsterdam', 'Aantal inwoners': '872.757'}, {'City': ' Beemster', 'Aantal inwoners': '10.022'}, {'City': ' Bergen (NH)', 'Aantal inwoners': '29.839'}, {'City': ' Beverwijk', 'Aantal inwoners': '41.626'}, {'City': ' Blaricum', 'Aantal inwoners': '11.538'}, {'City': ' Bloemendaal', 'Aantal inwoners': '23.571'}, {'City': ' Castricum', 'Aantal inwoners': '35.986'}, {'City': ' Den Helder', 'Aantal inwoners': '56.296'}, {'City': ' Diemen', 'Aantal inwoners': '30.780'}, {'City': ' Drechterland', 'Aantal inwoners': '19.719'}, {'City': ' Edam-Volendam', 'Aantal inwoners': '36.197'}, {'City': ' Enkhuizen', 'Aantal inwoners': '18.591'}, {'City': ' Gooise Meren', 'Aantal inwoners': '58.055'}, {'City': ' Haarlem', 'Aantal inwoners': '162.902'}, {'City': ' Haarlemmermeer', 'Aantal inwoners': '156.002'}, {'City': ' Heemskerk', 'Aantal

In [71]:
pd.DataFrame(inwoners_city)

Unnamed: 0,City,Aantal inwoners
0,Aalsmeer,31.859
1,Alkmaar,109.436
2,Amstelveen,91.675
3,Amsterdam,872.757
4,Beemster,10.022
...,...,...
347,Valkenburg aan de Geul,16.367
348,Venlo,101.802
349,Venray,43.614
350,Voerendaal,12.475


In [69]:
 # '% populatie stijging' and 'daling' in two separate columns
def extract_populatiegroei(page_urls_full):
    populatie_groei = []
    for page_url in page_urls_full:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        populatiegroei = soup.find("div", {"class": "buurt-info"})
        new_populatiegroei = populatiegroei.find_all('p')[4].get_text
        new_populatiegroei2 = str(new_populatiegroei)
        new_populatiegroei_increase = re.search('afgelopen jaar met (.+?) gegroeid', new_populatiegroei2)
        found_i = 'NA'
        if new_populatiegroei_increase:
            found_i = new_populatiegroei_increase.group(1)
            found_i = found_i.strip()
        new_populatiegroei_decline = re.search('afgelopen jaar met (.+?) gekrompen', new_populatiegroei2)
        found_d = 'NA'
        if new_populatiegroei_decline:
            found_d = new_populatiegroei_decline.group(1)
            found_d = found_d.strip()        
        populatie_groei.append({'City':city_name1, '% populatie stijging':found_i, '% populatie daling':found_d})
    return(populatie_groei)
populatie_groei = extract_populatiegroei(page_urls_full)

In [72]:
print(populatie_groei)

[{'City': ' Aalsmeer', '% populatie stijging/daling': '0.41%'}, {'City': ' Alkmaar', '% populatie stijging/daling': '0.81%'}, {'City': ' Amstelveen', '% populatie stijging/daling': '0.92%'}, {'City': ' Amsterdam', '% populatie stijging/daling': '1.13%'}, {'City': ' Beemster', '% populatie stijging/daling': '2.81%'}, {'City': ' Bergen (NH)', '% populatie stijging/daling': 'NA'}, {'City': ' Beverwijk', '% populatie stijging/daling': '1.09%'}, {'City': ' Blaricum', '% populatie stijging/daling': '3.02%'}, {'City': ' Bloemendaal', '% populatie stijging/daling': '0.69%'}, {'City': ' Castricum', '% populatie stijging/daling': '0.60%'}, {'City': ' Den Helder', '% populatie stijging/daling': '1.24%'}, {'City': ' Diemen', '% populatie stijging/daling': '5.43%'}, {'City': ' Drechterland', '% populatie stijging/daling': '0.62%'}, {'City': ' Edam-Volendam', '% populatie stijging/daling': '0.27%'}, {'City': ' Enkhuizen', '% populatie stijging/daling': '0.45%'}, {'City': ' Gooise Meren', '% populati

In [74]:
pd.DataFrame(populatie_groei)

Unnamed: 0,City,% populatie stijging/daling
0,Aalsmeer,0.41%
1,Alkmaar,0.81%
2,Amstelveen,0.92%
3,Amsterdam,1.13%
4,Beemster,2.81%
...,...,...
347,Valkenburg aan de Geul,
348,Venlo,0.20%
349,Venray,0.66%
350,Voerendaal,0.18%


**Now we also try to add the inwoner/bevolkingsgroei data to our dataframe (under construction still, doesn't work yet)**

In [None]:
def extract_city_trends(page_urls_full):
    trend_list = []
    for page_url in page_urls_full:
        driver.get(page_url)
        time.sleep(5) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # City name
        city_name = soup.find_all('h2')[0].get_text()
        city_name = city_name.replace('Woningmarkt','')
        city_name = city_name.replace(' ', '')
            # Gemiddelde vraagprijs
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # Aantal verkochte woningen
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # Gemiddelde vierkante meter prijs
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # Percentage overboden
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Aantal inwoners (NEEDS TO BE ADJUSTED!)
        def extract_inwoners(page_urls):
            inwoners_city = []
            for page_url in page_urls:
                res = requests.get(page_url)
                soup = BeautifulSoup(res.text, 'html.parser')
                city_name = soup.find_all('h2')[0].get_text()
                city_name1 = city_name.replace('Woningmarkt','')
                inwoners = soup.find("div", {"class": "buurt-info"})
                new_inwoners = inwoners.find_all('p')[3].get_text
                new_inwoners2 = str(new_inwoners)
                new_inwoners1 = re.search('Dat zijn(.+?)inwoners', new_inwoners2)
                found = 'NA'
                if new_inwoners1:
                    found = new_inwoners1.group(1)
                    found = found.strip()
                inwoners_city.append({'City':city_name1, 'Aantal inwoners':found})
            return(inwoners_city)
        inwoners_city = extract_inwoners(page_urls)
            # Bevolkings groei: % populatie stijging/daling (NEEDS TO BE ADJUSTED!)
        def extract_populatiegroei(page_urls):
            populatie_groei = []
            for page_url in page_urls:
                res = requests.get(page_url)
                soup = BeautifulSoup(res.text, 'html.parser')
                city_name = soup.find_all('h2')[0].get_text()
                city_name1 = city_name.replace('Woningmarkt','')
                populatiegroei = soup.find("div", {"class": "buurt-info"})
                new_populatiegroei = populatiegroei.find_all('p')[4].get_text
                new_populatiegroei2 = str(new_populatiegroei)                 
                new_populatiegroei_increase = re.search('afgelopen jaar met (.+?) gegroeid', new_populatiegroei2)
                found_i = 'NA'
                if new_populatiegroei_increase:
                    found_i = new_populatiegroei_increase.group(1)
                    found_i = found_i.strip()
                new_populatiegroei_decline = re.search('afgelopen jaar met (.+?) gekrompen', new_populatiegroei2)
                found_d = 'NA'
                if new_populatiegroei_decline:
                    found_d = new_populatiegroei_decline.group(1)
                    found_d = found_d.strip()        
                populatie_groei.append({'City':city_name1, '% populatie stijging':found_i, '% populatie daling':found_d})
            return(populatie_groei)
        populatie_groei = extract_populatiegroei(page_urls_full)
            # Append list (NEEDS TO BE ADJUSTED!)
        trend_list.append({"City":city_name, 
                           "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                           "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                           "Gem. m2 prijs":m2_prijs, "%Δ M2 prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                           "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                           "Besteedbaar inkomen (per huishouden)":bes_inkomen, 'Inwoners':inwoners_city,'Bevolkingsgroei (t.o.v vorig jaar)':populatie_groei})
    return(trend_list)

## Step 5: Exporting dataframe as CSV file

For the final output (a CSV file, so tabular data) we would want the output of the scraper to be gathered in one single dictionary.

In [None]:
final_dataframe.to_csv('huizenzoeker_scraper_data.csv') 

## Step 6: Providing summary statistics

**Exporting our output to RStudio and then importing that CSV here**

First we try to generate some summary statistics by using the output of our scraping data. We can't do this directly as you can see as most variables are seen as characters, while they should be numerics. Therefore we exported the final_dataframe to R to change these datatypes and then export it as CSV to then use it here to generate some summary statistics: count, mean, std, min, max, 25%, 50%, 75%. 

In [35]:
huizenzoeker = pd.read_csv('huizenzoeker_data2.csv', encoding= 'latin-1') #Adding the latin encoding solved the UNICODE error 

In [36]:
huizenzoeker = pd.DataFrame(huizenzoeker)
huizenzoeker

Unnamed: 0,City,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen
0,Aalsmeer,725000.0,55.41,9,-25.00,4297.0,3.19,7.25,-3.13,45800.0
1,Alkmaar,410000.0,43.86,24,-59.32,4013.0,13.94,12.67,1.53,36300.0
2,Amstelveen,700000.0,47.37,14,-68.89,5097.0,10.09,8.71,0.81,37800.0
3,Amsterdam,465000.0,9.41,166,-43.73,6993.0,5.52,15.72,1.81,30100.0
4,Beemster,675000.0,-3.23,3,-50.00,4299.0,-7.15,11.87,1.79,47300.0
...,...,...,...,...,...,...,...,...,...,...
347,ValkenburgaandeGeul,325000.0,-15.58,3,-66.67,2567.0,-14.83,7.19,10.08,35600.0
348,Venlo,379500.0,43.21,26,-50.00,2735.0,17.84,8.56,3.05,33700.0
349,Venray,297000.0,14.23,8,-27.27,2729.0,27.11,7.76,-1.93,39100.0
350,Voerendaal,287500.0,3.23,2,-66.67,2185.0,-12.53,9.78,-0.59,40800.0


In [37]:
huizenzoeker.describe() #to return the summary statisitics for all municipalities in the Netherlands. 

Unnamed: 0,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen
count,352.0,339.0,352.0,345.0,352.0,339.0,348.0,346.0,349.0
mean,375464.8,12.085398,13.428977,-35.358841,3157.700009,5.462271,9.44319,1.003844,40037.535817
std,172211.8,32.366931,20.007992,47.145009,1078.111676,17.085127,3.842578,3.379959,4728.094766
min,0.0,-40.86,0.0,-100.0,0.0,-31.75,-0.94,-9.5,25400.0
25%,295000.0,-8.47,5.0,-61.9,2612.0,-4.065,7.05,-0.785,36900.0
50%,350000.0,7.69,8.0,-46.15,3070.0,4.23,8.98,0.835,40400.0
75%,426200.0,25.035,14.0,-22.22,3653.75,12.695,12.24,2.615,43500.0
max,1847500.0,197.06,166.0,250.0,9262.0,122.21,22.32,19.86,53800.0


## Step 7: Scraping woningmarkt dashboard (province-level)

### Step 7a: First scraping the trend, and other data

Here I used the BeautifulSoup method to generate a list of links to the province pages as I'm not sure whether there is an overview page of all provinces like for the municipalities (but maybe also possible with Selenium). 

Generating links of all provinces: 

In [22]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/'
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/']

Defining a function to paste together these URL parts: 

In [23]:
def generate_links(base_url,province_url): 
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links
page_links = generate_links(base_url,province_url)
print(page_links)

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zuid-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zeeland/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/', 'https://www.huizenzoeker.nl/woningmarkt/utrecht/', 'https://www.huizenzoeker.nl/woningmarkt/flevoland/', 'https://www.huizenzoeker.nl/woningmarkt/friesland/', 'https://www.huizenzoeker.nl/woningmarkt/groningen/', 'https://www.huizenzoeker.nl/woningmarkt/drenthe/', 'https://www.huizenzoeker.nl/woningmarkt/overijssel/', 'https://www.huizenzoeker.nl/woningmarkt/gelderland/', 'https://www.huizenzoeker.nl/woningmarkt/limburg/']


In [24]:
def extract_province_trends(page_links):
    trend_list = []
    for page_link in page_links:
        driver.get(page_link)
        time.sleep(5) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # Province name
        province_name = soup.find_all('h2')[0].get_text()
        province_name = province_name.replace('Woningmarkt','')
        province_name = province_name.replace(" ",'') #UPDATE: removed space before province name
            # Gemiddelde vraagprijs
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # Aantal verkochte woningen
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # Gemiddelde vierkante meter prijs
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # Percentage overboden
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Inwoners and bevolkingsgroei (still to be added)
            # Append list
        trend_list.append({"Province":province_name, 
                           "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                           "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                           "Gem. m2 prijs":m2_prijs, "%Δ M2 prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                           "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                           "Besteedbaar inkomen (per huishouden)":bes_inkomen})
    return(trend_list)

In [26]:
df1 = extract_province_trends(page_links) 
province_dataframe = pd.DataFrame(df1)

In [27]:
province_dataframe

Unnamed: 0,Province,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden)
0,Noord-Holland,"€ 437,000",16.53%,812,-33.71%,"€ 4,508",10.76%,12.66%,1.08%,"€ 36,200"
1,Zuid-Holland,"€ 365,000",7.67%,1094,-43.58%,"€ 3,630",6.17%,10.21%,0.77%,"€ 35,800"
2,Zeeland,"€ 275,000",0.09%,145,-44.44%,"€ 2,663",3.62%,8.13%,0.13%,"€ 36,900"
3,Noord-Brabant,"€ 365,000",7.67%,635,-50.20%,"€ 3,218",6.10%,7.87%,0.88%,"€ 38,100"
4,Utrecht,"€ 439,000",14.03%,523,-17.77%,"€ 4,200",4.09%,11.97%,0.78%,"€ 39,500"
5,Flevoland,"€ 340,000",4.62%,127,-46.41%,"€ 2,970",0.34%,14.71%,1.08%,"€ 39,500"
6,Friesland,"€ 289,000",3.58%,286,-19.89%,"€ 2,446",-1.13%,10.64%,1.94%,"€ 34,900"
7,Groningen,"€ 255,000",13.33%,282,-16.07%,"€ 2,592",8.41%,15.76%,1.33%,"€ 30,600"
8,Drenthe,"€ 299,750",1.61%,208,-32.03%,"€ 2,475",0.69%,10.49%,1.41%,"€ 37,100"
9,Overijssel,"€ 300,000",0.67%,392,-24.47%,"€ 2,732",5.65%,9.84%,1.38%,"€ 36,900"


### Step 7b: Scrape some more woningmarkt dashboard data

To this dataframe, we now want to add more data from the woningmarkt dashboard per province, e.g. 'aantal geintereseerden per woning', huuraanbod, profiel huizenzoekers (?), over woningen...  

But for now,  we export this dataframe already as CSV to R to fix the characters into numerics; such that it is an useable dataset!

## Step 8: Exporting dashboard data as CSV

In [30]:
province_dataframe.to_csv(r'C:\Users\danie\OneDrive\Documents\Repositories\oDCM-project-team-3\src\collection\huizenzoeker_province_data.csv') #at province-level

In [31]:
huizenzoeker_province = pd.read_csv('huizenzoeker_province_data1.csv', encoding= 'latin-1')

In [33]:
huizenzoeker_province = pd.DataFrame(huizenzoeker_province)
huizenzoeker_province

Unnamed: 0,Province,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen
0,Noord-Holland,437000.0,16.53,812,-33.71,4508,10.76,12.66,1.08,36200
1,Zuid-Holland,365000.0,7.67,1094,-43.58,3630,6.17,10.21,0.77,35800
2,Zeeland,275000.0,0.09,145,-44.44,2663,3.62,8.13,0.13,36900
3,Noord-Brabant,365000.0,7.67,635,-50.2,3218,6.1,7.87,0.88,38100
4,Utrecht,439000.0,14.03,523,-17.77,4200,4.09,11.97,0.78,39500
5,Flevoland,340000.0,4.62,127,-46.41,2970,0.34,14.71,1.08,39500
6,Friesland,289000.0,3.58,286,-19.89,2446,-1.13,10.64,1.94,34900
7,Groningen,255000.0,13.33,282,-16.07,2592,8.41,15.76,1.33,30600
8,Drenthe,299750.0,1.61,208,-32.03,2475,0.69,10.49,1.41,37100
9,Overijssel,300000.0,0.67,392,-24.47,2732,5.65,9.84,1.38,36900


In [34]:
huizenzoeker_province.describe()

Unnamed: 0,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen
count,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0,12.0
mean,333645.833333,6.559167,454.75,-34.221667,3086.416667,4.365833,10.975,1.0,36483.333333
std,61324.440695,5.593182,293.471735,12.065189,693.881108,3.441874,2.435407,0.478957,2394.627826
min,255000.0,0.09,127.0,-50.2,2446.0,-1.13,7.87,0.13,30600.0
25%,286500.0,1.57,263.5,-43.795,2562.75,1.9875,9.5975,0.76,35575.0
50%,320000.0,6.04,358.0,-36.47,2851.0,4.68,10.52,0.98,36900.0
75%,365000.0,9.085,630.5,-23.325,3321.0,6.1175,12.1425,1.3425,37650.0
max,439000.0,16.53,1094.0,-16.07,4508.0,10.76,15.76,1.94,39500.0
