# Scraping www.huizenzoeker.nl/woningmarkt/ for all municipalities in The Netherlands

### ODCM project - Team 3 

Which municipalities in the Netherlands are hit hardest by the Dutch Housing crisis, and which the least? 
We use the site www.huizenzoeker.nl/woningmarkt/ to analyse the Dutch Housing Market, including the gem. vraagprijs, # verkochte woningen, gem. vierkante meter prijs, % overboden. 

## Step 1: Loading all the basics

In [1]:
#import the packages (after you have installed them properly)
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd 
import time 
from functools import reduce
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
# pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys

In [2]:
#set the basis for BeautifulSoup 
url = 'https://www.huizenzoeker.nl/woningmarkt/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

In [3]:
#set the basis for Selenium
chrome_path = "C:\Documents\MSc_Marketing_Analytics\oDCM\oDCM-project-team-3\src\collection\chromedriver.exe" 
#depends on your own path

## Step 2: Collecting the URLs

### Step 2a: Extracting first the provinces, and then all the municipalities 

Here we collect the URLs for all municipalities of a province using Selenium to make the process more efficient.

We first construct a base url and a province_url which once appended together creates the URL to each of the woningmarkt pages for each province. We generate the generate_links() function to append these parts of the URL together. 

In [4]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/'

In [5]:
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/']

In [6]:
def generate_links(base_url,province_url): 
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links

In [7]:
page_links = generate_links(base_url,province_url)
print(page_links)

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zuid-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zeeland/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/', 'https://www.huizenzoeker.nl/woningmarkt/utrecht/', 'https://www.huizenzoeker.nl/woningmarkt/flevoland/', 'https://www.huizenzoeker.nl/woningmarkt/friesland/', 'https://www.huizenzoeker.nl/woningmarkt/groningen/', 'https://www.huizenzoeker.nl/woningmarkt/drenthe/', 'https://www.huizenzoeker.nl/woningmarkt/overijssel/', 'https://www.huizenzoeker.nl/woningmarkt/gelderland/', 'https://www.huizenzoeker.nl/woningmarkt/limburg/']


We then use this list of all provinces, to extract all municipalities from each, making use of window handling (**STATUS: this code doesn't fully work yet!!**)

In [465]:
driver = webdriver.Chrome()

In [466]:
for x in range(len(page_links)): #WARNING: running this code snippet will open 12 Chrome tabs automatically!!
    driver.get(page_links[x])
    if x < 11:
        driver.execute_script("window.open('');")
        driver.switch_to.window(driver.window_handles[x+1])

In [467]:
driver.window_handles

['CDwindow-8BD69957BDEA6A266D818319A3A6D587',
 'CDwindow-1B5A517E79478501796E4216E41383EA',
 'CDwindow-BF7E49698EC2348E729D195F26E199AE',
 'CDwindow-22EB0F76918939743E1AD4EFBEC02297',
 'CDwindow-B18C6781F81D22B030EE28D6988D595F',
 'CDwindow-E786440D60A63EE02AA49D60217A3BE5',
 'CDwindow-F7756562E785986090947DFCD91C6602',
 'CDwindow-30807A57591F8474AA53EA49FE21C3B3',
 'CDwindow-2B15C1EBA75087645E14CB83BF296494',
 'CDwindow-72A3421DF43D5284366FEB3EB687779B',
 'CDwindow-081C0571F8903949540841709CE5B375',
 'CDwindow-E33256578476C15942FB7047CFC1D995']

In [468]:
page_urls_full = []
for handle in driver.window_handles:        
    driver.switch_to.window(handle)
    elem1 = driver.find_elements_by_xpath("//li//div//a[@href]")
    
    for elem in elem1:
        urls = elem.get_attribute('href')
        page_urls_full.append(urls)   

In [469]:
page_urls_full

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/aalsmeer/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/alkmaar/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/amstelveen/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/amsterdam/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/beemster/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/bergen-nh/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/beverwijk/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/blaricum/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/bloemendaal/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/castricum/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/den-helder/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/diemen/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/drechterland/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/edam-volendam/',
 'https://www.huizenzoeker.nl/w

In [408]:
subset = page_urls_full[30:33] # defined subset to try out on few urls first
subset

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/medemblik/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/oostzaan/',
 'https://www.huizenzoeker.nl/woningmarkt/noord-holland/opmeer/']

# Step 3: Scrape data from each url

For each municipality we now extract:
* *trend data*: gem. vraagprijs, verkochte woningen, gem.vierkantemeter prijs, % overboden (and how these numbers how changed t.o.v. vorige maand) 
* *other information*: besteedbaar inkomen, aantal inwoners, % populatiegroei/daling

#### Warning: Running the next cell will take aprox. 30 minutes. You might want to replace #page_urls_full for a subset! Otherwise, remove "#" in front of 'page_urls_full'.

In [475]:
def extract_city_trends(#page_urls_full):
    trend_list = []
    for page_url in #page_urls_full:
        driver.get(page_url)
        time.sleep(5) 
        soup = BeautifulSoup(driver.page_source, 'html.parser')
            # City name
        city_name = soup.find_all('h2')[0].get_text()
        city_name = city_name.replace('Woningmarkt','')
        city_name = city_name.replace(' ', '')
            # Gemiddelde vraagprijs
        content = soup.find_all(class_='trend-graph')[0]
        if content.find(class_="trend-graph-icon") == None:
            gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_vraagprijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            else:
                gem_vraagprijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
                gem_vraagprijs = gem_vraagprijs.replace("(","")
                gem_vraagprijs = gem_vraagprijs.replace(",)","")    
                gem_vraagprijs = gem_vraagprijs.replace(".", ",")
                tov_vorige_maand_vraagprijs = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace("\n\n","")
                tov_vorige_maand_vraagprijs = tov_vorige_maand_vraagprijs.replace(" t.o.v. vorige maand\n","")
            # Aantal verkochte woningen
        content = soup.find_all(class_='trend-graph')[1]
        if content.find(class_="trend-graph-icon") == None:
            verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_verkocht = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            else:
                verk_woningen = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_verkocht = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace("\n\n","")
                tov_vorige_maand_verkocht = tov_vorige_maand_verkocht.replace(" t.o.v. vorige maand\n","")
            # Gemiddelde vierkante meter prijs
        content = soup.find_all(class_='trend-graph')[2]
        if content.find(class_="trend-graph-icon") == None:
            m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_m2_prijs = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text()     
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            else:
                m2_prijs = content.find("h3",{"class":"trend-graph-value"}).get_text() 
                m2_prijs = m2_prijs.replace(".", ",")
                tov_vorige_maand_m2_prijs = content.find("div",{"class":"trend-graph-pill"}).get_text() 
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace("\n\n","")
                tov_vorige_maand_m2_prijs = tov_vorige_maand_m2_prijs.replace(" t.o.v. vorige maand\n","")
            # Percentage overboden
        content = soup.find_all(class_='trend-graph')[3]
        if content.find(class_="trend-graph-icon") == None:
            perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()
            tov_vorige_maand_perc_overboden = "NA"
        else:
            if content.find(class_="trend-graph-pill trend-down") != None:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()               
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill trend-down"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            else:
                perc_overboden = content.find("h3",{"class":"trend-graph-value"}).get_text()             
                tov_vorige_maand_perc_overboden = content.find("div",{"class":"trend-graph-pill"}).get_text()
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace("\n\n","")
                tov_vorige_maand_perc_overboden = tov_vorige_maand_perc_overboden.replace(" t.o.v. vorige maand\n","")
            # Besteedbaar inkomen
        bes_inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        bes_inkomen = bes_inkomen.replace('\n','')
        bes_inkomen = bes_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        bes_inkomen = bes_inkomen.replace(".", ",")
            # Aantal inwoners
def extract_inwoners(page_urls):
    inwoners_city = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        inwoners = soup.find("div", {"class": "buurt-info"})
        new_inwoners = inwoners.find_all('p')[3].get_text
        new_inwoners2 = str(new_inwoners)
        new_inwoners1 = re.search('Dat zijn(.+?)inwoners', new_inwoners2)
        found = 'NA'
        if new_inwoners1:
            found = new_inwoners1.group(1)
            found = found.strip()
        inwoners_city.append({'City':city_name1, 'Aantal inwoners':found})
    return(inwoners_city)
inwoners_city = extract_inwoners(page_urls)
            # % populatie stijging/daling
def extract_populatiegroei(page_urls):
    populatie_groei = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        populatiegroei = soup.find("div", {"class": "buurt-info"})
        new_populatiegroei = populatiegroei.find_all('p')[4].get_text
        new_populatiegroei2 = str(new_populatiegroei)
        new_populatiegroei1 = re.search('afgelopen jaar met (.+?) gegroeid' or 'afgelopen jaar met (.+?) gekrompen', new_populatiegroei2)
        found = 'NA'
        if new_populatiegroei1:
            found = new_populatiegroei1.group(1)
            found = found.strip()
        populatie_groei.append({'City':city_name1, '% populatie stijging/daling':found})
    return(populatie_groei)
populatie_groei = extract_populatiegroei(page_urls)
            # Append list
        trend_list.append({"City":city_name, 
                           "Gem. vraagprijs":gem_vraagprijs, "%Δ Vraagprijs (t.o.v vorige maand)": tov_vorige_maand_vraagprijs,
                           "Verkochte woningen":verk_woningen, "%Δ Verkochte woningen (t.o.v vorige maand)":tov_vorige_maand_verkocht,
                           "Gem. m2 prijs":m2_prijs, "%Δ M2 prijs (t.o.v vorige maand)":tov_vorige_maand_m2_prijs,
                           "% Vraagprijs overboden":perc_overboden, "%Δ Overboden (t.o.v vorige maand)":tov_vorige_maand_perc_overboden,
                           "Besteedbaar inkomen (per huishouden)":bes_inkomen})
    return(trend_list)

SyntaxError: invalid syntax (<ipython-input-475-0a0e7f0ff510>, line 3)

In [473]:
df = extract_city_trends(page_urls_full) 
pd.DataFrame(df)

Unnamed: 0,City,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkochte woningen (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden)
0,Aalsmeer,"€ 725,000",55.41%,9,-25.00%,"€ 4,297",3.19%,7.25%,-3.13%,"€ 45,800"
1,Alkmaar,"€ 410,000",43.86%,24,-59.32%,"€ 4,013",13.94%,12.67%,1.53%,"€ 36,300"
2,Amstelveen,"€ 700,000",47.37%,14,-68.89%,"€ 5,097",10.09%,8.71%,0.81%,"€ 37,800"
3,Amsterdam,"€ 465,000",9.41%,166,-43.73%,"€ 6,993",5.52%,15.72%,1.81%,"€ 30,100"
4,Beemster,"€ 675,000",-3.23%,3,-50.00%,"€ 4,299",-7.15%,11.87%,1.79%,"€ 47,300"
...,...,...,...,...,...,...,...,...,...,...
347,ValkenburgaandeGeul,"€ 325,000",-15.58%,3,-66.67%,"€ 2,567",-14.83%,7.19%,10.08%,"€ 35,600"
348,Venlo,"€ 379,500",43.21%,26,-50.00%,"€ 2,735",17.84%,8.56%,3.05%,"€ 33,700"
349,Venray,"€ 297,000",14.23%,8,-27.27%,"€ 2,729",27.11%,7.76%,-1.93%,"€ 39,100"
350,Voerendaal,"€ 287,500",3.23%,2,-66.67%,"€ 2,185",-12.53%,9.78%,-0.59%,"€ 40,800"


### Inhabitants

*Insert code for this still*

## Step 5: Creating CSV.file

For the final output (a CSV file, so tabular data) we would want the output of the scraper to be gathered in one single dictionary.

*Insert code for this still*