# Scraping www.huizenzoeker.nl/woningmarkt/ for all municipalities in the Netherlands

### ODCM project - Team 3 

Which municipalities in the Netherlands are hit hardest by the Dutch Housing crisis, and which the least? 
We use the site www.huizenzoeker.nl/woningmarkt/ to analyse the Dutch Housing Market, including the gem. vraagprijs, # verkochte woningen, gem. vierkante meter prijs, % overboden. 

## Step 1: Loading all the basics

Here we import the packages we will use. Set the url, request the url and set the html.parser.

In [3]:
#import the packages (after you have installed them properly)
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd 
from functools import reduce

In [4]:
#set the basis for BeautifulSoup 
url = 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/veldhoven/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

## Step 2: Collecting the URLs 

Here we generate a base_url which is the same for all municipalities, for all provinces of the Netherlands. Then we generate a city_url for the specific province and the municipality in that province. 

In [5]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/' 

**Noord-Brabant URLS**

The first city_url is used to test out the code (less time-consuming), and the second city_url already works for all municipalities and is used for the end-product. 

In [6]:
city_url = ['noord-brabant/veldhoven/','noord-brabant/best/','noord-brabant/eindhoven/'] #to test out the code (less time-consuming)

In [7]:
city_url = ['noord-brabant/alphen-chaam/','noord-brabant/altena/', 'noord-brabant/asten/','noord-brabant/baarle-nassau/','noord-brabant/bergeijk/','noord-brabant/bergen-op-zoom/','noord-brabant/bernheze/','noord-brabant/best/','noord-brabant/bladel/','noord-brabant/boekel/','noord-brabant/boxmeer/','noord-brabant/boxtel/','noord-brabant/breda/','noord-brabant/cranendonck/','noord-brabant/cuijk/','noord-brabant/deurne/','noord-brabant/dongen/','noord-brabant/drimmelen/','noord-brabant/eersel/','noord-brabant/eindhoven/','noord-brabant/etten-leur/','noord-brabant/geertruidenberg/','noord-brabant/geldrop-mierlo/','noord-brabant/gemert-bakel/','noord-brabant/gilze-en-rijen/','noord-brabant/goirle/','noord-brabant/grave/','noord-brabant/halderberge/','noord-brabant/heeze-leende/','noord-brabant/helmond/','noord-brabant/heusden/','noord-brabant/hilvarenbeek/', 'noord-brabant/laarbeek/','noord-brabant/landerd/','noord-brabant/loon-op-zand/','noord-brabant/meierijstad/','noord-brabant/mill-en-sint-hubert/','noord-brabant/moerdijk/','noord-brabant/nuenen,-gerwen-en-nederwetten/','noord-brabant/oirschot/','noord-brabant/oisterwijk/','noord-brabant/oosterhout/','noord-brabant/oss/','noord-brabant/reusel-de-mierden/','noord-brabant/roosendaal/','noord-brabant/rucphen/','noord-brabant/s-hertogenbosch/','noord-brabant/sint-anthonis/','noord-brabant/sint-michielsgestel/','noord-brabant/someren/','noord-brabant/son-en-breugel/','noord-brabant/steenbergen/','noord-brabant/tilburg/','noord-brabant/uden/','noord-brabant/valkenswaard/','noord-brabant/veldhoven/','noord-brabant/vught/','noord-brabant/waalre/','noord-brabant/waalwijk/','noord-brabant/woensdrecht/','noord-brabant/zundert/']

We define a function 'generate_page_urls' to paste together the base_url and city_url, for every municipality in NB, into a complete URL. 

In [8]:
def generate_page_urls(base_url,city_url):
    page_urls = []
    for i in city_url:
        full_url = base_url + i
        page_urls.append(full_url)
    return page_urls

In [9]:
page_urls = generate_page_urls(base_url,city_url) #use the function on the base_url and city_url 
print(page_urls) #show that it has worked

['https://www.huizenzoeker.nl/woningmarkt/noord-brabant/alphen-chaam/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/altena/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/asten/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/baarle-nassau/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/bergeijk/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/bergen-op-zoom/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/bernheze/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/best/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/bladel/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/boekel/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/boxmeer/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/boxtel/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/breda/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/cranendonck/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/cuijk/'

## Step 3a: Gather all data for all municipalities: Trend data

For each municipality we now extract the gem. vraagprijs, verkochte woningen, gem.vierkantemeter prijs, % overboden (and how these numbers how changed t.o.v. vorige maand).

### Gemiddelde vraagprijs (Quadrant 1 Trend data)

Extract the gemiddelde vraagprijs, and its % change t.o.v. vorige maand. 

**Noord-Brabant data**

We generate a function extract_city_trends that will extract the data of the first quadrant for all municipalities. We adjusted the function to put the % change t.o.v. last month in a seperate column (which looks better in the pandas dataframe, and makes it easier to analyse the data). 

In [10]:
def extract_city_trends(page_urls):
    trend_list = []
    for page_url in page_urls:
        res = requests.get(page_url) #HERE WAS A FEEDBACK UPDATE: using driver.get(page_url) instead, and adding time.sleep(5), but I think that maybe only applies to the Selenium solution so that wasn't implemented here!!
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[0].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Gem. Vraagprijs','')
        price,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list.append({'City':city_name1, 'Gem. vraagprijs':price, '%Δ Vraagprijs (t.o.v vorige maand)':change1})
    return(trend_list)

In [11]:
gem_vraagprijs = extract_city_trends(page_urls) #use the function
print(gem_vraagprijs) #check if it works

[{'City': ' Alphen-Chaam', 'Gem. vraagprijs': '€\xa0304.500', '%Δ Vraagprijs (t.o.v vorige maand)': '-18.69%'}, {'City': ' Altena', 'Gem. vraagprijs': '€\xa0349.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-8.16%'}, {'City': ' Asten', 'Gem. vraagprijs': '€\xa0465.000', '%Δ Vraagprijs (t.o.v vorige maand)': '27.40%'}, {'City': ' Baarle-Nassau', 'Gem. vraagprijs': '€\xa0389.000', '%Δ Vraagprijs (t.o.v vorige maand)': '21.56%'}, {'City': ' Bergeijk', 'Gem. vraagprijs': '€\xa0325.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-17.72%'}, {'City': ' Bergen op Zoom', 'Gem. vraagprijs': '€\xa0286.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-11.04%'}, {'City': ' Bernheze', 'Gem. vraagprijs': '€\xa0325.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-28.88%'}, {'City': ' Best', 'Gem. vraagprijs': '€\xa0345.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-21.50%'}, {'City': ' Bladel', 'Gem. vraagprijs': '€\xa0320.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-20.00%'}, {'City': ' Boekel', 'Gem. vraagprij

We generate a Pandas dataframe of the data we just collected for all municipalities.

In [12]:
pd.DataFrame(gem_vraagprijs)

Unnamed: 0,City,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand)
0,Alphen-Chaam,€ 304.500,-18.69%
1,Altena,€ 349.000,-8.16%
2,Asten,€ 465.000,27.40%
3,Baarle-Nassau,€ 389.000,21.56%
4,Bergeijk,€ 325.000,-17.72%
...,...,...,...
56,Vught,€ 375.000,-18.43%
57,Waalre,€ 447.500,-33.70%
58,Waalwijk,€ 285.000,-8.06%
59,Woensdrecht,€ 309.000,21.41%


### # Verkochte woningen (Quadrant 2 Trend data)

Extract the # of verkochte woningen, and its % change t.o.v. vorige maand. 

**Noord-Brabant data**

Again we use a similar function, but now we use it to select data from a different quadrant. 

In [13]:
def extract_city_trends1(page_urls):
    trend_list1 = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[1].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Verkochte woningen','')
        verkocht,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list1.append({'City':city_name1, 'Verkochte woningen':verkocht, '%Δ Verkocht (t.o.v vorige maand)':change1})
    return(trend_list1)

In [14]:
verk_woningen = extract_city_trends1(page_urls) #use the function
print(verk_woningen) #check if it works 

[{'City': ' Alphen-Chaam', 'Verkochte woningen': '2', '%Δ Verkocht (t.o.v vorige maand)': '-75.00%'}, {'City': ' Altena', 'Verkochte woningen': '22', '%Δ Verkocht (t.o.v vorige maand)': '-50.00%'}, {'City': ' Asten', 'Verkochte woningen': '7', '%Δ Verkocht (t.o.v vorige maand)': '-46.15%'}, {'City': ' Baarle-Nassau', 'Verkochte woningen': '3', '%Δ Verkocht (t.o.v vorige maand)': '-57.14%'}, {'City': ' Bergeijk', 'Verkochte woningen': '9', '%Δ Verkocht (t.o.v vorige maand)': '-57.14%'}, {'City': ' Bergen op Zoom', 'Verkochte woningen': '43', '%Δ Verkocht (t.o.v vorige maand)': '-36.76%'}, {'City': ' Bernheze', 'Verkochte woningen': '7', '%Δ Verkocht (t.o.v vorige maand)': '-69.57%'}, {'City': ' Best', 'Verkochte woningen': '8', '%Δ Verkocht (t.o.v vorige maand)': '-73.33%'}, {'City': ' Bladel', 'Verkochte woningen': '3', '%Δ Verkocht (t.o.v vorige maand)': '-84.21%'}, {'City': ' Boekel', 'Verkochte woningen': '6', '%Δ Verkocht (t.o.v vorige maand)': '-40.00%'}, {'City': ' Boxmeer', 'Ver

We generate a Pandas dataframe of the data we just collected for all municipalities.

In [15]:
pd.DataFrame(verk_woningen)

Unnamed: 0,City,Verkochte woningen,%Δ Verkocht (t.o.v vorige maand)
0,Alphen-Chaam,2,-75.00%
1,Altena,22,-50.00%
2,Asten,7,-46.15%
3,Baarle-Nassau,3,-57.14%
4,Bergeijk,9,-57.14%
...,...,...,...
56,Vught,17,-39.29%
57,Waalre,4,-55.56%
58,Waalwijk,18,-59.09%
59,Woensdrecht,12,-36.84%


### Gemiddelde vierkantemeter prijs (Quadrant 3 Trend data)

Extract the gemiddelde vierkante meter prijs, and its % change t.o.v. vorige maand. 

**Noord-Brabant data**

In [16]:
def extract_city_trends2(page_urls):
    trend_list2 = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[2].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Gem. vierkantemeter prijs','')
        m2prijs,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list2.append({'City':city_name1, 'Gem. m2 prijs':m2prijs, '%Δ M2 prijs (t.o.v vorige maand)':change1})
    return(trend_list2)

In [17]:
m2_prijs = extract_city_trends2(page_urls) #use the function 
print(m2_prijs) #check whether it works

[{'City': ' Alphen-Chaam', 'Gem. m2 prijs': '€\xa02.430', '%Δ M2 prijs (t.o.v vorige maand)': '-23.27%'}, {'City': ' Altena', 'Gem. m2 prijs': '€\xa02.928', '%Δ M2 prijs (t.o.v vorige maand)': '2.59%'}, {'City': ' Asten', 'Gem. m2 prijs': '€\xa02.809', '%Δ M2 prijs (t.o.v vorige maand)': '4.08%'}, {'City': ' Baarle-Nassau', 'Gem. m2 prijs': '€\xa02.992', '%Δ M2 prijs (t.o.v vorige maand)': '18.68%'}, {'City': ' Bergeijk', 'Gem. m2 prijs': '€\xa03.427', '%Δ M2 prijs (t.o.v vorige maand)': '4.55%'}, {'City': ' Bergen op Zoom', 'Gem. m2 prijs': '€\xa02.575', '%Δ M2 prijs (t.o.v vorige maand)': '-6.36%'}, {'City': ' Bernheze', 'Gem. m2 prijs': '€\xa02.752', '%Δ M2 prijs (t.o.v vorige maand)': '-6.55%'}, {'City': ' Best', 'Gem. m2 prijs': '€\xa02.893', '%Δ M2 prijs (t.o.v vorige maand)': '-8.13%'}, {'City': ' Bladel', 'Gem. m2 prijs': '€\xa03.299', '%Δ M2 prijs (t.o.v vorige maand)': '10.63%'}, {'City': ' Boekel', 'Gem. m2 prijs': '€\xa02.899', '%Δ M2 prijs (t.o.v vorige maand)': '8.05%'}, 

We generate a Pandas dataframe of the data we just collected for all municipalities.

In [18]:
pd.DataFrame(m2_prijs)

Unnamed: 0,City,Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand)
0,Alphen-Chaam,€ 2.430,-23.27%
1,Altena,€ 2.928,2.59%
2,Asten,€ 2.809,4.08%
3,Baarle-Nassau,€ 2.992,18.68%
4,Bergeijk,€ 3.427,4.55%
...,...,...,...
56,Vught,€ 3.700,-1.93%
57,Waalre,€ 3.734,-11.07%
58,Waalwijk,€ 2.736,-5.75%
59,Woensdrecht,€ 2.719,3.46%


### Percentage overboden (Quadrant 4 Trend data)

Extract the % overboden, and its % change t.o.v. vorige maand. 

In [19]:
def extract_city_trends3(page_urls):
    trend_list3 = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[3].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Percentage overboden','')
        overboden,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list3.append({'City':city_name1, '% Vraagprijs overboden':overboden, '%Δ Overboden (t.o.v vorige maand)': change1})
    return(trend_list3)

In [20]:
perc_overboden = extract_city_trends3(page_urls) #use the function 
print(perc_overboden) #make sure it works

[{'City': ' Alphen-Chaam', '% Vraagprijs overboden': '1.63%', '%Δ Overboden (t.o.v vorige maand)': '-2.14%'}, {'City': ' Altena', '% Vraagprijs overboden': '8.79%', '%Δ Overboden (t.o.v vorige maand)': '-0.22%'}, {'City': ' Asten', '% Vraagprijs overboden': '7.15%', '%Δ Overboden (t.o.v vorige maand)': '-0.35%'}, {'City': ' Baarle-Nassau', '% Vraagprijs overboden': '15.02%', '%Δ Overboden (t.o.v vorige maand)': '17.89%'}, {'City': ' Bergeijk', '% Vraagprijs overboden': '3.36%', '%Δ Overboden (t.o.v vorige maand)': '1.60%'}, {'City': ' Bergen op Zoom', '% Vraagprijs overboden': '7.75%', '%Δ Overboden (t.o.v vorige maand)': '2.45%'}, {'City': ' Bernheze', '% Vraagprijs overboden': '8.43%', '%Δ Overboden (t.o.v vorige maand)': '3.17%'}, {'City': ' Best', '% Vraagprijs overboden': '8.14%', '%Δ Overboden (t.o.v vorige maand)': '-1.24%'}, {'City': ' Bladel', '% Vraagprijs overboden': '3.70%', '%Δ Overboden (t.o.v vorige maand)': '-1.74%'}, {'City': ' Boekel', '% Vraagprijs overboden': '5.04%

We generate a Pandas dataframe of the data we just collected for all municipalities.

In [21]:
pd.DataFrame(perc_overboden)

Unnamed: 0,City,% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand)
0,Alphen-Chaam,1.63%,-2.14%
1,Altena,8.79%,-0.22%
2,Asten,7.15%,-0.35%
3,Baarle-Nassau,15.02%,17.89%
4,Bergeijk,3.36%,1.60%
...,...,...,...
56,Vught,6.89%,-0.22%
57,Waalre,10.64%,0.37%
58,Waalwijk,5.48%,1.02%
59,Woensdrecht,7.02%,2.56%


## Step 3b: Gather all data for all municipalities: Besteedbaar inkomen and # Inhabitants

For each municipality we now extract the Besteedbaar inkomen per huishouden, the number of inhabitants and its % change t.o.v. the year before. 

### Besteedbaar inkomen

**Noord-brabant data**

We generate a function extract_besteedbaar that will extract the data of besteedbaar inkomen per huishouden, for all municipalities. 

In [22]:
def extract_besteedbaar(page_urls):
    besteed_inkomen = []
    for page_url in page_urls:
        res = requests.get(page_url)
        soup = BeautifulSoup(res.text, 'html.parser')
        city_name = soup.find_all('h2')[0].get_text()
        city_name1 = city_name.replace('Woningmarkt','')
        inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        new_inkomen = inkomen.replace('\n','')
        new_inkomen1 = new_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        besteed_inkomen.append({'City':city_name1, 'Besteedbaar inkomen (per huishouden)':new_inkomen1})
    return(besteed_inkomen)

In [23]:
besteed_inkomen = extract_besteedbaar(page_urls) #use the function 
print(besteed_inkomen) #check whether it works

[{'City': ' Alphen-Chaam', 'Besteedbaar inkomen (per huishouden)': '€ 45.700'}, {'City': ' Altena', 'Besteedbaar inkomen (per huishouden)': '€ 43.400'}, {'City': ' Asten', 'Besteedbaar inkomen (per huishouden)': '€ 40.200'}, {'City': ' Baarle-Nassau', 'Besteedbaar inkomen (per huishouden)': '€ 37.700'}, {'City': ' Bergeijk', 'Besteedbaar inkomen (per huishouden)': '€ 43.600'}, {'City': ' Bergen op Zoom', 'Besteedbaar inkomen (per huishouden)': '€ 36.800'}, {'City': ' Bernheze', 'Besteedbaar inkomen (per huishouden)': '€ 44.600'}, {'City': ' Best', 'Besteedbaar inkomen (per huishouden)': '€ 44.500'}, {'City': ' Bladel', 'Besteedbaar inkomen (per huishouden)': '€ 42.700'}, {'City': ' Boekel', 'Besteedbaar inkomen (per huishouden)': '€ 45.100'}, {'City': ' Boxmeer', 'Besteedbaar inkomen (per huishouden)': '€ 41.800'}, {'City': ' Boxtel', 'Besteedbaar inkomen (per huishouden)': '€ 38.400'}, {'City': ' Breda', 'Besteedbaar inkomen (per huishouden)': '€ 35.500'}, {'City': ' Cranendonck', 'Be

We generate a Pandas dataframe of the data we just collected for all municipalities.

In [24]:
pd.DataFrame(besteed_inkomen)

Unnamed: 0,City,Besteedbaar inkomen (per huishouden)
0,Alphen-Chaam,€ 45.700
1,Altena,€ 43.400
2,Asten,€ 40.200
3,Baarle-Nassau,€ 37.700
4,Bergeijk,€ 43.600
...,...,...
56,Vught,€ 43.500
57,Waalre,€ 47.300
58,Waalwijk,€ 37.500
59,Woensdrecht,€ 38.800


### Inhabitants

**Noord-brabant data**

We generate a function extract_inhabitants that will extract the data of the number of inhabitants for all municipalities, and how this number has changed t.o.v. last year.  

_insert code for this still_

## Step 5: Merging all data gathered (by city name)

For the final output (a CSV file, so tabular data) we would want the output of the scraper to be gathered in one single dictionary. 

**Noord-brabant data**

Here we name the Pandas Dataframes for each quadrant of the trend data we extracted, and also for the besteedbaar inkomen and inhabitants data. We set the index of the dataframes to the name of the municipality. 

In [25]:
df1 = pd.DataFrame(gem_vraagprijs).set_index('City') #gemiddelde vraagprijs 

In [26]:
df2 = pd.DataFrame(verk_woningen).set_index('City') #aantal verkochte woningen 

In [27]:
df3 = pd.DataFrame(m2_prijs).set_index('City') #vierkantemeter prijs 

In [28]:
df4 = pd.DataFrame(perc_overboden).set_index('City') #percentage overboden 

In [29]:
df5 = pd.DataFrame(besteed_inkomen).set_index('City') #besteedbaar inkomen 

In [30]:
#add a dataframe for the inhabitants data, once finished. 

In [31]:
data_frames = [df1,df2,df3,df4,df5] #now we combine the dataframes into one single dataframe 

The final output: 

In [32]:
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['City'],how='outer'),data_frames)
df_merged

Unnamed: 0_level_0,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkocht (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden)
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alphen-Chaam,€ 304.500,-18.69%,2,-75.00%,€ 2.430,-23.27%,1.63%,-2.14%,€ 45.700
Altena,€ 349.000,-8.16%,22,-50.00%,€ 2.928,2.59%,8.79%,-0.22%,€ 43.400
Asten,€ 465.000,27.40%,7,-46.15%,€ 2.809,4.08%,7.15%,-0.35%,€ 40.200
Baarle-Nassau,€ 389.000,21.56%,3,-57.14%,€ 2.992,18.68%,15.02%,17.89%,€ 37.700
Bergeijk,€ 325.000,-17.72%,9,-57.14%,€ 3.427,4.55%,3.36%,1.60%,€ 43.600
...,...,...,...,...,...,...,...,...,...
Vught,€ 375.000,-18.43%,17,-39.29%,€ 3.700,-1.93%,6.89%,-0.22%,€ 43.500
Waalre,€ 447.500,-33.70%,4,-55.56%,€ 3.734,-11.07%,10.64%,0.37%,€ 47.300
Waalwijk,€ 285.000,-8.06%,18,-59.09%,€ 2.736,-5.75%,5.48%,1.02%,€ 37.500
Woensdrecht,€ 309.000,21.41%,12,-36.84%,€ 2.719,3.46%,7.02%,2.56%,€ 38.800


In [33]:
df_merged.describe()

Unnamed: 0,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkocht (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden)
count,61,61,61,61,61,61,61,61,61
unique,45,61,28,51,59,60,61,59,49
top,€ 325.000,-18.69%,7,-57.14%,€ 2.899,-6.36%,1.63%,-0.22%,€ 43.500
freq,5,1,7,5,2,2,1,2,3


## Step 6: Exporting dataframes to CSV

In [36]:
df_merged.to_csv('Huizenzoeker_NB_data.csv')

## Step 7: Providing summary statistics

**Exporting our output to Rstudio and then importing that CSV here**

First we try to generate some summary statistics by using the output of our scraping data. We can't do this directly as you can see as most variables are seen as characters, while they should be numerics. Therefore we exported the Noord-Brabant data for all municipalities to R to change these datatypes and then export it as CSV to then use it here to generate some summary statistics: count, mean, std, min, max, 25%, 50%, 75%. 
(STATUS: Huizenzoeker.nl used both '.' to indicate thousands and decimals so R is confused whether a value is 2000 or 2.00. We need to fix this issue still in the Rscript to accurately explore the summary statistics!!)

In [41]:
noordbrabant = pd.read_csv('Huizenzoeker_NB_data.csv')

In [45]:
noordbrabant.describe() #STATUS: now only works for 'Verkochte woningen' because all other variables are seen as characters

Unnamed: 0,Verkochte woningen
count,61.0
mean,18.442623
std,21.448018
min,2.0
25%,7.0
50%,11.0
75%,18.0
max,96.0


In [51]:
#so i wrote an R script to turn the variables from characters into numerics, and saved it as huizen_NB.csv

In [49]:
noordbrabant1 = pd.read_csv('huizen_NB.csv') 

In [50]:
noordbrabant1.describe()

Unnamed: 0,gem_vraagprijs,perc_ver_vraagprijs,verk_woningen,perc_ver_verkocht,gem_m2prijs,perc_ver_m2prijs,perc_overboden,perc_ver_overboden,best_inkomen
count,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0,61.0
mean,346.148557,4.110328,18.442623,-51.58377,3.116016,4.293607,7.22082,1.096557,41.132787
std,77.482364,60.840026,21.448018,16.378496,1.0366,30.467221,3.606782,4.722097,3.589323
min,2.062,-45.38,2.0,-91.3,2.173,-30.53,-0.94,-9.5,31.6
25%,309.0,-15.52,7.0,-62.07,2.775,-5.51,5.09,-0.88,38.8
50%,339.25,-6.29,11.0,-53.76,2.941,0.59,7.03,0.85,41.1
75%,389.0,6.76,18.0,-39.29,3.294,9.19,8.74,2.45,43.6
max,600.0,450.0,96.0,-11.76,10.612,223.44,20.04,19.86,47.3


**Scraping some mean values at province-level**

We may also use the woningmarkt pages for every province to generate some mean values at province-level for their municipalities (for all variables). We will do this similarly as to how we scraped the data for each municipality of Noord-Brabant, but now thus at province-level (using BeautifulSoup). 

Generating links of all provinces: 

In [52]:
base_url = 'https://www.huizenzoeker.nl/woningmarkt/'
province_url = ['noord-holland/', 'zuid-holland/', 'zeeland/', 'noord-brabant/', 'utrecht/', 'flevoland/', 
                'friesland/', 'groningen/', 'drenthe/', 'overijssel/', 'gelderland/', 'limburg/'] 

Defining a function to paste these URL parts together: 

In [54]:
def generate_links(base_url,province_url): 
    page_links = []
    for i in province_url:
        full_links = base_url + i
        page_links.append(full_links)  
    return page_links
page_links = generate_links(base_url,province_url)
print(page_links)

['https://www.huizenzoeker.nl/woningmarkt/noord-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zuid-holland/', 'https://www.huizenzoeker.nl/woningmarkt/zeeland/', 'https://www.huizenzoeker.nl/woningmarkt/noord-brabant/', 'https://www.huizenzoeker.nl/woningmarkt/utrecht/', 'https://www.huizenzoeker.nl/woningmarkt/flevoland/', 'https://www.huizenzoeker.nl/woningmarkt/friesland/', 'https://www.huizenzoeker.nl/woningmarkt/groningen/', 'https://www.huizenzoeker.nl/woningmarkt/drenthe/', 'https://www.huizenzoeker.nl/woningmarkt/overijssel/', 'https://www.huizenzoeker.nl/woningmarkt/gelderland/', 'https://www.huizenzoeker.nl/woningmarkt/limburg/']


*Gemiddelde vraagprijs:* 

In [55]:
def extract_province_trends(page_links): 
    trend_list = []
    for page_link in page_links:
        res = requests.get(page_link)
        soup = BeautifulSoup(res.text, 'html.parser')
        province_name = soup.find_all('h2')[0].get_text()
        province_name1 = province_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[0].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Gem. Vraagprijs','')
        price,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list.append({'Province':province_name1, 'Gem. vraagprijs':price, '%Δ Vraagprijs (t.o.v vorige maand)':change1})
    return(trend_list)

In [56]:
mean_vraagprijs = extract_province_trends(page_links) #use the function 
print(mean_vraagprijs) #check if it works

[{'Province': ' Noord-Holland', 'Gem. vraagprijs': '€\xa0375.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-6.02%'}, {'Province': ' Zuid-Holland', 'Gem. vraagprijs': '€\xa0335.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-4.29%'}, {'Province': ' Zeeland', 'Gem. vraagprijs': '€\xa0275.000', '%Δ Vraagprijs (t.o.v vorige maand)': '6.18%'}, {'Province': ' Noord-Brabant', 'Gem. vraagprijs': '€\xa0339.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-3.14%'}, {'Province': ' Utrecht', 'Gem. vraagprijs': '€\xa0386.250', '%Δ Vraagprijs (t.o.v vorige maand)': '-3.44%'}, {'Province': ' Flevoland', 'Gem. vraagprijs': '€\xa0325.000', '%Δ Vraagprijs (t.o.v vorige maand)': '0.00%'}, {'Province': ' Friesland', 'Gem. vraagprijs': '€\xa0280.000', '%Δ Vraagprijs (t.o.v vorige maand)': '2.19%'}, {'Province': ' Groningen', 'Gem. vraagprijs': '€\xa0227.000', '%Δ Vraagprijs (t.o.v vorige maand)': '-9.20%'}, {'Province': ' Drenthe', 'Gem. vraagprijs': '€\xa0295.000', '%Δ Vraagprijs (t.o.v vorige maand)': '0.00%'},

In [57]:
pd.DataFrame(mean_vraagprijs)

Unnamed: 0,Province,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand)
0,Noord-Holland,€ 375.000,-6.02%
1,Zuid-Holland,€ 335.000,-4.29%
2,Zeeland,€ 275.000,6.18%
3,Noord-Brabant,€ 339.000,-3.14%
4,Utrecht,€ 386.250,-3.44%
5,Flevoland,€ 325.000,0.00%
6,Friesland,€ 280.000,2.19%
7,Groningen,€ 227.000,-9.20%
8,Drenthe,€ 295.000,0.00%
9,Overijssel,€ 298.500,-0.50%


*Aantal verkochte woningen:*

In [58]:
def extract_province_trends1(page_links):
    trend_list1 = []
    for page_link in page_links:
        res = requests.get(page_link)
        soup = BeautifulSoup(res.text, 'html.parser')
        province_name = soup.find_all('h2')[0].get_text()
        province_name1 = province_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[1].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Verkochte woningen','')
        verkocht,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list1.append({'Province':province_name1, 'Verkochte woningen':verkocht, '%Δ Verkocht (t.o.v vorige maand)':change1})
    return(trend_list1)

In [59]:
mean_verkwoningen = extract_province_trends1(page_links) #use the function
print(mean_verkwoningen) #check if it works

[{'Province': ' Noord-Holland', 'Verkochte woningen': '1113', '%Δ Verkocht (t.o.v vorige maand)': '-52.90%'}, {'Province': ' Zuid-Holland', 'Verkochte woningen': '1738', '%Δ Verkocht (t.o.v vorige maand)': '-49.70%'}, {'Province': ' Zeeland', 'Verkochte woningen': '250', '%Δ Verkocht (t.o.v vorige maand)': '-36.39%'}, {'Province': ' Noord-Brabant', 'Verkochte woningen': '1125', '%Δ Verkocht (t.o.v vorige maand)': '-52.21%'}, {'Province': ' Utrecht', 'Verkochte woningen': '583', '%Δ Verkocht (t.o.v vorige maand)': '-52.79%'}, {'Province': ' Flevoland', 'Verkochte woningen': '229', '%Δ Verkocht (t.o.v vorige maand)': '-41.28%'}, {'Province': ' Friesland', 'Verkochte woningen': '332', '%Δ Verkocht (t.o.v vorige maand)': '-38.40%'}, {'Province': ' Groningen', 'Verkochte woningen': '307', '%Δ Verkocht (t.o.v vorige maand)': '-41.52%'}, {'Province': ' Drenthe', 'Verkochte woningen': '287', '%Δ Verkocht (t.o.v vorige maand)': '-40.95%'}, {'Province': ' Overijssel', 'Verkochte woningen': '487'

In [60]:
pd.DataFrame(mean_verkwoningen)

Unnamed: 0,Province,Verkochte woningen,%Δ Verkocht (t.o.v vorige maand)
0,Noord-Holland,1113,-52.90%
1,Zuid-Holland,1738,-49.70%
2,Zeeland,250,-36.39%
3,Noord-Brabant,1125,-52.21%
4,Utrecht,583,-52.79%
5,Flevoland,229,-41.28%
6,Friesland,332,-38.40%
7,Groningen,307,-41.52%
8,Drenthe,287,-40.95%
9,Overijssel,487,-47.52%


*Gemiddelde vierkante meter prijs:*

In [61]:
def extract_province_trends2(page_links):
    trend_list2 = []
    for page_link in page_links:
        res = requests.get(page_link)
        soup = BeautifulSoup(res.text, 'html.parser')
        province_name = soup.find_all('h2')[0].get_text()
        province_name1 = province_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[2].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Gem. vierkantemeter prijs','')
        m2prijs,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list2.append({'Province':province_name1, 'Gem. m2 prijs':m2prijs, '%Δ M2 prijs (t.o.v vorige maand)':change1})
    return(trend_list2)

In [62]:
mean_m2prijs = extract_province_trends2(page_links) #use the function 
print(mean_m2prijs) #check whether it works

[{'Province': ' Noord-Holland', 'Gem. m2 prijs': '€\xa04.080', '%Δ M2 prijs (t.o.v vorige maand)': '-2.97%'}, {'Province': ' Zuid-Holland', 'Gem. m2 prijs': '€\xa03.425', '%Δ M2 prijs (t.o.v vorige maand)': '-0.61%'}, {'Province': ' Zeeland', 'Gem. m2 prijs': '€\xa02.604', '%Δ M2 prijs (t.o.v vorige maand)': '7.43%'}, {'Province': ' Noord-Brabant', 'Gem. m2 prijs': '€\xa03.035', '%Δ M2 prijs (t.o.v vorige maand)': '-0.52%'}, {'Province': ' Utrecht', 'Gem. m2 prijs': '€\xa04.020', '%Δ M2 prijs (t.o.v vorige maand)': '3.16%'}, {'Province': ' Flevoland', 'Gem. m2 prijs': '€\xa02.948', '%Δ M2 prijs (t.o.v vorige maand)': '-0.41%'}, {'Province': ' Friesland', 'Gem. m2 prijs': '€\xa02.517', '%Δ M2 prijs (t.o.v vorige maand)': '7.47%'}, {'Province': ' Groningen', 'Gem. m2 prijs': '€\xa02.391', '%Δ M2 prijs (t.o.v vorige maand)': '-1.89%'}, {'Province': ' Drenthe', 'Gem. m2 prijs': '€\xa02.455', '%Δ M2 prijs (t.o.v vorige maand)': '0.16%'}, {'Province': ' Overijssel', 'Gem. m2 prijs': '€\xa02.

In [63]:
pd.DataFrame(mean_m2prijs)

Unnamed: 0,Province,Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand)
0,Noord-Holland,€ 4.080,-2.97%
1,Zuid-Holland,€ 3.425,-0.61%
2,Zeeland,€ 2.604,7.43%
3,Noord-Brabant,€ 3.035,-0.52%
4,Utrecht,€ 4.020,3.16%
5,Flevoland,€ 2.948,-0.41%
6,Friesland,€ 2.517,7.47%
7,Groningen,€ 2.391,-1.89%
8,Drenthe,€ 2.455,0.16%
9,Overijssel,€ 2.613,-0.34%


*Percentage overboden:*

In [64]:
def extract_province_trends3(page_links):
    trend_list3 = []
    for page_link in page_links:
        res = requests.get(page_link)
        soup = BeautifulSoup(res.text, 'html.parser')
        province_name = soup.find_all('h2')[0].get_text()
        province_name1 = province_name.replace('Woningmarkt','')
        trends = soup.find_all(class_='trend-graph')[3].get_text()
        new_trend = trends.replace('\n','')
        new_trend1 = new_trend.replace('Percentage overboden','')
        overboden,change = new_trend1.split('                ',1)
        change1 = change.replace(' t.o.v. vorige maand            ', '')
        trend_list3.append({'Province':province_name1, '% Vraagprijs overboden':overboden, '%Δ Overboden (t.o.v vorige maand)': change1})
    return(trend_list3)

In [65]:
mean_overboden = extract_province_trends3(page_links) #use the function 
print(mean_overboden) #make sure it works

[{'Province': ' Noord-Holland', '% Vraagprijs overboden': '12.66%', '%Δ Overboden (t.o.v vorige maand)': '1.08%'}, {'Province': ' Zuid-Holland', '% Vraagprijs overboden': '10.23%', '%Δ Overboden (t.o.v vorige maand)': '0.79%'}, {'Province': ' Zeeland', '% Vraagprijs overboden': '8.13%', '%Δ Overboden (t.o.v vorige maand)': '0.13%'}, {'Province': ' Noord-Brabant', '% Vraagprijs overboden': '7.87%', '%Δ Overboden (t.o.v vorige maand)': '0.88%'}, {'Province': ' Utrecht', '% Vraagprijs overboden': '11.97%', '%Δ Overboden (t.o.v vorige maand)': '0.77%'}, {'Province': ' Flevoland', '% Vraagprijs overboden': '14.71%', '%Δ Overboden (t.o.v vorige maand)': '1.08%'}, {'Province': ' Friesland', '% Vraagprijs overboden': '10.64%', '%Δ Overboden (t.o.v vorige maand)': '1.94%'}, {'Province': ' Groningen', '% Vraagprijs overboden': '15.76%', '%Δ Overboden (t.o.v vorige maand)': '1.33%'}, {'Province': ' Drenthe', '% Vraagprijs overboden': '10.49%', '%Δ Overboden (t.o.v vorige maand)': '1.41%'}, {'Prov

In [66]:
pd.DataFrame(mean_overboden)

Unnamed: 0,Province,% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand)
0,Noord-Holland,12.66%,1.08%
1,Zuid-Holland,10.23%,0.79%
2,Zeeland,8.13%,0.13%
3,Noord-Brabant,7.87%,0.88%
4,Utrecht,11.97%,0.77%
5,Flevoland,14.71%,1.08%
6,Friesland,10.64%,1.94%
7,Groningen,15.76%,1.33%
8,Drenthe,10.49%,1.41%
9,Overijssel,9.83%,1.36%


*Besteedbaar inkomen:*

In [67]:
def extract_besteedbaar(page_links):
    besteed_inkomen = []
    for page_link in page_links:
        res = requests.get(page_link)
        soup = BeautifulSoup(res.text, 'html.parser')
        province_name = soup.find_all('h2')[0].get_text()
        province_name1 = province_name.replace('Woningmarkt','')
        inkomen = soup.find_all(class_='detail__income huizenzoeker-card single-value-graph-container')[0].get_text()
        new_inkomen = inkomen.replace('\n','')
        new_inkomen1 = new_inkomen.replace('Besteedbaar Inkomen Per Huishouden','')
        besteed_inkomen.append({'Province':province_name1, 'Besteedbaar inkomen (per huishouden)':new_inkomen1})
    return(besteed_inkomen)

In [68]:
mean_inkomen = extract_besteedbaar(page_links) #use the function 
print(mean_inkomen) #check whether it works

[{'Province': ' Noord-Holland', 'Besteedbaar inkomen (per huishouden)': '€ 36.200'}, {'Province': ' Zuid-Holland', 'Besteedbaar inkomen (per huishouden)': '€ 35.800'}, {'Province': ' Zeeland', 'Besteedbaar inkomen (per huishouden)': '€ 36.900'}, {'Province': ' Noord-Brabant', 'Besteedbaar inkomen (per huishouden)': '€ 38.100'}, {'Province': ' Utrecht', 'Besteedbaar inkomen (per huishouden)': '€ 39.500'}, {'Province': ' Flevoland', 'Besteedbaar inkomen (per huishouden)': '€ 39.500'}, {'Province': ' Friesland', 'Besteedbaar inkomen (per huishouden)': '€ 34.900'}, {'Province': ' Groningen', 'Besteedbaar inkomen (per huishouden)': '€ 30.600'}, {'Province': ' Drenthe', 'Besteedbaar inkomen (per huishouden)': '€ 37.100'}, {'Province': ' Overijssel', 'Besteedbaar inkomen (per huishouden)': '€ 36.900'}, {'Province': ' Gelderland', 'Besteedbaar inkomen (per huishouden)': '€ 37.500'}, {'Province': ' Limburg', 'Besteedbaar inkomen (per huishouden)': '€ 34.800'}]


In [69]:
pd.DataFrame(mean_inkomen)

Unnamed: 0,Province,Besteedbaar inkomen (per huishouden)
0,Noord-Holland,€ 36.200
1,Zuid-Holland,€ 35.800
2,Zeeland,€ 36.900
3,Noord-Brabant,€ 38.100
4,Utrecht,€ 39.500
5,Flevoland,€ 39.500
6,Friesland,€ 34.900
7,Groningen,€ 30.600
8,Drenthe,€ 37.100
9,Overijssel,€ 36.900


*Merging the dataframes:*

In [70]:
df1 = pd.DataFrame(mean_vraagprijs).set_index('Province') #gemiddelde vraagprijs
df2 = pd.DataFrame(mean_verkwoningen).set_index('Province') #aantal verkochte woningen
df3 = pd.DataFrame(mean_m2prijs).set_index('Province') #gemiddelde m2 prijs
df4 = pd.DataFrame(mean_overboden).set_index('Province') #percentage overboden
df5 = pd.DataFrame(mean_inkomen).set_index('Province') #besteedbaar inkomen
#add a dataframe for the inhabitants data, once finished. 

In [71]:
data_frames = [df1,df2,df3,df4,df5] #now we combine the dataframes into one single dataframe

In [72]:
df_province_summary = reduce(lambda left,right: pd.merge(left,right,on=['Province'],how='outer'),data_frames)
df_province_summary

Unnamed: 0_level_0,Gem. vraagprijs,%Δ Vraagprijs (t.o.v vorige maand),Verkochte woningen,%Δ Verkocht (t.o.v vorige maand),Gem. m2 prijs,%Δ M2 prijs (t.o.v vorige maand),% Vraagprijs overboden,%Δ Overboden (t.o.v vorige maand),Besteedbaar inkomen (per huishouden)
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Noord-Holland,€ 375.000,-6.02%,1113,-52.90%,€ 4.080,-2.97%,12.66%,1.08%,€ 36.200
Zuid-Holland,€ 335.000,-4.29%,1738,-49.70%,€ 3.425,-0.61%,10.23%,0.79%,€ 35.800
Zeeland,€ 275.000,6.18%,250,-36.39%,€ 2.604,7.43%,8.13%,0.13%,€ 36.900
Noord-Brabant,€ 339.000,-3.14%,1125,-52.21%,€ 3.035,-0.52%,7.87%,0.88%,€ 38.100
Utrecht,€ 386.250,-3.44%,583,-52.79%,€ 4.020,3.16%,11.97%,0.77%,€ 39.500
Flevoland,€ 325.000,0.00%,229,-41.28%,€ 2.948,-0.41%,14.71%,1.08%,€ 39.500
Friesland,€ 280.000,2.19%,332,-38.40%,€ 2.517,7.47%,10.64%,1.94%,€ 34.900
Groningen,€ 227.000,-9.20%,307,-41.52%,€ 2.391,-1.89%,15.76%,1.33%,€ 30.600
Drenthe,€ 295.000,0.00%,287,-40.95%,€ 2.455,0.16%,10.49%,1.41%,€ 37.100
Overijssel,€ 298.500,-0.50%,487,-47.52%,€ 2.613,-0.34%,9.83%,1.36%,€ 36.900
