# Immoscraper
- This is a notebook to scrape immoscout24.de, a german real estate website for flats
- I'll test & learn extracting data needed for first page as this one has some exceptions, not only in link structure, and will then move on to loop over all following approx. 40 pages of flat listings given no search parameters except than Location: Hamburg

Link Structure
- Page 1 : https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Hamburg/Hamburg?enteredFrom=result_list
- Page 2:  https://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg
- Page 3:  https://www.immobilienscout24.de/Suche/S-T/P-3/Wohnung-Miete/Hamburg/Hamburg

In [273]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

In [274]:
#Downloading data from immobilienscout24 for Hamburg
data = requests.get('https://www.immobilienscout24.de/Suche/S-T/Wohnung-Miete/Hamburg/Hamburg?enteredFrom=result_list')
#Reading in the text of the html with data.text, then parsing html w/ BS4 
soup = BeautifulSoup(data.text, 'html.parser')

In [275]:
#Printing first 100 chars of soup.prettify to check if everything worked properly
print(soup.prettify()[0:100])

<!DOCTYPE doctype html>
<html lang="de">
 <head>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA


# The first Page 

## Extracting Price, Size and # of Rooms

In [276]:
infos=[]
#Getting Price, Size and Rooms which are all stored into the <dd> tag of class 'font-nowrap font-line-xs'
for i in soup.find_all("dd", {"class":"font-nowrap font-line-xs"}):
    infos.append(i.text)
#First 3 entries are data for ONE apartement | Price/Size/Rooms
infos[0:6]

['1.060,08 €', '89,97 m²', '3 Zi.3', '841 €', '79,64 m²', '2 Zi.2']

In [277]:
#Only prices, gt every 4th element out of the list, starting with 0th element
price = infos[::3]
#Only size, get every 4th element out of list, starting with second element
size = infos[1::3]
#Only number of rooms, get every 4th element out of list, starting with third element
rooms = infos[2::3]

## CAUTION: BUG IN THIS CODE CELL: Replacing size with price 

In [278]:
#Remove the € sign in price 
#price = [s.replace('€','')for s in price]
##Remove spaces 
#price = [s.replace(' ','') for s in price]
#remove m² in size
#size = [s.replace('m²','')for s in size]
#Remove spaces 
#size = [s.replace(' ','') for s in price]

In [279]:
#Store 3 Lists in a Dataframe 
df = pd.DataFrame({'price':price, 'size':size,'rooms':rooms})

In [280]:
df.head()

Unnamed: 0,price,rooms,size
0,"1.060,08 €",3 Zi.3,"89,97 m²"
1,841 €,2 Zi.2,"79,64 m²"
2,1.150 €,2 Zi.2,"75,14 m²"
3,1.470 €,3 Zi.3,"97,65 m²"
4,1.575 €,4 Zi.4,"121,02 m²"


## Adding Flat Titles

In [281]:
titles=[]
#Find all titles of the ad which are all stored inside the <h5> tag of class
for i in soup.find_all('h5', {'class':'result-list-entry__brand-title font-h6 onlyLarge nine-tenths margin-bottom-none maxtwolinerHeadline'}):
    titles.append(i.text)
titles[0:20]

['Sonnige Quadratmeter ! Moderne 3-Zimmer-Wohnung in Langenhorn',
 'Einziehen und Wohlfühlen! 2-Zimmer-Wohnung mit Blick ins Grüne',
 'neue Wohnung - neues Glück',
 'Geniale 3-Zimmerwohnung!  ***Erstbezug***',
 'Familienfreundliches Wohnen',
 'Traumhafte Penthousewohnung! ***Erstbezug***']

In [282]:
len(titles)

6

## Problem: 
- There are different types of titles as on the first page you have sponsored entries (6 to be exact)
- need to append other elements to titles list also

In [283]:
#Other titles are also in h5 tag but of different class
for i in soup.find_all('h5', {'class':'result-list-entry__brand-title font-h6 onlyLarge nine-tenths maxtwolinerHeadline'}):
    titles.append(i.text)

In [284]:
df['titles'] = pd.Series(titles).values

In [285]:
df

Unnamed: 0,price,rooms,size,titles
0,"1.060,08 €",3 Zi.3,"89,97 m²",Sonnige Quadratmeter ! Moderne 3-Zimmer-Wohnun...
1,841 €,2 Zi.2,"79,64 m²",Einziehen und Wohlfühlen! 2-Zimmer-Wohnung mit...
2,1.150 €,2 Zi.2,"75,14 m²",neue Wohnung - neues Glück
3,1.470 €,3 Zi.3,"97,65 m²",Geniale 3-Zimmerwohnung! ***Erstbezug***
4,1.575 €,4 Zi.4,"121,02 m²",Familienfreundliches Wohnen
5,2.500 €,5 Zi.5,"166,02 m²",Traumhafte Penthousewohnung! ***Erstbezug***
6,550 €,1 Zi.1,32 m²,NEUTraumschöne 1 Zi. -Wohnung in einer Jugend...
7,1.199 €,3 Zi.3,"88,73 m²","3 Zimmer, Einbauküche, Balkon"
8,550 €,2 Zi.2,36 m²,2 Zimmer Wohnung nur für Studenten
9,"303,30 €","1,5 Zi.1,5","40,04 m²",Schöne und helle 1-1/2- Zi.-Whg. in Harburg-He...


## Getting the Address
- stored inside a "<div" tag of class="font-ellipsis“

In [286]:
addresses = []
for i in soup.find_all('div', {'class':'font-ellipsis'}):
    addresses.append(i.text)
addresses[0:15]

['Sumpfcallastieg  9, Langenhorn, Hamburg',
 'Kreuzblumenweg  14, Langenhorn, Hamburg',
 'Glückel-von-Hameln-Straße  2, Altona-Nord, Hamburg',
 'Susanne-von-Paczensky-Straße  11, Altona-Nord, Hamburg',
 'Glückel-von-Hameln-Straße  2, Altona-Nord, Hamburg',
 'Susanne-von-Paczensky-Straße  7, Altona-Nord, Hamburg',
 'Wördemannsweg 7, Stellingen, Hamburg',
 'Matthias LandschofGbR Landschof-Schlosser / Hausverwaltung',
 'Bargteheider Straße 134a, Rahlstedt, Hamburg',
 'Herr Max MüllerSTRABAG Property and Facility Services GmbH',
 'Wördemannsweg 7, Stellingen, Hamburg',
 'Matthias LandschofGbR Landschof-Schlosser / Hausverwaltung',
 'Baustraße 10, Heimfeld, Hamburg',
 'Wohnungsunternehmen HeimfeldWohnungsunternehmen Heimfeld GmbH & Co. KG',
 'Magdalenenstraße 50, Rotherbaum, Hamburg']

## Problem
- the div tag of class font-ellipsis contains the addresses
- the span tag of class font-ellipsis contains the name of the one who inserted the ad 
- BS4 mixes these up: Sometimes I also get the names, sometmes not so it's not following any rule like 2nd element or something like before so I'll leave that out

# Getting Data for all consecutive pages
- I'll need a response code check so I know when the end of possible websites is reached 
- I'll use a loops to extract the data needed 
- as there are 20 ads per page I'll get the number of total ads to know how many consecutive pages ther'll be

## Link structure
- Page 2: https://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg
- Page 3: https://www.immobilienscout24.de/Suche/S-T/P-3/Wohnung-Miete/Hamburg/Hamburg

## Getting number of pages to scrape

In [287]:
#Getting total number of ads to calculate how many pages there are 
totalAds = []
#totalAds.text doesn't work but inexplainable within a loop like seen before so I'll just do this although loop is stupid
#for one element 
for i in soup.find_all('span', {'class':'font-normal'}):
    totalAds.append(i.text)
totalAds = float(totalAds[0])

In [288]:
#As total # of Ads is stated as a float due to the fact that it's written a string with . as 1000 sign I'll do x1000
totalAds= float(totalAds*1000)
totalAds

1235.0

In [289]:
#As there's 20 ads per page we'll divide by 20 and round up and use math.ceil() (ceiling) function as rounding down 
#wouldn't make sense as you'd 'throw away' ads 
import math 
numberPages = math.ceil(totalAds/20)
numberPages

62

## Linklist for scraper to access all consecutive pages

In [290]:
linklist=[]
#Fill in the consecutive number inside the links
for i in range(numberPages):
    linklist.append('https://www.immobilienscout24.de/Suche/S-T/P-'+str(i)+'/Wohnung-Miete/Hamburg/Hamburg')
#Drop first two elements as they're not needed since second page starts with P-2
linklist = linklist[2:len(linklist)]
linklist[0:4]

['https://www.immobilienscout24.de/Suche/S-T/P-2/Wohnung-Miete/Hamburg/Hamburg',
 'https://www.immobilienscout24.de/Suche/S-T/P-3/Wohnung-Miete/Hamburg/Hamburg',
 'https://www.immobilienscout24.de/Suche/S-T/P-4/Wohnung-Miete/Hamburg/Hamburg',
 'https://www.immobilienscout24.de/Suche/S-T/P-5/Wohnung-Miete/Hamburg/Hamburg']

## getting All pages

In [291]:
#Just for testing 
if requests.get('https://www.google.com/maps').status_code==200:
    print ('Success')

Success


## Extracting size, price, number of rooms

In [300]:
#Same code parts as above, but with difference that I'll extract for all pages this time
infos2=[]
titles2=[]
for i in linklist:
    #Only if page is available, keep going:
    if requests.get(i).status_code==200:
        #Getting raw html
        page = requests.get(i)
        #Creating Parse Tree for raw HTML with BS4
        soup = BeautifulSoup(page.text,'html.parser')
        #Getting Price, Size and Rooms which are all stored into the <dd> tag of class 'font-nowrap font-line-xs'
        for x in soup.find_all("dd", {"class":"font-nowrap font-line-xs"}):
            infos2.append(x.text)
    
        #Find all titles of the ad which are all stored inside the <h5> tag of class
        for y in soup.find_all('h5', {'class':'result-list-entry__brand-title font-h6 onlyLarge nine-tenths font-ellipsis'}):
            titles2.append(y.text)

In [301]:
#Only prices, gt every 4th element out of the list, starting with 0th element
price2 = infos2[::3]
#Only size, get every 4th element out of list, starting with second element
size2 = infos2[1::3]
#Only number of rooms, get every 4th element out of list, starting with third element
rooms2 = infos2[2::3]

In [302]:
print(len(price2),len(size2),len(rooms2),len(titles2))

1199 1199 1199 1168


In [303]:
#As there seem to be titles missing I'll fill the missing values with '/'
for i in range(1199):
    if len(titles2)<1199:
        titles2.append('/')

In [304]:
print(len(price2),len(size2),len(rooms2),len(titles2))

1199 1199 1199 1199


## CAUTION: BUG IN THIS CODE CELL: Replacing size with price 

In [305]:
#Remove the € sign in price 
#price2 = [s.replace('€','')for s in price2]
#Remove spaces 
#price2 = [t.replace(' ','') for t in price2]
#remove m² in size
#size2 = [u.replace('m²','')for u in size2]
#Remove spaces 
#size2 = [v.replace(' ','') for v in price2]

In [306]:
#Second DataFrame for every Info > page 1, will merge both later 

#Store 3 Lists in a Dataframe 
df2 = pd.DataFrame({'price':price2, 'size':size2,'rooms':rooms2, 'titles':titles2})
df2.head()

Unnamed: 0,price,rooms,size,titles
0,955 €,2 Zi.2,"69,44 m²",NEUherrliche 2-Zimmerwohnung mit Weitblick Har...
1,1.300 €,3 Zi.3,"86,04 m²",NEU3-Zimmer-Wohnung in Hamburg - Sinstorf
2,1.440 €,4 Zi.4,"99,39 m²",NEULoftartige & sehr großzügige 1-Zimmer Wohnu...
3,"1.296,85 €",3 Zi.3,"100,22 m²",NEULoftartige & sehr großzügige 1-Zimmer Wohnu...
4,1.285 €,3 Zi.3,"86,83 m²",möblierte Wohnung mit Balkon/Endetage/ neue E...


In [307]:
#Merge both dataframes for one big dataset
bigData = df.append(df2)

In [308]:
bigData.tail()

Unnamed: 0,price,rooms,size,titles
1194,815 €,2 Zi.2,"85,29 m²",/
1195,695 €,3 Zi.3,"64,39 m²",/
1196,"1.275,59 €",3 Zi.3,"104,13 m²",/
1197,1.290 €,3 Zi.3,108 m²,/
1198,610 €,2 Zi.2,"63,1 m²",/


In [309]:
#Exporting as CSV
bigData.to_csv('flatsHH.csv', sep=',', encoding='utf-8')

# Data Cleaning
- Remove € sign from 'price' column 
- remove m2 from 'size' column 
- convert both columns to numeric 

In [310]:
bigData.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1219 entries, 0 to 1198
Data columns (total 4 columns):
price     1219 non-null object
rooms     1219 non-null object
size      1219 non-null object
titles    1219 non-null object
dtypes: object(4)
memory usage: 28.6+ KB


In [311]:
#bigData['price'] = bigData['price'].astype(float)