# Immoscout24.de Scraper

Ein Script zum dumpen (in `.csv` schreiben) von Immobilien, welche auf [immoscout24.de](http://immoscout24.de) angeboten werden

In [5]:
from bs4 import BeautifulSoup
import json
import urllib.request as urllib2
import random
from random import choice
import time

In [7]:
# urlquery from Achim Tack. Thank you!
# https://github.com/ATack/GoogleTrafficParser/blob/master/google_traffic_parser.py
def urlquery(url):
    # function cycles randomly through different user agents and time intervals to simulate more natural queries
    try:
        sleeptime = float(random.randint(1,6))/5
        time.sleep(sleeptime)

        agents = ['Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17',
        'Mozilla/5.0 (compatible; MSIE 10.6; Windows NT 6.1; Trident/5.0; InfoPath.2; SLCC1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 2.0.50727) 3gpp-gba UNTRUSTED/1.0',
        'Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02',
        'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
        'Mozilla/3.0',
        'Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3',
        'Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3',
        'Opera/9.00 (Windows NT 5.1; U; en)']

        agent = choice(agents)
        opener = urllib2.build_opener()
        opener.addheaders = [('User-agent', agent)]

        html = opener.open(url).read()
        time.sleep(sleeptime)
        
        return html

    except Exception as e:
        print('Something went wrong with Crawling:\n%s' % e)

In [9]:
def immoscout24parser(url):
    
    ''' Parser holt aus Immoscout24.de Suchergebnisseiten die Immobilien '''
    
    try:
        soup = BeautifulSoup(urlquery(url), 'html.parser')
        scripts = soup.findAll('script')
        for script in scripts:
            #print script.text.strip()
            if 'IS24.resultList' in script.text.strip():
                s = script.string.split('\n')
                for line in s:
                    #print('\n\n\'%s\'' % line)
                    if line.strip().startswith('resultListModel'):
                        resultListModel = line.strip('resultListModel: ')
                        immo_json = json.loads(resultListModel[:-1])

                        searchResponseModel = immo_json[u'searchResponseModel']
                        resultlist_json = searchResponseModel[u'resultlist.resultlist']
                        
                        return resultlist_json

    except Exception as e:
        print("Fehler in immoscout24 parser: %s" % e)

## Main Loop

Geht Wohnungen und Häuser, jeweils zum Kauf und Miete durch und sammelt die Daten

In [31]:
immos = {}

b = 'Sachsen' # Bundesland
s = 'Dresden' # Stadt
k = 'Haus' # Wohnung oder Haus
w = 'Kauf' # Miete oder Kauf

page = 0
print('Suche %s / %s' % (k, w))

while True:
    page+=1
    url = 'http://www.immobilienscout24.de/Suche/S-T/P-%s/%s-%s/%s/%s?pagerReporting=true' % (page, k, w, b, s)

    # Because of some timeout or immoscout24.de errors,
    # we try until it works \o/
    resultlist_json = None
    while resultlist_json is None:
        try:
            resultlist_json = immoscout24parser(url)
            numberOfPages = int(resultlist_json[u'paging'][u'numberOfPages'])
            pageNumber = int(resultlist_json[u'paging'][u'pageNumber'])
        except:
            pass

    if page>numberOfPages:
        break

    # Get the data
    for resultlistEntry in resultlist_json['resultlistEntries'][0][u'resultlistEntry']:
        realEstate_json = resultlistEntry[u'resultlist.realEstate']
        
        realEstate = {}

        realEstate[u'Miete/Kauf'] = w
        realEstate[u'Haus/Wohnung'] = k

        realEstate['address'] = realEstate_json['address']['description']['text']
        realEstate['city'] = realEstate_json['address']['city']
        realEstate['postcode'] = realEstate_json['address']['postcode']
        realEstate['quarter'] = realEstate_json['address']['quarter']
        try:
            realEstate['lat'] = realEstate_json['address'][u'wgs84Coordinate']['latitude']
            realEstate['lon'] = realEstate_json['address'][u'wgs84Coordinate']['longitude']
        except:
            realEstate['lat'] = None
            realEstate['lon'] = None
            
        realEstate['title'] = realEstate_json['title']

        realEstate['numberOfRooms'] = realEstate_json['numberOfRooms']
        realEstate['livingSpace'] = realEstate_json['livingSpace']
        
        if k=='Wohnung':
            realEstate['balcony'] = realEstate_json['balcony']
            realEstate['builtInKitchen'] = realEstate_json['builtInKitchen']
            realEstate['garden'] = realEstate_json['garden']
            realEstate['price'] = realEstate_json['price']['value']
            realEstate['privateOffer'] = realEstate_json['privateOffer']
        elif k=='Haus':
            realEstate['isBarrierFree'] = realEstate_json['isBarrierFree']
            realEstate['cellar'] = realEstate_json['cellar']
            realEstate['plotArea'] = realEstate_json['plotArea']
            realEstate['price'] = realEstate_json['price']['value']
            realEstate['privateOffer'] = realEstate_json['privateOffer']
        
        realEstate['floorplan'] = realEstate_json['floorplan']
        realEstate['from'] = realEstate_json['companyWideCustomerId']
        realEstate['ID'] = realEstate_json[u'@id']
        realEstate['url'] = u'https://www.immobilienscout24.de/expose/%s' % realEstate['ID']

        immos[realEstate['ID']] = realEstate

    print('Scrape Page %i/%i (%i Immobilien %s %s gefunden)' % (page, numberOfPages, len(immos), k, w))

Suche Haus / Kauf
Scrape Page 1/9 (20 Immobilien Haus Kauf gefunden)
Scrape Page 2/9 (40 Immobilien Haus Kauf gefunden)
Scrape Page 3/9 (60 Immobilien Haus Kauf gefunden)
Scrape Page 4/9 (80 Immobilien Haus Kauf gefunden)
Scrape Page 5/9 (100 Immobilien Haus Kauf gefunden)
Scrape Page 6/9 (120 Immobilien Haus Kauf gefunden)
Scrape Page 7/9 (140 Immobilien Haus Kauf gefunden)
Scrape Page 8/9 (160 Immobilien Haus Kauf gefunden)
Scrape Page 9/9 (180 Immobilien Haus Kauf gefunden)


In [32]:
print("Scraped %i Immos" % len(immos))

Scraped 180 Immos


## Datenaufbereitung & Cleaning

Die gesammelten Daten werden in ein sauberes Datenformat konvertiert, welches z.B. auch mit Excel gelesen werden kann. Weiterhin werden die Ergebnisse pseudonymisiert, d.h. die Anbieter bekommen eindeutige Nummern statt Klarnamen.

In [33]:
from datetime import datetime
timestamp = datetime.strftime(datetime.now(), '%Y-%m-%d-%H-%M')

In [34]:
import pandas as pd

In [35]:
df = pd.DataFrame(immos).T
df.index.name = 'ID'

In [36]:
len(df)

180

In [37]:
df.head()

Unnamed: 0_level_0,Haus/Wohnung,ID,Miete/Kauf,address,cellar,city,floorplan,from,isBarrierFree,lat,livingSpace,lon,numberOfRooms,plotArea,postcode,price,privateOffer,quarter,title,url
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
99645205,Haus,99645205,Kauf,"Loschwitz/Wachwitz, Dresden",True,Dresden,False,1.293446,False,,315.0,,8.0,2290,1326,1800000,False,Loschwitz/Wachwitz,www.r-o.de+++ Denkmalgeschützte Villa in Bestl...,https://www.immobilienscout24.de/expose/99645205
105154327,Haus,105154327,Kauf,"Weißig, Dresden",False,Dresden,True,2.01002840387,False,,237.0,,6.0,1532,1328,752395,False,Weißig,"Architektur, die verbindet! - Wohnen im Schöne...",https://www.immobilienscout24.de/expose/105154327
106445307,Haus,106445307,Kauf,"Cossebaude/Mobschatz/Oberwartha, Dresden",False,Dresden,True,2.01008311556,False,,160.36,,5.0,750,1156,272706,False,Cossebaude/Mobschatz/Oberwartha,Familiengerechtes bauen mit Massa Haus in Brab...,https://www.immobilienscout24.de/expose/106445307
106587371,Haus,106587371,Kauf,"Pohrsdorfer Weg 1, Naußlitz, Dresden",False,Dresden,False,1.637078,False,51.035,114.0,13.677,4.5,178,1169,339900,False,Naußlitz,30.08.2018 Vorstellung in DD-Roßthal nach Vere...,https://www.immobilienscout24.de/expose/106587371
102842625,Haus,102842625,Kauf,"Bühlau/Weißer Hirsch, Dresden",False,Dresden,False,1.118256,False,,295.88,,10.0,780,1324,1000000,False,Bühlau/Weißer Hirsch,Loschwitz - interessante Villa in Bestlage! Al...,https://www.immobilienscout24.de/expose/102842625


## Alles Dumpen

In [38]:
f = open('%s-%s-%s.csv' % (timestamp, k, w), 'w')
f.write('# %s %s from immoscout24.de on %s\n' % (k,w,timestamp))
df[(df['Haus/Wohnung']==k) & (df['Miete/Kauf']==w)].to_csv(f, encoding='utf-8')
f.close()

In [39]:
df.to_excel('%s-%s-%s.xlsx' % (timestamp, k, w))

Fragen? [@Balzer82](https://twitter.com/Balzer82)