<img src="Homegate_Logo.png" width="200"/>

# Title: The Flat Hunter

Goal: Create a program that gets the latest flat rental postings that fit your criteria, and store important data in a database for future analytics (study price fluctuations for example).

Involves: Web scraping, data cleaning with Python, visualization, API

Description: The idea is to write a program with the following functionalities:
- Get data from homegate.ch on the flat/house rental for the criteria of your interest
- Store the raw data
- Clean up the data and filter it according to keywords (for example 'view', 'bright', ' Attika',....)
- Perform some analytics (average price for example)
- Visualize the results (histogram of rent prices for example) 


Possible extensions:
- Setup an automatic reporting system, where you just need to run one script to get a full pdf report on new listings available for your criteria 
- Get acquainted with nltk, a natural language processing software and basic word analysis 
- Connect to gmaps API and implement a distance filter from a specific address
- Extend to other websites.

Work Packages:

- Explore homegate.ch, make a list of data to extract and then write the notebook to extract them. Save the data as a csv file
- Write the notebook to clean the data, filter by keyword and analyze and plot the data 


In [1]:
import requests
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup

#this is a fancy progress bar! works on jupyter notebook
from tqdm.notebook import tqdm

from time import sleep
from datetime import datetime

# this is to not show warnings in the notebook. Warning!!! Use only if you are absolutely sure
#import warnings

#warnings.filterwarnings("ignore")

## Requests

In [2]:
link = "https://www.homegate.ch/rent/real-estate/matching-list?loc=geo-zipcode-8001%2Cgeo-zipcode-8050%2Cgeo-zipcode-8006%2Cgeo-zipcode-8008"

## Getting the webpage content using requests library

In [3]:
response = requests.get(link, timeout=15) #now all the information is stored in the Response object

In [4]:
response.status_code # 200 is good. Anything with 4** or 5** is bad

200

In [5]:
#response.content #whole content of the webpage without treatment

#### Parse response using BeautifulSoup

We can use the BeautifulSoup library to parse this document and extract texts from the HTML tags

In [6]:
soup = BeautifulSoup(response.content, "html.parser")

In [7]:
#soup.prettify()

In [8]:
# get whole text from soup

#soup.get_text()

In [9]:
# get all `a` tags

all_a = soup.findAll("a") #or all_a = soup.find_all("a")
# all_a[8].get_text()
# for a in all_a[:4]:
#     print(a.get_text())

#### Go to web page and inspect it to get a flat ad box

In [10]:
one_address = soup.find("p", text=True).text
one_address

'32 Mainaustrasse, 8008 Zurich'

In [11]:
one_price = soup.find("span", {"class":"ListItemPrice_price_1o0i3"}).text
one_price

'CHF 2,280.–'

In [12]:
one_space = soup.find("span", {"class":"ListItemLivingSpace_value_2zFir"}).text
one_space

'43m2'

In [13]:
one_rooms = soup.find("span", {"class":"ListItemRoomNumber_value_Hpn8O"}).text
one_rooms

'1.5rm'

#### Get flat link

In [14]:
link_flat = "https://homegate.ch"+soup.find("a", {"data-test":"result-list-item"}).get("href")
link_flat

'https://homegate.ch/rent/3000909385'

In [15]:
# for span in room.find_all("span"):
#     print(span.text)

### Putting all together: Extend getting information from one flat to all flats in a page

##### Extend getting information from one page, to all pages

In [16]:
#Getting the number of pages
max_pages = soup.find("div", {"class": "ResultListPage_paginationHolder_3XZql"}).text.split()
a = max_pages[0]
pieces = a.split('...')
max_pages = pieces[1]
max_pages


'16'

https://www.homegate.ch/rent/real-estate/matching-list?loc=geo-zipcode-8001%2Cgeo-zipcode-8050%2Cgeo-zipcode-8006%2Cgeo-zipcode-8008

#### Defining the link

In [17]:
link_first_part = "https://www.homegate.ch"
link_mid_1_part = "/rent/real-estate/matching-list?loc=geo-zipcode-"
link_more_part = "%2Cgeo-zipcode-"

In [18]:
# for page in tqdm(range(1, int(max_pages) + 1)):
#     """Make the urls dynamic"""
#     url = (
#         link_first_part
#         + link_mid_1_part
#         + str(page)
#         + link_mid_2_part
#         + "Data%20Engineer"

#### Modularizing the process

In [18]:
def get_flatdata_page(soup):
    
    cols = ["Address", "Price", "Space", "Rooms", "flat_link"]
    df_page = pd.DataFrame(columns=cols)  # defining my dataframe
    
    all_flats = soup.find("div", {"data-test":"result-list"})
    for flat_ad in all_flats:
        link_flat = link_first_part + flat_ad.find("a", {"data-test":"result-list-item"}).get("href")
        
        try:
            address = flat_ad.find("p", text=True).text #get address
        except AttributeError:
            address = ""
            
        try:
            price = flat_ad.find("span", {"class":"ListItemPrice_price_1o0i3"}).text #get price
        except AttributeError:
            price = ""
        
        try:
            space = flat_ad.find("span", {"class":"ListItemLivingSpace_value_2zFir"}).text
        except AttributeError:
            space = ""
            
        try:
            rooms = flat_ad.find("span", {"class":"ListItemRoomNumber_value_Hpn8O"}).text
        except AttributeError:
            rooms = ""
        
        df_page = df_page.append(
            {
                "Address": address,
                "Price": price,
                "Space": space,
                "Rooms": rooms,
                "flat_link": link_flat,
            },
            ignore_index=True,
        )
    return df_page

In [20]:
# test function
df_page = get_flatdata_page(soup)

In [21]:
df_page

Unnamed: 0,Address,Price,Space,Rooms,flat_link
0,"Zaehringerstrasse 26, 8001 Zurich","CHF 1,550.–",100m2,4.5rm,https://www.homegate.ch/rent/3000908194
1,"32 Mainaustrasse, 8008 Zurich","CHF 2,280.–",43m2,1.5rm,https://www.homegate.ch/rent/3000909385
2,"Mühlebachstrasse, 8008 Zurich","CHF 3,450.–",60m2,2.5rm,https://www.homegate.ch/rent/3000751316
3,"Trittligasse, 8001 Zürich","CHF 1,700.–",140m2,4.5rm,https://www.homegate.ch/rent/3000895367
4,"Storchengasse 14, 8001 Zürich","CHF 1,780.–",34m2,1rm,https://www.homegate.ch/rent/3000899580
5,"Rindermarkt 12, 8001 Zürich","CHF 2,600.–",56m2,2.5rm,https://www.homegate.ch/rent/3000661119
6,"Rämistrasse 44, 8001 Zürich","CHF 4,100.–",,3.5rm,https://www.homegate.ch/rent/3000886313
7,"Spiegelgasse 13, 8001 Zürich","CHF 5,170.–",121m2,2.5rm,https://www.homegate.ch/rent/3000814335
8,"In Gassen 14, 8001 Zürich","CHF 7,200.–",173m2,4.5rm,https://www.homegate.ch/rent/3000900478
9,"Clausiusstrasse 68, 8006 Zürich","CHF 2,670.–",58m2,2.5rm,https://www.homegate.ch/rent/3000867312


# Selenium

In [28]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

In [29]:
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)
driver.get("https://www.homegate.ch/rent/real-estate/city-zurich/matching-list")
print(driver.title)

Apartment & house for rent in Zurich (Zürich) | homegate.ch


In [3]:
# try:
#     main = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CLASS_NAME, "ResultListPage_stickyParent_2d4Bp"))
#     )
#     print(main.text)
# except:
#     driver.quit()

1133 results
-
Apartment & house for rent in Zurich (Zürich)
Sort by: Top offers
List
Map
12
CHF 1,570.–22m21.5rmTop
Zentralstrasse 50-60, 8003 Zurich
Kleine aber feine Wohnung in Wiedikon mit vielen Komforts.Die Wohnung befindet sich auf der Zentralstrasse, im Herzen Wiedikons und 7 Geh-Minuten entfernt vom Bahnhof Wiedikon. Coop, Migros, Post, Apotheke und Banken ebenfalls in max 7 Minuten zu Fuss erreichbar.Die Küche verfügt über Kühlschrack mittlerer Grösse mit integriertem Gefrierfach, Geschirrspüler, Kochherd und Micro/Grill-Kombigerät alles der Marke V-Zug.Bad mit Duschkabine.Zur Wohnung gehört Abstellraum 4,7 qm im Estrichbereich.Waschmaschinen und Tumbler zur Mitbenützung im Keller. Das Gebäude verfügt über Lift, und wird von der Livit AG verwaltet.Zur Besichtigung werden folgende fixe Termine angeboten: • Freitag, 12. März zwischen 18.30-20.00 Uhr • Samstag, 13. März zwischen 10.30-12.00 Uhr • Montag, 15. März zwischen 18.30-20.00 Uhr Besichtigung nur mit Maske und ohne Voran

In [130]:
driver.quit()

In [133]:
driver.back()

### printing elements

In [129]:
main = driver.find_elements_by_xpath('//*[@class="ListItemTopPremium_itemLink_11yOE ResultList_ListItem_3AwDq"]')
urls = [i.get_attribute('href') for i in main]
for url in urls:
    driver.get(url)
    try:
#         link = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/main/div[2]/div/div[3]/div[2]/div[1]/a"))
#         )
#         print("\nlink:",each.get_attribute('href'))
#         link.click()

        address = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "address.AddressDetails_address_3Uq1m"))
        )
        print("\nAddress:",address.text)

        Listing_ID = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "dl.ListingTechReferences_techReferencesList_3qCPT"))
        )
        print("\nListing_ID:",Listing_ID.text)

        Listing_ID = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.SpotlightAttributes_value_2njuM"))
        )
        print("\nPrice:",Listing_ID.text)

        info = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "div.CoreAttributes_coreAttributes_2UrTf"))
        )
        print("\ninfo:",info.text)

        availability = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "dd"))
        )
    #     By.XPATH, "/html/body/div[1]/main/div[2]/div/div[1]/div[1]/div[1]/div[3]/section[1]/div[2]/div[1]/dl/dd"
        print("\navailability:",availability.text)

        features = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "ul.FeaturesFurnishings_list_1HzQj"))
        )
        print("\nfeatures:",features.text)
        driver.back()
    except:
        print('not possible')
    #     driver.quit()
        driver.back()


Address: Zentralstrasse 50-60, 8003 Zurich

Listing_ID: Listing ID
3000945478
Object ref.
6784e.hdpjo

Price: CHF 1,570.–

info: Main information
Type:
Apartment
No. of rooms:
1.5
Floor:
4
Number of floors:
6
Surface living:
22 m2
Room height:
2.4 m
Last refurbishment:
2016

availability: 01.04.2021

features: Pets allowed
Cable TV
Elevator
Minergie certified

Address: 32 Mainaustrasse, 8008 Zurich

Listing_ID: Listing ID
3000909385
Object ref.
gyh75.05bml

Price: CHF 2,280.–

info: Main information
Type:
Apartment
No. of rooms:
1.5
Floor:
1
Surface living:
43 m2
Land area:
43 m2
Last refurbishment:
2016

availability: 01.04.2021

features: Quiet neighborhood
Old building

Address: Mühlebachstrasse, 8008 Zurich

Listing_ID: Listing ID
3000751316
Object ref.
k3ykr.85r3z

Price: CHF 3,450.–

info: Main information
Type:
Apartment
No. of rooms:
2.5
Floor:
3
Number of floors:
4
Surface living:
60 m2
Floor space:
60 m2
Last refurbishment:
2018

availability: Immediately

features: Pets all

In [131]:
# for only the first add
# try:
#     link = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/main/div[2]/div/div[3]/div[2]/div[1]/a"))
#     )
#     print("\nlink:",link.get_attribute('href'))
#     link.click()
    
#     address = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CSS_SELECTOR, "address.AddressDetails_address_3Uq1m"))
#     )
#     print("\nAddress:",address.text)
    
#     Listing_ID = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CSS_SELECTOR, "dl.ListingTechReferences_techReferencesList_3qCPT"))
#     )
#     print("\nListing_ID:",Listing_ID.text)
    
#     Listing_ID = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CSS_SELECTOR, "div.SpotlightAttributes_value_2njuM"))
#     )
#     print("\nPrice:",Listing_ID.text)
    
#     info = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CSS_SELECTOR, "div.CoreAttributes_coreAttributes_2UrTf"))
#     )
#     print("\ninfo:",info.text)
    
#     availability = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.TAG_NAME, "dd"))
#     )
# #     By.XPATH, "/html/body/div[1]/main/div[2]/div/div[1]/div[1]/div[1]/div[3]/section[1]/div[2]/div[1]/dl/dd"
#     print("\navailability:",availability.text)
    
#     features = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CSS_SELECTOR, "ul.FeaturesFurnishings_list_1HzQj"))
#     )
#     print("\nfeatures:",features.text)
# except:
#     print('not possible')
# #     driver.quit()

## To transform from BeautifulSoup

In [None]:
def get_flatdata_page(soup):
    
    cols = ["Listing_ID", "Address", "Price", "Space", "Rooms", "No.Rooms", "Availability", "Type", 
            "Floor", "No.Floors", "Room_Height", "Last_refurbishment", "Balcony/Terrace", "Pets", 
            "Elevator", "Minergie", "Wheelchair_access", "New_building", "flat_link", "Features"]
    df_page = pd.DataFrame(columns=cols)  # defining my dataframe
    
    all_flats = soup.find("div", {"data-test":"result-list"})
    for flat_ad in all_flats:
        link_flat = link_first_part + flat_ad.find("a", {"data-test":"result-list-item"}).get("href")
        
        try:
            address = flat_ad.find("p", text=True).text #get address
        except AttributeError:
            address = ""
            
        try:
            price = flat_ad.find("span", {"class":"ListItemPrice_price_1o0i3"}).text #get price
        except AttributeError:
            price = ""
        
        try:
            space = flat_ad.find("span", {"class":"ListItemLivingSpace_value_2zFir"}).text
        except AttributeError:
            space = ""
            
        try:
            rooms = flat_ad.find("span", {"class":"ListItemRoomNumber_value_Hpn8O"}).text
        except AttributeError:
            rooms = ""
        
        df_page = df_page.append(
            {
                "Address": address,
                "Price": price,
                "Space": space,
                "Rooms": rooms,
                "flat_link": link_flat,
            },
            ignore_index=True,
        )
    return df_page