<img src="Homegate_Logo.png" width="200"/>

# Title: The Flat Hunter

Goal: Create a program that gets the latest flat rental postings that fit your criteria, and store important data in a database for future analytics (study price fluctuations for example).

Involves: Web scraping, data cleaning with Python, visualization, API

Description: The idea is to write a program with the following functionalities:
- Get data from homegate.ch on the flat/house rental for the criteria of your interest
- Store the raw data
- Clean up the data and filter it according to keywords (for example 'view', 'bright', ' Attika',....)
- Perform some analytics (average price for example)
- Visualize the results (histogram of rent prices for example) 


Possible extensions:
- Setup an automatic reporting system, where you just need to run one script to get a full pdf report on new listings available for your criteria 
- Get acquainted with nltk, a natural language processing software and basic word analysis 
- Connect to gmaps API and implement a distance filter from a specific address
- Extend to other websites.

Work Packages:

- Explore homegate.ch, make a list of data to extract and then write the notebook to extract them. Save the data as a csv file
- Write the notebook to clean the data, filter by keyword and analyze and plot the data 


In [1]:
import requests
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup

#this is a fancy progress bar! works on jupyter notebook
from tqdm.notebook import tqdm

from time import sleep
from datetime import datetime

# this is to not show warnings in the notebook. Warning!!! Use only if you are absolutely sure
#import warnings

#warnings.filterwarnings("ignore")

## Requests

In [2]:
link = "https://www.homegate.ch/rent/real-estate/matching-list?loc=geo-zipcode-8001%2Cgeo-zipcode-8050%2Cgeo-zipcode-8006%2Cgeo-zipcode-8008"

## Getting the webpage content using requests library

In [3]:
response = requests.get(link, timeout=15) #now all the information is stored in the Response object

In [4]:
response.status_code # 200 is good. Anything with 4** or 5** is bad

200

In [5]:
#response.content #whole content of the webpage without treatment

#### Parse response using BeautifulSoup

We can use the BeautifulSoup library to parse this document and extract texts from the HTML tags

In [6]:
soup = BeautifulSoup(response.content, "html.parser")

In [7]:
#soup.prettify()

In [8]:
# get whole text from soup

#soup.get_text()

In [9]:
# get all `a` tags

all_a = soup.findAll("a") #or all_a = soup.find_all("a")
# all_a[8].get_text()
# for a in all_a[:4]:
#     print(a.get_text())

#### Go to web page and inspect it to get a flat ad box

In [10]:
one_address = soup.find("p", text=True).text
one_address

'32 Mainaustrasse, 8008 Zurich'

In [11]:
one_price = soup.find("span", {"class":"ListItemPrice_price_1o0i3"}).text
one_price

'CHF 2,280.–'

In [12]:
one_space = soup.find("span", {"class":"ListItemLivingSpace_value_2zFir"}).text
one_space

'43m2'

In [13]:
one_rooms = soup.find("span", {"class":"ListItemRoomNumber_value_Hpn8O"}).text
one_rooms

'1.5rm'

#### Get flat link

In [14]:
link_flat = "https://homegate.ch"+soup.find("a", {"data-test":"result-list-item"}).get("href")
link_flat

'https://homegate.ch/rent/3000909385'

In [15]:
# for span in room.find_all("span"):
#     print(span.text)

### Putting all together: Extend getting information from one flat to all flats in a page

##### Extend getting information from one page, to all pages

In [16]:
#Getting the number of pages
max_pages = soup.find("div", {"class": "ResultListPage_paginationHolder_3XZql"}).text.split()
a = max_pages[0]
pieces = a.split('...')
max_pages = pieces[1]
max_pages


'16'

https://www.homegate.ch/rent/real-estate/matching-list?loc=geo-zipcode-8001%2Cgeo-zipcode-8050%2Cgeo-zipcode-8006%2Cgeo-zipcode-8008

#### Defining the link

In [17]:
link_first_part = "https://www.homegate.ch"
link_mid_1_part = "/rent/real-estate/matching-list?loc=geo-zipcode-"
link_more_part = "%2Cgeo-zipcode-"

In [18]:
# for page in tqdm(range(1, int(max_pages) + 1)):
#     """Make the urls dynamic"""
#     url = (
#         link_first_part
#         + link_mid_1_part
#         + str(page)
#         + link_mid_2_part
#         + "Data%20Engineer"

#### Modularizing the process

In [18]:
def get_flatdata_page(soup):
    
    cols = ["Address", "Price", "Space", "Rooms", "flat_link"]
    df_page = pd.DataFrame(columns=cols)  # defining my dataframe
    
    all_flats = soup.find("div", {"data-test":"result-list"})
    for flat_ad in all_flats:
        link_flat = link_first_part + flat_ad.find("a", {"data-test":"result-list-item"}).get("href")
        
        try:
            address = flat_ad.find("p", text=True).text #get address
        except AttributeError:
            address = ""
            
        try:
            price = flat_ad.find("span", {"class":"ListItemPrice_price_1o0i3"}).text #get price
        except AttributeError:
            price = ""
        
        try:
            space = flat_ad.find("span", {"class":"ListItemLivingSpace_value_2zFir"}).text
        except AttributeError:
            space = ""
            
        try:
            rooms = flat_ad.find("span", {"class":"ListItemRoomNumber_value_Hpn8O"}).text
        except AttributeError:
            rooms = ""
        
        df_page = df_page.append(
            {
                "Address": address,
                "Price": price,
                "Space": space,
                "Rooms": rooms,
                "flat_link": link_flat,
            },
            ignore_index=True,
        )
    return df_page

In [20]:
# test function
df_page = get_flatdata_page(soup)

In [21]:
df_page

Unnamed: 0,Address,Price,Space,Rooms,flat_link
0,"Zaehringerstrasse 26, 8001 Zurich","CHF 1,550.–",100m2,4.5rm,https://www.homegate.ch/rent/3000908194
1,"32 Mainaustrasse, 8008 Zurich","CHF 2,280.–",43m2,1.5rm,https://www.homegate.ch/rent/3000909385
2,"Mühlebachstrasse, 8008 Zurich","CHF 3,450.–",60m2,2.5rm,https://www.homegate.ch/rent/3000751316
3,"Trittligasse, 8001 Zürich","CHF 1,700.–",140m2,4.5rm,https://www.homegate.ch/rent/3000895367
4,"Storchengasse 14, 8001 Zürich","CHF 1,780.–",34m2,1rm,https://www.homegate.ch/rent/3000899580
5,"Rindermarkt 12, 8001 Zürich","CHF 2,600.–",56m2,2.5rm,https://www.homegate.ch/rent/3000661119
6,"Rämistrasse 44, 8001 Zürich","CHF 4,100.–",,3.5rm,https://www.homegate.ch/rent/3000886313
7,"Spiegelgasse 13, 8001 Zürich","CHF 5,170.–",121m2,2.5rm,https://www.homegate.ch/rent/3000814335
8,"In Gassen 14, 8001 Zürich","CHF 7,200.–",173m2,4.5rm,https://www.homegate.ch/rent/3000900478
9,"Clausiusstrasse 68, 8006 Zürich","CHF 2,670.–",58m2,2.5rm,https://www.homegate.ch/rent/3000867312


# Selenium

In [10]:
import time
import re
import numpy as np
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [2]:
# As my chosen browser is chrome 89, I must match this to chromedriver 89.0.4389.23
# To download: https://sites.google.com/a/chromium.org/chromedriver/downloads
path = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(path)

In [3]:
# stablishing the connection with HOMEGATE
driver.get("https://www.homegate.ch/rent/real-estate/city-zurich/matching-list")
print(driver.title)

Apartment & house for rent in Zurich (Zürich) | homegate.ch


## Defining function to get the elements from the website

In [15]:
def flat_ID():
    try:
        listing_ID = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "dl.ListingTechReferences_techReferencesList_3qCPT"))
            )
        id_raw= listing_ID.text
        id_flat = re.split('[\n]', id_raw)[1]
        return id_flat
    except:
        return None 
        
        
def flat_address():
    try:
        address_find = WebDriverWait(driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, "address.AddressDetails_address_3Uq1m"))
                    )
        address = address_find.text
        return address
    except:
        return None

def flat_price():
    try:
        price_find = WebDriverWait(driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "div.SpotlightAttributes_value_2njuM"))
                )
        price_raw = price_find.text
        price = float(re.findall(r'[0-9],[0-9]{3}',price_raw)[0].replace(',',''))
        return price
    except:
        return None 

def flat_availability():
    try:
        availability_find = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "dd"))
        )
    #     By.XPATH, "/html/body/div[1]/main/div[2]/div/div[1]/div[1]/div[1]/div[3]/section[1]/div[2]/div[1]/dl/dd"
        availability_raw = availability_find.text
        return availability_raw
    except:
        return None
    
def flat_type():
    try:
        if 'Type:' in main_info_dict:
            return main_info_dict.get('Type:')
        else:
            raise Exception('error')
    except:
        return None

def flat_n_rooms():
    try:
        if 'No. of rooms:' in main_info_dict:
            return main_info_dict.get('No. of rooms:')
        else:
            raise Exception('error')        
    except:
        return None

def flat_floor():
    try:
        if 'Floor:' in main_info_dict:
            return main_info_dict.get('Floor:')
        else:
            raise Exception('error')        
    except:
        return None
    
def flat_n_floors():
    try:
        if 'Number of floors:' in main_info_dict:
            return main_info_dict.get('Number of floors:')
        else:
            raise Exception('error')        
    except:
        return None
    
def flat_surface():
    try:
        if 'Surface living:' in main_info_dict:
            return main_info_dict.get('Surface living:')
        else:
            raise Exception('error')        
    except:
        return None
    
def flat_floor_space():
    try:
        if 'Floor space:' in main_info_dict:
            return main_info_dict.get('Floor space:')
        else:
            raise Exception('error')        
    except:
        return None
    
def flat_Room_height():
    try:
        if 'Room height:' in main_info_dict:
            return main_info_dict.get('Room height:')
        else:
            raise Exception('error')        
    except:
        return None

def flat_last_refurbishment():
    try:
        if 'Last refurbishment:' in main_info_dict:
            return main_info_dict.get('Last refurbishment:')
        else:
            raise Exception('error')        
    except:
        return None
    
def flat_year():
    try:
        if 'Year built:' in main_info_dict:
            return main_info_dict.get('Year built:')
        else:
            raise Exception('error')        
    except:
        return None
    
    
def flat_features():
    try:
        features_find = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, "ul.FeaturesFurnishings_list_1HzQj"))
        )
        features_raw = features_find.text
        features = (re.split('[\n]', features_raw))
        features = [element.lower() for element in features]
        return features
    except:
        return None


### Getting the urls

In [5]:
main = driver.find_elements_by_xpath('//*[@class="ListItemTopPremium_itemLink_11yOE ResultList_ListItem_3AwDq"]')
urls = [i.get_attribute('href') for i in main]

### Creating the dictionary with all the flats to create the DataFrame

In [19]:
flats_lst = []

In [20]:
for url in urls:
    try:
        driver.get(url)
        
        # Creating a dictionary with the 'Main Information'
        key_type_find = WebDriverWait(driver, 10).until(
            EC.presence_of_all_elements_located((By.XPATH, '//div[@class="CoreAttributes_coreAttributes_2UrTf"]/dl/dt'))
            )
        key_type = []
        for type in key_type_find:
            key_type.append(type.text)

        value_type_find = WebDriverWait(driver, 10).until(
                    EC.presence_of_all_elements_located((By.XPATH, '//div[@class="CoreAttributes_coreAttributes_2UrTf"]/dl/dd'))
                    )
        value_type = []
        for type in value_type_find:
            value_type.append(type.text)
        main_info_dict = dict(zip(key_type, value_type))
        
        flats_dict = {'flat_ID': flat_ID(),
                      'address': flat_address(),
                      'price': flat_price(),
                      'availability': flat_availability(), 
                      'type': flat_type(), 
                      'N_of_rooms': flat_n_rooms(),
                      'floor': flat_floor(),
                      'N_of_floors': flat_n_floors(),
                      'Surface_living': flat_surface(),
                      'Floor_space': flat_floor_space(),
                      'Room_height': flat_Room_height(),
                      'Last_refurbishment': flat_last_refurbishment(),
                      'Year_built': flat_year(),
                      'Features': flat_features(),
                    }
        flats_lst.append(flats_dict)
    
    except:
        print('not possible')
        

In [21]:
df = pd.DataFrame(flats_lst)

In [22]:
df

Unnamed: 0,flat_ID,address,price,availability,type,N_of_rooms,floor,N_of_floors,Surface_living,Floor_space,Room_height,Last_refurbishment,Year_built,Features
0,3000945478,"Zentralstrasse 50-60, 8003 Zurich",1570.0,01.04.2021,Apartment,1.5,4,6.0,22 m2,,2.4 m,2016.0,,"[pets allowed, cable tv, elevator, minergie ce..."
1,3000955690,8006 Zurich,1200.0,By agreement,Single Room,1.0,1,5.0,110 m2,110 m2,,2014.0,,"[balcony / terrace, washing machine, cable tv,..."
2,3000909385,"32 Mainaustrasse, 8008 Zurich",2280.0,01.04.2021,Apartment,1.5,1,,43 m2,,,2016.0,,"[quiet neighborhood, old building]"
3,3000751316,"Mühlebachstrasse, 8008 Zurich",3450.0,Immediately,Apartment,2.5,3,4.0,60 m2,60 m2,,2018.0,,"[pets allowed, balcony / terrace, washing mach..."
4,3000842763,"Mutschellenstrasse 85, 8038 Zurich",2130.0,Immediately,Apartment,2.5,GF,,38 m2,,,2017.0,,"[balcony / terrace, parking space, elevator, n..."
5,3000661119,"Rindermarkt 12, 8001 Zürich",2700.0,Immediately,Apartment,2.5,4,,56 m2,,,,1358.0,"[washing machine, cable tv, view, quiet neighb..."
6,3000951747,"Schoffelgasse, 8001 Zürich",3400.0,Immediately,Roof flat,2.5,5,5.0,100 m2,,,,,"[balcony / terrace, cable tv, view, quiet neig..."
7,3000938095,"Kuttelgasse 15, 8001 Zürich",3499.0,01.04.2021,Duplex,3.0,1,6.0,90 m2,,,,,"[balcony / terrace, cable tv, view, elevator]"
8,3000814335,"Spiegelgasse 13, 8001 Zürich",5170.0,01.04.2021,Apartment,2.5,3,,121 m2,,,,1317.0,"[pets allowed, cable tv, fireplace, view, gara..."
9,3000937617,"Mutschellenstrasse 35, 8002 Zürich",3590.0,Immediately,Apartment,3.0,1,,94 m2,,,2020.0,,[balcony / terrace]


In [None]:
# try:
#     main = WebDriverWait(driver, 10).until(
#         EC.presence_of_element_located((By.CLASS_NAME, "ResultListPage_stickyParent_2d4Bp"))
#     )
#     print(main.text)
# except:
#     driver.quit()

In [None]:
# driver.quit()

In [None]:
# driver.back()

### printing elements

In [None]:
# main = driver.find_elements_by_xpath('//*[@class="ListItemTopPremium_itemLink_11yOE ResultList_ListItem_3AwDq"]')
# urls = [i.get_attribute('href') for i in main]
# for url in urls:
#     driver.get(url)
#     try:
# #         link = WebDriverWait(driver, 10).until(
# #             EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/main/div[2]/div/div[3]/div[2]/div[1]/a"))
# #         )
# #         print("\nlink:",each.get_attribute('href'))
# #         link.click()

#         address = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.CSS_SELECTOR, "address.AddressDetails_address_3Uq1m"))
#         )
#         print("\nAddress:",address.text)

#         listing_ID = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.CSS_SELECTOR, "dl.ListingTechReferences_techReferencesList_3qCPT"))
#         )
#         print("\nListing_ID:",listing_ID.text)

#         Price = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.CSS_SELECTOR, "div.SpotlightAttributes_value_2njuM"))
#         )
#         print("\nPrice:",Price.text)

#         info = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.CSS_SELECTOR, "div.CoreAttributes_coreAttributes_2UrTf"))
#         )
#         print("\ninfo:",info.text)

#         availability = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.TAG_NAME, "dd"))
#         )
#     #     By.XPATH, "/html/body/div[1]/main/div[2]/div/div[1]/div[1]/div[1]/div[3]/section[1]/div[2]/div[1]/dl/dd"
#         print("\navailability:",availability.text)

#         features = WebDriverWait(driver, 10).until(
#             EC.presence_of_element_located((By.CSS_SELECTOR, "ul.FeaturesFurnishings_list_1HzQj"))
#         )
#         print("\nfeatures:",features.text)
#         driver.back()
#     except:
#         print('not possible')
#     #     driver.quit()
#         driver.back()