# Yad2 Web Scraping
<img src="images/yad2Logo.png" alt="drawing" width="300"/>

### Goals:
1) Scrap data to local computer for price statistics according tovarious parameters (area, number of rooms, floor, city and neighborhood).
2) Create An interactive heat map for a convenient comparison of prices by area and house data as following:

<img src="images/heat_map_example.png" alt="drawing" width="300"/>

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from typing import List
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

## Generate headers

In [2]:
def get_Cookie() -> str:
    s = requests.Session()
    
    res = s.get('https://gw.yad2.co.il/auth/token/refresh')
    co_dict = res.cookies.get_dict()
    
    Cookie = ''
    for key in co_dict:
        Cookie += f'{key}={co_dict[key]}; '
    
    return Cookie

def get_headers() -> dict:
    headers = {
        'Accept': 'application/json, text/plain, */*',
        'Accept-Encoding': 'gzip, deflate, br',
        'Accept-Language': 'he-IL,he;q=0.9,en-US;q=0.8,en;q=0.7',
        'Connection': 'keep-alive',
        'Host': 'www.yad2.co.il',
        'mobile-app': 'false',
        'Referer': 'https://www.yad2.co.il/realestate/forsale',
        'sec-ch-ua': '"Google Chrome";v="107", "Chromium";v="107", "Not=A?Brand";v="24"',
        'sec-ch-ua-platform': '"Windows"',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
    }
    
    headers['Cookie'] = get_Cookie()
    
    return headers

## Get required headers for auth

In [3]:
headers = get_headers()

## Scraping functions

In [4]:
def get_feeditems(page_num=1) -> List[BeautifulSoup]:
    if page_num % 50 == 0:
        print(f'{page_num} pages has been scaned so far')
    uri = f'https://www.yad2.co.il/realestate/forsale?page={page_num}'

    res = requests.get(uri, headers=headers)
    soup = BeautifulSoup(res.text)
    
    return soup.find_all('div', class_="feeditem table")

def get_item_date(item: BeautifulSoup) -> str:
    if (today := item.find('span', class_='date')) and ((today := today.string.strip()) != 'עודכן היום'):
        return today
    else:
        return datetime.today().strftime('%Y/%m/%d')

def get_house_data(feeditem: BeautifulSoup) -> dict:
    title = feeditem.find('span', class_='title').text.strip()

    subtitle = feeditem.find('span', class_='subtitle').text
    subtitle = tuple(sub_part.strip() for sub_part in subtitle.split(','))
    type_ = subtitle[0]
    city = subtitle[-1]

    price = feeditem.find('div', class_='price').string.strip()

    attr = tuple(span.string for span in feeditem.find('div', class_="middle_col").find_all('span'))
    rooms, floor, size = str(attr[0]), str(attr[2]), str(attr[4])

    try:
        price = int(price[:-2].replace(',', ''))
        return {
            'price': price,
            'address': title,
            'city': city,
            'type': type_,
            'rooms': int(rooms),
            'floor': int(floor),
            'size': int(size),
            'update': get_item_date(feeditem),
            'scrap_date': datetime.today().strftime('%Y/%m/%d')
        }
    except:
        return {}

def feeditems_to_data_pre_normal(feeditems: List[BeautifulSoup]) -> List[dict]:
    return [get_house_data(feeditem) for feeditem in feeditems]


## Get data of posts from first 1000 pages (orderd by date)

In [5]:
# change "last_page" to the count of pages you want to scrape
last_page = 10

In [6]:
result = []

with ThreadPoolExecutor(max_workers=64) as executor:
    res_list = executor.map(lambda i: feeditems_to_data_pre_normal(get_feeditems(i)), range(1, last_page))
    for res in res_list:
        result.extend(res)

### Posts with no numeric price return empty dict and need to be removed

In [7]:
result = filter(lambda x: x, result)

## Convert to pandas DataFrame

In [8]:
df = pd.json_normalize(result)
df

Unnamed: 0,price,address,city,type,rooms,floor,size,update,scrap_date
0,5440000,הכרמים 16,נס ציונה,דירת גן,5,3,305,2022/11/05,2022/11/05
1,1430000,התאנה 1,מגדל,דירה,3,1,105,2022/11/05,2022/11/05
2,2200000,נחל ירקון 5,קרית גת,דירה,5,13,140,2022/11/05,2022/11/05
3,2260000,עיר דוד 16,פדואל,דירת גן,5,1,175,2022/11/05,2022/11/05
4,3100000,תוצרת הארץ 5,תל אביב יפו,דירה,2,16,66,2022/11/05,2022/11/05
...,...,...,...,...,...,...,...,...,...
172,9990000,אלון,רמת השרון,גג/פנטהאוז,4,2,260,2022/11/05,2022/11/05
173,3150000,סוקולוב 6,הרצליה,דירה,3,15,80,2022/11/05,2022/11/05
174,1399000,כרמל מערבי,חיפה,דירה,4,2,90,2022/11/05,2022/11/05
175,3800000,דגניה 10,חולון,דירה,5,1,120,2022/11/05,2022/11/05


## Save to your local computer

In [None]:
df.to_excel('Houses.xlsx', index=False)

# Result
1) Upload the table to [your own Google Sheets](https://docs.google.com/spreadsheets/u/0/).
2) Build a dashbord in [google Looker (data studio)](https://datastudio.google.com/u/0/).
3) Visit my [dashbord](https://datastudio.google.com/embed/reporting/90738cd2-b077-4b71-9b46-25bb2ce8e4c6/page/V806C)!