## Divar Scrap
In this notebook, I try to scrap data from [Tehran Divar](https://divar.ir/s/tehran) website. 

This website is mainly designed for selling second hand stuffs, however after a while, there are some other services available in the website like housing.

I focused on the [Tehran Divar](https://divar.ir/s/tehran), which is dedicated for *Tehran city*. I scrap housing advertisings in *Tehran city*.


In [1]:
import time
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv

import numpy as np
import pandas as pd

!pip install Unidecode
from unidecode import unidecode

!pip install arabic-reshaper
from arabic_reshaper import reshape

!pip install progressbar
import progressbar



### save_urls Method
This method is used to find all advertisements of housing in *Tehran* and store their url in a text file called *AdsUrl.txt*. The url used to find all ads is [housing in Tehran Divar](https://divar.ir/s/tehran/rent-residential?warehouse=true). 

This page provides the user with around 24 or 25 housing advertisements in a 18.5" monitor. To get more advertisements, user should scroll down the whole page to the bottom then new advertisements will be loaded in the same places.

To achieve this goal, I use [Chromedriver](https://chromedriver.chromium.org/downloads). You need to change the chrome driver directory to your local path in the second line of code. By chrome driver, I scroll down the page each time and the advertisements are refreshed automatically.

At the end, I store the url of each housing advertisement in a text file called *AdsUrl.txt*.

In [2]:
# save the urls of all advertisements
# Web scrapper for infinite scrolling page #
def save_urls(scroll_times = 100):
    
    with open(url_file, 'w', newline='', encoding='utf-8') as write_obj:
                    write_obj.writelines('')
            
    # copy chrome driver in the main folder of project and paste its address in the line bellow
    driver = webdriver.Chrome(executable_path=r"D:\Bank\Educational\Python\Online Class\13- Projects\Divar\Scrap\chromedriver.exe")
    driver.get(search_url)
    time.sleep(2)  # Allow 2 seconds for the web page to open
    scroll_pause_time = 1 # You can set your own pause time. My laptop is a bit slow so I use 1 sec
    screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web
    
    # progress bar
    bar = progressbar.ProgressBar(maxval=scroll_times, \
        widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    
    print('Finding links progress:')
    bar.start()
    
    for i in range(scroll_times):
        bar.update(i+1)

        # scroll one screen height each time
        driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
        time.sleep(scroll_pause_time)
        
        soup = BeautifulSoup(driver.page_source, "html.parser")

        # scrap the page
        # find tag div for ads 
        for each_div in soup.find_all("div",class_="post-card-item kt-col-6 kt-col-xxl-4"):
            if each_div == None : continue
            url = ''

            # find a tag
            a_tag =  each_div.find("a", recursive = False)

            if a_tag != None and a_tag.has_attr('href'): 
                url = urljoin(home_url, a_tag.attrs['href'])
                # find the rent urls and save in the text file
                with open(url_file, 'a+', newline='', encoding='utf-8') as write_obj:
                    write_obj.writelines(url + '\n')
    
    bar.finish()


### scrap_links Method
In this method, I open the *AdsUrl.txt* file, prepared before, request each url and find all features of housing. 

Total 12 features will be found and stored in a csv fil called *Data.csv*.

In [3]:
# scrap all links in url_file
def scrap_links():
    with open(url_file, 'r', newline='', encoding='utf-8') as read_obj:
        links = read_obj.readlines()
        
        print('------------------------------')
        print('Total link counts:',len(links))
        # remove duplicates
        links =  list(set(links))
        print('Unique link counts:',len(links))
        print('------------------------------')

    # Write the headers in data csv file
    with open(data_file, mode='w', newline='', encoding='utf-8') as csv_file:
        handle = csv.writer(csv_file)
        handle.writerow(['neighborhood','area','year','room','deposit','rent','floor'
            ,'elavator','parking','warehouse','balcony', 'link'])
    
    
    # progress bar
    bar1 = progressbar.ProgressBar(maxval= len(links), \
        widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    
    index_counter = 0
    
    print('Scraping links progress:')
    bar1.start()

    for each_link in links:
        
        bar1.update(index_counter+1)
        index_counter += 1 
        
        each_link = each_link.replace('\n','')
        response = requests.get(each_link)
        # Check if page is found
        if response.status_code != 200: continue
        
        soup = BeautifulSoup(response.content, 'html.parser')

        neighborhood = area = year = room = deposit = rent = floor = ''
        elavator = parking = warehouse = balcony = ''

        # find main div containing features
        post_div = soup.select('div.post-info')
        if post_div == None or len(post_div) == 0: continue
        temp_div = post_div[0].find('div', attrs={'class': None}, recursive = False)
        if temp_div == None: continue
        grouprow_divs = temp_div.find_all('div', attrs={'class': 'kt-group-row'}, recursive = False)
        if grouprow_divs == None or len(grouprow_divs) < 2: continue
        baserow_divs = temp_div.find_all('div', attrs={'class': 'kt-base-row kt-base-row--large kt-unexpandable-row'}, recursive = False)
        if baserow_divs == None: continue
        items_div = grouprow_divs[0].find_all('div', attrs={'class': 'kt-group-row-item kt-group-row-item--info-row'}, recursive = False)
        if items_div == None: continue
        
        temp_div = post_div[0].find('div', attrs={'class': 'kt-page-title'}, recursive = False)
        if temp_div == None: continue
        temp_div = temp_div.find('div', attrs={'class': 'kt-page-title__texts'}, recursive = False)
        if temp_div == None: continue
        temp_div = temp_div.find('div', attrs={'class': 'kt-page-title__subtitle kt-page-title__subtitle--responsive-sized'}, recursive = False)
        if temp_div == None: continue
        neighborhood = temp_div.get_text()

        index = 0
        for each_div in items_div:
            temp_span = each_div.find('span', attrs={'class': 'kt-group-row-item__value'}, recursive = False)
            if temp_span == None: continue
            if index == 0: area = temp_span.get_text()
            if index == 1: year = temp_span.get_text()
            if index == 2: room = temp_span.get_text()
            index += 1
        
        index = 0
        for each_div in baserow_divs:
            temp_div = each_div.find('div', attrs={'class': 'kt-base-row__end kt-unexpandable-row__value-box'}, recursive = False)
            if temp_div == None: continue
            tmep_p = temp_div.find('p', attrs={'class': 'kt-unexpandable-row__value'}, recursive = False)
            if tmep_p == None: continue
            if index == 0: deposit = tmep_p.get_text()
            if index == 1: rent = tmep_p.get_text()
            if index == 5: floor = tmep_p.get_text()
            index += 1
        
        index = 0
        items_div = grouprow_divs[1].find_all('div', recursive = False)
        for each_div in items_div:
            temp_span = each_div.find('span', recursive = False)
            temp_i = each_div.find('i', recursive = False)
            if temp_i.has_attr("class") == False: continue
            if 'kt-icon-balcony' in temp_i['class']: balcony = temp_span.get_text()
            if 'kt-icon-parking' in temp_i['class']: parking = temp_span.get_text()
            if 'kt-icon-elevator' in temp_i['class']: elavator = temp_span.get_text()
            if 'kt-icon-cabinet' in temp_i['class']: warehouse = temp_span.get_text()
            index += 1
        
        # write in file
        new_row = [neighborhood, area, year, room, deposit, rent, floor, elavator, parking
                    , warehouse, balcony, each_link ]

       
        with open(data_file, 'a+', newline='', encoding='utf-8') as write_obj:
            # Create a writer object from csv module
            csv_writer = csv.writer(write_obj)
            # Add contents of list as last row in the csv file
            csv_writer.writerow(new_row)
    
    bar1.finish()


### clean_data Method
In this method I do the folowings:
1.   Romoving useless or bad records
2.   Farsi characters correction
3.   Extract useful data from phrases
4.   Calculte the *total_value* column as the goal feature

In [4]:
# change farsi characters and clean data set
def clean_data():
    df = pd.read_csv('Data.csv', encoding="utf-8")  
    df.drop_duplicates(subset =None, keep = 'first', inplace = True)
    
    # filter apartments
    # چون فقط آپارتمانها را می گیریم، بالکن ندارند
    df = df[df['neighborhood'].str.contains('اجاره آپارتمان')]
    df['neighborhood'] = df['neighborhood'].astype(pd.StringDtype())

    # چون ستون طبقه، دارای مقادیر زیادی از نال و موارد نادرست است آن را حذف می کنیم
    df.drop('floor', inplace=True, axis=1)
    # ستون بالکون برای تمام موارد نال است
    df.drop('balcony', inplace=True, axis=1)

    # int columns
    df['area'] = pd.to_numeric(df.area.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)
    df['room'] = pd.to_numeric(df.room.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    # string columns
    df['warehouse'] = df['warehouse'].replace({'انباری ندارد': '۰'}, regex=True)
    df['warehouse'] = df['warehouse'].replace({'انباری': '۱'}, regex=True)
    df['warehouse'] = pd.to_numeric(df['warehouse'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    df['elavator'] = df['elavator'].replace({'آسانسور ندارد': '۰'}, regex=True)
    df['elavator'] = df['elavator'].replace({'آسانسور': '۱'}, regex=True)
    df['elavator'] = pd.to_numeric(df['elavator'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    df['parking'] = df['parking'].replace({'پارکینگ ندارد': '۰'}, regex=True)
    df['parking'] = df['parking'].replace({'پارکینگ': '۱'}, regex=True)
    df['parking'] = pd.to_numeric(df['parking'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    df[['neighborhood','temp1']] = df['neighborhood'].str.split('،',expand=True)
    df[['temp1','temp2']] = df['temp1'].str.split('|',expand=True)
    df['neighborhood'] = df['temp1'].replace({'‌': ' '}, regex=True)

    df['deposit'] = df['deposit'].replace({'مجانی': '۰'}, regex=True)
    df['deposit'] = df['deposit'].replace({'توافقی': '۰'}, regex=True)
    df['deposit'] = df['deposit'].replace({'٫': ''}, regex=True)
    df['deposit'] = df['deposit'].replace({'تومان': ''}, regex=True)
    df['deposit'] = pd.to_numeric(df['deposit'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(float)

    df['rent'] = df['rent'].replace({'مجانی': '۰'}, regex=True)
    df['rent'] = df['rent'].replace({'توافقی': '۰'}, regex=True)
    df['rent'] = df['rent'].replace({'٫': ''}, regex=True)
    df['rent'] = df['rent'].replace({'تومان': ''}, regex=True)
    df['rent'] = pd.to_numeric(df['rent'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(float)

    # قبل از 1370 را با 1363 پر می کنم تا فاصله ها حفظ شود
    df['year'] = df['year'].replace({'قبل از ۱۳۷۰': '۱۳۶۳'}, regex=True)
    df['year'] = pd.to_numeric(df.year.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    # تبدیل اجاره و ودیعه به یکدیگر و به دست آوردن یک عدد به عنوان ارزش منزل
    df['total_value'] = ((df['rent'] * 100) / 3) + df['deposit']

    # remove temp columns
    df.drop(columns = ['temp1','temp2'], inplace=True, axis=1)

    df = df[['total_value','neighborhood','area','year','deposit','rent','elavator','parking','warehouse']]
    
    df.to_csv(r'Data.csv', index = False)    

### __main__ Method
In the main method, I just called the *save_urls* and *scrap_links* respectively.

Finally, the *Data.csv* file containing the features of housing ads will be prepared. This file can be used for analysis the housing price in all sections of *Tehran city*.

In [None]:
if __name__ == "__main__":

    url_file = 'AdsUrl.txt'
    data_file = 'Data.csv'
    home_url = 'https://divar.ir'
    search_url = "https://divar.ir/s/tehran/rent-residential?warehouse=true"

    # 1- save the urls of advertisements in a file
    save_urls(100)

    # 2- read links from file and scrap all links
    scrap_links()
    
    # 3- change farsi characters and clean data
    clean_data()

### Tips
By running this program for many times, some tips are found:

1.   500 times of scrolling seemed to be an optimal point.
2.   By running the program in the early evening hours, I get more unique records.
3.   Running the program will take about half an hour for 500 scroll times. 
4.   Running the program for 500 scroll times will return about 6000 unique records.
5.   You can run this program in different days and combine the result in one csv file.