## Divar Scrap
In this notebook, I try to scrap data from [Tehran Divar](https://divar.ir/s/tehran) website. 

This website is mainly designed for selling second hand stuffs, however after a while, there are some other services available in the website like housing.

I focused on the [Tehran Divar](https://divar.ir/s/tehran), which is dedicated for *Tehran city*. I scrap housing advertisings in *Tehran city*.


In [4]:
import os
import time
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import csv

import numpy as np
import pandas as pd

!pip install Unidecode
from unidecode import unidecode

!pip install arabic-reshaper
from arabic_reshaper import reshape

!pip install progressbar
import progressbar

!pip install webdriver-manager
# selenium 4.0
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager



### save_urls Method
This method is used to find all advertisements of housing in *Tehran* and store their url in a text file called *AdsUrl.txt*. The url used to find all ads is [housing in Tehran Divar](https://divar.ir/s/tehran/rent-residential?warehouse=true). 

This page provides the user with around 24 or 25 housing advertisements in a 18.5" monitor. To get more advertisements, user should scroll down the whole page to the bottom then new advertisements will be loaded in the same places.

To achieve this goal, I use [Chromedriver](https://chromedriver.chromium.org/downloads). You need to change the chrome driver directory to your local path in the second line of code. By chrome driver, I scroll down the page each time and the advertisements are refreshed automatically.

At the end, I store the url of each housing advertisement in a text file called *AdsUrl.txt*.

In [5]:
# save the urls of all advertisements
# Web scrapper for infinite scrolling page #
def save_urls(scroll_times = 100):
    
    with open(url_file, 'w', newline='', encoding='utf-8') as write_obj:
                    write_obj.writelines('')
            
    # copy chrome driver in the main folder of project and paste its address in the line bellow
    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
    driver.get(search_url)
    time.sleep(2)  # Allow 2 seconds for the web page to open
    scroll_pause_time = 5 # You can set your own pause time. My laptop is a bit slow so I use 1 sec
    screen_height = driver.execute_script("return window.screen.height;")   # get the screen height of the web
    
    # progress bar
    bar = progressbar.ProgressBar(maxval=scroll_times, \
        widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    
    print('Finding links progress:')
    bar.start()
    
    for i in range(scroll_times):
        bar.update(i+1)

        # scroll one screen height each time
        driver.execute_script("window.scrollTo(0, {screen_height}*{i});".format(screen_height=screen_height, i=i))  
        time.sleep(scroll_pause_time)

        try:
                more_list = driver.find_element(By.XPATH, "//button[@class='post-list__unsafe-show-more-e2b99']")
                more_list.click()
                print("clicked on more list")
                time.sleep(1)
        except:
                pass
        
        soup = BeautifulSoup(driver.page_source, "html.parser")
        
        # scrap the page
        # find tag div for ads 
        for each_div in soup.find_all("div",class_="post-list__widget-col-c1444"):
            if each_div == None : continue
            url = ''

            # find a tag
            a_tag =  each_div.find("a", recursive = False)

            if a_tag != None and a_tag.has_attr('href'): 
                url = urljoin(home_url, a_tag.attrs['href'])
                # find the rent urls and save in the text file
                with open(url_file, 'a+', newline='', encoding='utf-8') as write_obj:
                    write_obj.writelines(url + '\n')
    
    bar.finish()


### scrap_links Method
In this method, I open the *AdsUrl.txt* file, prepared before, request each url and find all features of housing. 

Total 22 features will be found and stored in a csv fil called *Data.csv*.

In [6]:
# scrap all links in url_file
# try to scrap all the links in the file and retry if post_div is not found
def scrap_links():
    with open(url_file, 'r', newline='', encoding='utf-8') as read_obj:
        links = read_obj.readlines()
        
        print('------------------------------')
        print('Total link counts:',len(links))
        # remove duplicates
        links =  list(set(links))
        print('Unique link counts:',len(links))
        print('------------------------------')

    # Write the headers in data csv file
    with open(data_file, mode='w', newline='', encoding='utf-8') as csv_file:
        handle = csv.writer(csv_file)
        handle.writerow(['id','name','neighborhood','area','year','room','deposit','rent','changeAble','buildingFloors'
                ,'unitFloor','elavator','parking','warehouse','balcony','wc','cooling','heating','hotWater','unitPerFloor','direction','unitStatus','longitude','latitude','description','link'])
    
    
    # progress bar
    bar1 = progressbar.ProgressBar(maxval= len(links), \
        widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    
    index_counter = 0
    
    print('Scraping links progress:')
    bar1.start()

    driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

    list_index = 0
    for each_link in links:
        list_index += 1
        bar1.update(index_counter+1)
        index_counter += 1 

        neighborhood = area = year = room = deposit = rent = changeAble = buildingFloors = unitFloor = longitude = latitude = description = ''
        elavator = parking = warehouse = balcony = wc = cooling = heating = hotWater = ''
        unitPerFloor = direction = unitStatus = ''
        
        each_link = each_link.replace('\n','')

        driver.get(each_link)
        time.sleep(1)
        soup = BeautifulSoup(driver.page_source, "html.parser")

        post_div = soup.select('div.kt-col-5')      
        if len(post_div) == 0: continue
        temp_section = post_div[0].find_all('section', recursive = False)
        if len(temp_section) == 0: continue
        temp_div = temp_section[0].find('div', attrs={'class': 'post-page__section--padded'}, recursive = False)
        if temp_div == None: continue
        baserow_divs = temp_div.find_all('div', attrs={'class': 'kt-base-row kt-base-row--large kt-unexpandable-row'}, recursive = False)
        if len(baserow_divs) == 0: continue
        grouprow_divs = temp_div.find_all('table', attrs={'class': 'kt-group-row'}, recursive = False)
        if len(grouprow_divs) == 0: continue
        # items_div = grouprow_divs.find_all('td', attrs={'class': 'kt-group-row-item kt-group-row-item__value kt-body kt-body--stable'}, recursive = False)

        #get Neighborhood
        temp_div = temp_section[0].find('div', attrs={'class': 'kt-page-title'}, recursive = False)
        if temp_div == None: continue
        temp_div = temp_div.find('div', attrs={'class': 'kt-page-title__texts'}, recursive = False)
        if temp_div == None: continue
        title_div = temp_div.find('div', attrs={'class': 'kt-page-title__title kt-page-title__title--responsive-sized'}, recursive = False)
        if title_div == None: continue
        name = title_div.get_text()
        subtitle_div = temp_div.find('div', attrs={'class': 'kt-page-title__subtitle kt-page-title__subtitle--responsive-sized'}, recursive = False)
        if subtitle_div == None: continue
        neighborhood = subtitle_div.get_text()
        # only select after '،' character
        neighborhood = neighborhood.split('،')[1] 

        # get area, year, room
        index = 0
        items_div = grouprow_divs[0].find('tbody', recursive = False)
        if items_div == None: continue
        items_div = items_div.find('tr', attrs={'class': 'kt-group-row__data-row'}, recursive = False)
        if items_div == None: continue
        items_div = items_div.find_all('td', recursive = False)
        for each_div in items_div:
                if index == 0: area = each_div.get_text()
                if index == 1: year = each_div.get_text()
                if index == 2: room = each_div.get_text()
                index += 1

        # get price, ppm, floor
        for each_div in baserow_divs:
                temp_div_start = each_div.find('div', attrs={'class': 'kt-base-row__start kt-unexpandable-row__title-box'}, recursive = False)
                temp_div_end = each_div.find('div', attrs={'class': 'kt-base-row__end kt-unexpandable-row__value-box'}, recursive = False)
                if temp_div_start != None and temp_div_end != None: 
                        temp_p_start = temp_div_start.find('p', attrs={'class': 'kt-base-row__title'}, recursive = False)
                        temp_p_end = temp_div_end.find('p', attrs={'class': 'kt-unexpandable-row__value'}, recursive = False)
                if temp_p_start != None and temp_p_end != None:
                        if temp_p_start.get_text() == 'ودیعه': deposit = temp_p_end.get_text()
                        if temp_p_start.get_text() == 'اجارهٔ ماهانه': rent = temp_p_end.get_text()
                        if temp_p_start.get_text() == 'ودیعه و اجاره': changeAble = temp_p_end.get_text()
                        if temp_p_start.get_text() == 'طبقه': 
                                floor = temp_p_end.get_text()
                                temp_floor = floor.split('از')
                                if len(temp_floor) == 2: 
                                        buildingFloors = temp_floor[1].strip()
                                        unitFloor = temp_floor[0].strip()


        thead_div = grouprow_divs[1].find('thead', recursive = False)
        if thead_div == None: continue
        thead_tr = thead_div.find('tr', recursive = False)
        if thead_tr == None: continue
        thead_th = thead_tr.find_all('th', recursive = False)
        if thead_th == None: continue
        tbody_div = grouprow_divs[1].find('tbody', recursive = False)
        if tbody_div == None: continue
        tbody_tr = tbody_div.find('tr', recursive = False)
        if tbody_tr == None: continue
        tbody_td = tbody_tr.find_all('td', recursive = False)
        index = 0
        for each_th in thead_th:
                td = tbody_td[index]
                temp_i = each_th.find('i', recursive = False)
                if temp_i == None: continue
                if temp_i.has_attr("class") == False: continue
                if 'kt-icon-balcony' in temp_i['class']: balcony = td.get_text()
                if 'kt-icon-parking' in temp_i['class']: parking = td.get_text()
                if 'kt-icon-elevator' in temp_i['class']: elavator = td.get_text()
                if 'kt-icon-cabinet' in temp_i['class']: warehouse = td.get_text()
                index += 1

        map_a = soup.select('a.map-cm__attribution')
        if len(map_a) != 0: 
                # get href attribute
                href = map_a[0]['href']
                # url is like "https://balad.ir/location?latitude=35.739093017052&amp;longitude=51.379365921021&radius=500"
                # get latitude and longitude
                temp = href.split('latitude=')[1]
                latitude = temp.split('&')[0]
                temp = href.split('longitude=')[1]
                longitude = temp.split('&')[0]
                # remove &amp; from latitude and longitude
                latitude = latitude.replace('&amp;','')
                longitude = longitude.replace('&amp;','')
                # remove radius from longitude
                longitude = longitude.split('&radius')[0]


        # click on detail_button
        try:
                more_details = driver.find_element(By.XPATH, "//div[@class='raw-button-cd669']")
                if more_details != None:
                        more_details.click()
                        time.sleep(1)
                        soup = BeautifulSoup(driver.page_source, "html.parser")
                        temp_div_modal = soup.select('div.kt-modal__body')
                        if len(temp_div_modal) != 0: 
                                # find all divs with class 'kt-base-row kt-base-row--large kt-unexpandable-row'
                                temp_features_divs = temp_div_modal[0].find_all('div', attrs={'class': 'kt-base-row kt-base-row--large kt-unexpandable-row'}, recursive = False)
                                if len(temp_features_divs) != 0:
                                        for each_div in temp_features_divs:
                                                temp_div_start = each_div.find('div', attrs={'class': 'kt-base-row__start kt-unexpandable-row__title-box'}, recursive = False)
                                                temp_div_end = each_div.find('div', attrs={'class': 'kt-base-row__end kt-unexpandable-row__value-box'}, recursive = False)
                                                if temp_div_start != None and temp_div_end != None: 
                                                        temp_p_start = temp_div_start.find('p', attrs={'class': 'kt-base-row__title'}, recursive = False)
                                                        temp_p_end = temp_div_end.find('p', attrs={'class': 'kt-unexpandable-row__value'}, recursive = False)
                                                if temp_p_start != None and temp_p_end != None:
                                                        if temp_p_start.get_text() == 'تعداد واحد در طبقه': unitPerFloor = temp_p_end.get_text()
                                                        if temp_p_start.get_text() == 'جهت ساختمان': direction = temp_p_end.get_text()
                                                        if temp_p_start.get_text() == 'وضعیت واحد': unitStatus = temp_p_end.get_text()

                                temp_ability_divs = temp_div_modal[0].find_all('div', attrs={'class': 'kt-base-row kt-base-row--large kt-base-row--has-icon kt-feature-row'}, recursive = False)
                                if len(temp_ability_divs) != 0:
                                        for each_div in temp_ability_divs:
                                                temp_div = each_div.find('div', attrs={'class': 'kt-base-row__start'}, recursive = False)
                                                if temp_div != None: 
                                                        temp_i = temp_div.find('i', recursive = False)
                                                        temp_p = temp_div.find('p', recursive = False)
                                                        if temp_i != None: 
                                                                if temp_i.has_attr("class") != False:
                                                                        if 'kt-icon-balcony' in temp_i['class']: balcony = temp_p.get_text()
                                                                        if 'kt-icon-parking' in temp_i['class']: parking = temp_p.get_text()
                                                                        if 'kt-icon-elevator' in temp_i['class']: elavator = temp_p.get_text()
                                                                        if 'kt-icon-cabinet' in temp_i['class']: warehouse = temp_p.get_text()
                                                                        if 'kt-icon-wc' in temp_i['class']: wc = temp_p.get_text()
                                                                        if 'kt-icon-snowflake' in temp_i['class']: cooling = temp_p.get_text()
                                                                        if 'kt-icon-sunny' in temp_i['class']: heating = temp_p.get_text()
                                                                        if 'kt-icon-thermometer' in temp_i['class']: hotWater = temp_p.get_text()
        except:
                pass

        
        # write in file
        new_row = [index_counter, name, neighborhood, area, year, room, deposit, rent, changeAble, buildingFloors
                , unitFloor, elavator, parking, warehouse, balcony, wc, cooling, heating, hotWater, unitPerFloor, direction, unitStatus, longitude, latitude, description, each_link ]
        
        with open(data_file, 'a+', newline='', encoding='utf-8') as write_obj:
                # Create a writer object from csv module
                csv_writer = csv.writer(write_obj)
                # Add contents of list as last row in the csv file
                csv_writer.writerow(new_row)

        bar1.finish()


### clean_data Method
In this method I do the folowings:
1.   Romoving useless or bad records
2.   Farsi characters correction
3.   Extract useful data from phrases
4.   Calculte the *total_value* column as the goal feature

In [36]:
# change farsi characters and clean data set
def clean_data():

    # read all data with clean in their name
    path = './Data'
    files = os.listdir(path)
    files = [f for f in files if f.startswith('Data_'+ city) and f.endswith('.csv')]
    df = pd.concat([pd.read_csv(os.path.join(path, f)) for f in files], ignore_index=True)
    print("After concat: ", df.shape)
    # drop duplicates
    df.drop_duplicates(subset =None, keep = 'first', inplace = True)

    print("After drop_duplicates: ", df.shape)

    df['neighborhood'] = df['neighborhood'].str.strip()

    # drop rows with deposit and rent empty values
    df = df.dropna(subset=['deposit', 'rent', 'area', 'room', 'year'])
    print("After dropna [deposit, rent, area, room, year]: ", df.shape)

    # drop rows with deposit and rent = 0
    df = df[(df['deposit'] != 0) & (df['rent'] != 0)]
    print("After drop [deposit, rent] = 0: ", df.shape)

    # drop rows with Nan neighborhood and longitude and latitude at same time
    df = df.dropna(subset=['neighborhood'])
    print("After dropna [neighborhood]: ", df.shape)

    # set id column as index and reset index
    df = df.set_index('id')
    df = df.reset_index(drop=True)

    # drop wc, cooling, heating, hotWater columns
    df = df.drop(['changeAble', 'wc', 'cooling', 'heating', 'hotWater'], axis=1)

    # int columns
    # replace empty area and room with 0
    df['area'] = df['area'].replace(np.nan, 0).astype(int)
    
    # df['area'] = pd.to_numeric(df['area'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)
    df['room'] = df['room'].replace({'بدون اتاق': '0'}, regex=True)
    df['room'] = df['room'].replace({'بیشتراز۸': '8'}, regex=True)
    df['room'] = df['room'].replace(np.nan, 0).astype(int)
    # df['room'] = pd.to_numeric(df['room'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    # string columns
    df['buildingFloors'] = df['buildingFloors'].replace({'': '1'}, regex=True)
    df['buildingFloors'] = df['buildingFloors'].replace({'بیشتر از ۱۵': '15'}, regex=True)
    df['buildingFloors'] = df['buildingFloors'].replace({'بیشتر از ۱۰': '10'}, regex=True)
    df['buildingFloors'] = df['buildingFloors'].replace({'بیشتر از ۵': '5'}, regex=True)
    df['buildingFloors'] = df['buildingFloors'].replace({'بیشتر از ۳': '3'}, regex=True)
    df['buildingFloors'] = df['buildingFloors'].replace({'انتخاب نشده': '1'}, regex=True)
    df['buildingFloors'] = df['buildingFloors'].replace(np.nan, 0).astype(int)

    df['warehouse'] = df['warehouse'].replace({'انباری ندارد': '0'}, regex=True)
    df['warehouse'] = df['warehouse'].replace({'انباری': '1'}, regex=True)
    df['warehouse'] = df['warehouse'].replace({'': '0'}, regex=True)
    df['warehouse'] = df['warehouse'].replace(np.nan, 0).astype(int)
    # df['warehouse'] = pd.to_numeric(df['warehouse'].apply(unidecode), errors='coerce')

    df['elavator'] = df['elavator'].replace({'آسانسور ندارد': '0'}, regex=True)
    df['elavator'] = df['elavator'].replace({'آسانسور': '1'}, regex=True)
    df['elavator'] = df['elavator'].replace({'': '0'}, regex=True)  
    df['elavator'] = df['elavator'].replace(np.nan, 0).astype(int)
    # df['elavator'] = pd.to_numeric(df['elavator'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    df['parking'] = df['parking'].replace({'پارکینگ ندارد': '0'}, regex=True)
    df['parking'] = df['parking'].replace({'پارکینگ': '1'}, regex=True)
    df['parking'] = df['parking'].replace({'': '0'}, regex=True)
    df['parking'] = df['parking'].replace(np.nan, 0).astype(int)     
    # df['parking'] = pd.to_numeric(df['parking'].apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    df['warehouse'] = df['warehouse'].replace({'انباری ندارد': '0'}, regex=True)
    df['warehouse'] = df['warehouse'].replace({'انباری دارد': '1'}, regex=True)
    df['warehouse'] = df['warehouse'].replace({'انباری': '1'}, regex=True)
    df['warehouse'] = df['warehouse'].replace({'': '0'}, regex=True)
    df['warehouse'] = df['warehouse'].replace(np.nan, 0).astype(int) 

    df['balcony'] = df['balcony'].replace({'بالکن ندارد': '0'}, regex=True)
    df['balcony'] = df['balcony'].replace({'بالکن دارد': '1'}, regex=True)
    df['balcony'] = df['balcony'].replace({'بالکن': '1'}, regex=True)
    df['balcony'] = df['balcony'].replace({'1 انتخاب نشده': '0'}, regex=True)
    df['balcony'] = df['balcony'].replace({'': '0'}, regex=True)
    df['balcony'] = df['balcony'].replace(np.nan, 0).astype(int) 

    df['deposit'] = df['deposit'].replace({'مجانی': '0'}, regex=True)
    df['deposit'] = df['deposit'].replace({'توافقی': '0'}, regex=True)
    df['deposit'] = df['deposit'].replace({'٬': ''}, regex=True)
    df['deposit'] = df['deposit'].replace({'تومان': ''}, regex=True)
    df['deposit'] = df['deposit'].replace({' ': ''}, regex=True)
    df['deposit'] = df['deposit'].replace(np.nan, 0).astype(float) 
    # df['deposit'] = pd.to_numeric(df.deposit.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(float)

    df['rent'] = df['rent'].replace({'مجانی': '0'}, regex=True)
    df['rent'] = df['rent'].replace({'توافقی': '0'}, regex=True)
    df['rent'] = df['rent'].replace({'٬': ''}, regex=True)
    df['rent'] = df['rent'].replace({'تومان': ''}, regex=True)
    df['rent'] = df['rent'].replace({' ': ''}, regex=True)
    df['rent'] = df['rent'].replace(np.nan, 0).astype(float) 
    # df['rent'] = pd.to_numeric(df.rent.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(float)

    # قبل از 1370 را با 1363 پر می کنم تا فاصله ها حفظ شود
    df['year'] = df['year'].replace({'قبل از ۱۳۷۰': '۱۳۶۳'}, regex=True)
    df['year'] = df['year'].replace(np.nan, 0).astype(int)
    # df['year'] = pd.to_numeric(df.year.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    # convert unitFloor to int
    df['unitFloor'] = df['unitFloor'].replace({' ': ''}, regex=True)
    df['unitFloor'] = df['unitFloor'].replace({'': '0'}, regex=True)
    df['unitFloor'] = df['unitFloor'].replace({'همکف': '0'}, regex=True)
    df['unitFloor'] = df['unitFloor'].replace({'زیر0': '-1'}, regex=True)
    df['unitFloor'] = df['unitFloor'].replace(np.nan, 0).astype(int)
    # df['unitFloor'] = pd.to_numeric(df.unitFloor.apply(unidecode), errors='coerce').replace(np.nan, 0).astype(int)

    df['unitPerFloor'] = df['unitPerFloor'].replace({' ': ''}, regex=True)
    df['unitPerFloor'] = df['unitPerFloor'].replace({'': '1'}, regex=True)
    df['unitPerFloor'] = df['unitPerFloor'].replace({'انتخابنشده': '1'}, regex=True)
    df['unitPerFloor'] = df['unitPerFloor'].replace({'بیشتراز۸': '1'}, regex=True)
    df['unitPerFloor'] = df['unitPerFloor'].replace(np.nan, 0).astype(int)

    df['unitStatus'] = df['unitStatus'].replace({'بازسازی نشده': '0'}, regex=True)
    df['unitStatus'] = df['unitStatus'].replace({'بازسازی شده': '1'}, regex=True)
    df['unitStatus'] = df['unitStatus'].replace({' ': ''}, regex=True)
    df['unitStatus'] = df['unitStatus'].replace(np.nan, 0).astype(int)
    
    df['direction'] = df['direction'].replace({' ': ''}, regex=True)
    df['direction'] = df['direction'].replace({'شمالی': '1'}, regex=True)
    df['direction'] = df['direction'].replace({'جنوبی': '0'}, regex=True)

    # convert longitude and latitude to float
    df['longitude'] = df['longitude'].replace(np.nan, 0).astype(float)
    df['latitude'] = df['latitude'].replace(np.nan, 0).astype(float)

    # تبدیل اجاره و ودیعه به یکدیگر و به دست آوردن یک عدد به عنوان ارزش منزل
    df['total_value'] = ((df['rent'] * 3) / 100) + df['deposit']

    # drop raw with total_value = 0
    df = df[df['total_value'] != 0]


    # df = df[['total_value','neighborhood','area','year','deposit','rent','elavator','parking','room','unitFloor','longitude','latitude']]
    
    # save the combined data
    df.to_csv(data_combined_file, index = False) 

### __main__ Method
In the main method, I just called the *save_urls* and *scrap_links* respectively.

Finally, the *Data.csv* file containing the features of housing ads will be prepared. This file can be used for analysis the housing price in all sections of *Tehran city*.

In [37]:
if __name__ == "__main__":
    # Specific the City
    city = 'tehran'
    # Create Directory for Urls and Data if not exist
    if not os.path.exists('Urls'):
        os.makedirs('Urls')
    if not os.path.exists('Data'):
        os.makedirs('Data')
    # get timestamp 
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    # timestamp = "20240227-183741"
    # Add timestamp to Urls file names
    url_file = './Urls/' + 'AdsUrls_' + city + '_'  +  timestamp + '.txt'
    data_file = './Data/' + 'Data_' +  city + '_' + timestamp  + '.csv'
    data_clean_file = './Data/' + 'Data_' +  city + '_' + 'clean_' + timestamp  + '.csv'
    data_combined_file = './Data/' + 'Dataset_' +  city + '_' + timestamp +'.csv'
    # Search Urls
    home_url = 'https://divar.ir'
    search_url = "https://divar.ir/s/" + city  + "/rent-apartment"

    #1- save the urls of advertisements in a file
    save_urls(500)

    # 2- read links from file and scrap all links
    scrap_links()
    
    # 3- change farsi characters and clean data
    clean_data()

After concat:  (43164, 27)
After drop_duplicates:  (39744, 27)
After dropna [deposit, rent, area, room, year]:  (28491, 27)
After drop [deposit, rent] = 0:  (26121, 27)
After dropna [neighborhood]:  (26121, 27)


### Tips
By running this program for many times, some tips are found:

1.   500 times of scrolling seemed to be an optimal point.
2.   By running the program in the early evening hours, I get more unique records.
3.   Running the program will take about half an hour for 500 scroll times. 
4.   Running the program for 500 scroll times will return about 6000 unique records.
5.   You can run this program in different days and combine the result in one csv file.