# [Jabama](https://www.jabama.com/) website data scrap

In this notebook, I try to scrap data from [Jabama](https://www.jabama.com/) website. 

This site is providing services in the field of accommodation and is one the leaders accommodation market in Iran. There are residences available in almost all major or tourist towns in Iran. People can rent any kind of residences in the website for a short to medium time period. There are many kinds of residenc like apartment, villa, suite, complex, inn, hostel, ecotourism and etc, available in the site. 

Therefore we can find valuable information about accommodation conditions (like price and ...) in the country.

## Import libraries
We use 3 main libraries for web scraping:

1.   Csv (for read input data and write final data)
2.   Requests (for http request to web pages)
3.   BeautifulSoup (the main library for web scrap)

In [1]:
import csv
import requests 
from bs4 import BeautifulSoup

## Software structure
I need all accommadations information for a good analysis, but unfortunately the site does not have the feature for getting all information. I even called the website support team and they informed me that I cannot get all information from the site directly. So I did the following activities to gather a good and complete set of accommodation data set.

At first, I open and read a text file called *Cities.txt*, which consists of the provinces and major cities of Iran. For all rows in the file, I build a Url for searching in the website.The Url looks like :*https://www.jabama.com/search?q=مازندران&kind=accommodation&page-number=1*, which means page 1 of the search result of all accommodation types for province or city *مازندران*. So if I search all pages (max=70) for all cities or provinces, almost the information of all residences of the site will be gathered.

In each page, there will be at most 12 accommodations (at the end of year 1399). I get accommodations by web scraping and for each accommodation in the page, I build a Url to request the accommodation page. The Url looks like *https://www.jabama.com/stay/cottage-102172*. 

At last, I can get all information about accommodations like price, foundation, rooms and ... by scraping the page. Ofcourse some information is in the accommodation Url like accommodation kind and code. 

By adding new cities to *Cities.txt*, you can gather new records of data.

#### Scrap_and_Save function
This function get each search page result and scrap it and save the accommodation data into result file in csv format. I separate the scrap section of my code into a funtion for higher understandability.

In [2]:
# Scrap the page soup in paramters
# It is working based on the https://www.jabama.com/city-tehran?page-number=3 pages
def Scrap_and_Save(soup, data_file) :
    # Find all stays
    stays = soup.find_all('a', attrs= {'class':'vertical-card'})
    for each_stay in stays:
        temp_Url = Jabama_Url_WithoutSlash + each_stay['href']
        
        # Check if stay is found
        response = requests.get(temp_Url)
        response.encoding = 'utf-8'
        if response.status_code != 200:
            continue
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find url features
        code = kind = ''
        split_url = each_stay['href'].split('-')

        print('        scraping stay :',each_stay['href'])

        if len(split_url) == 2 :
            code = split_url[1]
            split_url = split_url[0].split('/')
            if len(split_url) == 3 :
                kind = split_url[2]

        # Find direct features
        price = comment = score = city = ''
        price = soup.find('span', attrs= {'class':'box-title'})
        comment = soup.find('span', attrs= {'class':'count'})
        score = soup.find('span', attrs= {'class':'score'})
        city = soup.find('span', attrs= {'class':'city-province'})

        if comment != None : comment = comment.get_text().strip().replace('(','').replace(')','')
        if score != None : score = score.get_text().strip()
        if city != None : city = city.get_text().strip().replace('،','-')
        if price != None :  price = price.get_text().strip().split('تومان')[0].replace(',','')
        # if comment == 'جدید' : comment = ''
        # if score == '' : score = '0'

       
        # Find acommodation specification features
        foundation = area = room = capacity = ''
        double_bed = single_bed = iranian_bed = ''
        toile = bath = ''
        specifications = soup.find('div', attrs= {'class':'accommodation-info__specifications'})
        if specifications != None:
            specs = specifications.find_all('div', attrs= {'class':'accommodation-spec'}, recursive=False)
            if specs != None :
                # Count the specifications and caption row items
                index1 = index2 = 0
                for each_spec in specs:
                    content = each_spec.find('div', attrs= {'class':'content'}, recursive=False)
                    if content == None : continue
                    caption_container = content.find('div', attrs= {'class':'caption-container'}, recursive=False)
                    if caption_container == None : continue
                    caption_rows = caption_container.find_all('div', attrs= {'class':'caption-row'}, recursive=False)
                    if caption_rows == None : continue
                    index2 = 0
                    for each_caption_row in caption_rows:
                        capt = each_caption_row.find('span', attrs= {'class':'caption'}, recursive=False)
                        if capt == None : continue
                        temp_str = capt.get_text().strip()
                        
                        # Specify the variable
                        if index1 == 0:
                            if index2 == 0: foundation = temp_str
                            if index2 == 1: area = temp_str
                            if index2 == 2: room = temp_str
                        if index1 == 1:
                            if index2 == 0: capacity = temp_str
                        if index1 == 2:
                            if index2 == 0: double_bed = temp_str
                            if index2 == 1: single_bed = temp_str
                            if index2 == 2: iranian_bed = temp_str
                        if index1 == 3:
                            if index2 == 0: toile = temp_str
                            if index2 == 1: bath = temp_str
                        index2 += 1
                    index1 += 1
        
        # Find acommodation amenities features
        water = water_cooler = refrigerator = closet = 'False'
        cooking = oven = furniture = dining_table = 'False'
        restaurant = green_space = lobby = elavator = 'False'
        amenities_container = soup.find('div', attrs= {'class':'accommodation-amenities__list'})
        if amenities_container != None :
            amenities_list = amenities_container.find_all('div', attrs= {'class':'accommodation-amenities__amenity'}, recursive=False)
            amenities_list_missed = amenities_container.find_all('div', attrs= {'class':'accommodation-amenities__amenity missed'}, recursive=False)

            for each_aminity in amenities_list:
                if each_aminity in amenities_list_missed : continue
                temp_str = each_aminity.find('span', recursive=False)
                if temp_str != None:
                    temp_str = temp_str.get_text().strip()

                    if temp_str == 'آب' : water = 'True'
                    if temp_str == 'یخچال' : refrigerator = 'True'
                    if temp_str == 'کولر آبی' : water_cooler = 'True'
                    if temp_str == 'کمد/دراور' : closet = 'True'
                    if temp_str == 'لوازم آشپزی' : cooking = 'True'
                    if temp_str == 'اجاق گاز' : oven = 'True'
                    if temp_str == 'مبلمان' : furniture = 'True'
                    if temp_str == 'میز نهارخوری' : dining_table = 'True'
                    if temp_str == 'رستوران' : restaurant = 'True'
                    if temp_str == 'فضای سبز' : green_space = 'True'
                    if temp_str == 'لابی' : lobby = 'True'
                    if temp_str == 'آسانسور' : elavator = 'True'
        
        new_row = [code,kind,price,comment,score,city,foundation,area
                    ,room,capacity,double_bed,single_bed,iranian_bed
                    ,toile,bath,water,water_cooler,refrigerator,closet
                    ,cooking,oven,furniture,dining_table,restaurant
                    ,green_space,lobby,elavator]
        
        with open(data_file, 'a+', newline='', encoding='utf-8') as write_obj:
            # Create a writer object from csv module
            csv_writer = csv.writer(write_obj)
            # Add contents of list as last row in the csv file
            csv_writer.writerow(new_row)


#### Main function
In the main function, I just open the *cities.txt* file, create the search Url and request the Url. Then call the Scrap_and_Save function with the Url result.

Tip1 : The site returns a web page instead of not found error (404). So if the searched city was not found, the site returns a web page. Therefore, I continue the for loop by detecting a *div* with specific *class*.


In [None]:
if __name__ == "__main__":
    Data_File_Name = 'Data.csv'
    City_File_Name = 'Cities.txt'
    
    # Write the headers in data csv file
    with open(Data_File_Name, mode='w', newline='', encoding='utf-8') as csv_file:
        handle = csv.writer(csv_file)
        handle.writerow(['code','kind','price','comment','score','city'
            ,'foundation','area','room','capacity','double_bed','single_bed','iranian_bed'
            ,'toile','bath','water','water_cooler','refrigerator','closet','cooking','oven'
            ,'furniture','dining_table','restaurant','green_space','lobby','elavator'])

    city_list = list()
    # Read city source file
    with open(City_File_Name,'r',encoding='utf-8') as city_file:
        lines = city_file.readlines()
        for each_line in lines:
            city_list.append(each_line.replace('\n',''))

    # Get HTML of all cities
    Jabama_Url = 'https://www.jabama.com/'
    Jabama_Url_WithoutSlash = 'https://www.jabama.com'

    for each_city in city_list :
        # Find all pages (At last Max_page_no pages)
        Max_page_no = 70
        for each_page in range(Max_page_no):
            temp_page = str(each_page + 1)

            print('Scraping', each_city, 'page', temp_page, ':')
            
            temp_Url = f'{Jabama_Url}search?q={each_city}&kind=accommodation&page-number={temp_page}'

            # Check if page is found
            response = requests.get(temp_Url)
            if response.status_code != 200:
                break

            soup = BeautifulSoup(response.content, 'html.parser')

            # Jabama return this page instead of 404
            check_empty = soup.find_all('div', attrs= {'class':'listing-empty-state'})
            if len(check_empty) > 0:
                break
            
            Scrap_and_Save(soup, Data_File_Name)
        

Scraping اردبیل page 1 :
        scraping stay : /stay/inn-85754
        scraping stay : /stay/inn-85779
        scraping stay : /stay/inn-85775
        scraping stay : /stay/ecotourism-331103
        scraping stay : /stay/ecotourism-331102
        scraping stay : /stay/ecotourism-329207
        scraping stay : /stay/ecotourism-329206
        scraping stay : /stay/complex-93549
        scraping stay : /stay/apartment-108984
        scraping stay : /stay/ecotourism-88391
        scraping stay : /stay/ecotourism-88376
        scraping stay : /stay/apartment-332256
Scraping اردبیل page 2 :
        scraping stay : /stay/complex-93564
        scraping stay : /stay/complex-93545
        scraping stay : /stay/suite-331660
        scraping stay : /stay/complex-93552
        scraping stay : /stay/complex-87617
        scraping stay : /stay/ecotourism-329208
        scraping stay : /stay/apartment-333602
        scraping stay : /stay/complex-93554
        scraping stay : /stay/ecotourism-88395
 

Finally, I will have a csv file called *Data.csv* after running the program. The file contains 27 features of accommodations. This file w