# Web Scraping Project : Scraping Hotel List from Booking.Com

For this project, I scraped the list of all the hotels available for `New Delhi` city from booking.com. `Booking.com`is one of the most famous and trusted hotel and accomodation booking website used all around the world.

Link : https://www.booking.com/ 


###  Our Objective

- Scrape hotel list available at `New Delhi` location.
- We'll scrape basic details like, hotel name, reviews, locality, ratings and more.
- Finally we'll make a csv file for further analysis and use.

### Our Tools

For this Web Scraping Project I'm going to use several tools such as `Jupyter Notebook`,`Python` and Python Libraries such as `bs4`, `pandas`, `lxml`, `requests`, and `jovian`.

![](https://i.imgur.com/0eACaKm.png)

### Special Note

Web Scraping Projects are highly prone to error and malfunction. It requires constants updates due to changes being made in the website we are scraping.

Not all websites and services allows web scraping, so please check beforehand for the consent.

# 0. Imports

**1. `Beautiful Soup`**

`bs4` module helps in pulling data out of HTML and XML files. We need to import `BeautifulSoup` function from the `bs4` module.

For more info: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


**2. `Requests`**

`requests` is an elegant and simple HTTP library for Python, to request and retrive the data from the server.

For more info: https://docs.python-requests.org/en/master/


**3. `lxml`**

`lxml` is the most feature-rich and easy-to-use library for processing and parsing XML and HTML. Although there are inbuilt parser avaibale, but I prefer to use it because of its speed.

For more info: https://lxml.de/

**4. `Pandas`**

`Pandas` is one of the most powerful and easy to use data structures and data analysis tool. We'll use it for making viewing and saving our scraped data.

For more info: https://pandas.pydata.org/docs/index.html

**5. `Matplotlib`**

`matplotlib` is easy to use data analysis and data visualisation tool. We'll use it for making graphs and plots out of our scraped data.

For more info: https://matplotlib.org/stable/contents.html

In [1]:
# Installation if needed

#!pip install pandas bs4 lxml requests matplotlib jovian --q

In [2]:
from bs4 import BeautifulSoup # Import for Beautiful Soup
import requests # Import for requests
import lxml # Import for lxml parser
import pandas as pd # Import for Pandas
import matplotlib.pyplot as plt # Import for Matplotlib Pyplot
%matplotlib inline

# 1. Prelimianry Details

### Q. What link we are going to use ?

We need to do some setup on the website before we scrape:

1. Select the city we need to scrape: 

> I've choosen `New Delhi` for this project.

2. Clicked on the `Top Reviewed`:

![](https://i.imgur.com/BEPhGNz.png)

> As the page shows `New Delhi` has around 1700 place to book and the maximum pages `booking.com` can show is 40 with 25 sections per page.

![](https://i.imgur.com/hOWXv1G.png)

> So, `25*40=1000`, we would be able to see and select 1000 (maximum) hotels only, so its better to grab places with better ratings.

3. Checked the first and last page:

We can see the internal working of the link, understanding it is important in order to manipulate it as per our needs.

First Page Link : https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&sid=55fa6423aa5026996217708ea6be131f&tmpl=searchresults&city=-2106102&class_interval=1&dest_id=-2106102&dest_type=city&dr_ps=IDR&from_idr=1&group_adults=2&group_children=0&ilp=1&label_click=undef&no_rooms=1&order=bayesian_review_score&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=518283f1feb10049&ssb=empty&top_ufis=1&sig=v1Q1dbuDSb&rows=25&offset=0 

Last Page Link : https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&sid=55fa6423aa5026996217708ea6be131f&tmpl=searchresults&city=-2106102&class_interval=1&dest_id=-2106102&dest_type=city&dr_ps=IDR&from_idr=1&group_adults=2&group_children=0&ilp=1&label_click=undef&no_rooms=1&order=bayesian_review_score&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=518283f1feb10049&ssb=empty&top_ufis=1&sig=v1Q1dbuDSb&rows=25&offset=975 

**Note :** The link seems little long but from the last bit we can see that its showing 25 (row=25) results on every page and goes up from 0 (offset=0) to 975 (offset=975), increment of 25 ( Pages 0 to 40 ).

### Q. What contents we are going to scrape ?

![img](https://i.imgur.com/eNSw7iC.png)

1. We'll make a sample csv file with the details we need for our reference.

![img](https://i.imgur.com/tYX34aY.png)

In [3]:
# Load the sample csv file
sample_df = pd.read_csv('hotel-sample.csv')

In [4]:
# To see the csv file
sample_df

Unnamed: 0,Hotel Name,Place,Review (Points),Review (Cat.),Review (Count),Distance (City Centre),Metro Facilities,Description
0,The Oberoi New Delhi,New Delhi,9.6,Exceptional,460 reviews,4.6 km from centre,No Metro Access,Overlooking Delhi's Golf Club and located in C...
1,Bungalow 99,"South Delhi, New Delhi",9.7,Exceptional,216 reviews,6.8 km from centre,No Metro Access,"Located in New Delhi, 2.6 mi from Humayun's To..."


### Q. Any addtional things we should take care of ?

Sometimes the website blocks the scraping tools to see the page, so we'll add headers section while requesting for the web page, just to be on the safe side.

# 2. Experiment Stage

In this stage we'll try to do different experinents with the web page:

- Fix the details we need and load the webpage.
- Look for relevant classes, tags and links realted with the information.
- Make functions for later use

## 2.1 Fix Preliminary Details

In [5]:
main_link = "https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&sid=55fa6423aa5026996217708ea6be131f&tmpl=searchresults&city=-2106102&class_interval=1&dest_id=-2106102&dest_type=city&dr_ps=IDR&from_idr=1&group_adults=2&group_children=0&ilp=1&label_click=undef&no_rooms=1&order=bayesian_review_score&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=518283f1feb10049&ssb=empty&top_ufis=1&sig=v1Q1dbuDSb&rows=25&offset=0"

headers = {'User-Agent':'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'}

Here we added the link we are going to use, similar to the one we used above. Additionaly we set out our headers that we are going to pass. I'm using Mozilla so my header looks like that but its different for different clients and OS.

## 2.2 Test with the Main Page

In [6]:
# Request for the URL
page = requests.get(main_link, headers=headers)

# We can also check the response code
print(page.status_code)

# Make it a soup
soup = BeautifulSoup(page.text,"lxml")

# Display Soup ( Main Page)
soup

200


<!DOCTYPE html>
<!--
You know you could be getting paid to poke around in our code?
We're hiring designers and developers to work in Amsterdam:
https://careers.booking.com/
--><!-- wdot-802 --><html><head><link crossorigin="" href="https://cf.bstatic.com" rel="dns-prefetch"/>
<link crossorigin="" href="https://cf.bstatic.com" rel="dns-prefetch"/>
<link crossorigin="" href="https://shelves.booking.com/" rel="preconnect"/>
<meta content="origin" name="referrer"/>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<script nonce="BB1kQbHWmKDNKfX"> function b_cors_check(supported) { var value = supported ? 1 : 0; if (!/(^|;)\s*cors_js=/.test(document.cookie)) { var d = new Date(); d.setTime(d.getTime() + 60 * 60 * 24 * 365 * 1000); var cookieDomain = '.booking.com' || '.booking.com'; document.cookie = 'cors_js=' + value +'; domain=' + cookieDomain + '; path=/; expires=' + d.toGMTString(); } if (!value) { location.reload(); } } </script>
<script nonce="BB1kQbHWmKDNKfX">(fun

First we requested the weblink with the help of `requests` module and `get` function. We also passed the `url` and `headers` for our convenience.

Then we parsed the result we got from the server with `lxml` parser. 

* For more info (`requests.get()`) : https://docs.python-requests.org/en/master/user/quickstart/#make-a-request
* For more info (`lxml`) : https://lxml.de/index.html

### Function for Page and Soup

We made a complete function to get the soup of a particular url

In [7]:
# We need to add query number

def get_page(url,head,num):
    """
    Function takes url as parameter and returns the soup.
    `url` : Link in strings
    'num' : Fordifferent pages (1-40)
    'head' : header for the url
    """
    page = requests.get(url.format(num),headers=head)
    if page.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    soup = BeautifulSoup(page.text, "lxml")
    return soup

### Check if we are getting 25 results for the page

1. Right Click on the mouse and `Inspect(Q)` to check for the block code.
2. Select the relavent class code for the block.

![img](https://i.imgur.com/71xTbPp.png?1)

In [8]:
blocks = soup.select(".sr_item.sr_item_new.sr_item_default.sr_property_block.sr_item_no_dates")

In [9]:
len(blocks)

25

In [10]:
# For first Block
blocks[0]

<div class="sr_item sr_item_new sr_item_default sr_property_block sr_item_no_dates" data-class="5" data-et-click="customGoal:NAREFBINEIfBccOHT:2" data-hotelid="77814" data-score="9.6">
<table class="sr_item_legacy">
<tbody>
<tr>
<td class="sr_item_legacy_photo" id="hotel_77814" rowspan="2">
<svg aria-hidden="true" class="bk-icon -iconset-heart" focusable="false" height="128" role="presentation" style="display:none;" viewbox="0 0 128 128" width="128"><path d="M64 112a3.6 3.6 0 0 1-2-.5 138.8 138.8 0 0 1-44.2-38c-10-14.4-10.6-26-9.4-33.2a29 29 0 0 1 22.9-23.7c11.9-2.4 24 2.5 32.7 13a33.7 33.7 0 0 1 32.7-13 29 29 0 0 1 22.8 23.7c1.3 7.2.6 18.8-9.3 33.3-9.1 13.1-24 25.9-44.2 37.9a3.6 3.6 0 0 1-2 .5z"></path></svg>
<svg aria-hidden="true" class="bk-icon -iconset-loading" focusable="false" height="128" role="presentation" style="display:none;" viewbox="0 0 128 128" width="128"><path d="m64 8a4.67 4.67 0 0 1 4.67 4.67v18.66a4.67 4.67 0 0 1 -4.67 4.67 4.67 4.67 0 0 1 -4.67-4.67v-18.66a4.67 4.6

Here we picked the `sr_item sr_item_new sr_item_default sr_property_block sr_item_no_dates` class from the soup to get the details regardinng particular blocks. We are looking for blocks which should be 25 in length related to the 25 hotels shown in the first page.

**Note:** Did you notice that we replaced `' '` empty space between the class names with `'.'` periods. That's the notion we have to use in case of empty space in the class calls. 

Finally we checked the html code for the first block.

## 2.3 Test for Hotel Name

![](https://i.imgur.com/LsYrw67.png?2)

In [11]:
section = soup.select(".sr_item.sr_item_new.sr_item_default.sr_property_block.sr_item_no_dates")[0] # For First Block

In [12]:
anchor = section.select("a")[1] # Second Anchor Tag in First Block
anchor

<a class="js-sr-hotel-link hotel_name_link url" href="
/hotel/in/the-oberoi-new-delhi.en-gb.html?aid=304142&amp;label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&amp;sid=0727d5416c4adf5ee8bf507a782f88a7&amp;dest_id=-2106102&amp;dest_type=city&amp;group_adults=2&amp;group_children=0&amp;hapos=1&amp;hpos=1&amp;no_rooms=1&amp;sr_order=bayesian_review_score&amp;srepoch=1626604410&amp;srpvid=78534a3c262b00f5&amp;ucfs=1&amp;sig=v1TFFrz92c&amp;from=searchresults
;highlight_room=#hotelTmpl" rel="noopener" target="_blank">
<span class="sr-hotel__name" data-et-click="   ">
The Oberoi New Delhi
</span>
<span class="invisible_spoken">Opens in new window</span>
</a>

In [13]:
# Select the First Section
section = soup.select(".sr_item.sr_item_new.sr_item_default.sr_property_block.sr_item_no_dates")[0]

# Select the second anchor tag
anchor = section.select("a")[1]

# Select the first span tag
span = anchor.select("span")[0]

# Get the name of hotel
name = span.getText().split('\n')[1]
name

'The Oberoi New Delhi'

Here we slected the first `block` to get the detail regarding hotel name. 

We needed to dive deep into `anchor` tags. It seems the first anchor tag contains our requred details.

But again the text data is in `span` class, so we selected span to dive deep.

Finally after splitting and choosing the right index we got our details.

### Function for Hotel Name

We made a complete function to get the hotel name from particular block

In [14]:
def get_hotel_name(section):
    """
    Function gives us the hotel name 
    """
    anchor = section.select("a")[1]
    span = anchor.select("span")[0]
    name = span.getText().split('\n')[1]
    return name

## 2.4 Test for Location

In [15]:
# Select the Third Anchor Tag
anchor = section.select("a")[2]
anchor

<a class="bui-link" data-coords="77.2402811050415,28.596231522505" data-et-click="customGoal:HZUaQRSeBcKHSYeGXT:1" data-google-track="Click/Action: sr_map_link_used" data-map-caption="" href="/hotel/in/the-oberoi-new-delhi.en-gb.html?aid=304142&amp;amp;amp;label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&amp;amp;amp;sid=0727d5416c4adf5ee8bf507a782f88a7&amp;amp;amp;dest_id=-2106102&amp;amp;amp;dest_type=city&amp;amp;amp;group_adults=2&amp;amp;amp;group_children=0&amp;amp;amp;hapos=1&amp;amp;amp;hpos=1&amp;amp;amp;no_rooms=1&amp;amp;amp;sr_order=bayesian_review_score&amp;amp;amp;srepoch=1626604410&amp;amp;amp;srpvid=78534a3c262b00f5&amp;amp;amp;ucfs=1&amp;amp;amp;sig=v1TFFrz92c&amp;amp;amp;from=searchresults;map=1&amp;amp;amp;msd=1#hotelTmpl" rel="noopener" target="_blank">
New Delhi
<span class="sr_card_address_line__item">
<span class="sr_card_address_line__dot-separator"></span>
Show on ma

In [16]:
place = anchor.getText().split('\n')[1]
place

'New Delhi'

In [17]:
# Third Anchor Tag
anchor = section.select("a")[2]

# Selecting the name of the location
place = anchor.getText().split('\n')[1]
place

'New Delhi'

Interestingly we found out that the `Third Anchor:[2]` contails the details regarding location of the hotel.

Again we needed to split and select the relevant index for our details.

### Function for Location

A complete function to get the location of our hotel

In [18]:
def get_location(section):
    """
    Function gives us the location info
    """
    anchor = section.select("a")[2]
    place = anchor.getText().split('\n')[1]
    return place

## 2.5 Testing for the Review

In [19]:
# Check for the Fourth Anchor Tag
section.select("a")[3]

<a class="sr-review-score__link" href="/hotel/in/the-oberoi-new-delhi.en-gb.html?aid=304142&amp;amp;label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&amp;amp;sid=0727d5416c4adf5ee8bf507a782f88a7&amp;amp;dest_id=-2106102&amp;amp;dest_type=city&amp;amp;group_adults=2&amp;amp;group_children=0&amp;amp;hapos=1&amp;amp;hpos=1&amp;amp;no_rooms=1&amp;amp;sr_order=bayesian_review_score&amp;amp;srepoch=1626604410&amp;amp;srpvid=78534a3c262b00f5&amp;amp;ucfs=1&amp;amp;sig=v1TFFrz92c&amp;amp;from=searchresults;from_sr_review=1;#hotelTmpl" target="_blank">
<div class="bui-review-score c-score bui-review-score--end"> <div aria-label="Scored  " class="bui-review-score__badge"> 9.6 </div> <div class="bui-review-score__content"> <div class="bui-review-score__title"> 
Exceptional
 </div> <div class="bui-review-score__text"> 460 reviews </div> </div> </div>
<span class="invisible_spoken">Opens in new window</s

In [20]:
review = section.select("a")[3].getText()
review_data = review.split('\n')

# Review Rating
rating = review_data[1].rstrip().lstrip()

# Review Category
category = review_data[2].rstrip().lstrip()

# Review Count
count = review_data[3].rstrip().lstrip()

In [21]:
rating, category, count

('9.6', 'Exceptional', '460 reviews')

Here the `Fourth Anchor:[3]` contails details regarding the review section.
    
Again we splitted the data to get all three details we needed.

### Function for Review

We found that some of the hotels doesn't has all the components of review so we added error handling block for its robustnes.

In [22]:
# Not all the hotels have reviews so we'll add try,except here

def get_review_details(section):
    """
    Function gives us the Review Rating, Review Category and Review Count
    """
    review = section.select("a")[3].getText()
    review_data = review.split('\n')
    
    # For Raring
    try:
        rating = review_data[1].rstrip().lstrip()
    except:
        rating = 'Nan'
    
    # For Raring
    try:
        category = review_data[2].rstrip().lstrip()
    except:
        category = 'Nan'
        
    # For Raring
    try:
        count = review_data[3].rstrip().lstrip()
    except:
        count = 'Nan'
    
    return rating,category,count

## 2.6 Testing for the Distance and Metro Info

![img](https://i.imgur.com/e3jWlVd.jpg)

In [23]:
# New class for the Distance and Metro Info
blocks[3].select(".sr_card_address_line")[0]

<div class="sr_card_address_line">
<a class="bui-link" data-coords="77.2259970009327,28.6743843250181" data-et-click="customGoal:HZUaQRSeBcKHSYeGXT:1" data-google-track="Click/Action: sr_map_link_used" data-map-caption="" href="/hotel/in/maidens.en-gb.html?aid=304142&amp;amp;amp;label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&amp;amp;amp;sid=0727d5416c4adf5ee8bf507a782f88a7&amp;amp;amp;dest_id=-2106102&amp;amp;amp;dest_type=city&amp;amp;amp;group_adults=2&amp;amp;amp;group_children=0&amp;amp;amp;hapos=4&amp;amp;amp;hpos=4&amp;amp;amp;no_rooms=1&amp;amp;amp;sr_order=bayesian_review_score&amp;amp;amp;srepoch=1626604410&amp;amp;amp;srpvid=78534a3c262b00f5&amp;amp;amp;ucfs=1&amp;amp;amp;sig=v1_LOOJYxQ&amp;amp;amp;from=searchresults;map=1&amp;amp;amp;msd=1#hotelTmpl" rel="noopener" target="_blank">
North Delhi, New Delhi
<span class="sr_card_address_line__item">
<span class="sr_card_address_lin

In [24]:
# Distance Info
center_distance = blocks[3].select(".sr_card_address_line__user_destination_address")[0].getText()

distance = center_distance.split("\n")
distance[1]

'4.4 km from centre'

In [25]:
# Metro Info
metro_list = blocks[3].select(".sr_card_address_line__item")[1].getText()
metro = metro_list.split("\n")[2]
metro

'Metro access'

For distance and metro information, we needed to select different class `sr_card_address_line`.

Further we needed to select additional seperate classes for distance and metro

1. `sr_card_address_line__user_destination_address` : For Distance
2. `sr_card_address_line__item` : For Metro Information

Again we needed to do some splitting and stripping process to get the relavant details.

### Function for Distance and Metro Info

A complete function for getting Distance and Metro information.

Note: Not all hotels has information regarding metro services so we added error handling.

![](https://i.imgur.com/y4gOeed.png)

In [26]:
def get_distance_info(section):
    """
    Function returns the distance information
    """
    # For the centre distance
    center_distance = section.select(".sr_card_address_line__user_destination_address")[0].getText()
    distance = center_distance.split("\n")[1]
    
    # For Metro Access
    try:
        metro_list = section.select(".sr_card_address_line__item")[1].getText()
        metro = metro_list.split("\n")[2]
    except:
        metro = 'No Metro Access'
    return distance, metro

In [27]:
p,q = get_distance_info(blocks[5])
p,q

('1.3 km from centre', 'Metro access')

## 2.7 Testing for the Hotel Description

In [28]:
section.select(".hotel_desc")[0]

<div class="hotel_desc">
Overlooking Delhi's Golf Club and situated in Central Delhi, newly renovated The Oberoi, New Delhi is 16 km from the Indira Gandhi International Airport. 
</div>

In [29]:
# Descrption Data 
data = section.select(".hotel_desc")[0].getText()

# Final Description Info
description = data.split("\n")[1].rstrip().lstrip()
description

"Overlooking Delhi's Golf Club and situated in Central Delhi, newly renovated The Oberoi, New Delhi is 16 km from the Indira Gandhi International Airport."

Here we need to choose different class `hotel_desc` which gave us the descprition of hotel we needed.

Again we need to do some stripping and splitting to get the information.

### Function for Hotel Description

A complete function with error handling block to get the description details.

In [30]:
def get_description(section):
    """
    Funtion returns hotel descriptions if any
    """
    try:
        data = section.select(".hotel_desc")[0].getText()
        description = data.split("\n")[1].rstrip().lstrip()
    except:
        description = "Unavailable"
        
    return description

# 3. Final Stage

Now we have got all the function we need to get the relvevanr details.

We'll make a final scrpt to run and fetch the details.

## 3.1 For Saving the Data

We made a dictionary to save the details we need.

In [31]:
# Make Dictionary

def new_data_dict():
    """
    Function return new dictionary
    """
    
    new_dict = {
    "Name" : [],
    "Place" : [],
    "Rating Point" : [],
    "Rating Category" : [],
    "Number of Reviews" : [],
    "Distance from Centre" : [],
        "Metro Access": [],
    "Description" : [] }
    
    return new_dict

## 3.2 Final Setup For Scraping

In [32]:
# Setting up the URL and headers
url = "https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaGyIAQGYATG4ARnIAQ_YAQHoAQH4AQKIAgGoAgO4AsrBzIcGwAIB0gIkNDFiZWU4NjMtMDU1My00NjA4LTkxMmYtMGE2NzliNTRiNTNl2AIF4AIB&sid=55fa6423aa5026996217708ea6be131f&tmpl=searchresults&city=-2106102&class_interval=1&dest_id=-2106102&dest_type=city&dr_ps=IDR&from_idr=1&group_adults=2&group_children=0&ilp=1&label_click=undef&no_rooms=1&order=bayesian_review_score&raw_dest_type=city&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&srpvid=518283f1feb10049&ssb=empty&top_ufis=1&sig=v1Q1dbuDSb&rows=25&offset={}"
headers = {'User-Agent':'Mozilla/5.0 (platform; rv:geckoversion) Gecko/geckotrail Firefox/firefoxversion'}

# Creating empty dictonary for storing 
scrape_dict = new_data_dict()

# Looping through all the content from 0-1000
for num in range(0,976,25):
    
    # Request for the URL and make soup
    soup = get_page(url,headers,num)
    
    # Get the sections
    sections = soup.select(".sr_item.sr_item_new.sr_item_default.sr_property_block.sr_item_no_dates")
    
    # For all the blocks
    for section in sections:
        
        # Get and add the details to dictionary
        scrape_dict["Name"].append(get_hotel_name(section))
        scrape_dict["Place"].append(get_location(section))
        part1,part2,part3 = get_review_details(section)
        scrape_dict["Rating Point"].append(part1)
        scrape_dict["Rating Category"].append(part2)
        scrape_dict["Number of Reviews"].append(part3)
        info1,info2 = get_distance_info(section)
        scrape_dict["Distance from Centre"].append(info1)
        scrape_dict["Metro Access"].append(info2)
        scrape_dict["Description"].append(get_description(section))

## 3.3 Viewing and Saving the Data

In [33]:
# Saving the data to a dataframe
df = pd.DataFrame(scrape_dict)

In [34]:
# Check the length of the dataframe
len(df)

1000

In [37]:
df[:20]

Unnamed: 0,Name,Place,Rating Point,Rating Category,Number of Reviews,Distance from Centre,Metro Access,Description
0,The Oberoi New Delhi,New Delhi,9.6,Exceptional,460 reviews,4.6 km from centre,No Metro Access,Overlooking Delhi's Golf Club and situated in ...
1,Bungalow 99,"South Delhi, New Delhi",9.7,Exceptional,216 reviews,6.8 km from centre,No Metro Access,"Situated in New Delhi, 4.2 km from Humayun's T..."
2,The Leela Palace New Delhi,"South Delhi, New Delhi",9.3,Superb,"1,240 reviews",7 km from centre,No Metro Access,"Located in New Delhi's Diplomatic Enclave, The..."
3,Maidens Hotel New Delhi,"North Delhi, New Delhi",9.4,Superb,762 reviews,4.4 km from centre,Metro access,"Built in 1903, Maidens Hotel showcases 19th ce..."
4,Tatvamasi Homestay,"South Delhi, New Delhi",9.8,Exceptional,118 reviews,11.2 km from centre,No Metro Access,Tatvamasi Homestay is located in New Delhi. It...
5,"The Imperial, New Delhi","Connaught Place, New Delhi",9.3,Superb,791 reviews,1.3 km from centre,Metro access,Located 1 km from New Delhi's City Centre and ...
6,Cp Villa,"Central Delhi, New Delhi",9.3,Superb,329 reviews,1.5 km from centre,Metro access,"Featuring free WiFi, Cp Villa is situated in N..."
7,Tobo Stays,"South Delhi, New Delhi",9.2,Superb,376 reviews,12.1 km from centre,No Metro Access,Situated in New Delhi and with Lotus Temple re...
8,"Taj Palace, New Delhi","Chanakyapuri, New Delhi",8.9,Fabulous,"3,379 reviews",6.7 km from centre,No Metro Access,"Spread over six acres of lush gardens, offerin..."
9,Staybook - ShivDev International New Delhi,"Paharganj, New Delhi",8.5,Very good,848 reviews,1.3 km from centre,No Metro Access,Staybook - ShivDev International New Delhi pro...


In [36]:
# To save the Data Frame

df.to_csv("hotel-list.csv")

# 4. Summary and Limitations

### Summary

* Here in this project we scraped the data of hotels in `New Delhi` city from `Booking.com` Website.

* We got the list of 1000 hotesl with details like name, location, reviews, metrod access, distance from city centre and description.


### Limitations

* The maimum number of posts we can see is 1000, so we were only able to scrape data of 1000 out of 1700 available hotels.

* Due to constant call of url and result, there may be some duplicacy regarding the hotel list.

* Not all hotels are available all the time, so its highly probable that we missed out some.

# 5. References and Future Work

### References to useful links:

* Pandas : https://www.w3schools.com/python/pandas/default.asp
* Beautiful Soup : https://realpython.com/beautiful-soup-web-scraper-python/
* lxml : https://lxml.de/elementsoup.html


### Ideas for future work:

* We can try to scrape further details related to the hotel like its facilities, images and all.
* We can list multiple cities so, we can get details for all in single go.
* We can make our code better with adding error handling fif we we don't number of pages.