## Hotel info: scraping booking.com for top 35 cities to visit in France  

In this notebook I :  
-  Retrieved information for the top 100 hotels found in the [top 35 best cities in France](https://one-week-in.com/35-cities-to-visit-in-france/)
-  Hotel information is scraped from [booking.com](https://www.booking.com)

## Import libraries

In [1]:
import pandas as pd
import numpy as np
import os
import logging

import scrapy
from scrapy.crawler import CrawlerProcess

from src.booking_scrap1 import * # import spiders and list of top 35 cities

## Top 35 cities  
Check list of names of cities without french accents


In [2]:
# Cities_meta_df imported in src.booking_scrap1 from src.cities_weather
top_cities_weather_ls = cities_meta_df['cities_clean'].tolist()

print(len(top_cities_weather_ls))
top_cities_weather_ls[:5]

35


['Mont Saint Michel', 'St Malo', 'Bayeux', 'Le Havre', 'Rouen']

## Scrap  hotel information  

Carry out two rounds of scraping to obtain the following information for each hotel found in each city:  

*Scrap1*
*   **hotel_name** : hotel name  
*   **suburbs** : if found, indicate city suburb in which hotel is found  
*   **link** : url to its booking.com page  
*   **rating** : score given by the website users  
*   **room_type** : description of available room  
*   **price** : price found for the indicated room and length of stay  
*   **stay** : length of stay chosen in search  
*   **guests** : number of guests (adults) chosen in search  
*   **description** : text description of the hotel     
*   **location** : distance of location to city centre  
*   **map_link** : url to its location on a map  


*Scrap2*  
* hotel name  
* hotel latitude and longitude

### Scrap1  
Following an update of booking.com, could no longer find pagination link  
Solution: create 4 spiders to scrap top 100 best hoteks (booking default ranking)

In [3]:
# Name of the file where the results will be saved
filename = "scrap1_hotels_topcities_booking.csv"

# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir('data/interim/'):
        os.remove('data/interim/' + filename)


process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/97.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'data/interim/' + filename: {"format": "csv"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(HotelBookingSpider)
process.crawl(HotelBookingSpiderp2)
process.crawl(HotelBookingSpiderp3)
process.crawl(HotelBookingSpiderp4)
process.start()

2022-04-17 14:33:28 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-04-17 14:33:28 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.7 (default, Sep 16 2021, 08:50:36) - [Clang 10.0.0 ], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-04-17 14:33:28 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Chrome/97.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-04-17 14:33:28 [scrapy.extensions.telnet] INFO: Telnet Password: 3cd59203f5bf5a52
2022-04-17 14:33:28 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-04-17 14:33:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['

### Scraping output

In [4]:
booking_df = pd.read_csv('data/interim/scrap1_hotels_topcities_booking.csv')
print(booking_df.shape)
booking_df.head()

(3054, 13)


Unnamed: 0,city,suburbs,hotel_name,link,rating,room_type,price,stay,guests,room,description,location,map_link
0,Saint Malo,"Sillon, Saint Malo",Antinéa,https://www.booking.com/hotel/fr/antinea.en-gb...,8.3,Family Room (2 Adults + 2 Children),"€ 1,667",6 nights,2 adults,98 reviews,Free cancellation,1.6 km from centre,https://www.booking.com/hotel/fr/antinea.en-gb...
1,Saint Malo,"Parame, Saint Malo",Le RUELLAN charmant duplex proche plage,https://www.booking.com/hotel/fr/le-ruellan-ch...,8.3,Apartment,€ 539,6 nights,2 adults,Managed by a private host,Free cancellation,3.3 km from centre,https://www.booking.com/hotel/fr/le-ruellan-ch...
2,Saint Malo,"Saint-Servan, Saint Malo",Résidence Pierre & Vacances Ty Mat,https://www.booking.com/hotel/fr/maevatymat.en...,8.0,Standard Studio with sleeping Alcove and Balco...,€ 480,6 nights,2 adults,458 reviews,,1.5 km from centre,https://www.booking.com/hotel/fr/maevatymat.en...
3,Saint Malo,Saint Malo,"The Originals City, Hôtel Belem, Saint-Malo",https://www.booking.com/hotel/fr/belem.en-gb.h...,7.5,Comfort Double Room,€ 534,6 nights,2 adults,"1,056 reviews",,4.9 km from centre,https://www.booking.com/hotel/fr/belem.en-gb.h...
4,Saint Malo,"Sillon, Saint Malo",Mercure St Malo Front de Mer,https://www.booking.com/hotel/fr/st-malo-front...,7.9,Classic Double Room with Side Sea View,"€ 1,229",6 nights,2 adults,601 reviews,FREE cancellation • No prepayment needed,0.9 km from centre,https://www.booking.com/hotel/fr/st-malo-front...


In [5]:
# Check for unique city values: 
print(booking_df.city.unique().shape)
print(booking_df.city.unique())


(36,)
['Saint Malo' 'city' 'Le Havre' 'Le Mont Saint Michel' 'Rouen' 'Amiens'
 'Bayeux' 'Lille' 'Paris' 'Besançon' 'Strasbourg'
 '0 properties are available in and around this destination' 'Eguisheim'
 'Dijon' 'Colmar' 'Grenoble' 'Annecy' 'La Rochelle' 'Biarritz' 'Ariège'
 'Toulouse' 'Carcassonne' 'Collioure' 'Bayonne' 'Montauban' 'Nîmes'
 'Bormes-les-Mimosas' 'Aix-en-Provence' 'Cassis' 'Avignon' 'Marseille'
 'Saintes-Maries-de-la-Mer' 'Aigues-Mortes' 'Gorges du Verdon' 'Lyon'
 'Uzès']


In [6]:
# Clean for cases where there is the concatenation line 'city' and the 0 properties found
booking_df_clean = booking_df.copy()

booking_df_clean = booking_df_clean.drop(booking_df_clean[booking_df_clean.city == 'city'].index)
booking_df_clean = booking_df_clean.drop(booking_df_clean[booking_df_clean.city == '0 properties are available in and around this destination'].index)
booking_df_clean.shape

(2951, 13)

In [7]:
# Check for unique city values: 
print(booking_df_clean.city.unique().shape)
print(booking_df_clean.city.unique())

(34,)
['Saint Malo' 'Le Havre' 'Le Mont Saint Michel' 'Rouen' 'Amiens' 'Bayeux'
 'Lille' 'Paris' 'Besançon' 'Strasbourg' 'Eguisheim' 'Dijon' 'Colmar'
 'Grenoble' 'Annecy' 'La Rochelle' 'Biarritz' 'Ariège' 'Toulouse'
 'Carcassonne' 'Collioure' 'Bayonne' 'Montauban' 'Nîmes'
 'Bormes-les-Mimosas' 'Aix-en-Provence' 'Cassis' 'Avignon' 'Marseille'
 'Saintes-Maries-de-la-Mer' 'Aigues-Mortes' 'Gorges du Verdon' 'Lyon'
 'Uzès']


In [8]:
booking_df_clean.to_csv('data/processed/scrap1_hotels_topcities_booking-clean.csv', index = False)

## Scrap2  

Go to notebook 02_booking_scrap2.ipynb for scraping of hotel latitude and longitude