## Hotel location: scraping booking.com for location of hotels in top 35 cities to visit in France  

In this notebook I :  
-  Scraped the latitude and longitude for all the top 100 hotels found in the [top 35 best cities in France](https://one-week-in.com/35-cities-to-visit-in-france/)
-  Hotel latitude and longitude is scraped from [booking.com](https://www.booking.com), using the link to the map of each hotel

- For scraping of hotel information, see notebook 02-booking_scrap1.ipynb

## Import libraries

In [1]:
import pandas as pd
import os
import logging

from bs4 import BeautifulSoup

import scrapy
from scrapy.crawler import CrawlerProcess

from src.booking_scrap2 import *

## Scrap2  

Use map links for each hotel retrieved from booking.com to obtatin latitude and longitude of hotel

In [2]:
from src.booking_scrap1 import hotels_links_ls

In [3]:
# Name of the file where the results will be saved
filename = "scrap2_hotels_topcities_booking.csv"


# If file already exists, delete it before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir('data/processed/'):
        os.remove('data/processed/' + filename)


process = CrawlerProcess(settings = {
    'USER_AGENT': 'Chrome/97.0 (compatible; MSIE 7.0; Windows NT 5.1)',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        'data/processed/' + filename: {"format": "csv"},
    }
})

# Start the crawling using the spider you defined above
process.crawl(HotelLatLonSpider)
process.start()

2022-04-17 15:59:35 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-04-17 15:59:35 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.9.7 (default, Sep 16 2021, 08:50:36) - [Clang 10.0.0 ], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform macOS-10.16-x86_64-i386-64bit
2022-04-17 15:59:35 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 20,
 'USER_AGENT': 'Chrome/97.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2022-04-17 15:59:35 [scrapy.extensions.telnet] INFO: Telnet Password: 68338651a9de89ae
2022-04-17 15:59:35 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-04-17 15:59:35 [scrapy.middleware] INFO: Enabled downloader middlewares:
['

### Scraping output

In [19]:
booking_df_loc = pd.read_csv('data/processed/scrap2_hotels_topcities_booking.csv')
print(booking_df_loc.shape)
booking_df_loc.head()

(2942, 3)


Unnamed: 0,hotel_name,hotel_lat,hotel_lon
0,Corne de cerf,48.650225,-2.024735
1,LES AMARRES - Maison familiale 3 chambres - Ja...,48.647683,-1.995204
2,Antinéa,48.655543,-2.005139
3,Le RUELLAN charmant duplex proche plage,48.656487,-1.982275
4,Mercure St Malo Front de Mer,48.652567,-2.014242


## Merge Scrap 1 and Scrap2

In [6]:
# Import scrap1 result
booking_df = pd.read_csv('data/processed/scrap1_hotels_topcities_booking-clean.csv')
booking_df.columns

Index(['city', 'suburbs', 'hotel_name', 'link', 'rating', 'room_type', 'price',
       'stay', 'guests', 'room', 'description', 'location', 'map_link'],
      dtype='object')

In [16]:
# Check hotel names match
ls_hot =booking_df['hotel_name'].unique().sort()
ls_loc = booking_df_loc['hotel_name'].unique().sort()
ls_hot == ls_loc

True

In [12]:
# Merge dfs
booking_df_s3 = booking_df.merge(booking_df_loc, on=['hotel_name'])
print(booking_df_s3.shape)
print(booking_df_s3.columns)
booking_df_s3.head()

(3466, 15)
Index(['city', 'suburbs', 'hotel_name', 'link', 'rating', 'room_type', 'price',
       'stay', 'guests', 'room', 'description', 'location', 'map_link',
       'hotel_lat', 'hotel_lon'],
      dtype='object')


Unnamed: 0,city,suburbs,hotel_name,link,rating,room_type,price,stay,guests,room,description,location,map_link,hotel_lat,hotel_lon
0,Saint Malo,"Sillon, Saint Malo",Antinéa,https://www.booking.com/hotel/fr/antinea.en-gb...,8.3,Family Room (2 Adults + 2 Children),"€ 1,667",6 nights,2 adults,98 reviews,Free cancellation,1.6 km from centre,https://www.booking.com/hotel/fr/antinea.en-gb...,48.655543,-2.005139
1,Saint Malo,"Parame, Saint Malo",Le RUELLAN charmant duplex proche plage,https://www.booking.com/hotel/fr/le-ruellan-ch...,8.3,Apartment,€ 539,6 nights,2 adults,Managed by a private host,Free cancellation,3.3 km from centre,https://www.booking.com/hotel/fr/le-ruellan-ch...,48.656487,-1.982275
2,Saint Malo,"Parame, Saint Malo",Le RUELLAN charmant duplex proche plage,https://www.booking.com/hotel/fr/le-ruellan-ch...,8.3,Apartment,€ 539,6 nights,2 adults,Managed by a private host,Free cancellation,3.3 km from centre,https://www.booking.com/hotel/fr/le-ruellan-ch...,48.656487,-1.982275
3,Saint Malo,"Parame, Saint Malo",Le RUELLAN charmant duplex proche plage,https://www.booking.com/hotel/fr/le-ruellan-ch...,8.3,Apartment,€ 555,6 nights,2 adults,Managed by a private host,Free cancellation,3.3 km from centre,https://www.booking.com/hotel/fr/le-ruellan-ch...,48.656487,-1.982275
4,Saint Malo,"Parame, Saint Malo",Le RUELLAN charmant duplex proche plage,https://www.booking.com/hotel/fr/le-ruellan-ch...,8.3,Apartment,€ 555,6 nights,2 adults,Managed by a private host,Free cancellation,3.3 km from centre,https://www.booking.com/hotel/fr/le-ruellan-ch...,48.656487,-1.982275


In [13]:
# Output merged dataframe to be uploaded to S3 bucket
booking_df_s3.to_csv('results/scrap_booking_s3.csv', index= False)