# Project: planning my next holidays ☀️

Let's create a script that allows to get some information about all the hotels in a given city on <a href="https://www.booking.com" target="_blank">www.booking.com</a> 🧙

**We strongly recommend that you use Scrapy, it will be much easier!**

You can scrap as many information as you want, but we suggest that you get at least:

* The hotel name, 
* The url to its booking.com page, 
* Its coordinates: latitude and longitude,
* The score given by the website users,
* The text description of the hotel.

Then, you can execute this script for several cities from yesterday's list. Make sure you save the results in different files for each city and that the name of the city is stored in the filename (for later purposes 😉).

In [33]:
pip install scrapy

Collecting scrapy
  Using cached Scrapy-2.5.1-py2.py3-none-any.whl (254 kB)
Collecting cssselect>=0.9.1
  Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting h2<4.0,>=3.0
  Using cached h2-3.2.0-py2.py3-none-any.whl (65 kB)
Collecting service-identity>=16.0.0
  Using cached service_identity-21.1.0-py2.py3-none-any.whl (12 kB)
Collecting Twisted[http2]>=17.9.0
  Using cached Twisted-21.7.0-py3-none-any.whl (3.1 MB)
Collecting lxml>=3.5.0; platform_python_implementation == "CPython"
  Using cached lxml-4.6.3-cp38-cp38-manylinux2014_x86_64.whl (6.8 MB)
Collecting parsel>=1.5.0
  Using cached parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Processing /home/jovyan/.cache/pip/wheels/d1/d7/61/11b5b370ee487d38b5408ecb7e0257db9107fa622412cbe2ff/PyDispatcher-2.0.5-py3-none-any.whl
Collecting queuelib>=1.4.2
  Using cached queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting itemadapter>=0.1.0
  Using cached itemadapter-0.4.0-py3-none-any.whl (10 kB)
Collecting zope.interface>=4.1.3


In [34]:
villes = ['Mont-Saint-Michel',
 'St Malo',
 'Bayeux',
 'Le Havre',
 'Rouen',
 'Paris',
 'Amiens',
 'Lille',
 'Strasbourg',
 'Chateau du Haut Koenigsbourg',
 'Colmar',
 'Eguisheim',
 'Besancon',
 'Dijon',
 'Annecy',
 'Grenoble',
 'Lyon',
 'Bormes les Mimosas',
 'Cassis',
 'Marseille',
 'Aix en Provence',
 'Avignon',
 'Uzès',
 'Nímes',
 'Aigues Mortes',
 'Saintes Maries de la mer',
 'Collioure',
 'Carcassonne',
 'Toulouse',
 'Montauban',
 'Biarritz',
 'Bayonne',
 'La Rochelle']

In [35]:
import os
import logging
import json
import scrapy
from scrapy.crawler import CrawlerProcess

In [36]:

    class Hotels(scrapy.Spider):
        # Name of your spider
        name = "hotels"

        # Starting URL
        start_urls = ['https://www.booking.com/index.fr.html']

        # Parse function for login
        def parse(self, response):
            # FormRequest used to login
            return scrapy.FormRequest.from_response(
                response,
                formdata={'ss': destination_name},
                callback=self.after_search
            )

        # Callback used after login
        def after_search(self, response):

            hotels = response.css('.sr_item')

            for h in hotels:
                yield {
                    'name': h.css('.sr-hotel__name::text').get(),
                    'url': "https://www.booking.com" + h.css('.hotel_name_link').attrib["href"],
                    'coords': h.css('.sr_card_address_line a').attrib["data-coords"],
                    'score': h.css('.bui-review-score__badge::text').get(),
                    'description': h.css('.hotel_desc::text').get()

                }


            # Select the NEXT button and store it in next_page
            try:
                next_page = response.css('a.paging-next').attrib["href"]
            except KeyError:
                logging.info('No next page. Terminating crawling process.')
            else:
                yield response.follow(next_page, callback=self.after_search)
                
                
                

In [37]:
filename = "2_hotels_" + destination_name.replace(" ", "-") + ".json"

if filename in os.listdir('/home/jovyan/res/'):
     os.remove('/home/jovyan/res/' + filename)

process = CrawlerProcess(settings = {
                'USER_AGENT': 'Chrome/84.0 (compatible; MSIE 7.0; Windows NT 5.1)',
                'LOG_LEVEL': logging.INFO,
                "FEEDS": {
                    '/home/jovyan/res/' + filename: {"format": "json"},
                }
            })
process.crawl(Hotels)
process.start()

NameError: name 'destination_name' is not defined

<B>CREATION ESPACE DE STOCKAGE DE NOTRE SCRAPING

In [38]:
pip install boto3

Note: you may need to restart the kernel to use updated packages.


In [39]:
import pandas as pd
import json
import requests
import boto3

<b>IMPORT DE TOUTES LES VILLES DANS UN S3 Bucket

In [3]:
import pandas as pd

In [6]:
path_to_file="/home/jovyan/Projet_kayak/res//hotels_"

In [7]:
data_frame=pd.DataFrame()

In [8]:
top_35_cities=["Mont-Saint-Michel", "St-Malo", "Bayeux", "Le-Havre", "Rouen", "Paris", "Amiens", "Lille", "Strasbourg",
"Chateau-du-Haut-Koenigsbourg","Colmar", "Eguisheim", "Besancon", "Dijon","Annecy", "Grenoble", "Lyon", "Gorges-du-Verdon",
"Bormes-les-Mimosas", "Cassis", "Marseille", "Aix-en-Provence", "Avignon", "Uzes", "Nimes", "Aigues-Mortes",
"Saintes-Maries-de-la-mer", "Collioure", "Carcassonne", "Ariege", "Toulouse", "Montauban", "Biarritz", "Bayonne",
"La-Rochelle"]

In [9]:
for city in top_35_cities:
    data=pd.read_json(path_to_file+city+".json")
    data["city"]=city
    data_frame=data_frame.append(data)

In [16]:
data_frame.head()

Unnamed: 0,name,url,coords,score,description,city
0,\nHôtel Vert\n,https://www.booking.com\n/hotel/fr/vert.fr.htm...,"-1.50961697101593,48.6147004862904",81,"\nSitué à 2 km du Mont-Saint-Michel, sur la cô...",Mont Saint Michel
1,\nMercure Mont Saint Michel\n,https://www.booking.com\n/hotel/fr/mont-saint-...,"-1.51054501533508,48.6142465295929",82,\nInstallé dans des espaces verts à seulement ...,Mont Saint Michel
2,\nHotel De La Digue\n,https://www.booking.com\n/hotel/fr/de-la-digue...,"-1.51091784238815,48.6168815494412",71,\nL'hôtel De La Digue est un établissement tra...,Mont Saint Michel
3,\nLe Saint Aubert\n,https://www.booking.com\n/hotel/fr/hotel-saint...,"-1.51010513305664,48.6129378347065",73,"\nNiché dans un écrin de verdure, à seulement ...",Mont Saint Michel
4,\nLes Terrasses Poulard\n,https://www.booking.com\n/hotel/fr/les-terrass...,"-1.51037871837616,48.6353494256412",73,\nOccupant 2 bâtiments différents au cœur du M...,Mont Saint Michel


In [11]:
cities_=["Mont-Saint-Michel", "St-Malo", "Le-Havre","Chateau-du-Haut-Koenigsbourg","Gorges-du-Verdon",
"Bormes-les-Mimosas", "Aix-en-Provence", "Aigues-Mortes","Saintes-Maries-de-la-mer","La-Rochelle"]
cities=["Mont Saint Michel", "St Malo", "Le Havre","Chateau du Haut Koenigsbourg","Gorges du Verdon",
"Bormes les Mimosas", "Aix en Provence", "Aigues Mortes","Saintes Maries de la mer","La Rochelle"]

for (city_, city) in zip(cities_, cities):
    data_frame['city'] = data_frame['city'].replace(city_,city, regex=True)


In [14]:
data_frame.head()

Unnamed: 0,name,url,coords,score,description,city
0,\nHôtel Vert\n,https://www.booking.com\n/hotel/fr/vert.fr.htm...,"-1.50961697101593,48.6147004862904",81,"\nSitué à 2 km du Mont-Saint-Michel, sur la cô...",Mont Saint Michel
1,\nMercure Mont Saint Michel\n,https://www.booking.com\n/hotel/fr/mont-saint-...,"-1.51054501533508,48.6142465295929",82,\nInstallé dans des espaces verts à seulement ...,Mont Saint Michel
2,\nHotel De La Digue\n,https://www.booking.com\n/hotel/fr/de-la-digue...,"-1.51091784238815,48.6168815494412",71,\nL'hôtel De La Digue est un établissement tra...,Mont Saint Michel
3,\nLe Saint Aubert\n,https://www.booking.com\n/hotel/fr/hotel-saint...,"-1.51010513305664,48.6129378347065",73,"\nNiché dans un écrin de verdure, à seulement ...",Mont Saint Michel
4,\nLes Terrasses Poulard\n,https://www.booking.com\n/hotel/fr/les-terrass...,"-1.51037871837616,48.6353494256412",73,\nOccupant 2 bâtiments différents au cœur du M...,Mont Saint Michel


In [15]:
data_frame.to_csv("hotel.csv")

In [18]:
csv = data_frame.to_csv()

In [26]:
session = boto3.Session(aws_access_key_id="", aws_secret_access_key="")

In [27]:
s3 = session.resource("s3")

In [28]:
bucket_name = s3.create_bucket(Bucket="dataset-projet-kayak-villes-meteo")

In [29]:
bucket_name.put_object(Key='hotel.csv', Body= csv)

s3.Object(bucket_name='dataset-projet-kayak-villes-meteo', key='hotel.csv')