![Kayak](https://seekvectorlogo.com/wp-content/uploads/2018/01/kayak-vector-logo.png)

# Plan your trip with Kayak 

## Company's description 📇

<a href="https://www.kayak.com" target="_blank">Kayak</a> is a travel search engine that helps user plan their next trip at the best price.

The company was founded in 2004 by Steve Hafner & Paul M. English. After a few rounds of fundraising, Kayak was acquired by <a href="https://www.bookingholdings.com/" target="_blank">Booking Holdings</a> which now holds: 

* <a href="https://booking.com/" target="_blank">Booking.com</a>
* <a href="https://kayak.com/" target="_blank">Kayak</a>
* <a href="https://www.priceline.com/" target="_blank">Priceline</a>
* <a href="https://www.agoda.com/" target="_blank">Agoda</a>
* <a href="https://Rentalcars.com/" target="_blank">RentalCars</a>
* <a href="https://www.opentable.com/" target="_blank">OpenTable</a>

With over \$300 million revenue a year, Kayak operates in almost all countries and all languages to help their users book travels accros the globe. 

## Project 🚧

The marketing team needs help on a new project. After doing some user research, the team discovered that **70% of their users who are planning a trip would like to have more information about the destination they are going to**. 

In addition, user research shows that **people tend to be defiant about the information they are reading if they don't know the brand** which produced the content. 

Therefore, Kayak Marketing Team would like to create an application that will recommend where people should plan their next holidays. The application should be based on real data about:

* Weather 
* Hotels in the area 

The application should then be able to recommend the best destinations and hotels based on the above variables at any given time. 

## Goals 🎯

As the project has just started, your team doesn't have any data that can be used to create this application. Therefore, your job will be to: 

* Scrape data from destinations 
* Get weather data from each destination 
* Get hotels' info about each destination
* Store all the information above in a data lake
* Extract, transform and load cleaned data from your datalake to a data warehouse

## Scope of this project 🖼️

Marketing team wants to focus first on the best cities to travel to in France. According <a href="https://one-week-in.com/35-cities-to-visit-in-france/" target="_blank">One Week In.com</a> here are the top-35 cities to visit in France: 

```python 
["Mont Saint Michel",
"St Malo",
"Bayeux",
"Le Havre",
"Rouen",
"Paris",
"Amiens",
"Lille",
"Strasbourg",
"Chateau du Haut Koenigsbourg",
"Colmar",
"Eguisheim",
"Besancon",
"Dijon",
"Annecy",
"Grenoble",
"Lyon",
"Gorges du Verdon",
"Bormes les Mimosas",
"Cassis",
"Marseille",
"Aix en Provence",
"Avignon",
"Uzes",
"Nimes",
"Aigues Mortes",
"Saintes Maries de la mer",
"Collioure",
"Carcassonne",
"Ariege",
"Toulouse",
"Montauban",
"Biarritz",
"Bayonne",
"La Rochelle"]
```

Your team should focus **only on the above cities for your project**. 


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you

### Get weather data with an API 

*   Use https://nominatim.org/ to get the gps coordinates of all the cities (no subscription required) Documentation : https://nominatim.org/release-docs/develop/api/Search/

*   Use https://openweathermap.org/appid (you have to subscribe to get a free apikey) and https://openweathermap.org/api/one-call-api to get some information about the weather for the 35 cities and put it in a DataFrame

*   Determine the list of cities where the weather will be the nicest within the next 7 days For example, you can use the values of daily.pop and daily.rain to compute the expected volume of rain within the next 7 days... But it's only an example, actually you can have different opinions on a what a nice weather would be like 😎 Maybe the most important criterion for you is the temperature or humidity, so feel free to change the rules !

*   Save all the results in a `.csv` file, you will use it later 😉 You can save all the informations that seem important to you ! Don't forget to save the name of the cities, and also to create a column containing a unique identifier (id) of each city (this is important for what's next in the project)

*   Use plotly to display the best destinations on a map

### Scrape Booking.com 

Since BookingHoldings doesn't have aggregated databases, it will be much faster to scrape data directly from booking.com 

You can scrap as many information asyou want, but we suggest that you get at least:

*   hotel name,
*   Url to its booking.com page,
*   Its coordinates: latitude and longitude
*   Score given by the website users
*   Text description of the hotel


### Create your data lake using S3 

Once you managed to build your dataset, you should store into S3 as a csv file. 

### ETL 

Once you uploaded your data onto S3, it will be better for the next data analysis team to extract clean data directly from a Data Warehouse. Therefore, create a SQL Database using AWS RDS, extract your data from S3 and store it in your newly created DB. 

## Deliverable 📬

To complete this project, your team should deliver:

* A `.csv` file in an S3 bucket containing enriched information about weather and hotels for each french city

* A SQL Database where we should be able to get the same cleaned data from S3 

* Two maps where you should have a Top-5 destinations and a Top-20 hotels in the area. You can use plotly or any other library to do so. It should look something like this: 

![Map](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Kayak_best_destination_project.png)

# - Hôtels par Villes.

In [1]:
#Suivi des liens de pagination 📄📄📄
!pip install Scrapy
#L'exemple ci-dessous montre comment utiliser des liens pour itérer sur plusieurs pages :
import json
import requests
# Import os => Bibliothèque utilisée pour manipuler facilement les systèmes d'exploitation.
## Plus d'informations => https://docs.python.org/3/library/os.html
import os 

# Import os => Bibliothèque utilisée pour manipuler facilement les systèmes d'exploitation
## Plus d'infos => https://docs.python.org/3/library/os.html
import logging

# Importer scrapy et scrapy.crawler
import scrapy
from scrapy.crawler import CrawlerProcess

import pandas as pd



Collecting Scrapy
  Using cached Scrapy-2.5.1-py2.py3-none-any.whl (254 kB)
Collecting PyDispatcher>=2.0.5
  Downloading PyDispatcher-2.0.5.zip (47 kB)
     |████████████████████████████████| 47 kB 4.3 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting cssselect>=0.9.1
  Using cached cssselect-1.1.0-py2.py3-none-any.whl (16 kB)
Collecting lxml>=3.5.0
  Downloading lxml-4.7.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (6.9 MB)
     |████████████████████████████████| 6.9 MB 20.5 MB/s            
Collecting parsel>=1.5.0
  Using cached parsel-1.6.0-py2.py3-none-any.whl (13 kB)
Collecting queuelib>=1.4.2
  Using cached queuelib-1.6.2-py2.py3-none-any.whl (13 kB)
Collecting h2<4.0,>=3.0
  Using cached h2-3.2.0-py2.py3-none-any.whl (65 kB)
Collecting zope.interface>=4.1.3
  Downloading zope.interface-5.4.0-cp39-cp39-manylinux2010_x86_64.whl (255 kB)
     |████████████████████████████████| 255 kB 63.5 MB/s            
[

## - Listes des 35 villes.

In [2]:
villes = ["Le Mont Saint Michel"]
#, "Uzes", "Chateau du Haut Koenigsbourg", "Ariege", "Aigues Mortes"
#     ,"St Malo","Bayeux","Le Havre","Rouen","Paris","Amiens","Lille","Strasbourg",
#               "Colmar","Eguisheim","Besancon","Dijon","Annecy","Grenoble","Lyon",
#               "Gorges du Verdon","Bormes les Mimosas","Cassis","Marseille","Aix en Provence","Avignon","Nimes",
#               "Saintes Maries de la mer","Collioure","Carcassonne","Toulouse","Montauban",
#               "Biarritz","Bayonne","La Rochelle"]



## - Scraping avancé avec Scrapy.

In [3]:
class MultipleSpider(scrapy.Spider):

    # Nom de votre araignée
    name = "multiplepages"

    # Url à partir de laquelle démarrer votre araignée    #h3.sr-hotel__title  #span.sr-hotel__name  #div.sr-hotel__title-wrap
    # div.sr-hotel__title-wrap h3.sr-hotel__title a.js-sr-hotel-link.hotel_name_link.url span.sr-hotel__name
    startURL = 'https://www.booking.com/searchresults.fr.html?ss='
  
    start_urls = []
    
    for ville in villes:
        start_urls.append(startURL+ville)
    
    # Fonction callback qui sera appelée lors du démarrage de votre spider.
    # Elle récupérera le texte, l'auteur et les balises de la <div> avec class="quote".
    def parse (self, response):
        #print(response.body)
        text = response.css('h1._30227359d._0db903e42::text').get() #resultat su scrapping
        #si le scrapping marche pas, en le relance. 
        if text == None:
            print('none : ' + response.url)
            yield response.follow(response.url, callback=self.parse) #pour la relance du scrapping
        else:
            #scrapping OK
            #print('ok : ' +response.url)
            #print(response.css('h1._30227359d._0db903e42::text').get()) #sorth1
            #ville = str(response.css('h1.sorth1::text').get()).split(':')[0].strip()
            url = response.url.split("?ss=")
            
            ville = url[1].split("&")[0].replace("%20", " ") #nom de la ville (passer en paramettre dans l'url)
                        
            nb = int(str(response.css('h1._30227359d._0db903e42::text').get()).split(':')[1].split()[0]) #sorth1 #nombre d'hotels trouvées.
            
            #pour la dernier page en cas de pagination.
            offset = 0
            if "offset" in response.url:
                offset = int(response.url.split("offset=")[1]) #recuperer la valeur de l'offset depuis l'url.(offset => pagination )
                if offset>nb:
                    nb = nb - offset + 25 #avoir le nombre exacte d'hotel restant sur la dernier page.
            
            i=1 #numero d'hotel en cours de scrapping.
            
            results = response.css('div._fe1927d9e._0811a1b54._a8a1be610._022ee35ec.b9c27d6646.fb3c4512b4.fc21746a73') #liste des hotels trouvé.
            
            for title in results: #sr_item_no_dates
                #aller sur la page de l'hotel pour recuperer les coordoner gps
                gps = str(requests.get(title.css('div._12369ea61 a::attr(href)').get()).content).split("data-atlas-latlng=\"")[1].split("data-atlas-bbox=")[0].split("\"")[0]
                #print(gps\)
                yield {
                    'ville': ville,
                    'Hotel': str(title.css('div.fde444d7ef._c445487e2::text').get()).strip(), #sr-hotel__name
                    'Url': title.css('div._12369ea61 a::attr(href)').get(),
                    'Description': str(title.css('div._4abc4c3d5::text').get()).replace(',','.').replace("\r\n", " ").replace("\n", " ").replace("\r", " ").strip(), #hotel_desc 
                    'Note' : str(title.css('div._9c5f726ff.bd528f9ea6::text').get()).strip().replace(',','.'), #bui-review-score__badge
                    #'GPS_lat_lng' : gps_page.css('p.showMap2 a::attr(href)').get(), #sr_card_address_line
                    #'GPS_lat' : gps_page.css('p.address.address_clean a::attr(data-atlas-latlng)').get().split(',')[0],
                    #'GPS_lng' : gps_page.css('p.address.address_clean a::attr(data-atlas-latlng)').get().split(',')[1],
                    'GPS_lat' : gps.split(',')[0],
                    'GPS_lng' : gps.split(',')[1],
                    
                }
                i+=1 # i = i + 1 : hotel suivant
                if i>nb: #pour scraper que le nombre exacte d'hotel par page. parce que booking suggere d'autre hotel pour d'autre ville pour completer la page par 25 hotels
                    break
            
            #pour ne pas chercher d'autre page si le resultat est inferieur a 25. parce que booking suggere d'autre pages pour d'autre ville.
            if nb<=25 :
                return
            
            #si on est la, alors surement il y a des pages a scraper
            #gestion de pagination: si on est sur le premier scrapping.
            next_offset = 0
            while nb >= next_offset and "offset" not in response.url: 
                next_offset += 25
                next_page = response.url + "&offset="+str(next_offset)
                yield response.follow(next_page, callback=self.parse)
            
            #ancienne code de pagination.
            #try:
            #    # Sélectionnez le bouton NEXT et enregistrez-le dans next_page.
            #    #next_page = response.xpath('//div[@id="b_pageNext"]/a').attrib["href"]
            #    next_page = response.css('a.bui-pagination__link.paging-next').attrib["href"]
            #    #print(next_page)
            #except KeyError:
            #    # Dans la dernière page, il n'y aura pas de "href" et une KeyError sera levée.
            #    logging.info('No next page. Terminating crawling process.')
            #else:
            #    # Si une page suivante est trouvée, exécutez la méthode d'analyse une fois de plus.
            #    yield response.follow(next_page, callback=self.parse)



In [4]:
# Nom du fichier où les résultats seront sauvegardés
filenameHotel = "Ville_hotel22"

# Si le fichier existe déjà, supprimez-le avant le crawling (parce que Scrapy va concaténera les derniers et nouveaux résultats sinon)
if filenameHotel in os.listdir('src_test_json/'):
    os.remove('src_test_json/' + filenameHotel + '.json')
    os.remove('src_test_json/' + filenameHotel + '.csv')
    

# Déclarer un nouveau CrawlerProcess avec quelques paramètres
## USER_AGENT => Simule un navigateur sur un OS
## LOG_LEVEL => Niveau minimal de journalisation 
process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
    'LOG_LEVEL': logging.INFO,                                              
    'AUTOTHROTTLE_ENABLED': True,
    'AUTOTHROTTLE_TARGET_CONCURRENCY' : 0.1,
    'FEED_EXPORT_ENCODING' : 'UTF-8',
    'Accept-Language': 'fr-FR, fr;q=0.9',
    "FEEDS": {
        'src_test_json/' + filenameHotel + '.json': {"format": "json"},
    }
})

# Commencez l'exploration en utilisant l'araignée que vous avez définie ci-dessus.
process.crawl(MultipleSpider)
process.start()


2022-01-23 16:05:18 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-01-23 16:05:18 [scrapy.utils.log] INFO: Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.6 | packaged by conda-forge | (default, Oct  7 2020, 19:08:05) - [GCC 7.5.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.1.1, Platform Linux-5.4.129+-x86_64-with-glibc2.10
2022-01-23 16:05:18 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'AUTOTHROTTLE_TARGET_CONCURRENCY': 0.1,
 'FEED_EXPORT_ENCODING': 'UTF-8',
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
               '(KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
2022-01-23 16:05:18 [scrapy.extensions.telnet] INFO: Telnet Password: b800511b921d8b63
2022-01-23 16:05:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.Telnet

In [5]:
with open('src_test_json/' + filenameHotel+'.json','r') as f:
    data = json.loads(f.read())

df = pd.json_normalize(data)
df.to_csv ('src_test_json/' + filenameHotel + '.csv', index = None)

# - Hôtels des Villes.

villes = ["Uzes","Mont Saint Michel"
          , "Chateau du Haut Koenigsbourg", "Ariege", "Aigues Mortes","St Malo","Bayeux","Le Havre",
          "Rouen","Paris","Amiens", "Lille","Strasbourg", "Colmar","Eguisheim","Besancon","Dijon","Annecy","Grenoble","Lyon",
          "Gorges du Verdon","Bormes les Mimosas","Cassis","Marseille","Aix en Provence","Avignon","Nimes",
          "Saintes Maries de la mer","Collioure","Carcassonne","Toulouse","Montauban","Biarritz","Bayonne","La Rochelle"]

### Lecture du CSV.

In [2]:
df = pd.read_csv('2_Ville_hotel1.csv')

In [3]:
df

Unnamed: 0,ville,hotel,url,description,note,GPS_lat,GPS_lng
0,Le Mont Saint Michel,Hôtel Vert,https://www.booking.com/hotel/fr/vert.fr.html?...,Situé à 2 km du Mont-Saint-Michel. sur la côte...,8.1,48.614700,-1.509617
1,Le Mont Saint Michel,Le Relais Saint Michel,https://www.booking.com/hotel/fr/le-relais-sai...,Le Relais Saint Michel vous accueille face à l...,7.8,48.617587,-1.510396
2,Le Mont Saint Michel,Mercure Mont Saint Michel,https://www.booking.com/hotel/fr/mont-saint-mi...,Installé dans des espaces verts à seulement 2 ...,8.2,48.614247,-1.510545
3,Le Mont Saint Michel,Le Mouton Blanc,https://www.booking.com/hotel/fr/le-mouton-bla...,Situé au pied de l'abbaye. le Mouton Blanc Hot...,7.2,48.636023,-1.509896
4,Le Mont Saint Michel,Les Terrasses Poulard,https://www.booking.com/hotel/fr/les-terrasses...,Occupant 2 bâtiments différents au cœur du Mon...,7.3,48.635349,-1.510379
...,...,...,...,...,...,...,...
9423,Grenoble,Grand logement quartier Championnet,https://www.booking.com/hotel/fr/grand-logemen...,Hébergement géré par un particulier,8.8,45.187509,5.719823
9424,Grenoble,Montagnes russes,https://www.booking.com/hotel/fr/montagnes-rus...,Hébergement géré par un particulier,8.2,45.185315,5.718978
9425,Grenoble,B&B Hôtel Grenoble Centre Verlaine,https://www.booking.com/hotel/fr/b-amp-b-greno...,Le B&B Hôtel Grenoble Centre Verlaine propose ...,6.8,45.161910,5.714983
9426,Grenoble,"Le Contemporain, Hyper-centre, 8 pers",https://www.booking.com/hotel/fr/le-contempora...,Doté d'une connexion Wi-Fi gratuite et offrant...,9.2,45.189887,5.728864


In [4]:
df["note"]=df["note"].str.replace("None", "0.0").astype(float)

## - Répartition des meilleurs hôtels par villes (note de satisfaction > 9.5)

In [5]:
!pip install plotly

import plotly.express as px
import matplotlib.pyplot as plt
import plotly.io as pio
pio.renderers.default="iframe_connected"

mask = (df['note'] > 9.5) 
data_viz = df.loc[mask,:]

fig = px.scatter_mapbox(data_viz, lat="GPS_lat", lon="GPS_lng", color="note",  mapbox_style="carto-positron", animation_frame="ville",size="note", size_max = 20,
                        color_continuous_scale=px.colors.diverging.Portland,zoom=0 ,hover_name='hotel')
fig.show()



## - On concatene les deux CSV 'meteo1' et 'hotels1' .

In [None]:
meteo1 = pd.read_csv( "2_Meteo_ville_temp_max_ciel.csv")
hotels1= pd.read_csv("2_Ville_hotel1.csv")
print(meteo1.shape)
print(hotels1.shape)
#print("df_meteo : " + df_meteo['ville'].unique())
#print("df_hotels : " + df_hotels['ville'].unique())

#Kayak_data_outer = pd.merge(df_meteo, df_hotels, how="outer", on=["ville"])
Kayak_data_Meteo_Hotel = pd.merge(meteo1, hotels1, on=["ville"])
#(Kayak_displaydata_inner.head())
print(Kayak_data_Meteo_Hotel.shape)
#display(Kayak_data.head())
#print("outer : " + Kayak_data_outer['ville'].unique())
#print("inner : " + Kayak_data_inner['ville'].unique())

(35, 20)
(9428, 7)
(9415, 26)


## - Convertir 'Kayak_data_Meteo_Hotel' en CSV.

In [8]:
#Export des données en CSV

Kayak_data_Meteo_Hotel.to_csv('3_Meteo_Hotel_Final.csv')
Kayak_data_Meteo_Hotel = pd.read_csv('3_Meteo_Hotel_Final.csv')

In [None]:
Kayak_data_Meteo_Hotel.head(10)