⚠️ This project is mandatory for certification bloc #1.

![Kayak](https://seekvectorlogo.com/wp-content/uploads/2018/01/kayak-vector-logo.png)

# Plan your trip with Kayak 


## Company's description 📇

<a href='https://www.kayak.com' target='_blank'>Kayak</a> is a travel search engine that helps user plan their next trip at the best price.

The company was founded in 2004 by Steve Hafner & Paul M. English. After a few rounds of fundraising, Kayak was acquired by <a href='https://www.bookingholdings.com/' target='_blank'>Booking Holdings</a> which now holds: 

* <a href='https://booking.com/' target='_blank'>Booking.com</a>
* <a href='https://kayak.com/' target='_blank'>Kayak</a>
* <a href='https://www.priceline.com/' target='_blank'>Priceline</a>
* <a href='https://www.agoda.com/' target='_blank'>Agoda</a>
* <a href='https://Rentalcars.com/' target='_blank'>RentalCars</a>
* <a href='https://www.opentable.com/' target='_blank'>OpenTable</a>

With over \$300 million revenue a year, Kayak operates in almost all countries and all languages to help their users book travels accros the globe. 

## Project 🚧

The marketing team needs help on a new project. After doing some user research, the team discovered that **70% of their users who are planning a trip would like to have more information about the destination they are going to**. 

In addition, user research shows that **people tend to be defiant about the information they are reading if they don't know the brand** which produced the content. 

Therefore, Kayak Marketing Team would like to create an application that will recommend where people should plan their next holidays. The application should be based on real data about:

* Weather 
* Hotels in the area 

The application should then be able to recommend the best destinations and hotels based on the above variables at any given time. 

## Goals 🎯

As the project has just started, your team doesn't have any data that can be used to create this application. Therefore, your job will be to: 

* Scrape data from destinations 
* Get weather data from each destination 
* Get hotels' info about each destination
* Store all the information above in a data lake
* Extract, transform and load cleaned data from your datalake to a data warehouse

## Scope of this project 🖼️

Marketing team wants to focus first on the best cities to travel to in France. According <a href='https://one-week-in.com/35-cities-to-visit-in-france/' target='_blank'>One Week In.com</a> here are the top-35 cities to visit in France: 

```python 
['Mont Saint Michel',
'St Malo',
'Bayeux',
'Le Havre',
'Rouen',
'Paris',
'Amiens',
'Lille',
'Strasbourg',
'Chateau du Haut Koenigsbourg',
'Colmar',
'Eguisheim',
'Besancon',
'Dijon',
'Annecy',
'Grenoble',
'Lyon',
'Gorges du Verdon',
'Bormes les Mimosas',
'Cassis',
'Marseille',
'Aix en Provence',
'Avignon',
'Uzes',
'Nimes',
'Aigues Mortes',
'Saintes Maries de la mer',
'Collioure',
'Carcassonne',
'Ariege',
'Toulouse',
'Montauban',
'Biarritz',
'Bayonne',
'La Rochelle']
```

Your team should focus **only on the above cities for your project**. 


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you

### Get weather data with an API 

*   Use https://nominatim.org/ to get the gps coordinates of all the cities (no subscription required) Documentation : https://nominatim.org/release-docs/develop/api/Search/

*   Use https://openweathermap.org/appid (you have to subscribe to get a free apikey) and https://openweathermap.org/api/one-call-api to get some information about the weather for the 35 cities and put it in a DataFrame

*   Determine the list of cities where the weather will be the nicest within the next 7 days For example, you can use the values of daily.pop and daily.rain to compute the expected volume of rain within the next 7 days... But it's only an example, actually you can have different opinions on a what a nice weather would be like 😎 Maybe the most important criterion for you is the temperature or humidity, so feel free to change the rules !

*   Save all the results in a `.csv` file, you will use it later 😉 You can save all the informations that seem important to you ! Don't forget to save the name of the cities, and also to create a column containing a unique identifier (id) of each city (this is important for what's next in the project)

*   Use plotly to display the best destinations on a map

### Scrape Booking.com 

Since BookingHoldings doesn't have aggregated databases, it will be much faster to scrape data directly from booking.com 

You can scrap as many information asyou want, but we suggest that you get at least:

*   hotel name,
*   Url to its booking.com page,
*   Its coordinates: latitude and longitude
*   Score given by the website users
*   Text description of the hotel


### Create your data lake using S3 

Once you managed to build your dataset, you should store into S3 as a csv file. 

### ETL 

Once you uploaded your data onto S3, it will be better for the next data analysis team to extract clean data directly from a Data Warehouse. Therefore, create a SQL Database using AWS RDS, extract your data from S3 and store it in your newly created DB. 

## Deliverable 📬

To complete this project, your team should deliver:

* A `.csv` file in an S3 bucket containing enriched information about weather and hotels for each french city

* A SQL Database where we should be able to get the same cleaned data from S3 

* Two maps where you should have a Top-5 destinations and a Top-20 hotels in the area. You can use plotly or any other library to do so. It should look something like this: 

![Map](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Kayak_best_destination_project.png)

# Let's start

In [1]:
# Import libs:

import pandas as pd
import requests
import numpy as np
import plotly.express as px
import scrapy
from scrapy.crawler import CrawlerProcess
import logging
import os
from bs4 import BeautifulSoup
import time


## Get weather data with an API

In [2]:
# Import list of cities

cities = [
    "Mont Saint Michel",
    "St Malo",
    "Bayeux",
    "Le Havre",
    "Rouen",
    "Paris",
    "Amiens",
    "Lille",
    "Strasbourg",
    "Chateau du Haut Koenigsbourg",
    "Colmar",
    "Eguisheim",
    "Besancon",
    "Dijon",
    "Annecy",
    "Grenoble",
    "Lyon",
    "Gorges du Verdon",
    "Bormes les Mimosas",
    "Cassis",
    "Marseille",
    "Aix en Provence",
    "Avignon",
    "Uzes",
    "Nimes",
    "Aigues Mortes",
    "Saintes Maries de la mer",
    "Collioure",
    "Carcassonne",
    "Ariege",
    "Toulouse",
    "Montauban",
    "Biarritz",
    "Bayonne",
    "La Rochelle",
]


In [3]:
# Creating a dataframe from the list of cities
df_cities = pd.DataFrame(data=cities, columns=["city"])
df_cities = df_cities.reset_index()
df_cities = df_cities.rename(columns={"index": "id"})
df_cities


Unnamed: 0,id,city
0,0,Mont Saint Michel
1,1,St Malo
2,2,Bayeux
3,3,Le Havre
4,4,Rouen
5,5,Paris
6,6,Amiens
7,7,Lille
8,8,Strasbourg
9,9,Chateau du Haut Koenigsbourg


### Nominatim API : get the gps coordinates of all the cities => https://nominatim.org/release-docs/latest/

In [4]:
# Try a city with blank space in name to see if formatting needed:

params = {"city": "La Rochelle", "country": "France", "format": "json"}

r = requests.get("https://nominatim.openstreetmap.org/search?", params).json()

r


[{'place_id': 281822562,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'relation',
  'osm_id': 117858,
  'boundingbox': ['46.1331804', '46.1908971', '-1.2419231', '-1.111097'],
  'lat': '46.1591126',
  'lon': '-1.1520434',
  'display_name': 'La Rochelle, Charente-Maritime, Nouvelle-Aquitaine, France métropolitaine, 17000, France',
  'class': 'boundary',
  'type': 'administrative',
  'importance': 0.9114837096874572,
  'icon': 'https://nominatim.openstreetmap.org/ui/mapicons/poi_boundary_administrative.p.20.png'},
 {'place_id': 282096286,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'relation',
  'osm_id': 1215878,
  'boundingbox': ['47.7388479', '47.7647325', '5.7060641', '5.7454734'],
  'lat': '47.75173925',
  'lon': '5.7245822865532014',
  'display_name': 'La Rochelle, Vesoul, Haute-Saône, Bourgogne-Franche-Comté, France métropolitaine, 70120, France',
  'class': 'boundary',

In [5]:
# I don't know 'Chateau du Haut Koenigsbourg', check if Nominatim knows it

params = {"city": "Chateau du Haut Koenigsbourg", "country": "France", "format": "json"}

r = requests.get("https://nominatim.openstreetmap.org/search?", params).json()

r


[{'place_id': 49750339,
  'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
  'osm_type': 'node',
  'osm_id': 4245068168,
  'boundingbox': ['48.2494726', '48.2495726', '7.3454423', '7.3455423'],
  'lat': '48.2495226',
  'lon': '7.3454923',
  'display_name': 'Château du Haut-Kœnigsbourg, Orschwiller, Sélestat-Erstein, Bas-Rhin, Grand Est, France métropolitaine, 67600, France',
  'class': 'place',
  'type': 'isolated_dwelling',
  'importance': 0.51}]

In [6]:
# 'Gorges du Verdon' isn't a city but a canyon, check if Nominatim works with or not

params = {"city": "Gorges du Verdon", "country": "France", "format": "json"}

r = requests.get("https://nominatim.openstreetmap.org/search?", params).json()

r

# Doesnt' works... Google search : 'Castellane' will be use for Gorges du Verdon


[]

Ariege is a department and not a city, its prefecture is 'Foix'. We'll use this city

#### Request API

In [7]:
# Creating a copy of df_cities to store the coordinates from Nominatim API

df_gps = df_cities.copy()
lat_list = []
lon_list = []

for i in cities:
    print(f"Request for city: {i}")
    params = {"city": i, "country": "France", "format": "json"}
    # No 'Gorges du Verdon' city, replacing by 'Castellane'
    if i == "Gorges du Verdon":
        i = "Castellane"
        r = requests.get(
            f"https://nominatim.openstreetmap.org/search?city={i}&country=France&format=json"
        ).json()
        lat_list.append(r[0]["lat"])
        lon_list.append(r[0]["lon"])
    # 'Ariege' not a city, using the prefecture instead -> 'Foix'
    elif i == "Ariege":
        i = "Foix"
        r = requests.get(
            f"https://nominatim.openstreetmap.org/search?city={i}&country=France&format=json"
        ).json()
        lat_list.append(r[0]["lat"])
        lon_list.append(r[0]["lon"])
    else:
        r = requests.get(f"https://nominatim.openstreetmap.org/search?", params).json()
        lat_list.append(r[0]["lat"])
        lon_list.append(r[0]["lon"])

# Adding the coordinates to the dataframe
df_gps["lat"] = lat_list
df_gps["lon"] = lon_list


Request for city: Mont Saint Michel
Request for city: St Malo
Request for city: Bayeux
Request for city: Le Havre
Request for city: Rouen
Request for city: Paris
Request for city: Amiens
Request for city: Lille
Request for city: Strasbourg
Request for city: Chateau du Haut Koenigsbourg
Request for city: Colmar
Request for city: Eguisheim
Request for city: Besancon
Request for city: Dijon
Request for city: Annecy
Request for city: Grenoble
Request for city: Lyon
Request for city: Gorges du Verdon
Request for city: Bormes les Mimosas
Request for city: Cassis
Request for city: Marseille
Request for city: Aix en Provence
Request for city: Avignon
Request for city: Uzes
Request for city: Nimes
Request for city: Aigues Mortes
Request for city: Saintes Maries de la mer
Request for city: Collioure
Request for city: Carcassonne
Request for city: Ariege
Request for city: Toulouse
Request for city: Montauban
Request for city: Biarritz
Request for city: Bayonne
Request for city: La Rochelle


In [8]:
df_gps.head()


Unnamed: 0,id,city,lat,lon
0,0,Mont Saint Michel,48.6359541,-1.511459954959514
1,1,St Malo,48.649518,-2.0260409
2,2,Bayeux,49.2764624,-0.7024738
3,3,Le Havre,49.4938975,0.1079732
4,4,Rouen,49.4404591,1.0939658


In [9]:
df_gps.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      35 non-null     int64 
 1   city    35 non-null     object
 2   lat     35 non-null     object
 3   lon     35 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.2+ KB


### OpenWeather : get weather of the week => https://openweathermap.org

In [10]:
# try at St Malo
parameters = {
    "lat": 48.649518,
    "lon": -2.0260409,
    "exclude": "current,minutely,hourly",
    "units": "metric",
    "appid": "aa423e6694bf72625fe1fe31544949dc",
    "lang": "fr",
}

r = requests.get(
    "https://api.openweathermap.org/data/2.5/onecall", params=parameters
).json()

r


{'lat': 48.6495,
 'lon': -2.026,
 'timezone': 'Europe/Paris',
 'timezone_offset': 7200,
 'daily': [{'dt': 1655899200,
   'sunrise': 1655870738,
   'sunset': 1655928865,
   'moonrise': 1655858940,
   'moonset': 1655905020,
   'moon_phase': 0.79,
   'temp': {'day': 21.14,
    'min': 14.91,
    'max': 21.22,
    'night': 16.94,
    'eve': 18.37,
    'morn': 16.39},
   'feels_like': {'day': 21.22, 'night': 17.07, 'eve': 18.66, 'morn': 16.46},
   'pressure': 1010,
   'humidity': 73,
   'dew_point': 16.11,
   'wind_speed': 7.22,
   'wind_deg': 57,
   'wind_gust': 10.85,
   'weather': [{'id': 501,
     'main': 'Rain',
     'description': 'pluie modérée',
     'icon': '10d'}],
   'clouds': 0,
   'pop': 0.88,
   'rain': 9.04,
   'uvi': 6.87},
  {'dt': 1655985600,
   'sunrise': 1655957154,
   'sunset': 1656015274,
   'moonrise': 1655946300,
   'moonset': 1655995620,
   'moon_phase': 0.82,
   'temp': {'day': 21.85,
    'min': 15.81,
    'max': 22.01,
    'night': 16.22,
    'eve': 19.74,
    'mor

In [11]:
# Weather in 3 days at St Malo
r["daily"][3:]


[{'dt': 1656158400,
  'sunrise': 1656129993,
  'sunset': 1656188083,
  'moonrise': 1656121260,
  'moonset': 1656176820,
  'moon_phase': 0.89,
  'temp': {'day': 17.21,
   'min': 13.76,
   'max': 17.89,
   'night': 14.24,
   'eve': 17.89,
   'morn': 14.08},
  'feels_like': {'day': 16.61, 'night': 13.57, 'eve': 17.22, 'morn': 13.74},
  'pressure': 1007,
  'humidity': 62,
  'dew_point': 9.55,
  'wind_speed': 6.39,
  'wind_deg': 268,
  'wind_gust': 9.6,
  'weather': [{'id': 501,
    'main': 'Rain',
    'description': 'pluie modérée',
    'icon': '10d'}],
  'clouds': 100,
  'pop': 0.81,
  'rain': 4.03,
  'uvi': 6.93},
 {'dt': 1656244800,
  'sunrise': 1656216416,
  'sunset': 1656274484,
  'moonrise': 1656209040,
  'moonset': 1656267300,
  'moon_phase': 0.92,
  'temp': {'day': 18.03,
   'min': 11.07,
   'max': 18.03,
   'night': 14.14,
   'eve': 17.13,
   'morn': 12.62},
  'feels_like': {'day': 17.19, 'night': 13.52, 'eve': 16.47, 'morn': 11.92},
  'pressure': 1012,
  'humidity': 50,
  'dew_po

In [12]:
day3 = r["daily"][3]  # Weather in 3 days

# desciption of the main weather in 3 days at St Malo
day3["weather"][0]["main"]


'Rain'

In [13]:
df_gps_weather = df_gps.copy(deep=True)
my_api_key = "aa423e6694bf72625fe1fe31544949dc"

temperatures_list = []
rain_list = []
weather_list = []

days = list(range(1, 8))

for i in df_gps_weather.itertuples():
    lat = i.lat
    lon = i.lon

    parameters = {
        "lat": {lat},
        "lon": {lon},
        "exclude": "current,minutely,hourly",
        "units": "metric",
        "appid": "aa423e6694bf72625fe1fe31544949dc",
        "lang": "fr",
    }

    r = requests.get(
        f"https://api.openweathermap.org/data/2.5/onecall?", parameters
    ).json()
    forecast_7days = r["daily"][1:]  # Getting the weather data for the next 7 days
    temperatures = [int(d["feels_like"]["day"]) for d in forecast_7days]
    rain = [int(d["pop"] * 100) for d in forecast_7days]
    weather = [str(d["weather"][0]["main"]) for d in forecast_7days]
    temperatures_list.append(temperatures)
    rain_list.append(rain)
    weather_list.append(weather)

df_gps_weather["jour_+x"] = [days for _ in range(len(df_gps_weather))]
df_gps_weather["temperature_ressentie"] = temperatures_list
df_gps_weather["probabilite_de_pluie"] = rain_list
df_gps_weather["meteo_principale"] = weather_list
# df_weather['rang'] = df_weather['probabilite_de_pluie'].sort_values()
df_gps_weather["score"] = df_gps_weather.apply(
    lambda x: ((np.mean(x["temperature_ressentie"])))
    - (np.mean(x["probabilite_de_pluie"]) / 10),
    axis=1,
).astype(int)


In [14]:
df_gps_weather


Unnamed: 0,id,city,lat,lon,jour_+x,temperature_ressentie,probabilite_de_pluie,meteo_principale,score
0,0,Mont Saint Michel,48.6359541,-1.511459954959514,"[1, 2, 3, 4, 5, 6, 7]","[24, 18, 16, 19, 18, 20, 17]","[96, 85, 100, 20, 2, 39, 94]","[Rain, Rain, Rain, Rain, Clouds, Rain, Rain]",12
1,1,St Malo,48.649518,-2.0260409,"[1, 2, 3, 4, 5, 6, 7]","[21, 18, 16, 17, 16, 19, 16]","[89, 59, 81, 19, 0, 41, 77]","[Rain, Rain, Rain, Clouds, Clouds, Rain, Rain]",12
2,2,Bayeux,49.2764624,-0.7024738,"[1, 2, 3, 4, 5, 6, 7]","[25, 18, 14, 18, 18, 21, 18]","[67, 67, 100, 32, 24, 25, 100]","[Rain, Rain, Rain, Clouds, Rain, Rain, Rain]",12
3,3,Le Havre,49.4938975,0.1079732,"[1, 2, 3, 4, 5, 6, 7]","[21, 18, 14, 16, 15, 20, 20]","[60, 86, 100, 56, 9, 16, 94]","[Rain, Rain, Rain, Clouds, Clouds, Clouds, Rain]",11
4,4,Rouen,49.4404591,1.0939658,"[1, 2, 3, 4, 5, 6, 7]","[26, 20, 15, 19, 19, 21, 23]","[54, 100, 100, 53, 39, 0, 45]","[Rain, Rain, Rain, Rain, Rain, Clouds, Rain]",14
5,5,Paris,48.8588897,2.3200410217200766,"[1, 2, 3, 4, 5, 6, 7]","[26, 20, 21, 20, 21, 22, 25]","[60, 100, 100, 8, 100, 9, 20]","[Rain, Rain, Rain, Clouds, Rain, Clear, Rain]",16
6,6,Amiens,49.8941708,2.2956951,"[1, 2, 3, 4, 5, 6, 7]","[23, 18, 16, 20, 19, 20, 24]","[92, 100, 100, 4, 95, 4, 35]","[Rain, Rain, Rain, Clouds, Rain, Clouds, Rain]",13
7,7,Lille,50.6365654,3.0635282,"[1, 2, 3, 4, 5, 6, 7]","[23, 18, 17, 19, 17, 20, 24]","[93, 96, 100, 27, 100, 16, 14]","[Rain, Rain, Rain, Clouds, Rain, Clear, Clouds]",13
8,8,Strasbourg,48.584614,7.7507127,"[1, 2, 3, 4, 5, 6, 7]","[28, 22, 26, 24, 16, 20, 24]","[96, 100, 11, 84, 100, 35, 0]","[Rain, Rain, Clear, Rain, Rain, Rain, Clear]",16
9,9,Chateau du Haut Koenigsbourg,48.2495226,7.3454923,"[1, 2, 3, 4, 5, 6, 7]","[27, 20, 22, 17, 13, 17, 21]","[98, 100, 26, 100, 100, 34, 0]","[Rain, Rain, Rain, Rain, Rain, Rain, Clear]",13


In [15]:
df_gps_weather.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   id                     35 non-null     int64 
 1   city                   35 non-null     object
 2   lat                    35 non-null     object
 3   lon                    35 non-null     object
 4   jour_+x                35 non-null     object
 5   temperature_ressentie  35 non-null     object
 6   probabilite_de_pluie   35 non-null     object
 7   meteo_principale       35 non-null     object
 8   score                  35 non-null     int64 
dtypes: int64(2), object(7)
memory usage: 2.6+ KB


In [16]:
df_gps_weather[["lat", "lon"]] = df_gps_weather[["lat", "lon"]].astype(float)


### List of cities where the weather will be the nicest within the next 7 days

In [17]:
df_gps_weather.sort_values(by=["score"], ascending=False)


Unnamed: 0,id,city,lat,lon,jour_+x,temperature_ressentie,probabilite_de_pluie,meteo_principale,score
18,18,Bormes les Mimosas,43.157217,6.329254,"[1, 2, 3, 4, 5, 6, 7]","[27, 28, 28, 28, 28, 27, 27]","[16, 7, 0, 0, 10, 0, 0]","[Clouds, Clouds, Clear, Clear, Clouds, Clear, ...",27
21,21,Aix en Provence,43.529842,5.447474,"[1, 2, 3, 4, 5, 6, 7]","[28, 28, 28, 30, 28, 27, 29]","[74, 74, 1, 9, 35, 0, 0]","[Rain, Rain, Clear, Clouds, Rain, Clear, Clear]",25
19,19,Cassis,43.214036,5.539632,"[1, 2, 3, 4, 5, 6, 7]","[26, 25, 27, 26, 26, 26, 27]","[31, 28, 0, 0, 7, 0, 0]","[Rain, Clouds, Clouds, Clear, Clouds, Clear, C...",25
27,27,Collioure,42.52505,3.083155,"[1, 2, 3, 4, 5, 6, 7]","[29, 26, 30, 25, 20, 26, 27]","[8, 23, 28, 21, 31, 0, 5]","[Clear, Clouds, Rain, Clear, Clouds, Clear, Cl...",24
22,22,Avignon,43.949249,4.805901,"[1, 2, 3, 4, 5, 6, 7]","[31, 28, 30, 31, 26, 26, 29]","[87, 98, 27, 56, 56, 0, 0]","[Rain, Rain, Clouds, Rain, Rain, Clear, Clear]",24
20,20,Marseille,43.296174,5.369953,"[1, 2, 3, 4, 5, 6, 7]","[26, 25, 27, 26, 26, 26, 27]","[36, 40, 0, 2, 10, 0, 0]","[Rain, Rain, Clouds, Clear, Clouds, Clear, Clear]",24
17,17,Gorges du Verdon,43.846218,6.513181,"[1, 2, 3, 4, 5, 6, 7]","[26, 25, 27, 28, 28, 27, 28]","[16, 56, 0, 35, 79, 25, 2]","[Clouds, Rain, Clear, Rain, Rain, Clear, Clear]",23
24,24,Nimes,43.837425,4.360069,"[1, 2, 3, 4, 5, 6, 7]","[29, 27, 28, 29, 23, 27, 30]","[89, 85, 54, 64, 78, 0, 0]","[Rain, Rain, Rain, Rain, Rain, Clear, Clear]",22
26,26,Saintes Maries de la mer,43.452277,4.428717,"[1, 2, 3, 4, 5, 6, 7]","[24, 25, 25, 25, 24, 25, 25]","[53, 73, 33, 44, 38, 0, 0]","[Rain, Rain, Rain, Rain, Rain, Clear, Clear]",21
25,25,Aigues Mortes,43.565823,4.191284,"[1, 2, 3, 4, 5, 6, 7]","[26, 26, 26, 26, 25, 26, 26]","[65, 74, 50, 62, 44, 0, 0]","[Rain, Rain, Rain, Rain, Rain, Clear, Clear]",21


### Save all the results in a `.csv` file

In [18]:
df_gps_weather.to_csv("df_gps_weather.csv", index=False)


### Plotly to display the best destinations on a map

In [19]:
df_plotly = df_gps_weather.apply(
    pd.Series.explode
)  # To obtain a line per day and per city

df_plotly[["jour_+x", "temperature_ressentie", "probabilite_de_pluie"]] = df_plotly[
    ["jour_+x", "temperature_ressentie", "probabilite_de_pluie"]
].astype(int)

fig = px.scatter_mapbox(
    df_plotly,
    lat="lat",
    lon="lon",
    hover_name="city",
    zoom=4,
    hover_data=["meteo_principale", "probabilite_de_pluie", "temperature_ressentie"],
    color="temperature_ressentie",
    color_continuous_scale="thermal",
    mapbox_style="carto-positron",
)
fig.show()


## Scrape Booking.com 

### Get hotels and their URLs

In [20]:
class BookingSpider(scrapy.Spider):
    # Name of my spider
    name = "booking_spider"
    cities = df_gps_weather.city
    # Url to start my spider from
    start_urls = [
        "https://www.booking.com/index.fr.html",
    ]

    # Callback function that will be called when starting my spider
    def parse(self, response):
        for x in cities:
            yield scrapy.FormRequest.from_response(
                response, formdata={"ss": x}, callback=self.after_search
            )

    def after_search(self, response):
        cities = response.url.split("ss=")[-1].split("&")[0]
        booking = response.css(".d4924c9e74")

        for data in booking:

            yield {
                "ville": cities,
                "hotels": data.css("a div.fcab3ed991.a23c043802::text").getall(),
                "liens": data.css("h3.a4225678b2 a::attr(href)").getall(),
            }

        try:
            next_page = response.css("a.paging-next").attrib["href"]
        except KeyError:
            logging.info("No next page. Terminating crawling process.")
        else:
            yield response.follow(next_page, callback=self.after_search)


# Name of the file where the results will be saved
filename = "hotels.json"

# If file already exists, delete before crawling (because Scrapy will concatenate the last and new results otherwise)
if filename in os.listdir():
    os.remove(filename)

# Declare a new CrawlerProcess with some settings
process = CrawlerProcess(
    settings={
        "USER_AGENT": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1",
        "LOG_LEVEL": logging.INFO,
        "FEEDS": {
            filename: {"format": "json"},
        },
        "AUTOTHROTTLE_ENABLED": True,
    }
)

# Start the crawling using the spider you defined above
process.crawl(BookingSpider)
process.start()


2022-06-22 13:58:59 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-22 13:58:59 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr  5 2022, 01:53:17) - [Clang 12.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1o  3 May 2022), cryptography 3.4.8, Platform macOS-10.16-x86_64-i386-64bit
2022-06-22 13:58:59 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, '
               'like Gecko) Chrome/13.0.782.112 Safari/535.1'}
2022-06-22 13:58:59 [scrapy.extensions.telnet] INFO: Telnet Password: e00ec7d63b47c6d1
2022-06-22 13:58:59 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions

### Get hotels' coordinates

In [21]:
# Create a DataFrame from the json file

df = pd.read_json("hotels.json")
df.head()


Unnamed: 0,ville,hotels,liens
0,Mont+Saint+Michel,"[Hôtel Vert, Hotel De La Digue, Le Saint Auber...",[https://www.booking.com/hotel/fr/vert.fr.html...
1,St+Malo,"[Le Cargo by Cocoonr, Hotel d'Aleth, Le Hauban...",[https://www.booking.com/hotel/fr/le-cargo-sai...
2,Bayeux,"[Le Mogador, Le Castel Guesthouse, ibis budget...",[https://www.booking.com/hotel/fr/le-mogador-b...
3,Le+Havre,"[Hilton Garden Inn Le Havre Centre, Hôtel Le M...",[https://www.booking.com/hotel/fr/hilton-garde...
4,Rouen,"[Radisson Blu Hotel, Rouen Centre, Cosy'Appart...",[https://www.booking.com/hotel/fr/radisson-blu...


In [22]:
# Reduce number of hotels per city to 20 as requested by the task
for i in range(len(df["ville"])):
    df["hotels"][i] = df["hotels"][i][0:20]
    df["liens"][i] = df["liens"][i][0:20]


In [24]:
# BeautifulSoup iterates on each URL to get the hotel's score, GPS coordinates and description.
df["lat"] = 0
df["lon"] = 0
df["description"] = "---"
df["score"] = 0.0

navigator = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)'

for i in range(len(df["liens"])):
    lat_list = []
    lon_list = []
    description_list = []
    score_list = []

    hotel_list = df["liens"][i]

    for i2 in hotel_list:

        # Sometimes BeautifulSoup doesn't manage to gather data. When it fails, it tries again.
        try:
            page = requests.get(i2, headers={'User-Agent': navigator})
            soup = BeautifulSoup(page.text, 'html.parser')
        except:
            page = requests.get(i2, headers={'User-Agent': navigator})
            soup = BeautifulSoup(page.text, 'html.parser')

        lat_list.append(soup.select('p.address.address_clean a')[0].get("data-atlas-latlng").split(",")[0])
        lon_list.append(soup.select('p.address.address_clean a')[0].get("data-atlas-latlng").split(",")[1])
        description_list.append(soup.select('div#property_description_content')[0].get_text())

        try:
            score_list.append(soup.select('div.b5cd09854e.d10a6220b4')[0].get_text())
            
        except:
            # 2 hotels over the 700 that I am going to scrap dont have a score yet but I still need one for the visualization. I set it to 1.
            score_list.append("1.0")

        time.sleep(1.4)
  
    df["lat"][i] = lat_list
    df["lon"][i] = lon_list
    df["description"][i] = description_list
    df["score"][i] = score_list
    
    print (f"city {df['ville'].iloc[i]} done")



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy




city Mont+Saint+Michel done
city St+Malo done
city Bayeux done
city Le+Havre done
city Rouen done
city Amiens done
city Paris done
city Lille done
city Strasbourg done
city Chateau+du+Haut+Koenigsbourg done
city Colmar done
city Eguisheim done
city Besancon done
city Dijon done
city Annecy done
city La+Rochelle done
city Grenoble done
city Bayonne done
city Montauban done
city Biarritz done
city Toulouse done
city Ariege done
city Carcassonne done
city Collioure done
city Saintes+Maries+de+la+mer done
city Aigues+Mortes done
city Nimes done
city Uzes done
city Avignon done
city Aix+en+Provence done
city Marseille done
city Cassis done
city Bormes+les+Mimosas done
city Gorges+du+Verdon done
city Lyon done


In [25]:
df

Unnamed: 0,ville,hotels,liens,lat,lon,description,score
0,Mont+Saint+Michel,"[Hôtel Vert, Hotel De La Digue, Le Saint Auber...",[https://www.booking.com/hotel/fr/vert.fr.html...,"[48.61470049, 48.61688155, 48.61293783, 48.635...","[-1.50961697, -1.51091784, -1.51010513, -1.510...",[\n\nVous pouvez bénéficier d'une réduction Ge...,"[8,0, 7,2, 7,2, 7,3, 7,1, 7,9, 7,2, 8,0, 8,2, ..."
1,St+Malo,"[Le Cargo by Cocoonr, Hotel d'Aleth, Le Hauban...",[https://www.booking.com/hotel/fr/le-cargo-sai...,"[48.64842800, 48.63593081, 48.64767450, 48.651...","[-2.02664200, -2.02171236, -2.02519510, -2.023...",[\nCet établissement est à 2 minutes à pied de...,"[8,3, 7,8, 9,0, 8,0, 8,1, 9,1, 8,3, 9,1, 8,5, ..."
2,Bayeux,"[Le Mogador, Le Castel Guesthouse, ibis budget...",[https://www.booking.com/hotel/fr/le-mogador-b...,"[49.27933593, 49.27368314, 49.25424209, 49.272...","[-0.70705610, -0.70322692, -0.64648747, -0.698...",[\nL'établissement Le Mogador est situé à Baye...,"[8,1, 8,6, 8,2, 9,3, 9,6, 8,5, 9,5, 9,8, 8,1, ..."
3,Le+Havre,"[Hilton Garden Inn Le Havre Centre, Hôtel Le M...",[https://www.booking.com/hotel/fr/hilton-garde...,"[49.49008699, 49.49227908, 49.49641693, 49.498...","[0.09774696, 0.12310803, 0.15151234, 0.1288370...",[\nLe Hilton Garden Inn Le Havre Centre est si...,"[8,7, 8,1, 7,6, 8,2, 8,5, 8,4, 7,0, 8,4, 7,8, ..."
4,Rouen,"[Radisson Blu Hotel, Rouen Centre, Cosy'Appart...",[https://www.booking.com/hotel/fr/radisson-blu...,"[49.44644100, 49.44573273, 49.44117285, 49.437...","[1.09412000, 1.09130477, 1.09517545, 1.0972528...",[\n\nVous pouvez bénéficier d'une réduction Ge...,"[8,9, 8,5, 8,2, 8,4, 7,5, 8,5, 8,2, 8,2, 8,7, ..."
5,Amiens,"[Le Majestic Cathédrale, Holiday Inn Express A...",[https://www.booking.com/hotel/fr/le-duplex-am...,"[49.89538310, 49.89167620, 49.89444519, 49.894...","[2.29777200, 2.30775386, 2.30094239, 2.3057780...",[\n\nVous pouvez bénéficier d'une réduction Ge...,"[9,3, 8,1, 8,3, 8,0, 8,9, 8,1, 8,8, 7,6, 7,3, ..."
6,Paris,"[Westside Arc de Triomphe Hotel, Hotel Korner ...",[https://www.booking.com/hotel/fr/westside-arc...,"[48.88129894, 48.84659045, 48.87693501, 48.859...","[2.29480304, 2.28857003, 2.32698441, 2.3684085...","[\nDoté d’un bar-salon, le Westside Arc de Tri...","[8,0, 8,5, 1.0, 8,1, 8,1, 8,2, 8,4, 8,6, 8,5, ..."
7,Lille,"[CALM Appart' & Hostel, Moxy Lille City, Hotel...",[https://www.booking.com/hotel/fr/calm-apparth...,"[50.63734618, 50.62783100, 50.63788756, 50.636...","[3.06945696, 3.06359200, 3.07268500, 3.0749092...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,1, 8,6, 8,0, 7,3, 7,7, 9,2, 8,0, 7,8, 8,5, ..."
8,Strasbourg,[Comfort Hotel Strasbourg - Montagne Verte & R...,[https://www.booking.com/hotel/fr/comforthotel...,"[48.57269177, 48.59001400, 48.58906194, 48.604...","[7.72871353, 7.73987000, 7.73879863, 7.7044611...","[\nSitué près de l'Ill, le Comfort Hotel Stras...","[8,3, 7,0, 7,1, 8,6, 7,8, 9,0, 8,7, 8,1, 8,1, ..."
9,Chateau+du+Haut+Koenigsbourg,"[Les Chambres du Haut-Koenigsbourg, Gîte L'Oré...",[https://www.booking.com/hotel/fr/les-chambres...,"[48.24702280, 48.24834892, 48.24840250, 48.180...","[7.34632093, 7.35484448, 7.35482570, 7.3092045...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,8, 9,5, 9,5, 9,5, 9,4, 9,0, 9,6, 9,3, 8,7, ..."


In [28]:
df['ville'] = df['ville'].str.replace('+', ' ')
df = df.set_index('ville')
df = df.reindex(cities)
df.reset_index(inplace=True)

df.reset_index(inplace=True)
df.rename(columns={'index': 'id'}, inplace=True)
df


The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.




Unnamed: 0,id,ville,id.1,hotels,liens,lat,lon,description,score
0,0,Mont Saint Michel,0,"[Hôtel Vert, Hotel De La Digue, Le Saint Auber...",[https://www.booking.com/hotel/fr/vert.fr.html...,"[48.61470049, 48.61688155, 48.61293783, 48.635...","[-1.50961697, -1.51091784, -1.51010513, -1.510...",[\n\nVous pouvez bénéficier d'une réduction Ge...,"[8,0, 7,2, 7,2, 7,3, 7,1, 7,9, 7,2, 8,0, 8,2, ..."
1,1,St Malo,1,"[Le Cargo by Cocoonr, Hotel d'Aleth, Le Hauban...",[https://www.booking.com/hotel/fr/le-cargo-sai...,"[48.64842800, 48.63593081, 48.64767450, 48.651...","[-2.02664200, -2.02171236, -2.02519510, -2.023...",[\nCet établissement est à 2 minutes à pied de...,"[8,3, 7,8, 9,0, 8,0, 8,1, 9,1, 8,3, 9,1, 8,5, ..."
2,2,Bayeux,2,"[Le Mogador, Le Castel Guesthouse, ibis budget...",[https://www.booking.com/hotel/fr/le-mogador-b...,"[49.27933593, 49.27368314, 49.25424209, 49.272...","[-0.70705610, -0.70322692, -0.64648747, -0.698...",[\nL'établissement Le Mogador est situé à Baye...,"[8,1, 8,6, 8,2, 9,3, 9,6, 8,5, 9,5, 9,8, 8,1, ..."
3,3,Le Havre,3,"[Hilton Garden Inn Le Havre Centre, Hôtel Le M...",[https://www.booking.com/hotel/fr/hilton-garde...,"[49.49008699, 49.49227908, 49.49641693, 49.498...","[0.09774696, 0.12310803, 0.15151234, 0.1288370...",[\nLe Hilton Garden Inn Le Havre Centre est si...,"[8,7, 8,1, 7,6, 8,2, 8,5, 8,4, 7,0, 8,4, 7,8, ..."
4,4,Rouen,4,"[Radisson Blu Hotel, Rouen Centre, Cosy'Appart...",[https://www.booking.com/hotel/fr/radisson-blu...,"[49.44644100, 49.44573273, 49.44117285, 49.437...","[1.09412000, 1.09130477, 1.09517545, 1.0972528...",[\n\nVous pouvez bénéficier d'une réduction Ge...,"[8,9, 8,5, 8,2, 8,4, 7,5, 8,5, 8,2, 8,2, 8,7, ..."
5,5,Paris,5,"[Westside Arc de Triomphe Hotel, Hotel Korner ...",[https://www.booking.com/hotel/fr/westside-arc...,"[48.88129894, 48.84659045, 48.87693501, 48.859...","[2.29480304, 2.28857003, 2.32698441, 2.3684085...","[\nDoté d’un bar-salon, le Westside Arc de Tri...","[8,0, 8,5, 1.0, 8,1, 8,1, 8,2, 8,4, 8,6, 8,5, ..."
6,6,Amiens,6,"[Le Majestic Cathédrale, Holiday Inn Express A...",[https://www.booking.com/hotel/fr/le-duplex-am...,"[49.89538310, 49.89167620, 49.89444519, 49.894...","[2.29777200, 2.30775386, 2.30094239, 2.3057780...",[\n\nVous pouvez bénéficier d'une réduction Ge...,"[9,3, 8,1, 8,3, 8,0, 8,9, 8,1, 8,8, 7,6, 7,3, ..."
7,7,Lille,7,"[CALM Appart' & Hostel, Moxy Lille City, Hotel...",[https://www.booking.com/hotel/fr/calm-apparth...,"[50.63734618, 50.62783100, 50.63788756, 50.636...","[3.06945696, 3.06359200, 3.07268500, 3.0749092...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,1, 8,6, 8,0, 7,3, 7,7, 9,2, 8,0, 7,8, 8,5, ..."
8,8,Strasbourg,8,[Comfort Hotel Strasbourg - Montagne Verte & R...,[https://www.booking.com/hotel/fr/comforthotel...,"[48.57269177, 48.59001400, 48.58906194, 48.604...","[7.72871353, 7.73987000, 7.73879863, 7.7044611...","[\nSitué près de l'Ill, le Comfort Hotel Stras...","[8,3, 7,0, 7,1, 8,6, 7,8, 9,0, 8,7, 8,1, 8,1, ..."
9,9,Chateau du Haut Koenigsbourg,9,"[Les Chambres du Haut-Koenigsbourg, Gîte L'Oré...",[https://www.booking.com/hotel/fr/les-chambres...,"[48.24702280, 48.24834892, 48.24840250, 48.180...","[7.34632093, 7.35484448, 7.35482570, 7.3092045...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,8, 9,5, 9,5, 9,5, 9,4, 9,0, 9,6, 9,3, 8,7, ..."


## Create your data lake using S3

## ETL