⚠️ This project is mandatory for certification bloc #1.

![Kayak](https://seekvectorlogo.com/wp-content/uploads/2018/01/kayak-vector-logo.png)

# Plan your trip with Kayak 

## Company's description 📇

<a href="https://www.kayak.com" target="_blank">Kayak</a> is a travel search engine that helps user plan their next trip at the best price.

The company was founded in 2004 by Steve Hafner & Paul M. English. After a few rounds of fundraising, Kayak was acquired by <a href="https://www.bookingholdings.com/" target="_blank">Booking Holdings</a> which now holds: 

* <a href="https://booking.com/" target="_blank">Booking.com</a>
* <a href="https://kayak.com/" target="_blank">Kayak</a>
* <a href="https://www.priceline.com/" target="_blank">Priceline</a>
* <a href="https://www.agoda.com/" target="_blank">Agoda</a>
* <a href="https://Rentalcars.com/" target="_blank">RentalCars</a>
* <a href="https://www.opentable.com/" target="_blank">OpenTable</a>

With over \$300 million revenue a year, Kayak operates in almost all countries and all languages to help their users book travels accros the globe. 

## Project 🚧

The marketing team needs help on a new project. After doing some user research, the team discovered that **70% of their users who are planning a trip would like to have more information about the destination they are going to**. 

In addition, user research shows that **people tend to be defiant about the information they are reading if they don't know the brand** which produced the content. 

Therefore, Kayak Marketing Team would like to create an application that will recommend where people should plan their next holidays. The application should be based on real data about:

* Weather 
* Hotels in the area 

The application should then be able to recommend the best destinations and hotels based on the above variables at any given time. 

## Goals 🎯

As the project has just started, your team doesn't have any data that can be used to create this application. Therefore, your job will be to: 

* Scrape data from destinations 
* Get weather data from each destination 
* Get hotels' info about each destination
* Store all the information above in a data lake
* Extract, transform and load cleaned data from your datalake to a data warehouse

## Scope of this project 🖼️

Marketing team wants to focus first on the best cities to travel to in France. According <a href="https://one-week-in.com/35-cities-to-visit-in-france/" target="_blank">One Week In.com</a> here are the top-35 cities to visit in France: 

```python 
["Mont Saint Michel",
"St Malo",
"Bayeux",
"Le Havre",
"Rouen",
"Paris",
"Amiens",
"Lille",
"Strasbourg",
"Chateau du Haut Koenigsbourg",
"Colmar",
"Eguisheim",
"Besancon",
"Dijon",
"Annecy",
"Grenoble",
"Lyon",
"Gorges du Verdon",
"Bormes les Mimosas",
"Cassis",
"Marseille",
"Aix en Provence",
"Avignon",
"Uzes",
"Nimes",
"Aigues Mortes",
"Saintes Maries de la mer",
"Collioure",
"Carcassonne",
"Ariege",
"Toulouse",
"Montauban",
"Biarritz",
"Bayonne",
"La Rochelle"]
```

Your team should focus **only on the above cities for your project**. 


## Helpers 🦮

To help you achieve this project, here are a few tips that should help you

### Get weather data with an API 

*   Use https://nominatim.org/ to get the gps coordinates of all the cities (no subscription required) Documentation : https://nominatim.org/release-docs/develop/api/Search/

*   Use https://openweathermap.org/appid (you have to subscribe to get a free apikey) and https://openweathermap.org/api/one-call-api to get some information about the weather for the 35 cities and put it in a DataFrame

*   Determine the list of cities where the weather will be the nicest within the next 7 days For example, you can use the values of daily.pop and daily.rain to compute the expected volume of rain within the next 7 days... But it's only an example, actually you can have different opinions on a what a nice weather would be like 😎 Maybe the most important criterion for you is the temperature or humidity, so feel free to change the rules !

*   Save all the results in a `.csv` file, you will use it later 😉 You can save all the informations that seem important to you ! Don't forget to save the name of the cities, and also to create a column containing a unique identifier (id) of each city (this is important for what's next in the project)

*   Use plotly to display the best destinations on a map

### Scrape Booking.com 

Since BookingHoldings doesn't have aggregated databases, it will be much faster to scrape data directly from booking.com 

You can scrap as many information asyou want, but we suggest that you get at least:

*   hotel name,
*   Url to its booking.com page,
*   Its coordinates: latitude and longitude
*   Score given by the website users
*   Text description of the hotel


### Create your data lake using S3 

Once you managed to build your dataset, you should store into S3 as a csv file. 

### ETL 

Once you uploaded your data onto S3, it will be better for the next data analysis team to extract clean data directly from a Data Warehouse. Therefore, create a SQL Database using AWS RDS, extract your data from S3 and store it in your newly created DB. 

## Deliverable 📬

To complete this project, your team should deliver:

* A `.csv` file in an S3 bucket containing enriched information about weather and hotels for each french city

* A SQL Database where we should be able to get the same cleaned data from S3 

* Two maps where you should have a Top-5 destinations and a Top-20 hotels in the area. You can use plotly or any other library to do so. It should look something like this: 

![Map](https://full-stack-assets.s3.eu-west-3.amazonaws.com/images/Kayak_best_destination_project.png)

### Get weather data with an API 

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import requests
import time

#from dotenv import load_dotenv
import os

'''load_dotenv()
%load_ext dotenv
%dotenv'''

key = os.getenv('APIKEY')

# Variable to save bucketname
BUCKETNAME = os.getenv('BUCKETNAME')

'''# Variables to connect to my RDS PostgresSQL Database
DBUSERNAME = os.getenv('DBUSERNAME')
DBPASSWORD = os.getenv('DBPASSWORD')
DBHOSTNAME = os.getenv('DBHOSTNAME')
DBNAME = os.getenv('DBNAME')'''

import logging
import scrapy
from scrapy.crawler import CrawlerProcess
from bs4 import BeautifulSoup

import warnings
warnings.filterwarnings("ignore")

In [1]:
cities = ["Mont Saint Michel", "St Malo", "Bayeux", "Le Havre", "Rouen", "Paris", "Amiens", "Lille", "Strasbourg", "Chateau du Haut Koenigsbourg", "Colmar", "Eguisheim", "Besancon", "Dijon", "Annecy", "Grenoble", "Lyon", "Gorges du Verdon", "Bormes les Mimosas", "Cassis", "Marseille", "Aix en Provence", "Avignon", "Uzes", "Nimes", "Aigues Mortes", "Saintes Maries de la mer", "Collioure", "Carcassonne", "Ariege", "Toulouse", "Montauban", "Biarritz", "Bayonne", "La Rochelle"]

In [4]:
# Create a dataframe to store cities name
df_cities = pd.DataFrame(columns=["city"])
df_cities['city'] = cities
df_cities.reset_index(inplace=True)
df_cities.rename(columns={'index': 'id'}, inplace=True)
df_cities.head(2)

Unnamed: 0,id,city
0,0,Mont Saint Michel
1,1,St Malo


### GPS Coordinates

In [5]:
# Function to get the GPS coordinates from openstreetmap API
# The code within this cell has been made with the help of another alumni: Guillaume Arp
def get_gps(df):

    df_new = df.copy()
    lat_list = []
    lon_list = []
    for i in cities:
        if i == "Gorges du Verdon":
            i = "La%20Palud-sur-Verdon"
            r = requests.get(f"https://nominatim.openstreetmap.org/search?city={i}&format=json").json()
            lat_list.append(r[0]['lat'])
            lon_list.append(r[0]['lon'])
        elif i == 'Ariege':
            r = requests.get(f"https://nominatim.openstreetmap.org/search?county={i}&format=json").json()
            lat_list.append(r[0]['lat'])
            lon_list.append(r[0]['lon'])
        else:
            name = i.replace(" ", "%20")
            r = requests.get(f"https://nominatim.openstreetmap.org/search?city={name}&format=json").json()
            lat_list.append(r[0]['lat'])
            lon_list.append(r[0]['lon'])
        
    df_new['lat'] = lat_list
    df_new['lon'] = lon_list 
    return df_new

In [6]:
df_gps = get_gps(df_cities)
df_gps.head(2)

Unnamed: 0,id,city,lat,lon
0,0,Mont Saint Michel,48.6359541,-1.511459954959514
1,1,St Malo,48.649518,-2.0260409


### Weather Data

In [9]:
# Function to get the weather data from openweathermap API
# The code within this cell has been made with the help of another alumni: Guillaume Arp
def get_weather(df):
    
    df_new = df.copy()
    temps_list = []
    rain_pop = []
    humidity_list = []
    days = list(range(1,8))
    
    for i in df.itertuples():
        lat = i.lat
        lon = i.lon
        r = requests.get(f"https://api.openweathermap.org/data/2.5/onecall?lat={lat}&lon={lon}&units=metric&appid={key}").json()
        weather_7_days = r["daily"][1:]
        temps = [i['feels_like']['day'] for i in weather_7_days]
        rain = [i['pop'] * 100 for i in weather_7_days]
        humidity = [i['humidity'] for i in weather_7_days]
        temps_list.append(temps)
        rain_pop.append(rain)
        humidity_list.append(humidity)
        
    df_new['day_plus'] = [days for _ in range(len(df))]
    df_new['felt_temperature'] = temps_list
    df_new['rain_chances'] = rain_pop
    df_new['humidity'] = humidity_list
    df_new['score'] = df_new.apply(lambda x: ((35 - np.mean(x['felt_temperature'])) * 2) + np.mean(x['rain_chances']) + (np.mean(x['humidity']) / 2), axis=1)
    
    return df_new

In [10]:
df_full = get_weather(df_gps)
df_full.head(2)

KeyError: 'daily'

### Scrape Booking.com 

In [11]:
# Store cities name into a DataFrame for next step
df_booking = pd.DataFrame(cities, columns=["city"])
df_booking


Unnamed: 0,city
0,Mont Saint Michel
1,St Malo
2,Bayeux
3,Le Havre
4,Rouen
5,Paris
6,Amiens
7,Lille
8,Strasbourg
9,Chateau du Haut Koenigsbourg


In [12]:
# Configure Scrapy to iterate over each city to gather the name and the URL of all their respective hotels
class BookingSpider(scrapy.Spider):
    name = "Booking_data"
    cities = df_booking["city"]
    start_urls = ['https://www.booking.com/index.fr.html']

    def parse(self, response):
        for x in cities:
            yield scrapy.FormRequest.from_response(
            response,
            formdata={'ss': x },
            callback=self.after_search
        )

    def after_search(self, response):
        cities = response.url.split("ss=")[-1].split("&")[0]
                
        booking = response.css('.d4924c9e74')
        
        for data in booking:
            
            yield {
                'location': cities,
                'name': data.css('a div.fcab3ed991.a23c043802::text').getall(),
                'url': data.css('h3.a4225678b2 a::attr(href)').getall(),
            }

        try:
            next_page = response.css('a.paging-next').attrib["href"]
        except KeyError:
            logging.info('No next page. Terminating crawling process.')
        else:
            yield response.follow(next_page, callback=self.after_search)

In [13]:
# Start the scraping process and save the file to json format
filename = "cities.json"

if filename in os.listdir():
        os.remove(filename)

process = CrawlerProcess(settings = {
    'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0',
    'LOG_LEVEL': logging.INFO,
    "FEEDS": {
        filename: {"format": "json"},
    },
    "AUTOTHROTTLE_ENABLED": True
})

process.crawl(BookingSpider)
process.start()

2022-06-22 13:24:26 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-06-22 13:24:26 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.9.12 (main, Apr  5 2022, 01:53:17) - [Clang 12.0.0 ], pyOpenSSL 21.0.0 (OpenSSL 1.1.1o  3 May 2022), cryptography 3.4.8, Platform macOS-10.16-x86_64-i386-64bit
2022-06-22 13:24:26 [scrapy.crawler] INFO: Overridden settings:
{'AUTOTHROTTLE_ENABLED': True,
 'LOG_LEVEL': 20,
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:92.0) '
               'Gecko/20100101 Firefox/92.0'}
2022-06-22 13:24:26 [scrapy.extensions.telnet] INFO: Telnet Password: 27f7029f1b6b44ad
2022-06-22 13:24:26 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats',
 'scrap

In [None]:
# Create a DataFrame from the json file
df = pd.read_json("cities.json")
df.head()

Unnamed: 0,location,name,url
0,Mont+Saint+Michel,"[Hôtel Vert, Hotel Gabriel, Les Terrasses Poul...",[https://www.booking.com/hotel/fr/vert.fr.html...
1,St+Malo,"[Studio cocooning, Escale Iodée ST-MALO - SOLI...",[https://www.booking.com/hotel/fr/studio-cocoo...
2,Bayeux,"[Le Castel Guesthouse, Hotel Particulier de Sa...",[https://www.booking.com/hotel/fr/le-castel-no...
3,Le+Havre,"[Holiday Inn Express - Le Havre Centre, Hilton...",[https://www.booking.com/hotel/fr/campanile-le...
4,Rouen,"[Radisson Blu Hotel, Rouen Centre, Passo, Merc...",[https://www.booking.com/hotel/fr/radisson-blu...


In [None]:
# Reduce number of hotels per city to 20 as requested by the task
for i in range (len(df["location"])):
    df["name"][i] = df["name"][i][0:20]
    df["url"][i] = df["url"][i][0:20]

In [None]:
# BeautifulSoup iterates on each URL to get the hotel's score, GPS coordinates and description.
df["lat"] = 0
df["lon"] = 0
df["description"] = "test"
df["score"] = 0.0

navigator = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)'

for i in range(len(df["url"])):
    lat_list = []
    lon_list = []
    description_list = []
    score_list = []

    hotel_list = df["url"][i]

    for i2 in hotel_list:

        # Sometimes BeautifulSoup doesn't manage to gather data. When it fails, it tries again.
        try:
            page = requests.get(i2, headers={'User-Agent': navigator})
            soup = BeautifulSoup(page.text, 'html.parser')
        except:
            page = requests.get(i2, headers={'User-Agent': navigator})
            soup = BeautifulSoup(page.text, 'html.parser')

        lat_list.append(soup.select('p.address.address_clean a')[0].get("data-atlas-latlng").split(",")[0])
        lon_list.append(soup.select('p.address.address_clean a')[0].get("data-atlas-latlng").split(",")[1])
        description_list.append(soup.select('div#property_description_content')[0].get_text())

        try:
            score_list.append(soup.select('div.b5cd09854e.d10a6220b4')[0].get_text())
            
        except:
            # 2 hotels over the 700 that I am going to scrap dont have a score yet but I still need one for the visualization. I set it to 1.
            score_list.append("1.0")

        time.sleep(1.4)
  
    df["lat"][i] = lat_list
    df["lon"][i] = lon_list
    df["description"][i] = description_list
    df["score"][i] = score_list
    
    print (f"city {df['location'].iloc[i]} done")

city Mont+Saint+Michel done
city St+Malo done
city Bayeux done
city Le+Havre done
city Rouen done
city Paris done
city Amiens done


In [None]:
# Add an ID to each city to be able to merge the scrapped data with the API data later
df["location"] = df["location"].str.replace("+", " ")
df = df.set_index("location")
df = df.reindex(cities)
df.reset_index(inplace=True)

df.reset_index(inplace=True)
df.rename(columns={'index': 'id'}, inplace=True)
df.head(2)

Unnamed: 0,id,location,name,url,lat,lon,description,score
0,0,Mont Saint Michel,"[Hôtel Vert, Le Relais Saint Michel, La Mère P...",[https://www.booking.com/hotel/fr/vert.fr.html...,"[48.61470049, 48.61758727, 48.63508532, 48.614...","[-1.50961697, -1.51039615, -1.51053965, -1.510...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,1, 7,8, 7,2, 8,2, 7,3, 7,2, 8,0, 8,1, 7,2, ..."
1,1,St Malo,"[Hotel Eden, Hotel d'Aleth, Hotel Ajoncs d'Or,...",[https://www.booking.com/hotel/fr/eden-saint-m...,"[48.66190919, 48.63593081, 48.64735692, 48.647...","[-1.98966533, -2.02171236, -2.02519655, -2.028...",[\nVous pouvez bénéficier d'une réduction Geni...,"[7,3, 7,9, 8,5, 7,4, 9,2, 7,9, 9,7, 8,0, 8,2, ..."


# Datalake S3

In [None]:
display(df.head(2))
display(df.shape)
display(df_full.head(2))
display(df_full.shape)

Unnamed: 0,id,location,name,url,lat,lon,description,score
0,0,Mont Saint Michel,"[Hôtel Vert, Le Relais Saint Michel, La Mère P...",[https://www.booking.com/hotel/fr/vert.fr.html...,"[48.61470049, 48.61758727, 48.63508532, 48.614...","[-1.50961697, -1.51039615, -1.51053965, -1.510...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,1, 7,8, 7,2, 8,2, 7,3, 7,2, 8,0, 8,1, 7,2, ..."
1,1,St Malo,"[Hotel Eden, Hotel d'Aleth, Hotel Ajoncs d'Or,...",[https://www.booking.com/hotel/fr/eden-saint-m...,"[48.66190919, 48.63593081, 48.64735692, 48.647...","[-1.98966533, -2.02171236, -2.02519655, -2.028...",[\nVous pouvez bénéficier d'une réduction Geni...,"[7,3, 7,9, 8,5, 7,4, 9,2, 7,9, 9,7, 8,0, 8,2, ..."


(35, 8)

Unnamed: 0,id,city,lat,lon,day_plus,felt_temperature,rain_chances,humidity,score
0,0,Mont Saint Michel,48.6359541,-1.511459954959514,"[1, 2, 3, 4, 5, 6, 7]","[17.48, 18.47, 15.31, 10.6, 10.91, 7.45, 11.52]","[56.99999999999999, 0, 0, 38.0, 79.0, 82.0, 87.0]","[72, 60, 50, 80, 68, 85, 72]",127.574286
1,1,St Malo,48.649518,-2.0260409,"[1, 2, 3, 4, 5, 6, 7]","[14.66, 15.23, 14.15, 10.06, 9.83, 6.6, 10.4]","[24.0, 0, 0, 38.0, 62.0, 98.0, 83.0]","[75, 75, 58, 72, 70, 84, 75]",126.805714


(35, 9)

In [None]:
# Merge the scrapped data with the API data 
# I would not naturally do that by myself but I did it for the sake of the task as only one csv file is requested within the Datalake
df_full_raw = pd.merge(df_full, df, on='id', how='outer')
display(df_full_raw.head(2))
display(df_full_raw.shape)

Unnamed: 0,id,city,lat_x,lon_x,day_plus,felt_temperature,rain_chances,humidity,score_x,location,name,url,lat_y,lon_y,description,score_y
0,0,Mont Saint Michel,48.6359541,-1.511459954959514,"[1, 2, 3, 4, 5, 6, 7]","[17.48, 18.47, 15.31, 10.6, 10.91, 7.45, 11.52]","[56.99999999999999, 0, 0, 38.0, 79.0, 82.0, 87.0]","[72, 60, 50, 80, 68, 85, 72]",127.574286,Mont Saint Michel,"[Hôtel Vert, Le Relais Saint Michel, La Mère P...",[https://www.booking.com/hotel/fr/vert.fr.html...,"[48.61470049, 48.61758727, 48.63508532, 48.614...","[-1.50961697, -1.51039615, -1.51053965, -1.510...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,1, 7,8, 7,2, 8,2, 7,3, 7,2, 8,0, 8,1, 7,2, ..."
1,1,St Malo,48.649518,-2.0260409,"[1, 2, 3, 4, 5, 6, 7]","[14.66, 15.23, 14.15, 10.06, 9.83, 6.6, 10.4]","[24.0, 0, 0, 38.0, 62.0, 98.0, 83.0]","[75, 75, 58, 72, 70, 84, 75]",126.805714,St Malo,"[Hotel Eden, Hotel d'Aleth, Hotel Ajoncs d'Or,...",[https://www.booking.com/hotel/fr/eden-saint-m...,"[48.66190919, 48.63593081, 48.64735692, 48.647...","[-1.98966533, -2.02171236, -2.02519655, -2.028...",[\nVous pouvez bénéficier d'une réduction Geni...,"[7,3, 7,9, 8,5, 7,4, 9,2, 7,9, 9,7, 8,0, 8,2, ..."


(35, 16)

In [None]:
# Create a bucket on amazon S3 to store the data
import boto3

session = boto3.Session()

s3 = session.resource("s3")

bucket = s3.create_bucket(Bucket=BUCKETNAME, CreateBucketConfiguration={'LocationConstraint': 'eu-west-3'})

2022-04-15 01:02:39 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials


In [None]:
# Transform DataFrame to csv
csv = df_full_raw.to_csv(index = False)

# Upload csv file to the bucket
put_object = bucket.put_object(Key="df_full_raw.csv", Body=csv, 
# ACL='public-read'
)

# Open the bucket to get the csv file and retransform it into a DataFrame
data = pd.read_csv('s3://bloc-1-bq/df_full_raw.csv')
data.shape

2022-04-15 01:02:44 [botocore.credentials] INFO: Found credentials in shared credentials file: ~/.aws/credentials


(35, 16)

In [None]:
# Split the raw data obtain from the datalake into 2 specific DataFrames
data_weather = data[["id", "city", "lat_x", "lon_x", "day_plus", "felt_temperature", "rain_chances", "humidity", "score_x"]]
data_hotels = data[["id", "city", "name", "url", "lat_y", "lon_y", "description", "score_y"]]
data_hotels.head(2)

Unnamed: 0,id,city,name,url,lat_y,lon_y,description,score_y
0,0,Mont Saint Michel,"['Hôtel Vert', 'Le Relais Saint Michel', 'La M...",['https://www.booking.com/hotel/fr/vert.fr.htm...,"['48.61470049', '48.61758727', '48.63508532', ...","['-1.50961697', '-1.51039615', '-1.51053965', ...","[""\nVous pouvez bénéficier d'une réduction Gen...","['8,1', '7,8', '7,2', '8,2', '7,3', '7,2', '8,..."
1,1,St Malo,"['Hotel Eden', ""Hotel d'Aleth"", ""Hotel Ajoncs ...",['https://www.booking.com/hotel/fr/eden-saint-...,"['48.66190919', '48.63593081', '48.64735692', ...","['-1.98966533', '-2.02171236', '-2.02519655', ...","[""\nVous pouvez bénéficier d'une réduction Gen...","['7,3', '7,9', '8,5', '7,4', '9,2', '7,9', '9,..."


In [None]:
# When I turned the DataFrames into csv file, it transformed the lists into strings and it's time to revert them back to lists
data_hotels["name"] = data_hotels["name"].apply(eval)
data_hotels["url"] = data_hotels["url"].apply(eval)
data_hotels["lat_y"] = data_hotels["lat_y"].apply(eval)
data_hotels["lon_y"] = data_hotels["lon_y"].apply(eval)
data_hotels["description"] = data_hotels["description"].apply(eval)
data_hotels["score_y"] = data_hotels["score_y"].apply(eval)

display(data_hotels.head(2))
display(data_hotels.shape)

data_weather["day_plus"] = data_weather["day_plus"].apply(eval)
data_weather["felt_temperature"] = data_weather["felt_temperature"].apply(eval)
data_weather["rain_chances"] = data_weather["rain_chances"].apply(eval)
data_weather["humidity"] = data_weather["humidity"].apply(eval)

display(data_weather.head(2))
display(data_weather.shape)

Unnamed: 0,id,city,name,url,lat_y,lon_y,description,score_y
0,0,Mont Saint Michel,"[Hôtel Vert, Le Relais Saint Michel, La Mère P...",[https://www.booking.com/hotel/fr/vert.fr.html...,"[48.61470049, 48.61758727, 48.63508532, 48.614...","[-1.50961697, -1.51039615, -1.51053965, -1.510...",[\nVous pouvez bénéficier d'une réduction Geni...,"[8,1, 7,8, 7,2, 8,2, 7,3, 7,2, 8,0, 8,1, 7,2, ..."
1,1,St Malo,"[Hotel Eden, Hotel d'Aleth, Hotel Ajoncs d'Or,...",[https://www.booking.com/hotel/fr/eden-saint-m...,"[48.66190919, 48.63593081, 48.64735692, 48.647...","[-1.98966533, -2.02171236, -2.02519655, -2.028...",[\nVous pouvez bénéficier d'une réduction Geni...,"[7,3, 7,9, 8,5, 7,4, 9,2, 7,9, 9,7, 8,0, 8,2, ..."


(35, 8)

Unnamed: 0,id,city,lat_x,lon_x,day_plus,felt_temperature,rain_chances,humidity,score_x
0,0,Mont Saint Michel,48.635954,-1.51146,"[1, 2, 3, 4, 5, 6, 7]","[17.48, 18.47, 15.31, 10.6, 10.91, 7.45, 11.52]","[56.99999999999999, 0, 0, 38.0, 79.0, 82.0, 87.0]","[72, 60, 50, 80, 68, 85, 72]",127.574286
1,1,St Malo,48.649518,-2.026041,"[1, 2, 3, 4, 5, 6, 7]","[14.66, 15.23, 14.15, 10.06, 9.83, 6.6, 10.4]","[24.0, 0, 0, 38.0, 62.0, 98.0, 83.0]","[75, 75, 58, 72, 70, 84, 75]",126.805714


(35, 9)

In [None]:
df_hotels_2 = data_hotels.copy()
# Explode the lists into separate series
df_hotels_2 = df_hotels_2.apply(pd.Series.explode)
df_hotels_2["score_y"].replace(",", ".", regex=True, inplace=True)
df_hotels_2[["lat_y", "lon_y", "score_y"]] = df_hotels_2[["lat_y", "lon_y", "score_y"]].apply(pd.to_numeric)

display(df_hotels_2.head(2))
display(df_hotels_2.shape)


df_weather_2 = data_weather.sort_values('score_x').head()
df_weather_2.reset_index(inplace=True, drop=True)
df_weather_2 = df_weather_2.apply(pd.Series.explode)
df_weather_2[['lat_x', 'lon_x', 'day_plus', 'felt_temperature', 'rain_chances', 'humidity']] = df_weather_2[['lat_x', 'lon_x', 'day_plus', 'felt_temperature', 'rain_chances', 'humidity']].apply(pd.to_numeric)

display(df_weather_2.head(2))
display(df_weather_2.shape)

Unnamed: 0,id,city,name,url,lat_y,lon_y,description,score_y
0,0,Mont Saint Michel,Hôtel Vert,https://www.booking.com/hotel/fr/vert.fr.html?...,48.6147,-1.509617,\nVous pouvez bénéficier d'une réduction Geniu...,8.1
0,0,Mont Saint Michel,Le Relais Saint Michel,https://www.booking.com/hotel/fr/le-relais-sai...,48.617587,-1.510396,\nVous pouvez bénéficier d'une réduction Geniu...,7.8


(700, 8)

Unnamed: 0,id,city,lat_x,lon_x,day_plus,felt_temperature,rain_chances,humidity,score_x
0,5,Paris,48.85889,2.320041,1,18.24,10.0,51,76.711429
0,5,Paris,48.85889,2.320041,2,16.73,0.0,45,76.711429


(35, 9)

# Data Warehouse PostgresSQL RDS

In [None]:
from sqlalchemy import create_engine, text

# Connect to my postgresSQL database
engine = create_engine(f"postgresql+psycopg2://{DBUSERNAME}:{DBPASSWORD}@{DBHOSTNAME}/{DBNAME}", echo=True)

In [None]:
# Initialize a sessionmaker 
from sqlalchemy.orm import sessionmaker 
Session = sessionmaker(bind=engine)

# Instanciate Session 
session = Session()

In [None]:
# Transform the data_weather DataFrame into a SQL table inside the RDS Data Warehouse
df_weather_2.to_sql(
    "df_weather",
    engine
)

In [None]:
# Transform the df_final DataFrame into a SQL table inside the RDS Data Warehouse
df_hotels_2.to_sql(
    "df_final",
    engine
)

In [None]:
# SQL Request to get the data from the RDS Data Warehouse
df_hotel_1 = text("SELECT * FROM df_final")
df_final_2 = pd.read_sql(df_hotel_1, engine)

df_weather_1 = text("SELECT * FROM df_weather")
df_weather_2 = pd.read_sql(df_weather_1, engine)

# Visualisation

In [None]:
px.set_mapbox_access_token(open(".mapbox_token").read())
fig = px.scatter_mapbox(
    df_weather_2.sort_values('day_plus'),
    lat='lat_x',
    lon='lon_x',
    color='felt_temperature',
    size='humidity',
    color_continuous_scale=px.colors.diverging.Picnic,
    size_max=30,
    zoom=4.7,
    hover_name='city',
    hover_data={
        'lat_x': False,
        'lon_x': False,
        'day_plus': False,
        'rain_chances': True,
        'humidity': True,
        'felt_temperature': True        
        },
    animation_frame='day_plus'
)

fig.update_layout(width = 1300, height = 800, template='plotly_dark' ,title='Best cities to visit in the next 7 days')
fig.show()

In [None]:
best_city_list = df_weather_2["id"].tolist()
df_final_2 = df_hotels_2[df_hotels_2["id"].isin(best_city_list)]
df_final_2["city"].value_counts()

Paris        20
Lille        20
Uzes         20
Nimes        20
Collioure    20
Name: city, dtype: int64

In [None]:
px.set_mapbox_access_token(open(".mapbox_token").read())
fig = px.scatter_mapbox(
    df_final_2.sort_values('name'),
    lat='lat_y',
    lon='lon_y',
    color='score_y',
    size='score_y',
    color_continuous_scale=px.colors.diverging.BrBG,
    size_max=30,
    zoom=5,
    hover_name='name'
)

fig.update_layout(width = 1300, height = 800, template='plotly_dark' ,title='Best hotels in the area')
fig.show()