# Practical Session 3/4 : Flights Dataset Analysis
In this session, we will conduct a few analyses on a simplified flight fares dataset.  
In particular we will try to build cheapest routes from one point to another.

## Grading and Instructions
You must return your notebook before **Wednesday March 2nd 23:59 Paris time** by email to David : d.diebold@criteo.com.  
Grade will be composed of :
1. Timely return
2. Correctness (some questions may still leave you with some liberties)
3. Report formatting : While we allow you to return your project in a notebook format, you should think your report as being a classic text and image pdf report in which the code is in appendix. That means your notebook should be fully readable while hiding all the code cells.
4. Code Readability (factorized code, well-named variables, explain what you do when code becomes complicated, etc...)
5. Performance (this is not a race but we want you to think about performance issues when designing your solution (shuffles, etc...). Don't hesitate to annotate your notebook with any remarks about your solution.


## Install Spark Environment
Since we are not running on databricks, we will need to install Spark by ourselves, every time we run the session.  
We need to install Spark, as well as a Java Runtime Environment.  
Then we need to setup a few environment variables.  


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!curl -O https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  287M  100  287M    0     0   198M      0  0:00:01  0:00:01 --:--:--  198M


In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

conf = SparkConf().set('spark.ui.port', '4050')
sc = SparkContext(conf=conf)
spark = SparkSession.builder.master('local[*]').getOrCreate()

## Optional step : Enable SparkUI through secure tunnel
This step is useful if you want to look at Spark UI.
First, you need to create a free ngrok account : https://dashboard.ngrok.com/login.  
Then connect on the website and copy your AuthToken.


In [None]:
# this step downloads ngrok, configures your AuthToken, then starts the tunnel
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
!./ngrok authtoken my_ngrok_auth_token_retrieved_from_website # <-------------- change this line !
get_ipython().system_raw('./ngrok http 4050 &')

Authtoken saved to configuration file: /root/.ngrok2/ngrok.yml


Now get the Spark UI url on https://dashboard.ngrok.com/endpoints/status. We're done !

## Useful imports

In [None]:
import time
import numpy as np
import pyspark.sql.functions as F
%matplotlib inline

## Introduction
Aim of this notebook is to help you get comfortable with Spark Dataframe API while working on a flights dataset.  
This dataset contains some domestic flight prices for US country.  
We will call route a tuple identified by an origin airport and a destination airport.  
We will try to find-out what are the best options for a traveler, to go from some place to another.  
Here is a short description of the columns:
- ItinID & MktID: vaguely demonstrates the order in which tickets were ordered (lower ID #'s being ordered first)
- MktCoupons: the number of coupons in the market for that flight
- Quarter: 1, 2, 3, or 4, all of which are in 2018
- Origin: the city out of which the flight begins
- OriginWac: USA State/Territory World Area Code
- Dest: the city out of which the flight begins
- DestWac: USA State/Territory World Area Code
- Miles: the number of miles traveled
- ContiguousUSA: binary column -- (2) meaning flight is in the contiguous (48) USA states, and (1) meaning it is not (ie: Hawaii, Alaska, off-shore territories)
- NumTicketsOrdered: number of tickets that were purchased by the user
- Airline Company: the two-letter airline company code that the user used from start to finish (key codes below)
- PricePerTicket: ticket price

In [None]:
# download the dataset described above
from urllib import request
import zipfile

url = "https://www.dropbox.com/s/kda4h5su4z6go05/flights.zip?dl=1"
filehandle, _ = request.urlretrieve(url)
zip_file_object = zipfile.ZipFile(filehandle, 'r')
zip_file_object.extractall()

In [None]:
# This seond file contains a mapping with airports Code / Name / Latitude / Longitude
# It can help to get a better understanding of the airports you are dealing with.
# Source : https://www.partow.net/miscellaneous/airportdatabase/index.html#Downloads
url2 = "https://www.dropbox.com/s/xe2a3hgwlugos7a/GlobalAirportDatabase.txt?dl=1"
request.urlretrieve(url2, "airport_latlon.txt")

('airport_latlon.txt', <http.client.HTTPMessage at 0x7fbc6a291150>)

In [None]:
!ls

airport_latlon.txt	  spark-3.2.1-bin-hadoop3.2
Cleaned_2018_Flights.csv  spark-3.2.1-bin-hadoop3.2.tgz
sample_data


## Question 1 (1 point)
Display a few rows of the flights fare dataset, display it's schema, and count the amount of rows.  
You are likely to read this dataset a lots of times ; rewrite the dataset on the file system in an optimized way, to optimize further readings.  

Amount of rows : 9534417


## Question 2 (4 points)  
Find how many origin and destination airports are contained in the dataset.  
Show them on a US map to get a better intuition of the dataset. You can use shapely and geopandas to perform this task.  
Do we have all the lat/lon available ?

In the next two questions, we will want to get an understanding of ticket prices based on flight distance.  
## Question 3 (2 points)
To do that we first need to get and understanding of the flight distance distribution.  
We want to display an histogram of flight distances. To do this :  
- use numpy logspace function to create 10 distance buckets, base=1.05
- then use numpy digitize function inside a spark UDF to create the buckets.
- buckets should be displayed in the correct order, and displayed like this : [min;max]

## Question 4 (3 points)
Display the average flight fares for each distance bucket.  
Graph should also contain the confidence intervals.  
Buckets should be displayed in the correct order, and displayed like this : $[min;max]$  
Interpret the results.  

## Question 5 (4 points)
For the remainder of the notebook, we will only take care of the average price of each route.  
Our goal is to find cheap combinations of flights to travel from one place to another.  
First, we want to build a dataframe named 'cheapest_routes_df' containing the cheapest price to go from one place to another, with one or zero waypoint. Dataframe should look like this (Waypoints column can be empty) :  

Origin  | Destination | Waypoints | TotalPrice
-------------------|------------------|---|---
ACY       | MOB | ATL | 323.0
Row 2, Col 1       | Row 2, Col 2 | | 89.0
  
Is it interesting to consider waypoints to go from one place to another ?  

## Question 6 (6 points)
Now we want to create the dataframe with cheap combinations of flights from one place to another, but there is no longer any limitation on the amount of waypoints.  
Let $Routes_{k}$ designate the dataframe that contains cheapest routes for at most $k$ waypoints.  
This dataframe contains a column named 'Waypoints', containing an array of waypoints.  
Then:  
- Define a function that computes $Routes_{k+1}$ from $Routes_{k}$ and $Routes_{0}$.  
- Test it on a simple dataset made of three rows, built with $spark.sparkContext.parallelize$.  
- Use it iteratively to build what we want.  
- At each step, measure the amount of routes with k waypoints.  
- What is the stopping criterion ?  
- Measure the execution time of each step.  
- What if we want to execute the iterations up to $k=15$ ?  
- Explain what happens, and find a solution to approximately have the same execution time at each iteration.  
- Analyze the results obtained for $Routes_{maxK}$
- Analyze the euros spared, putting them in front of the extra miles traveled.