Developpeur: **Mouhite ADEBO**

**Projet: FLIGHTSRADARAPI24**

***version: français***
#### Sujet
Créer un pipeline ***ETL (Extract, Transform, Load)*** permettant de traiter les données de ***l'[API aviationstack](https://aviationstack.com/)***, qui répertorie l'ensemble des ***vols aériens, aéroports, compagnies aériennes mondiales***.

C'est une API est payant, j'ai utilisant la période d'essai pour réaliser le projet. Pour la version à laquel j'ai souscrire, je ne peux que récupérer 100 données et faire 10000 requêtes par mois.

#### Tâches
- Extraction de données avec la librairie python "requests"
- Transformation de données:
  - Data preparation : 
    1. Créer un cadre de données pour chaque extrait d'ensemble de données
    2. Explosez mon dataframe df_flights: [Explication: parce que pour certains de mes colonnes j'avais des sous structure de données]
    3. Affichage de la taille des données et du premier élement : [Explication: C'est pour connaitre la taille des données extraites et avoir une vue de ceux à quoi elle ressemble]
    4. Néttoyage des données
  
        4.1. Convertir le type de chaque colonne [Explication: Leur type etait par défaut, mais certains données nécessitait d'avoir leur vrai type comme les dates]

        4.2. Remplir les valeur NaN [Explication: Pour pouvoir mieux faire mon traitement de données]

        4.3. Séparer les colonnes departure_timezone et arrival_timezone du continent et de la ville. [Explication: Il serait plus simple d'avoir les deux éléments séparer , car les résultats seront différents une fois séparer par rapport à lorsqu'il était ensemble]

        4.4. Calculer la durée en secondes et en heures des vols [Explication: Cela pourrai nous servir lors de notre traitement ou plus tard les data analyst par exemple]
  - Requêtage : 
    1. La compagnie avec le + de vols en cours
    2. Pour chaque continent, la compagnie avec le + de vols régionaux actifs (continent d'origine == continent de destination)
    3. Le vol en cours avec le trajet le plus long
    4. Pour chaque continent, la longueur de vol moyenne
    5. L'entreprise constructeur d'avions avec le plus de vols actifs
    6. Pour chaque pays de compagnie aérienne, le top 3 des modèles d'avion en usage
- Enregistrer la données extraites

***NB: J'ai commenté tout mon code, afin que le lecteur puisse comprendre ce que j'ai fait***

> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

***version anglaise***
#### Topic
Create a ***ETL (Extract, Transform, Load)*** pipeline to process data from ***the [aviationstack API](https://aviationstack.com/)***, which lists all ***air flights, airports, airlines worldwide***.

It is an API is paying, I used the trial period to realize the project. For the version I subscribed to, I can only retrieve 100 data and make 10000 queries per month.

#### Tasks
- Data extraction with the python library "requests
- Data transformation:
  - Data preparation : 
    1. Create a data frame for each data set extract
    2. Explain my dataframe df_flights: [Explanation: because for some of my columns I had sub data structures]
    3. Display data size and first element: [Explanation: This is to know the size of the extracted data and have a view of what it looks like]
    4. Data cleanup
  
        4.1. Convert the type of each column [Explanation: Their type was default, but some data needed to have their real type like dates]

        4.2 Fill in the NaN values [Explanation: To be able to do my data processing better]

        4.3. Separate the departure_timezone and arrival_timezone columns from the continent and city. [Explanation: It would be easier to have the two elements separated, because the results will be different when separated than when they were together].

        4.4 Calculate the duration in seconds and hours of the flights [Explanation: This could be useful for our processing or later the data analysts for example].
  - Query : 
    1. The company with the most flights in progress
    2. For each continent, the airline with the most active regional flights (continent of origin == continent of destination)
    3. The current flight with the longest route
    4. For each continent, the average flight length
    5. The aircraft manufacturer with the most active flights
    6. For each airline country, the top 3 aircraft models in use
- Save the extracted data

***NB: I commented all my code, so the reader can understand what I did***

In [0]:
from typing import *
import logging

In [0]:
logging.basicConfig(filename='log_flights_extraction.log', 
                    encoding='utf-8', 
                    level=logging.DEBUG, 
                    format='[%(name)s]::%(levelname)s - %(asctime)s: %(message)s'
                   )
logger = logging.getLogger("ETL_FLIGHTSAPI_EXALTIT")

In [0]:
# config to get data
params = {
  'access_key': 'dd9e01818832f20af35d64846177ecc9'
}

In [0]:
# list endpoints, allow to get flights, airports and airlines
url_flights = "http://api.aviationstack.com/v1/flights"

## Extraction data

In [0]:
import requests

In [0]:
# function to do extraction data on my API
def get_data(url:str, type_data:str) -> list:    
    try:
        response = requests.get(url, params=params)
        logging.info("function 'get_data' is done")
        return response.json()['data']
    except Exception as e:
        logging.error(f'Error: I can not get data to {type_data}. See error :  [{e}]', exc_info=True)
        return []

In [0]:
# Function allow to display of the data size and the first element
def describe_data_extract(data:list, type_data: str) -> None:
    if len(data) > 0:
        print(f"""
        [\n
            'Type data': {type_data};\n
            'Size':{len(data)};\n
            'First_element': {data[0]}
        \n]
        """)
    else:
        print(f"[\n'Type data': {type_data};\n'Size':{len(data)};\n'First_element': {data}\n]\n")
        
    logging.info("function 'describe_data_extract' is done")

## Data Transformation

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType, MapType
from pyspark.sql.window import Window
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql import DataFrame
from pyspark.sql.types import *
from pyspark.sql import Row

In [0]:
# Create spark session
spark = SparkSession.builder \
    .appName("FlightRadarETL_Test_Technique") \
    .getOrCreate()

spark

**1. Create dataframe to every dataset extract**

In [0]:
# function to create structure dataframe and create dateframe
def create_dataframe(data: list) -> DataFrame:
    # Step1: define schema dataframe
    schema = StructType([
        StructField("flight_date", StringType(), True),
        StructField("flight_status", StringType(), True),
        StructField("departure", StructType([
            StructField("airport", StringType(), True),
            StructField("timezone", StringType(), True),
            StructField("iata", StringType(), True),
            StructField("icao", StringType(), True),
            StructField("terminal", StringType(), True),
            StructField("gate", StringType(), True),
            StructField("delay", StringType(), True),
            StructField("scheduled", StringType(), True),
            StructField("estimated", StringType(), True),
            StructField("actual", StringType(), True),
            StructField("estimated_runway", StringType(), True),
            StructField("actual_runway", StringType(), True),
        ]), True),
        StructField("arrival", StructType([
            StructField("airport", StringType(), True),
            StructField("timezone", StringType(), True),
            StructField("iata", StringType(), True),
            StructField("icao", StringType(), True),
            StructField("terminal", StringType(), True),
            StructField("gate", StringType(), True),
            StructField("baggage", StringType(), True),
            StructField("delay", StringType(), True),
            StructField("scheduled", StringType(), True),
            StructField("estimated", StringType(), True),
            StructField("actual", StringType(), True),
            StructField("estimated_runway", StringType(), True),
            StructField("actual_runway", StringType(), True),
        ]), True),
        StructField("airline", StructType([
            StructField("name", StringType(), True),
            StructField("iata", StringType(), True),
            StructField("icao", StringType(), True),
        ]), True),
        StructField("flight", StructType([
            StructField("number", StringType(), True),
            StructField("iata", StringType(), True),
            StructField("icao", StringType(), True),
            StructField("codeshared", StringType(), True),
        ]), True),
        StructField("aircraft", StructType([
            StructField("registration", StringType(), True),
            StructField("iata", StringType(), True),
            StructField("icao", StringType(), True),
            StructField("icao24", StringType(), True),
        ]), True),
        StructField("live", StructType([
            StructField("updated", StringType(), True),
            StructField("latitude", StringType(), True),
            StructField("longitude", StringType(), True),
            StructField("altitude", StringType(), True),
            StructField("direction", StringType(), True),
            StructField("speed_horizontal", StringType(), True),
            StructField("speed_vertical", StringType(), True),
            StructField("is_ground", StringType(), True),        
        ]), True)
    ])
    # Step2: create rdd and create my dataframe
    rdd = spark.sparkContext.parallelize(data)
    df = spark.createDataFrame(rdd, schema)
    
    logging.info("function 'create_dataframe' is done")
    df.toPandas().head()
    return df

**2. Explode my dataframe df_flights**

In [0]:
# function to explode datframe , to remove sub structure
def explode_dataframe(df:DataFrame) -> DataFrame:
    # create alias for my sub structure 
    df = df.select(
        col("flight_date"),
        col("flight_status"),

        col("departure.airport").alias("departure_airport"),
        col("departure.timezone").alias("departure_timezone"),
        col("departure.iata").alias("departure_iata"),
        col("departure.icao").alias("departure_icao"),
        col("departure.terminal").alias("departure_terminal"),
        col("departure.gate").alias("departure_gate"),
        col("departure.delay").alias("departure_delay"),
        col("departure.scheduled").alias("departure_scheduled"),
        col("departure.estimated").alias("departure_estimated"),
        col("departure.actual").alias("departure_actual"),
        col("departure.estimated_runway").alias("departure_estimated_runway"),
        col("departure.actual_runway").alias("departure_actual_runway"),

        col("arrival.airport").alias("arrival_airport"),
        col("arrival.timezone").alias("arrival_timezone"),
        col("arrival.iata").alias("arrival_iata"),
        col("arrival.icao").alias("arrival_icao"),
        col("arrival.terminal").alias("arrival_terminal"),
        col("arrival.gate").alias("arrival_gate"),
        col("arrival.baggage").alias("arrival_baggage"),
        col("arrival.delay").alias("arrival_delay"),
        col("arrival.scheduled").alias("arrival_scheduled"),
        col("arrival.estimated").alias("arrival_estimated"),
        col("arrival.actual").alias("arrival_actual"),
        col("arrival.estimated_runway").alias("arrival_estimated_runway"),
        col("arrival.actual_runway").alias("arrival_actual_runway"),

        col("airline.name").alias("airline_name"),
        col("airline.iata").alias("airline_iata"),
        col("airline.icao").alias("airline_icao"),

        col("flight.number").alias("flight_number"),
        col("flight.iata").alias("flight_iata"),
        col("flight.icao").alias("flight_icao"),
        col("flight.codeshared").alias("flight_codeshared"),

        col("aircraft.registration").alias("aircraft_registration"),
        col("aircraft.iata").alias("aircraft_iata"),
        col("aircraft.icao").alias("aircraft_icao"),
        col("aircraft.icao24").alias("aircraft_icao24"),

        col("live.updated").alias("live_updated"),
        col("live.latitude").alias("live_latitude"),
        col("live.longitude").alias("live_longitude"),
        col("live.altitude").alias("live_altitude"),
        col("live.direction").alias("live_direction"),
        col("live.speed_horizontal").alias("live_speed_horizontal"),
        col("live.speed_vertical").alias("live_speed_vertical"),
        col("live.is_ground").alias("live_is_ground"),
    )
    # display new dataset
    logging.info("function 'explode_dataframe' is done")
    df.toPandas().head(100)
    
    return df

**3. Describe my dataframe df_flights**

In [0]:
# function to describe my dataframe
def describe_dataframe(df:DataFrame) -> None :
    print("Display my dataframe : ")
    df.show()
    print()
    
    print(f"List column of dataframe flights is : {df_flights.columns}\n")
    
    print('Describe of my dataframe df_flights : ')
    df_flights.describe().show()
    print()
    
    print('Count number of miss value in my dataframe : ')
    df_flights.select([count(when(col(c).isNull(), c)).alias(c) for c in df_flights.columns]).show()
    
    logging.info("function 'describe_dataframe' is done")
    df.toPandas()

**4. Cleaning dataframe df_flights**

4.1. Convert type every column

In [0]:
# function to define type every column dataframe
def define_type_column_dataframe(df:DataFrame) -> DataFrame:
    df = df.withColumn("flight_date", col("flight_date").cast("date"))

    df = df.withColumn("flight_status",
                                      when(col('flight_status').isNull() | isnan(col('flight_status')), None) \
                                      .otherwise(col("flight_status").cast("string")))

    df = df.withColumn("departure_airport",
                                      when(col('departure_airport').isNull() | isnan(col('departure_airport')), None) \
                                      .otherwise(col("departure_airport").cast("string")))
    df = df.withColumn("departure_timezone",
                                      when(col('departure_timezone').isNull() | isnan(col('departure_timezone')), None) \
                                      .otherwise(col("departure_timezone").cast("string")))
    df = df.withColumn("departure_iata",
                                      when(col('departure_iata').isNull() | isnan(col('departure_iata')), None) \
                                      .otherwise(col("departure_iata").cast("string")))
    df = df.withColumn("departure_icao",
                                      when(col('departure_icao').isNull() | isnan(col('departure_icao')), None) \
                                      .otherwise(col("departure_icao").cast("string")))
    df = df.withColumn("departure_terminal",
                                      when(col('departure_terminal').isNull() | isnan(col('departure_terminal')), None) \
                                      .otherwise(col("departure_terminal").cast("string")))
    df = df.withColumn("departure_gate",
                                      when(col('departure_gate').isNull() | isnan(col('departure_gate')), None) \
                                      .otherwise(col("departure_gate").cast("string")))
    df = df.withColumn("departure_delay",
                                      when(col('departure_delay').isNull() | isnan(col('departure_delay')), None) \
                                      .otherwise(col("departure_delay").cast("string")))
    df = df.withColumn("departure_scheduled", to_timestamp(col("departure_scheduled"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("departure_estimated", to_timestamp(col("departure_estimated"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("departure_actual", to_timestamp(col("departure_actual"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("departure_estimated_runway", to_timestamp(col("departure_estimated_runway"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("departure_actual_runway", to_timestamp(col("departure_actual_runway"), "yyyy-MM-dd'T'HH:mm:ssXXX"))

    df = df.withColumn("arrival_airport",
                                      when(col('arrival_airport').isNull() | isnan(col('arrival_airport')), None) \
                                      .otherwise(col("arrival_airport").cast("string")))
    df = df.withColumn("arrival_timezone",
                                      when(col('arrival_timezone').isNull() | isnan(col('arrival_timezone')), None) \
                                      .otherwise(col("arrival_timezone").cast("string")))
    df = df.withColumn("arrival_iata",
                                      when(col('arrival_iata').isNull() | isnan(col('arrival_iata')), None) \
                                      .otherwise(col("arrival_iata").cast("string")))
    df = df.withColumn("arrival_icao",
                                      when(col('arrival_icao').isNull() | isnan(col('arrival_icao')), None) \
                                      .otherwise(col("arrival_icao").cast("string")))
    df = df.withColumn("arrival_terminal",
                                      when(col('arrival_terminal').isNull() | isnan(col('arrival_terminal')), None) \
                                      .otherwise(col("arrival_terminal").cast("string")))
    df = df.withColumn("arrival_gate",
                                      when(col('arrival_gate').isNull() | isnan(col('arrival_gate')), None) \
                                      .otherwise(col("arrival_gate").cast("string")))
    df = df.withColumn("arrival_baggage",
                                      when(col('arrival_baggage').isNull() | isnan(col('arrival_baggage')), None) \
                                      .otherwise(col("arrival_baggage").cast("string")))
    df = df.withColumn("arrival_delay",
                                      when(col('arrival_delay').isNull() | isnan(col('arrival_delay')), None) \
                                      .otherwise(col("arrival_delay").cast("string")))
    df = df.withColumn("arrival_scheduled", to_timestamp(col("arrival_scheduled"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("arrival_estimated", to_timestamp(col("arrival_estimated"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("arrival_actual", to_timestamp(col("arrival_actual"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("arrival_estimated_runway", to_timestamp(col("arrival_estimated_runway"), "yyyy-MM-dd'T'HH:mm:ssXXX"))
    df = df.withColumn("arrival_actual_runway", to_timestamp(col("arrival_actual_runway"), "yyyy-MM-dd'T'HH:mm:ssXXX"))

    df = df.withColumn("airline_name",
                                      when(col('airline_name').isNull() | isnan(col('airline_name')), None) \
                                      .otherwise(col("airline_name").cast("string")))
    df = df.withColumn("airline_iata",
                                      when(col('airline_iata').isNull() | isnan(col('airline_iata')), None) \
                                      .otherwise(col("airline_iata").cast("string")))
    df = df.withColumn("airline_icao",
                                      when(col('airline_icao').isNull() | isnan(col('airline_icao')), None) \
                                      .otherwise(col("airline_icao").cast("string")))

    df = df.withColumn("flight_number",
                                      when(col('flight_number').isNull() | isnan(col('flight_number')), None) \
                                      .otherwise(col("flight_number").cast("string")))
    df = df.withColumn("flight_iata",
                                      when(col('flight_iata').isNull() | isnan(col('flight_iata')), None) \
                                      .otherwise(col("flight_iata").cast("string")))
    df = df.withColumn("flight_icao",
                                      when(col('flight_icao').isNull() | isnan(col('flight_icao')), None) \
                                      .otherwise(col("flight_icao").cast("string")))
    df = df.withColumn("flight_codeshared",
                                      when(col('flight_codeshared').isNull() | isnan(col('flight_codeshared')), None) \
                                      .otherwise(col("flight_codeshared").cast("string")))

    df = df.withColumn("aircraft_registration",
                                      when(col('aircraft_registration').isNull() | isnan(col('aircraft_registration')), None) \
                                      .otherwise(col("aircraft_registration").cast("string")))
    df = df.withColumn("aircraft_iata",
                                      when(col('aircraft_iata').isNull() | isnan(col('aircraft_iata')), None) \
                                      .otherwise(col("aircraft_iata").cast("string")))
    df = df.withColumn("aircraft_icao",
                                      when(col('aircraft_icao').isNull() | isnan(col('aircraft_icao')), None) \
                                      .otherwise(col("aircraft_icao").cast("string")))

    df = df.withColumn("live_updated", to_timestamp(col("live_updated")))
    df = df.withColumn("live_latitude",
                                      when(col('live_latitude').isNull() | isnan(col('live_latitude')), None) \
                                      .otherwise(col("live_latitude").cast("string")))
    df = df.withColumn("live_longitude",
                                      when(col('live_longitude').isNull() | isnan(col('live_longitude')), None) \
                                      .otherwise(col("live_longitude").cast("string")))
    df = df.withColumn("live_altitude",
                                      when(col('live_altitude').isNull() | isnan(col('live_altitude')), None) \
                                      .otherwise(col("live_altitude").cast("string")))
    df = df.withColumn("live_direction",
                                      when(col('live_direction').isNull() | isnan(col('live_direction')), None) \
                                      .otherwise(col("live_direction").cast("string")))
    df = df.withColumn("live_speed_horizontal",
                                      when(col('live_speed_horizontal').isNull() | isnan(col('live_speed_horizontal')), None) \
                                      .otherwise(col("live_speed_horizontal").cast("string")))
    df = df.withColumn("live_speed_vertical",
                                      when(col('live_speed_vertical').isNull() | isnan(col('live_speed_vertical')), None) \
                                      .otherwise(col("live_speed_vertical").cast("string")))
    df = df.withColumn("live_is_ground",
                                      when(col('live_is_ground').isNull() | isnan(col('live_is_ground')), None) \
                                      .otherwise(col("live_is_ground").cast("string")))

    # Print the schema to verify the changes
    df.printSchema()
    logging.info("function 'define_type_column_dataframe' is done")
    return df


4.2. Fill NaN values

In [0]:
# Replace all NaN value by this string "No data <name_column>" or ""
def fill_nan_value(df:DataFrame, fill_value="No data") -> DataFrame:
    for column in df.columns:
        try:                
            if df.filter(col(column).isNull() | isnan(col(column))).count() > 0:
    #           df = df.withColumn(column, col(column).fillna(fill_value + " " + column))
                df = df.fillna(fill_value + " " + column, subset=[column])
        except Exception as e:
            print(f'I can not check with function isnan this column {column}')
            logging.warning(f'I can not check with function isnan this column {column}', exc_info=True)
            df = df.fillna("", subset=[column])
            
    logging.info("function fill_nan_value done")
    return df

4.3. Seperate column departure_timezone and arrival_timezone to continent and city

In [0]:
# function to split data columns departure_timezone and arrival_timezone to new column (departure, arrival)_continent and (departure, arrival)_city
def split_timezone(df:DataFrame) -> DataFrame:
#   retrieve in a list departure_timezone and arrival_timezone
    departure_timezone = df['departure_timezone']
    arrival_timezone = df['arrival_timezone']
    
#     split every column in two column (departure, arrival)_continent and (departure, arrival)_city
    df = df.withColumn("departure_continent", split(df["departure_timezone"], "/").getItem(0)) \
           .withColumn("departure_city", split(df["departure_timezone"], "/").getItem(1)) \
           .withColumn("arrival_continent", split(df["arrival_timezone"], "/").getItem(0)) \
           .withColumn("arrival_city", split(df["arrival_timezone"], "/").getItem(1))
    
#     treat eventual values NaN. If see Nan replace by "No data departure_continent" for data related to continent and "No data departure_city" for data related city
    df = df.withColumn("departure_continent", 
                       when(df["departure_continent"].isNull(), "No data departure_continent") \
                       .otherwise(df["departure_continent"])) \
          .withColumn("departure_city", 
                      when(df["departure_city"].isNull(), "No data departure_city") \
                      .otherwise(df["departure_city"])) \
          .withColumn("arrival_continent", when(df["arrival_continent"].isNull(), "No data arrival_continent") \
                      .otherwise(df["arrival_continent"])) \
          .withColumn("arrival_city", when(df["arrival_city"].isNull(), "No data arrival_city") \
                      .otherwise(df["arrival_city"]))
    logging.info("function 'split_timezone' is done")
    return df   

4.4. Calculate duration in second and hour of flights

In [0]:
# create new column to get duration flights in hour and second
def calculate_duration_in_sec_and_hour(df: DataFrame) -> DataFrame:
    # Calculate flight duration
    df = df.withColumn('duration', 
                       when(col('arrival_scheduled').isNull(), None) \
                       .otherwise(col('arrival_scheduled').cast('long') - col('departure_scheduled').cast('long')))
    # Convert mean duration to every continent to format HH:mm:ss
    df = df.withColumn("duration_in_hour", from_unixtime(col("duration"), "HH:mm:ss"))
    
    logging.info("function 'calculate_duration_in_sec_and_hour' is done")
    df.show()
    return df

## Requests on my dataframe

**Q1: La compagnie avec le plus de vols en cours**

In [0]:
def get_number_flights_by_airline(df:DataFrame) -> DataFrame:
    # get by filter flights actif
    flight_status = df_flights.filter(col('flight_status')=="active")
    
    number_flights_by_airline = df.groupBy('airline_name').count().orderBy(desc("count"))
    if flight_status.count() > 0:
        number_flights_by_airline = flight_status.groupBy('airline_name').count().orderBy(desc("count"))
    else:
        print("I can not get airline with more flights loading")
        logging.warning("I can not get airline actif with the more flights loading")
        
    print(f"Airline have more flights is {number_flights_by_airline.first()}.")
    number_flights_by_airline.toPandas()
    
    logging.info("function 'get_number_flights_by_airline' is done")
    return number_flights_by_airline

**Q2: Pour chaque continent, la compagnie avec le + de vols régionaux actifs (continent d'origine == continent de destination)**

In [0]:
def get_max_regional_flights_by_continent_v1(df:DataFrame) -> DataFrame:
    # version 1
    # Filter the data based on the condition ask
    regional_flights = df.filter((col("departure_continent") == col("arrival_continent")) & (col("flight_status") != "scheduled"))
    if regional_flights.count() <= 0:
        print("I can not get airline with the more region flights actif with departure_continent = arrival_continent")
        logging.warning("I can not get airline with the more region flights actif with departure_continent == arrival_continent")
        regional_flights = df.filter((col("departure_continent") == col("arrival_continent")))
        
    # Group the data by "departure_continent" and "airline_name" to see how many regional flights to every airline 
    regional_flights_count = regional_flights.groupBy("departure_continent", "airline_name") \
        .agg(count("*").alias("flight_count"))

    # Group the data by "departure_continent" and find the maximum flight count by departure_continent 
    max_regional_flights_count = regional_flights_count.groupBy("departure_continent") \
        .agg(max("flight_count").alias("max_flight"))

    # Join the "max_regional_flights_count" dataFrame with the original "flight_counts" dataFrame
    max_regional_flights_by_continent = max_regional_flights_count.join(regional_flights_count, ["departure_continent"], "inner") \
        .filter(col("flight_count") == col("max_flight"))

    # Group data by "departure_continent" and collect values "max_flight" corresponding
    max_regional_flights_by_continent = max_regional_flights_by_continent.groupBy("departure_continent", "max_flight") \
        .agg(collect_list("airline_name").alias("airline_names"))

    max_regional_flights_by_continent.toPandas().head(100)
    
    logging.info("function 'get_max_regional_flights_by_continent_v1' is done")
    return max_regional_flights_by_continent

In [0]:
# version 2 : take first continent in alphabetic order. This is to delete sub structure in column
def get_max_regional_flights_by_continent_v2(df:DataFrame) -> DataFrame:
    
    max_regional_flights_by_continent = get_max_regional_flights_by_continent_v1(df)
    
    max_regional_flights_by_continent = max_regional_flights_by_continent.withColumn("max_airline_name", array_min(col("airline_names")))
    max_regional_flights_by_continent = max_regional_flights_by_continent.drop("airline_names")
    
    max_regional_flights_by_continent.toPandas().head(100)
    
    logging.info("function 'get_max_regional_flights_by_continent_v2' is done")
    return max_regional_flights_by_continent

**Q3: Le vol en cours avec le trajet le plus long**

In [0]:
def get_flights_with_max_duration(df:DataFrame) -> DataFrame:
    # Found max duration
    max_duration = df.agg(max("duration").alias("max_duration")).select("max_duration").first()[0]

    # Applicate filter to display
    flights_with_max_duration = df_flights.filter(col('duration') == max_duration).select("duration_in_hour", "flight_number", "arrival_airport", 'departure_airport', "departure_scheduled", "arrival_scheduled")

    flights_with_max_duration.toPandas()
    
    logging.info("function 'get_flights_with_max_duration' is done")
    return flights_with_max_duration  

**Q4: Pour chaque continent, la longueur de vol moyenne**

In [0]:
def get_average_flight_duration(df:DataFrame) -> DataFrame:
    # Calculate average duration to every continent
    average_flight_duration = df.groupBy("departure_continent") \
                                      .avg("duration") \
                                      .withColumnRenamed("avg(duration)", "average_duration_seconds")

    # Convert average duration to every continent to format HH:mm:ss
    average_flight_duration = average_flight_duration.withColumn("average_duration", from_unixtime(col("average_duration_seconds"), "HH:mm:ss"))

    average_flight_duration.toPandas()
    
    logging.info("function 'get_average_flight_duration' is done")
    return average_flight_duration

**Q5: L'aeroport avec le plus de vols actifs**

In [0]:
def get_number_flights_by_departure_arrival_airport(df:DataFrame) -> None:
    flight_status = df_flights.filter(col('flight_status')=="active")
    
    # get number of flights for departure and arrival airport in order desc and display the first row
    number_flights_by_departure_airport = df.groupBy('departure_airport').count().orderBy(desc("count"))
    number_flights_by_arrival_airport = df.groupBy('arrival_airport').count().orderBy(desc("count"))
    
    if flight_status.count() > 0:
        number_flights_by_departure_airport = flight_status.groupBy('departure_airport').count().orderBy(desc("count"))
        number_flights_by_arrival_airport = flight_status.groupBy('arrival_airport').count().orderBy(desc("count"))
    else:
        print("I can not get airport with the more flights actif")
        logging.info("I can not get airport with the more flights actif")
        
    print(f"Departure Airport have more flights is {number_flights_by_departure_airport.first()}.")
    print(f"Arrival Airport have more flights is {number_flights_by_arrival_airport.first()}.")

    number_flights_by_departure_airport.show(100)
    number_flights_by_arrival_airport.show(100)
    
    logging.info("function 'get_number_flights_by_departure_arrival_airport' is done")
    return number_flights_by_departure_airport, number_flights_by_arrival_airport

**Q6: Pour chaque pays de compagnie aérienne, le top 3 des modèles d'avion en usage**

In [0]:
def get_top3_airplanes_used_by_country_df(df:DataFrame) -> DataFrame:
    # Group by 'departure_city', "airline_name", 'aircraft_registration' , to count number flights
    grouped_df = df.groupBy('departure_city', "airline_name", 'aircraft_registration').count().orderBy(desc("count"))

    # assign rank to every departure_city in function number flights
    window_spec = Window.partitionBy('departure_city').orderBy(col('count').desc())
    ranked_df = grouped_df.withColumn('rank', row_number().over(window_spec))

    # filter to get only top 3 airline
    top3_airplanes_used_df = ranked_df.filter(col('rank') <= 3)

    # Sort by "departure_city" and 'rank'
    top3_airplanes_used_by_country_df = top3_airplanes_used_df.orderBy("departure_city", 'rank')

    top3_airplanes_used_by_country_df.toPandas()
    
    logging.info("function 'get_top3_airplanes_used_by_country_df' is done")
    return top3_airplanes_used_by_country_df

**Q7: Quel aéroport a la plus grande différence entre le nombre de vol sortant et le nombre de vols entrants ?**

In [0]:
def get_airport_with_max_diff(df:DataFrame) -> DataFrame:
    # Number flights to departure
    outgoing_flights = df.groupBy('departure_airport').count().withColumnRenamed('count', 'outgoing_count')

    # Number flights to arrival
    incoming_flights = df.groupBy('arrival_airport').count().withColumnRenamed('count', 'incoming_count')

    # Join two dataframe in using "outer join"
    airport_flights_diff = outgoing_flights.join(incoming_flights, outgoing_flights['departure_airport'] == incoming_flights['arrival_airport'], 'outer')

    # Calculate diffrence between outgoing_flights and incoming_flights; and take absolute value
    airport_flights_diff = airport_flights_diff.withColumn('flight_diff', abs(col('outgoing_count') - col('incoming_count')))


    airport_with_max_diff = airport_flights_diff.orderBy(col('flight_diff').desc())

    print(f"L'aéroport avec la plus grande différence entre le nombre de vol sortant et le nombre de vols entrants est: {airport_with_max_diff.first()}")
    airport_with_max_diff.limit(1).show()
    airport_with_max_diff.toPandas()
    
    logging.info("function 'get_airport_with_max_diff' is done")
    return airport_with_max_diff

## Save data

In [0]:
from datetime import datetime
def save_data(df: DataFrame, category_data:str, type_data:str, type_file='csv') -> None:
    # column to register date and hour data is register
    df = df.withColumn('date_save', lit(datetime.today().strftime('%Y-%m-%d')))
    df = df.withColumn('hour_save', lit(datetime.today().strftime('%H:%M:%S')))
    df.write.format(type_file) \
      .mode("overwrite") \
      .save(f"Flights/results_requests/{category_data}/tech_year={datetime.today().strftime('%Y')}/tech_month={datetime.today().strftime('%Y-%m')}/tech_day={datetime.today().strftime('%Y-%m-%d')}/{type_data}_{datetime.today().strftime('%Y%m%d%H%M%S')}.{type_file}")
    logging.info(f"Data of {category_data} is save with success")

## Run ETL

In [0]:
# Extraction
# Display not clean data
data_flights = get_data(url_flights, "flights")

In [0]:
# Function allow to describe data extracted
describe_data_extract(data_flights, "flights")


        [

            'Type data': flights;

            'Size':100;

            'First_element': {'flight_date': '2023-05-22', 'flight_status': 'scheduled', 'departure': {'airport': 'Singapore Changi', 'timezone': 'Asia/Singapore', 'iata': 'SIN', 'icao': 'WSSS', 'terminal': '4', 'gate': '4', 'delay': None, 'scheduled': '2023-05-22T10:15:00+00:00', 'estimated': '2023-05-22T10:15:00+00:00', 'actual': None, 'estimated_runway': None, 'actual_runway': None}, 'arrival': {'airport': 'Don Muang', 'timezone': 'Asia/Bangkok', 'iata': 'DMK', 'icao': 'VTBD', 'terminal': '1', 'gate': None, 'baggage': None, 'delay': None, 'scheduled': '2023-05-22T11:40:00+00:00', 'estimated': '2023-05-22T11:40:00+00:00', 'actual': None, 'estimated_runway': None, 'actual_runway': None}, 'airline': {'name': 'AirAsia', 'iata': 'AK', 'icao': 'AXM'}, 'flight': {'number': '358', 'iata': 'AK358', 'icao': 'AXM358', 'codeshared': None}, 'aircraft': None, 'live': None}
        
]
        


In [0]:
# Transformation
# Step 1: Create dataframe

df_flights = create_dataframe(data_flights)

df_flights = explode_dataframe(df_flights)

describe_dataframe(df_flights)

df_copy = df_flights

Display my dataframe : 
+-----------+-------------+--------------------+------------------+--------------+--------------+------------------+--------------+---------------+--------------------+--------------------+--------------------+--------------------------+-----------------------+--------------------+-------------------+------------+------------+----------------+------------+---------------+-------------+--------------------+--------------------+--------------+------------------------+---------------------+--------------------+------------+------------+-------------+-----------+-----------+--------------------+---------------------+-------------+-------------+---------------+------------+-------------+--------------+-------------+--------------+---------------------+-------------------+--------------+
|flight_date|flight_status|   departure_airport|departure_timezone|departure_iata|departure_icao|departure_terminal|departure_gate|departure_delay| departure_scheduled| departure_esti

In [0]:
# Save data extract before transformation
# In CSV
save_data(df_flights, "rawzone", "flights_before_transformation")
df_flights.toPandas()

Unnamed: 0,flight_date,flight_status,departure_airport,departure_timezone,departure_iata,departure_icao,departure_terminal,departure_gate,departure_delay,departure_scheduled,...,aircraft_icao,aircraft_icao24,live_updated,live_latitude,live_longitude,live_altitude,live_direction,live_speed_horizontal,live_speed_vertical,live_is_ground
0,2023-05-22,scheduled,Singapore Changi,Asia/Singapore,SIN,WSSS,4,4,,2023-05-22T10:15:00+00:00,...,,,,,,,,,,
1,2023-05-22,scheduled,Singapore Changi,Asia/Singapore,SIN,WSSS,3,B3,,2023-05-22T08:20:00+00:00,...,,,,,,,,,,
2,2023-05-22,scheduled,Singapore Changi,Asia/Singapore,SIN,WSSS,,,11,2023-05-22T04:30:00+00:00,...,,,,,,,,,,
3,2023-05-22,scheduled,Jinan,Asia/Shanghai,TNA,ZSJN,,,,2023-05-22T09:20:00+00:00,...,,,,,,,,,,
4,2023-05-22,scheduled,Kubin Island,Australia/Brisbane,KUG,YKUB,,,,2023-05-22T12:35:00+00:00,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2023-05-22,scheduled,Melbourne - Tullamarine Airport,Australia/Melbourne,MEL,YMML,3,5,,2023-05-22T11:55:00+00:00,...,,,,,,,,,,
96,2023-05-22,scheduled,Melbourne - Tullamarine Airport,Australia/Melbourne,MEL,YMML,3,5,,2023-05-22T11:55:00+00:00,...,,,,,,,,,,
97,2023-05-22,scheduled,Melbourne - Tullamarine Airport,Australia/Melbourne,MEL,YMML,3,5,,2023-05-22T11:55:00+00:00,...,,,,,,,,,,
98,2023-05-22,scheduled,Brisbane International,Australia/Brisbane,BNE,YBBN,D,,,2023-05-22T11:00:00+00:00,...,,,,,,,,,,


In [0]:
# Step2: Data cleaning
df_flights = define_type_column_dataframe(df_flights)

df_flights = fill_nan_value(df_flights)

df_flights = split_timezone(df_flights)

df_flights = calculate_duration_in_sec_and_hour(df_flights)

describe_dataframe(df_flights)

df_flights.toPandas()

root
 |-- flight_date: date (nullable = true)
 |-- flight_status: string (nullable = true)
 |-- departure_airport: string (nullable = true)
 |-- departure_timezone: string (nullable = true)
 |-- departure_iata: string (nullable = true)
 |-- departure_icao: string (nullable = true)
 |-- departure_terminal: string (nullable = true)
 |-- departure_gate: string (nullable = true)
 |-- departure_delay: string (nullable = true)
 |-- departure_scheduled: timestamp (nullable = true)
 |-- departure_estimated: timestamp (nullable = true)
 |-- departure_actual: timestamp (nullable = true)
 |-- departure_estimated_runway: timestamp (nullable = true)
 |-- departure_actual_runway: timestamp (nullable = true)
 |-- arrival_airport: string (nullable = true)
 |-- arrival_timezone: string (nullable = true)
 |-- arrival_iata: string (nullable = true)
 |-- arrival_icao: string (nullable = true)
 |-- arrival_terminal: string (nullable = true)
 |-- arrival_gate: string (nullable = true)
 |-- arrival_baggage: 

Unnamed: 0,flight_date,flight_status,departure_airport,departure_timezone,departure_iata,departure_icao,departure_terminal,departure_gate,departure_delay,departure_scheduled,...,live_direction,live_speed_horizontal,live_speed_vertical,live_is_ground,departure_continent,departure_city,arrival_continent,arrival_city,duration,duration_in_hour
0,2023-05-22,scheduled,Singapore Changi,Asia/Singapore,SIN,WSSS,4,4,No data departure_delay,2023-05-22 10:15:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Asia,Singapore,Asia,Bangkok,5100,01:25:00
1,2023-05-22,scheduled,Singapore Changi,Asia/Singapore,SIN,WSSS,3,B3,No data departure_delay,2023-05-22 08:20:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Asia,Singapore,Asia,Shanghai,14400,04:00:00
2,2023-05-22,scheduled,Singapore Changi,Asia/Singapore,SIN,WSSS,No data departure_terminal,No data departure_gate,11,2023-05-22 04:30:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Asia,Singapore,Asia,Hong_Kong,14100,03:55:00
3,2023-05-22,scheduled,Jinan,Asia/Shanghai,TNA,ZSJN,No data departure_terminal,No data departure_gate,No data departure_delay,2023-05-22 09:20:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Asia,Shanghai,Asia,Samarkand,15000,04:10:00
4,2023-05-22,scheduled,Kubin Island,Australia/Brisbane,KUG,YKUB,No data departure_terminal,No data departure_gate,No data departure_delay,2023-05-22 12:35:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Australia,Brisbane,Australia,Brisbane,1200,00:20:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2023-05-22,scheduled,Melbourne - Tullamarine Airport,Australia/Melbourne,MEL,YMML,3,5,No data departure_delay,2023-05-22 11:55:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Australia,Melbourne,Australia,Sydney,3900,01:05:00
96,2023-05-22,scheduled,Melbourne - Tullamarine Airport,Australia/Melbourne,MEL,YMML,3,5,No data departure_delay,2023-05-22 11:55:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Australia,Melbourne,Australia,Sydney,3900,01:05:00
97,2023-05-22,scheduled,Melbourne - Tullamarine Airport,Australia/Melbourne,MEL,YMML,3,5,No data departure_delay,2023-05-22 11:55:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Australia,Melbourne,Australia,Sydney,3900,01:05:00
98,2023-05-22,scheduled,Brisbane International,Australia/Brisbane,BNE,YBBN,D,No data departure_gate,No data departure_delay,2023-05-22 11:00:00,...,No data live_direction,No data live_speed_horizontal,No data live_speed_vertical,No data live_is_ground,Australia,Brisbane,Australia,Sydney,6600,01:50:00


In [0]:
# Step3: Request on dataframe
df_number_flights_by_airline = get_number_flights_by_airline(df_flights)
df_number_flights_by_airline.toPandas().head(100)

I can not get airline with more flights loading
Airline have more flights is Row(airline_name='Cebu Pacific Air', count=10).


Unnamed: 0,airline_name,count
0,Cebu Pacific Air,10
1,IndiGo,8
2,FlexFlight,6
3,Philippine Airlines,6
4,Skytrans,5
5,AirAsia,4
6,AirSWIFT,4
7,ANA,4
8,Malaysia Airlines,3
9,Ethiopian Airlines,3


In [0]:
df_max_regional_flights_by_continent_v1 = get_max_regional_flights_by_continent_v1(df_flights)
df_max_regional_flights_by_continent_v1.toPandas().head(100)

I can not get airline with the more region flights actif with departure_continent = arrival_continent


Unnamed: 0,departure_continent,max_flight,airline_names
0,Pacific,1,[Air China LTD]
1,Asia,10,[Cebu Pacific Air]
2,Africa,2,[Ethiopian Airlines]
3,Europe,1,"[Finnair, Juneyao Airlines, empty]"
4,Australia,5,"[FlexFlight, Skytrans]"


In [0]:
df_max_regional_flights_by_continent_v2 = get_max_regional_flights_by_continent_v2(df_flights)
df_max_regional_flights_by_continent_v2.toPandas().head(100)

I can not get airline with the more region flights actif with departure_continent = arrival_continent


Unnamed: 0,departure_continent,max_flight,max_airline_name
0,Pacific,1,Air China LTD
1,Asia,10,Cebu Pacific Air
2,Africa,2,Ethiopian Airlines
3,Europe,1,Finnair
4,Australia,5,FlexFlight


In [0]:
df_flights_with_max_duration = get_flights_with_max_duration(df_flights)
df_flights_with_max_duration.toPandas().head(100)

Unnamed: 0,duration_in_hour,flight_number,arrival_airport,departure_airport,departure_scheduled,arrival_scheduled
0,08:10:00,3610,Chhatrapati Shivaji International (Sahar Inter...,Bole International,2023-05-22 05:40:00,2023-05-22 13:50:00


In [0]:
df_average_flight_duration = get_average_flight_duration(df_flights)
df_average_flight_duration.toPandas().head(100)

Unnamed: 0,departure_continent,average_duration_seconds,average_duration
0,Asia,7047.887324,01:57:27
1,Australia,2720.0,00:45:20
2,Africa,12990.0,03:36:30
3,No data departure_timezone,6825.0,01:53:45
4,Pacific,2400.0,00:40:00
5,Europe,3660.0,01:01:00


In [0]:
df_number_flights_by_departure_airport, df_number_flights_by_arrival_airport = get_number_flights_by_departure_arrival_airport(df_flights)
df_number_flights_by_departure_airport.toPandas().head(100)
df_number_flights_by_arrival_airport.toPandas().head(100)

I can not get airport with the more flights actif
Departure Airport have more flights is Row(departure_airport='Ninoy Aquino International', count=26).
Arrival Airport have more flights is Row(arrival_airport='Horn Island', count=9).
+--------------------+-----+
|   departure_airport|count|
+--------------------+-----+
|Ninoy Aquino Inte...|   26|
|Netaji Subhas Cha...|   11|
|              Sendai|    7|
|  Bole International|    5|
|   Kozhikode Airport|    5|
|    Singapore Changi|    4|
|             Phu Bai|    4|
|        Yorke Island|    3|
|          Yam Island|    3|
|Soekarno-Hatta In...|    3|
|              Kuopio|    3|
|Melbourne - Tulla...|    3|
|        Kubin Island|    2|
|        Port Hedland|    2|
|            Sandakan|    2|
|           Yakushima|    2|
|No data departure...|    2|
|               Jinan|    1|
|      Bamaga Injinoo|    1|
|Guangzhou Baiyun ...|    1|
|            Hurghada|    1|
|Barimunya Airport...|    1|
|Auckland Internat...|    1|
|          L

Unnamed: 0,arrival_airport,count
0,Horn Island,9
1,Shanghai Pudong International,4
2,Malay,4
3,No data arrival_airport,4
4,Chhatrapati Shivaji International (Sahar Inter...,4
5,Kuala Lumpur International Airport (klia),4
6,Tan Son Nhat International,4
7,Canberra,4
8,Hong Kong International,3
9,Cairo International Airport,3


In [0]:
df_top3_airplanes_used_by_country_df = get_top3_airplanes_used_by_country_df(df_flights)
df_top3_airplanes_used_by_country_df.toPandas().head(100)

Unnamed: 0,departure_city,airline_name,aircraft_registration,count,rank
0,Addis_Ababa,Ethiopian Airlines,No data aircraft_registration,3,1
1,Addis_Ababa,EgyptAir,No data aircraft_registration,1,2
2,Addis_Ababa,Aegean Airlines,No data aircraft_registration,1,3
3,Auckland,Air China LTD,No data aircraft_registration,1,1
4,Brisbane,Skytrans,No data aircraft_registration,5,1
5,Brisbane,FlexFlight,No data aircraft_registration,4,2
6,Brisbane,Air New Zealand,No data aircraft_registration,1,3
7,Cairo,AMC Airlines,No data aircraft_registration,1,1
8,Helsinki,empty,No data aircraft_registration,1,1
9,Helsinki,Finnair,No data aircraft_registration,1,2


In [0]:
df_airport_with_max_diff = get_airport_with_max_diff(df_flights)
df_airport_with_max_diff.toPandas().head(100)

L'aéroport avec la plus grande différence entre le nombre de vol sortant et le nombre de vols entrants est: Row(departure_airport='Singapore Changi', outgoing_count=4, arrival_airport='Singapore Changi', incoming_count=1, flight_diff=3)
+-----------------+--------------+----------------+--------------+-----------+
|departure_airport|outgoing_count| arrival_airport|incoming_count|flight_diff|
+-----------------+--------------+----------------+--------------+-----------+
| Singapore Changi|             4|Singapore Changi|             1|          3|
+-----------------+--------------+----------------+--------------+-----------+



Unnamed: 0,departure_airport,outgoing_count,arrival_airport,incoming_count,flight_diff
0,Singapore Changi,4.0,Singapore Changi,1.0,3.0
1,Sandakan,2.0,Sandakan,1.0,1.0
2,Guangzhou Baiyun International,1.0,Guangzhou Baiyun International,1.0,0.0
3,,,Alexander The Great Airport,1.0,
4,,,Amausi,1.0,
...,...,...,...,...,...
77,,,Vienna International,1.0,
78,Yakushima,2.0,,,
79,Yam Island,3.0,,,
80,Yorke Island,3.0,,,


In [0]:
# Save data

# Data flights
# In CSV
save_data(df_flights, "rawzone", "flights")

# In PARQUET
save_data(df_flights, "rawzone", "flights", "parquet")

In [0]:
# Result request on dataframe
# Q1
save_data(df_number_flights_by_airline, "results_requests", "number_flights_by_airline")

# Q2
# save_data(df_max_regional_flights_by_continent_v1, "results_requests", "max_regional_flights_by_continent_v1")
save_data(df_max_regional_flights_by_continent_v2, "results_requests", "max_regional_flights_by_continent_v2")

# Q3
save_data(df_flights_with_max_duration, "results_requests", "flights_with_max_duration")

# Q4
save_data(df_average_flight_duration, "results_requests", "average_flight_duration")

# Q5
save_data(df_number_flights_by_departure_airport, "results_requests", "number_flights_by_departure_airport")
save_data(df_number_flights_by_arrival_airport, "results_requests", "number_flights_by_arrival_airport")

# Q6
save_data(df_top3_airplanes_used_by_country_df, "results_requests", "top3_airplanes_used_by_country_df")

# Q7
save_data(df_airport_with_max_diff, "results_requests", "airport_with_max_diff")

In [0]:
try:
    spark.stop()
    logging.info(f"Spark is closed with success")
except Exception as e:
    logging.warning(f"Spark closed. See error :  [{e}]", exc_info=True)

------------------------------------------------------------------------------ FIN ------------------------------------------------------------------------------