
# Brazilian Airline Historical Series Analysis

#### Frederico Horst

### Data Sources:
- Historical air fares by origin, destination and airline: available at [ANAC website](https://sistemas.anac.gov.br/sas/downloads/view/frmDownload.aspx)
- Inflation data, using IPCA index: available at [IBGE website](https://www.ibge.gov.br/estatisticas/economicas/precos-e-custos/9256-indice-nacional-de-precos-ao-consumidor-amplo.html?=&t=series-historicas)
- More information on air fares on [ANAC website](https://www.anac.gov.br/assuntos/dados-e-estatisticas/mercado-do-transporte-aereo)

### Goals:
- Build a database for historical deflated prices.
- Calculate the confidence interval for the average price range by route, considering a 95% confidence.
- Confidence intervals will be calculated by route, not considering airline differences. We want to take a closer look to the consumer point of view.


In [4]:
# external libs
import numpy as np
from pyspark.sql import SparkSession, SQLContext

# internal lib:
import files_processor

# spark configs:
spark = SparkSession.builder \
        .master("local") \
        .appName("anac-prices") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

sqlContext = SQLContext(spark)

In [5]:
# importing and cleaning files
anac_table = files_processor.files_cleaning(
    path_source='csv_files_from_anac',
    inflation_file='ipca_historico.csv')

###########################################################################
BEGGINING DATA CLEANING PROCESS
importing IPCA file
Inflation series imported successfully
#########################
beginning data cleaning
FIM DO PROCESSO
###########################################################################


In [8]:
anac_table.registerTempTable('anac_table')
anac_table.show()

+----+-----+----------+-------+------+-----------+------+-----+------------------+
|year|month|year_month|company|origin|destination|tariff|seats|   deflated_tariff|
+----+-----+----------+-------+------+-----------+------+-----+------------------+
|2015|   11|    201511|    AZU|  SBPA|       SBSL| 985.9|    5|22.152816007370042|
|2015|   11|    201511|    AZU|  SBGR|       SBCA| 305.9|    1| 6.873462234156097|
|2015|   11|    201511|    AZU|  SBPJ|       SBPA|187.39|    2| 4.210585446415531|
|2015|   11|    201511|    AZU|  SBCT|       SBSP|100.01|   13| 2.247188486557539|
|2015|   11|    201511|    AZU|  SBPA|       SBRP| 833.9|    1| 18.73743104629869|
|2015|   11|    201511|    AZU|  SBFZ|       SBMO| 603.9|    1|13.569415206608321|
|2015|   11|    201511|    AZU|  SBGL|       SBMO| 425.9|    5| 9.569818782370323|
|2015|   11|    201511|    AZU|  SBGR|       SBRF| 500.0|  103|11.234818950892606|
|2015|   11|    201511|    AZU|  SBIL|       SBGL| 316.9|    1| 7.120628251075734|
|201

In [11]:
# importing clean airports names
airports = sqlContext.read.csv('aeroportos.csv', sep=";", inferSchema="true", header="true")
airports.registerTempTable('airports')

grouping_query = """
    SELECT
        year_month,
        year,
        month,
        origin AS origin_code,
        origin_airports.IATA AS origin,
        destination AS destination_code,
        destination_airports.IATA AS destination, 
        SUM(seats) AS solded_seats,
        SUM(deflated_tariff) AS total_deflated_tariff,
        SUM(tariff) AS total_tariff,
        SUM(deflated_tariff)/SUM(seats) AS deflated_tariff_mean,
        SUM(tariff)/SUM(seats) AS tariff_mean

    FROM anac_table

    LEFT JOIN airports AS origin_airports ON origin_airports.ICAO = anac_table.origin

    LEFT JOIN airports AS destination_airports ON destination_airports.ICAO = anac_table.destination

    GROUP BY 1,2,3,4,5,6,7
    """

anac_sts = sqlContext.sql(grouping_query)



In [12]:
anac_sts.show()

+----------+----+-----+-----------+------+----------------+-----------+------------+---------------------+------------------+--------------------+------------------+
|year_month|year|month|origin_code|origin|destination_code|destination|solded_seats|total_deflated_tariff|      total_tariff|deflated_tariff_mean|       tariff_mean|
+----------+----+-----+-----------+------+----------------+-----------+------------+---------------------+------------------+--------------------+------------------+
|    201511|2015|   11|       SBGR|   GRU|            SBRF|        REC|       31434|    31615.35940317488|1407025.7650375366|   1.005769529909489| 44.76127012271861|
|    201511|2015|   11|       SBBE|   BEL|            SBPA|        POA|        1235|   5656.2554686358935|251728.82341003418|   4.579963942215298|203.82900680974427|
|    201511|2015|   11|       SBDN|   PPB|            SBAR|        AJU|          24|   221.33986714124416|  9850.62026977539|   9.222494464218506| 410.4425112406413|
|   