### Descrição dos Dados
O dataset trabalhado contêm informações de voôs de avião do ano de 2008 nos EUA, onde o Departamento de Transportes do Estados Unidos faz avaliação mensal dos serviços aéreos. Não se tem informação se são todos os voôs do ano,ou se são apenas uma amostra da população.

### Problemática a ser resolvida
Prever o tempo de atraso dos voôs (coluna: ArrDelay), com base nas informações do dataset.

Download do dataset: https://data.world/data-society/airlines-delay

Formato: CSV

Dicionário de dados:
Link: http://stat-computing.org/dataexpo/2009/the-data.html


 Id | Name 				| Description
----|-------------------|--------------------------------------------------------------------------------
 1  |Year              	|1987-2008
 2  |Month 				|1-12
 3  |DayofMonth 	    |1-31
 4  |DayOfWeek 			|1 (Monday) - 7 (Sunday)
 5  |DepTime 		    |actual departure time (local, hhmm)
 6  |CRSDepTime 		|scheduled departure time (local, hhmm)
 7  |ArrTime 		    |actual arrival time (local, hhmm)
 8  |CRSArrTime 		|scheduled arrival time (local, hhmm)
 9  |UniqueCarrier 		|unique carrier code
 10 |FlightNum 			|flight number
 11 |TailNum 		    |plane tail number
 12 |ActualElapsedTime 	|in minutes
 13 |CRSElapsedTime 	|in minutes
 14 |AirTime 		    |in minutes
 15 |**ArrDelay**	    |**arrival delay, in minutes**
 16 |DepDelay 		    |departure delay, in minutes
 17 |Origin 			|origin IATA airport code
 18 |Dest 			    |destination IATA airport code
 19 |Distance 		    |in miles
 20 |TaxiIn 			|taxi in time, in minutes
 21 |TaxiOut 		    |taxi out time in minutes
 22 |Cancelled 			|was the flight cancelled?
 23 |CancellationCode  	|reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
 24 |Diverted 		    |1 = yes, 0 = no
 25 |CarrierDelay 	    |in minutes
 26 |WeatherDelay 	    |in minutes
 27 |NASDelay 		    |in minutes
 28 |SecurityDelay 		|in minutes
 29 |LateAircraftDelay 	|in minutes
 
 ### Variável ALVO: ArrDelay
 ### Variáveis PREDITORAS: demais que tiverem boa correlação.

In [1]:
# Spark Session - quando usar DatFrame, é necessário instanciar a sessão do Spark
spSession = SparkSession.builder.master("local").appName("MLLib-LinaerRegression").config("spark.some.config.option", "session").getOrCreate()

In [2]:
# Importando as bibliotecas que serão usadas durante o processo
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

In [4]:
# Descompactando o dataset, que está alocado no arquivo .ZIP

import os.path
import zipfile

PROJECT_PATH = os.getcwd()
DATA_PROJECT = PROJECT_PATH + '\\data\\'
EXT = '.zip'

items = os.listdir(DATA_PROJECT)

newlist = []
for names in items:
    if names.endswith(".zip"):
        newlist.append(names)

for file_to_extract in newlist:
    with zipfile.ZipFile(DATA_PROJECT + file_to_extract, "r") as zip_ref:
        zip_ref.extractall(DATA_PROJECT)
        
newitems = os.listdir(DATA_PROJECT)
#print('Diretório:', DATA_PROJECT, '\n')
print('Items:', newitems)

Items: ['airlinedelaycauses.zip', 'DelayedFlights.csv']


In [5]:
# Carregando os dados e gerando um RDD
FILE = DATA_PROJECT + "DelayedFlights.csv"
DelayedFlightsRDD = sc.textFile(FILE)

# RDD em cache
DelayedFlightsRDD.cache()

In [6]:
# Quantidade de registros do RDD
print('Quantidade de registros no Dataset:', DelayedFlightsRDD.count(), '\n')

print('Exibindo as 5 primeiras linhas:', '\n')
DelayedFlightsRDD.take(5)

Quantidade de registros no Dataset: 1936759 

Exibindo as 5 primeiras linhas: 



[',Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay',
 '0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,335,N712SW,128.0,150.0,116.0,-14.0,8.0,IAD,TPA,810,4.0,8.0,0,N,0,,,,,',
 '1,2008,1,3,4,754.0,735,1002.0,1000,WN,3231,N772SW,128.0,145.0,113.0,2.0,19.0,IAD,TPA,810,5.0,10.0,0,N,0,,,,,',
 '2,2008,1,3,4,628.0,620,804.0,750,WN,448,N428WN,96.0,90.0,76.0,14.0,8.0,IND,BWI,515,3.0,17.0,0,N,0,,,,,',
 '4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,3920,N464WN,90.0,90.0,77.0,34.0,34.0,IND,BWI,515,3.0,10.0,0,N,0,2.0,0.0,0.0,0.0,32.0']

In [7]:
# Lembrando que é possível carregar esse set de dados diretamente como um DATAFRAME, pois se trata de um arquivo .CSV
# Método: com o SparkSQL.
'''
Cria-se o contexto com o SparkSQL

spark.read.csv("some_input_file.csv", header=True, mode="DROPMALFORMED", schema=schema)
'''

# Removendo a primeira linha do arquivo (cabeçalho)
DelayedFlightsRDD_without_HEADER = DelayedFlightsRDD.filter(lambda x: "ArrDelay" not in x)

print('Quantidade de registros no Dataset:', DelayedFlightsRDD_without_HEADER.count(), '\n')

print('Exibindo as 5 primeiras linhas:', '\n')
DelayedFlightsRDD_without_HEADER.take(5)

Quantidade de registros no Dataset: 1936758 

Exibindo as 5 primeiras linhas: 



['0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,335,N712SW,128.0,150.0,116.0,-14.0,8.0,IAD,TPA,810,4.0,8.0,0,N,0,,,,,',
 '1,2008,1,3,4,754.0,735,1002.0,1000,WN,3231,N772SW,128.0,145.0,113.0,2.0,19.0,IAD,TPA,810,5.0,10.0,0,N,0,,,,,',
 '2,2008,1,3,4,628.0,620,804.0,750,WN,448,N428WN,96.0,90.0,76.0,14.0,8.0,IND,BWI,515,3.0,17.0,0,N,0,,,,,',
 '4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,3920,N464WN,90.0,90.0,77.0,34.0,34.0,IND,BWI,515,3.0,10.0,0,N,0,2.0,0.0,0.0,0.0,32.0',
 '5,2008,1,3,4,1940.0,1915,2121.0,2110,WN,378,N726SW,101.0,115.0,87.0,11.0,25.0,IND,JAX,688,4.0,10.0,0,N,0,,,,,']

### Limpeza dos Dados

In [8]:
# Função para quebrar os dados a cada vírgula encontrada
def limpaDados(inputStr) :
    attList = inputStr.split(",") 
    return (attList)

In [9]:
# Executa a função no RDD
DelayedFlightsRDD_01 = DelayedFlightsRDD_without_HEADER.map(limpaDados)
DelayedFlightsRDD_01.cache()

print('Exibindo as 2 primeiras linhas do Datset após a limpeza:', '\n')
DelayedFlightsRDD_01.take(2)

Exibindo as 2 primeiras linhas do Datset após a limpeza: 



[['0',
  '2008',
  '1',
  '3',
  '4',
  '2003.0',
  '1955',
  '2211.0',
  '2225',
  'WN',
  '335',
  'N712SW',
  '128.0',
  '150.0',
  '116.0',
  '-14.0',
  '8.0',
  'IAD',
  'TPA',
  '810',
  '4.0',
  '8.0',
  '0',
  'N',
  '0',
  '',
  '',
  '',
  '',
  ''],
 ['1',
  '2008',
  '1',
  '3',
  '4',
  '754.0',
  '735',
  '1002.0',
  '1000',
  'WN',
  '3231',
  'N772SW',
  '128.0',
  '145.0',
  '113.0',
  '2.0',
  '19.0',
  'IAD',
  'TPA',
  '810',
  '5.0',
  '10.0',
  '0',
  'N',
  '0',
  '',
  '',
  '',
  '',
  '']]

In [96]:
# Avaliando o dataset, percebe-se que há valores MISSING. Logo, será aplicada uma fórmula para popular esses "buracos"
default_value_mussing = sc.broadcast(0.0)

def ajustaValMissing(inputStr):
    global default
    attList = inputStr

    # Substitui o caracter "" (vazio) por um valor padrão pré-definido
    clean_list = []
    #for vetor in inputStr:
        #for posicao in vetor:
    for posicao in inputStr:
        if posicao == "":
            posicao = default_value_mussing.value
        clean_list.append(posicao)

    # Transformando a linha de elementos de entrada em um DataFrame (linha por linha)
    linhas = Row (
                 Index = float(clean_list[0])
                ,Year = float(clean_list[1])
                ,Month = float(clean_list[2])
                ,DayofMonth = float(clean_list[3])
                ,DayOfWeek = float(clean_list[4])
                ,DepTime = float(clean_list[5])
                ,CRSDepTime = float(clean_list[6])
                ,ArrTime = float(clean_list[7])
                ,CRSArrTime = float(clean_list[8])
                ,UniqueCarrier = str(clean_list[9])
                ,FlightNum = float(clean_list[10])
                ,TailNum = str(clean_list[11])
                ,ActualElapsedTime = float(clean_list[12])
                ,CRSElapsedTime = float(clean_list[13])
                ,AirTime = float(clean_list[14])
                ,ArrDelay = float(clean_list[15])
                ,DepDelay = float(clean_list[16])
                ,Origin = str(clean_list[17])
                ,Dest = str(clean_list[18])
                ,Distance = float(clean_list[19])
                ,TaxiIn = float(clean_list[20])
                ,TaxiOut = float(clean_list[21])
                ,Cancelled = float(clean_list[22])
                ,CancellationCode = str(clean_list[23])
                ,Diverted = float(clean_list[24])
                ,CarrierDelay = (clean_list[25])
                ,WeatherDelay = float(clean_list[26])
                ,NASDelay = float(clean_list[27])
                ,SecurityDelay = float(clean_list[28])
                ,LateAircraftDelay = float(clean_list[29])
                )
    return(linhas)

In [97]:
# Executa a função no RDD
DelayedFlightsRDD_02 = DelayedFlightsRDD_01.map(ajustaValMissing)
DelayedFlightsRDD_02.cache()

#print('Exibindo as 2 primeiras linhas do Datset após a limpeza:', '\n')
DelayedFlightsRDD_02.take(2)

[Row(ActualElapsedTime=128.0, AirTime=116.0, ArrDelay=-14.0, ArrTime=2211.0, CRSArrTime=2225.0, CRSDepTime=1955.0, CRSElapsedTime=150.0, CancellationCode='N', Cancelled=0.0, CarrierDelay=0.0, DayOfWeek=4.0, DayofMonth=3.0, DepDelay=8.0, DepTime=2003.0, Dest='TPA', Distance=810.0, Diverted=0.0, FlightNum=335.0, Index=0.0, LateAircraftDelay=0.0, Month=1.0, NASDelay=0.0, Origin='IAD', SecurityDelay=0.0, TailNum='N712SW', TaxiIn=4.0, TaxiOut=8.0, UniqueCarrier='WN', WeatherDelay=0.0, Year=2008.0),
 Row(ActualElapsedTime=128.0, AirTime=113.0, ArrDelay=2.0, ArrTime=1002.0, CRSArrTime=1000.0, CRSDepTime=735.0, CRSElapsedTime=145.0, CancellationCode='N', Cancelled=0.0, CarrierDelay=0.0, DayOfWeek=4.0, DayofMonth=3.0, DepDelay=19.0, DepTime=754.0, Dest='TPA', Distance=810.0, Diverted=0.0, FlightNum=3231.0, Index=1.0, LateAircraftDelay=0.0, Month=1.0, NASDelay=0.0, Origin='IAD', SecurityDelay=0.0, TailNum='N772SW', TaxiIn=5.0, TaxiOut=10.0, UniqueCarrier='WN', WeatherDelay=0.0, Year=2008.0)]

### Análise Exploratória

In [98]:
# Cria um Dataframe
DelayedFlightsRDD_DF = spSession.createDataFrame(DelayedFlightsRDD_02)

In [100]:
# Estatísticas descritivas
DelayedFlightsRDD_DF.select("ActualElapsedTime", "ArrTime", "AirTime", "ArrDelay", "Distance").describe().show()

+-------+------------------+------------------+-----------------+------------------+-----------------+
|summary| ActualElapsedTime|           ArrTime|          AirTime|          ArrDelay|         Distance|
+-------+------------------+------------------+-----------------+------------------+-----------------+
|  count|           1936758|           1936758|          1936758|           1936758|          1936758|
|   mean|132.72859128502373|1604.2296683426634|107.8082605054426|42.017141016069125|765.6861590348407|
| stddev|  72.4347129450919| 555.7685313737034|68.86184470008828| 56.72934640442311|574.4796530720831|
|    min|               0.0|               0.0|              0.0|            -109.0|             11.0|
|    max|            1114.0|            2400.0|           1091.0|            2461.0|           4962.0|
+-------+------------------+------------------+-----------------+------------------+-----------------+



In [104]:
# Encontrando a correlação entre a variável ALVO com as variáveis PREDITORAS
# ALVO: ArrDelay
# PREDITORAS: demais variáveis
for i in DelayedFlightsRDD_DF.columns:
    if not(isinstance(DelayedFlightsRDD_DF.select(i).take(1)[0][0], str)): #verifica se a coluna é NUMÉRICA para correlacionar com o alvo!
        print( "Correlação da variável ArrDelay (ALVO) com a PREDITORA", i, DelayedFlightsRDD_DF.stat.corr('ArrDelay', i))

Correlação da variável ArrDelay (ALVO) com a PREDITORA ActualElapsedTime 0.07345240925097012
Correlação da variável ArrDelay (ALVO) com a PREDITORA AirTime 0.004855533289515905
Correlação da variável ArrDelay (ALVO) com a PREDITORA ArrDelay 1.0
Correlação da variável ArrDelay (ALVO) com a PREDITORA ArrTime -0.04205414123000174
Correlação da variável ArrDelay (ALVO) com a PREDITORA CRSArrTime 0.04288754483576667
Correlação da variável ArrDelay (ALVO) com a PREDITORA CRSDepTime 0.044720833795896996
Correlação da variável ArrDelay (ALVO) com a PREDITORA CRSElapsedTime -0.016546310393610483
Correlação da variável ArrDelay (ALVO) com a PREDITORA Cancelled -0.013392263602579044
Correlação da variável ArrDelay (ALVO) com a PREDITORA CarrierDelay nan
Correlação da variável ArrDelay (ALVO) com a PREDITORA DayOfWeek 0.006166381886928189
Correlação da variável ArrDelay (ALVO) com a PREDITORA DayofMonth 0.003996287904564747
Correlação da variável ArrDelay (ALVO) com a PREDITORA DepDelay 0.94600200

### Pré Processamento
Para efetuar essa etapa, deve-se avaliar as correlações anteriores para identificar quais variáveis preditoras irão fazer parte da construção do modelo.

- Converter variáveis ALVO + PREDITORAS selecionadas para um objeto **LabeledPoint (target, Vector[features])**

In [105]:
# O parâmetro de entrada "row" é uma linha que a RDD inical (DelayedFlightsRDD_02) irá passar.
def transformaVetorDenso(row):
    obj = (row["ArrDelay"], Vectors.dense([row["DepDelay"], row["LateAircraftDelay"], row["NASDelay"]]))
    return(obj)

In [111]:
# Utiliza o RDD, aplica a função
DelayedFlightsRDD_03 = DelayedFlightsRDD_02.map(transformaVetorDenso)
print("Tipo do objeto criado (Vectors.dense):", type(DelayedFlightsRDD_03), '\n')

# Converte a RDD para Dataframe e aplica a função select()
# spSession para trabalhar com DataFrame, onde o "input data" é o próprio vetor denso criado anteriormente
DelayedFlightsRDD_DF = spSession.createDataFrame(DelayedFlightsRDD_03,["label", "features"])
DelayedFlightsRDD_DF.select("label","features").show(15)

Tipo do objeto criado (Vectors.dense): <class 'pyspark.rdd.PipelinedRDD'> 

+-----+---------------+
|label|       features|
+-----+---------------+
|-14.0|  [8.0,0.0,0.0]|
|  2.0| [19.0,0.0,0.0]|
| 14.0|  [8.0,0.0,0.0]|
| 34.0|[34.0,32.0,0.0]|
| 11.0| [25.0,0.0,0.0]|
| 57.0|[67.0,47.0,0.0]|
|  1.0|  [6.0,0.0,0.0]|
| 80.0|[94.0,72.0,0.0]|
| 11.0|  [9.0,0.0,0.0]|
| 15.0|[27.0,12.0,0.0]|
|-15.0|  [9.0,0.0,0.0]|
| 16.0|[28.0,16.0,0.0]|
| 37.0|[51.0,25.0,0.0]|
| 19.0|[32.0,12.0,0.0]|
|  6.0| [20.0,0.0,0.0]|
+-----+---------------+
only showing top 15 rows



In [112]:
DelayedFlightsRDD_03.take(10)

[(-14.0, DenseVector([8.0, 0.0, 0.0])),
 (2.0, DenseVector([19.0, 0.0, 0.0])),
 (14.0, DenseVector([8.0, 0.0, 0.0])),
 (34.0, DenseVector([34.0, 32.0, 0.0])),
 (11.0, DenseVector([25.0, 0.0, 0.0])),
 (57.0, DenseVector([67.0, 47.0, 0.0])),
 (1.0, DenseVector([6.0, 0.0, 0.0])),
 (80.0, DenseVector([94.0, 72.0, 0.0])),
 (11.0, DenseVector([9.0, 0.0, 0.0])),
 (15.0, DenseVector([27.0, 12.0, 0.0]))]

### Machine Learning

In [113]:
# Dados de Treino (70 %) e de Teste (30 %)
(dados_treino, dados_teste) = DelayedFlightsRDD_DF.randomSplit([0.7, 0.3])

In [115]:
print('Quantidade de dados para TREINO: ', dados_treino.count())
print('Quantidade de dados para TESTE: ', dados_teste.count())

Quantidade de dados para TREINO:  1355546
Quantidade de dados para TESTE:  581212


In [122]:
print('Amostra de dados de TREINO:')
print(dados_treino.take(5))
print('')
print('Amostra de dados de TESTE:')
print(dados_teste.take(5))

Amostra de dados de TREINO:
[Row(label=-67.0, features=DenseVector([15.0, 0.0, 0.0])), Row(label=-66.0, features=DenseVector([8.0, 0.0, 0.0])), Row(label=-61.0, features=DenseVector([13.0, 0.0, 0.0])), Row(label=-55.0, features=DenseVector([9.0, 0.0, 0.0])), Row(label=-55.0, features=DenseVector([11.0, 0.0, 0.0]))]

Amostra de dados de TESTE:
[Row(label=-66.0, features=DenseVector([7.0, 0.0, 0.0])), Row(label=-55.0, features=DenseVector([10.0, 0.0, 0.0])), Row(label=-54.0, features=DenseVector([7.0, 0.0, 0.0])), Row(label=-49.0, features=DenseVector([6.0, 0.0, 0.0])), Row(label=-48.0, features=DenseVector([6.0, 0.0, 0.0]))]


In [121]:
# Construindo o modelo com os dados de treino
linearReg = LinearRegression(maxIter = 20)

# Treinando o modelo
modelo = linearReg.fit(dados_treino)

print('Modelo criado:', modelo, '\n')
print('Colunas do modelo:', predictions.columns)

Modelo criado: LinearRegression_4a5ab4c20a683aa7490a 

Colunas do modelo: ['label', 'features', 'prediction']


In [117]:
# Imprimindo as métricas
print("Coeficientes: " + str(modelo.coefficients))
print("Intercept: " + str(modelo.intercept))

Coeficientes: [0.8943593515899491,0.12254708234451549,0.38800425824897244]
Intercept: -2.34826887815541


In [124]:
# Previsões com dados de teste
predictions = modelo.transform(dados_teste)
predictions.select("prediction", "features").show()

+------------------+--------------+
|        prediction|      features|
+------------------+--------------+
|3.9122465829742334| [7.0,0.0,0.0]|
|6.5953246377440795|[10.0,0.0,0.0]|
|3.9122465829742334| [7.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
| 5.700965286154132| [9.0,0.0,0.0]|
|6.5953246377440795|[10.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
|6.5953246377440795|[10.0,0.0,0.0]|
|6.5953246377440795|[10.0,0.0,0.0]|
|3.9122465829742334| [7.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
| 7.489683989334029|[11.0,0.0,0.0]|
| 15.53891815364357|[20.0,0.0,0.0]|
| 5.700965286154132| [9.0,0.0,0.0]|
| 9.278402692513927|[13.0,0.0,0.0]|
|3.0178872313842846| [6.0,0.0,0.0]|
|10.172762044103877|[14.0,0.0,0.0]|
+------------------+--------------+
only showing top 20 rows



In [119]:
# Coeficiente de determinação R2
avaliador = RegressionEvaluator(predictionCol = "prediction", labelCol = "label", metricName = "r2")
avaliador.evaluate(predictions)

0.9263899807304109

### CONCLUSÃO
As variáveis do dataset não tem uma correlação desejada para que o modelo de regressão torne-se consistente. Logo, não é possível explicar o motivo da variação dos valores da variável DEPENTE (ArrDelay) em relação aos valores das variáveis INDEPENDETES selecionadas para o modelo (DepDelay, LateAircraftDelay e NASDelay).

Sugere-se obtenção de novas variáveis, exploração das mesmas e re-criação do modelo de regressão linear. Após, nova avaliação.