航班数据的处理和简单分析

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as ticker

from datetime import datetime

from IPython.display import display

%matplotlib inline

sns.set(font="simhei")

## 读取数据

读取原始数据，原始文件名是中文，手动改成了英文。

In [42]:
train_f = pd.read_csv('input/train/raw/fight-201505-201705.csv', encoding='gb2312', low_memory=False)

display(train_f.head())
train_f.info()

Unnamed: 0,出发机场,到达机场,航班编号,计划起飞时间,计划到达时间,实际起飞时间,实际到达时间,飞机编号,航班是否取消
0,HGH,DLC,CZ6328,1453809600,1453817100,1453813000.0,1453819000.0,1.0,正常
1,SHA,XMN,FM9261,1452760800,1452767100,1452763000.0,1452768000.0,2.0,正常
2,CAN,WNZ,ZH9597,1453800900,1453807500,1453802000.0,1453807000.0,3.0,正常
3,SHA,ZUH,9C8819,1452120600,1452131100,1452121000.0,1452130000.0,4.0,正常
4,SHE,TAO,TZ185,1452399000,1452406800,1452400000.0,1452404000.0,5.0,正常


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7518638 entries, 0 to 7518637
Data columns (total 9 columns):
出发机场      object
到达机场      object
航班编号      object
计划起飞时间    int64
计划到达时间    int64
实际起飞时间    float64
实际到达时间    float64
飞机编号      float64
航班是否取消    object
dtypes: float64(3), int64(2), object(4)
memory usage: 516.3+ MB


In [43]:
train_f.columns = ['Departure', 'Destination', 'FLTNo', 'PDepartureTime', 'PArrivalTime', 'ADepartureTime', 'AArrivalTime', 'Id', 'Cancel']

train_f.head()

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel
0,HGH,DLC,CZ6328,1453809600,1453817100,1453813000.0,1453819000.0,1.0,正常
1,SHA,XMN,FM9261,1452760800,1452767100,1452763000.0,1452768000.0,2.0,正常
2,CAN,WNZ,ZH9597,1453800900,1453807500,1453802000.0,1453807000.0,3.0,正常
3,SHA,ZUH,9C8819,1452120600,1452131100,1452121000.0,1452130000.0,4.0,正常
4,SHE,TAO,TZ185,1452399000,1452406800,1452400000.0,1452404000.0,5.0,正常


## 空值

Departure、Destination、FLTNo、PDepartureTime、PArrivalTime都没有Null。

In [44]:
train_f[train_f['Departure'].isnull()]

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel


In [45]:
train_f[train_f['Destination'].isnull()]

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel


In [46]:
train_f[train_f['FLTNo'].isnull()]

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel


In [47]:
train_f[train_f['PDepartureTime'].isnull()]

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel


In [48]:
train_f[train_f['PArrivalTime'].isnull()]

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel


ADepartureTime有很多Null值，可以发现其中绝大部门都是因为航班取消了，这也能够理解，航班取消了自然就不会有出发时间了。这种继续保持NaN值就好，后面这个特征肯定会被转换成其他特征而拿掉的。

In [49]:
print('ADepartureTime NaN: ', train_f[train_f['ADepartureTime'].isnull()].size)

train_f[train_f['ADepartureTime'].isnull()].head()

ADepartureTime NaN:  2912418


Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel
5,DLC,NNG,ZH953Z,1452385800,1452401700,,,6.0,取消
6,HGH,SZX,CZ6327,1452591900,1452600900,,,,取消
55,HAK,SZX,HU7013,1452032700,1452037500,,,55.0,取消
78,WUX,SZX,MF1094,1452855300,1452864900,,,,取消
98,CAN,URC,CZ6884,1453976100,1453996200,,,,取消


而下面这些航班没有取消，却没有实际出发时间的就很难弄了。

In [50]:
print('ADepartureTime NaN: ', train_f[train_f['ADepartureTime'].isnull() & (train_f['Cancel'] != '取消')].size)

train_f[train_f['ADepartureTime'].isnull() & (train_f['Cancel'] != '取消')].head()

ADepartureTime NaN:  15444


Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel
183,HGH,CKG,OQ2380,1453606500,1453616400,,1453607000.0,176.0,正常
2767,FOC,PVG,MU5506,1452815700,1452820500,,1452822000.0,1239.0,正常
13891,LZH,CTU,EU2202,1453044600,1453050900,,1453052000.0,1559.0,正常
15737,KWE,DLC,HU762Z,1451865600,1451883000,,1451881000.0,498.0,正常
17352,TCZ,KMG,KY8288,1453020000,1453023600,,1453024000.0,168.0,正常


那么我就看航班到达晚点了多少吧，姑且认为到达晚点了多少出发就晚点多少。

In [51]:
# 根据到达晚点的时间 填补 实际出发时间
train_f['ADelay'] = train_f['AArrivalTime'] - train_f['PArrivalTime']

cond = train_f['ADepartureTime'].isnull() & (train_f['Cancel'] != '取消')
train_f.loc[cond, 'ADepartureTime'] = train_f.loc[cond, 'PDepartureTime'] + train_f.loc[cond, 'ADelay']

In [52]:
train_f[train_f['ADepartureTime'].isnull() & (train_f['Cancel'] != '取消')].head()

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel,ADelay


AArrivalTime的Null值也不少，幸运的是，AArrivalTime为Null的航班全部都取消了，我们无需对Null做处理。

In [54]:
print('AArrivalTime NaN: ', train_f[train_f['AArrivalTime'].isnull()].size)

print('AArrivalTime NaN and not Cancel: ', train_f[train_f['AArrivalTime'].isnull() & (train_f['Cancel'] != '取消')].size)


AArrivalTime NaN:  3231240
AArrivalTime NaN and not Cancel:  0


Id为Null且航班未取消的有27w+，真头疼！处理思路是根据航班号、出发、到达机场找到前序、后续航班，拿前序、后续航班的Id进行填充，但这个数据量太大，具体实现要想想办法。

In [55]:
print('Id NaN: ', train_f[train_f['Id'].isnull() & (train_f['Cancel'] != '取消')].size)

train_f[train_f['Id'].isnull() & (train_f['Cancel'] != '取消')].head()

Id NaN:  273610


Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel,ADelay
2720,DNH,LHW,MU7582,1452936900,1452942600,1452936000.0,1452940000.0,,正常,-2160.0
9012,JHG,CKG,PN6216A,1454255400,1454265900,1454260000.0,1454269000.0,,正常,3420.0
15058,DNH,XIY,GS7680A,1454002200,1454013300,1454003000.0,1454013000.0,,正常,120.0
15792,CKG,KWE,CZ3641A,1452362400,1452365100,1452369000.0,1452372000.0,,正常,6540.0
16320,CAN,HAK,CZ6778A,1453479300,1453483500,1453480000.0,1453484000.0,,正常,540.0


In [None]:
#id null的填充，待完成

Cancel无Null值，只有“正常”和“取消”。这里将正常用0替代，取消用1来替代。

In [56]:
train_f['Cancel'].value_counts()

正常    7195285
取消     323353
Name: Cancel, dtype: int64

In [57]:
train_f['Cancel'] = train_f['Cancel'].map({'正常': 0, '取消': 1})
train_f.head()

Unnamed: 0,Departure,Destination,FLTNo,PDepartureTime,PArrivalTime,ADepartureTime,AArrivalTime,Id,Cancel,ADelay
0,HGH,DLC,CZ6328,1453809600,1453817100,1453813000.0,1453819000.0,1.0,0,2280.0
1,SHA,XMN,FM9261,1452760800,1452767100,1452763000.0,1452768000.0,2.0,0,840.0
2,CAN,WNZ,ZH9597,1453800900,1453807500,1453802000.0,1453807000.0,3.0,0,-660.0
3,SHA,ZUH,9C8819,1452120600,1452131100,1452121000.0,1452130000.0,4.0,0,-1260.0
4,SHE,TAO,TZ185,1452399000,1452406800,1452400000.0,1452404000.0,5.0,0,-2460.0


## 前序、后序