## Top 10 arrival airports in the world in 2013 (using the bookings file)

Arrival airport is the column arr_port. It is the IATA code for the airport

To get the total number of passengers for an airport, you can sum the "pax" column, grouping by arr_port.

Note that there is negative pax. That corresponds to cancelations. So to get the total number of passengers that have actually booked, you should sum including the negatives (that will remove the canceled bookings).

Print the top 10 arrival airports in the standard output, including the number of passengers.

Bonus point: Get the name of the city or airport corresponding to that airport (programatically, we suggest to have a look at [neobase in Github](https://github.com/alexprengere/neobase))

Bonus point: Solve this problem using pandas (instead of any other approach)


Suggestion: follow the below plan of action:

* Get familiar with the data
* Select columns of interest
* Decide what to do with NaNs

* Make processing plan
* Develop code that works with a sample

* Adjust the code to work with Big data
* Test big data approach on a sample

* Run program with big data


## 1) Get familiar with data

In [1]:
import pandas as pd

In [2]:
def columns_csv(file, compression, sep):
    """
    To know which number fits with wich col name
    
    :param str file: Input file.
    :param str compression: Type of compression.
    :param str sep: Type o separator
    
    return: a dataframe with name of column and number
    """
    return pd.Series(pd.read_csv(file, compression=compression, header=None, nrows=1, sep=sep).T[0]).str.strip()

In [3]:
columns = columns_csv('bookings.csv.bz2', 'bz2', '^')

In [4]:
list_cols_to_use = []
list_cols_to_use.append(columns[columns == 'arr_port'].index[0])
list_cols_to_use.append(columns[columns == 'pax'].index[0])
list_cols_to_use.append(columns[columns == 'cre_date'].index[0])
list_cols_to_use.append(columns[columns == 'act_date'].index[0])
list_cols_to_use.append(columns[columns == 'year'].index[0])

In [5]:
list_cols_to_use

[12, 34, 6, 0, 35]

### What if we dont want to read the whole file?

Options:

* prepare the sample

* read_csv with nrows option

In [6]:
%%time
df = pd.read_csv('bookings.csv.bz2', compression='bz2', header=0, usecols=list_cols_to_use, sep='^')

CPU times: user 5min 26s, sys: 4.87 s, total: 5min 31s
Wall time: 5min 41s


In [7]:
# Other solution to read the file is reading a sample of that using:
# df = pd.read_csv('bookings.csv.bz2', compression='bz2', header=0, nrows=10000, sep='^')
# and then you ca choose the columns of interest

In [8]:
df.shape

(10000010, 5)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000010 entries, 0 to 10000009
Data columns (total 5 columns):
act_date               object
cre_date               object
arr_port               object
pax                    float64
year                   float64
dtypes: float64(2), object(3)
memory usage: 381.5+ MB


In [10]:
df.head()

Unnamed: 0,act_date,cre_date,arr_port,pax,year
0,2013-03-05 00:00:00,2013-02-22 00:00:00,LHR,-1.0,2013.0
1,2013-03-26 00:00:00,2013-03-26 00:00:00,CLT,1.0,2013.0
2,2013-03-26 00:00:00,2013-03-26 00:00:00,CLT,1.0,2013.0
3,2013-03-26 00:00:00,2013-03-26 00:00:00,SVO,1.0,2013.0
4,2013-03-26 00:00:00,2013-03-26 00:00:00,SVO,1.0,2013.0


In [11]:
df.sample(7)

Unnamed: 0,act_date,cre_date,arr_port,pax,year
648361,2013-05-17 00:00:00,2013-05-17 00:00:00,IST,3.0,2013.0
8599053,2013-09-06 00:00:00,2013-09-06 00:00:00,SCL,1.0,2013.0
5912239,2013-01-29 00:00:00,2013-01-29 00:00:00,LUX,1.0,2013.0
2757129,2013-10-07 00:00:00,2013-10-06 00:00:00,ICN,-1.0,2013.0
7118718,2013-04-24 00:00:00,2013-04-24 00:00:00,TLV,1.0,2013.0
1726465,2013-12-09 00:00:00,2013-04-30 00:00:00,FSZ,-2.0,2013.0
9370283,2013-04-29 00:00:00,2013-04-19 00:00:00,CFU,-3.0,2013.0


In [12]:
df.describe()

Unnamed: 0,pax,year
count,10000010.0,10000009.0
mean,0.4908805,2013.0
std,2.199173,0.0
min,-90.0,2013.0
25%,-1.0,2013.0
50%,1.0,2013.0
75%,1.0,2013.0
max,99.0,2013.0


In [13]:
df.describe(include='all')

Unnamed: 0,act_date,cre_date,arr_port,pax,year
count,10000010,10000010,10000010,10000010.0,10000009.0
unique,365,719,2275,,
top,2013-01-16 00:00:00,2013-01-14 00:00:00,LHR,,
freq,133330,132250,215551,,
mean,,,,0.4908805,2013.0
std,,,,2.199173,0.0
min,,,,-90.0,2013.0
25%,,,,-1.0,2013.0
50%,,,,1.0,2013.0
75%,,,,1.0,2013.0


In [14]:
df.dtypes

act_date                object
cre_date                object
arr_port                object
pax                    float64
year                   float64
dtype: object

In [15]:
df.isnull().sum()

act_date               0
cre_date               0
arr_port               0
pax                    1
year                   1
dtype: int64

In [16]:
df.count()

act_date               10000010
cre_date               10000010
arr_port               10000010
pax                    10000009
year                   10000009
dtype: int64

In [17]:
pd.set_option('display.max_columns', None)

In [18]:
non_null_counts = df.count()

Clean the column names

In [19]:
df.columns = df.columns.str.strip()

## 2) Select the columns of interest 

In [20]:
df = df[['pax', 'arr_port']]

## 3) What to do with NaN?



In the sample everything might be ok, but we should prepare for NaN case

In [21]:
df = df.dropna()
df.isnull().sum()

pax         0
arr_port    0
dtype: int64

## 4) Make processing plan
1) get only the bookings from 2013

2) group by arr_port, sum

3) sort 

4) get top 10

#### 4.1) Get only the booking from 2013

In [22]:
# All rows are in 2013
# In case the previous statement were false just doing:
# bookings_2013 = df[df['year'] == 2013]

#### 4.2) group by arr_port, sum

In [23]:
max_ten = df.groupby(by='arr_port').sum().sort_values(by='pax', ascending=False).head(10)

In [24]:
max_ten

Unnamed: 0_level_0,pax
arr_port,Unnamed: 1_level_1
LHR,88809.0
MCO,70930.0
LAX,70530.0
LAS,69630.0
JFK,66270.0
CDG,64490.0
BKK,59460.0
MIA,58150.0
SFO,58000.0
DXB,55590.0


## 5) Adjust the code to work with Big data


Hint: check out https://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

In [25]:
reader = pd.read_csv('bookings.csv.bz2', compression='bz2', nrows=100000, sep='^', chunksize=20000)
reader

<pandas.io.parsers.TextFileReader at 0x7f78f1f21198>

In [26]:
partial_results = []
for chunk in reader:
    bookings_2013 = chunk[chunk['year'] == 2013] 
    totals = bookings_2013.groupby('arr_port')['pax'].sum()
    partial_results.append(totals)

In [27]:
[partial.head() for partial in partial_results] 

[arr_port
 AAL          3
 ABJ          3
 ABQ         18
 ABV          9
 ABY          1
 Name: pax, dtype: int64, arr_port
 AAE         0
 AAL         1
 AAQ         1
 ABE         4
 ABJ         3
 Name: pax, dtype: int64, arr_port
 AAE          2
 AAL          8
 ABE         12
 ABJ          9
 ABQ          7
 Name: pax, dtype: int64, arr_port
 AAR         2
 ABJ         0
 ABQ         2
 ABS         6
 ABV         6
 Name: pax, dtype: int64, arr_port
 AAL         4
 AAQ         4
 AAR         2
 ABJ         6
 ABQ         6
 Name: pax, dtype: int64]

In [28]:
all_results = pd.concat(partial_results)
all_results.index = all_results.index.str.strip()
all_results['ABQ']

ABQ    18
ABQ    23
ABQ     7
ABQ     2
ABQ     6
Name: pax, dtype: int64

In [29]:
totals = all_results.groupby('arr_port').sum()
totals.sort_values(ascending=False).head(10)

arr_port
LHR    1006
MCO     861
JFK     795
LAX     761
BKK     747
LAS     732
SFO     705
ORD     686
CDG     676
DXB     587
Name: pax, dtype: int64

## Final solution working like big data

In [30]:
%%time
from datetime import datetime
reader = pd.read_csv('bookings.csv.bz2', compression='bz2', sep='^', chunksize=200000)

partial_results = []
for nchunk, chunk in enumerate(reader):
    print('Starting with chunk %.2d at %s' % (nchunk, datetime.now()))
    bookings_2013 = chunk[chunk['year'] == 2013][['arr_port', 'pax']].dropna()
    totals = bookings_2013.groupby('arr_port')['pax'].sum()
    partial_results.append(totals)
    
all_results = pd.concat(partial_results)
all_results.index = all_results.index.str.strip()

totals = all_results.groupby('arr_port').sum()
totals.sort_values(ascending=False).head(10)

Starting with chunk 00 at 2018-12-07 17:00:14.666041
Starting with chunk 01 at 2018-12-07 17:00:23.104304
Starting with chunk 02 at 2018-12-07 17:00:31.899589
Starting with chunk 03 at 2018-12-07 17:00:40.797776
Starting with chunk 04 at 2018-12-07 17:00:49.953816
Starting with chunk 05 at 2018-12-07 17:00:59.160410
Starting with chunk 06 at 2018-12-07 17:01:09.163172
Starting with chunk 07 at 2018-12-07 17:01:18.005263
Starting with chunk 08 at 2018-12-07 17:01:26.978722
Starting with chunk 09 at 2018-12-07 17:01:35.884132
Starting with chunk 10 at 2018-12-07 17:01:44.767137
Starting with chunk 11 at 2018-12-07 17:01:53.368129
Starting with chunk 12 at 2018-12-07 17:02:02.723806
Starting with chunk 13 at 2018-12-07 17:02:11.527079
Starting with chunk 14 at 2018-12-07 17:02:20.139411
Starting with chunk 15 at 2018-12-07 17:02:29.086175
Starting with chunk 16 at 2018-12-07 17:02:38.011768
Starting with chunk 17 at 2018-12-07 17:02:46.700628
Starting with chunk 18 at 2018-12-07 17:02:56.



Starting with chunk 25 at 2018-12-07 17:04:23.471203
Starting with chunk 26 at 2018-12-07 17:04:33.033892
Starting with chunk 27 at 2018-12-07 17:04:41.781405
Starting with chunk 28 at 2018-12-07 17:04:50.450917
Starting with chunk 29 at 2018-12-07 17:04:59.077423
Starting with chunk 30 at 2018-12-07 17:05:08.353215
Starting with chunk 31 at 2018-12-07 17:05:17.651714
Starting with chunk 32 at 2018-12-07 17:05:27.283362
Starting with chunk 33 at 2018-12-07 17:05:36.092714
Starting with chunk 34 at 2018-12-07 17:05:45.022410
Starting with chunk 35 at 2018-12-07 17:05:55.380035
Starting with chunk 36 at 2018-12-07 17:06:06.542088
Starting with chunk 37 at 2018-12-07 17:06:17.770574
Starting with chunk 38 at 2018-12-07 17:06:30.247528
Starting with chunk 39 at 2018-12-07 17:06:44.746047
Starting with chunk 40 at 2018-12-07 17:07:03.870320
Starting with chunk 41 at 2018-12-07 17:07:18.149183
Starting with chunk 42 at 2018-12-07 17:07:30.457501
Starting with chunk 43 at 2018-12-07 17:07:39.

In [31]:
#!pip install neobase

In [32]:
from neobase import NeoBase

### Airport Names

Using [neobase](https://github.com/alexprengere/neobase), the successor to GeoBases. It's a referential data library that particularly focuses on airports.

In [33]:
nb = NeoBase()

In [34]:
max_ten

Unnamed: 0_level_0,pax
arr_port,Unnamed: 1_level_1
LHR,88809.0
MCO,70930.0
LAX,70530.0
LAS,69630.0
JFK,66270.0
CDG,64490.0
BKK,59460.0
MIA,58150.0
SFO,58000.0
DXB,55590.0


In [35]:
mt = max_ten.reset_index()

In [36]:
mt['arr_port_city'] = [nb.get(code.strip(), 'city_name_list')[0] for code in mt['arr_port'].unique()]

In [37]:
mt['arr_port'].apply(lambda x: nb.get(x.strip(), 'city_name_list')[0])

0           London
1          Orlando
2      Los Angeles
3        Las Vegas
4    New York City
5            Paris
6          Bangkok
7            Miami
8    San Francisco
9            Dubai
Name: arr_port, dtype: object

In [39]:
[ nb.get(code.strip())['name'] for code in max_ten.index ]

['London Heathrow Airport',
 'Orlando International Airport',
 'Los Angeles International Airport',
 'McCarran International Airport',
 'John F. Kennedy International Airport',
 'Paris Charles de Gaulle Airport',
 'Suvarnabhumi Airport',
 'Miami International Airport',
 'San Francisco International Airport',
 'Dubai International Airport']

In [43]:
totals.sort_values(ascending=False).head(10)

arr_port
LHR    88809.0
MCO    70930.0
LAX    70530.0
LAS    69630.0
JFK    66270.0
CDG    64490.0
BKK    59460.0
MIA    58150.0
SFO    58000.0
DXB    55590.0
Name: pax, dtype: float64