# PoC: Airlines Data Cleaning and Ingestion

In [1]:
import pandas as pd
import numpy as np

PROJ_PATH = "/data/"

DATA_IN_PREFIX = "orig/"
DATA_OUT_PREFIX = "processed/"

# Load data from Excel sheets.
airlines_raw = pd.read_excel(PROJ_PATH + DATA_IN_PREFIX + "20180102 Airlines.xlsx")
airports_raw = pd.read_excel(PROJ_PATH + DATA_IN_PREFIX + "20180102 Airports.xlsx")
routes_raw = pd.read_excel(PROJ_PATH + DATA_IN_PREFIX + "20180102 Routes.xlsx")

## Data Exploration

Let's do some basic exploration. What does the head of the airlines data look like?

In [4]:
airlines_raw.head()


Unnamed: 0,Airline ID,Airline Name,Airline Alias,IATA Code,ICAO Code,Airline Callsign,Airline Country,Airline Operational?
0,-1,Unknown,\N,-,,\N,\N,Y
1,1,Private flight,\N,-,,,,Y
2,2,135 Airways,\N,,GNL,GENERAL,United States,N
3,3,1Time Airline,\N,1T,RNX,NEXTIME,South Africa,Y
4,4,2 Sqn No 1 Elementary Flying Training School,\N,,WYT,,UK,N


Looks like the first line is dodgy.

Next, let's look at the datatypes and percentage of missing, i.e. **nan** / **null** data:

In [6]:
## Look at airline types.
print("Data types:\n")
print(airlines_raw.dtypes)

print("\n\n")

# What percentages of actual missing data do we have?
# Note: Missing means NaN; there are other types of flaws in the data which we'll get to later.
def print_perc_missing(df):
    """
    Prints what percentage of data is missing for each column.
    
    CAVEAT: Only checks for nan, not other data defects, e.g. "\\N" entries etc.
    """
    print(round(df.isnull().sum() / df.shape[0] * 100))
    
print("\nAirlines data\n")
print_perc_missing(airlines_raw)

## RES:
## Airline aliases flawed.
## Most IATA codes are rubbish. Many nan, and other issues.
## ICAO codes have nans and other problems.
## Further, 13% of airline callsigns are flawed.


#%%
print("\nAirports data\n\n")
print_perc_missing(airports_raw)
## Airport cities seems to be an issue. 
## Timezone offset too.
## NOTE: Some IATA codes are \N --> not NaN! Need to fix this too.
## Need to fix cities if we want to use it as a key.
## There are two unnamed columns. Could probably just delete those as they only contain NaNs.

#%%
print("\nRoutes data\n\n")
print_perc_missing(routes_raw)

## RES: Codeshare is totally useless. Rest seems OK for now.

Data types:

Airline ID               int64
Airline Name            object
Airline Alias           object
IATA Code               object
ICAO Code               object
Airline Callsign        object
Airline Country         object
Airline Operational?    object
dtype: object




Airlines data

Airline ID               0.0
Airline Name             0.0
Airline Alias            8.0
IATA Code               75.0
ICAO Code                1.0
Airline Callsign        13.0
Airline Country          0.0
Airline Operational?     0.0
dtype: float64

Airports data


Airport ID           0.0
Airport Name         0.0
Airport City         1.0
Airport Country      0.0
IATA Code            0.0
ICAO Code            0.0
Latitude             0.0
Longitude            0.0
Altitude             0.0
Timezone Offset      4.0
DST                  0.0
Timezone             0.0
Unnamed: 12        100.0
Unnamed: 13        100.0
dtype: float64

Routes data


Airline                    0.0
Airline ID                 0.0


Now we would like to check the dimensionality of the data:

In [9]:
all_dta = [airlines_raw, airports_raw, routes_raw]

## Print dimensions.
print("\nDimensions...")
list(map(lambda x: print(x.shape), all_dta))


Dimensions...
(6162, 8)
(7184, 14)
(67663, 9)


[None, None, None]

### Results

* Routes has the largest amount of entries, ~68k rows.
* Probably also the most scalable part of the data as new routes will be created, or existing ones updated.
* Need to understand unit of scale in new routes.