## Sampling Eurocontrol Data

This notebook reports the main steps to sample the CSV files about flights and FIRs. In fact, due to the high computational time required to ingesting all fights, we use a sample of 100K flights from the file containing March flights.

To measure execution time in Jupyter notebooks: <code>pip install ipython-autotime</code>

In [1]:
# required libraries
import pandas as pd
import os
import gc
from pathlib import Path
from datetime import datetime

In [2]:
# parameters and URLS
path = str(Path(os.path.abspath(os.getcwd())).parent.absolute())

flightsPath = path + '/data/flights/'
marchFlightsURL = flightsPath + 'Flights_20190301_20190331.csv'
flightsFIRPath = path + '/data/flights_FIR/'
marchFIRsURL = flightsFIRPath + 'Flight_FIRs_Actual_20190301_20190331.csv'


## Sampling the flights

In this section we sample the 100K flights.

In [3]:
# Load the CSV file in memory
flights = pd.read_csv(marchFlightsURL, sep=',', index_col='ECTRL ID')
flights.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 789432 entries, 227743250 to 228593317
Data columns (total 17 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   ADEP                        789432 non-null  object 
 1   ADEP Latitude               788029 non-null  float64
 2   ADEP Longitude              788029 non-null  float64
 3   ADES                        789432 non-null  object 
 4   ADES Latitude               787954 non-null  float64
 5   ADES Longitude              787954 non-null  float64
 6   FILED OFF BLOCK TIME        789432 non-null  object 
 7   FILED ARRIVAL TIME          789432 non-null  object 
 8   ACTUAL OFF BLOCK TIME       789432 non-null  object 
 9   ACTUAL ARRIVAL TIME         789432 non-null  object 
 10  AC Type                     789432 non-null  object 
 11  AC Operator                 789432 non-null  object 
 12  AC Registration             786209 non-null  object 
 13  ICA

In [4]:
# Create the file with 100K flights
sample = flights[:100000]
sample.to_csv(flightsPath + 'Flights_marchSample.csv', sep=',')
print('*** Sample File created ***')

*** Sample File created ***


## Sampling the FIRs

In this section we handle FIRs. We sample the csv file about FIRs up to the row containing the last checkpoint of the last flight stored in the sample. We retrieve the number of such row just by looking at the csv file with a simple text editor. 

Then we check if we have data for all flights stored in the flights sample and if we have extra date, i.e. checkpoints of flights not stored in the flights sample.

In [5]:
# Load the CSV file in memory
firs = firs = pd.read_csv(marchFIRsURL, sep=',')
firs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6022352 entries, 0 to 6022351
Data columns (total 5 columns):
 #   Column           Dtype 
---  ------           ----- 
 0   ECTRL ID         int64 
 1   Sequence Number  int64 
 2   FIR ID           object
 3   Entry Time       object
 4   Exit Time        object
dtypes: int64(2), object(3)
memory usage: 229.7+ MB


If we do not run this section right after the previous one we need to load the sample file about flights in memory.

In [6]:
# Load the CSV file about flights in memory 
sample = pd.read_csv(flightsPath + 'Flights_marchSample.csv', sep=',', index_col='ECTRL ID')
sample.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 227743250 to 227849442
Data columns (total 17 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   ADEP                        100000 non-null  object 
 1   ADEP Latitude               99846 non-null   float64
 2   ADEP Longitude              99846 non-null   float64
 3   ADES                        100000 non-null  object 
 4   ADES Latitude               99844 non-null   float64
 5   ADES Longitude              99844 non-null   float64
 6   FILED OFF BLOCK TIME        100000 non-null  object 
 7   FILED ARRIVAL TIME          100000 non-null  object 
 8   ACTUAL OFF BLOCK TIME       100000 non-null  object 
 9   ACTUAL ARRIVAL TIME         100000 non-null  object 
 10  AC Type                     100000 non-null  object 
 11  AC Operator                 100000 non-null  object 
 12  AC Registration             99638 non-null   object 
 13  ICA

In [8]:
%%time
sampledFIRs = firs[:773544] 
missing_data = [] # List of flights in the sample with no FIR data
count = 0
prev = 0
stored = False

# Check if we have data for flights not stored in the sample and drop them
for index, row in sampledFIRs.iterrows():
    
    flight = row['ECTRL ID']
    
    if(prev == flight):
        if(not stored):
            count += 1
            test.drop(index=index)
    else:
        prev = flight
        if(not ((sample.index == flight).any() == True)):
            count += 1
            stored = False
            test.drop(index=index)
        else:
            stored = True
        
print('Dropped ' + str(count) + ' rows')

prev = 0
found = False

# Check if we have data for all flights in the sample
for index, row in sample.iterrows():  
    
    if(prev == index):
        if(not found):
            missing_data.append(index)
    else:
        prev = index
        if(not ((index == sampledFIRs['ECTRL ID'].values).any() == True)):
            found = False
            missing_data.append(index)
        else:
            found = True

if(len(missing_data) == 0):
    print('We have DATA FOR ALL FLIGHTS in the sample')
else:
    print('MISSING DATA for some flights in the sample')

Dropped 0 rows
We have DATA FOR ALL FLIGHTS in the sample
CPU times: user 1min 10s, sys: 7.89 s, total: 1min 18s
Wall time: 1min 9s


No further actions are needed since there is no missing data and we have data for all flights stored in the sample.

In [9]:
sampledFIRs.to_csv(flightsFIRPath + 'Flight_FIRs_Actual_marchSample.csv', sep=',', index=False)
print('*** Sample File created ***')

*** Sample File created ***
