# Produce flight streaming data

Kafka will be used to simulate streaming data from the CSV files by creating a topic that data will be published to.

## Table of contents<a class="anchor" id="table"></a>

* [1. Read all flight data into list of list of dictionaries](#1)
* [2. Set up Kafka producer auxiliary functions](#2)
* [3. Publish streaming data to Kafka topic](#3)

In [1]:
# Import libaries
from random import randint
import datetime as dt
from time import sleep
from json import dumps
from kafka import KafkaProducer
import csv

## 1. Read all flight data into list of list of dictionaries<a class="anchor" id="1"></a>
[Back to top](#table)

In [2]:
# Auxiliary function to read data from csv files (from tutorial)
def read_csv(fileName):
    list = []
    with open(fileName) as f:
        reader = csv.DictReader(f)
        for row in reader:
            list.append(row)
    return list

Clean the data on the producer side. This is included in the function `getFlightRecords`.

In [3]:
# We need to avoid null in the following columns for the ML pipeline later
mlCols = ["MONTH", "DAY_OF_WEEK", "AIRLINE", "FLIGHT_NUMBER", "ORIGIN_AIRPORT", "DESTINATION_AIRPORT",
          "SCHEDULED_DEPARTURE", "DEPARTURE_DELAY", "TAXI_OUT", "WHEELS_OFF", "SCHEDULED_TIME",
          "ELAPSED_TIME", "AIR_TIME", "DISTANCE", "WHEELS_ON", "TAXI_IN", "SCHEDULED_ARRIVAL",
          "ARRIVAL_DELAY"]

In [4]:
# Create keys that will be used for indexing
keyFlights = {1,2,3,4,5,6,7}

In [5]:
# Sort dictionaries into lists based on their DAY_OF_WEEK
def getFlightRecords(checkNullCols):
    
    # Read all data
    data = []
    for f in range(1,21):
        data += read_csv('data/flight{}.csv'.format(f))
    
    flightRecords = []
    
    for key in keyFlights:
        
        # Separate data into lists by key
        sublist = [d for d in data if d['DAY_OF_WEEK']==str(key)]
        
        # For chosen columns, do not keep records if any chosen columns are empty
        sublist = [d for d in sublist if all([d.get(k) for k in mlCols])]
        
        # Add list of records for a particular day of the week to the outer list
        flightRecords.append(sublist)
    
    return flightRecords

In [6]:
# Get flight records into a list of lists of dictionaries
flightRecords = getFlightRecords(checkNullCols = mlCols)

## 2. Set up Kafka producer auxiliary functions<a class="anchor" id="2"></a>
[Back to top](#table)

In [7]:
# Set up functions for kafka producer
def publish_message(producer_instance, topic_name, data):
    try:
        producer_instance.send(topic_name, data)
        print('Message published successfully. Data: ' + str(len(data)) + ' records sent')
    except Exception as ex:
        print('Exception in publishing message.')
        print(str(ex))
        
def connect_kafka_producer():
    _producer = None
    try:
        _producer = KafkaProducer(bootstrap_servers=['localhost:9092'],
                                  value_serializer=lambda x: dumps(x).encode('ascii'),
                                  api_version=(0, 10))
    except Exception as ex:
        print('Exception while connecting Kafka.')
        print(str(ex))
    finally:
        return _producer

## 3. Publish streaming data to Kafka topic<a class="anchor" id="3"></a>
[Back to top](#table)

$X_n$ records will be sent in the $n^{th}$ batch while $Y_n$ will be sent in the $(n+1)^{th}$ batch

In [8]:
if __name__ == '__main__':
   
    topic = 'flightTopic'

    print('Publishing records..')
    flightProducer = connect_kafka_producer()
    
    # Ensure Y list is reset to empty each time this code cell is run
    Y = []
    
    # Initialise indexer to help retrieve records
    indexer = [0]*len(keyFlights)
    
    while True:

        # Set Y batch that will be sent this iteration
        YOld = Y

        # Start with empty lists
        X = [] # Send this at current time unit
        Y = [] # Send this at next time unit

        # For each sub batch, add records to batches X and Y
        for sublist, key in zip(flightRecords, keyFlights):

            # Create timestamp (For both sub batches)
            ts = int(dt.datetime.utcnow().timestamp())
            
            # X sub batch: do 70-100 times
            for i in range(randint(70,100)):
                
                # Create record
                record = {'ts': ts}
                
                # Update record
                try:
                    record.update(sublist[indexer[key-1]])
                except IndexError:
                    indexer[key-1] = 0  # reset index if data exhausted
                    record.update(sublist[indexer[key-1]])
                finally:
                    indexer[key-1] += 1
        
                # Add record to the batch
                X.append(record)

            # Y batch: do 5-10 times
            for i in range(randint(5,10)):

                # Create record
                record = {'ts': ts}
            
                # Update the record with flight data
                try:
                    record.update(sublist[indexer[key-1]])
                except IndexError:
                    indexer[key-1] = 0  # reset index if data exhausted
                    record.update(sublist[indexer[key-1]])
                finally:
                    indexer[key-1] += 1

                # Append instance
                Y.append(record)

            sleep(1)
        
        # So that total time between batches being sent is 10 (1*7 + 3)
        sleep(3)        
        
        # Send batches
        publish_message(flightProducer, topic, X + YOld)

Publishing records..
Message published successfully. Data: 567 records sent
Message published successfully. Data: 606 records sent
Message published successfully. Data: 658 records sent
Message published successfully. Data: 654 records sent
Message published successfully. Data: 670 records sent
Message published successfully. Data: 647 records sent
Message published successfully. Data: 620 records sent
Message published successfully. Data: 614 records sent
Message published successfully. Data: 680 records sent
Message published successfully. Data: 659 records sent
Message published successfully. Data: 624 records sent
Message published successfully. Data: 626 records sent
Message published successfully. Data: 709 records sent
Message published successfully. Data: 651 records sent
Message published successfully. Data: 614 records sent
Message published successfully. Data: 629 records sent
Message published successfully. Data: 639 records sent
Message published successfully. Data: 668 re

KeyboardInterrupt: 