# Exploratory Data Analysis 4
This notebook exports data suitable for input to pyHON. **Don't forget to kill the session at the end with `spark.stop()`!**

In [1]:
import findspark
findspark.init('/usr/hdp/current/spark2-client')

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf().setMaster("yarn-client").setAppName("AirlineDataAnalysis")
        .set("spark.yarn.queue", "eecs598w19")
        .set("spark.executor.memory", "4g")
        .set("spark.executor.instances", "10")
        .set("spark.driver.memory", "4g")
        .set("spark.shuffle.service.enabled", "true")
        .set("spark.dynamicAllocation.enabled", "true")
        .set("spark.dynamicAllocation.minExecutors", "4")
        )

spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark.sparkContext.setLogLevel("ERROR")  # Hides irrelevant warnings caused by workers running default of Python2

In [2]:
import signac

from pyspark.sql.functions import collect_list, first, last
from util import hdfs_fn

project = signac.get_project()
job = project.find_jobs({"year": 2011, "quarter": 1}).next()

In [3]:
df = spark.read.csv(hdfs_fn(job, 'Coupon.csv'), header=True, inferSchema=True)

In [4]:
col_names = ['ItinID', 'SeqNum', 'OriginAirportID', 'Origin', 'DestAirportID', 'Dest']
df_network = df[col_names].repartition('ItinID').sort(['ItinID', 'SeqNum'])
itins = df_network.groupby(['ItinID']).agg(first('OriginAirportID').alias('FirstAirportID'),
                                           collect_list('OriginAirportID').alias('OriginAirportIDs'),
                                           last('DestAirportID').alias('LastAirportID'))

In [5]:
def make_line(row):
    return '{} {} {} {} {}'.format(row.ItinID, row.FirstAirportID, ' '.join(map(str, row.OriginAirportIDs)), row.LastAirportID, row.LastAirportID)

itins.rdd.map(make_line).saveAsTextFile(hdfs_fn(job, 'hon_itineraries.txt'))

This leaves the data in a folder called `hon_itineraries.txt` with files called `part-00000`, `part-00001`, ..., which can be combined with `cat`.

The files are space-separated, with the format `ItinID i i j k l l` for a sequence of flights from `i` to `j` to `k` to `l`.

The start and end points are repeated to match the conventions described in the Supplementary Information of [Rosvall et al.](https://www.nature.com/articles/ncomms5630)

In [6]:
spark.stop()