# Intro to Spark #
## What is Spark?
Basic info about Apache Spark
## SparkContext: how you interact with your data
A SparkContext is the most basic tool for connecting to your data. 

In [1]:
import pyspark
sc = pyspark.SparkContext('local[*]')

## Resilient Distributed Datasets (RDD)
Now that we can interact with our data using a SparkSession/context, what happens to that data? We put it in RDDs! There is a lot going on under the hood with RDDs that leverage the distrubuted 

In [2]:
list_of_arrivals = [
    ("PDX", 1),
    ("LAX", 5),
    ("DEN", 3),
    ("PDX", 2),
    ("JFK", 9),
    ("DEN", 5),
    ("PDX", 7),
    ("JFK", 10),
]
arrivals_rdd = sc.parallelize(list_of_arrivals)
print(arrivals_rdd)
print(arrivals_rdd.collect())
print(arrivals_rdd.count())
pdx_arrivals = arrivals_rdd.filter(lambda x: x[0] == "PDX")
print(pdx_arrivals.collect())
grouped_arrivals = arrivals_rdd.groupByKey()
print(grouped_arrivals.collect())
print(grouped_arrivals.collect())
grouped_arrivals = grouped_arrivals.mapValues(sum)
print(grouped_arrivals.collect())
sorted_arrivals = arrivals_rdd.sortByKey()
print(sorted_arrivals.first())
print(sorted_arrivals.collect())

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:262
[('PDX', 1), ('LAX', 5), ('DEN', 3), ('PDX', 2), ('JFK', 9), ('DEN', 5), ('PDX', 7), ('JFK', 10)]
8
[('PDX', 1), ('PDX', 2), ('PDX', 7)]
[('JFK', <pyspark.resultiterable.ResultIterable object at 0x7f6cad91de20>), ('LAX', <pyspark.resultiterable.ResultIterable object at 0x7f6cad91de80>), ('DEN', <pyspark.resultiterable.ResultIterable object at 0x7f6cad91dfa0>), ('PDX', <pyspark.resultiterable.ResultIterable object at 0x7f6cad92b040>)]
[('JFK', <pyspark.resultiterable.ResultIterable object at 0x7f6cad91dfa0>), ('LAX', <pyspark.resultiterable.ResultIterable object at 0x7f6c9615ae80>), ('DEN', <pyspark.resultiterable.ResultIterable object at 0x7f6c9615a3d0>), ('PDX', <pyspark.resultiterable.ResultIterable object at 0x7f6cad92b0a0>)]
[('JFK', 19), ('LAX', 5), ('DEN', 8), ('PDX', 10)]
('DEN', 3)
[('DEN', 3), ('DEN', 5), ('JFK', 9), ('JFK', 10), ('LAX', 5), ('PDX', 1), ('PDX', 2), ('PDX', 7)]


In [3]:
flight_file = '../data/flights.csv'
txt = sc.textFile(flight_file)
print("We have {} flights!".format(txt.count()))

pdx_lines = txt.filter(lambda line: 'pdx' in line.lower())
print("Of these flights, {} involved PDX".format(pdx_lines.count()))

We have 12716 flights!
Of these flights, 270 involved PDX


In [4]:
csv = txt.map(lambda x: x.split(','))
csv.take(2)

[['flight_date',
  'airline',
  'tailnumber',
  'flight_number',
  'src',
  'dest',
  'departure_time',
  'arrival_time',
  'flight_time',
  'distance'],
 ['2019-11-28',
  '9E',
  'N8974C',
  '3280',
  'CHA',
  'DTW',
  '1300',
  '1455',
  '115.0',
  '505.0']]

However, as you can probably guess, this is not the most efficient way to deal with CSV data. Luckily, the PysparkSQL module will give us a much better toolset for tabular data which we will explore in the next notebook.