### [Get Started with PySpark and Jupyter Notebook in 3 Minutes](https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f)

### Importing Libraries
  * #### Part 1: Essential Libraries to
    * Find Spark Installation &nbsp;&nbsp;&nbsp;&nbsp;>>> **import** findspark
    * Initiate a Spark Instance &nbsp;>>> **findspark.init()**

In [1]:
import findspark; findspark.init()

  * #### Part 2: Importing Essential pyspark Modules

In [2]:
from pyspark import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import StorageLevel
from pyspark import SQLContext
import time

### Loading Spark and SQL Context

In [3]:
sc = SparkContext("local[*]", "db2-q2-spark-rdd-b")
sqlContext = SQLContext(sc)

### Create Dataset from csv file (RDD Approach)

In [4]:
lines = sc.textFile("db2_project_data.csv")

In [5]:
def getColumns(row, indexes):
    row = row.split(',')
    tmp = [row[i] for i in indexes]
    return tmp

In [6]:
consumption = lines.map(lambda s: getColumns(s, [0,2]));
header = consumption.first()
consumption = consumption.filter(lambda line : line != header)

In [7]:
# Make it persistent on both MEMORY (if it "fits") and DISK 
# to spped-up computations at the cost of memory
consumption.persist(StorageLevel.MEMORY_AND_DISK)

PythonRDD[3] at RDD at PythonRDD.scala:48

### Execute a Query from Question (a)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; a.iv: Average Distance (Kilometers) per Customer 

In [10]:
st = time.time()
tplRes = (0,0) # As of Python3, you can't pass a literal sequence to a function.
result = consumption.mapValues(lambda v: (v, 1))\
.reduceByKey(lambda a,b: ((float(a[0])+float(b[0]), float(a[1])+float(b[1]))))\
.mapValues(lambda v: v[0]/v[1])
fn = time.time() - st

result.persist(StorageLevel.MEMORY_AND_DISK)
print ('Query Computation Time:', fn, ' seconds.')
print ('Average Distance (Kilometers) per Customer [Without Hash Partitioning]:')
print (result.take(10))
print ('Cumulative Computation Time:', time.time() - st, ' seconds.')

Query Computation Time: 0.022536039352416992  seconds.
Average Distance (Kilometers) per Customer [Without Hash Partitioning]:
[('9759632-CALEND-304-305-Σάββατο-10', 665.4940102244079), ('8135848-CALEND-049-Σάββατο-06', 499.5832917739269), ('8119276-CALEND-040-Κυριακή-07', 573.5036786643687), ('9759643-CALEND-304-305-Σάββατο-10', 662.1925830229202), ('9759705-CALEND-304-305-Σάββατο-10', 673.5770425023304), ('9230565-CALEND-891-Κυριακή-06', 311.757060266112), ('9262267-CALEND-049-Καθημερινή-13', 510.126052710065), ('9257025-CALEND-040-Καθημερινή-11', 605.6769329510254), ('9311573-CALEND-703-Καθημερινή-09', 496.91912571390145), ('9741563-CALEND-304-305-Καθημερινή-11', 658.390396959772)]
Cumulative Computation Time: 14.460395574569702  seconds.


### Stopping Spark Context... 

In [12]:
sc.stop()