# ***Exercise 63 - Spark Streaming***

Full station identification in real-time

Input:
- A textual file containing the list of stations of a bike sharing system
    - Each line of the file contains the information about one station
                            
                            [id\tlongitude\tlatitude\tname]
                            

- A stream of readings about the status of the stations
    - Each reading has the format
    - StationId,# free slots,#used slots,timestamp
    

Output:

- For each reading with a number of free slots equal to 0
    - print on the standard output timestamp and name of the station


- Emit new results every 2 seconds by considering only the data received in the last 2 seconds

In [1]:
from pyspark.streaming import StreamingContext

In [4]:
# Create a Spark Streaming Context object
ssc = StreamingContext(sc, 2)

In [5]:
inputFileStations = "data/Ex63/data/stations.csv"

**Good to cache the static RDD**

In [6]:
# "Standard" RDD associated with the characteristics of the stations
# Extract (stationId, name)
stationNameRDD = sc.textFile(inputFileStations)\
.map(lambda line: (line.split("\t")[0], line.split("\t")[3]) ).cache()

In [3]:
# Create a (Receiver) DStream that will connect to localhost:9999
readingsDStream = ssc.socketTextStream("localhost", 9999)

In [7]:
# Each readings has the format:
# stationId,#free slots,#used slots,timestamp
# Select readings with num. free slots = 0
fullReadingsDStream = readingsDStream.filter(lambda line: int(line.split(",")[1])==0)

In [8]:
# Extract pairs (stationId, timestamp)
stationIdTimestampDStream = fullReadingsDStream.map(lambda line: (line.split(",")[0],line.split(",")[3]))

In [9]:
# Join the content of the DStream with the "standard" RDD to retrieve
# the name of each station. 
# To perform this join between streaming and
# non-streaming RDDs the transform transformation must be used
joinDStream = stationIdTimestampDStream.transform(lambda batchRDD: batchRDD.join(stationNameRDD))

In [10]:
# Extract (name of the station, timestamp)
# It is the value part of the returned pairs
stationNameTimestampDStream = joinDStream.map(lambda pair: pair[1])

In [11]:
stationNameTimestampDStream.pprint()

In [14]:
#Start the computation
ssc.start()

In [None]:
# Run this application for 90 seconds
ssc.awaitTerminationOrTimeout(90)
ssc.stop(stopSparkContext=False)