# Streaming Movie Reviews

In this hands-on exercise we will look at data, which is not bounded. In many applications data is continuously updated. The data that we will be working with comes from the Internet Movie Database (IMDB) app. Users who have set their app up to connect with Twitter will automatically produce a tweet everytime they rate a movie in the app. It is possible to subscribe to tweets as they are produced, but for simplicity we will simulate this process by streaming historic data.

To work with streaming data in Spark we need to create a StreamingContext that plays a similar role to the SparkContext of a batch application. We also need to set an interval of how often we want to process data. Here we will set it to 10 seconds.

In [None]:
from pyspark.streaming import StreamingContext
batch_interval=10
stream_context = StreamingContext(sc, batch_interval)

We can now manipulate the streaming data similarly to what we would do with batch data, but the difference is that the processing is repeated every 10 seconds with the data that have arrived since last run.

We are faking the stream of reviews by hooking up to a bucket.

In [None]:
stream_of_reviews=stream_context.textFileStream("gs://big-data-streaming-examples")

When we process data we would normally update a database with results as they come along. For simplicity we will just keep a local dictionary with that can store data.

In the dictionary **local_data** we store three variables:
- The total number of reviews we have processed (**total_count**)
- One example of a recent review (**one_line**)
- The time of the latest data batch (**latest_processing_time**)

In [None]:
local_data={}
local_data["total_count"]=0
local_data["one_line"]=""
local_data["latest_processing_time"]=""


def count_and_keep_one(time, rdd):
    data=rdd.collect()
    local_data["latest_processing_time"]=time
    local_data["total_count"] += len(data)
    if len(data)>0:
        local_data["one_line"]=data[0]
    

stream_of_reviews.foreachRDD(count_and_keep_one)

We can print out what we have in the *database*:

In [None]:
print("Number of lines processed: "+str(local_data["total_count"]))
print("Latest processing time: "+str(local_data["latest_processing_time"]))
print("Example of a line from latest batch: "+local_data["one_line"])

Before any processing can happen we need to start the streaming process.

In [None]:
stream_context.start()

And we can stop the process again when we are done. The boolean argument indicates whether the SparkContext should be destroyed as well.

In [None]:
stream_context.stop(False)

Try to store and update the distribution of ratings as they arrive. This is challenging for a number of reasons. One important reason is that errors are not outputted to the notebook. The errors can, however, be found via the SparkUI.

In [1]:
from IPython.display import Javascript
Javascript("""
           var el=document.createElement("h3");
           var ela=document.createElement("a");
           ela.innerHTML="SparkUI";
           ela.href=window.location.protocol + '//' + window.location.hostname + ':8088/proxy/""" \
           + spark.sparkContext.applicationId \
           + """/';
           ela.target="_blank";
           el.append(ela);
           
           element.append(el);
           """)

<IPython.core.display.Javascript object>