# ***Spark Streaming - Word Count Problem***

- Input: a stream of sentences retrieved from localhost:9999
- Split the input stream in batches of 5 seconds each and print on the standard output, for each batch, the occurrences of each word appearing in the batch
    - i.e., execute the word count problem for each batch of 5 seconds
- Store the results also in an HDFS folder

In [None]:
from pyspark.streaming import StreamingContext

# Set prefix of the output folders
outputPathPrefix="resSparkStreamingExamples"

#Create a configuration object and
#set the name of the applicationconf
SparkConf().setAppName("Streaming word count")

# Create a Spark Context object
sc = SparkContext(conf=conf)

# Create a Spark Streaming Context object
ssc = StreamingContext(sc, 5)

# Create a (Receiver) DStream that will connect to localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

In [None]:
# Apply a chain of transformations to perform the word count task
# The returned RDDs are DStream RDDs
words = lines.flatMap(lambda line: line.split(" "))
wordsOnes = words.map(lambda word: (word, 1))
wordsCounts = wordsOnes.reduceByKey(lambda v1, v2: v1+v2)

# Print the result on the standard output
wordsCounts.pprint()

# Store the result in HDFS
wordsCounts.saveAsTextFiles(outputPathPrefix, "")

In [None]:
#Start the computation
ssc.start()

# Run this application for 90 seconds
ssc.awaitTerminationOrTimeout(90)
ssc.stop(stopSparkContext=False)