# ***Spark Transform Transformation***

If you want to combine a static RDD with DStream you can use Transform. Some types of transformations are not available for DStreams
- E.g., sortBy, sortByKey, distinct()

Moreover, sometimes you need to combine DStreams and RDDs. For example, the functionality of joining every batch in a data stream with another dataset (a “standard” RDD) is not directly exposed in the DStream API.


The **transform()** transformation can be used in these situations

- It is a specific transformation of DStreams
- It returns a new DStream by applying an RDD-to- RDD function to every RDD of the source Dstream
    - This can be used to apply arbitrary RDD operations on the DStream

In [None]:
from pyspark.streaming import StreamingContext

# Set prefix of the output folders
outputPathPrefix="resSparkStreamingExamples"

#Create a configuration object and#set the name of the applicationconf
SparkConf().setAppName("Streaming word count")

# Create a Spark Context object
sc = SparkContext(conf=conf)

# Create a Spark Streaming Context object
ssc = StreamingContext(sc, 5)

# Create a (Receiver) DStream that will connect to localhost:9999
lines = ssc.socketTextStream("localhost", 9999)

In [None]:
# Apply a chain of transformations to perform the word count task
# The returned RDDs are DStream RDDs
words = lines.flatMap(lambda line: line.split(" "))

wordsOnes = words.map(lambda word: (word, 1))

wordsCounts = wordsOnes.reduceByKey(lambda v1, v2: v1+v2)

# Sort the content/the pairs by decreasing value (# of occurrences)
wordsCountsSortByKey = wordsCounts\
.transform(lambda batchRDD: batchRDD.sortBy(lambda pair: -1*pair[1]))

In [None]:
# Print the result on the standard output
wordsCountsSortByKey.pprint()

# Store the result in HDFS
wordsCountsSortByKey.saveAsTextFiles(outputPathPrefix, "")

#Start the computation
ssc.start()

# Run this application for 90 seconds
ssc.awaitTerminationOrTimeout(90)
ssc.stop(stopSparkContext=False)