# WordCount Program

Follow the below steps:

- Create SparkContext object.
- Use textFile function to read the input file from the local file storage or hdfs
- Make sure to filter out empty lines (line length = 0)
- Use map transformation to split each line from the RDD.
- To this new transformed RDD use a map function that assignes each word with a value of 1.
- Chain this transformation with reduceByKey() within which create a lambda function that adds the current value and accumulated value.
- Lastly chain the transformations with a sortByKey to get new RDD with sorted key values.
- Simply collect the result RDD.

In [1]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName('RDD_WordCount').setMaster('local')

sc = SparkContext.getOrCreate(conf=conf)

In [14]:
textFile_local = sc.textFile('file:///D:/Code/big-data-stack/pyspark-rdd-operations/data/sample.txt')
textFile_local.collect()

['Hello from pySpark']

In [26]:
textFile = sc.textFile('hdfs://0.0.0.0:19000/data/sample.txt')
textFile.collect()

['Hello from pySpark']

In [43]:
NonEmptyLines = textFile.filter(lambda line: len(line) > 0)
words = NonEmptyLines.flatMap(lambda lines: lines.split(' '))
words.collect()

['Hello', 'from', 'pySpark']

In [46]:
wordCount = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y).sortByKey()
wordCount.collect()

[('Hello', 1), ('from', 1), ('pySpark', 1)]