# Count the Occurences of the Words in a Text

In this notebook, we use PySpark to count the occurrences of the words in a text. The text used for this exercise can be found at ../datasets/text.txt. Although this text was randomly generated and doesn't possess any meaning, the reader must consider that our main goal in this exercise is to show a use case of the .flatMap() transformation, not to give any kind of interpretation to the results.

First we call the some libraries and tell the computer that we are going to run the script on our local system.

In [1]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster('local').setAppName('words_counter')
sc = SparkContext(conf = conf)

Let's load our data and print the first five lines of the text

In [2]:
raw_text = sc.textFile('../datasets/text.txt')

for row in raw_text.take(5):
    print(row)

coconut apple orange hazelnut
blueberry pumpkin pepper carrot
watercress tomato radish almond
peas pickle pumpkin rice
spinach potato turnip wheat apricot


Note that our text consists only of fruits and vegetables' names.

In the next cell, we use the .flatMap() transformation to split the strings into words. We've used the .map transformation in similar tasks. The main difference here is that using the .flatMap transformation each word will be stored in the new RDD as independent values, in contrast with .map that returns a list per line whose elements consist of the line's words. We'll print some values of the new RDD to make this more clear.

In [3]:
words = raw_text.flatMap(lambda string: string.split())

for word in words.take(10):
    print(word)

coconut
apple
orange
hazelnut
blueberry
pumpkin
pepper
carrot
watercress
tomato


Now, we can count the occurrences of each word and print the results.

In [8]:
word_occurrences = words.countByValue()

for word, occurrences in word_occurrences.items():
    print('Word: ' + str(word) + ', Occurences: ' + str(occurrences))

Word: coconut, Occurences: 7
Word: apple, Occurences: 1
Word: orange, Occurences: 3
Word: hazelnut, Occurences: 3
Word: blueberry, Occurences: 7
Word: pumpkin, Occurences: 4
Word: pepper, Occurences: 2
Word: carrot, Occurences: 2
Word: watercress, Occurences: 2
Word: tomato, Occurences: 4
Word: radish, Occurences: 4
Word: almond, Occurences: 1
Word: peas, Occurences: 5
Word: pickle, Occurences: 3
Word: rice, Occurences: 5
Word: spinach, Occurences: 4
Word: potato, Occurences: 8
Word: turnip, Occurences: 5
Word: wheat, Occurences: 7
Word: apricot, Occurences: 7
Word: grape, Occurences: 4
Word: lemon, Occurences: 6
Word: melon, Occurences: 4
Word: peach, Occurences: 6
Word: pineapple, Occurences: 4
Word: raspberry, Occurences: 4
Word: watermelon, Occurences: 6
Word: artichoke, Occurences: 9
Word: beans, Occurences: 5
Word: brussels, Occurences: 5
Word: cauliflower, Occurences: 5
Word: courgette, Occurences: 11
Word: garlic, Occurences: 7
Word: lettuce, Occurences: 7
Word: mango, Occurenc