# Basic WordCount

In this section we will learn some of the ways you can explore text data using Spark. In the next session we'll use a larger dataset (still not very large - to save time).

*Initiate Spark Context - ONLY first time for each notebook. If you get problems with below, see [Help](/notebooks/spark_course/1-Course-Information-and-Links/If-you-get-problems-initiating-spark-context.ipynb)*

In [1]:
import os
from pyspark import SparkContext
sc = SparkContext(appName="search", master=os.environ['MASTER'])

#Basic WordCount example
'Basic' in the sense that we're only playing with a local file. In the next notebook we'll use some more data from HDFS.

## Step0: Load some textfile into HDFS
Let's find some text file to play with using '%%sh' (which gives you the shell environment):

In [2]:
%%sh 
ls ../

1-Course-Information-and-Links
2-Introduction-to-Spark
3-Spark-SQL-and-Dataframes
4-Spark-Internals
5-Spark-Streaming
6-MLlib-Example
7-Solutions-to-exercises
loadData.sh
personalRatings.txt
README.md


Let's put the README.md into HDFS (and check that it's there):

*Note: it might be that README.md is already in HDFS, no problems with that.* 

*Note: feel free to add some other file. In the next session we'll use a larger file*

In [3]:
%%sh
hdfs dfs -put -f ../README.md /uuData/

In [4]:
%%sh
hdfs dfs -ls /uuData

Found 9 items
-rw-r--r--   1 ubuntu supergroup        534 2015-04-18 12:08 /uuData/README.md
-rw-r--r--   1 ubuntu supergroup     174449 2015-04-18 06:26 /uuData/access_log
-rw-r--r--   1 ubuntu supergroup      14989 2015-04-18 06:26 /uuData/error_log
-rw-r--r--   1 ubuntu supergroup     197105 2015-04-18 06:26 /uuData/lr_data.txt
drwxr-xr-x   - ubuntu supergroup          0 2015-04-18 06:26 /uuData/movies
-rw-r--r--   1 ubuntu supergroup    3004200 2015-04-18 06:26 /uuData/names
drwxr-xr-x   - ubuntu supergroup          0 2015-04-18 06:26 /uuData/pagecounts
-rw-r--r--   1 ubuntu supergroup         73 2015-04-18 06:26 /uuData/people.json
-rw-r--r--   1 ubuntu supergroup         32 2015-04-18 06:26 /uuData/people.txt


##Step1: Load the text file

Now point read the README file into the cluster

In [5]:
filePath = "/uuData/README.md"
lines = sc.textFile(filePath)

In [6]:
lines

/uuData/README.md MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2

Display first 10 lines in the file.

In [None]:
lines.take(10)

Unfortunately this is not very readable because take() returns an array and Scala simply prints the array with each element separated by a comma. We can make it prettier by traversing the array to print each record on its own line.

In [8]:
for x in lines.take(10):
    print x

# spark_course

Spark Course for Uppsala University
April 21, 2015
Ake Edlund and Izhar ul Hassan

Material based on AMPCamp and Databricks training material provided online under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license.





##Step 2: Inspect the number of partitions (workers) used to store the dataset

In [9]:
numPartitions = lines.getNumPartitions() # get the number of partitions
print "Number of partitions (workers) storing the dataset = %d" % numPartitions

Number of partitions (workers) storing the dataset = 2


##Step 3: Split each line into a list of words separated by a space from the dataset

In [10]:
words = lines.flatMap(lambda x: x.split(' ')) # split each line into a list of words
words.take(10) # display the first 10 words

[u'#',
 u'spark_course',
 u'',
 u'Spark',
 u'Course',
 u'for',
 u'Uppsala',
 u'University',
 u'April',
 u'21,']

##Step 4: Filter the list of words to exclude common stop words

In [11]:
stopWords = ['','a','*','and','is','of','the','a'] # define the list of stop words
filteredWords = words.filter(lambda x: x.lower() not in stopWords) # filter the words
filteredWords.take(10) # display the first 10 filtered words

[u'#',
 u'spark_course',
 u'Spark',
 u'Course',
 u'for',
 u'Uppsala',
 u'University',
 u'April',
 u'21,',
 u'2015']

##Step 5: Cache the filtered dataset in memory to speed up future actions.

In [12]:
filteredWords.cache() # cache filtered dataset into memory across the cluster worker nodes

PythonRDD[8] at RDD at PythonRDD.scala:42

##Step 6: Transform filtered words into list of (word,1) tuples for WordCount

In [13]:
word1Tuples = filteredWords.map(lambda x: (x, 1)) # map the words into (word,1) tuples
word1Tuples.take(10) # display the (word,1) tuples

[(u'#', 1),
 (u'spark_course', 1),
 (u'Spark', 1),
 (u'Course', 1),
 (u'for', 1),
 (u'Uppsala', 1),
 (u'University', 1),
 (u'April', 1),
 (u'21,', 1),
 (u'2015', 1)]

##Step 7: Aggregate the (word,1) tuples into (word,count) tuples

In [14]:
wordCountTuples = word1Tuples.reduceByKey(lambda x, y: x + y) # aggregate counts for each word
wordCountTuples.take(10) # display the first 10 (word,count) tuples

[(u'From', 1),
 (u'scratch,', 1),
 (u'21,', 1),
 (u'Attribution-NonCommercial-NoDerivatives', 1),
 (u'#', 1),
 (u'based', 1),
 (u'for', 1),
 (u'AMPCamp', 1),
 (u'-', 1),
 (u'dfs', 1)]

## Step 8: Display the top 10 (word,count) tuples by count

In [15]:
sortedWordCountTuples = wordCountTuples.top(10,key=lambda (x, y): y) # top 10 (word,count) tuples
for tuple in sortedWordCountTuples: # display the top 10 (word,count) tuples by count 
  print str(tuple)

(u'spark_course', 2)
(u'cd', 2)
(u'on', 2)
(u'time', 2)
(u'***', 2)
(u'From', 1)
(u'scratch,', 1)
(u'21,', 1)
(u'Attribution-NonCommercial-NoDerivatives', 1)
(u'#', 1)


####See SQL section for more...