#**WordCount Example**

###Goal:  Determine the most popular words in a given text file using Python and SQL

### ![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 1**: Load text file from our [Hosted Datasets](https://docs.databricks.com/user-guide/faq/databricks-datasets.html).  **Shift-Enter Runs the code below.**

In [3]:
filePath = "dbfs:/databricks-datasets/SPARK_README.md" # path in Databricks File System
lines = sc.textFile(filePath) # read the file into the cluster
lines.take(10) # display first 10 lines in the file

###![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 2**:  Inspect the number of partitions (workers) used to store the dataset

In [5]:
numPartitions = lines.getNumPartitions() # get the number of partitions
print ("Number of partitions (workers) storing the dataset", numPartitions)

###![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 3**:  Split each line into a list of words separated by a space from the dataset

In [7]:
words = lines.flatMap(lambda x: x.split(' ')) # split each line into a list of words
words.take(10) # display the first 10 words

###![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 4**:  Filter the list of words to exclude common stop words

In [9]:
stopWords = ['','a','*','and','is','of','the','a'] # define the list of stop words
filteredWords = words.filter(lambda x: x.lower() not in stopWords) # filter the words
filteredWords.take(10) # display the first 10 filtered words

###![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 5**:  Cache the filtered dataset in memory to speed up future actions.

In [11]:
filteredWords.cache() # cache filtered dataset into memory across the cluster worker nodes

###![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 6**:  Transform filtered words into list of (word,1) tuples for WordCount

In [13]:
word1Tuples = filteredWords.map(lambda x: (x, 1)) # map the words into (word,1) tuples
word1Tuples.take(10) # display the (word,1) tuples

###![](http://training.databricks.com/databricks_guide/downarrow.png) **Step 7**:  Aggregate the (word,1) tuples into (word,count) tuples

In [15]:
wordCountTuples = word1Tuples.reduceByKey(lambda x, y: x + y) # aggregate counts for each word
wordCountTuples.take(10) # display the first 10 (word,count) tuples