# Word count with with EMR Serverless on EMR Studio

#### Topics covered in this example
* Write a file to S3, read the file and perform word count on the data.

***

## Prerequisites
<div class="alert alert-block alert-info">
<b>NOTE :</b> In order to execute this notebook successfully as is, please ensure the following prerequisites are completed.</div>

* Create an S3 bucket to save your results or use an existing s3 bucket. For example: `s3://EXAMPLE-BUCKET/word-count/`
* The Interactive runtime role selected when attaching to an application should have S3 read and write permission to the above bucket.
* The EMR Serverless application attached to this notebook should be of type `SPARK`.
* This notebook uses the `PySpark` kernel.
***

## Introduction
In this example we write a file to S3, use pyspark to count the occurrence of each word in the file and store the results to s3.
***

## Example

Create a test data frame with some sample records.
We will use the `createDataFrame()` method to create and `printSchema()` method to print out the schema.

<div class="alert alert-block alert-info">
    <b>NOTE :</b> You will need to update <b>EXAMPLE-BUCKET</b> in the statement below to your own bucket. Please make sure you have read and write permissions for this bucket.</div>

In [None]:
BUCKET = "s3://EXAMPLE-BUCKET/word-count/" # Change this to the S3 location that you created in prerequisites.

wordsDF = sqlContext.createDataFrame([("emr",), ("spark",), ("example",), ("spark",), ("pyspark",), ("python",),
             ("example",), ("emr",), ("example",), ("spark",), ("pyspark",), ("python",)], ["words"])
wordsDF.show()
wordsDF.printSchema()

Print out the number of unique words so that we can verify this number with the end result count.

In [None]:
uniqueWordsCount = wordsDF.distinct().groupBy().count().head()[0]
print(uniqueWordsCount)

This step only shows an example on how to write to s3.
You can use an existing file stored in S3 and read it as shown in the next steps.

In [None]:
wordsDF.write.csv(BUCKET + "test-data.csv")

Read the csv file from S3 and store in RDD.

In [None]:
wordsData = sc.textFile(BUCKET + "test-data.csv")
wordsData.count()

Display the contents of the file.

In [None]:
wordsData.collect()

Count the occurence of each word and print the count of the result. This should be equal to the number of unique words we found earlier.

In [None]:
wordsCounts = wordsData.flatMap(lambda line: line.split(" ")) \
                       .map(lambda word: (word, 1)) \
                       .reduceByKey(lambda a, b: a+b)
wordsCounts.count()

Display the count for each word.

In [None]:
wordsCounts.collect()

Save the results to your s3 bucket. The results are stored in the key `word-count` and split based on paritions.

In [None]:
wordsCounts.saveAsTextFile(BUCKET + "word-count")