# First PySpark Program Continued

## Overview

Our program is planned as follows:
We want to find the most used words in Pride and Prejudice. Here are the steps we want to take:
1. `Read` input data (assuming a plain text file)
2. `Token`ize each word
3. `Clean` up: 
   1. Remove puncuations and non-word tokens
   2. Lowercase each word
4. `Count` the frequency of each word
5. `Answer` return the top 10 (or 20, 50, 100)

In chapter 2 we've done 1~3. Now we want to do 4~5,submit our first PySpark program, and also organize our program into multiple Python files.

## Group, Order and Aggregate the Records

### Group

To group a data frame's records into groups, we use data frame's `groupby()` method and pass the columns we want to group as the parameter. 

This method returns a `GroupData` object on which we can apply aggregation functions such as `count()`.

In [19]:
# The usual initialization
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, lower, regexp_extract, length, col

spark = SparkSession.builder.appName("Analyzing the vocabulary of Pride and Prejudice").getOrCreate()

# load and preprocess data
book = spark.read.text("../data/gutenberg_books/1342-0.txt")
words_nonull = book.select(
        split(book.value, " ").alias("line")
    ).select(
        explode("line").alias("word")
    ).select(
        lower("word").alias("word_lower")
    ).select(
        regexp_extract("word_lower", "[a-z]+", 0).
            alias("word")
    ).where(
        length(col("word")) > 0
    )

In [20]:
# group
groups = words_nonull.groupBy(col("word"))
print(groups)

<pyspark.sql.group.GroupedData object at 0x10a0fcd00>


In [21]:
results = words_nonull.groupBy(
        col("word")
    ).count()
print(results)
results.show(5)

DataFrame[word: string, count: bigint]
+------+-----+
|  word|count|
+------+-----+
|online|    4|
|  some|  209|
| still|   72|
|   few|   72|
|  hope|  122|
+------+-----+
only showing top 5 rows



### Order

Use `orderBy` to order the results by a column. You can pass multiple columns.

In [22]:
results.orderBy("count", ascending=False).show(5)
# alternativly
results.orderBy(col("count").desc()).show(5)

+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
+----+-----+
only showing top 5 rows

+----+-----+
|word|count|
+----+-----+
| the| 4496|
|  to| 4235|
|  of| 3719|
| and| 3602|
| her| 2223|
+----+-----+
only showing top 5 rows

