# Book projet

1- Read—Read the input data (we’re assuming a plain text file).

2- Token—Tokenize each word.

3- Clean—Remove any punctuation and/or tokens that aren’t words. Lowercase each word.

4- Count—Count the frequency of each word present in the text.

5- Answer—Return the top 10 (or 20, 50, 100).

![ A simplified flow of our program, illustrating the five steps](./data/i/02-01.png)

## Read—Read the input data (we’re assuming a plain text file).
### Read file and EDA

In [2]:
from pyspark.sql import SparkSession                                   

spark = (SparkSession
         .builder                                                      
         .appName("Analyzing the vocabulary of Pride and Prejudice.") .getOrCreate())


sc = spark.sparkContext
sqlContext = spark
book = spark.read.text("./data/gutenberg_books/1342-0.txt")
 
book

DataFrame[value: string]

![ A simplified flow of our program, illustrating the five steps](./data/i/02-02.png)

In [4]:
book.printSchema()
print(book.dtypes)

root
 |-- value: string (nullable = true)

[('value', 'string')]


The show() method takes three optional parameters:

- n can be set to any positive integer and will display that number of rows.

- truncate, if set to true, will truncate the columns to display only 20 characters. Set to False, it will display the whole length, or any positive integer to truncate to a specific number of characters.

- vertical takes a Boolean value and, when set to True, will display each record as a small table. If you need to check records in detail, this is a very useful option.

In [11]:
book.show(5)


+--------------------+
|               value|
+--------------------+
|The Project Guten...|
|                    |
|This eBook is for...|
|almost no restric...|
|re-use it under t...|
+--------------------+
only showing top 5 rows



In [8]:
book.show(10, truncate=50)


+--------------------------------------------------+
|                                             value|
+--------------------------------------------------+
|The Project Gutenberg EBook of Pride and Prejud...|
|                                                  |
|This eBook is for the use of anyone anywhere at...|
|almost no restrictions whatsoever.  You may cop...|
|re-use it under the terms of the Project Gutenb...|
|    with this eBook or online at www.gutenberg.org|
|                                                  |
|                                                  |
|                        Title: Pride and Prejudice|
|                                                  |
+--------------------------------------------------+
only showing top 10 rows



## Token—Tokenize each word.
### Simple column transformations: Moving from a sentence to a list of words
When ingesting our selected text into a data frame, PySpark created one record for each line of text and provided a value column of type String. To tokenize each word, we need to split each string into a list of distinct words. This section covers simple transformations using select(). We will split our lines of text into words so we can count them.


- The select() method and its canonical usage, which is selecting data

- The alias() method to rename transformed columns

In [10]:
from pyspark.sql.functions import split
 
lines = book.select(split(book.value, " ").alias("line"))
 
lines.show(5)
 

+--------------------+
|                line|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



### Selecting specific columns using select()

In [12]:
book.select(book.value)


DataFrame[value: string]

In [15]:
from pyspark.sql.functions import col
 
book.select(book.value)
book.select(book["value"])
book.select(col("value"))
book.select("value")

DataFrame[value: string]

### Transforming columns: Splitting a string into a list of words


In [16]:
from pyspark.sql.functions import col, split
 
lines = book.select(split(col("value"), " "))
 
lines
 
 
lines.printSchema()
 

lines.show(5)

root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = true)

+--------------------+
| split(value,  , -1)|
+--------------------+
|[The, Project, Gu...|
|                  []|
|[This, eBook, is,...|
|[almost, no, rest...|
|[re-use, it, unde...|
+--------------------+
only showing top 5 rows



### Renaming columns: alias and withColumnRenamed

- ❶ Our new column is called split(value, , -1), which isn’t really pretty.

- ❷ We aliased our column to the name line. Much better!

In [17]:
book.select(split(col("value"), " ")).printSchema()
# root
#  |-- split(value,  , -1): array (nullable = true)    ❶
#  |    |-- element: string (containsNull = true)
 
book.select(split(col("value"), " ").alias("line")).printSchema()


root
 |-- split(value,  , -1): array (nullable = true)
 |    |-- element: string (containsNull = true)

root
 |-- line: array (nullable = true)
 |    |-- element: string (containsNull = true)



When writing your code, choosing between those two options is pretty easy:

- When you’re using a method where you’re specifying which columns you want to appear, like the select() method, use alias().

- If you just want to rename a column without changing the rest of the data frame, use .withColumnRenamed. Note that, should the column not exist, PySpark will treat this method as a no-op and not perform anything.

In [18]:
# This looks a lot cleaner
lines = book.select(split(book.value, " ").alias("line"))
# This is messier, and you have to remember the name PySpark assigns automatically
lines = book.select(split(book.value, " "))
lines = lines.withColumnRenamed("split(value,  , -1)", "line")

## Clean—Remove any punctuation and/or tokens that aren’t words. Lowercase each word.

### Reshaping your data: Exploding a list into rows

![ Exploding a data frame of array[String] into a data frame of String. Each element of each array becomes its own record.](./data/i/02-04.png)

In [19]:
from pyspark.sql.functions import explode, col
 
words = lines.select(explode(col("line")).alias("word"))
 
words.show(15)


+----------+
|      word|
+----------+
|       The|
|   Project|
| Gutenberg|
|     EBook|
|        of|
|     Pride|
|       and|
|Prejudice,|
|        by|
|      Jane|
|    Austen|
|          |
|      This|
|     eBook|
|        is|
+----------+
only showing top 15 rows



### Working with words: Changing case and removing punctuation

In [20]:
from pyspark.sql.functions import lower
words_lower = words.select(lower(col("word")).alias("word_lower"))
 
words_lower.show()

+----------+
|word_lower|
+----------+
|       the|
|   project|
| gutenberg|
|     ebook|
|        of|
|     pride|
|       and|
|prejudice,|
|        by|
|      jane|
|    austen|
|          |
|      this|
|     ebook|
|        is|
|       for|
|       the|
|       use|
|        of|
|    anyone|
+----------+
only showing top 20 rows



In [22]:
from pyspark.sql.functions import regexp_extract
words_clean = words_lower.select(
    regexp_extract(col("word_lower"), "[a-z]+", 0).alias("word") 
)
 
words_clean.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|         |
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
+---------+
only showing top 20 rows



### Filtering rows
An important data manipulation operation is filtering records according to a certain predicate. In our case, blank cells shouldn’t be considered—they’re not words! This section covers how to filter records from a data frame. After select()-ing records, filtering is probably the most frequent and easiest operation to perform on your data; PySpark provides a simple process to do so.

Conceptually, we should be able to provide a test to perform on each record. If it returns true, we keep the record. False? You’re out! PySpark provides not one, but two identical methods to perform this task. You can use either .filter() or its alias .where(). This duplication is to ease the transition for users coming from other data-processing engines or libraries; some use one, some the other. PySpark provides both, so no arguments are possible! I prefer filter(), because w maps to more data frame methods (withColumn() in chapter 4 or withColumnRenamed() in chapter 3). If we look at the next listing, we can see that columns can be compared to values using the usual Python comparison operators. In this case, we’re using “not equal,” or !=.

In [23]:
words_nonull = words_clean.filter(col("word") != "")
 
words_nonull.show()

+---------+
|     word|
+---------+
|      the|
|  project|
|gutenberg|
|    ebook|
|       of|
|    pride|
|      and|
|prejudice|
|       by|
|     jane|
|   austen|
|     this|
|    ebook|
|       is|
|      for|
|      the|
|      use|
|       of|
|   anyone|
| anywhere|
+---------+
only showing top 20 rows

