# First PySpark data program

## Prerequisite

Make sure that PySpark is installed.

In [1]:
!pyspark --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 20
Branch HEAD
Compiled by user liangchi on 2023-02-10T19:57:40Z
Revision 5103e00c4ce5fcc4264ca9c4df12295d42557af6
Url https://github.com/apache/spark
Type --help for more information.


## REPL

### PySpark REPL

The `pyspark` program provides quick and easy access to a Python REPL with PySpark preconfigured. 

The `Spark context` is then available as `sc` and the `Spark session` is available as `spark`. Spark context is your entry point to Spark, a liaison between your Python REPL and the Spark cluster. Spark session wraps the Spark context and provides you functionalities to interact with the Spark SQL API, which includes the data frame structure.

```bash
$ pyspark
Python 3.9.12 (main, Apr  5 2022, 01:52:34)
[Clang 12.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/15 12:03:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.2
      /_/

Using Python version 3.9.12 (main, Apr  5 2022 01:52:34)
Spark context Web UI available at http://xiaolis-air:4040
Spark context available as 'sc' (master = local[*], app id = local-1681527792485).
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x11607f0d0>
>>> sc
<SparkContext master=local[*] appName=PySparkShell>
>>> spark.sparkContext
<SparkContext master=local[*] appName=PySparkShell>
```

### Normal REPL with Your Configured Spark

You can configure the spark session using the builder pattern. When using REPL, do not start the REPL with `pyspark`, but open a normal `python3` REPL, import `SparkSession` , build and configure a spark context as you want.

```bash
$ python
>>> from pyspark.sql import SparkSession
>>> spark = (SparkSession.builder.appName("Analyzing the vocabulary of Pride and Prejudice").getOrCreate()
```

Check out the configured spark context you just built:

```REPL
>>> spark
<pyspark.sql.session.SparkSession object at 0x11772d4f0>
>>> spark.sparkContext
<SparkContext master=local[*] appName=Analyzing the vocabulary of Pride and Prejudice>
```

## Log levels

Set log level as follows:

```
spark.sparkContext.setLogLevel("<KEYWORD>")
```
Here are possible `KEYWORD`s listed in ascending order of chattiness, each includes the logs of its above levels.

- `OFF`: no logging
- `FATAL`: fatal errors that will crash your Spark cluster
- `ERROR`: fatal and recoverable errors
- `WARN`: warnings and errors. This is the default of `pyspark` shell.
- `INFO`: runtime information such as repartitioning and data recovery, and everything above. This is the default of non-interactive PySpark program.
- `DEBUG`: debug information of your jobs and everything above
- `TRACE`: trace your jobs (more verbose debug logs) and everything above
- `ALL`: everything PySpark can spit.

The `pyspark` shell defaults to `WARN` and non-interactive PySpark programs default to `INFO`.

## Our First Data Preparation Program

### Overview

We want to find the most used words in Pride and Prejudice. Here are the steps we want to take:
1. `Read` input data (assuming a plain text file)
2. `Token`ize each word
3. `Clean` up: 
   1. Remove puncuations and non-word tokens
   2. Lowercase each word
4. `Count` the frequency of each word
5. `Answer` return the top 10 (or 20, 50, 100)

### Ingest

#### Read Data into a Data Frame

Data structures:
- RDD (Resilient distributed dataset): a distributed collection of objects (or rows). Use regular Python functions to manipulate them.
- Dataframe (DF): a stricter version of the RDD, can be seen conceptually as a table. This is the dominant data structure. Operate on columns instead of records.

We use `DataFrameReader` object to read data into a data frame. You can access the `DataFrameReader` through `spark.read`, let's print its content to see what's there

```
>>> spark.read
<pyspark.sql.readwriter.DataFrameReader object at 0x102eb0f70>
>>> dir(spark.read)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_df', '_jreader', '_set_opts', '_spark', 'csv', 'format', 'jdbc', 'json', 'load', 'option', 'options', 'orc', 'parquet', 'schema', 'table', 'text']
```

### Explore

### Transform

### Filtering
