# Lecture 7a PySpark
__Math 3280: Data Mining__

__Outline__
1. Example of using PySpark

__Reading__ 
* Rioux, Chapter 2

Most data-driven applications are built in three steps, or a simple ETL:
1. Loading (or *__e__xtracting*)
2. __T__ransforming
3. Exporting (or *__l__oading* into the bigger system)

First, set up Spark in Google CoLab.

In [1]:
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

[33m0% [Working][0m            Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Ign:3 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy Release
Hit:6 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:8 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:10 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
68 packages can be upgraded. Run 'apt list --upgradable' to see them.
[1;33mW: [0mSkipping acqu

'/usr/local/lib/python3.10/dist-packages/pyspark'

Now, we create a session in Spark. For this we use an entry point called `SparkContext`, a liason between our Python terminal and the Spark cluster.

In [2]:
# Create a `SparkSession` entry point from scratch
from pyspark.sql import SparkSession

spark = {SparkSession
         .builder
         .appName("Analyzing the vocabulary of books.")
         .getOrCreate()
         }

In [3]:
spark.sparkContext

AttributeError: 'set' object has no attribute 'sparkContext'

In [None]:
spark.sparkContext.setLogLevel("FATAL")

With our session started, let's plan out how we're going to tackle the problem:
* __Goal__: Read through thousands of books to find most commonly used words
  * Gather lots of books from the free domain
  * We will test the program on just one book in the free domain: *Pride and Prejudice*

### __E__TL: Extract the data
Where will the data go once loaded?
* The RDD (Resilient Distributed Dataset)
* A DataFrame

![RDD vs DF](https://learning.oreilly.com/api/v2/epubs/urn:orm:book:9781617297205/files/OEBPS/Images/02-02.png)

A DataFrame is essentially a stricter version of the RDD. An RDD uses a relation (a DataFrame) organized in tuples (rows of a DF) and attributes (columns of a DF).



In [4]:
dir(spark.read)

AttributeError: 'set' object has no attribute 'read'

In [None]:
spark.read

In [None]:
book = spark.read.text('/content/1342-0.txt')

In [5]:
###  Install in Google CoLab  ###
!sudo apt update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#Check this site for the latest download link https://www.apache.org/dyn/closer.lua/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!wget -q https://dlcdn.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
!tar xf spark-3.2.1-bin-hadoop3.2.tgz
!pip install -q findspark
!pip install pyspark
!pip install py4j

import os
import sys
# os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
# os.environ["SPARK_HOME"] = "/content/spark-3.2.1-bin-hadoop3.2"


import findspark
findspark.init()
findspark.find()

###  Create SparkSession  ###
import pyspark

from pyspark.sql import DataFrame, SparkSession
from typing import List
import pyspark.sql.types as T
import pyspark.sql.functions as F

spark= SparkSession \
       .builder \
       .appName("Our First Spark Example") \
       .getOrCreate()

spark

###  Transforming  ###

from pyspark.sql import SparkSession
import pyspark.sql.functions as F


spark = SparkSession.builder.appName(
    "Counting word occurences from a book."
).getOrCreate()

spark.sparkContext.setLogLevel("WARN")

##########################################################
#####   TOKENIZATION - LEARN MORE ABOUT THIS IN NLP  #####

# If you need to read multiple text files, replace `1342-0` by `*`.
results = (
    spark.read.text("/content/1342-0.txt")
    .select(F.split(F.col("value"), " ").alias("line"))
    .select(F.explode(F.col("line")).alias("word"))
    .select(F.lower(F.col("word")).alias("word"))
    .select(F.regexp_extract(F.col("word"), "[a-z']*", 0).alias("word"))
    .where(F.col("word") != "")
    .groupby(F.col("word"))
    .count()
)

results.orderBy("count", ascending=False).show(20)
#results.orderBy("word", ascending=True).show(20)
#results.coalesce(1).write.csv("./results_single_partition.csv")

[33m0% [Working][0m            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Ign:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://r2u.stat.illinois.edu/ubuntu jammy Release
Get:10 http://security.ubuntu.com/ubuntu jammy-security/main amd64 Packages [2,308 kB]
Get:11 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,150 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/restricted amd64 Packages [3,097 kB]
Hit:13 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:14 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy I

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col,
    explode,
    lower,
    regexp_extract,
    split,
)

spark = SparkSession.builder.appName(
    "Analyzing the vocabulary of Pride and Prejudice."
).getOrCreate()

book = spark.read.text("/content/1342-0.txt")
#book = spark.read.text("https://raw.githubusercontent.com/jonesberg/DataAnalysisWithPythonAndPySpark-Data/trunk/gutenberg_books/1342-0.txt")

lines = book.select(split(book.value, " ").alias("line"))

words = lines.select(explode(col("line")).alias("word"))

words_lower = words.select(lower(col("word")).alias("word"))

words_clean = words_lower.select(
    regexp_extract(col("word"), "[a-z']*", 0).alias("word")
)

words_nonull = words_clean.where(col("word") != "")

results = words_nonull.groupby(col("word")).count()

results.orderBy("count", ascending=False).show(10)

results.coalesce(1).write.csv("./simple_count_single_partition.csv")