<a href="https://colab.research.google.com/github/aaalexlit/big-data-hadoop-spark-edx-course/blob/main/Getting_Started_With_Spark_using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install required libs
!pip install pyspark
!pip install findspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


To be able to import pyspark as a normal library

In [2]:
import findspark
findspark.init()

In [3]:
# PySpark is the Spark API for Python. 
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession


# Spark Context and Spark Session
**SparkContext** is the entry point for Spark applications and contains functions to create RDDs such as `parallelize()`  
**SparkSession** is needed for SparkSQL and DataFrame operations.

## Creating the spark session and context

In [4]:
# Creating a spark context 
sc = SparkContext()

# Creating a spark session
spark = SparkSession \
        .builder \
        .appName("Python Spark DataFrames basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

## Initialize Spark Session
verify that the spark session instance has been created

In [5]:
spark

# Resilient Distributed Datasets (RDDs)
RDDs are Spark's primitive data abstraction and we use concepts from functional programming to create and manipulate RDDs.

## Create and RDD
Create and RDD from a python generator of integers from 1 to 30

In [6]:
data = range(1, 30)
print(data[0])
len(data)
xrangeRDD = sc.parallelize(data, 4)
xrangeRDD

1


PythonRDD[1] at RDD at PythonRDD.scala:53

## Transformations
- A transformation is an operation on an RDD that results in a new RDD
- Generated rapidly because the new RDD is lazily evaluated

Transformations to the original RDD:
1. reduce each element by 1
2. filter to only contain elements < 10

In [7]:
subRDD = xrangeRDD.map(lambda x: x - 1)
filteredRDD = subRDD.filter(lambda x: x < 10)

## Actions
To get the output from the transformation `collect()` action need to be applied

In [8]:
print(filteredRDD.collect())
filteredRDD.count()

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


10

## Caching Data
Create RDD and cache it.  
gives **10x speed improvement**


In [9]:
import time

test = sc.parallelize(range(1, 50000), 4)
test.cache()

t1 = time.time()
# first time it caches and counts
count1 = test.count()
dt1 = time.time() - t1
print("dt1: ", dt1)

t2 = time.time()
# second time it only counts
count2 = test.count()
dt2 = time.time() - t2
print("dt2: ", dt2)

dt1:  1.4878273010253906
dt2:  0.3539144992828369


# DataFrames and SparkSQL
To work with SQL engine we need to create a Spark Session.  
It was already created, now to verify if it's still active

In [10]:
spark

## Create a DataFrame
You can create a structured data set (much like a database table) in Spark.  
Use powerful SQL tools to query and join dataframes

In [11]:
# Download the data first into a local `people.json` file
!curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/data/people.json >> people.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    73  100    73    0     0    172      0 --:--:-- --:--:-- --:--:--   172


In [12]:
# Read and cache the dataset into a spark dataframe using the `read.json()` function
df = spark.read.json("people.json").cache()

In [13]:
# Print the dataframe as well as the data schema
df.show()
df.printSchema()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [14]:
# Register the DataFrame as a SQL temporary view
df.createTempView("people")

## Explore the data using DataFrame functions and SparkSQL
Different ways to achieve the same task

In [15]:
# Select and show basic data columns

df.select("name").show()
df.select(df["name"]).show()
spark.sql("SELECT name FROM people").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
|Michael|
|   Andy|
| Justin|
+-------+

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
|Michael|
|   Andy|
| Justin|
+-------+

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
|Michael|
|   Andy|
| Justin|
+-------+



In [16]:
# Basic filtering
df.filter(df["age"] > 21).show()
spark.sql("SELECT age, name FROM people WHERE age > 21").show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
| 30|Andy|
+---+----+

+---+----+
|age|name|
+---+----+
| 30|Andy|
| 30|Andy|
+---+----+



In [17]:
# Basic aggregation
df.groupBy("age").count().show()
spark.sql("SELECT age, count(age) as num FROM people GROUP BY age").show()

+----+-----+
| age|count|
+----+-----+
|  19|    2|
|null|    2|
|  30|    2|
+----+-----+

+----+---+
| age|num|
+----+---+
|  19|  2|
|null|  0|
|  30|  2|
+----+---+



# Exercises
### Create an RDD with integers from 1-50. Apply a transformation to multiply every number by 2, resulting in an RDD that contains the first 50 even numbers.


In [18]:
numbers = range(1, 50)
numbers_RDD = sc.parallelize(numbers)
even_numbers_RDD = numbers_RDD.map(lambda x: x * 2)

### Similar to the `people.json` file, now read the `people2.json` file into the notebook, load it into a dataframe and apply SQL operations to determine the average age in our people2 file.


In [19]:
!curl https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/labs/people2.json >> people2.json
df = spark.read.json("people2.json").cache()
df.createTempView("people2")


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud


In [20]:
spark.sql("SELECT avg(age) from people2").show()

+-----------------+
|         avg(age)|
+-----------------+
|24.77777777777778|
+-----------------+



### Close SparkSession

In [21]:
spark.stop()