# Spark High-Level API

![img](https://static.packt-cdn.com/products/9781785888359/graphics/8bffbd94-04f7-46e3-a1e9-0d6046d2dcab.png)

Source: https://static.packt-cdn.com/products/9781785888359/graphics/8bffbd94-04f7-46e3-a1e9-0d6046d2dcab.png

## Overview of Spark SQL

Creating DataFrames
- [ ] RDD
    - [ ] createDataFrame()
- [ ] Text file
    - [ ] read.text()
- [ ] JSON file
    - [ ] read.json()
    - [ ] read,json(RDD)
- [ ] Parquet file
    - [ ] read.parquet()
- [ ] Table in a relational database
- [ ] Temporary table in Spark

DataFrame to RDD
- [ ] rdd()

Schemas
- [ ] Inferring schemas
    - [ ] Why it is not optimal practice
- [ ] Specifying schemas
    - [ ] Using StructType and StructField
    - [ ] Using DDL string (schema = “author STRING, title STRING, pages INT”)
- [ ] Metadata
    - [ ] printSchema()
    - [ ] columns()
    - [ ] dtypes()
- [ ] Actions
    - [ ] show()
- [ ] Transforms
    - [ ] select() and alias()
    - [ ] drop()
    - [ ] filter() / where()
    - [ ] distinct()
    - [ ] dropDuplicates()
    - [ ] sample
    - [ ] sampleBy()
    - [ ] limit()
    - [ ] orderBy() / sort()
    - [ ] groupBy()

Operations that return an RDD 
    - [ ] rdd.map()
    - [ ] rdd.flatMap()

pyspark.sql.functions module
- [ ] String functions
- [ ] Math functions
- [ ] Statistics functions
- [ ] Date functions
- [ ] Hashing functions
- [ ] Algorithms (sounded, levenstein)
- [ ] Windowing functions

User defined functions
- [ ] udf()
- [ ] pandas_udf()

Multiple DataFrames
- [ ] join(other, on, how)
- [ ] union(), unionAll()
- [ ] intersect()
- [ ] subtract()

Persistence
- [ ] cache()
- [ ] persist(:
- [ ] unpersist()
- [ ] cacheTable()
- [ ] clearCache()
- [ ] repartition()
- [ ] coalesce()

Output
- [ ] write.csv()
- [ ] write.parquet()
- [ ] write.json()

Spark SQL
- [ ] df.createOrReplaceTempView
- [ ] sql()
- [ ] table()

## Implementation notes

Spark 2.4 is desinged to run on Java 8 (or version 1.8). If you have veriosns of Java greater than Java 8, also install Java 8 and include this export

```bash
export JAVA_HOME=$(/usr/libexec/java_home -v 1.8)
```

If using the latest `pyarrow` (`pip install -U pyarrow`), you will need to set this variable

```bash
ARROW_PRE_0_15_IPC_FORMAT=1
```

in the file

```bash
SPARK_HOME/conf/spark-env.sh
```


In [None]:
from pyspark.sql import SparkSession

In [None]:
import pyspark.sql.functions as F
import pyspark.sql.types as T

In [None]:
spark = (
    SparkSession.builder 
    .master("local") 
    .appName("BIOS-823") 
    .config("spark.executor.cores", 4) 
    .getOrCreate()    
)

In [None]:
spark.version

In [None]:
spark.conf.get('spark.executor.cores')

## Create a Spark DataFrame

In [None]:
df = spark.range(3)

In [None]:
df.show(3)

In [None]:
%%file data/test.csv
number,letter
0,a
1,c
2,b
3,a
4,b
5,c
6,a
7,a
8,a
9,b
10,b
11,c
12,c
13,b
14,b

#### Implicit schema

In [None]:
df = (
    spark.read.
    format('csv').
    option('header', 'true').
    option('inferSchema', 'true').
    load('data/test.csv')
)

In [None]:
df.show(3)

In [None]:
df.printSchema()

#### Explicit schema

For production use, you should provide an explicit schema to reduce risk of error.

In [None]:
schema = T.StructType([
    T.StructField("number", T.DoubleType()),
    T.StructField("letter", T.StringType()),
])

In [None]:
df = (
    spark.read.
    format('csv').
    option('header', 'true').
    schema(schema).
    load('data/test.csv')
)

In [None]:
df.show(3)

In [None]:
df.printSchema()

#### Alternative way to specify schema

You can use SQL DDL syntax to specify a schema as well.

In [None]:
schema = 'number DOUBLE, letter STRING'

In [None]:
df_altschema = (
    spark.read.
    format('csv').
    option('header', 'true').
    schema(schema=schema).
    load('data/test.csv')
)

In [None]:
df_altschema.take(3)

In [None]:
df_altschema.printSchema()

### Persist

In [None]:
df.cache()

## Data manipulation

### Select

In [None]:
df.select('number').show(3)

In [None]:
from pyspark.sql.functions import col, expr

In [None]:
df.select(col('number').alias('index')).show(3)

In [None]:
df.select(expr('number as x')).show(3)

In [None]:
df.withColumnRenamed('number', 'x').show(3)

## Filter

In [None]:
df.filter('number % 2 == 0').show(3)

## Sort

In [None]:
df.sort(df.number.desc()).show(3)

In [None]:
df.orderBy(df.letter.desc()).show(3)

## Transform

In [None]:
df.selectExpr('number*2 as x').show(3)

In [None]:
df.withColumn('x', expr('number*2')).show(3)

## Sumarize

In [None]:
import pyspark.sql.functions as F

In [None]:
df.agg(F.min('number'),
       F.max('number'), 
       F.min('letter'), 
       F.max('letter')).show(3)

## Group by

In [None]:
(
    df.groupby('letter').
    agg(F.mean('number'), F.stddev_samp('number')).show()
)

## Window functions

In [None]:
from pyspark.sql.window import Window

In [None]:
ws = (
    Window.partitionBy('letter').
    orderBy(F.desc('number')).
    rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

In [None]:
df.groupby('letter').agg(F.sum('number')).show()

In [None]:
df.show()

In [None]:
(
    df.select('letter', F.sum('number').
              over(ws).
              alias('rank')).show()
)

## SQL

In [None]:
df.createOrReplaceTempView('df_table')

In [None]:
spark.sql('''SELECT * FROM df_table''').show(3)

In [None]:
spark.sql('''
SELECT letter, mean(number) AS mean, 
stddev_samp(number) AS sd from df_table
WHERE number % 2 = 0
GROUP BY letter
ORDER BY letter DESC
''').show()

## String operatons

In [None]:
from pyspark.sql.functions import split, lower, explode

In [None]:
import pandas as pd

In [None]:
s = spark.createDataFrame(
    pd.DataFrame(
        dict(sents=('Thing 1 and Thing 2',
                    'The Quick Brown Fox'))))

In [None]:
s.show()

In [None]:
from pyspark.sql.functions import regexp_replace

In [None]:
s1 = (
    s.select(explode(split(lower(expr('sents')), ' '))).
    sort('col')
)

In [None]:
s1.show()

In [None]:
s1.groupBy('col').count().show()

In [None]:
s.createOrReplaceTempView('s_table')

In [None]:
spark.sql('''
SELECT regexp_replace(sents, 'T.*?g', 'FOO')
FROM s_table
''').show()

### Numeric operations

In [None]:
from pyspark.sql.functions import log1p, randn

In [None]:
df.selectExpr('number', 'log1p(number)', 'letter').show(3)

In [None]:
(
    df.selectExpr('number', 'randn() as random').
    stat.corr('number', 'random')
)

### Date and time

In [None]:
dt = (
    spark.range(3).
    withColumn('today', F.current_date()).
    withColumn('tomorrow', F.date_add('today', 1)).
    withColumn('time', F.current_timestamp())
)

In [None]:
dt.show()

In [None]:
dt.printSchema()

### Nulls

In [None]:
%%file data/test_null.csv
number,letter
0,a
1,
2,b
3,a
4,b
5,
6,a
7,a
8,
9,b
10,b
11,c
12,
13,b
14,b

In [None]:
dn = (
    spark.read.
    format('csv').
    option('header', 'true').
    option('inferSchema', 'true').
    load('data/test_null.csv')
)

In [None]:
dn.printSchema()

In [None]:
dn.show()

In [None]:
dn.na.drop().show()

In [None]:
dn.na.fill('Missing').show()

## UDF

To avoid degrading performance, avoid using UDF if you can use the functions in `pyspark.sql.functions`. If you must use UDFs, prefer `pandas_udf` to `udf` where possible.

In [None]:
from pyspark.sql.functions import udf, pandas_udf

### Standard Python UDF

In [None]:
@udf('double')
def square(x):
    return x**2

In [None]:
df.select('number', square('number')).show(3)

### Pandas UDF

In [None]:
@pandas_udf('double')
def scale(x):
    return (x - x.mean())/x.std()

In [None]:
df.select(scale('number')).show(3)

#### Grouped agg

In [None]:
@pandas_udf('double', F.PandasUDFType.GROUPED_AGG)
def gmean(x):
    return x.mean()

In [None]:
df.groupby('letter').agg(gmean('number')).show()

#### Grouped map

In [None]:
@pandas_udf(df.schema, F.PandasUDFType.GROUPED_MAP)
def gscale(pdf):
    return pdf.assign(
        number = (pdf.number - pdf.number.mean()) /
        pdf.number.std())

In [None]:
df.groupby('letter').apply(gscale).show()

## Joins

In [None]:
names = 'ann ann bob bob chcuk'.split()
courses = '821 823 821 824 823'.split()
pdf1 = pd.DataFrame(dict(name=names, course=courses))

In [None]:
pdf1

In [None]:
course_id = '821 822 823 824 825'.split()
course_names = 'Unix Python R Spark GLM'.split()
pdf2 = pd.DataFrame(dict(course_id=course_id, name=course_names))

In [None]:
pdf2

In [None]:
df1 = spark.createDataFrame(pdf1)
df2 = spark.createDataFrame(pdf2)

In [None]:
df1.join(df2, df1.course == df2.course_id, how='inner').show()

In [None]:
df1.join(df2, df1.course == df2.course_id, how='right').show()

## DataFrame conversions

In [None]:
sc = spark.sparkContext

In [None]:
rdd = sc.parallelize([('ann', 23), ('bob', 34)])

In [None]:
df = spark.createDataFrame(rdd, schema='name STRING, age INT')

In [None]:
df.show()

In [None]:
df.rdd.map(lambda x: (x[0], x[1]**2)).collect()

In [None]:
df.rdd.mapValues(lambda x: x**2).collect()

In [None]:
df.toPandas()