## PySpark & Pydeequ testing

The following notebook is a simple test of getting PySpark & Pydeequ (the Python API for AWS DEEQU) working locally <br>

Using Minconda on WSL (Ubuntu distro), I created an environment called `dmc_1` (details can be found in the `requirements.txt` file) <br>

But in particular, this environment uses:
- PySpark == 2.4.0
- Pydeequ == 1.0.1
- Python == 3.7.13 

To use Pydeequ, you will need to set an environment variable SPARK_VERSION. We set this to `2.4.0` as that is our PySpark version 

In [1]:
# set environment variable SPARK_VERSION 
import os 

os.environ["SPARK_VERSION"]="2.4.0" 

spark_version = os.environ["SPARK_VERSION"] 
print(spark_version)

2.4.0


In [2]:
# import pyspark & pydeequ 
from pyspark.sql import SparkSession, Row
import pydeequ 

In [3]:
# build a spark session with deequ in it 
try:
    spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate()) 
    print("Spark session created - accessible via `spark`")

except Exception as e:
    print("ERROR - Spark session failed") 
    print(e) 

Spark session created - accessible via `spark`


Ok, now let's run a pydeequ example with PySpark

In [4]:
df = spark.sparkContext.parallelize([
Row(a="foo", b=1, c=5),
Row(a="bar", b=2, c=6),
Row(a="baz", b=3, c=None)]).toDF() 

df.show() 

+---+---+----+
|  a|  b|   c|
+---+---+----+
|foo|  1|   5|
|bar|  2|   6|
|baz|  3|null|
+---+---+----+



In [5]:
# install analysers from pydeequ 
from pydeequ.analyzers import * 

analysisResult = AnalysisRunner(spark) \
.onData(df) \
.addAnalyzer(Size()) \
.addAnalyzer(Completeness("b")) \
.run()
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show()

+-------+--------+------------+-----+
| entity|instance|        name|value|
+-------+--------+------------+-----+
|Dataset|       *|        Size|  3.0|
| Column|       b|Completeness|  1.0|
+-------+--------+------------+-----+



In [6]:
# stop the spark session 
spark.stop() 

#### End