<a href="https://colab.research.google.com/github/ctshiz/WORKSPACE_SPARK/blob/main/Spark_Stream_with_PysparK_Saving_to_Parquet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [93]:
#https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks

In [94]:
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-3.2.2/spark-3.2.2-bin-hadoop2.7.tgz
!tar xf spark-3.2.2-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
!pip install pyspark[sql]

0% [Working]            Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease
Ign:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Hit:8 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Get:9 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Hit:10 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:11 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:12 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease
Fetched 256 kB

In [95]:
import pyspark
from pyspark.sql import SparkSession
spark =  SparkSession.builder.getOrCreate()
import os
os.environ["JAVA_HOME"] = "/user/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.2.2-bin-hadoop2.7"

import findspark
import pyspark.sql.functions as F
import pyspark.sql.types as T
findspark.init()
from pyspark import SparkContext
sc = SparkContext.getOrCreate()


In [97]:
#read the files
from pyspark.sql.functions import to_timestamp, col, lit

In [98]:
#import data
df =  spark.read.csv('stream1.csv', header=True)
df = df.drop("_c0","isFraud", "isFlaggedFraud")
df = df.filter(col("step") <= 2)
df = df.limit(5)

In [99]:
df.show()

+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+
|step|    type|  amount|   nameOrig|oldbalanceOrg|newbalanceOrig|   nameDest|oldbalanceDest|newbalanceDest|
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+
|   1| PAYMENT| 9839.64|C1231006815|     170136.0|     160296.36|M1979787155|           0.0|           0.0|
|   1| PAYMENT| 1864.28|C1666544295|      21249.0|      19384.72|M2044282225|           0.0|           0.0|
|   1|TRANSFER|   181.0|C1305486145|        181.0|           0.0| C553264065|           0.0|           0.0|
|   1|CASH_OUT|   181.0| C840083671|        181.0|           0.0|  C38997010|       21182.0|           0.0|
|   1| PAYMENT|11668.14|C2048537720|      41554.0|      29885.86|M1230701703|           0.0|           0.0|
+----+--------+--------+-----------+-------------+--------------+-----------+--------------+--------------+



Column **STEP** maps a unit of time in the real world. Let assume that 1 step is 1 hour of time. 
So we can assume for this example that we have another job that runs every hour and gets all the transactions in that time frame.

In [100]:
%%time
steps = df.select("step").distinct().collect()
for step in steps[:]:
  _df = df.where(f"step ={step[0]}")
  _df.coalesce(1).write.mode("append").option("header", "true").csv("transaction_csv/paysim")

CPU times: user 73.3 ms, sys: 3.04 ms, total: 76.4 ms
Wall time: 8.99 s


--------------------------------------------------------------------------------

**STREAMING**

Let's create a streaming version of this input, we'll read each file one by one as if it was a stream

In [103]:
part = spark.read.csv("transaction_csv/paysim/part-00000-499baf87-6e59-41c7-91e2-698dc84df381-c000.csv",
                      header=True,
                      inferSchema=True)

In [104]:
dataSchema = part.schema

In [106]:
streaming = spark.readStream.schema(dataSchema)\
                  .option("maxFilesPerTrigger",1)\
                  .csv("transaction_csv/paysim/")

In [107]:
# Write new data to Parquet files
streaming.writeStream\
  .format("parquet")\
  .option("checkpointLocation", "transaction_checkpoint")\
  .option("path", "transaction_parquet")\
  .start()

<pyspark.sql.streaming.StreamingQuery at 0x7fe1e4e8dbd0>

Note: To visualize the data streaming:

1) filter the dataframe df on step (step == 3), 
2) save it as a csv in another folder, 
3) manually copy the csv file in the **transaction_csv** folder
4) after 1 min or less, a new parquet file will be add in the **transaction_parquet** folder

In [110]:
#activityQuery = (streaming.writeStream\
#  .format("parquet")\
#  .option("checkpointLocation", "transaction_checkpoint")\
#  .option("path", "transaction_parquet")\
#  .start())

In [111]:
#if we want to turn off the stream we will run activityQuery.stop()to reset the query for testing purposes
#activityQuery.stop()