<a href="https://colab.research.google.com/github/carsofferrei/04_data_processing/blob/main/04_data_processing%20/spark_streaming/2_checkpoint.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Checkpoint

# Setting up PySpark

In [None]:
%pip install pyspark



In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('Test streaming').getOrCreate()

In [18]:
!rm -rf content/input/*
!rm -rf content/output/*
!rm -rf content/checkpoint/*

In [30]:
from datetime import datetime
import csv

def generate_file():
  timestamp = datetime.now().strftime("%Y%m%d%H%M%S")
  filename = f"content/input/file_{timestamp}.csv"
  with open(filename, 'w', newline='') as csvfile:
      fieldnames = ['col', 'value', 'file']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=";")
      writer.writeheader()
      writer.writerow({'col': 'c1', 'value': 'v1', 'file': filename})
      writer.writerow({'col': 'c2', 'value': 'v2', 'file': filename})
      writer.writerow({'col': 'c3', 'value': 'v3', 'file': filename})

!mkdir -p content/input
!mkdir -p content/output

In [39]:
generate_file()

In [40]:
spark.read.format("csv").option("sep", ";").option("header", True).load("content/input").show(100, False)

+---+-----+-------------------------------------+
|col|value|file                                 |
+---+-----+-------------------------------------+
|c1 |v1   |content/input/file_20241123144751.csv|
|c2 |v2   |content/input/file_20241123144751.csv|
|c3 |v3   |content/input/file_20241123144751.csv|
|c1 |v1   |content/input/file_20241123144625.csv|
|c2 |v2   |content/input/file_20241123144625.csv|
|c3 |v3   |content/input/file_20241123144625.csv|
|c1 |v1   |content/input/file_20241123144753.csv|
|c2 |v2   |content/input/file_20241123144753.csv|
|c3 |v3   |content/input/file_20241123144753.csv|
|c1 |v1   |content/input/file_20241123144749.csv|
|c2 |v2   |content/input/file_20241123144749.csv|
|c3 |v3   |content/input/file_20241123144749.csv|
+---+-----+-------------------------------------+



In [41]:
from pyspark.sql.types import *

schema = StructType([
StructField('Col',StringType(),True),
StructField('Value',StringType(),True),
StructField('File',StringType(),True)
])


# Aqui só indico onde tenho que ir ler os dados
stream = spark.readStream.format('csv').schema(schema).option("sep", ";").option('header', True).load('content/input/')

In [42]:
query = (stream.writeStream
.format('csv')
.option("header", True)
#.queryName("stream") # não precisamos de ter este parâmetro sempre. Temos que avaliar quando é necessário.
.option('checkpointLocation', 'content/checkpoint')
.option('path', 'content/output') # onde vou escrever o ficheiro
.trigger(processingTime='5 seconds') # processo vai correr a cada 5 segundos
.outputMode('append') #
.start()
)

In [43]:
print(spark.read.csv('content/output', header=True, sep=";").count())
spark.read.csv('content/output', header=True, sep=",").show(100, False)

12
+---+-----+-------------------------------------+
|Col|Value|File                                 |
+---+-----+-------------------------------------+
|c1 |v1   |content/input/file_20241123144751.csv|
|c2 |v2   |content/input/file_20241123144751.csv|
|c3 |v3   |content/input/file_20241123144751.csv|
|c1 |v1   |content/input/file_20241123144753.csv|
|c2 |v2   |content/input/file_20241123144753.csv|
|c3 |v3   |content/input/file_20241123144753.csv|
|c1 |v1   |content/input/file_20241123144625.csv|
|c2 |v2   |content/input/file_20241123144625.csv|
|c3 |v3   |content/input/file_20241123144625.csv|
|c1 |v1   |content/input/file_20241123144749.csv|
|c2 |v2   |content/input/file_20241123144749.csv|
|c3 |v3   |content/input/file_20241123144749.csv|
+---+-----+-------------------------------------+



In [44]:
query.stop()
#quando fazemos stop é mesmo so da query

In [45]:
query.isActive

False

**Se alguem der um *stop* no processo, os inputs continuam a entrar no entanto, no output não vai estar refletido pois o processo parou. Como temos o checkpoit definido, na pasta do offset > indica o ponto onde parou.

Por isso, o que temos que fazer a partir do momento que alguem pare o processo é fazer "start"**



```
query = (stream.writeStream
.format('csv')
.option("header", True)
.option('checkpointLocation', 'content/checkpoint')
.option('path', 'content/output') # onde vou escrever o ficheiro
.trigger(processingTime='5 seconds') # processo vai correr a cada 5 segundos
.outputMode('append') #
.start()
)```

