# Handling Event-time and Window Operation

이벤트가 생성된 타임라인의 관점에서 처리 로직을 적용하고 싶음 -> 각 기기들에서 timestamp 를 찍어서 데이터를 보냄

스트림이 하나의 row 를 이루기 때문에, 타임스탬프는 그 중 하나의 컬럼이 되는 것.

처리하는 시스템의 내부 시계가 아닌 생산 시스템의 관점에서 이벤트들의 타임라인을 해석해야 함 - 찍힌 타임스탬프를 기준으로

데이터가 들어왔을때, 데이터가 생성된 시간(타임스탬프)가 같이 들어온다고 해봅시다

이 때 특정 시간동안 들어오는 데이터에 대해서만 계속 관찰해보고싶음 예를 들어 실시간 검색어

### 슬라이딩 윈도우 그림 하나 필요

12시 정각부터 12시 10분까지의 키워드 집계 -> 12시 10분의 실검 => 10분의 윈도우

12시 1분부터 12시 11분까지의 키워드 집계 -> 12시 11분의 실검 => 10분의 윈도우

...

12시 9분부터 12시 19분까지의 키워드 집계 -> 12시 19분의 실검 => 10분의 윈도우

==> 보고기간(슬라이딩) 1분

---

### 워터마크

일반적으로 타임스탬프로 선언된 필드가 단조 증가하면서 타임라인이 증가 -> 이벤트가 늦게 도착할 수 있음

현재 설정된 타임라인보다 일정 시간 이상 차이나는 이벤트들을 폐기시키는 워터마크

지금 열심히 9분부터 19분까지 키워드 집계 내고 있는데, 갑자기 12시 2분에 만들어진 데이터가 뭐 네트워크 문제 떄문에 이제 도착했음

물론 구조적스트리밍은 내부적으로 이를 반영 가능 - 집계 낸 중간데이터를 일정 기간 좀 유지하고 있어서, 늦게 온 애들도 집계내서 갱신되게

근데 이게 무한정 할 수 있는게 아니니까, 저 "일정기간"을 어떻게 잡을건지 bound 를 쳐놔야함

이 경계선은 그니까, 얼마나 데이터가 늦게 도착해야 집계에서 빼버릴 것인지를 결정하는 친구 - 이를 워터마킹이라고 함

### 워터마크 그림 필요

스파크엔진이 현재 이벤트타임을 추적해서 너무 오래된 친구들은 싸그리싹싹

엔진에 계속해서 데이터가 들어오는데, 앵간치 타임스탬프가 계속 증가해나갈것이고, 이 타임스탬프 최댓값을 엔진이 계속 추적(윈도우 보고시점에만 확인)

이 추적값에서 (워터마크 크기) 안에 들어오는 타임스탬프 데이터들만 해당 윈도우 집계에 반영시킴

그니까 이제 애초에 워터마크를 벗어난 윈도우친구들은 더이상 수정안된다는 뜻, 워터마크 바깥의 데이터가 늦게 들어오면 그냥 드랍해버림

---

## 테이블 처리 로직 짜기

In [1]:
from pyspark.sql.types import StringType

spark=SparkSession.builder.appName("sparkdf").getOrCreate()
data = [("2021-01-01 00:00:01, A B % ^ & *"),
        ("2021-01-01 00:00:02, E 1@ a#$% B*()_+"),
        ("2021-01-01 00:00:03, a b c d"),
        ("2021-01-01 00:00:03, a d e d e"),
        ("2021-01-01 00:00:04, f f a"),
        ("2021-01-01 00:00:06, b c d %"),
        ("2021-01-01 00:00:09, a "),
        ("2021-01-01 00:00:09,  a a a a! !a !a! !@#A@#$%^&*")]

lines = spark.createDataFrame(data, StringType())
lines.show()

[Stage 0:>                                                          (0 + 1) / 1]

+--------------------+
|               value|
+--------------------+
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
|2021-01-01 00:00:...|
+--------------------+



                                                                                

In [2]:
from pyspark.sql.functions import to_timestamp, split

tmp = split(lines.value, ",")
lines = lines.withColumn("timestamp", to_timestamp(tmp[0])).withColumn("sentence", tmp[1]).select("timestamp", "sentence")
lines.show()

[Stage 3:>                                                          (0 + 1) / 1]

+-------------------+--------------------+
|          timestamp|            sentence|
+-------------------+--------------------+
|2021-01-01 00:00:01|         A B % ^ & *|
|2021-01-01 00:00:02|    E 1@ a#$% B*()_+|
|2021-01-01 00:00:03|             a b c d|
|2021-01-01 00:00:03|           a d e d e|
|2021-01-01 00:00:04|               f f a|
|2021-01-01 00:00:06|             b c d %|
|2021-01-01 00:00:09|                  a |
|2021-01-01 00:00:09|  a a a a! !a !a!...|
+-------------------+--------------------+



                                                                                

In [3]:
from pyspark.sql.functions import explode, split, lower, regexp_replace, trim

words = lines.withColumn("word", explode(split(trim(regexp_replace(lower(lines.sentence), r"[^a-z0-9 ]", "")), " "))).select("timestamp", "word")
words.show()

                                                                                

+-------------------+----+
|          timestamp|word|
+-------------------+----+
|2021-01-01 00:00:01|   a|
|2021-01-01 00:00:01|   b|
|2021-01-01 00:00:02|   e|
|2021-01-01 00:00:02|   1|
|2021-01-01 00:00:02|   a|
|2021-01-01 00:00:02|   b|
|2021-01-01 00:00:03|   a|
|2021-01-01 00:00:03|   b|
|2021-01-01 00:00:03|   c|
|2021-01-01 00:00:03|   d|
|2021-01-01 00:00:03|   a|
|2021-01-01 00:00:03|   d|
|2021-01-01 00:00:03|   e|
|2021-01-01 00:00:03|   d|
|2021-01-01 00:00:03|   e|
|2021-01-01 00:00:04|   f|
|2021-01-01 00:00:04|   f|
|2021-01-01 00:00:04|   a|
|2021-01-01 00:00:06|   b|
|2021-01-01 00:00:06|   c|
+-------------------+----+
only showing top 20 rows



---

## 배치 쿼리를 스트림처리에 동일하게 적용

In [4]:
spark = SparkSession.builder.appName("StructuredStreamingTest").getOrCreate()
lines = spark.readStream.format("socket").option("host", "localhost").option("port", "5000").load()

21/09/29 00:31:55 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


In [5]:
tmp = split(lines.value, ",")
lines = lines.withColumn("timestamp", to_timestamp(tmp[0])).withColumn("sentence", tmp[1]).select("timestamp", "sentence")
words = lines.withColumn("word", explode(split(trim(regexp_replace(lower(lines.sentence), r"[^a-z0-9 ]", "")), " "))).select("timestamp", "word")

In [6]:
from pyspark.sql.functions import window

wwc = words.withWatermark("timestamp", "15 minutes")\
           .groupBy(window(words.timestamp, "10 minutes", "1 minutes"), words.word)\
           .count()

In [None]:
query = wwc.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

21/09/29 00:32:04 WARN StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-1be68aea-eba5-4044-8864-24056bfc969c. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+------+----+-----+
|window|word|count|
+------+----+-----+
+------+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+----+-----+
|              window|word|count|
+--------------------+----+-----+
|[2020-12-31 23:55...|   a|    1|
|[2020-12-31 23:52...|   c|    1|
|[2020-12-31 23:52...|   b|    1|
|[2020-12-31 23:58...|   a|    1|
|[2020-12-31 23:57...|   a|    1|
|[2020-12-31 23:53...|   b|    1|
|[2020-12-31 23:58...|   1|    1|
|[2020-12-31 23:58...|    |    1|
|[2020-12-31 23:56...|   1|    1|
|[2020-12-31 23:59...|    |    1|
|[2020-12-31 23:55...|   c|    1|
|[2020-12-31 23:51...|   1|    1|
|[2020-12-31 23:55...|   b|    1|
|[2020-12-31 23:52...|   a|    1|
|[2020-12-31 23:55...|    |    1|
|[2020-12-31 23:52...|    |    1|
|[2020-12-31 23:54...|   1|    1|
|[2020-12-31 23:59...|   b|    1|
|[2020-12-31 23:57...|   b|    1|
|[2020-12-31 23:53...|   c|    1|
+--------------------+----+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+--------------------+----+-----+
|              window|word|count|
+--------------------+----+-----+
|[2020-12-31 23:55...|   a|    1|
|[2021-01-01 00:01...|   a|    1|
|[2020-12-31 23:52...|   c|    1|
|[2020-12-31 23:52...|   b|    1|
|[2020-12-31 23:58...|   a|    1|
|[2020-12-31 23:57...|   a|    1|
|[2020-12-31 23:53...|   b|    1|
|[2020-12-31 23:58...|   1|    1|
|[2021-01-01 00:10...|   a|    1|
|[2020-12-31 23:58...|    |    1|
|[2020-12-31 23:56...|   1|    1|
|[2020-12-31 23:59...|    |    1|
|[2020-12-31 23:55...|   c|    1|
|[2020-12-31 23:51...|   1|    1|
|[2020-12-31 23:55...|   b|    1|
|[2021-01-01 00:08...|   a|    1|
|[2020-12-31 23:52...|   a|    1|
|[2021-01-01 00:03...|   a|    1|
|[2020-12-31 23:55...|    |    1|
|[2020-12-31 23:52...|    |    1|
+--------------------+----+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+--------------------+----+-----+
|              window|word|count|
+--------------------+----+-----+
|[2020-12-31 23:55...|   a|    1|
|[2021-01-01 00:01...|   a|    1|
|[2020-12-31 23:52...|   c|    1|
|[2020-12-31 23:52...|   b|    1|
|[2020-12-31 23:58...|   a|    1|
|[2020-12-31 23:57...|   a|    1|
|[2020-12-31 23:53...|   b|    1|
|[2020-12-31 23:58...|   1|    1|
|[2021-01-01 00:13...|   b|    2|
|[2021-01-01 00:10...|   a|    1|
|[2021-01-01 00:08...|   b|    2|
|[2020-12-31 23:58...|    |    1|
|[2020-12-31 23:56...|   1|    1|
|[2020-12-31 23:59...|    |    1|
|[2021-01-01 00:09...|   b|    2|
|[2021-01-01 00:11...|   b|    2|
|[2020-12-31 23:55...|   c|    1|
|[2020-12-31 23:51...|   1|    1|
|[2020-12-31 23:55...|   b|    1|
|[2021-01-01 00:08...|   a|    1|
+--------------------+----+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+--------------------+----+-----+
|              window|word|count|
+--------------------+----+-----+
|[2021-01-01 00:52...|   f|    1|
|[2020-12-31 23:55...|   a|    1|
|[2021-01-01 00:01...|   a|    1|
|[2020-12-31 23:52...|   c|    1|
|[2020-12-31 23:52...|   b|    1|
|[2021-01-01 00:51...|   f|    1|
|[2021-01-01 00:55...|   d|    1|
|[2021-01-01 00:55...|   e|    1|
|[2021-01-01 00:54...|   f|    1|
|[2020-12-31 23:58...|   a|    1|
|[2020-12-31 23:57...|   a|    1|
|[2020-12-31 23:53...|   b|    1|
|[2020-12-31 23:58...|   1|    1|
|[2021-01-01 00:13...|   b|    2|
|[2021-01-01 00:10...|   a|    1|
|[2021-01-01 00:08...|   b|    2|
|[2021-01-01 00:51...|   g|    1|
|[2020-12-31 23:58...|    |    1|
|[2021-01-01 00:55...|   g|    1|
|[2021-01-01 00:53...|   f|    1|
+--------------------+----+-----+
only showing top 20 rows



21/09/29 00:34:46 WARN TextSocketMicroBatchStream: Stream closed by localhost:5000


# Window Operations Demo

### Demo

In [2]:
import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import window
from pyspark.sql.types import StructType
from pyspark.sql.streaming import DataStreamReader

In [3]:
windowSize = "60"
slideSize = "10"

windowDuration = '{} seconds'.format(windowSize)
slideDuration = '{} seconds'.format(slideSize)
monitoring_dir = 'monitoring_data'

In [4]:
spark = SparkSession\
    .builder\
    .appName("InteractionCount")\
    .config("spark.eventLog.enabled","true")\
    .config("spark.eventLog.dir","applicationHistory")\
    .master("local[*]")\
    .getOrCreate()

In [5]:
userSchema = StructType().add("userA","string")\
                         .add("userB","string")\
                         .add("timestamp","timestamp")\
                         .add("interaction","string")

In [6]:
twitterIDSchema = StructType().add("userA","string")
twitterIDs = spark.read.schema(twitterIDSchema).csv('twitterIDs.csv')
csvDF = spark\
    .readStream\
    .schema(userSchema)\
    .csv(monitoring_dir)

joinedDF = csvDF.join(twitterIDs,"userA")

In [7]:
interactions = joinedDF.select(joinedDF['userA'],joinedDF['interaction'],joinedDF['timestamp'])

In [8]:
windowedCounts = interactions.groupBy(
                       window(interactions.timestamp, windowDuration, slideDuration),interactions.userA)\
                       .count()

In [None]:
query = windowedCounts\
    .writeStream\
    .outputMode('complete')\
    .format('console')\
    .option('truncate','false')\
    .option('numRows','10000')\
    .trigger(processingTime='12 seconds')\
    .start()

query.awaitTermination()

# Window Operations Exercise Solution

### Exercise

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.3.0-bin-hadoop2.7')

In [2]:
import os
import sys

from pyspark.sql import Row, SparkSession
from pyspark.sql.streaming import DataStreamWriter, DataStreamReader
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.functions import window

In [3]:
spark = SparkSession.builder.appName("WindowedCount").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")

In [4]:
# TODO: Define a windowDuration and slideDuration
windowDuration = '50 seconds'
slideDuration = '30 seconds'

In [5]:
# TODO: Set up the `lines` readStream to take in data from a socket stream, AND include the timestamp
lines = spark.readStream.format("socket").option("host", "localhost")\
         .option("port", 6669).option('includeTimestamp', 'true').load()

In [6]:
# Splitting the lines into words
words = lines.select(explode(split(lines.value, " ")).alias("word"),lines.timestamp)

In [7]:
# TODO: Add a watermark to the data, using the timestamp
words = words.withWatermark("timestamp", "5 seconds")

In [8]:
# TODO: Write out the windowed wordcounts using groupBy(), window(), and count().
windowedCounts = words.groupBy(window(words.timestamp, windowDuration, slideDuration),words.word).count()

In [None]:
query = windowedCounts.writeStream.outputMode("complete")\
                   .option("numRows", "100000")\
                   .option("truncate", "false")\
                   .format("console")\
                   .start()

query.awaitTermination()