# Stream Dataframe Join

정적인 테이블에서 조인을 하는 것과 비슷

- 2.0 부터는 스트림-스태틱 조인

- 2.3 부터는 스트림-스트림 조인 지원

한쪽 스트림에 들어오는 데이터들은 다른쪽 스트림과 언제든 조인이 발생할 수 있는 데이터가 되므로 더 어려움

과거의 입력들을 버퍼해놓음으로써 나중에 조인 결과 테이블에 잘 반영시킬 수 있도록 함

근데 스트림 테이블은 unbounded table 이기 때문에 데이터가 계속 들어오고, 옛날에 들어왔던 데이터를 다른 스트림과 계속 조인시키기 위한 중간상태값들을 무한정 유지할 수 없기 때문에 이를 조절하는 조인 컨디션 추가

1. 각 스트림 테이블에 워터마크를 추가시켜서 늦은 데이터는 알아서 무시하게끔

2. 두 스트림을 조인할 때의 constraint 를 한번 더 설정해줘서 스파크엔진이 필터링하게끔

    - 시간 범위 : ```JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR```

    - 윈도우 : ```JOIN ON leftWindow = rightWindow```

[지원되는 Join](https://spark.apache.org/docs/3.0.3/structured-streaming-programming-guide.html#stream-stream-joins) 종류에 대해서 알고싶은사람

## Example

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, TimestampType
from pyspark.sql.functions import from_unixtime, substring, to_timestamp
from pyspark.sql.functions import col, udf
from pyspark.sql.functions import explode, split, lower, regexp_replace, trim
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName("StructuredStreamingTest").getOrCreate()

### Static DataFrame

In [2]:
data = [("android",10),
        ("app",1),
        ("iphone",12),
        ("web",24),
        ("the",3),
        ("socially",9),
        ("retweet",17),
        ("codebot",5),
        ("testtest",999),
        ("testtest2",888)]

df = spark.createDataFrame(data, ["devname", "ID"])
df.show()

[Stage 0:>                                                          (0 + 1) / 1]

+---------+---+
|  devname| ID|
+---------+---+
|  android| 10|
|      app|  1|
|   iphone| 12|
|      web| 24|
|      the|  3|
| socially|  9|
|  retweet| 17|
|  codebot|  5|
| testtest|999|
|testtest2|888|
+---------+---+



                                                                                

### Stream DataFrame

In [3]:
schema = StructType().add("time", "string").add("id", "string").add("text", "string").add("source", "string")

lines = spark.readStream.option("sep",",").csv("/data/Structured_Streaming/", schema=schema)

func = udf(lambda x: x.lower().split(">")[1].split("<")[0] if x else None, StringType())

devices = lines.withColumn("timestamp", to_timestamp(from_unixtime(substring("time", 1, 10), format="yyyy-MM-dd HH:mm:ss"), 'yyyy-MM-dd HH:mm:ss')).\
              withColumn("device", explode(split(trim(regexp_replace(func("source"), r"[^a-z]", " ")), " "))).\
              select("timestamp", "device")

results = devices.where("device not in ('twitter', 'for', 'bot')").groupBy("device").count()

### Stream-Static Joins

In [4]:
from pyspark.sql.functions import expr
joiner = results.join(
    df,
    expr("""
        device = devname
        """),
    "left"
).select("device","count","ID").orderBy(col("count").desc()).limit(10)

In [None]:
query = joiner.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

21/10/05 02:09:13 WARN StreamingQueryManager: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-9822e3bc-5e93-4e95-acac-f59b54522c69. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----+----+
|   device|count|  ID|
+---------+-----+----+
|  android|   12|  10|
|incorrect|   11|null|
|      app|   11|   1|
|         |   11|null|
| socially|   11|   9|
|   iphone|    9|  12|
|      web|    9|  24|
| azuerbot|    5|null|
|djangoapp|    5|null|
|  goaidev|    4|null|
+---------+-----+----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------+-----+----+
|      device|count|  ID|
+------------+-----+----+
|     android|   29|  10|
|         app|   21|   1|
|   incorrect|   18|null|
|    socially|   18|   9|
|         web|   16|  24|
|            |   15|null|
|      iphone|   14|  12|
|      nodejs|   11|null|
|codedailybot|   10|null|
|   djangoapp|    6|null|
+------------+-----+----+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+---------+-----+----+
|   device|count|  ID|
+---------+-----+----+
|  android|   44|  10|
|      app|   34|   1|
|      web|   25|  24|
|   iphone|   23|  12|
|   nodejs|   21|null|
|incorrect|   19|null|
| socially|   19|   9|
|         |   16|null|
|djangoapp|   10|null|
|   funbot|   10|null|
+---------+-----+----+



                                                                                

-------------------------------------------
Batch: 3
-------------------------------------------
+---------------+-----+----+
|         device|count|  ID|
+---------------+-----+----+
|        android|   68|  10|
|            app|   51|   1|
|            web|   40|  24|
|         iphone|   33|  12|
|         nodejs|   24|null|
|      incorrect|   19|null|
|       socially|   19|   9|
|               |   16|null|
|thedeveloperbot|   11|null|
|      djangoapp|   11|null|
+---------------+-----+----+



                                                                                

-------------------------------------------
Batch: 4
-------------------------------------------
+---------+-----+----+
|   device|count|  ID|
+---------+-----+----+
|  android|   80|  10|
|      app|   65|   1|
|      web|   51|  24|
|   iphone|   39|  12|
|   nodejs|   36|null|
|incorrect|   19|null|
|         |   19|null|
| socially|   19|   9|
| azuerbot|   16|null|
|  goaidev|   15|null|
+---------+-----+----+



                                                                                

-------------------------------------------
Batch: 5
-------------------------------------------
+---------+-----+----+
|   device|count|  ID|
+---------+-----+----+
|  android|   88|  10|
|      app|   79|   1|
|      web|   64|  24|
|   nodejs|   50|null|
|   iphone|   45|  12|
|         |   21|null|
| azuerbot|   19|null|
|incorrect|   19|null|
|   funbot|   19|null|
| socially|   19|   9|
+---------+-----+----+



                                                                                

-------------------------------------------
Batch: 6
-------------------------------------------
+---------+-----+----+
|   device|count|  ID|
+---------+-----+----+
|  android|  100|  10|
|      app|   95|   1|
|      web|   79|  24|
|   nodejs|   58|null|
|   iphone|   52|  12|
|         |   29|null|
| azuerbot|   20|null|
|   funbot|   20|null|
|djangoapp|   19|null|
| socially|   19|   9|
+---------+-----+----+



                                                                                

-------------------------------------------
Batch: 7
-------------------------------------------
+---------------+-----+----+
|         device|count|  ID|
+---------------+-----+----+
|            app|  107|   1|
|        android|  106|  10|
|            web|   89|  24|
|         nodejs|   63|null|
|         iphone|   61|  12|
|               |   29|null|
|         funbot|   25|null|
|       azuerbot|   23|null|
|thedeveloperbot|   23|null|
|        goaidev|   21|null|
+---------------+-----+----+



                                                                                

-------------------------------------------
Batch: 8
-------------------------------------------
+---------------+-----+----+
|         device|count|  ID|
+---------------+-----+----+
|            app|  125|   1|
|        android|  113|  10|
|            web|  103|  24|
|         nodejs|   75|null|
|         iphone|   66|  12|
|               |   29|null|
|         funbot|   28|null|
|       azuerbot|   27|null|
|thedeveloperbot|   27|null|
|        goaidev|   27|null|
+---------------+-----+----+



## csv 파일 생성

bash start-cluster.sh 를 실행시켰던 터미널에서 아래 명령어 수행

```bash
python3 generator.py 4
```

파이썬 스크립트는 5초마다 다른 경로에 있던 트위터 데이터를 "/data/Structured_Streaming" 경로로 옮겨 스트리밍하게 들어오듯 만듦