For running this recipe, we first need to set up incoming streaming data. We will feed data by opening a terminal window in Jupyter labs UI and run the following command that uses the nc (netcat) utility to create a socket connection on port 9999 and listen for incoming data: 

`nc -lk 9999 `

Once the previous command is running, you can start typing any text on the command line. 

For example, you can enter the following text: 

Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Joe Reis and Matt Housley. This book provides a concise overview of the data engineering landscape and a framework of best practices to assess and solve data engineering problems. It also helps you choose the best technologies and architectures for your data needs. 
 
Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems** by Martin Kleppmann. This book explains the fundamental principles and trade-offs behind the design of distributed data systems. It covers topics such as replication, partitioning, consistency, fault tolerance, batch and stream processing, and data model

In [1]:
from pyspark.sql import SparkSession  # Spark SQL 작업을 위한 SparkSession 임포트
from pyspark.sql.functions import  # Spark SQL 함수들 임포트 explode, split

In [2]:
spark = (SparkSession.builder  # SparkSession 빌더 패턴 시작
           .appName("config-streaming")  # 애플리케이션 이름 설정
           .master("spark://spark-master:7077")  # Spark 마스터 URL 설정
           .config("spark.executor.memory", "512m")  # Spark 설정 옵션
           .getOrCreate()  # SparkSession 생성 또는 기존 세션 반환)
spark.sparkContext.setLogLevel("ERROR")  # 로그 레벨을 ERROR로 설정

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/04 17:37:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# 생성 DataFrame representing the stream of input lines from connection to localhost:9999
lines = (spark.readStream  # 스트리밍 데이터 읽기
         .format("socket")
         .option("host", "localhost")
         .option("port", 9999)
         .load(  # 파일 로드))

In [4]:
# Split the lines into words
words = lines.select(  # 컬럼 선택
   explode(  # 배열을 개별 행으로 분해split(lines.value, " ")).alias("word"))

In [5]:
# Generate running word count
wordCounts = words.groupBy(  # 그룹화"word").count()

In [6]:
 # Start running the query that prints the running counts to the console
query = (wordCounts.writeStream  # 스트리밍 데이터 쓰기
         .outputMode(  # 스트리밍 출력 모드 설정"complete")
         .format("console")
         .start()  # 스트리밍 시작)

                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+----+-----+
|word|count|
+----+-----+
+----+-----+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+------------+-----+
|        word|count|
+------------+-----+
|        Data|    2|
|    overview|    1|
|Fundamentals|    1|
|      stream|    1|
|          by|    2|
|       solve|    1|
|         you|    1|
|   landscape|    1|
|    systems.|    1|
|replication,|    1|
|         for|    1|
|         Joe|    1|
|  tolerance,|    1|
|    provides|    1|
|        Reis|    1|
|      topics|    1|
|   practices|    1|
|       model|    1|
|     concise|    1|
| distributed|    1|
+------------+-----+
only showing top 20 rows



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+------------+-----+
|        word|count|
+------------+-----+
|   Dynamical|    1|
|        Data|    2|
|     complex|    1|
|    overview|    1|
|     Science|    1|
|Fundamentals|    1|
|      stream|    1|
|      Nathan|    1|
|          by|    3|
|       solve|    2|
|         you|    2|
|   landscape|    1|
|          L.|    1|
|    systems.|    1|
|       apply|    1|
|replication,|    1|
|         for|    1|
|         Joe|    1|
|         how|    1|
|  reduction,|    1|
+------------+-----+
only showing top 20 rows



Open the terminal and add more data to the netcat listener. See the following example text: 

__Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control by Steven L. Brunton and J. Nathan Kutz13. This book teaches you how to apply machine learning and data analytics techniques to solve complex engineering and scientific problems. It covers topics such as dimensionality reduction, sparse sensing, system identification, and control design.__

A new batch for the stream query is triggered and the output is updated as shown: 

In [7]:
query.stop()

In [8]:
spark.stop()  # Spark 세션 종료 - 리소스 정리