# Spark container 실행 방법

## 확인 사항
* volume path(docker-compose.yml)
    * mariadb
    * jupyter lab
* expose port(Dockerfile, docker-compose.yml)
    * 이미 사용하고 있는 port는 아닌지 확인

## docker-composer 실행 순서
* docker-composer - Dockerfile - scripts/entrypoint.sh
* 명령어
    * docker-compose up --build
    
## 주의 
* entrypoint.sh schema init 은 최소 1회만 실행 (최소 실행 후 주석처리)

# Spark Session

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql import types as T

spark = (
    SparkSession
    .builder
    .appName("Last Chapter")
    .master("local[*]")
    .config("hive.metastore.uris", "thrift://0.0.0.0:9083")
    .enableHiveSupport()
    .getOrCreate()
)

In [2]:
spark.catalog.listDatabases()

[Database(name='default', description='Default Hive database', locationUri='file:/home/jovyan/work/spark-warehouse')]

# Structured Dataframe API

* ch04_data_transactions.txt
    * 구매 날짜, 시간, 고객 ID, 상품 ID, 구매 수량, 구매 금액

In [3]:
!head ../book-samples/ch04/ch04_data_transactions.txt 

2015-03-30#6:55 AM#51#68#1#9506.21
2015-03-30#7:39 PM#99#86#5#4107.59
2015-03-30#11:57 AM#79#58#7#2987.22
2015-03-30#12:46 AM#51#50#6#7501.89
2015-03-30#11:39 AM#86#24#5#8370.2
2015-03-30#10:35 AM#63#19#5#1023.57
2015-03-30#2:30 AM#23#77#7#5892.41
2015-03-30#7:41 PM#49#58#4#9298.18
2015-03-30#9:18 AM#97#86#8#9462.89
2015-03-30#10:06 PM#94#26#4#4199.15


In [4]:
transactions_schema = T.StructType([
    T.StructField("DATE", T.StringType(), True),
    T.StructField("TIME", T.StringType(), True),
    T.StructField("CUSTOMER_ID", T.StringType(), True),
    T.StructField("PRODUCT_ID", T.StringType(), True),
    T.StructField("QUANTITY", T.StringType(), True),
    T.StructField("AMOUNT", T.StringType(), True),
])

In [5]:
trans_df = spark.read.csv(
    "../book-samples/ch04/ch04_data_transactions.txt",
    sep="#",
    schema=transactions_schema
)

trans_df = trans_df.withColumn("DATE", F.to_date(F.col("DATE"), "yyyy-MM-dd"))
trans_df = trans_df.withColumn("DATETIME", F.concat(F.col("DATE"), F.lit(" "), F.col("TIME")))
trans_df = trans_df.withColumn("DATETIME", F.to_timestamp(F.col("DATETIME"), "yyyy-MM-dd H:mm a"))
print("the num of rows:", trans_df.count())

the num of rows: 1000


In [6]:
trans_df.show()

+----------+--------+-----------+----------+--------+-------+-------------------+
|      DATE|    TIME|CUSTOMER_ID|PRODUCT_ID|QUANTITY| AMOUNT|           DATETIME|
+----------+--------+-----------+----------+--------+-------+-------------------+
|2015-03-30| 6:55 AM|         51|        68|       1|9506.21|2015-03-30 06:55:00|
|2015-03-30| 7:39 PM|         99|        86|       5|4107.59|               null|
|2015-03-30|11:57 AM|         79|        58|       7|2987.22|2015-03-30 11:57:00|
|2015-03-30|12:46 AM|         51|        50|       6|7501.89|               null|
|2015-03-30|11:39 AM|         86|        24|       5| 8370.2|2015-03-30 11:39:00|
|2015-03-30|10:35 AM|         63|        19|       5|1023.57|2015-03-30 10:35:00|
|2015-03-30| 2:30 AM|         23|        77|       7|5892.41|2015-03-30 02:30:00|
|2015-03-30| 7:41 PM|         49|        58|       4|9298.18|               null|
|2015-03-30| 9:18 AM|         97|        86|       8|9462.89|2015-03-30 09:18:00|
|2015-03-30|10:0

* ch04_data_product.txt
    * 구매 날짜, 시간, 고객 ID, 상품 ID, 구매 수량, 구매 금액

In [7]:
!head ../book-samples/ch04/ch04_data_products.txt

1#ROBITUSSIN PEAK COLD NIGHTTIME COLD PLUS FLU#9721.89#10
2#Mattel Little Mommy Doctor Doll#6060.78#6
3#Cute baby doll, battery#1808.79#2
4#Bear doll#51.06#6
5#LEGO Legends of Chima#849.36#6
6#LEGO Castle#4777.51#10
7#LEGO Mixels#8720.91#1
8#LEGO Star Wars#7592.44#4
9#LEGO Lord of the Rings#851.67#2
10#LEGO The Hobbit#7314.55#9


In [55]:
prod_schema = T.StructType([
    T.StructField("PRODUCT_ID", T.StringType(), True),
    T.StructField("PRODUCT_NAME", T.StringType(), True),
    T.StructField("PRICE", T.StringType(), True),
    T.StructField("INDEX", T.StringType(), True),
])

In [57]:
prod_df = spark.read.csv(
    "../book-samples/ch04/ch04_data_products.txt",
    sep="#",
    schema=prod_schema
)

prod_df.show(25)

+----------+--------------------+-------+-----+
|PRODUCT_ID|        PRODUCT_NAME|  PRICE|INDEX|
+----------+--------------------+-------+-----+
|         1|ROBITUSSIN PEAK C...|9721.89|   10|
|         2|Mattel Little Mom...|6060.78|    6|
|         3|Cute baby doll, b...|1808.79|    2|
|         4|           Bear doll|  51.06|    6|
|         5|LEGO Legends of C...| 849.36|    6|
|         6|         LEGO Castle|4777.51|   10|
|         7|         LEGO Mixels|8720.91|    1|
|         8|      LEGO Star Wars|7592.44|    4|
|         9|LEGO Lord of the ...| 851.67|    2|
|        10|     LEGO The Hobbit|7314.55|    9|
|        11|      LEGO Minecraft|5646.81|    3|
|        12|   LEGO Hero Factory| 6911.2|    1|
|        13|   LEGO Architecture| 604.58|    5|
|        14|        LEGO Technic|7423.48|    3|
|        15|LEGO Storage & Ac...|3125.96|    2|
|        16|        LEGO Classic| 9933.3|   10|
|        17|   LEGO Galaxy Squad|5593.16|    4|
|        18|     LEGO Mindstorms|6022.88

# 데이터 분석
* 구매 횟수가 가장 많은 고객
* 바비 놀이세트(ID 25) 를 2개 이상 구매한 경우 5% 할인
* 사전을 다섯 권 이상 구매한 고객
* 가장 많은 금액을 지출한 고객
* 어제 판매한 상품 이름과 각 상품별 매출액 합계
* 어제 판매하지 않은 상품 목록
* 전일 판매 실적 통계: 고객별 평균, 최저 가격, 최고 가격, 구매 금액 합계