# APACHE SPARK FREQUENT PREDICTION

## Tugas 3 Big Data

### Requirements

1. Apache Spark 2.40
2. Python 3.7.1
3. JDK 1.8.0

In [1]:
# Spark initial

In [2]:
import findspark
findspark.init()

In [3]:
# Import required library
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

In [4]:
# Print Spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x114b74908>


In [127]:
# Data Import
df = spark.read.csv("/Users/gunstringer/Downloads/OnlineRetail.csv", header=True, inferSchema=True)

In [6]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



In [128]:
# Tampilkan Data
df.show()

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|01/12/10 08.26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|01/12/10 08.26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|01/12/10 08.26|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|01/12/10 08.26|     3.39|     17850|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|01/12/10 08.26|     3.39|     17850|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|01/12/10 08.26|     7.65|     17850|United Kingdom|
|   536365|    21730|GLASS STAR FROSTE...|       6|01/12/10 08.26|     4.

# Preprocess Dataset

In [129]:
# Preprocess data (Ekstraksi Barang berdasarkan InvoiceNo)
## menggunakan groupby dan aggregate function dari collect_set (menggolongkan StockCode unique berdasarkan invoiceNo)

import pyspark.sql.functions
df_preprocess = df.groupby("InvoiceNo").agg(collect_set('StockCode'),collect_set('Description')).sort('InvoiceNo')
df_preprocess.show()

+---------+----------------------+------------------------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|
+---------+----------------------+------------------------+
|   536365|  [84029E, 21730, 8...|    [RED WOOLLY HOTTI...|
|   536366|        [22632, 22633]|    [HAND WARMER UNIO...|
|   536367|  [22310, 22622, 21...|    [POPPY'S PLAYHOUS...|
|   536368|  [22913, 22960, 22...|    [BLUE COAT RACK P...|
|   536369|               [21756]|    [BATH BUILDING BL...|
|   536370|  [22659, 21035, 21...|    [MINI JIGSAW SPAC...|
|   536371|               [22086]|    [PAPER CHAIN KIT ...|
|   536372|        [22632, 22633]|    [HAND WARMER UNIO...|
|   536373|  [84029E, 21730, 8...|    [EDWARDIAN PARASO...|
|   536374|               [21258]|    [VICTORIAN SEWING...|
|   536375|  [84029E, 21730, 8...|    [EDWARDIAN PARASO...|
|   536376|        [21733, 22114]|    [RED HANGING HEAR...|
|   536377|        [22632, 22633]|    [HAND WARMER UNIO...|
|   536378|  [21212, 21931, 20...|    [R

# Perhitungan


In [50]:
## Jumlah Transaksi (invoice) pada dataset
df_preprocess.count()

25900

In [53]:
## Total Jenis Barang pada dataset
df.select("description").distinct().count()

4224

## Percobaan 1
* **minSupport = 0.1** ( min frekuensi penjualan set barang **2590**)
* **minConfidence = 0.5**

## Percobaan 2
* **minSupport = 0.01** ( min frekuensi penjualan set barang **259**)
* **minConfidence = 0.6**

## Percobaan 3
* **minSupport = 0.005** ( min frekuensi penjualan set barang **129**)
* **minConfidence = 0.8** 

# Percobaan 1

In [130]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(Description)", minSupport=0.1, minConfidence=0.5)
model = fpGrowth.fit(df_preprocess)

In [131]:
model.freqItemsets.show()

+-----+----+
|items|freq|
+-----+----+
+-----+----+



In [132]:
model.associationRules.show()

+----------+----------+----------+----+
|antecedent|consequent|confidence|lift|
+----------+----------+----------+----+
+----------+----------+----------+----+



In [133]:
model.transform(df_preprocess).show()

+---------+----------------------+------------------------+----------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|prediction|
+---------+----------------------+------------------------+----------+
|   536365|  [84029E, 21730, 8...|    [RED WOOLLY HOTTI...|        []|
|   536366|        [22632, 22633]|    [HAND WARMER UNIO...|        []|
|   536367|  [22310, 22622, 21...|    [POPPY'S PLAYHOUS...|        []|
|   536368|  [22913, 22960, 22...|    [BLUE COAT RACK P...|        []|
|   536369|               [21756]|    [BATH BUILDING BL...|        []|
|   536370|  [22659, 21035, 21...|    [MINI JIGSAW SPAC...|        []|
|   536371|               [22086]|    [PAPER CHAIN KIT ...|        []|
|   536372|        [22632, 22633]|    [HAND WARMER UNIO...|        []|
|   536373|  [84029E, 21730, 8...|    [EDWARDIAN PARASO...|        []|
|   536374|               [21258]|    [VICTORIAN SEWING...|        []|
|   536375|  [84029E, 21730, 8...|    [EDWARDIAN PARASO...|        []|
|   53

# Percobaan 2

In [134]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(Description)", minSupport=0.01, minConfidence=0.6)
model2 = fpGrowth.fit(df_preprocess)

In [135]:
model2.freqItemsets.show()

+--------------------+----+
|               items|freq|
+--------------------+----+
|[SET/3 RED GINGHA...| 489|
|[BALLOONS  WRITIN...| 346|
|[WHITE HANGING HE...|2302|
|[REGENCY CAKESTAN...|2169|
|[REGENCY CAKESTAN...| 360|
|[HAND WARMER UNIO...| 486|
|[AGED GLASS SILVE...| 345|
|[JUMBO BAG RED RE...|2135|
|[JUMBO BAG RED RE...| 288|
|[JUMBO BAG RED RE...| 452|
|[4 TRADITIONAL SP...| 486|
|[CLASSIC CAFE SUG...| 343|
|[DELUXE SEWING KIT ]| 342|
|     [PARTY BUNTING]|1706|
|[PARTY BUNTING, J...| 332|
|[PARTY BUNTING, R...| 398|
|[PARTY BUNTING, W...| 390|
|[RECIPE BOX RETRO...| 485|
|[LUNCH BAG RED RE...|1607|
|[LUNCH BAG RED RE...| 588|
+--------------------+----+
only showing top 20 rows



In [136]:
model2.associationRules.show()

+--------------------+--------------------+------------------+------------------+
|          antecedent|          consequent|        confidence|              lift|
+--------------------+--------------------+------------------+------------------+
|[PAPER CHAIN KIT ...|[PAPER CHAIN KIT ...|0.6670673076923077|14.766703648915188|
|[JUMBO BAG SPACEB...|[JUMBO BAG PINK P...|0.6134259259259259|12.906361885850107|
|[JUMBO BAG SCANDI...|[JUMBO SHOPPER VI...|0.6128318584070797|13.371815613094663|
|[HAND WARMER SCOT...|[HAND WARMER OWL ...|0.6057866184448463|23.593794613115065|
|[SET/6 RED SPOTTY...|[SET/6 RED SPOTTY...|0.6641366223908919| 40.18957598113108|
|[SET/6 RED SPOTTY...|[SET/20 RED RETRO...|0.7020872865275142|18.404919758160542|
|[ROSES REGENCY TE...|[GREEN REGENCY TE...|0.7672253258845437| 18.79956096538286|
|[ROSES REGENCY TE...|[PINK REGENCY TEA...|  0.62756052141527|20.291906997073024|
|[JUMBO SHOPPER VI...|[JUMBO BAG RED RE...|0.7447619047619047| 9.034816549570648|
|[SET/6 RED SPOT

In [137]:
model2.transform(df_preprocess).show()

+---------+----------------------+------------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|          prediction|
+---------+----------------------+------------------------+--------------------+
|   536365|  [84029E, 21730, 8...|    [RED WOOLLY HOTTI...|                  []|
|   536366|        [22632, 22633]|    [HAND WARMER UNIO...|                  []|
|   536367|  [22310, 22622, 21...|    [POPPY'S PLAYHOUS...|[POPPY'S PLAYHOUS...|
|   536368|  [22913, 22960, 22...|    [BLUE COAT RACK P...|                  []|
|   536369|               [21756]|    [BATH BUILDING BL...|                  []|
|   536370|  [22659, 21035, 21...|    [MINI JIGSAW SPAC...|                  []|
|   536371|               [22086]|    [PAPER CHAIN KIT ...|                  []|
|   536372|        [22632, 22633]|    [HAND WARMER UNIO...|                  []|
|   536373|  [84029E, 21730, 8...|    [EDWARDIAN PARASO...|                  []|
|   536374|               [2

# Percobaan 3

In [138]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(Description)", minSupport=0.005, minConfidence=0.8)
model3 = fpGrowth.fit(df_preprocess)

In [139]:
model3.freqItemsets.show()

+--------------------+----+
|               items|freq|
+--------------------+----+
|[SET/3 RED GINGHA...| 489|
|[SET/3 RED GINGHA...| 168|
|[SET/3 RED GINGHA...| 135|
|[SET/3 RED GINGHA...| 186|
|[SET/3 RED GINGHA...| 150|
|[SET/3 RED GINGHA...| 158|
|[SET/3 RED GINGHA...| 134|
|[SET/3 RED GINGHA...| 152|
|[SET/3 RED GINGHA...| 141|
|[BALLOONS  WRITIN...| 346|
|[BALLOONS  WRITIN...| 138|
|[BALLOONS  WRITIN...| 154|
|[VINTAGE DONKEY T...| 257|
|[GOLD MINI TAPE M...| 206|
|[GOLD MINI TAPE M...| 166|
|[ANGEL DECORATION...| 165|
|  [SKULLS  STICKERS]| 138|
|[WHITE HANGING HE...|2302|
|[GREEN POLKADOT B...| 206|
|[GREEN POLKADOT B...| 146|
+--------------------+----+
only showing top 20 rows



In [140]:
model3.associationRules.show()

+--------------------+--------------------+------------------+------------------+
|          antecedent|          consequent|        confidence|              lift|
+--------------------+--------------------+------------------+------------------+
|[HERB MARKER MINT...|[HERB MARKER PARS...|0.9226804123711341| 99.98921623603503|
|[HERB MARKER MINT...| [HERB MARKER THYME]|0.9432989690721649|102.65312310491207|
|[HERB MARKER MINT...|[HERB MARKER CHIV...|0.8350515463917526|102.98969072164948|
|[CHARLOTTE BAG PI...|[RED RETROSPOT CH...|0.8292682926829268|20.455284552845526|
|[REGENCY TEA PLAT...|[GREEN REGENCY TE...|0.8715083798882681|21.354841096599948|
|[REGENCY TEA PLAT...|[REGENCY TEA PLAT...|0.9553072625698324| 54.14104617190079|
|[REGENCY TEA PLAT...|[PINK REGENCY TEA...|0.8379888268156425| 27.09601824534974|
|[ROSES REGENCY TE...|[GREEN REGENCY TE...|           0.83125| 20.36837748344371|
|[ROTATING LEAVES ...|    [DOTCOM POSTAGE]|0.9825581395348837|35.893167579624105|
|[DOTCOM POSTAGE

In [141]:
model3.transform(df_preprocess).show()

+---------+----------------------+------------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|          prediction|
+---------+----------------------+------------------------+--------------------+
|   536365|  [84029E, 21730, 8...|    [RED WOOLLY HOTTI...|                  []|
|   536366|        [22632, 22633]|    [HAND WARMER UNIO...|                  []|
|   536367|  [22310, 22622, 21...|    [POPPY'S PLAYHOUS...|                  []|
|   536368|  [22913, 22960, 22...|    [BLUE COAT RACK P...|                  []|
|   536369|               [21756]|    [BATH BUILDING BL...|                  []|
|   536370|  [22659, 21035, 21...|    [MINI JIGSAW SPAC...|                  []|
|   536371|               [22086]|    [PAPER CHAIN KIT ...|                  []|
|   536372|        [22632, 22633]|    [HAND WARMER UNIO...|                  []|
|   536373|  [84029E, 21730, 8...|    [EDWARDIAN PARASO...|                  []|
|   536374|               [2

 # Contoh output rekomendasi itemset 

In [150]:
test = spark.createDataFrame([
    ('0', ['23173'],['REGENCY TEAPOT ROSES']),
    ('1', ['21731', '22666'],['RED TOADSTOOL LED NIGHT LIGHT', 'RECIPE BOX PANTRY YELLOW DESIGN']),
    ('2', ['22699'],['ROSES REGENCY TEACUP AND SAUCER ']),
    ('3', ['22356', '20719'],['CHARLOTTE BAG PINK POLKADOT', 'WOODLAND CHARLOTTE BAG']),
    ('4', ['20723', '22355'],['STRAWBERRY CHARLOTTE BAG', 'CHARLOTTE BAG SUKI DESIGN']),
    ('5', ['23352'],["ROLL WRAP 50'S RED CHRISTMAS "])


], ["InvoiceNo", "collect_set(StockCode)","collect_set(Description)"])
test.show()

+---------+----------------------+------------------------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|
+---------+----------------------+------------------------+
|        0|               [23173]|    [REGENCY TEAPOT R...|
|        1|        [21731, 22666]|    [RED TOADSTOOL LE...|
|        2|               [22699]|    [ROSES REGENCY TE...|
|        3|        [22356, 20719]|    [CHARLOTTE BAG PI...|
|        4|        [20723, 22355]|    [STRAWBERRY CHARL...|
|        5|               [23352]|    [ROLL WRAP 50'S R...|
+---------+----------------------+------------------------+



In [152]:
model.transform(test).show()

+---------+----------------------+------------------------+----------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|prediction|
+---------+----------------------+------------------------+----------+
|        0|               [23173]|    [REGENCY TEAPOT R...|        []|
|        1|        [21731, 22666]|    [RED TOADSTOOL LE...|        []|
|        2|               [22699]|    [ROSES REGENCY TE...|        []|
|        3|        [22356, 20719]|    [CHARLOTTE BAG PI...|        []|
|        4|        [20723, 22355]|    [STRAWBERRY CHARL...|        []|
|        5|               [23352]|    [ROLL WRAP 50'S R...|        []|
+---------+----------------------+------------------------+----------+



In [153]:
model2.transform(test).show()

+---------+----------------------+------------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|          prediction|
+---------+----------------------+------------------------+--------------------+
|        0|               [23173]|    [REGENCY TEAPOT R...|                  []|
|        1|        [21731, 22666]|    [RED TOADSTOOL LE...|                  []|
|        2|               [22699]|    [ROSES REGENCY TE...|[GREEN REGENCY TE...|
|        3|        [22356, 20719]|    [CHARLOTTE BAG PI...|[RED RETROSPOT CH...|
|        4|        [20723, 22355]|    [STRAWBERRY CHARL...|[CHARLOTTE BAG PI...|
|        5|               [23352]|    [ROLL WRAP 50'S R...|                  []|
+---------+----------------------+------------------------+--------------------+



In [154]:
model3.transform(test).show()

+---------+----------------------+------------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|collect_set(Description)|          prediction|
+---------+----------------------+------------------------+--------------------+
|        0|               [23173]|    [REGENCY TEAPOT R...|                  []|
|        1|        [21731, 22666]|    [RED TOADSTOOL LE...|    [DOTCOM POSTAGE]|
|        2|               [22699]|    [ROSES REGENCY TE...|                  []|
|        3|        [22356, 20719]|    [CHARLOTTE BAG PI...|[RED RETROSPOT CH...|
|        4|        [20723, 22355]|    [STRAWBERRY CHARL...|[RED RETROSPOT CH...|
|        5|               [23352]|    [ROLL WRAP 50'S R...|[ROLL WRAP 50'S C...|
+---------+----------------------+------------------------+--------------------+



# Kesimpulan

In [155]:
model.freqItemsets.count()

0

In [156]:
model2.freqItemsets.count()

1038

In [157]:
model3.freqItemsets.count()

7003

## analisis hasil output berdasarkan setiap model

1. Prediksi model 1 **(minSup = 0.1 minConf = 0.5)** Tidak menghasilkan prediksi apapun karena memang nilai **MinSupport terlalu tinggi** (tidak ada barang yang memiliki frekuensi penjualan diatas 2590)
2. Prediksi model 2 **(minSup = 0.01, minConf = 0.6)** memiliki hasil frequensi itemset yang lebih sedikit dari model 3 dikarenakan domain ditentukan dari nilai minSup (lebih kecil, maka lebih banyak)
3. Prediksi model 3 **(minSup = 0.005, minConf = 0.8)** memiliki hasil frequensi itemset yang lebih banyak dan memiliki nilai confidence yang lebih tinggi daripada model 2
4. Perbedaan yang dihasilkan dari prediksi Model 2 dan 3 dapat disimpulkan sebagai berikut
    * data prediksi yang ada pada model 2 namun tidak ada pada model 3 berarti **memiliki nilai confidence antara 0,6 hingga 0,8**
    * data prediksi yang ada pada model 3 namun tidak ada pada model 2 berarti **memiliki nilai frekuensi penjualan dibawah 259**