# APACHE SPARK FREQUENT PREDICTION

## Tugas 3 Big Data

### Requirements

1. Apache Spark 2.40
2. Python 3.7.1
3. JDK 1.8.0

In [1]:
# Spark initial

In [2]:
import findspark
findspark.init()

In [3]:
# Import required library
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

In [4]:
# Print Spark object ID
print(spark)

<pyspark.sql.session.SparkSession object at 0x114b74908>


In [5]:
# Data Import
df = spark.read.csv("/Users/gunstringer/Downloads/OnlineRetail.csv", header=True, inferSchema=True)

In [6]:
df.printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



In [7]:
# Tampilkan Data
df.show()

+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|01/12/10 08.26|     2.55|     17850|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|01/12/10 08.26|     3.39|     17850|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|01/12/10 08.26|     2.75|     17850|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|01/12/10 08.26|     3.39|     17850|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|01/12/10 08.26|     3.39|     17850|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|01/12/10 08.26|     7.65|     17850|United Kingdom|
|   536365|    21730|GLASS STAR FROSTE...|       6|01/12/10 08.26|     4.

# Preprocess Dataset

In [36]:
# Preprocess data (Ekstraksi Barang berdasarkan InvoiceNo)
## menggunakan groupby dan aggregate function dari collect_set (menggolongkan StockCode unique berdasarkan invoiceNo)

from pyspark.sql.functions import collect_set
df_preprocess = df.groupby("InvoiceNo").agg(collect_set('StockCode'))
df_preprocess.show()

+---------+----------------------+
|InvoiceNo|collect_set(StockCode)|
+---------+----------------------+
|   536596|  [22900, 22114, 84...|
|   536938|  [22112, 21931, 84...|
|   537252|               [22197]|
|   537691|  [22505, 46000R, 2...|
|   538041|               [22145]|
|   538184|  [22561, 22147, 21...|
|   538517|  [22749, 21212, 22...|
|   538879|  [21212, 22759, 22...|
|   539275|  [22083, 22150, 22...|
|   539630|  [22111, 22971, 22...|
|   540499|  [22697, 22796, 21...|
|   540540|  [22111, 22834, 22...|
|   540976|  [22413, 21212, 22...|
|   541432|  [22113, 22457, 21...|
|   541518|  [21212, 22432, 22...|
|   541783|  [22561, 22697, 22...|
|   542026|  [22398, 22194, 22...|
|   542375|  [22629, 21731, 22...|
|   543641|  [22645, 75131, 22...|
|   544303|  [84596L, 22931, 8...|
+---------+----------------------+
only showing top 20 rows



# Perhitungan


In [50]:
## Jumlah Transaksi (invoice) pada dataset
df_preprocess.count()

25900

In [53]:
## Total Jenis Barang pada dataset
df.select("description").distinct().count()

4224

## Percobaan 1
* **minSupport = 0.1** ( min frekuensi penjualan set barang **2590**)
* **minConfidence = 0.5**

## Percobaan 2
* **minSupport = 0.01** ( min frekuensi penjualan set barang **259**)
* **minConfidence = 0.6**

## Percobaan 3
* **minSupport = 0.005** ( min frekuensi penjualan set barang **129**)
* **minConfidence = 0.8** 

# Percobaan 1

In [75]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(StockCode)", minSupport=0.1, minConfidence=0.5)
model = fpGrowth.fit(df_preprocess)

In [76]:
model.freqItemsets.show()

+-----+----+
|items|freq|
+-----+----+
+-----+----+



In [77]:
model.associationRules.show()

+----------+----------+----------+----+
|antecedent|consequent|confidence|lift|
+----------+----------+----------+----+
+----------+----------+----------+----+



In [78]:
model.transform(df_preprocess).show()

+---------+----------------------+----------+
|InvoiceNo|collect_set(StockCode)|prediction|
+---------+----------------------+----------+
|   536596|  [22900, 22114, 84...|        []|
|   536938|  [22112, 21931, 84...|        []|
|   537252|               [22197]|        []|
|   537691|  [22505, 46000R, 2...|        []|
|   538041|               [22145]|        []|
|   538184|  [22561, 22147, 21...|        []|
|   538517|  [22749, 21212, 22...|        []|
|   538879|  [21212, 22759, 22...|        []|
|   539275|  [22083, 22150, 22...|        []|
|   539630|  [22111, 22971, 22...|        []|
|   540499|  [22697, 22796, 21...|        []|
|   540540|  [22111, 22834, 22...|        []|
|   540976|  [22413, 21212, 22...|        []|
|   541432|  [22113, 22457, 21...|        []|
|   541518|  [21212, 22432, 22...|        []|
|   541783|  [22561, 22697, 22...|        []|
|   542026|  [22398, 22194, 22...|        []|
|   542375|  [22629, 21731, 22...|        []|
|   543641|  [22645, 75131, 22...|

# Percobaan 2

In [60]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(StockCode)", minSupport=0.01, minConfidence=0.6)
model2 = fpGrowth.fit(df_preprocess)

In [61]:
model2.freqItemsets.show()

+----------------+----+
|           items|freq|
+----------------+----+
|         [22633]| 487|
|         [23236]| 344|
|        [85123A]|2246|
|         [22423]|2172|
| [22423, 85123A]| 355|
|         [22667]| 486|
|         [22579]| 343|
|  [22579, 22578]| 282|
|        [85099B]|2135|
| [85099B, 22423]| 288|
|[85099B, 85123A]| 404|
|         [22620]| 486|
|        [84536A]| 342|
|         [71053]| 342|
|         [47566]|1706|
| [47566, 85099B]| 332|
|  [47566, 22423]| 398|
| [47566, 85123A]| 391|
|         [85150]| 483|
|         [20725]|1608|
+----------------+----+
only showing top 20 rows



In [64]:
model2.associationRules.show()

+--------------------+----------+------------------+------------------+
|          antecedent|consequent|        confidence|              lift|
+--------------------+----------+------------------+------------------+
|      [20726, 22382]|   [20725]|0.6356107660455487|10.237760472997332|
|             [22699]|   [22697]|               0.7|  17.1523178807947|
|      [20723, 22355]|   [20724]|0.8038277511961722|19.827751196172247|
|      [20723, 22355]|   [20719]|0.7272727272727273| 22.34444084977893|
|      [20723, 22355]|   [22356]|  0.65311004784689| 22.25730294636112|
|             [22866]|   [22865]| 0.600358422939068| 23.31226859688435|
|             [20723]|   [20724]| 0.667574931880109|16.466848319709353|
|      [22356, 20719]|   [22355]|0.7405541561712846|21.430561614342203|
|      [22356, 20719]|   [20724]|0.8211586901763224| 20.25524769101595|
|      [22356, 20719]|   [20723]|0.6649874055415617|23.464814446221318|
|        [DOT, 22411]|  [85099B]|0.7713498622589532| 9.357358984

In [65]:
model2.transform(df_preprocess).show()

+---------+----------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|          prediction|
+---------+----------------------+--------------------+
|   536596|  [22900, 22114, 84...|                  []|
|   536938|  [22112, 21931, 84...|[85099B, 22355, 2...|
|   537252|               [22197]|                  []|
|   537691|  [22505, 46000R, 2...|                  []|
|   538041|               [22145]|                  []|
|   538184|  [22561, 22147, 21...|                  []|
|   538517|  [22749, 21212, 22...|                  []|
|   538879|  [21212, 22759, 22...|                  []|
|   539275|  [22083, 22150, 22...|                  []|
|   539630|  [22111, 22971, 22...|                  []|
|   540499|  [22697, 22796, 21...|      [22698, 20724]|
|   540540|  [22111, 22834, 22...|                  []|
|   540976|  [22413, 21212, 22...|[22355, 22356, 20...|
|   541432|  [22113, 22457, 21...|                  []|
|   541518|  [21212, 22432, 22...|[21931, 22386,

# Percobaan 3

In [71]:
from pyspark.ml.fpm import FPGrowth
fpGrowth = FPGrowth(itemsCol="collect_set(StockCode)", minSupport=0.005, minConfidence=0.8)
model3 = fpGrowth.fit(df_preprocess)

In [72]:
model3.freqItemsets.show()

+--------------------+----+
|               items|freq|
+--------------------+----+
|             [22633]| 487|
|      [22633, 22866]| 215|
|[22633, 22866, 22...| 140|
|      [22633, 23355]| 134|
|      [22633, 22865]| 238|
|      [22633, 22867]| 186|
|             [23236]| 344|
|      [23236, 23240]| 160|
|      [23236, 23245]| 147|
|             [23158]| 258|
|             [21922]| 209|
|             [21471]| 166|
|             [21394]| 138|
|            [85123A]|2246|
|             [23504]| 208|
|             [23157]| 166|
|             [21679]| 138|
|             [22423]|2172|
|     [22423, 85123A]| 355|
|             [22667]| 486|
+--------------------+----+
only showing top 20 rows



In [73]:
model3.associationRules.show()

+--------------------+----------+------------------+------------------+
|          antecedent|consequent|        confidence|              lift|
+--------------------+----------+------------------+------------------+
|      [21992, 21935]|     [DOT]|0.9631901840490797| 35.13609262939601|
|      [20712, 22457]|     [DOT]| 0.861271676300578|31.418220304485875|
|[22356, 20719, 20...|   [22355]|0.8285714285714286| 23.97765363128492|
|[22356, 20719, 20...|   [20724]|0.8857142857142857|21.847619047619045|
|[22356, 20719, 20...|   [20723]|0.8285714285714286|29.237057220708447|
|      [23173, 22697]|   [22699]|0.8826530612244898|20.411352040816325|
|      [21731, 22666]|     [DOT]|0.8118279569892473|29.614569135241556|
|[23245, 22697, 22...|   [22698]|0.8103448275862069|26.169490067933612|
|[23264, 23266, 23...|   [23265]|             0.875| 76.82203389830508|
|[23199, 22411, 21...|  [85099B]|0.9027777777777778|10.951730418943534|
|[20723, 20719, 22...|   [22355]|0.8588235294117647| 24.85310548

In [94]:
model3.transform(df_preprocess).show()

+---------+----------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|          prediction|
+---------+----------------------+--------------------+
|   536596|  [22900, 22114, 84...|                  []|
|   536938|  [22112, 21931, 84...|            [85099B]|
|   537252|               [22197]|                  []|
|   537691|  [22505, 46000R, 2...|                  []|
|   538041|               [22145]|                  []|
|   538184|  [22561, 22147, 21...|                  []|
|   538517|  [22749, 21212, 22...|                  []|
|   538879|  [21212, 22759, 22...|                  []|
|   539275|  [22083, 22150, 22...|                  []|
|   539630|  [22111, 22971, 22...|                  []|
|   540499|  [22697, 22796, 21...|        [DOT, 20724]|
|   540540|  [22111, 22834, 22...|                  []|
|   540976|  [22413, 21212, 22...|      [22355, 22356]|
|   541432|  [22113, 22457, 21...|                  []|
|   541518|  [21212, 22432, 22...|[DOT, 22355, 2

 # Contoh output rekomendasi itemset 

In [122]:
df2 = spark.createDataFrame([
    ('0', ['23173']),
    ('1', ['21731', '22666']),
    ('2', ['22699']),
    ('3', ['22356', '20719']),
    ('4', ['20723', '22355']),
    ('5', ['23352'])


], ["InvoiceNo", "collect_set(StockCode)"])
df2.show()

+---------+----------------------+
|InvoiceNo|collect_set(StockCode)|
+---------+----------------------+
|        0|               [23173]|
|        1|        [21731, 22666]|
|        2|               [22699]|
|        3|        [22356, 20719]|
|        4|        [20723, 22355]|
|        5|               [23352]|
+---------+----------------------+



In [123]:
model.transform(df2).show()


+---------+----------------------+----------+
|InvoiceNo|collect_set(StockCode)|prediction|
+---------+----------------------+----------+
|        0|               [23173]|        []|
|        1|        [21731, 22666]|        []|
|        2|               [22699]|        []|
|        3|        [22356, 20719]|        []|
|        4|        [20723, 22355]|        []|
|        5|               [23352]|        []|
+---------+----------------------+----------+



In [124]:
model2.transform(df2).show()


+---------+----------------------+--------------------+
|InvoiceNo|collect_set(StockCode)|          prediction|
+---------+----------------------+--------------------+
|        0|               [23173]|                  []|
|        1|        [21731, 22666]|                  []|
|        2|               [22699]|             [22697]|
|        3|        [22356, 20719]|[22355, 20724, 20...|
|        4|        [20723, 22355]|[20724, 20719, 22...|
|        5|               [23352]|                  []|
+---------+----------------------+--------------------+



In [125]:
model3.transform(df2).show()


+---------+----------------------+----------+
|InvoiceNo|collect_set(StockCode)|prediction|
+---------+----------------------+----------+
|        0|               [23173]|        []|
|        1|        [21731, 22666]|     [DOT]|
|        2|               [22699]|        []|
|        3|        [22356, 20719]|   [20724]|
|        4|        [20723, 22355]|   [20724]|
|        5|               [23352]|   [23351]|
+---------+----------------------+----------+



# Kesimpulan

In [119]:
model.freqItemsets.count()

0

In [120]:
model2.freqItemsets.count()

1087

In [121]:
model3.freqItemsets.count()

7538

## analisis hasil output berdasarkan setiap model

1. Prediksi model 1 **(minSup = 0.1 minConf = 0.6)** Tidak menghasilkan prediksi apapun karena memang nilai **MinSupport terlalu tinggi** (tidak ada barang yang memiliki frekuensi penjualan diatas 2590)
2. Prediksi model 2 **(minSup = 0.01, minConf = 0.6)** memiliki hasil frequensi itemset yang lebih sedikit dari model 3 dikarenakan domain ditentukan dari nilai minSup (lebih kecil, maka lebih banyak)
3. Prediksi model 3 **(minSup = 0.005, minConf = 0.8)** memiliki hasil frequensi itemset yang lebih banyak dan memiliki nilai confidence yang lebih tinggi daripada model 2
4. Perbedaan yang dihasilkan dari prediksi Model 2 dan 3 dapat disimpulkan sebagai berikut
    * data prediksi yang ada pada model 2 namun tidak ada pada model 3 berarti **memiliki nilai confidence antara 0,6 hingga 0,8**
    * data prediksi yang ada pada model 3 namun tidak ada pada model 2 berarti **memiliki nilai frekuensi penjualan dibawah 259**