# Tugas Clustering 

##### Ilham Muhammad Misbahuddin
##### 05111540000088

## Kebutuhan :
1. Operating System : Kali Linux 2019.1
2. Apache Spark 2.3.3
3. Scala 2.12.8
4. Python 3.7.3rc1
5. PySpark 2.4.0
6. Findspark 1.3.0
7. Jupyter 4.4.0

## Deskripsi Dataset
* Nama Dataset : [Online Retail](https://www.kaggle.com/puneetbhaya/online-retail)

<table>
    <thead>
        <tr>
            <th>Sumber Data</th>
            <th>Jumlah Baris</th>
            <th>Jumlah Colom</th>
            <th>Ukuran</th>
            <th>Format File</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Online Retail.xlsx</td>
            <td>541909</td>
            <td>8</td>
            <td>23,7 MB</td>
            <td>XLSX</td>
        </tr>
    </tbody>
</table>
    


## Inisialisasi Apache Spark

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder.appName("Big Data Frequent Itemsets").getOrCreate()

print(spark)

<pyspark.sql.session.SparkSession object at 0x7fee55adea10>


## Load Dataset

In [2]:
# Load excel
import pandas as pd
from pyspark.sql.types import *

df_excel = pd.read_excel("/root/Lecture/BIGDATA/datasets/Online Retail.xlsx")

header = StructType([ StructField("InvoiceNo", StringType(), True)\
                       ,StructField("StockCode", StringType(), True)\
                       ,StructField("Description", StringType(), True)\
                       ,StructField("Quantity", IntegerType(), True)\
                       ,StructField("InvoiceDate", StringType(), True)\
                       ,StructField("UnitPrice", DoubleType(), True)\
                       ,StructField("CustomerID", StringType(), True)\
                       ,StructField("Country", StringType(), True)])

df = spark.createDataFrame(df_excel, schema=header)
type(df)

pyspark.sql.dataframe.DataFrame

In [3]:
# Count dataset Online Retail
df.count()

541909

In [4]:
# Show dataset Online Retail
df.show()

+---------+---------+--------------------+--------+--------------------+---------+----------+--------------+
|InvoiceNo|StockCode|         Description|Quantity|         InvoiceDate|UnitPrice|CustomerID|       Country|
+---------+---------+--------------------+--------+--------------------+---------+----------+--------------+
|   536365|   85123A|WHITE HANGING HEA...|       6|java.util.Gregori...|     2.55|   17850.0|United Kingdom|
|   536365|    71053| WHITE METAL LANTERN|       6|java.util.Gregori...|     3.39|   17850.0|United Kingdom|
|   536365|   84406B|CREAM CUPID HEART...|       8|java.util.Gregori...|     2.75|   17850.0|United Kingdom|
|   536365|   84029G|KNITTED UNION FLA...|       6|java.util.Gregori...|     3.39|   17850.0|United Kingdom|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|java.util.Gregori...|     3.39|   17850.0|United Kingdom|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|java.util.Gregori...|     7.65|   17850.0|United Kingdom|
|   536365|    2173

In [5]:
# Drop some column
dropped_column = ['Quantity', 'InvoiceDate', 'UnitPrice', 'Country', 'CustomerID']
df = df.drop(*dropped_column)

# Show dataset after some column dropped
df.show()

+---------+---------+--------------------+
|InvoiceNo|StockCode|         Description|
+---------+---------+--------------------+
|   536365|   85123A|WHITE HANGING HEA...|
|   536365|    71053| WHITE METAL LANTERN|
|   536365|   84406B|CREAM CUPID HEART...|
|   536365|   84029G|KNITTED UNION FLA...|
|   536365|   84029E|RED WOOLLY HOTTIE...|
|   536365|    22752|SET 7 BABUSHKA NE...|
|   536365|    21730|GLASS STAR FROSTE...|
|   536366|    22633|HAND WARMER UNION...|
|   536366|    22632|HAND WARMER RED P...|
|   536367|    84879|ASSORTED COLOUR B...|
|   536367|    22745|POPPY'S PLAYHOUSE...|
|   536367|    22748|POPPY'S PLAYHOUSE...|
|   536367|    22749|FELTCRAFT PRINCES...|
|   536367|    22310|IVORY KNITTED MUG...|
|   536367|    84969|BOX OF 6 ASSORTED...|
|   536367|    22623|BOX OF VINTAGE JI...|
|   536367|    22622|BOX OF VINTAGE AL...|
|   536367|    21754|HOME BUILDING BLO...|
|   536367|    21755|LOVE BUILDING BLO...|
|   536367|    21777|RECIPE BOX WITH M...|
+---------+

In [6]:
# Group dataset 
from pyspark.sql import functions as F

gdf = df.groupby("InvoiceNo").agg(F.collect_set("Description"))

# Show groupped dataset
gdf.show()

+---------+------------------------+
|InvoiceNo|collect_set(Description)|
+---------+------------------------+
|   536596|    [WAKE UP COCKEREL...|
|   536938|    [RED 3 PIECE RETR...|
|   537252|    [SMALL POPCORN HO...|
|   537691|    [3 HOOK PHOTO SHE...|
|   538041|                   [NaN]|
|   538184|    [MINI JIGSAW SPAC...|
|   538517|    [LARGE POPCORN HO...|
|   538879|    [PARTY CONE CHRIS...|
|   539275|    [RED  HARMONICA I...|
|   539630|    [CHICK GREY HOT W...|
|   540499|    [IVORY KITCHEN SC...|
|   540540|    [HOME SWEET HOME ...|
|   540976|    [60 CAKE CASES DO...|
|   541432|    [RETROSPOT HEART ...|
|   541518|    [60 CAKE CASES DO...|
|   541783|    [PHOTO FRAME 3 CL...|
|   542026|    [SMALL POPCORN HO...|
|   542375|    [CHILDRENS APRON ...|
|   543641|    [DOORSTOP FOOTBAL...|
|   544303|    [TEA TIME KITCHEN...|
+---------+------------------------+
only showing top 20 rows



## Prediksi Frequent Itemsets

In [7]:
# Frequent itemsets with Minimum Support = 0.05 and Minimum Confidence = 0.8
from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol="collect_set(Description)", minSupport=0.05, minConfidence=0.8)
model = fpGrowth.fit(gdf)
model.freqItemsets.show()
model.freqItemsets.count()

+--------------------+----+
|               items|freq|
+--------------------+----+
|[WHITE HANGING HE...|2302|
|[REGENCY CAKESTAN...|2169|
|[JUMBO BAG RED RE...|2135|
|     [PARTY BUNTING]|1706|
|[LUNCH BAG RED RE...|1607|
|[ASSORTED COLOUR ...|1467|
|[SET OF 3 CAKE TI...|1458|
|               [NaN]|1454|
|[PACK OF 72 RETRO...|1334|
|[LUNCH BAG  BLACK...|1295|
+--------------------+----+



10

In [8]:
# Association rule with Minimum Support = 0.05 and Minimum Confidence = 0.8
model.associationRules.show()

+----------+----------+----------+
|antecedent|consequent|confidence|
+----------+----------+----------+
+----------+----------+----------+



In [9]:
# Item prediction with Minimum Support = 0.05 and Minimum Confidence = 0.8
model.transform(gdf).show()

+---------+------------------------+----------+
|InvoiceNo|collect_set(Description)|prediction|
+---------+------------------------+----------+
|   536596|    [WAKE UP COCKEREL...|        []|
|   536938|    [RED 3 PIECE RETR...|        []|
|   537252|    [SMALL POPCORN HO...|        []|
|   537691|    [3 HOOK PHOTO SHE...|        []|
|   538041|                   [NaN]|        []|
|   538184|    [MINI JIGSAW SPAC...|        []|
|   538517|    [LARGE POPCORN HO...|        []|
|   538879|    [PARTY CONE CHRIS...|        []|
|   539275|    [RED  HARMONICA I...|        []|
|   539630|    [CHICK GREY HOT W...|        []|
|   540499|    [IVORY KITCHEN SC...|        []|
|   540540|    [HOME SWEET HOME ...|        []|
|   540976|    [60 CAKE CASES DO...|        []|
|   541432|    [RETROSPOT HEART ...|        []|
|   541518|    [60 CAKE CASES DO...|        []|
|   541783|    [PHOTO FRAME 3 CL...|        []|
|   542026|    [SMALL POPCORN HO...|        []|
|   542375|    [CHILDRENS APRON ...|    

In [10]:
# Frequent itemsets with Minimum Support = 0.03 and Minimum Confidence = 0.7
fpGrowth = FPGrowth(itemsCol="collect_set(Description)", minSupport=0.03, minConfidence=0.7)
model2 = fpGrowth.fit(gdf)
model2.freqItemsets.show()
model2.freqItemsets.count()

+--------------------+----+
|               items|freq|
+--------------------+----+
|[WHITE HANGING HE...|2302|
|[REGENCY CAKESTAN...|2169|
|[JUMBO BAG RED RE...|2135|
|     [PARTY BUNTING]|1706|
|[LUNCH BAG RED RE...|1607|
|[ASSORTED COLOUR ...|1467|
|[SET OF 3 CAKE TI...|1458|
|               [NaN]|1454|
|[PACK OF 72 RETRO...|1334|
|[LUNCH BAG  BLACK...|1295|
|[NATURAL SLATE HE...|1266|
|           [POSTAGE]|1250|
|[JUMBO BAG PINK P...|1231|
|[JUMBO BAG PINK P...| 833|
|[JAM MAKING SET W...|1220|
|[HEART OF WICKER ...|1212|
|[JUMBO STORAGE BA...|1201|
|[JUMBO SHOPPER VI...|1187|
|[JAM MAKING SET P...|1174|
|[LUNCH BAG CARS B...|1173|
+--------------------+----+
only showing top 20 rows



75

In [11]:
# Association rule with Minimum Support = 0.03 and Minimum Confidence = 0.7
model2.associationRules.show()

+--------------------+--------------------+------------------+
|          antecedent|          consequent|        confidence|
+--------------------+--------------------+------------------+
|[GREEN REGENCY TE...|[ROSES REGENCY TE...|0.7417218543046358|
|[ROSES REGENCY TE...|[GREEN REGENCY TE...|               0.7|
+--------------------+--------------------+------------------+



In [12]:
# Item prediction with Minimum Support = 0.03 and Minimum Confidence = 0.7
model2.transform(gdf).show()

+---------+------------------------+----------+
|InvoiceNo|collect_set(Description)|prediction|
+---------+------------------------+----------+
|   536596|    [WAKE UP COCKEREL...|        []|
|   536938|    [RED 3 PIECE RETR...|        []|
|   537252|    [SMALL POPCORN HO...|        []|
|   537691|    [3 HOOK PHOTO SHE...|        []|
|   538041|                   [NaN]|        []|
|   538184|    [MINI JIGSAW SPAC...|        []|
|   538517|    [LARGE POPCORN HO...|        []|
|   538879|    [PARTY CONE CHRIS...|        []|
|   539275|    [RED  HARMONICA I...|        []|
|   539630|    [CHICK GREY HOT W...|        []|
|   540499|    [IVORY KITCHEN SC...|        []|
|   540540|    [HOME SWEET HOME ...|        []|
|   540976|    [60 CAKE CASES DO...|        []|
|   541432|    [RETROSPOT HEART ...|        []|
|   541518|    [60 CAKE CASES DO...|        []|
|   541783|    [PHOTO FRAME 3 CL...|        []|
|   542026|    [SMALL POPCORN HO...|        []|
|   542375|    [CHILDRENS APRON ...|    

In [13]:
# Frequent itemsets with Minimum Support = 0.01 and Minimum Confidence = 0.6
fpGrowth = FPGrowth(itemsCol="collect_set(Description)", minSupport=0.01, minConfidence=0.6)
model3 = fpGrowth.fit(gdf)
model3.freqItemsets.show()
model3.freqItemsets.count()

+--------------------+----+
|               items|freq|
+--------------------+----+
|[PANTRY MAGNETIC ...| 489|
|[ENGLISH ROSE NOT...| 346|
|[WHITE HANGING HE...|2302|
|[REGENCY CAKESTAN...|2169|
|[REGENCY CAKESTAN...| 360|
|[SET/3 RED GINGHA...| 489|
|[BALLOONS  WRITIN...| 346|
|[JUMBO BAG RED RE...|2135|
|[JUMBO BAG RED RE...| 288|
|[JUMBO BAG RED RE...| 452|
|[HAND WARMER UNIO...| 486|
|[AGED GLASS SILVE...| 345|
|[CLASSIC CAFE SUG...| 343|
|     [PARTY BUNTING]|1706|
|[PARTY BUNTING, J...| 332|
|[PARTY BUNTING, R...| 398|
|[PARTY BUNTING, W...| 390|
|[4 TRADITIONAL SP...| 486|
|[LUNCH BAG RED RE...|1607|
|[LUNCH BAG RED RE...| 588|
+--------------------+----+
only showing top 20 rows



1039

In [14]:
# Association rule with Minimum Support = 0.01 and Minimum Confidence = 0.6
model3.associationRules.show()

+--------------------+--------------------+------------------+
|          antecedent|          consequent|        confidence|
+--------------------+--------------------+------------------+
|[PAPER CHAIN KIT ...|[PAPER CHAIN KIT ...|0.6670673076923077|
|[JUMBO BAG SPACEB...|[JUMBO BAG PINK P...|0.6134259259259259|
|[JUMBO BAG SCANDI...|[JUMBO SHOPPER VI...|0.6128318584070797|
|[HAND WARMER SCOT...|[HAND WARMER OWL ...|0.6057866184448463|
|[SET/6 RED SPOTTY...|[SET/6 RED SPOTTY...|0.6641366223908919|
|[SET/6 RED SPOTTY...|[SET/20 RED RETRO...|0.7020872865275142|
|[ROSES REGENCY TE...|[GREEN REGENCY TE...|0.7672253258845437|
|[ROSES REGENCY TE...|[PINK REGENCY TEA...|  0.62756052141527|
|[JUMBO SHOPPER VI...|[JUMBO BAG RED RE...|0.7447619047619047|
|[SET/6 RED SPOTTY...|[SET/6 RED SPOTTY...|0.8952702702702703|
|[PINK REGENCY TEA...|[GREEN REGENCY TE...|0.8941368078175895|
|[STRAWBERRY CHARL...|[RED RETROSPOT CH...|0.8038277511961722|
|[STRAWBERRY CHARL...|[WOODLAND CHARLOT...|0.7272727272

In [15]:
# Item prediction with Minimum Support = 0.01 and Minimum Confidence = 0.6
model3.transform(gdf).show()

+---------+------------------------+--------------------+
|InvoiceNo|collect_set(Description)|          prediction|
+---------+------------------------+--------------------+
|   536596|    [WAKE UP COCKEREL...|                  []|
|   536938|    [RED 3 PIECE RETR...|[JUMBO BAG RED RE...|
|   537252|    [SMALL POPCORN HO...|                  []|
|   537691|    [3 HOOK PHOTO SHE...|                  []|
|   538041|                   [NaN]|                  []|
|   538184|    [MINI JIGSAW SPAC...|                  []|
|   538517|    [LARGE POPCORN HO...|                  []|
|   538879|    [PARTY CONE CHRIS...|                  []|
|   539275|    [RED  HARMONICA I...|                  []|
|   539630|    [CHICK GREY HOT W...|                  []|
|   540499|    [IVORY KITCHEN SC...|[PINK REGENCY TEA...|
|   540540|    [HOME SWEET HOME ...|                  []|
|   540976|    [60 CAKE CASES DO...|[CHARLOTTE BAG SU...|
|   541432|    [RETROSPOT HEART ...|                  []|
|   541518|   

In [16]:
model3.transform(gdf).printSchema()

root
 |-- InvoiceNo: string (nullable = true)
 |-- collect_set(Description): array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- prediction: array (nullable = true)
 |    |-- element: string (containsNull = true)

