# **Itemset and Association Rule Mining**
Spark MLlib provides:
1. An itemset mining algorithm based on the FP-growth algorithm
    - That extracts all the sets of items (of any length) with a minimum frequency
2. A rule mining algorithm
    - That extracts the association rules with a minimum frequency and a minimum confidence
    - Only the rules with one single item in the consequent of the rules are extracted
    
FP-growth is one of the most popular and efficient itemset mining algorithms. It is characterized by one single parameter:
- The minimum support threshold **(minsup)**
The input dataset is a transactional dataset

The input of the MLlib itemset and rule mining algorithm is a DataFrame containing a column called items.

In [7]:
from pyspark.ml.fpm import FPGrowth
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
from pyspark.sql.functions import col, split

# input and output folders
transactionsData = "./databases/transactions.csv"
outputPathItemsets = "/Itemsets"
outputPathRules = "/Rules"

# Create a DataFrame from transactions.csv
transactionsDataDF = spark.read.load(transactionsData,\
                                        format="csv", header=True,\
                                        inferSchema=True)

In [12]:
transactionsDataDF.printSchema()
transactionsDataDF.show()

root
 |-- transactions: string (nullable = true)

+------------+
|transactions|
+------------+
|     A B C D|
|         A B|
|         B C|
|       A D E|
+------------+



In [16]:
# Transform Column transactions into an ArrayType
trsDataDF = transactionsDataDF\
.selectExpr('split(transactions, " ")')\
.withColumnRenamed("split(transactions,  )", "items")

In [17]:
trsDataDF.printSchema()
trsDataDF.show()

root
 |-- items: array (nullable = true)
 |    |-- element: string (containsNull = true)

+------------+
|       items|
+------------+
|[A, B, C, D]|
|      [A, B]|
|      [B, C]|
|   [A, D, E]|
+------------+



In [18]:
# Create an FP-growth Estimator
fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)

# Extract itemsets and rules
model = fpGrowth.fit(trsDataDF)

# Retrieve the DataFrame associated with the frequent itemsets
dfItemsets = model.freqItemsets

# Retrieve the DataFrame associated with the frequent rules
dfRules = model.associationRules

In [20]:
dfItemsets.show()
dfRules.show()

+------+----+
| items|freq|
+------+----+
|   [C]|   2|
|[C, B]|   2|
|   [A]|   3|
|[A, B]|   2|
|   [D]|   2|
|[D, A]|   2|
|   [B]|   3|
+------+----+

+----------+----------+------------------+------------------+
|antecedent|consequent|        confidence|              lift|
+----------+----------+------------------+------------------+
|       [A]|       [B]|0.6666666666666666|0.8888888888888888|
|       [A]|       [D]|0.6666666666666666|1.3333333333333333|
|       [D]|       [A]|               1.0|1.3333333333333333|
|       [B]|       [C]|0.6666666666666666|1.3333333333333333|
|       [B]|       [A]|0.6666666666666666|0.8888888888888888|
|       [C]|       [B]|               1.0|1.3333333333333333|
+----------+----------+------------------+------------------+



In [None]:
# Save the result in an HDFS output folder
dfItemsets.write.json(outputPathItemsets)

# Save the result in an HDFS output folder
dfRules.write.json(outputPathRules)

The result is stored in a JSON file because itemsets and rules are stored in columns associated with the data type Array. Hence, CSV files cannot be used to store the result.