# Recommendation Systems – Frequent Pattern Mining

* Name: Benedictus Bimo Cahyo Wicaksono<br>
* Student ID: 5025201097<br>
* Class: Big Data<br>
* Lecturer: Abdul Munif, S.Kom., M.Sc.

In [2]:
# To be able to use your data stored in your Google Drive you first need to mount your Google Drive so you can load and save files to it. 
from google.colab import drive
drive.mount('/content/gdrive')
#You'll need to put in a token which Google will generate for you as soon as you click on the link

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Mount the Google Drive so that I can load the dataset from my drive.

In [3]:
import pandas as pd
import csv

csv_file = '/content/gdrive/MyDrive/Institut Teknologi Sepuluh Nopember/Mata Kuliah/Semester 6/Big Data/Mid Test/market-basket.csv'

with open(csv_file, 'r') as f:
    temp_lines = f.readline() + '\n' + f.readline()
    dialect = csv.Sniffer().sniff(temp_lines, delimiters=';,')
    f.seek(0)
    data = pd.read_csv(f, dialect=dialect, error_bad_lines=False)

data.head()



  data = pd.read_csv(f, dialect=dialect, error_bad_lines=False)
  data = pd.read_csv(f, dialect=dialect, error_bad_lines=False)


Unnamed: 0,BillNo,Itemname,Quantity,Date,Price,CustomerID,Country
0,536365,WHITE HANGING HEART T-LIGHT HOLDER,6,01.12.2010 08:26,255,17850.0,United Kingdom
1,536365,WHITE METAL LANTERN,6,01.12.2010 08:26,339,17850.0,United Kingdom
2,536365,CREAM CUPID HEARTS COAT HANGER,8,01.12.2010 08:26,275,17850.0,United Kingdom
3,536365,KNITTED UNION FLAG HOT WATER BOTTLE,6,01.12.2010 08:26,339,17850.0,United Kingdom
4,536365,RED WOOLLY HOTTIE WHITE HEART.,6,01.12.2010 08:26,339,17850.0,United Kingdom


After load the dataset, I have to split the columns because it is a *csv* file.

## Market Basket Analysis

Market basket analysis is a data mining technique used by retailers to increase sales by better understanding customer purchasing patterns. It involves analyzing large data sets, such as purchase history, to reveal product groupings, as well as products that are likely to be purchased together.

In [4]:
%%capture
!sudo apt-get update --fix-missing

!apt-get install openjdk-8-jdk-headless -qq > /dev/null

!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
#!wget -q https://downloads.apache.org/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz

!mv spark-3.0.0-bin-hadoop3.2.tgz sparkkk
!tar xf sparkkk
!pip install -q findspark

In [5]:
!pip install spark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
!ps aux | grep py4j

root        7065  0.0  0.0   6904  3328 ?        S    02:35   0:00 /bin/bash -c ps aux | grep py4j
root        7067  0.0  0.0   6312   720 ?        R    02:35   0:00 grep py4j


In [7]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName('fpgrowth') \
    .master('local[*]') \
    .config('spark.sql.execution.arrow.pyspark.enabled', True) \
    .config('spark.sql.session.timeZone', 'UTC') \
    .config('spark.driver.memory','32G') \
    .config('spark.ui.showConsoleProgress', True) \
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .getOrCreate()

spark   

In [8]:
from google.colab import files
from pyspark.sql import functions as F
from pyspark.ml.fpm import FPGrowth

data['BillNo'] = data['BillNo'].astype(str)
data['Itemname'] = data['Itemname'].astype(str)

sparkdata = spark.createDataFrame(data)
basketdata = sparkdata.dropDuplicates(['BillNo', 'Itemname']).sort('BillNo')
basketdata = basketdata.groupBy("BillNo").agg(F.collect_list("Itemname")).sort('BillNo')

  arrow_data = [[(c, t) for (_, c), t in zip(pdf_slice.iteritems(), arrow_types)]


Before I start the market basket analysis, I have decided to use *Billno* and *Itemname* for the columns. I had to convert them to *str* due to the type difference between both columns.

### minSupport = 0.006 and minConfidence = 0.006

In this section, I am using minSupport=0.006, minConfidence=0.006 for the FPGrowth.

In [None]:
#Frequent Pattern Growth – FP Growth is a method of mining frequent itemsets
fpGrowth = FPGrowth(itemsCol="collect_list(Itemname)", minSupport=0.006, minConfidence=0.006) 
model = fpGrowth.fit(basketdata)

# Display frequent itemsets.
model.freqItemsets.show()
items = model.freqItemsets
# Display generated association rules.
model.associationRules.show()
rules = model.associationRules
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(basketdata).show()
transformed = model.transform(basketdata)

To find frequent itemsets and association rules from transaction data stored in the basketdata DataFrame using PySpark, I had set the minimum support threshold and minimum confidence threshold. After that, I print the generated frequent itemsets and association rules. It also used the generated model to make predictions on new transaction data.

In [None]:
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = items.select("*").toPandas()
result_pdf.head()

In [None]:
result_pdf.to_excel('result_pdfItemsFreq.xlsx')

Export the xlsx.

In [None]:
rules_pdf = rules.select("*").toPandas()
rules_pdf.head()

In [None]:
rules_pdf.to_excel('rules_pdfAnteConseConfLift.xlsx')

In [None]:
transformed_pdf = transformed.select("*").toPandas()
transformed_pdf.head()

In [None]:
transformed_pdf.to_excel('transformed_pdfSalesTransactionIDCollectListPred.xlsx')

### minSupport = 0.05 and minConfidence = 0.05

In this section, I am using minSupport=0.05, minConfidence=0.05 for the FPGrowth.

In [None]:
#Frequent Pattern Growth – FP Growth is a method of mining frequent itemsets
fpGrowth = FPGrowth(itemsCol="collect_list(Itemname)", minSupport=0.003, minConfidence=0.003) 
model = fpGrowth.fit(basketdata)

# Display frequent itemsets.
model.freqItemsets.show()
items = model.freqItemsets
# Display generated association rules.
model.associationRules.show()
rules = model.associationRules
# transform examines the input items against all the association rules and summarize the
# consequents as prediction
model.transform(basketdata).show()
transformed = model.transform(basketdata)

To find frequent itemsets and association rules from transaction data stored in the basketdata DataFrame using PySpark, I had set the minimum support threshold and minimum confidence threshold. After that, I print the generated frequent itemsets and association rules. It also used the generated model to make predictions on new transaction data.

In [None]:
# Convert the Spark DataFrame back to a Pandas DataFrame using Arrow
result_pdf = items.select("*").toPandas()
result_pdf.head()

In [None]:
result_pdf.to_excel('result_pdfItemsFreq1.xlsx')

Export the xlsx.

In [None]:
rules_pdf = rules.select("*").toPandas()
rules_pdf.head()

In [None]:
rules_pdf.to_excel('rules_pdfAnteConseConfLift1.xlsx')

In [None]:
transformed_pdf = transformed.select("*").toPandas()
transformed_pdf.head()

In [None]:
transformed_pdf.to_excel('transformed_pdfSalesTransactionIDCollectListPred1.xlsx')

### Conclusion

Normally, increasing the minimum support and confidence thresholds can lead to fewer but more meaningful rules and associations being discovered, as only the strongest associations will meet the higher thresholds. Based on the experiments with minSupport = 0.006 and minConfidence = 0.006 compared to minSupport = 0.003 and minConfidence = 0.003, it is concluded that a very high minimum support or confidence may result in too few rules to be of any practical use, while a very low minimum support or confidence may result in a large number of weak or spurious rules.

### Problem

During this mid test, I faced a several problems related to the connection. I had successfully run my code for the 1st time, but when I re run it again, there were errors everywhere. The reason why I re run the code again is because I want to make sure everything before exporting the *xlsx* file.


*Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:34797)*