# A11: Generating Frequent Item Lists and Associate Rule Mining 
##### by Kevin Nguyen (3 April 2017)

## Introduction

Associate rules are about analyzing the likelihood if one item or action leads directly into a different item or action. This assignment will be based on a Market Basket Analysis. A Market Basket Analysis is based on a theory that if you buy a certain group of items, you are more or less likely to buy another group of items. 

We will be analyzing the contents within ‘bookdata.tsv.gz’ using R and the module arules. 

In this assignment, we will be answering each question below.

1. Discuss how you formulate this problem in terms of transactions and items for an associate rule mining analysis. Specifically, what are transactions and what are items in this data set? (Hint: A rule of interest is X -> Y where X and Y are book titles with enough support and confidence.)

2. Run Apriori algorithm with the support of 0.002 and confidence .75. Sort all the rules that meet this criterion according to its support and then confidence. Identify the top five rules. Interpret the results.

3. Among the top five rules listed in the previous question, examine which rules are for real and which ones are by chance by using a measure called lift. Define your lift criterion and discuss the decision you made.

4. Run Eclat on the same problem. Discuss your findings. What are main differences between Eclat and Apriori?

5. Outline your big data strategy for this data if you are asked to run associate rule mining on Amazon data set when the main memory of a computer can not fit all records in this data set.

6. (5 points bonus) List the top five most read books in this data set. Describe how you come to this list.


## Analysis

### 1. 
Discuss how you formulate this problem in terms of transactions and items for an associate rule mining analysis. Specifically, what are transactions and what are items in this data set? (Hint: A rule of interest is X -> Y where X and Y are book titles with enough support and confidence.)

The bookdata.tsv.gz contains data of books that users have rated. In this problem, we are trying to identify a link between books, e.g. if a user buys book 1 then the user buys book 2. In terms of this problem the ‘transactions’ are the user id’s and the ‘items’ within that data set are the actual books they have purchased. With such a large file, we will need to use the Apriori algorithm to help us narrow down the rules that will meet a certain criterion we specify. 


### 2. 
Run Apriori algorithm with the support of 0.002 and confidence .75. Sort all the rules that meet this criterion according to its support and then confidence. Identify the top five rules. Interpret the results.

The top five results of the Apriori algorithm are listed below.  

Top 5 Results sorted by Support ("Rule, Support, Confidence, Lift")  

1. "{Harry Potter and the Prisoner of Azkaban,Harry Potter and the Sorcerer's Stone} => {Harry Potter and the Chamber of Secrets}"	
0.00258392322056716	0.894736842105263	146.641318598989

2. "{Harry Potter and the Chamber of Secrets,Harry Potter and the Prisoner of Azkaban} => {Harry Potter and the Sorcerer's Stone}"	
0.00258392322056716	0.835087719298246	92.4498313090419

3. "{Harry Potter and the Chamber of Secrets,Harry Potter and the Goblet of Fire} => {Harry Potter and the Prisoner of Azkaban}"
0.00238850045598645	0.897959183673469	183.390741662519

4.  "{Harry Potter and the Goblet of Fire,Harry Potter and the Prisoner of Azkaban} => {Harry Potter and the Chamber of Secrets}"	
0.00238850045598645	0.883534136546185	144.805270905687

5.  "{Harry Potter and the Chamber of Secrets,Harry Potter and the Prisoner of Azkaban} => {Harry Potter and the Goblet of Fire}"	
0.00238850045598645	0.771929824561403	182.310031488979

The top 5 rules contain books from the Harry Potter series. The first two rules have the same support due to the two rules being nearly identical where 'if book 1 & 2, then book 3' will have the same support as 'if book 3 & 1, then book 2'. This information, while still useful, shows us the same general idea.  
The same issue can be seen in rule 3, 4, and 5 where the support is the exact same due to the order in which the books are interchanged within the exact same rule set.  


 Top 5 Results sorted by Confidence ("Rule, Support, Confidence, Lift")  

1. "{Harry Potter and the Goblet of Fire,Harry Potter and the Prisoner of Azkaban,Harry Potter and the Sorcerer's Stone} => {Harry Potter and the Chamber of Secrets}"	  
0.00209536630911539	0.965	158.156975088968
  
2. "{Harry Potter and the Chamber of Secrets,Harry Potter and the Goblet of Fire,Harry Potter and the Sorcerer's Stone} => {Harry Potter and the Prisoner of Azkaban}"	  
0.00209536630911539	0.932367149758454	190.417901175059
  
3. "{Harry Potter and the Goblet of Fire,Harry Potter and the Sorcerer's Stone} => {Harry Potter and the Chamber of Secrets}"  	 
0.00224736179267816	0.924107142857143	151.454912303
  
4. "{Harry Potter and the Chamber of Secrets,Harry Potter and the Goblet of Fire} => {Harry Potter and the Prisoner of Azkaban}"  
0.00238850045598645	0.897959183673469	183.390741662519
  
5. "{Harry Potter and the Prisoner of Azkaban,Harry Potter and the Sorcerer's Stone} => {Harry Potter and the Chamber of Secrets}"	    
0.00258392322056716	0.894736842105263	146.641318598989
  

Again the top 5 rules contain books from the Harry Potter series.  
The same issue from the previous top 5 rules sorted by support are also present for confidence. All the rules tell the same story even though the confidence values are not exactly the same. 

The R code used to run the apriori algorithm and sort the rules based on support and confidence are in Appendix A at the end of the document.

### 3. 
Among the top five rules listed in the previous question, examine which rules are for real and which ones are by chance by using a measure called lift. Define your lift criterion and discuss the decision you made.

Lift is is a ratio of confidence stating whether the rule is likely real and true or if it was created by chance.  

Lift Criterion  
*Lift > 1 : Indicates a rule is **USEFUL** in finding consequent item sets as opposed to selecting transactions randomly*   
*Lift < 1 : Indicates a rule is **LESS USEFUL** in finding consequent item sets as opposed to selecting transactions randomly*  
Lift = confidence of rule / confidence of consequent

By observing both the rules sorted by confidence, all rules generated are all real due to having a lift >> 1. Rules that are created by chance are likely to have a lift that is very low in comparison to the lift's within other rules.  
All five rules are real but they are all variations of the same thing in essence. Had the rules been less than 1, then you could safely conclude that they were created by chance. 

### 4.
Run Eclat on the same problem. Discuss your findings. What are main differences between Eclat and Apriori?

When running Eclat, the output is the most frequent itemsets based on the support that you set. No rules are generated using Eclat by itself. In order to generate rules, the function 'ruleInduction' must be used to extract the rules out of the output from Eclat. 


Eclat (support = 0.002)  
Output = All itemsets with support above 0.002 are listed.  
Eclat (support = 0.002, minlen = 2)
Output = All itemsets with atleast 2 items with a support above 0.002 are listed. 
Note: The R code used to generate the Eclat can be viewed in Appendix B.

One of the itemsets generated by Eclat contained 3 items, Harry Potter and the Chamber of Secrets,Harry Potter and the Prisoner of Azkaban,Harry Potter and the Sorcerer's Stone. This itemset was unique and only appeared once. While, in the rules generated by Apriori, the 3 items above appeared 3 seperate times with the same support in a different rule order. Eclat in this instance helped reduce redundant information. 

If the 'ruleInduction' function was used on the new eclat itemset, then you would get a different top 5 list based on support and confidence then that of Apriori due to the lack of 'variations' of essentially the same rule. 

The main differences between Apriori and Eclat are listed as the following.
1. Apriori outputs rules based on your parameters as opposed to the itemsets from Eclat
2. Rules generated by Eclat require the use of the function 'ruleInduction' while Apriori outputs rules by default
3. Apriori scans the dataset multiple times as opposed to a single time by Eclat
4. Eclat is faster than Apriori

### 5.
Outline your big data strategy for this data if you are asked to run associate rule mining on Amazon data set when the main memory of a computer can not fit all records in this data set. 

First we must consider the limitations given and the subsequent consequences of those limitations. If the entirety of the data set cannot fit in the main memory then there will be some complications. For instance, transactions that would correlate with each other get chunked out into different subsections and may not pass certain parameters, thus when regrouped that data is potentionally lost. 

Below are the steps I would use in terms of my big data strategy.  

1. Seperate the entire dataset into chunks. 
2. Run either Eclat or Apriori with less strict parameters then what you are looking for
3. Filter out any transactions that do not satisfy those parameters
4. Recombine results 
5. Repeat steps 2-4 until you can fit all the data into main memory
6. Run Eclat or Apriori with expected parameters 
7. Analyze results  

Some issues with this approach is time and efficiency depending on the size of the data. Naturally some data will be lost using this method but hopefully with the less strict parameters we can preserve most of it.
Multiple runs while chunking the dataset could also be beneficial in retaining some of the lost data from other runs. 


### 6.
List the top five most read books in this data set. Describe how you come to this list.

![](files/top5readbooks.PNG)

The top 5 books are as the following. 

1. "Wild Animus"
2. "The Lovely Bones: A Novel"
3. "She's Come Undone"
4. "The Da Vinci Code"
5. "Harry Potter and the Sorcerer's Stone"

This list was generated by using a function within arules called 'itemFrequencyPlot'. This function allows us to count the frequency of books and filters out everything but the top 5. We originally tried to use the function 'itemFrequency' but there were no parameters to filter the results. 

The top 5 were generated with the following R code below. The code can also be viewed in Appendix C.

In [None]:
library(arules)
bookbaskets<-read.transactions(“/home/cloudera/bookdata.tsv.gz”, format=”single”, sep=”\t”, cols=c(“userid”, “title”), rm.duplicates=T)
itemFrequencyPlot(bookbaskets, topN=5, type="absolute")
#For the plot to show up, the code must be ran in the command terminal in cloudera.

## Conclusion

In this assignment we utilized two common Association Rule Mining algorithms, Apriori and Eclat. Apriori generated rules while Eclat generates itemsets. This assignment was the first case of examining the parameter, lift, when helping us analyze whether certain rules were true or occured by chance. Additionally, issues using Apriori and Eclat were also exposed when looking at how to apply them to datasets that could not fit into main memory in its entirety. Lastly, the top five most read books were generated. All of this was accomplished in RHadoop in the cloudera environment using the 'arules' module. 

## Appendix A

In [None]:
library(arules)
bookbaskets<-read.transactions(“/home/cloudera/bookdata.tsv.gz”, format=”single”, sep=”\t”, cols=c(“userid”, “title”), rm.duplicates=T)
apr<-apriori(bookbaskets, parameter = list(supp= 0.002, conf= 0.75, target = "rules"))
summary(apr)
aprS<-sort(apr, by= "support", decreasing = TRUE)
aprC<-sort(apr, by = "confidence", decreasing = TRUE)
aprSF<-as(aprSF, "data.frame")
aprCF<-as(aprCF, "data.frame")
write.table(aprSF, file = "aprS.txt", sep = "\t", row.names = FALSE)
write.table(aprCF, file = "aprC.txt", sep = "\t", row.names = FALSE)

## Appendix B

In [None]:
library(arules)
bookbaskets<-read.transactions(“/home/cloudera/bookdata.tsv.gz”, format=”single”, sep=”\t”, cols=c(“userid”, “title”), rm.duplicates=T)
eclItemSet<-eclat(bookbaskets, parameter = list(supp=0.1, minlen=2))
eclSort<-sort(eclItemSet, by= "support", decreasing = TRUE)
eclSortF<-as(eclSort, "data.frame")
write.table(eclSortF, file = "eclOutput.txt", sep = "\t", row.names = FALSE)

## Appendix C

In [None]:
library(arules)
bookbaskets<-read.transactions(“/home/cloudera/bookdata.tsv.gz”, format=”single”, sep=”\t”, cols=c(“userid”, “title”), rm.duplicates=T)
itemFrequencyPlot(bookbaskets, topN=5, type="absolute")
#For the plot to show up, the code must be ran in the command terminal in cloudera.