**Association Rule Mining**

The primary aim of association rule mining is to uncover hidden patterns within the data. We've tried to find association between Promo1, Promo2, whether it is a school holiday and the amount of sales on that day.

In [352]:
library(arules)
library(arulesViz)
library(data.table)
library(zoo)
library(forecast)
library(ggplot2)
test <- fread("../input/test.csv")
train <- fread("../input/train.csv")
store <- fread("../input/store.csv")

In [353]:
str(train)

Classes ‘data.table’ and 'data.frame':	1017209 obs. of  9 variables:
 $ Store        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DayOfWeek    : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Date         : chr  "2015-07-31" "2015-07-31" "2015-07-31" "2015-07-31" ...
 $ Sales        : int  5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
 $ Customers    : int  555 625 821 1498 559 589 1414 833 687 681 ...
 $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ StateHoliday : chr  "0" "0" "0" "0" ...
 $ SchoolHoliday: chr  "1" "1" "1" "1" ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [354]:
str(store)

Classes ‘data.table’ and 'data.frame':	1115 obs. of  10 variables:
 $ Store                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ StoreType                : chr  "c" "a" "a" "c" ...
 $ Assortment               : chr  "a" "a" "a" "c" ...
 $ CompetitionDistance      : int  1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
 $ CompetitionOpenSinceMonth: int  9 11 12 9 4 12 4 10 8 9 ...
 $ CompetitionOpenSinceYear : int  2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
 $ Promo2                   : int  0 1 1 0 0 0 0 0 0 0 ...
 $ Promo2SinceWeek          : int  NA 13 14 NA NA NA NA NA NA NA ...
 $ Promo2SinceYear          : int  NA 2010 2011 NA NA NA NA NA NA NA ...
 $ PromoInterval            : chr  "" "Jan,Apr,Jul,Oct" "Jan,Apr,Jul,Oct" "" ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [355]:
summary(train)

     Store          DayOfWeek         Date               Sales      
 Min.   :   1.0   Min.   :1.000   Length:1017209     Min.   :    0  
 1st Qu.: 280.0   1st Qu.:2.000   Class :character   1st Qu.: 3727  
 Median : 558.0   Median :4.000   Mode  :character   Median : 5744  
 Mean   : 558.4   Mean   :3.998                      Mean   : 5774  
 3rd Qu.: 838.0   3rd Qu.:6.000                      3rd Qu.: 7856  
 Max.   :1115.0   Max.   :7.000                      Max.   :41551  
   Customers           Open            Promo        StateHoliday      
 Min.   :   0.0   Min.   :0.0000   Min.   :0.0000   Length:1017209    
 1st Qu.: 405.0   1st Qu.:1.0000   1st Qu.:0.0000   Class :character  
 Median : 609.0   Median :1.0000   Median :0.0000   Mode  :character  
 Mean   : 633.1   Mean   :0.8301   Mean   :0.3815                     
 3rd Qu.: 837.0   3rd Qu.:1.0000   3rd Qu.:1.0000                     
 Max.   :7388.0   Max.   :1.0000   Max.   :1.0000                     
 SchoolHoliday     


Merge training table with the store tables based on Store#.

In [356]:
total <-merge(train, store, by="Store")
head(total)

Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
1,5,2015-07-31,5263,555,1,1,0,1,c,a,1270,9,2008,0,,,
1,4,2015-07-30,5020,546,1,1,0,1,c,a,1270,9,2008,0,,,
1,3,2015-07-29,4782,523,1,1,0,1,c,a,1270,9,2008,0,,,
1,2,2015-07-28,5011,560,1,1,0,1,c,a,1270,9,2008,0,,,
1,1,2015-07-27,6102,612,1,1,0,1,c,a,1270,9,2008,0,,,
1,7,2015-07-26,0,0,0,0,0,0,c,a,1270,9,2008,0,,,


Select the required attributes among which the association rule mining is to be carried out. 

In [357]:
df <- subset(total, select = c(7, 9, 15, 4))

In [358]:
str(df)

Classes ‘data.table’ and 'data.frame':	1017209 obs. of  4 variables:
 $ Promo        : int  1 1 1 1 1 0 0 0 0 0 ...
 $ SchoolHoliday: chr  "1" "1" "1" "1" ...
 $ Promo2       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Sales        : int  5263 5020 4782 5011 6102 0 4364 3706 3769 3464 ...
 - attr(*, ".internal.selfref")=<externalptr> 


In [359]:
head(df)

Promo,SchoolHoliday,Promo2,Sales
1,1,0,5263
1,1,0,5020
1,1,0,4782
1,1,0,5011
1,1,0,6102
0,0,0,0


Categorize the sales variable into 3 categories based on an arbitrary threshold computed based on the mean sales value. 

In [360]:

df$Sale[df$Sales < 2000] <- "low"
df$Sale[df$Sales >= 2000 & df$Sales < 7000] <- "Average"
df$Sale[df$Sales >= 7000] <- "High"

In [361]:
head(df)

Promo,SchoolHoliday,Promo2,Sales,Sale
1,1,0,5263,Average
1,1,0,5020,Average
1,1,0,4782,Average
1,1,0,5011,Average
1,1,0,6102,Average
0,0,0,0,low


In [362]:
df <- within(df, rm(Sales))
head(df)

Promo,SchoolHoliday,Promo2,Sale
1,1,0,Average
1,1,0,Average
1,1,0,Average
1,1,0,Average
1,1,0,Average
0,0,0,low


The dataframe is converted into an itemset(transaction) form. Apriori algorithm is applied to mine the rules and we focus only on those rules where sales value is in the rhs. 

In [363]:
df <- as.data.frame(df)
df <- df[complete.cases(df),]
df$Promo <-as.factor(df$Promo)
df$SchoolHoliday <-as.factor(df$SchoolHoliday)
df$Promo2 <-as.factor(df$Promo2)
df$Sale <-as.factor(df$Sale)
str(df)
trans <- as(df, "transactions")
rules <- apriori(trans, parameter = list(supp = 0.10, conf = 0.50, minlen=2), appearance = list(rhs=c("Sale=Average", "Sale=High", "Sale=low"), default="lhs"))

'data.frame':	1017209 obs. of  4 variables:
 $ Promo        : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 1 1 ...
 $ SchoolHoliday: Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 1 1 1 ...
 $ Promo2       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Sale         : Factor w/ 3 levels "Average","High",..: 1 1 1 1 1 3 1 1 1 1 ...
Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.5    0.1    1 none FALSE            TRUE       5     0.1      2
 maxlen target   ext
     10  rules FALSE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 101720 

set item appearances ...[3 item(s)] done [0.00s].
set transactions ...[9 item(s), 1017209 transaction(s)] done [0.13s].
sorting and recoding items ... [9 item(s)] done [0.02s].
creating transaction tree ... done [0.55s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [11 rule(s

The rules are listed in the decreasing order of confidence.

In [364]:

rules <- sort(rules, by = "lift", decreasing='true')
inspect(rules)

     lhs                                   rhs            support   confidence
[1]  {Promo=1,Promo2=0}                 => {Sale=High}    0.1201081 0.6295604 
[2]  {Promo=1}                          => {Sale=High}    0.2232884 0.5852685 
[3]  {Promo=1,SchoolHoliday=0}          => {Sale=High}    0.1758744 0.5846858 
[4]  {Promo=1,Promo2=1}                 => {Sale=High}    0.1031804 0.5409657 
[5]  {Promo=0,Promo2=1}                 => {Sale=Average} 0.1809618 0.5840679 
[6]  {Promo=0,SchoolHoliday=0,Promo2=1} => {Sale=Average} 0.1482989 0.5674888 
[7]  {Promo=0}                          => {Sale=Average} 0.3389618 0.5480514 
[8]  {Promo=0,SchoolHoliday=0}          => {Sale=Average} 0.2779016 0.5338596 
[9]  {Promo2=1}                         => {Sale=Average} 0.2623325 0.5240741 
[10] {SchoolHoliday=0,Promo2=1}         => {Sale=Average} 0.2143453 0.5196718 
[11] {Promo=0,Promo2=0}                 => {Sale=Average} 0.1580000 0.5118977 
     lift     count 
[1]  1.860244 122175
[2]  1.729

 These rules are dependent on support and confidence.These counting of these rules can be changed by changing the values of support and confidence by trial and error method.

**Support:** Support is the basic probability of an event to occur. If we have an event to buy product A, Support(A) is the number of transactions which includes A divided by total number of transactions.

**Confidence:** The confidence of an event is the conditional probability of the occurrence; the chances of A happening given B has already happened.
