# Market Basket Analysis

Welcome!!

I hope to help you understand what Market Basket analysis is all about. As well as write some code to get the analysis of a set of [Grocery Dataset](https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset) from kaggle.

credits: Natassha Selvaraj, 365 DataScience

## What is Market Basket Analysis (MBA)?
____

MBA is an analysis done to ascertain the likelihood that two products are purchased together.

In other sense, what is the probability that two items will end up in the same basket (cart)

### Aplication of MBA
- It is used in retail industry to create promotions, product combinations and cross selling strategies.
- In e-commerse to recommend complementary products
- In hospitality to creat meal packages and menu recommendations to improve customer experiences
- In healthcare, it is used to monitor patients treatment outcomes, complimentary prescriptions and behavioral patterns
- Targeted Marketing campaigns and personalised bundle plans by the telecommunications / Banking / Finance sectors.


### Main Components

These are the main metrics for evaluating the likelihood of [collective purchase]("I made this term up but it kind of sums up the aim of the analysis")

- **Support**
This evaluates the probability of purchase of an item expressed as a fraction of total transaction or in percentage.
support = frequency of item purchase / total transaction. (multiplied by 100 as percentage)

- **Lift**
measures the likely ratio of increase in purchase of an item with respect to another one that sells better.
Lift = confidence / support (higher selling item) / (lower selling)

- **Confidence**
This calculates the likelihood of purchase combinations.
e.g confidence for A in (item A & item B) = P(A + B) / P(A)

### MBA Algorithms

- Apriori Algorithm
- AIS
- SETM Algorithm
- FP Growth

We would used the apriori algo for this tutorial

In [19]:
# Import depencies
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## Data Preparation

In [3]:
# read your data into df

df = pd.read_csv("Groceries_dataset.csv")

df.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


Create a new column to identity each transaction. This helps us to trace items bought in the same transaction, i.e. items with same member_number and date.

In [4]:
# Create a new column transaction which takes the input of Member_number and Date

df['Transactions'] = df['Member_number'].astype(str) + '_' + df['Date'].astype(str)

df.head()

Unnamed: 0,Member_number,Date,itemDescription,Transactions
0,1808,21-07-2015,tropical fruit,1808_21-07-2015
1,2552,05-01-2015,whole milk,2552_05-01-2015
2,2300,19-09-2015,pip fruit,2300_19-09-2015
3,1187,12-12-2015,other vegetables,1187_12-12-2015
4,3037,01-02-2015,whole milk,3037_01-02-2015


The newly added column "Transactions" identifies the uniquely all transactions by the same Member number and the same date. i.e items purchased in one receipt.

you can run the code'df.Transactions.value_counts()' to check

### Pivot the Table

pivoting the items into columns and the transaction into rows uniquely tells us the number of times and in which transactions the items where bought.

In [18]:
# pivot the table using pandas crosstab() function
df2 = pd.crosstab(df["Transactions"], df["itemDescription"])

df2.head()

itemDescription,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
Transactions,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000_15-03-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1000_24-06-2014,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1000_24-07-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_25-11-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000_27-05-2015,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Encode the Values

This final preprocessing steps entails coding all values that above 1 as 1 to get a 0 and 1 encoding.
(note: purchase frequency is not considered by the algorithm)

In [22]:
#define a function 'encode' that takes in the 'item_freq' and returns 1 for frequncys above 1

def encode(item_freq):
    res = 0
    if item_freq > 0:
        res = 1
    return res

#using applymap() functions, apply the encode function above to the pandas dataframe

basket_input = df2.map(encode)

The new dataframe "basket_input" will be used for the analysis.

### Build the Apriori ALgorithm

* Using the apriori function from the mlxtend library, we define frequent itemsets by passing the basket input dataframe to it.

* Next, the frequent itemset variable is passed to the association_rules function to give us a table we shall store as rules.

In [24]:
#define a variable freq_itetmset that takes in the apriori analysis

freq_itetmset = apriori(basket_input, min_support=0.001, use_colnames=True)

# define a variable rules that take in the table from the association rules

rules = association_rules(freq_itetmset, metric="lift")

# view the head of the table
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(bottled water),(UHT-milk),0.060683,0.021386,0.001069,0.017621,0.823954,-0.000228,0.996168,-0.185312
1,(UHT-milk),(bottled water),0.021386,0.060683,0.001069,0.05,0.823954,-0.000228,0.988755,-0.179204
2,(other vegetables),(UHT-milk),0.122101,0.021386,0.002139,0.017515,0.818993,-0.000473,0.99606,-0.201119
3,(UHT-milk),(other vegetables),0.021386,0.122101,0.002139,0.1,0.818993,-0.000473,0.975443,-0.184234
4,(UHT-milk),(sausage),0.021386,0.060349,0.001136,0.053125,0.880298,-0.000154,0.992371,-0.121998


In [27]:
rules.sort_values(["support", "confidence", "lift"], axis=0, ascending=False).head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
623,(rolls/buns),(whole milk),0.110005,0.157923,0.013968,0.126974,0.804028,-0.003404,0.96455,-0.214986
622,(whole milk),(rolls/buns),0.157923,0.110005,0.013968,0.088447,0.804028,-0.003404,0.97635,-0.224474
695,(yogurt),(whole milk),0.085879,0.157923,0.011161,0.129961,0.82294,-0.002401,0.967861,-0.190525
694,(whole milk),(yogurt),0.157923,0.085879,0.011161,0.070673,0.82294,-0.002401,0.983638,-0.203508
551,(soda),(other vegetables),0.097106,0.122101,0.009691,0.099794,0.817302,-0.002166,0.975219,-0.198448
550,(other vegetables),(soda),0.122101,0.097106,0.009691,0.079365,0.817302,-0.002166,0.980729,-0.202951
649,(sausage),(whole milk),0.060349,0.157923,0.008955,0.148394,0.939663,-0.000575,0.988811,-0.063965
648,(whole milk),(sausage),0.157923,0.060349,0.008955,0.056708,0.939663,-0.000575,0.99614,-0.070851
624,(yogurt),(rolls/buns),0.085879,0.110005,0.007819,0.091051,0.827697,-0.001628,0.979147,-0.185487
625,(rolls/buns),(yogurt),0.110005,0.085879,0.007819,0.071081,0.827697,-0.001628,0.984071,-0.189562


### The Result

From the table above the combination of Rolls/milk, yogurt/milk, sausages/milk, soda/vegetables

This notebook was prepared by ***Augustine Emmanuel***