# Market Basket Analysis - by Abhi Sharma

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import warnings
warnings.filterwarnings("ignore")

##### Part 1 : Data Preparation and Cleaning

In [None]:
df = pd.read_csv("/kaggle/input/market-basket-analysis/Assignment-1_Data.csv",delimiter=';')
df.head()

In [None]:
df.shape

In [None]:
df.info()

We see there are some null values in Itemname and Customer ID. Lets observe them

In [None]:
df[df.Itemname.isna()]

In [None]:
#lets remove rows with null itemnames
df.drop(df[df['Itemname'].isna()].index,inplace=True)

In [None]:
df.shape

In [None]:
df.info()

Here we see we have 388023 rows with customerID as nulls. Lets observe them before moving forward

In [None]:
df[df['CustomerID'].isna()]

The data above looks fine. There may be many reasons for customerID being null but lets consider these records valid for our analysis and move ahead

In [None]:
#check if there are any patterns for country and what is the share of each country
df.groupby(['Country'])['BillNo'].nunique().sort_values(ascending=False)

### Now, Lets prepare the data for association rules

In [None]:
dfprep = df.groupby(['BillNo','Itemname']).agg({'Quantity':'sum'}).reset_index().pivot(index='BillNo',columns='Itemname').fillna(0)
dfprep.columns = dfprep.columns.droplevel(0)


In [None]:
dfprep.head()

In [None]:
dfprep.reset_index(inplace=True)

In [None]:
#total number of items
dfprep.drop(dfprep.columns[0],inplace=True,axis=1)

In [None]:
dfprep = dfprep.applymap(lambda x: True if x>0 else False)

In [None]:
dfprep

In [None]:
frequent_itemsets = apriori(dfprep, min_support= 0.01, use_colnames=True, max_len = 2)

In [None]:
frequent_itemsets

In [None]:
rules = association_rules(frequent_itemsets, metric="lift",  min_threshold = 1.5)
rules.shape

In [None]:
rules

Here, we see 1550 association rules created for the minimum lieft threshold of 1.5. These rules will be helpful for giving reccomendations to users as per the association between antecedent and consequent. We see there are some metrics calculated from the above table. Lets see which of these metrics are helpful for our recommendation engine.

<b>Support</b> - Support is the pobability of A->B in the transactions. In mathematical form, it is N(A->B)/N(trans)<br><br>
<b>Confidence</b> - Confidence is the probability of occuring A->B out of all the transactions where A exists in the transaction. i.e N(A->B)/N(A), it is the conditional probability of P(B|A)<br><br>

<b>Lift</b> - Lift tells us the strength of the relationship between A and B. Mathematically, it is Confidence(A->B)/Support(B). ANything greater than 1 is a good value. It tells how string the relationship/association is b/w A and B<br><br>

In some scenarios, we use leverage and conviction too. It depends on individual cases if leverage and conviction is to be considered but generally support, confidence and lift is what most people look at.