## Market Basket Analysis in Python

Source: http://pbpython.com/market-basket-analysis.html  
Using MLxtend, and this [data](http://archive.ics.uci.edu/ml/datasets/Online+Retail)

**What is Market Basket Analysis?**  
* If you buy Milk, you are likely to buy Bread as well
* Finding connection by Association Analysis

Antecedant --> Consequent  

**Support**: Relative frequency that the rules show up.  
Usually, high support == useful relationship.   
Sometimes low support might indicate some 'hidden' relationships.

**Confidence**: Reliability of the rule.  
Value of 0.5 means 50% confidence that the rule holds up.  

**Lift**: Ratio of the observed support to that expected if the two rules were independent.  
The basic rule of thumb is that a lift value close to 1 means the rules were completely independent.  
Lift values > 1 are generally more 'interesting' and could be indicative of a useful rule pattern.

-----------------------------

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

In [2]:
# A data set of about a year worth of transactions for a UK-based online retail
# Source: http://archive.ics.uci.edu/ml/datasets/Online+Retail
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')

In [3]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
df.shape

(541909, 8)

Data Cleanup:
* Remove rows without invoices
* Remove rows that contain credit transactions (InvoiceNo contains 'C')

In [5]:
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

In [6]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [7]:
df.shape

(532621, 8)

MLxtend's apriori requires the data to be one-hot encoded.  
We need to pivot the data on 'InvoiceNo'  

In [8]:
# Exploring the data just for 'France' right now
basket = (df[df['Country'] == 'France']
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [9]:
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
basket.shape

(392, 1563)

In [11]:
basket.describe()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,...,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,0.857143,0.306122,0.005102,0.061224,0.502551,0.380102,0.395408,0.061224,0.278061,0.183673,...,0.191327,0.022959,0.010204,0.002551,0.081633,0.27551,0.012755,0.183673,0.030612,0.061224
std,5.077406,2.458487,0.101015,0.856046,5.410513,2.959386,3.432846,1.212183,1.994039,1.475092,...,2.181444,0.338469,0.202031,0.050508,1.616244,2.334137,0.252538,1.910247,0.606092,0.856046
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,48.0,20.0,2.0,12.0,96.0,24.0,48.0,24.0,24.0,12.0,...,25.0,6.0,4.0,1.0,32.0,36.0,5.0,24.0,12.0,12.0


In [12]:
# Converting float values to 0 or 1 for one hot encoding
basket_sets = basket.applymap(lambda x: 0 if x<=0 else 1)

In [13]:
# Dropping 'Postage' column, as it represents a charge and not a product
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [14]:
basket_sets.describe()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,...,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,0.030612,0.015306,0.002551,0.005102,0.015306,0.017857,0.017857,0.002551,0.022959,0.015306,...,0.007653,0.005102,0.002551,0.002551,0.002551,0.017857,0.002551,0.010204,0.002551,0.005102
std,0.172485,0.122924,0.050508,0.071337,0.122924,0.132601,0.132601,0.050508,0.149965,0.122924,...,0.087258,0.071337,0.050508,0.050508,0.050508,0.132601,0.050508,0.100627,0.050508,0.071337
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [15]:
# Applying apriori algorithm to generate frequent sets with a support value of 7%
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [16]:
frequent_itemsets.shape

(51, 2)

In [17]:
# Generating rules
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(SET/6 RED SPOTTY PAPER CUPS),(SET/20 RED RETROSPOT PAPER NAPKINS),0.137755,0.740741,5.584046
1,(SET/20 RED RETROSPOT PAPER NAPKINS),(SET/6 RED SPOTTY PAPER CUPS),0.132653,0.769231,5.584046
2,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.725,7.478947
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.763158,7.478947
4,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE RED),0.102041,0.725,7.681081


In [18]:
# High confidence and High Lift examples
rules[(rules['lift'] >= 6) & (rules['confidence'] >= 0.8)]

Unnamed: 0,antecedants,consequents,support,confidence,lift
10,"(SET/6 RED SPOTTY PAPER PLATES, SET/6 RED SPOT...",(SET/20 RED RETROSPOT PAPER NAPKINS),0.122449,0.8125,6.125
11,"(SET/6 RED SPOTTY PAPER PLATES, SET/20 RED RET...",(SET/6 RED SPOTTY PAPER CUPS),0.102041,0.975,7.077778
12,"(SET/6 RED SPOTTY PAPER CUPS, SET/20 RED RETRO...",(SET/6 RED SPOTTY PAPER PLATES),0.102041,0.975,7.644
18,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.127551,0.96,6.968889
19,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.137755,0.888889,6.968889
22,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.127551,0.8,6.030769
24,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.837838,8.642959
25,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.815789,8.642959


-----------------------------

Lets see if we can use the same logic on a different dataset

### Which actors are most likely to work together
Exploring [this](https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset) IMDb data set from Kaggle, we try to find out which actors are most likely to work together  
Data Source: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

In [19]:
# Download the data from above site, rename it 'movie_metadata.csv', and store it in the same repository as this notebook
df = pd.read_csv('movie_metadata.csv')

In [20]:
df.shape

(5043, 28)

In [21]:
df.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [22]:
# Extracting required columns
dir_act = df[['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']]
dir_act = dir_act.dropna()

In [23]:
# Merging all actor columns together under single 'actor_name' column
dir_act = pd.melt(dir_act, id_vars=['director_name'], value_vars=['actor_1_name', 'actor_2_name', 'actor_3_name']).drop(['variable'], axis=1)
dir_act.rename(columns={'value': 'actor_name'}, inplace=True)

In [24]:
dir_act.head()

Unnamed: 0,director_name,actor_name
0,James Cameron,CCH Pounder
1,Gore Verbinski,Johnny Depp
2,Sam Mendes,Christoph Waltz
3,Christopher Nolan,Tom Hardy
4,Andrew Stanton,Daryl Sabara


In [25]:
# Appending a column to count actor occurences
dir_act_onehot = dir_act.assign(n_times=pd.Series([1]*len(dir_act)).values)

# One-hot encoding the actors against directors
dir_act_onehot = (dir_act_onehot
                  .groupby(['director_name', 'actor_name'])['n_times']
                  .sum().unstack().reset_index().fillna(0)
                  .set_index('director_name'))

In [26]:
dir_act_onehot.head()

actor_name,50 Cent,A. Michael Baldwin,A.J. Buckley,A.J. DeLucia,A.J. Langer,AJ Michalka,Aaliyah,Aaron Ashmore,Aaron Hughes,Aaron Kwok,...,Zooey Deschanel,Zoë Bell,Zoë Kravitz,Zoë Poledouris,Zubaida Sahar,Zuhair Haddad,Álex Angulo,Ángela Molina,Émilie Dequenne,Óscar Jaenada
director_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A. Raven Cruz,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Aaron Hann,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Aaron Schneider,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Aaron Seltzer,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Abel Ferrara,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
# Converting float counts to proper integers
basket_dir_act = dir_act_onehot.applymap(lambda x: 0 if x<=0 else 1)

In [28]:
# Applying MLxtend's apriori to group actors
actors_together = apriori(basket_dir_act, min_support=0.001, use_colnames=True)

In [29]:
actors_together.shape

(1558, 2)

In [30]:
# Finding association rules using lift metric
paired_most = association_rules(actors_together, metric="lift", min_threshold=1)

In [31]:
paired_most.head(10)

Unnamed: 0,antecedants,consequents,support,confidence,lift
0,(Paul Hogan),(Linda Kozlowski),0.001677,0.75,596.25
1,(Linda Kozlowski),(Paul Hogan),0.001258,1.0,596.25
2,(Charlize Theron),(Liam Neeson),0.009644,0.130435,11.110248
3,(Liam Neeson),(Charlize Theron),0.01174,0.107143,11.110248
4,(Leonardo DiCaprio),(Michael Fassbender),0.006709,0.1875,31.941964
5,(Michael Fassbender),(Leonardo DiCaprio),0.00587,0.214286,31.941964
6,(Dwayne Johnson),(Charlize Theron),0.007966,0.157895,16.372998
7,(Charlize Theron),(Dwayne Johnson),0.009644,0.130435,16.372998
8,(Kate Winslet),(Tom Wilkinson),0.009224,0.136364,14.783058
9,(Tom Wilkinson),(Kate Winslet),0.009224,0.136364,14.783058


We see that the algorithm works fairly well.  
Jason Statham and Jet Li, who worked together 6 times till now (2017), are paired together with a lift value of 43.  
Higher lift values indicate increased interest

In [32]:
paired_most.sort_values(by=['support', 'lift'], ascending=False).head(10)

Unnamed: 0,antecedants,consequents,support,confidence,lift
73,(Robert De Niro),"(Nicolas Cage, Tom Cruise)",0.01761,0.071429,34.071429
257,(Robert De Niro),"(Brad Pitt, Tom Cruise)",0.01761,0.071429,34.071429
451,(Robert De Niro),"(Nicolas Cage, Anthony Hopkins)",0.01761,0.071429,34.071429
556,(Robert De Niro),(Bradley Whitford),0.01761,0.071429,34.071429
123,(Robert De Niro),(Ellen Barkin),0.01761,0.071429,28.392857
126,(Robert De Niro),(Marlon Brando),0.01761,0.071429,24.336735
486,(Robert De Niro),(Vivica A. Fox),0.01761,0.071429,17.035714
38,(Robert De Niro),(Tom Cruise),0.01761,0.166667,16.5625
95,(Robert De Niro),(Leonardo DiCaprio),0.01761,0.071429,10.647321
116,(Robert De Niro),(Brad Pitt),0.01761,0.119048,10.515873


Robert De Niro leads the chart sorted by support and lift values, as he has been in industry for well over 50 years now.  
He has acted in many movies, and with different actors.  
Just the sheer number of occurences of a pairing that has him puts him on top of our list as actor who you are most likely to see together.