<a href="https://colab.research.google.com/github/caiobellezi/estudos/blob/master/Estudos_Regras_de_associacao.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regras de associacao
Exercicio proposto pelo prof. Hugo de Paula (PUC Minas)

Para calculos de itemsets frequentes com algortimo Apriori, utilizar pacote `mlxt`



In [None]:
#!pip install mlxtend -q
!pip install mlxtend==0.17.3



## Regras de associação geradas a partir de itemsets frequentes
Fonte: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

No exemplo a seguir, foi criado um dataset transacional formado por uma "lista de listas", onde cada linha corresponde a um cesto de compras de um supermercado hipotético.

Nesta base, são considerados itemsets frequentes aqueles que possuírem suporte superior a 0.6.

In [None]:
!pip freeze | grep mlxtend

mlxtend==0.17.3


In [None]:
# Importando as bibliotecas

import pandas as pd
from mlxtend.preprocessing import  TransactionEncoder
from mlxtend.frequent_patterns import apriori#, fpgrowth
from mlxtend.frequent_patterns import association_rules

#Dataset transacional
dataset = [['Leite', 'Cebola', 'Batata', 'Feijão', 'Ovos', 'Iogurte'],
           ['Arroz', 'Cebola', 'Batata', 'Feijão', 'Ovos', 'Iogurte'],
           ['Leite', 'Maçã', 'Feijão', 'Ovos'],
           ['Leite', 'Milho', 'Feijão', 'Iogurte'],
           ['Milho', 'Cebola', 'Feijão', 'Sorvete', 'Ovos']]


In [None]:
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Arroz,Batata,Cebola,Feijão,Iogurte,Leite,Maçã,Milho,Ovos,Sorvete
0,False,True,True,True,True,True,False,False,True,False
1,True,True,True,True,True,False,False,False,True,False
2,False,False,False,True,False,True,True,False,True,False
3,False,False,False,True,True,True,False,True,False,False
4,False,False,True,True,False,False,False,True,True,True


In [None]:
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)
print(frequent_itemsets)

    support                itemsets
0       0.6                (Cebola)
1       1.0                (Feijão)
2       0.6               (Iogurte)
3       0.6                 (Leite)
4       0.8                  (Ovos)
5       0.6        (Feijão, Cebola)
6       0.6          (Ovos, Cebola)
7       0.6       (Feijão, Iogurte)
8       0.6         (Feijão, Leite)
9       0.8          (Feijão, Ovos)
10      0.6  (Feijão, Ovos, Cebola)


## Confianca
Gerar regras de associacao com confianca minima de 0.7

In [None]:
conf = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
display(conf)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Cebola),(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
1,(Ovos),(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
2,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
3,(Iogurte),(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
4,(Leite),(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf
5,(Feijão),(Ovos),1.0,0.8,0.8,0.8,1.0,0.0,1.0
6,(Ovos),(Feijão),0.8,1.0,0.8,1.0,1.0,0.0,inf
7,"(Feijão, Ovos)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
8,"(Feijão, Cebola)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
9,"(Ovos, Cebola)",(Feijão),0.6,1.0,0.6,1.0,1.0,0.0,inf


## Lift
Gerar regras de associacao com lift minimo de 1.2.

Lift inferior a 1 significa que a regra nao possui causalidade relevante e nao aumenta o poder de previsao

In [None]:
lift = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
display(lift)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Ovos),(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
1,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
2,"(Feijão, Ovos)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6
3,"(Feijão, Cebola)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf
4,(Ovos),"(Feijão, Cebola)",0.8,0.6,0.6,0.75,1.25,0.12,1.6
5,(Cebola),"(Feijão, Ovos)",0.6,0.8,0.6,1.0,1.25,0.12,inf


In [None]:
#criar coluna com o numero de antecedentes
lift['antecedent_len'] = lift['antecedents'].apply(lambda x: len(x))
display(lift)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
0,(Ovos),(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6,1
1,(Cebola),(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,1
2,"(Feijão, Ovos)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6,2
3,"(Feijão, Cebola)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,2
4,(Ovos),"(Feijão, Cebola)",0.8,0.6,0.6,0.75,1.25,0.12,1.6,1
5,(Cebola),"(Feijão, Ovos)",0.6,0.8,0.6,1.0,1.25,0.12,inf,1


Selecionar apenas as regras com pelo menos 2 antecedentes, com confianca superior a 0.75, e lift maior que 1.2

In [None]:
rules = lift.copy()
rules[(rules["antecedent_len"] >= 2) &
     (rules["confidence"] > 0.75) &
     (rules["lift"] >1.2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
3,"(Feijão, Cebola)",(Ovos),0.6,0.8,0.6,1.0,1.25,0.12,inf,2


In [None]:
#exibe apenas regras cujos antecendentes sao Feijao e Ovos
rules[rules["antecedents"] == {"Ovos", "Feijão"}]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
2,"(Feijão, Ovos)",(Cebola),0.8,0.6,0.6,0.75,1.25,0.12,1.6,2


# Análise de cesta de compras em Python
Fonte: Chris Moffitt (2017), Introduction to Market Basket Analysis in Python, http://pbpython.com/market-basket-analysis.html

Neste exemplo é utilizada a base de dados Online Retail da UCI, disponível em [archive.ics.uci.edu/ml/machine-learning-databases/00352/Online Retail.xlsx](http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx)


In [None]:
#importando as bibliotecas
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth
from mlxtend.frequent_patterns import association_rules

In [None]:
#carregando o banco de dados e mostrar as 5 primeiras entradas
df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')

df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   InvoiceNo    541909 non-null  object        
 1   StockCode    541909 non-null  object        
 2   Description  540455 non-null  object        
 3   Quantity     541909 non-null  int64         
 4   InvoiceDate  541909 non-null  datetime64[ns]
 5   UnitPrice    541909 non-null  float64       
 6   CustomerID   406829 non-null  float64       
 7   Country      541909 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


In [None]:
# mostrar a dimensao do dataframe
display(df.shape)
#resumo estatistico das variaveis numericas
df.describe()


(541909, 8)

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


## Preparacao dos dados
`strip()` - Elimina os caracteres em branco no comeco e no final da string

`df[~df['InvoiceNo'].str.contains("C")]` = seleciona todo o df cujo as linhas da coluna `InvoiceNo` nao tenha a letra 'C', pois todas as _Invoices_  com a letra 'C' significa que foram canceladas

In [None]:
!pip freeze | grep pandas

pandas==1.0.5
pandas-datareader==0.8.1
pandas-gbq==0.11.0
pandas-profiling==1.4.1
sklearn-pandas==1.8.0


In [None]:
#Strip: Elimina os espacos no comeco e no final da string
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df.Description = df.Description.str.strip()
df = df[~df['InvoiceNo'].str.contains("C")]
df.shape

(532621, 8)

Verificar quais sao os paises com maior quantidade de registros
Para este exercicio foi sugerido a avaliacao da Franca

In [None]:
df.Country.value_counts().head()

United Kingdom    487622
Germany             9042
France              8408
EIRE                7894
Spain               2485
Name: Country, dtype: int64

In [None]:
basket = (df[df.Country == 'France']
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum()
          .unstack()            
          .reset_index()            
          .fillna(0)            
          .set_index('InvoiceNo'))
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,15CM CHRISTMAS GLASS BALL 20 LIGHTS,16 PIECE CUTLERY SET PANTRY DESIGN,18PC WOODEN CUTLERY SET DISPOSABLE,20 DOLLY PEGS RETROSPOT,200 RED + WHITE BENDY STRAWS,3 HOOK HANGER MAGIC GARDEN,3 PIECE SPACEBOY COOKIE CUTTER SET,3 RAFFIA RIBBONS 50'S CHRISTMAS,3 STRIPEY MICE FELTCRAFT,3 TIER CAKE TIN RED AND CREAM,3 TRADITIONAl BISCUIT CUTTERS SET,36 DOILIES DOLLY GIRL,36 DOILIES VINTAGE CHRISTMAS,36 FOIL HEART CAKE CASES,36 FOIL STAR CAKE CASES,36 PENCILS TUBE RED RETROSPOT,36 PENCILS TUBE SKULLS,36 PENCILS TUBE WOODLAND,3D DOG PICTURE PLAYING CARDS,3D HEARTS HONEYCOMB PAPER GARLAND,3D SHEET OF DOG STICKERS,3D TRADITIONAL CHRISTMAS STICKERS,3D VINTAGE CHRISTMAS STICKERS,4 IVORY DINNER CANDLES SILVER FLOCK,4 PINK DINNER CANDLE SILVER FLOCK,4 TRADITIONAL SPINNING TOPS,5 HOOK HANGER MAGIC TOADSTOOL,5 HOOK HANGER RED MAGIC TOADSTOOL,50'S CHRISTMAS GIFT BAG LARGE,6 GIFT TAGS 50'S CHRISTMAS,...,WOODLAND DESIGN COTTON TOTE BAG,WOODLAND LARGE BLUE FELT HEART,WOODLAND LARGE PINK FELT HEART,WOODLAND LARGE RED FELT HEART,WOODLAND MINI BACKPACK,WOODLAND PARTY BAG + STICKER SET,WOODLAND SMALL BLUE FELT HEART,WOODLAND SMALL PINK FELT HEART,WOODLAND SMALL RED FELT HEART,WOODLAND STORAGE BOX LARGE,WOODLAND STORAGE BOX SMALL,WORLD WAR 2 GLIDERS ASSTD DESIGNS,WRAP VINTAGE DOILY,WRAP 50'S CHRISTMAS,WRAP ALPHABET DESIGN,WRAP CAROUSEL,WRAP CHRISTMAS VILLAGE,WRAP CIRCUS PARADE,WRAP DOILEY DESIGN,WRAP DOLLY GIRL,WRAP ENGLISH ROSE,WRAP GINGHAM ROSE,WRAP GREEN PEARS,WRAP I LOVE LONDON,WRAP PAISLEY PARK,WRAP PINK FAIRY CAKES,WRAP POPPIES DESIGN,WRAP RED APPLES,WRAP RED VINTAGE DOILY,WRAP SUKI AND FRIENDS,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
536370,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537065,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537463,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Criar a funcao `encode_units` para que quando a quantidade for maior que 0, retornar o valor 1 (positivo)

In [None]:
def encode_units(x):
  if x <= 0:
    return 0
  if x >= 1:
    return 1

In [None]:
basket_set = basket.applymap(encode_units)
basket_set.drop('POSTAGE', axis=1, inplace=True)
basket_set.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,15CM CHRISTMAS GLASS BALL 20 LIGHTS,16 PIECE CUTLERY SET PANTRY DESIGN,18PC WOODEN CUTLERY SET DISPOSABLE,20 DOLLY PEGS RETROSPOT,200 RED + WHITE BENDY STRAWS,3 HOOK HANGER MAGIC GARDEN,3 PIECE SPACEBOY COOKIE CUTTER SET,3 RAFFIA RIBBONS 50'S CHRISTMAS,3 STRIPEY MICE FELTCRAFT,3 TIER CAKE TIN RED AND CREAM,3 TRADITIONAl BISCUIT CUTTERS SET,36 DOILIES DOLLY GIRL,36 DOILIES VINTAGE CHRISTMAS,36 FOIL HEART CAKE CASES,36 FOIL STAR CAKE CASES,36 PENCILS TUBE RED RETROSPOT,36 PENCILS TUBE SKULLS,36 PENCILS TUBE WOODLAND,3D DOG PICTURE PLAYING CARDS,3D HEARTS HONEYCOMB PAPER GARLAND,3D SHEET OF DOG STICKERS,3D TRADITIONAL CHRISTMAS STICKERS,3D VINTAGE CHRISTMAS STICKERS,4 IVORY DINNER CANDLES SILVER FLOCK,4 PINK DINNER CANDLE SILVER FLOCK,4 TRADITIONAL SPINNING TOPS,5 HOOK HANGER MAGIC TOADSTOOL,5 HOOK HANGER RED MAGIC TOADSTOOL,50'S CHRISTMAS GIFT BAG LARGE,6 GIFT TAGS 50'S CHRISTMAS,...,WOODLAND DESIGN COTTON TOTE BAG,WOODLAND LARGE BLUE FELT HEART,WOODLAND LARGE PINK FELT HEART,WOODLAND LARGE RED FELT HEART,WOODLAND MINI BACKPACK,WOODLAND PARTY BAG + STICKER SET,WOODLAND SMALL BLUE FELT HEART,WOODLAND SMALL PINK FELT HEART,WOODLAND SMALL RED FELT HEART,WOODLAND STORAGE BOX LARGE,WOODLAND STORAGE BOX SMALL,WORLD WAR 2 GLIDERS ASSTD DESIGNS,WRAP VINTAGE DOILY,WRAP 50'S CHRISTMAS,WRAP ALPHABET DESIGN,WRAP CAROUSEL,WRAP CHRISTMAS VILLAGE,WRAP CIRCUS PARADE,WRAP DOILEY DESIGN,WRAP DOLLY GIRL,WRAP ENGLISH ROSE,WRAP GINGHAM ROSE,WRAP GREEN PEARS,WRAP I LOVE LONDON,WRAP PAISLEY PARK,WRAP PINK FAIRY CAKES,WRAP POPPIES DESIGN,WRAP RED APPLES,WRAP RED VINTAGE DOILY,WRAP SUKI AND FRIENDS,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
536370,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Geracao de `itemsets` frequentes e de regras de associacao

In [None]:
frequent_itemsets = apriori(basket_set, min_support=0.07, use_colnames=True)
print(frequent_itemsets.sort_values(by='support', ascending=False))
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules

     support                                           itemsets
22  0.188776                               (RABBIT NIGHT LIGHT)
26  0.181122                    (RED TOADSTOOL LED NIGHT LIGHT)
21  0.170918                 (PLASTERS IN TIN WOODLAND ANIMALS)
18  0.168367                    (PLASTERS IN TIN CIRCUS PARADE)
30  0.158163               (ROUND SNACK BOXES SET OF4 WOODLAND)
11  0.153061                          (LUNCH BAG RED RETROSPOT)
14  0.142857                 (LUNCH BOX WITH CUTLERY RETROSPOT)
33  0.137755                      (SET/6 RED SPOTTY PAPER CUPS)
24  0.137755                         (RED RETROSPOT MINI CASES)
19  0.137755                         (PLASTERS IN TIN SPACEBOY)
32  0.132653               (SET/20 RED RETROSPOT PAPER NAPKINS)
34  0.127551                    (SET/6 RED SPOTTY PAPER PLATES)
27  0.125000                         (REGENCY CAKESTAND 3 TIER)
36  0.125000                               (SPACEBOY LUNCH BOX)
9   0.125000                           (

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,0.064088,3.791383
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,0.064088,3.283859
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
4,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061
5,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE RED),0.102041,0.094388,0.07398,0.725,7.681081,0.064348,3.293135
6,(DOLLY GIRL LUNCH BOX),(SPACEBOY LUNCH BOX),0.09949,0.125,0.071429,0.717949,5.74359,0.058992,3.102273
7,(SPACEBOY LUNCH BOX),(DOLLY GIRL LUNCH BOX),0.125,0.09949,0.071429,0.571429,5.74359,0.058992,2.10119
8,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN SPACEBOY),0.168367,0.137755,0.089286,0.530303,3.849607,0.066092,1.835747
9,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN CIRCUS PARADE),0.137755,0.168367,0.089286,0.648148,3.849607,0.066092,2.363588


In [None]:
rules[(rules['lift'] >= 1) &
      (rules['confidence'] >= 0.2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,0.064088,3.791383
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,0.064088,3.283859
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,0.069932,5.568878
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,0.069932,4.916181
4,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,0.064348,4.153061
5,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE RED),0.102041,0.094388,0.07398,0.725,7.681081,0.064348,3.293135
6,(DOLLY GIRL LUNCH BOX),(SPACEBOY LUNCH BOX),0.09949,0.125,0.071429,0.717949,5.74359,0.058992,3.102273
7,(SPACEBOY LUNCH BOX),(DOLLY GIRL LUNCH BOX),0.125,0.09949,0.071429,0.571429,5.74359,0.058992,2.10119
8,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN SPACEBOY),0.168367,0.137755,0.089286,0.530303,3.849607,0.066092,1.835747
9,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN CIRCUS PARADE),0.137755,0.168367,0.089286,0.648148,3.849607,0.066092,2.363588


`ALARM CLOCK BAKELIKE RED` e `ALARM CLOCK BAKELIKE GREEN` sao os 2 produtos com maior lift (8,64) e com uma confianca de mais de 80%

Avaliar quantos destes produtos foram vendidos


In [None]:
print('Alarmes vermelhos:  {}'.format(basket["ALARM CLOCK BAKELIKE RED"].sum()))
print('Alarmes verdes:  {}'.format(basket["ALARM CLOCK BAKELIKE GREEN"].sum()))

Alarmes vermelhos:  316.0
Alarmes verdes:  340.0


Foram vendidos 340 alarmes verdes e apenas 316 vermelhos, seria possivel impulsionar a venda de alarmes vermelhos atraves de recomendacao?

## Avaliacao da Alemanha

Analise similar a da Franca, o objetivo é mostrar que o suporte minimo e confianca podem variar de uma base da dados para outra, paises podem ter um perfil de compra mais homogeneo e gerar regras com suporte maior, enquanto outros com suporte menor

In [None]:
basket_de = (df[df.Country == 'Germany']
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum()
          .unstack()            
          .reset_index()            
          .fillna(0)            
          .set_index('InvoiceNo'))
basket_set_de = basket_de.applymap(encode_units)
basket_set_de.drop('POSTAGE', inplace=True, axis=1)
frequent_itemsets_de = apriori(basket_set_de, min_support=0.05, use_colnames=True)
rules_de = association_rules(frequent_itemsets_de, metric='lift', min_threshold=1)

rules_de[(rules_de.lift >= 4) &
         (rules_de.confidence >= 0.5)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.115974,0.137856,0.067834,0.584906,4.242887,0.051846,2.076984
7,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.107221,0.137856,0.061269,0.571429,4.145125,0.046488,2.01167
10,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.070022,0.126915,0.059081,0.84375,6.648168,0.050194,5.587746


# Comparar com algoritimo `FPGrowth`

Comparar resultados utilizando o algoritimo `fpgrowth` ao inves do `apriori`

`fpgrowth` deveria ser mais rápido, pois percorre o df apenas 2 vezes

In [None]:
frequent_itemsets_de =  fpgrowth(basket_set_de, min_support=0.05, use_colnames=True)
rules_de = association_rules(frequent_itemsets_de, metric='lift', min_threshold=1)
rules_de[(rules_de.lift >= 4) &
         (rules_de.confidence >= 0.5)]
       

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
8,(PLASTERS IN TIN CIRCUS PARADE),(PLASTERS IN TIN WOODLAND ANIMALS),0.115974,0.137856,0.067834,0.584906,4.242887,0.051846,2.076984
12,(RED RETROSPOT CHARLOTTE BAG),(WOODLAND CHARLOTTE BAG),0.070022,0.126915,0.059081,0.84375,6.648168,0.050194,5.587746
17,(PLASTERS IN TIN SPACEBOY),(PLASTERS IN TIN WOODLAND ANIMALS),0.107221,0.137856,0.061269,0.571429,4.145125,0.046488,2.01167


Como podemos ver os resultados sao iguais, apesar da ordem apresentada ser diferente
ao utilizar o comando `%timeit` do jupyter notebook para comparar a performance dos 2 algortimos, podemor ver que por algum motivo, neste caso o `apriori` apresentou melhor performance do que o `fpgrowth`

In [None]:
%timeit -n 100  fpgrowth(basket_set_de, min_support=0.05, use_colnames=True)

100 loops, best of 3: 34.9 ms per loop


In [None]:
%timeit -n 100   apriori(basket_set_de, min_support=0.05, use_colnames=True)

100 loops, best of 3: 20.1 ms per loop


# Conclusao
Tanto o `apriori` quanto o `fpgrowth` sao excelentes algoritmos para se analisar regras de associacao