<a href="https://colab.research.google.com/github/blancavazquez/CursoDatosMasivosII/blob/master/notebooks/2b_apriori_paso_a_paso.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Algoritmo apriori**

En esta libreta exploraremos el modelo mercado canasta basado en el algoritmo apriori usando una base de datos de transacciones de una tienda de productos comestibles.


In [1]:
import functools 
import itertools
import operator 
import numpy as np
import pandas as pd
from math import sqrt
from collections import Counter
import matplotlib.pyplot as plt

## Carga de datos
Primero leemos la base de datos [Online Retail II](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II) de [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) y mostramos algunos registros.

In [2]:
tdb_nan = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx')
tdb_nan.head(5)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


Eliminamos los registros inválidos y revisamos el número de registros.



In [3]:
print(tdb_nan.shape)
tdb = tdb_nan.dropna(axis=0)
print(tdb.shape)

(525461, 8)
(417534, 8)


La base de datos cuenta con un registro por cada producto comprado (campo `Description`) y una factura asociada (`Invoice`), ademas de otros campos. Para nuestro ejercicio, vamos a utilizar únicamente `Invoice` y `Description`.

In [4]:
tdb = tdb[['Invoice', 'Description']]
tdb.head(5)

Unnamed: 0,Invoice,Description
0,489434,15CM CHRISTMAS GLASS BALL 20 LIGHTS
1,489434,PINK CHERRY LIGHTS
2,489434,WHITE CHERRY LIGHTS
3,489434,"RECORD FRAME 7"" SINGLE SIZE"
4,489434,STRAWBERRY CERAMIC TRINKET BOX


Para encontrar elementos frecuentes, es necesario agrupar los productos comprados en una misma transacción usando el número de factura `Invoice`.

In [5]:
tdb = tdb.groupby(['Invoice'])['Description'].apply(list).reset_index(name='Description')
tdb.head(5)

Unnamed: 0,Invoice,Description
0,489434,"[15CM CHRISTMAS GLASS BALL 20 LIGHTS, PINK CHE..."
1,489435,"[CAT BOWL , DOG BOWL , CHASING BALL DESIGN, HE..."
2,489436,"[DOOR MAT BLACK FLOCK , LOVE BUILDING BLOCK WO..."
3,489437,"[CHRISTMAS CRAFT HEART DECORATIONS, CHRISTMAS ..."
4,489438,"[DINOSAURS WRITING SET , SET OF MEADOW FLOWE..."


Revisamos el número de transacciones totales en la base de datos.

In [6]:
print(tdb.shape)

(23587, 2)


## Búsqueda de elementos frecuentes
Definimos el soporte mínimo y el número de transacciones mínimas de los conjuntos frecuentes que vamos a buscar

In [7]:
minsup = 0.01
minoc = minsup * len(tdb)
d = len(set(e for t in tdb['Description'] for e in t))
n = len(tdb)

print('Número de elementos únicos: {0}\nNúmero de transacciones: {1}'.format(d, n))
print('Soporte mínimo: {0}\nNúmero mínimo de transacciones: {1}'.format(minsup, minoc))

Número de elementos únicos: 4459
Número de transacciones: 23587
Soporte mínimo: 0.01
Número mínimo de transacciones: 235.87


Empezamos buscando elementos únicos que cumplan la condición de soporte mínimo

In [8]:
cont_ef = Counter([e for t in tdb['Description'] for e in t])
ef = [c[0] for c in cont_ef.items() if c[1] >= minoc]
sop_ef = [c[1] for c in cont_ef.items() if c[1] >= minoc]

Generamos una función para imprimir los conjuntos de elementos frecuentes y el número de transacciones en las que ocurren

In [9]:
def imprime_cf(cf, sop_cf):
  orden_cf = np.argsort(sop_cf)[::-1]
  texto_cf = [str(i+1)+ ' -- [' + str(sop_cf[j])+'] ' + str(cf[j]) 
              for i,j in enumerate(orden_cf)]
  print('\n'.join(texto_cf))

Imprimimos los elementos frecuentes encontrados

In [10]:
imprime_cf(ef, sop_ef)

1 -- [3245] WHITE HANGING HEART T-LIGHT HOLDER
2 -- [1872] REGENCY CAKESTAND 3 TIER
3 -- [1536] STRAWBERRY CERAMIC TRINKET BOX
4 -- [1376] ASSORTED COLOUR BIRD ORNAMENT
5 -- [1229] HOME BUILDING BLOCK WORD
6 -- [1214] PACK OF 72 RETRO SPOT CAKE CASES
7 -- [1195] 60 TEATIME FAIRY CAKE CASES
8 -- [1195] REX CASH+CARRY JUMBO SHOPPER
9 -- [1114] JUMBO BAG RED RETROSPOT
10 -- [1112] LUNCH BAG RED SPOTTY
11 -- [1082] BAKING SET 9 PIECE RETROSPOT 
12 -- [1054] RED HANGING HEART T-LIGHT HOLDER
13 -- [1051] HEART OF WICKER LARGE
14 -- [1050] WOODEN FRAME ANTIQUE WHITE 
15 -- [1035] LUNCH BAG  BLACK SKULL.
16 -- [1000] LOVE BUILDING BLOCK WORD
17 -- [999] JUMBO STORAGE BAG SUKI
18 -- [996] LUNCH BAG SUKI  DESIGN 
19 -- [966] PACK OF 60 PINK PAISLEY CAKE CASES
20 -- [954] JUMBO SHOPPER VINTAGE RED PAISLEY
21 -- [940] LUNCH BAG SPACEBOY DESIGN 
22 -- [923] HEART OF WICKER SMALL
23 -- [903] JUMBO BAG STRAWBERRY
24 -- [902] SWEETHEART CERAMIC TRINKET BOX
25 -- [902] SCOTTIE DOG HOT WATER BOTTLE
26 -

Seguimos contando todos los pares de elementos frecuentes en las transacciones

In [11]:
cont_pf = Counter()
for t in tdb['Description']:
  tef = list(set(t) & set(ef))
  for i in range(len(tef)-1):
    for j in range(i+1,len(tef)):
      cont_pf[(tef[i], tef[j])] += 1

Nos quedamos con los que cumplan el soporte mínimo

In [12]:
pf = [set(c[0]) for c in cont_pf.items() if c[1] >= minoc] 
sop_pf = [c[1] for c in cont_pf.items() if c[1] >= minoc]

Imprimimos las reglas de pares frecuentes

In [13]:
imprime_cf(pf, sop_pf)

1 -- [701] {'RED HANGING HEART T-LIGHT HOLDER', 'WHITE HANGING HEART T-LIGHT HOLDER'}
2 -- [475] {'HEART OF WICKER LARGE', 'HEART OF WICKER SMALL'}
3 -- [472] {'LOVE BUILDING BLOCK WORD', 'HOME BUILDING BLOCK WORD'}
4 -- [463] {'60 TEATIME FAIRY CAKE CASES', 'PACK OF 60 PINK PAISLEY CAKE CASES'}
5 -- [440] {'WOODEN FRAME ANTIQUE WHITE ', 'WHITE HANGING HEART T-LIGHT HOLDER'}
6 -- [402] {'PACK OF 72 RETRO SPOT CAKE CASES', 'PACK OF 60 PINK PAISLEY CAKE CASES'}
7 -- [390] {'WHITE HANGING HEART T-LIGHT HOLDER', 'HOME BUILDING BLOCK WORD'}
8 -- [387] {'LUNCH BAG  BLACK SKULL.', 'LUNCH BAG RED SPOTTY'}
9 -- [384] {'WOODEN PICTURE FRAME WHITE FINISH', 'WHITE HANGING HEART T-LIGHT HOLDER'}
10 -- [384] {'HOT WATER BOTTLE TEA AND SYMPATHY', 'CHOCOLATE HOT WATER BOTTLE'}
11 -- [379] {'ASSORTED COLOUR BIRD ORNAMENT', 'WHITE HANGING HEART T-LIGHT HOLDER'}
12 -- [364] {'LOVE BUILDING BLOCK WORD', 'WHITE HANGING HEART T-LIGHT HOLDER'}
13 -- [363] {'LUNCH BAG  BLACK SKULL.', 'LUNCH BAG SPACEBOY DESIG

Contamos todos los subconjuntos de tres elementos frecuentes, cuyos posibles pares también sean frecuentes

In [14]:
cont_t3 = Counter()
for t in tdb['Description']:
  tef = list(set(t) & set(ef))
  tpf = [{tef[i],tef[j]} 
         for i in range(len(tef)-1) 
         for j in range(i+1,len(tef))]
  tpf = [p for p in tpf if p in pf]
  tef = [e for p in tpf for e in p]
  for e in tef:
    for p in tpf:
      if e not in p:
        pares = [{i,e} for i in p if {i,e} in pf]
        if len(pares)==2:
          cont_t3[tuple(p|{e})] += 1

Mantenemos aquellos que tengan soporte mínimo

In [15]:
st3f = [set(c[0]) for c in cont_t3.items() if c[1] >= minoc] 
sop_st3 = [c[1] for c in cont_t3.items() if c[1] >= minoc]

Visualizamos los conjuntos de 3 elementos frecuentes

In [16]:
imprime_cf(st3f, sop_st3)

1 -- [2730] {'WOODEN FRAME ANTIQUE WHITE ', 'WOODEN PICTURE FRAME WHITE FINISH', 'WHITE HANGING HEART T-LIGHT HOLDER'}
2 -- [2457] {'PACK OF 72 RETRO SPOT CAKE CASES', '60 TEATIME FAIRY CAKE CASES', 'PACK OF 60 PINK PAISLEY CAKE CASES'}
3 -- [2176] {'LUNCH BAG WOODLAND', 'LUNCH BAG SPACEBOY DESIGN ', 'LUNCH BAG SUKI  DESIGN '}
4 -- [2151] {'LUNCH BAG WOODLAND', 'LUNCH BAG PINK RETROSPOT', 'LUNCH BAG RED SPOTTY'}
5 -- [2125] {'PACK OF 60 DINOSAUR CAKE CASES', 'PACK OF 72 RETRO SPOT CAKE CASES', '60 TEATIME FAIRY CAKE CASES'}
6 -- [2112] {'LUNCH BAG WOODLAND', 'LUNCH BAG CARS BLUE', 'LUNCH BAG RED SPOTTY'}
7 -- [2100] {'LUNCH BAG WOODLAND', 'LUNCH BAG SPACEBOY DESIGN ', 'LUNCH BAG RED SPOTTY'}
8 -- [2096] {'LUNCH BAG WOODLAND', 'LUNCH BAG  BLACK SKULL.', 'LUNCH BAG RED SPOTTY'}
9 -- [2051] {'PACK OF 60 DINOSAUR CAKE CASES', '60 TEATIME FAIRY CAKE CASES', 'PACK OF 60 PINK PAISLEY CAKE CASES'}
10 -- [1936] {'PACK OF 72 RETRO SPOT CAKE CASES', '60 TEATIME FAIRY CAKE CASES', '72 SWEETHEART F

Contamos los subconjuntos de 4 elementos frecuentes, cuyos posibles subconjuntos de 3 elementos también son frecuentes

In [17]:
cont_t4 = Counter()
for t in tdb['Description']:
  tef = list(set(t) & set(ef))
  s3 = [{tef[i],tef[j],tef[k]} 
        for i in range(len(tef)-2) 
        for j in range(i+1,len(tef)-1) 
        for k in range(j+1,len(tef))]
  s3 = [e for e in s3 if e in st3f]
  tef = [e for s in s3 for e in s]
  for e in tef:
    for s in s3:
      if e not in s:
        frecuentes = True
        for i in range(len(s)-1):
          for j in range(i+1,len(s)): 
            if {i,j,e} not in st3f:
              frecuentes = False
              
        if frecuentes:
          cont_t4[tuple(s + (e,))] += 1

Descatamos los que no tengan mínimo soporte

In [18]:
print(cont_t4.items())
st4f = [c[0] for c in cont_t4.items() if c[1] >= minoc] 
sop_t4 = [c[1] for c in cont_t4.items() if c[1] >= minoc]
print(st4f)
print(sop_t4)

dict_items([])
[]
[]
