# **Algoritmo apriori**

En esta libreta exploraremos el modelo mercado canasta basado en el algoritmo apriori usando una base de datos de transacciones de una tienda de productos comestibles.


In [1]:
import pandas as pd

## Carga de datos
Primero leemos la base de datos [Online Retail II](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II) de [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) y mostramos algunos registros.

In [2]:
tdb_nan = pd.read_excel('https://archive.ics.uci.edu/ml/machine-learning-databases/00502/online_retail_II.xlsx')
tdb_nan.head(5)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


Eliminamos los registros inválidos y revisamos el número de registros.



In [3]:
print(tdb_nan.shape)
tdb = tdb_nan.dropna(axis=0)
print(tdb.shape)

(525461, 8)
(417534, 8)


La base de datos cuenta con un registro por cada producto comprado (campo `Description`) y una factura asociada (`Invoice`), ademas de otros campos. Para nuestro ejercicio, vamos a utilizar únicamente `Invoice` y `Description`.

In [4]:
tdb = tdb[['Invoice', 'Description']]
tdb.head(5)

Unnamed: 0,Invoice,Description
0,489434,15CM CHRISTMAS GLASS BALL 20 LIGHTS
1,489434,PINK CHERRY LIGHTS
2,489434,WHITE CHERRY LIGHTS
3,489434,"RECORD FRAME 7"" SINGLE SIZE"
4,489434,STRAWBERRY CERAMIC TRINKET BOX


Para encontrar elementos frecuentes, es necesario agrupar los productos comprados en una misma transacción usando el número de factura `Invoice`.

In [5]:
tdb = tdb.groupby(['Invoice'])['Description'].apply(list).reset_index(name='Description')
tdb.head(5)

Unnamed: 0,Invoice,Description
0,489434,"[15CM CHRISTMAS GLASS BALL 20 LIGHTS, PINK CHE..."
1,489435,"[CAT BOWL , DOG BOWL , CHASING BALL DESIGN, HE..."
2,489436,"[DOOR MAT BLACK FLOCK , LOVE BUILDING BLOCK WO..."
3,489437,"[CHRISTMAS CRAFT HEART DECORATIONS, CHRISTMAS ..."
4,489438,"[DINOSAURS WRITING SET , SET OF MEADOW FLOWE..."


Revisamos el número de transacciones totalales en la base de datos.

## Ejercicio
Programa el algoritmo Apriori y realiza la búsqueda de conjuntos con cardinalidad 1, 2 y 3 de elementos frecuentes usando diferentes soportes mínimos para las transacciones de la base de datos Online Retail II.

In [6]:
#Debido a que el algoritmo apriori tiene una gran complejidad en tiempo, vamos a reducir el conjunto para analizar a 100 de estos datos aleatorios
X=tdb.sample(n=100)
X

Unnamed: 0,Invoice,Description
9726,515114,"[CHERRY BLOSSOM DECORATIVE FLASK, RED PAPER P..."
14482,526621,[BLACK RECORD COVER FRAME]
16120,530800,"[LARGE HEART MEASURING SPOONS, GREY HEART HOT ..."
1867,494512,"[PLEASE ONE PERSON METAL SIGN, AREA PATROLLED..."
14097,525723,"[SET 20 NAPKINS FAIRY CAKES DESIGN , PACK OF 2..."
...,...,...
22685,C527032,"[WHITE WOOD GARDEN PLANT LADDER, ORGANISER WOO..."
1214,492318,"[ASSORTED FLORAL SECATEURS, SWALLOWS GREETING ..."
15468,528945,[ALARM CLOCK BAKELIKE CHOCOLATE]
17415,533904,"[APPLE BATH SPONGE, STRAWBERRY BATH SPONGE , F..."


In [7]:
'''
Primero vamos a hacer el algoritmo pero solo para cardinalidad 1
'''
def algoritmo_apriori1(X,sop ):#pedimos el conjunto de datos y el soporte para que si se tomen en cuenta las frecuencias
    total = X.shape[0]  #vemos el totalal de transacciones de este coknjunto de datos
    soporte = int(sop*total//1 + 1) # Definimos el soporte normalizado por los datos 

    #Creamos C1 y C2
    C1 = {}
    F1 = []

# en C1 guardamos las apariciones y si pasan el soporte se agregan a F1
    for j in range(total):
        for j1 in X.iloc[j,1]:
            if j1 in C1:
    
                C1[str(j1)] += 1


                if C1[str(j1)] > soporte and not (str(j1) in F1):
                    F1.append(str(j1)) 
            else:
                C1[str(j1)] = 1

    print('El conjunto F1 es ',F1)    
    print('El tamaño del conjunto F1 es ',len(F1))
    return F1,len(F1)
  
a,b=algoritmo_apriori1(X,0.004)


El conjunto F1 es  ['PINK HAPPY BIRTHDAY BUNTING', 'RED RETROSPOT CUP', 'REGENCY CAKESTAND 3 TIER', 'DOORMAT UNION JACK GUNS AND ROSES', 'BLACK BAROQUE CARRIAGE CLOCK', 'LARGE HEART MEASURING SPOONS', 'PARTY BUNTING', 'ANTIQUE SILVER TEA GLASS ETCHED', 'RETRO SPOT CAKE STAND', 'SCOTTIE DOG HOT WATER BOTTLE', 'HOT WATER BOTTLE BABUSHKA ', 'ANTIQUE SILVER TEA GLASS ENGRAVED', 'GLASS JAR MARMALADE ', 'WHITE HANGING HEART T-LIGHT HOLDER', 'BLACK HEART CARD HOLDER', 'PAPER BUNTING RETRO SPOTS', 'PAPER BUNTING VINTAGE PAISLEY', 'LAVENDER SCENTED FABRIC HEART', 'KINGS CHOICE MUG', 'POTTING SHED TEA MUG', 'HOME SWEET HOME MUG', 'ENGLISH ROSE DESIGN PEG BAG', 'VINTAGE UNION JACK BUNTING', '60 TEATIME FAIRY CAKE CASES', 'HOME SWEET HOME METAL SIGN ', 'WASHROOM METAL SIGN', 'CONGRATULATIONS BUNTING', '12 RED ROSE PEG PLACE SETTINGS', 'BLUE NEW BAROQUE FLOCK CANDLESTICK', 'EDWARDIAN PARASOL NATURAL', 'JUMBO STORAGE BAG SUKI', 'CHARLOTTE BAG SUKI DESIGN', 'REX CASH+CARRY JUMBO SHOPPER', 'LOVE BUILD

In [12]:
def algoritmo_apriori2(X,sop):
  a,b=algoritmo_apriori1(X,sop)
  total = X.shape[0]
  soporte = int(sop*total//1 + 1)
#practicamente hacemos lo mismo que en algoritmo _apriori1 pero ahora viendo F1 
  C2 = {}
  F2_ = {}
  F2 = []

  for k in range(b-1):
      for k1 in range(k,b):
          j = a[k]
          j1 =a[k1]
          if j != j1:
              if not ((j,j1) in C2) and not ((j1,j) in C2):
                  C2[(j,j1)] = 0

  for j,v in C2.keys():
      for j1 in range(total):
          a_ = X.iloc[j1,1]
          #print(a_)
          if j in a_ and v in a_:
              if (j,v) in F2_:
                  F2_[(j,v)] += 1
              else:
                  F2_[(j,v)] = 1


  for j,v in F2_.items():
      if v > soporte:
          F2.append(j)


  print('El conjunto F2 es ',F2)
  print('El tamaño del conjunto F2 es ',len(F2))
  return F2, len(F2),F2_



In [9]:
algoritmo_apriori2(X,0.004)

El conjunto F1 es  ['PINK HAPPY BIRTHDAY BUNTING', 'RED RETROSPOT CUP', 'REGENCY CAKESTAND 3 TIER', 'DOORMAT UNION JACK GUNS AND ROSES', 'BLACK BAROQUE CARRIAGE CLOCK', 'LARGE HEART MEASURING SPOONS', 'PARTY BUNTING', 'ANTIQUE SILVER TEA GLASS ETCHED', 'RETRO SPOT CAKE STAND', 'SCOTTIE DOG HOT WATER BOTTLE', 'HOT WATER BOTTLE BABUSHKA ', 'ANTIQUE SILVER TEA GLASS ENGRAVED', 'GLASS JAR MARMALADE ', 'WHITE HANGING HEART T-LIGHT HOLDER', 'BLACK HEART CARD HOLDER', 'PAPER BUNTING RETRO SPOTS', 'PAPER BUNTING VINTAGE PAISLEY', 'LAVENDER SCENTED FABRIC HEART', 'KINGS CHOICE MUG', 'POTTING SHED TEA MUG', 'HOME SWEET HOME MUG', 'ENGLISH ROSE DESIGN PEG BAG', 'VINTAGE UNION JACK BUNTING', '60 TEATIME FAIRY CAKE CASES', 'HOME SWEET HOME METAL SIGN ', 'WASHROOM METAL SIGN', 'CONGRATULATIONS BUNTING', '12 RED ROSE PEG PLACE SETTINGS', 'BLUE NEW BAROQUE FLOCK CANDLESTICK', 'EDWARDIAN PARASOL NATURAL', 'JUMBO STORAGE BAG SUKI', 'CHARLOTTE BAG SUKI DESIGN', 'REX CASH+CARRY JUMBO SHOPPER', 'LOVE BUILD

([('PINK HAPPY BIRTHDAY BUNTING', 'WHITE HANGING HEART T-LIGHT HOLDER'),
  ('PINK HAPPY BIRTHDAY BUNTING', 'PAPER BUNTING RETRO SPOTS'),
  ('PINK HAPPY BIRTHDAY BUNTING', 'PAPER BUNTING VINTAGE PAISLEY'),
  ('PINK HAPPY BIRTHDAY BUNTING', 'VINTAGE UNION JACK BUNTING'),
  ('PINK HAPPY BIRTHDAY BUNTING', 'EDWARDIAN PARASOL NATURAL'),
  ('REGENCY CAKESTAND 3 TIER', 'DOORMAT UNION JACK GUNS AND ROSES'),
  ('REGENCY CAKESTAND 3 TIER', 'SCOTTIE DOG HOT WATER BOTTLE'),
  ('REGENCY CAKESTAND 3 TIER', 'HOT WATER BOTTLE BABUSHKA '),
  ('REGENCY CAKESTAND 3 TIER', 'WHITE HANGING HEART T-LIGHT HOLDER'),
  ('REGENCY CAKESTAND 3 TIER', 'PAPER BUNTING RETRO SPOTS'),
  ('REGENCY CAKESTAND 3 TIER', 'PAPER BUNTING VINTAGE PAISLEY'),
  ('REGENCY CAKESTAND 3 TIER', 'HOME SWEET HOME MUG'),
  ('REGENCY CAKESTAND 3 TIER', 'VINTAGE UNION JACK BUNTING'),
  ('REGENCY CAKESTAND 3 TIER', '60 TEATIME FAIRY CAKE CASES'),
  ('REGENCY CAKESTAND 3 TIER', '12 RED ROSE PEG PLACE SETTINGS'),
  ('REGENCY CAKESTAND 3 TIER'

In [15]:
def algoritmo_apriori3(data,sop):
  # S e hace lo mismo que anteriormente pero checando más casos
  F2,lenF2, F2_ =algoritmo_apriori2(X,sop)
  total = X.shape[0]
  soporte = int(sop*total//1 + 1)

  C3 = {}
  F3_ = {}
  F3 = []

  for k1 in range(lenF2-1):
      for k2  in range(k1+1,lenF2):
          j1 = F2[k1]
          j2 = F2[k2]

          if (j1[0] in j2 or j1[1] in j2) and j1 != j2:


              #Casos como pueden venir acomodados estos valores
              if j1[0] == j2[0]:
                  if (j1[1],j2[1]) in F2 or (j2[1],j1[1]) in F2:
                      A = [j1[0],j1[1],j2[1]]
                      A = tuple(sorted(A))
                      C3[A] = 0

              elif j1[0] == j2[1]:
                  if (j1[1],j2[0]) in F2 or (j2[0],j1[1]) in F2:
                      A = [j1[0],j1[1],j2[0]]
                      A = tuple(sorted(A))
                      C3[A] = 0
                          

              elif j1[1] == j2[0]:
                  if (j1[0],j2[1]) in F2 or (j2[1],j1[0]) in F2:
                      A = [j1[0],j1[1],j2[1]]
                      A = tuple(sorted(A))
                      C3[A] = 0

              elif j1[1] == j2[1]:
                  if (j1[0],j2[0]) in F2 or (j2[0],j1[0]) in F2:
                      A = [j1[0],j1[1],j2[0]]
                      A = tuple(sorted(A))
                      C3[A] = 0

  for j,v,k in C3.keys():
      for j1 in range(total):
          a_ = X.iloc[j1,1]
          if j in a_ and v in a_ and k in a_:
              b_ = tuple(sorted([j,v,k]))
              if b_ in F2_:
                  F3_[b_] += 1
              else:
                  F3_[b_] = 1

  for j,v in F3_.items():
      if v > soporte:
          F3.append(j)

  print('El conjunto F3 es ',F3)
  print('El tamaño del conjunto F3 es ',len(F3))



In [16]:
algoritmo_apriori3(X,0.004)

El conjunto F1 es  ['PINK HAPPY BIRTHDAY BUNTING', 'RED RETROSPOT CUP', 'REGENCY CAKESTAND 3 TIER', 'DOORMAT UNION JACK GUNS AND ROSES', 'BLACK BAROQUE CARRIAGE CLOCK', 'LARGE HEART MEASURING SPOONS', 'PARTY BUNTING', 'ANTIQUE SILVER TEA GLASS ETCHED', 'RETRO SPOT CAKE STAND', 'SCOTTIE DOG HOT WATER BOTTLE', 'HOT WATER BOTTLE BABUSHKA ', 'ANTIQUE SILVER TEA GLASS ENGRAVED', 'GLASS JAR MARMALADE ', 'WHITE HANGING HEART T-LIGHT HOLDER', 'BLACK HEART CARD HOLDER', 'PAPER BUNTING RETRO SPOTS', 'PAPER BUNTING VINTAGE PAISLEY', 'LAVENDER SCENTED FABRIC HEART', 'KINGS CHOICE MUG', 'POTTING SHED TEA MUG', 'HOME SWEET HOME MUG', 'ENGLISH ROSE DESIGN PEG BAG', 'VINTAGE UNION JACK BUNTING', '60 TEATIME FAIRY CAKE CASES', 'HOME SWEET HOME METAL SIGN ', 'WASHROOM METAL SIGN', 'CONGRATULATIONS BUNTING', '12 RED ROSE PEG PLACE SETTINGS', 'BLUE NEW BAROQUE FLOCK CANDLESTICK', 'EDWARDIAN PARASOL NATURAL', 'JUMBO STORAGE BAG SUKI', 'CHARLOTTE BAG SUKI DESIGN', 'REX CASH+CARRY JUMBO SHOPPER', 'LOVE BUILD