## Regras de Associação

Regras de Associação identificam padrões comuns em itens de um grande conjunto de dados.Neste exercício, nós vamos analisar padrões de comportamento em uma plataforma de filmes (como o Netflix) onde as pessoas costumam assistir seus filmes e séries. Existem alguns padrões claros, como pessoas que gostam de super heróis ou aqueles que assistem a desenhos animados.

Regras de Associação são geralmente escritas no formato: **{A} -> {B}**,  o que siginifica que existe uma forte relação entre os itens A e B. Por exemplo, uma possível regra válida para a plataforma de streams é **{Senhor dos Anéis} -> {O Hobbit}**. 

Se frequentemente uma pessoa que assiste a um filme também assiste a um outro, ou seja os filmes são asssitidos frequentemente juntos, então a plataforma de filmes poderia utilizar esse padrão para aumentar a visualização de alguns filmes, através de recomendações na plataforma.

No exemplo acima, **{Senhor dos Anéis} -> {O Hobbit}**, {Senhor dos Anéis} é o **antecedente** e **{O Hobbit}** é o **consequente**. Antecedentes e consequentes podem ter múltiplos itens, por exemplo um regra válida é **{Thor: Ragnarok, Vingadores: Guerra Infinita}->{Vingadores: Ultimato}**.

Por quê?
Fácil de explicar para pessoas não-técnicas

Sem necessidade de grande preparação dos dados e engenharia de features

Bom início para explorar dados


## Identificando padrões frequentes em usuários de streaming de vídeos
Neste exemplo nós utilizaremos regras de associação para analisar um dataset de transações onde cada transação é composta pelos filmes que um mesmo usuário de uma plataforma de filmes assisitu dentro de um intervalo de tempo.

Exemplo baseado no tutorial disponível em: https://medium.com/@fabio.italiano/the-apriori-algorithm-in-python-expanding-thors-fan-base-501950d55be9

<img src="fig_apriori/Streaming-Movie.jpg">

### Passo 1) Leitura do dataset

In [1]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [2]:
df = pd.read_csv('dataset_movies/movie_dataset.txt',header=None)

In [3]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,The Revenant,13 Hours,Allied,Zootopia,Jigsaw,Achorman,Grinch,Fast and Furious,Ghostbusters,Wolverine,Mad Max,John Wick,La La Land,The Good Dunosaur,Ninja Turtles,The Good Dunosaur Bad Moms,2 Guns,Inside Out,Valerian,Spiderman 3
1,Beirut,Martian,Get Out,,,,,,,,,,,,,,,,,
2,Deadpool,,,,,,,,,,,,,,,,,,,
3,X-Men,Allied,,,,,,,,,,,,,,,,,,
4,Ninja Turtles,Moana,Ghost in the Shell,Ralph Breaks the Internet,John Wick,,,,,,,,,,,,,,,


Cada linha do arquivo refere-se a um conjunto de filmes que um determinado usuário leu. Vamos considerar esse conjunto de filmes como sendo o conjunto de itens de uma transação.

Entretanto, precisamos transforma os dados para deixá-lo num formato de um dataframe  onde cada coluna se refere a um filme e as linhas aos usuarios. Cada cálula contém 1 quando o usuário assitiu ao filme e 0 no caso contrário.

In [4]:
import numpy as np

In [11]:
rows = df.shape[0]
rows

7501

In [6]:
filmes = set()
for i in range(rows):
    filmes = filmes.union(set(df.iloc[i].unique()))


In [7]:
np.nan in filmes

True

In [8]:
filmes.difference_update({np.nan})

In [9]:
df_ = pd.DataFrame(columns=filmes,data=np.zeros((rows,len(filmes))))

In [10]:
df_.head()

Unnamed: 0,Ant Man,Ghostbusters,Blade,Game Night,Justice League,Hotel Transylvania,Hulk,Aloha,Logan,Terminator,...,Batman,How to be Spiderman 3le,Angry Birds,Grinch,Atomic Blonde,13 Hours,The Hobbit,Pop Star,London Has Fallen,Suicide Squad
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
def set_units(x):
    return 1

In [13]:
for i in range(rows):
    df_.at[i, df.iloc[i].dropna()] = 1.

In [14]:
df_.head()

Unnamed: 0,Ant Man,Ghostbusters,Blade,Game Night,Justice League,Hotel Transylvania,Hulk,Aloha,Logan,Terminator,...,Batman,How to be Spiderman 3le,Angry Birds,Grinch,Atomic Blonde,13 Hours,The Hobbit,Pop Star,London Has Fallen,Suicide Squad
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### O Algoritmo Apriori
Alguns elementos são essenciais para o entendimento do algoritmo Apriori. 


**Suporte**: é um número de vezes que o itemset aparece em diferentes transações dividido pelo número total de transações.

$$supp(X) = \frac{|t \in T; X \subseteq t|}{|T|}$$

Por exemplo, podemos analisar o suporte do filme "Jumanji" fazendo a seguinte operação. 

In [21]:
def supp(df_,X):
    union = np.prod(df_[X].values,axis=1)
    return len(np.nonzero(union)[0])/df_.shape[0]

In [22]:
supp(df_,["Jumanji"])

0.09825356619117451

In [23]:
supp(df_,['Jumanji','Wonder Woman'])

0.005332622317024397

**Itemset Frequente**: Um conjunto $\{i_1,i_2, ..., i_n\}$ de itens é frequente quando o conjunto de itens ocorre com pelo menos a frequênciade um supporte mínimo, $min\_supp$.

**Confiança**:é a indicação de quão frequente uma regra é verdadeira. Quanto maior a confiança, maior é chance de encontrarmos a regra no dataset. É dada por:

$$conf(X \rightarrow Y) = supp(X \cup Y)/supp(X)$$


Por exemplo, a confiança da regra **{Avengers} -> {Thor}** é dada por:

In [24]:
def confidence(df_, X, Y):
    return supp(df_,X+Y)/supp(df_,X)

In [25]:
confidence(df_, ['Avengers'], ['Thor'])

0.16279069767441862

In [26]:
confidence(df_, ['Thor'],  ['Avengers'])

0.036939313984168866

In [27]:
supp(df_, ['Avengers'])

0.011465137981602452

In [28]:
supp(df_, ['Thor'])

0.05052659645380616

**Quando uma regra satisfaz a um mínimo suporte e confiança, dizemos que a regra é um regra de associação forte.**

Em geral, a mineração de regras de associação pode ser definida como:

1 - Encontrar todos os itemsets frequentes;

2 - Gerar regras de associação fortes a partir desses itens.

### Como funciona o algoritmo?

* Chamado de **Apriori** pois requer um conhecimento prévio das propriedades do itens mais frequentes;
* É um método iterativo onde $k$ itens são utilizados para para explorar $k+1$ itens;
* **Ideia geral**: Primeiro encontre o o itemset frequente de tamanho 1 satisfazendo o mínimo suporte, denominado $L_1$. Depois utilize $L_1$ para encontrar $L_2$, os itens frequentes de tamanho 2. $L_2$ é utilizado para encontrar $L_3$ e assim por diante.
* **Propriedade Apriori**: Todos os subconjuntos não vazios de um conjunto de itens frequente, também é frequente.




<img src="fig_apriori/Apriori.jpg">

Fonte: http://www.lessons2all.com/Apriori.php

### Utilizando o algortimo apriori

In [29]:
frequent_itemsets = apriori(df_, min_support=0.01, use_colnames=True)

#### Visualizando itens frequentes

In [30]:
frequent_itemsets.sort_values(by='support', ascending=False)

Unnamed: 0,support,itemsets
8,0.238368,(Ninja Turtles)
24,0.179709,(Get Out)
56,0.174110,(Tomb Rider)
4,0.170911,(Hotel Transylvania)
59,0.163845,(Coco)
42,0.132116,(John Wick)
25,0.129583,(Moana)
61,0.098254,(Jumanji)
66,0.095321,(Intern)
44,0.095054,(Spotlight)


#### Computando regras de associação 

In [31]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules.sort_values(by='confidence', ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
126,"(Get Out, Jumanji)",(Ninja Turtles),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848
135,"(Moana, Jumanji)",(Ninja Turtles),0.021997,0.238368,0.011065,0.503030,2.110308,0.005822,1.532552
154,"(Coco, Jumanji)",(Ninja Turtles),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434,1.447937
137,"(Moana, Intern)",(Ninja Turtles),0.023597,0.238368,0.011065,0.468927,1.967236,0.005440,1.434136
14,(Thor),(Ninja Turtles),0.050527,0.238368,0.023064,0.456464,1.914955,0.011020,1.401255
140,"(Tomb Rider, Spotlight)",(Ninja Turtles),0.025197,0.238368,0.011465,0.455026,1.908923,0.005459,1.397557
151,"(Spiderman 3, Tomb Rider)",(Ninja Turtles),0.022930,0.238368,0.010265,0.447674,1.878079,0.004799,1.378954
128,"(Moana, Tomb Rider)",(Ninja Turtles),0.035462,0.238368,0.015731,0.443609,1.861024,0.007278,1.368879
131,"(Moana, Coco)",(Ninja Turtles),0.032129,0.238368,0.013998,0.435685,1.827780,0.006340,1.349656
146,"(Tomb Rider, Jumanji)",(Ninja Turtles),0.039195,0.238368,0.017064,0.435374,1.826477,0.007722,1.348914


Suporte e confiança não são suficientes para filtrar regras interessantes. Uma medida de correlação também pode ser utilizada. O **lift** é uma medidade simples de correlação que mede se a corrência de um evento A é independente da ocorrência de um ecento B.

**Lift**: O lift de uma regra é definido como:  

$$lift(X \rightarrow Y): \frac{supp(X \cup Y)}{supp(X) \times supp(Y)}$$

* lift 1: a ocorrência de X é independente da ocorrência de Y

* lift > 1: possível dependência entre X e Y,  o que faz a regra útil para predizer futuros itens

* lift < 1: a presença X tem um efeito negativo na de Y, e vice-versa.


Por exemplo, a confiança da regra **{Avengers} -> {Thor}** é dada por:

In [32]:
def lift(df_, X, Y):
    return supp(df_,X+Y)/(supp(df_,X)*supp(df_,Y))

#### Visualizando regras com determinada confiança e lift

In [33]:
rules[ (rules['lift'] > 1.) &
       (rules['confidence'] >= 0.4) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
14,(Thor),(Ninja Turtles),0.050527,0.238368,0.023064,0.456464,1.914955,0.01102,1.401255
30,(The Good Dunosaur Bad Moms),(Ninja Turtles),0.042528,0.238368,0.017064,0.401254,1.683336,0.006927,1.272045
39,(Jumanji),(Ninja Turtles),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401
42,(Spiderman 3),(Ninja Turtles),0.065858,0.238368,0.027596,0.419028,1.757904,0.011898,1.310962
117,"(Moana, Get Out)",(Ninja Turtles),0.030796,0.238368,0.013065,0.424242,1.779778,0.005724,1.322834
123,"(Get Out, Coco)",(Ninja Turtles),0.033196,0.238368,0.013465,0.405622,1.701663,0.005552,1.281394
126,"(Get Out, Jumanji)",(Ninja Turtles),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848
128,"(Moana, Tomb Rider)",(Ninja Turtles),0.035462,0.238368,0.015731,0.443609,1.861024,0.007278,1.368879
131,"(Moana, Coco)",(Ninja Turtles),0.032129,0.238368,0.013998,0.435685,1.82778,0.00634,1.349656
135,"(Moana, Jumanji)",(Ninja Turtles),0.021997,0.238368,0.011065,0.50303,2.110308,0.005822,1.532552


In [34]:
lift(df_,  ['Avengers'], ['Thor'])

3.221881327851752