<a href="https://colab.research.google.com/github/iamemc/PD_01/blob/main/PD_202021_P1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining / Prospeção de Dados

## Diogo Soares and Sara C. Madeira, 2020/21

# Project 1 - Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 2 or 3 people**. 

**TASK 3 - Spring vs Summer Purchases** must be done only by groups of 3 people.

Individual projects might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `March, 28th (23:59)`.** 

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the zip file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202021_P1.ipynb`as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs** (File > Download as > HTML).

**Decisions should be justified and results should be critically discussed.** 

_Project solutions containing only code and outputs without discussions will achieve a maximum grade 10 out of 20._

## Dataset and Tools



In this project you will analyse data from an online Store collected over 4 months (April - July 2014). The folder `data` contains three files that you should use to obtain the dataset to be used in pattern mining. 

The file `store-buys.dat` comprises the buy events of the users over the items. It contains **318.444 sessions**. Each record/line in the file has the following fields (with this order): 

* **Session ID** - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
* **Timestamp** - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of item that has been bought. Could be represented as an integer number. 
* **Price** – the price of the item. Could be represented as an integer number.
* **Quantity** – the quantity in this buying.  Could be represented as an integer number.

The file `store-clicks.dat` comprises the clicks of the users over the items. It contains **5.613.499 sessions**.  Each record/line in the file has the following fields (with this order):

* **Session ID** – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.
* **Timestamp** – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of the item that has been clicked. Could be represented as an integer number.
* **Context** – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,
any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,
then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 
 
The file `products.csv` comprises the list of products sold by the online store. It contains **46.294 different products** associated with **123 different subcategories**. Each record/line in the file has the following fields:

* **Item ID** - the unique identifier of the item. Could be represented as an integer number. 
* **Product Categories** - the category and subcategories of the item. It is a string containing the category and subcategories of the item. Eg. `appliances.kitchen.juice`


In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and **[MLxtend](http://rasbt.github.io/mlxtend/)**. When using MLxtend, frequent patterns can either be discovered using `Apriori` and `FP-Growth`. **Choose the pattern mining algorithm to be used.** 


## Team Identification

**GROUP PD03**

Students:

* Eduardo Carvalho - nº55881
* Filipe Santos - nº55142
* Ivo Oliveira - nº50301

## 1. Mining Frequent Itemsets and Association Rules


In this first part of the project you should load and preprocess the dataset  in order to compute frequent itemsets and generate association rules considering all the sessions.

**In what follows keep the following question in mind and be creative!**

1. What are the most interesting products?
2. What are the most bought products?
3. Which products are bought together?
4. Can you find associations between the clicked products? 
5. Can you find associations highliting that when people buy a product/set of products also buy other product(s)?
6. Can you find associations highliting that when people click in a product/set of products also buy this product(s)?
7. Can you find relevant associated categories? 

### 1.1. Load and Preprocess Data

 **Product quantities should not be considered.**

In [1]:
#!pip install mlxtend
!pip install mlxtend --upgrade

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.preprocessing import  TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import fpgrowth

#este tem fpgrowth para associações relevantes
#!pip install fpgrowth_py
#from fpgrowth_py import fpgrowth
#itemSetList = [['eggs', 'bacon', 'soup'],
#                ['eggs', 'bacon', 'apple'],
#                ['soup', 'bacon', 'banana']]
#freqItemSet, rules = fpgrowth(itemSetList, minSup=0.5, minConf=0.5)
#print(rules)  
# [[{'beer'}, {'rice'}, 0.6666666666666666], [{'rice'}, {'beer'}, 1.0]]
# rules[0] --> rules[1], confidence = rules[2]

Requirement already up-to-date: mlxtend in /usr/local/lib/python3.7/dist-packages (0.18.0)


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
products=pd.read_csv('/content/drive/MyDrive/Datasets/projetoPD/products.csv',
                     header=None,
                     names=['ItemID','Category'],
                     dtype={'ItemID': int, 'Category':str})
buys=pd.read_csv('/content/drive/MyDrive/Datasets/projetoPD/store-buys.dat', 
                 header=None,
                 names=['SessionID','TimeStamp','ItemID','Price','Qty'],
                 dtype={'SessionID':int, 'TimeStamp':str, 'ItemID': int, 
                        'Price':int, 'Qty':int})
clicks=pd.read_csv('/content/drive/MyDrive/Datasets/projetoPD/store-clicks.dat', 
                   header=None,
                   names=['SessionID','TimeStamp','ItemID','Context'],
                   dtype={'SessionID':int, 'TimeStamp':str, 'ItemID': int, 
                        'Context':str})


In [11]:
clicks.head()

Unnamed: 0,SessionID,TimeStamp,ItemID,Context
0,1,2014-04-07T10:51:09.277Z,214536502,0
1,1,2014-04-07T10:54:09.868Z,214536500,0
2,1,2014-04-07T10:54:46.998Z,214536506,0
3,1,2014-04-07T10:57:00.306Z,214577561,0
4,2,2014-04-07T13:56:37.614Z,214662742,0


In [12]:
"""
DATABASE CORRECTIONS
products_no_duplicates - removed duplicates
buys_upd - added product names & season

rows with Price or Qty == 0 have been dropped
"""

products_no_duplicates = products.drop_duplicates().reset_index(drop=True)

buys_upd =pd.merge(buys,products_no_duplicates)
clicks_upd= pd.merge(clicks,products_no_duplicates)

buys_upd = buys_upd[buys_upd.Price>0]
buys_upd = buys_upd[buys_upd.Qty>0]

#no click_upd ver context=0 (missing value)

product_name_buys=[]
for cat in buys_upd.Category:
  product_name_buys.append(cat.split('.')[-1].replace('_',' ').title())
product_name_clicks=[]
for cat in clicks_upd.Category:
  product_name_clicks.append(cat.split('.')[-1].replace('_',' ').title())
#product_name[:5]

buys_upd['ProductName']=product_name_buys
buys_upd=buys_upd.drop('Category', axis=1)
clicks_upd['ProductName']=product_name_clicks
clicks_upd=clicks_upd.drop('Category', axis=1)

dates_buys =[]
season_buys =[]
dates_clicks =[]
season_clicks =[]

for i in buys_upd["TimeStamp"]:
  dates_buys.append(i[:10])
  if i[5:7] == "04" or i[5:7] == "05":
     season_buys.append("Spring")
  elif i[5:7] == "06" or i[5:7] == "07":
    season_buys.append("Summer")
  else:
    season_buys.append("Other")

for i in clicks_upd["TimeStamp"]:
  dates_clicks.append(i[:10])
  if i[5:7] == "04" or i[5:7] == "05":
     season_clicks.append("Spring")
  elif i[5:7] == "06" or i[5:7] == "07":
    season_clicks.append("Summer")
  else:
    season_clicks.append("Other")


weekday_buys=[]
buys_upd["TimeStamp"] = pd.to_datetime(buys_upd["TimeStamp"])
buys_upd["Weekday_Num"]=buys_upd["TimeStamp"].dt.dayofweek 

for i in buys_upd["Weekday_Num"]:
  if i < 5: 
    weekday_buys.append("Weekday")
  else :
    weekday_buys.append("Weekend")

weekday_clicks=[]
clicks_upd["TimeStamp"] = pd.to_datetime(clicks_upd["TimeStamp"])
clicks_upd["Weekday_Num"]=clicks_upd["TimeStamp"].dt.dayofweek 

for i in clicks_upd["Weekday_Num"]:
  if i < 5: 
    weekday_clicks.append("Weekday")
  else :
    weekday_clicks.append("Weekend")

  
buys_upd = buys_upd.drop(columns=['TimeStamp'])
buys_upd.insert(1, "TimeStamp", dates_buys)
buys_upd["Season"] = season_buys
buys_upd["Weekday"] = weekday_buys
buys_upd.sort_values(by='TimeStamp')
buys_upd.tail(10)

clicks_upd = clicks_upd.drop(columns=['TimeStamp'])
clicks_upd.insert(1, "TimeStamp", dates_clicks)
clicks_upd["Season"] = season_clicks
clicks_upd["Weekday"] = weekday_clicks
clicks_upd.sort_values(by='TimeStamp')
clicks_upd.tail(10)

#unsure about what this does

#products=products.sort_values(by='ItemID').reset_index(drop=True)
#products.head()

Unnamed: 0,SessionID,TimeStamp,ItemID,Context,ProductName,Weekday_Num,Season,Weekday
20704503,6920814,2014-07-27,214854470,11,Calculator,6,Summer,Weekend
20704504,6924814,2014-07-27,214854941,11,Calculator,6,Summer,Weekend
20704505,6923888,2014-07-23,214807087,11,Glove,2,Summer,Weekday
20704506,6923122,2014-07-28,214609644,3,Chair,0,Summer,Weekday
20704507,6929467,2014-07-23,214822490,11,Bath,2,Summer,Weekday
20704508,6928129,2014-07-28,214535055,7,Dolls,0,Summer,Weekday
20704509,6927372,2014-07-23,214818485,11,Chair,2,Summer,Weekday
20704510,6927372,2014-07-23,214818485,11,Chair,2,Summer,Weekday
20704511,6926506,2014-07-27,214646096,7,Glove,6,Summer,Weekend
20704512,6926506,2014-07-27,214646096,7,Glove,6,Summer,Weekend


In [5]:
buys_upd=buys_upd.sort_values(by='SessionID').reset_index(drop=True)
buys_upd.head()

Unnamed: 0,SessionID,TimeStamp,ItemID,Price,Qty,ProductName,Weekday_Num,Season,Weekday
0,11,2014-04-03,214821371,1046,1,Blender,3,Spring,Weekday
1,11,2014-04-03,214821371,1046,1,Blender,3,Spring,Weekday
2,12,2014-04-02,214717867,1778,4,Bag,2,Spring,Weekday
3,21,2014-04-07,214548744,3141,1,Skates,0,Spring,Weekday
4,21,2014-04-07,214838503,18745,1,Clocks,0,Spring,Weekday


Existem 113 registos duplicados. Precisamos de os remover para avançar com a análise.

In [6]:
buys_upd=buys_upd.drop_duplicates().reset_index(drop=True)
len(buys_upd)

157184

In [7]:
clicks_upd=clicks.sort_values(by='SessionID').reset_index(drop=True)
clicks_upd.head()

Unnamed: 0,SessionID,TimeStamp,ItemID,Context
0,1,2014-04-07T10:51:09.277Z,214536502,0
1,1,2014-04-07T10:54:09.868Z,214536500,0
2,1,2014-04-07T10:54:46.998Z,214536506,0
3,1,2014-04-07T10:57:00.306Z,214577561,0
4,2,2014-04-07T13:56:37.614Z,214662742,0


In [None]:
len(clicks)-len(clicks.drop_duplicates())

23

Existem 46 registos duplicados. Precisamos de os remover para avançar com a análise.

In [7]:
clicks_upd=clicks_upd.drop_duplicates().reset_index(drop=True)

In [19]:
#1. What are the most interesting products?
click_most_interesting = clicks.groupby("ItemID", as_index=False).size().sort_values(by="size", ascending=False)
click_most_interesting = click_most_interesting.head(10)

list_click_most_interesting = list(click_most_interesting["ItemID"])
most_int = pd.merge(click_most_interesting, products_no_duplicates)
most_int.head(10)

Unnamed: 0,ItemID,size,Category
0,643078800,147419,computers.gaming
1,214829878,102563,sport.tennis
2,214826610,67473,computers.components.memory
3,214834880,61253,appliances.environment.fan
4,214839973,59083,medicine.tools.tonometer
5,214748336,55844,computers.components.videocards
6,214834877,53308,appliances.environment.fan
7,214835017,51347,appliances.kitchen.toster
8,214836932,48943,computers.peripherals.monitor
9,214821309,48791,appliances.kitchen.blender


The most interesting products are blabla



In [20]:
#2. What are the most bought products?

df4 = pd.merge(buys, products_no_duplicates)

quantities = df4.groupby(["Category"], as_index=False).sum(["Qty"]).sort_values(by="Qty", ascending=False)

quantities = quantities[["Category","Qty"]]
quantities = quantities.rename(columns={"Qty": "Quantity"})
quantities = quantities.reset_index(drop=True)

quantities.head(10)


Unnamed: 0,Category,Quantity
0,appliances.kitchen.blender,40163
1,computers.components.memory,23810
2,sport.tennis,16864
3,appliances.kitchen.meat_grinder,11101
4,electronics.video.tv,9055
5,appliances.environment.fan,7387
6,medicine.tools.tonometer,6156
7,computers.notebook,5667
8,appliances.iron,5198
9,accessories.bag,5066


The most bought product category are Blenders

## 1.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

In [9]:
#3. Which products are bought together?
all_sessions_buy={}
for i in range(len(buys_upd)):
  all_sessions_buy[buys_upd.SessionID[i]]=[]

for i in range(len(buys_upd)):
  if buys_upd.ProductName[i] not in all_sessions_buy.get(buys_upd.SessionID[i]):
    all_sessions_buy[buys_upd.SessionID[i]].append(buys_upd.ProductName[i])

transactions_buy=list(all_sessions_buy.values())
#3.1 Which products are viewed together?
all_sessions_click={}
for i in range(len(clicks_upd)):
  all_sessions_click[clicks_upd.SessionID[i]]=[]

for i in range(len(clicks_upd)):
  if clicks_upd.ProductName[i] not in all_sessions_click.get(clicks_upd.SessionID[i]):
    all_sessions_click[clicks_upd.SessionID[i]].append(clicks_upd.ProductName[i])

transactions_click=list(all_sessions_click.values())

AttributeError: ignored

In [13]:
len(transactions_buy)
len(transactions_click)

87318

In [14]:
#Compute binary databases
tr_enc = TransactionEncoder()
#buys
trans_array_buy = tr_enc.fit(transactions_buy).transform(transactions_buy)
binary_database_buy = pd.DataFrame(trans_array_buy, columns=tr_enc.columns_)
binary_database_buy.head(3)
#clicks
trans_array_click = tr_enc.fit(transactions_click).transform(transactions_click)
binary_database_click = pd.DataFrame(trans_array_click, columns=tr_enc.columns_)
binary_database_click.head(3)

Unnamed: 0,Acoustic,Air Conditioner,Air Heater,Alarm,Anti Freeze,Bag,Bath,Battery,Bed,Bicycle,Blanket,Blender,Bottles,Cabinet,Calculator,Camera,Carriage,Cartrige,Cdrw,Chair,Climate,Clocks,Coffee Grinder,Coffee Machine,Compressor,Cooler,Costume,Cpu,Cultivator,Desktop,Diapers,Dictaphone,Dishwasher,Diving,Dolls,Drill,Ebooks,Fan,Faucet,Fryer,...,Scanner,Screw,Sewing Machine,Shelving,Shirt,Shoes,Skates,Ski,Smartphone,Snowboard,Sock,Sofa,Sound Card,Stapler,Steam Cleaner,Steam Cooker,Subwoofer,Swing,Table,Tablet,Telephone,Tennis,Toilet,Tonometer,Toster,Toys,Trainer,Trousers,Tshirt,Tv,Umbrella,Vacuum,Video,Videocards,Videoregister,Washer,Water Heater,Watering,Weather Station,Welding
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [15]:
#Compute itemsets min_support = 1% apriori and association rules
#buys
frequent_itemsets_buy = apriori(binary_database_buy, min_support=0.01, use_colnames=True)
frequent_itemsets_buy
#Rules grau 1
rules1_buy = association_rules(frequent_itemsets_buy, metric="confidence", min_threshold=0.01)
#rules1_buy
# add new column length
frequent_itemsets_buy['length'] = frequent_itemsets_buy['itemsets'].apply(lambda x: len(x))
# filter using pattern length = 2
frequent_2_itemsets_buy = frequent_itemsets_buy[frequent_itemsets_buy['length'] >= 2].reset_index(drop=True)
#frequent_2_itemsets_buy

#clicks
frequent_itemsets_click = apriori(binary_database_click, min_support=0.01, use_colnames=True)
frequent_itemsets_click
#Rules grau 1
rules1_click = association_rules(frequent_itemsets_click, metric="confidence", min_threshold=0.01)
#rules1_click
# add new column length
frequent_itemsets_click['length'] = frequent_itemsets_click['itemsets'].apply(lambda x: len(x))
# filter using pattern length = 2
frequent_2_itemsets_click = frequent_itemsets_click[frequent_itemsets_click['length'] >= 2].reset_index(drop=True)
#frequent_2_itemsets_click


In [16]:
#FP-Growth é melhori que apriori
#buys
frequent_itemsets_fpg_buy=fpgrowth(binary_database_buy, min_support=0.01,use_colnames=True)
frequent_itemsets_fpg_buy
# Generate association rules with confidence >= 90%

rules_buy = association_rules(frequent_itemsets_fpg_buy, metric = "confidence", min_threshold=0.01)
rules_buy
#clicks
frequent_itemsets_fpg_click=fpgrowth(binary_database_click, min_support=0.01,use_colnames=True)
frequent_itemsets_fpg_click
# Generate association rules with confidence >= 90%

rules_click = association_rules(frequent_itemsets_fpg_click, metric = "confidence", min_threshold=0.01)
rules_click

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Fan),(Memory),0.0548,0.133695,0.029169,0.532288,3.981357,0.021843,1.85222
1,(Memory),(Fan),0.133695,0.0548,0.029169,0.218177,3.981357,0.021843,1.20897
2,(Fan),(Blender),0.0548,0.145136,0.010834,0.197701,1.362177,0.002881,1.065518
3,(Blender),(Fan),0.145136,0.0548,0.010834,0.074647,1.362177,0.002881,1.021448
4,(Fan),(Tennis),0.0548,0.102213,0.015083,0.275235,2.692771,0.009482,1.238729
5,(Tennis),(Fan),0.102213,0.0548,0.015083,0.147563,2.692771,0.009482,1.108821
6,(Blender),(Memory),0.145136,0.133695,0.02823,0.194508,1.454861,0.008826,1.075498
7,(Memory),(Blender),0.133695,0.145136,0.02823,0.211153,1.454861,0.008826,1.083688
8,(Memory),(Tennis),0.133695,0.102213,0.01703,0.127377,1.246197,0.003364,1.028838
9,(Tennis),(Memory),0.102213,0.133695,0.01703,0.166611,1.246197,0.003364,1.039496


### 1.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 1.4. Take a Look at Maximal Patterns: Compute Maximal Frequent Itemsets

### 1.5. Conclusions 

# 2. Week vs Weekend Purchases

In this part of the project you should analyse the consumption patterns during the week vs during the weekeed.

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the week and the weekend? 
2. What are the most bought products during the week? And during the weekend?
3. There are differences between the sets of products bought during the week and the weekend?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the week vs the weekend?
5. Discuss the results obtained for the week sessions vs weekend sessions.

### 2.1. Load and Preprocess Data

 **Product quantities should not be considered.**

### 2.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 2.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 2.4. Conclusions 

# 3. [Only Groups of 3] Spring vs Summer Purchases

In this part of the project you should analyse the consumption patterns during the Spring months (April and May) vs Summer months (June and July).

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the Spring and the Summer? 
2. What are the most bought products during the Spring? And during the Summer?
3. There are differences between the sets of products bought during the Spring and the Summer?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the Spring vs the Summer?
5. Discuss the results obtained for the Spring sessions vs Summer sessions.

### 3.1. Load and Preprocess Data

 **Product quantities should not be considered.**

### 3.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 3.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 3.4. Conclusions 

## 4. Conclusions
Draw some conclusions about this project work.