<a href="https://colab.research.google.com/github/iamemc/PD_01/blob/main/PD_202021_P1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Mining / Prospeção de Dados

## Diogo Soares and Sara C. Madeira, 2020/21

# Project 1 - Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 2 or 3 people**. 

**TASK 3 - Spring vs Summer Purchases** must be done only by groups of 3 people.

Individual projects might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `March, 28th (23:59)`.** 

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the zip file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202021_P1.ipynb`as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs** (File > Download as > HTML).

**Decisions should be justified and results should be critically discussed.** 

_Project solutions containing only code and outputs without discussions will achieve a maximum grade 10 out of 20._

## Dataset and Tools



In this project you will analyse data from an online Store collected over 4 months (April - July 2014). The folder `data` contains three files that you should use to obtain the dataset to be used in pattern mining. 

The file `store-buys.dat` comprises the buy events of the users over the items. It contains **318.444 sessions**. Each record/line in the file has the following fields (with this order): 

* **Session ID** - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
* **Timestamp** - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of item that has been bought. Could be represented as an integer number. 
* **Price** – the price of the item. Could be represented as an integer number.
* **Quantity** – the quantity in this buying.  Could be represented as an integer number.

The file `store-clicks.dat` comprises the clicks of the users over the items. It contains **5.613.499 sessions**.  Each record/line in the file has the following fields (with this order):

* **Session ID** – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.
* **Timestamp** – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of the item that has been clicked. Could be represented as an integer number.
* **Context** – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,
any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,
then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 
 
The file `products.csv` comprises the list of products sold by the online store. It contains **46.294 different products** associated with **123 different subcategories**. Each record/line in the file has the following fields:

* **Item ID** - the unique identifier of the item. Could be represented as an integer number. 
* **Product Categories** - the category and subcategories of the item. It is a string containing the category and subcategories of the item. Eg. `appliances.kitchen.juice`


In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and **[MLxtend](http://rasbt.github.io/mlxtend/)**. When using MLxtend, frequent patterns can either be discovered using `Apriori` and `FP-Growth`. **Choose the pattern mining algorithm to be used.** 


## Team Identification

**GROUP PD03**

Students:

* Eduardo Carvalho - nº55881
* Filipe Santos - nº55142
* Ivo Oliveira - nº50301

## 1. Mining Frequent Itemsets and Association Rules


In this first part of the project you should load and preprocess the dataset  in order to compute frequent itemsets and generate association rules considering all the sessions.

**In what follows keep the following question in mind and be creative!**

1. What are the most interesting products?
2. What are the most bought products?
3. Which products are bought together?
4. Can you find associations between the clicked products? 
5. Can you find associations highliting that when people buy a product/set of products also buy other product(s)?
6. Can you find associations highliting that when people click in a product/set of products also buy this product(s)?
7. Can you find relevant associated categories? 

### 1.1. Load and Preprocess Data

 **Product quantities should not be considered.**

In [7]:
import pandas as pd
import numpy as np
import mlxtend as mlx
import matplotlib.pyplot as plt
import seaborn as sns

print("hello world")

hello world


In [15]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [8]:
products=pd.read_csv('/content/drive/MyDrive/Datasets/projetoPD/products.csv',
                     header=None,
                     names=['ItemID','Category'],
                     dtype={'ItemID': int, 'Category':str})
buys=pd.read_csv('/content/drive/MyDrive/Datasets/projetoPD/store-buys.dat', 
                 header=None,
                 names=['SessionID','TimeStamp','ItemID','Price','Qty'],
                 dtype={'SessionID':int, 'TimeStamp':str, 'ItemID': int, 
                        'Price':int, 'Qty':int})
clicks=pd.read_csv('/content/drive/MyDrive/Datasets/projetoPD/store-clicks.dat', 
                   header=None,
                   names=['SessionID','TimeStamp','ItemID','Context'],
                   dtype={'SessionID':int, 'TimeStamp':str, 'ItemID': int, 
                        'Context':str})

In [18]:
products=products.sort_values(by='ItemID').reset_index(drop=True)
products.head()

Unnamed: 0,ItemID,Category
0,214507224,computers.peripherals.scanner
1,214507224,computers.peripherals.scanner
2,214507224,computers.peripherals.scanner
3,214507224,computers.peripherals.scanner
4,214507224,computers.peripherals.scanner


In [19]:
len(products)

20704559

In [20]:
len(products.ItemID.unique())

46294

In [23]:
#products=products.drop_duplicates().reset_index(drop=True)
products2=products.drop_duplicates().reset_index(drop=True)
products2.head()

Unnamed: 0,ItemID,Category
0,214536502,electronics.tablet
1,214536500,electronics.tablet
2,214536506,electronics.tablet
3,214577561,electronics.audio.headphone
4,214662742,furniture.kitchen.table


In [None]:
len(products)

20704559

In [None]:
buys=buys.sort_values(by='SessionID').reset_index(drop=True)
buys.head()

Unnamed: 0,SessionID,TimeStamp,ItemID,Price,Qty
0,11,2014-04-03T11:04:18.097Z,214821371,1046,1
1,11,2014-04-03T11:04:11.417Z,214821371,1046,1
2,12,2014-04-02T10:42:17.227Z,214717867,1778,4
3,21,2014-04-07T09:24:18.307Z,214548744,3141,1
4,21,2014-04-07T09:24:18.360Z,214838503,18745,1


In [None]:
len(buys)-len(buys.drop_duplicates())

113

Existem 113 registos duplicados. Precisamos de os remover para avançar com a análise.

In [None]:
buys=buys.drop_duplicates().reset_index(drop=True)
len(buys)

679376

In [None]:
clicks=clicks.sort_values(by='SessionID').reset_index(drop=True)
clicks.head()

Unnamed: 0,SessionID,TimeStamp,ItemID,Context
0,1,2014-04-07T10:51:09.277Z,214536502,0
1,1,2014-04-07T10:54:09.868Z,214536500,0
2,1,2014-04-07T10:54:46.998Z,214536506,0
3,1,2014-04-07T10:57:00.306Z,214577561,0
4,2,2014-04-07T13:56:37.614Z,214662742,0


In [None]:
len(clicks)-len(clicks.drop_duplicates())

46

Existem 46 registos duplicados. Precisamos de os remover para avançar com a análise.

In [None]:
clicks=clicks.drop_duplicates().reset_index(drop=True)
len(clicks)

20704513

In [57]:
#clicks.().apply(lambda s: s.apply('{0:.0f}'.format))

#1. What are the most interesting products?
#clicks.groupby(by='ItemID').count()
click_most_interesting = clicks.groupby("ItemID", as_index=False).size().sort_values(by="size", ascending=False)
click_most_interesting = click_most_interesting.head(10)

list_click_most_interesting = list(click_most_interesting["ItemID"])
print(list_click_most_interesting)

df3 = pd.merge(click_most_interesting, products2)
df3.head(10)


[643078800, 214829878, 214826610, 214834880, 214839973, 214748336, 214834877, 214835017, 214836932, 214821309]


Unnamed: 0,ItemID,size,Category
0,643078800,147419,computers.gaming
1,214829878,102563,sport.tennis
2,214826610,67473,computers.components.memory
3,214834880,61253,appliances.environment.fan
4,214839973,59083,medicine.tools.tonometer
5,214748336,55844,computers.components.videocards
6,214834877,53308,appliances.environment.fan
7,214835017,51347,appliances.kitchen.toster
8,214836932,48943,computers.peripherals.monitor
9,214821309,48791,appliances.kitchen.blender


The most interesting products are blabla



In [45]:
#2. What are the most bought products?

df4 = pd.merge(buys, products2)

quantities = df4.groupby(["Category"], as_index=False).sum(["Qty"]).sort_values(by="Qty", ascending=False)

quantities = quantities[["Category","Qty"]]
quantities = quantities.rename(columns={"Qty": "Quantity"})
quantities = quantities.reset_index(drop=True)

quantities.head(10)


Unnamed: 0,Category,Quantity
0,appliances.kitchen.blender,40163
1,computers.components.memory,23810
2,sport.tennis,16864
3,appliances.kitchen.meat_grinder,11101
4,electronics.video.tv,9055
5,appliances.environment.fan,7387
6,medicine.tools.tonometer,6156
7,computers.notebook,5667
8,appliances.iron,5198
9,accessories.bag,5066


The most bought product category are Blenders

In [None]:
#3. Which products are bought together?




#2. What are the most bought products?

## 1.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 1.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 1.4. Take a Look at Maximal Patterns: Compute Maximal Frequent Itemsets

### 1.5. Conclusions 

# 2. Week vs Weekend Purchases

In this part of the project you should analyse the consumption patterns during the week vs during the weekeed.

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the week and the weekend? 
2. What are the most bought products during the week? And during the weekend?
3. There are differences between the sets of products bought during the week and the weekend?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the week vs the weekend?
5. Discuss the results obtained for the week sessions vs weekend sessions.

### 2.1. Load and Preprocess Data

 **Product quantities should not be considered.**

### 2.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 2.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 2.4. Conclusions 

# 3. [Only Groups of 3] Spring vs Summer Purchases

In this part of the project you should analyse the consumption patterns during the Spring months (April and May) vs Summer months (June and July).

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the Spring and the Summer? 
2. What are the most bought products during the Spring? And during the Summer?
3. There are differences between the sets of products bought during the Spring and the Summer?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the Spring vs the Summer?
5. Discuss the results obtained for the Spring sessions vs Summer sessions.

### 3.1. Load and Preprocess Data

 **Product quantities should not be considered.**

### 3.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 3.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 3.4. Conclusions 

## 4. Conclusions
Draw some conclusions about this project work.