# Data Mining / Prospecção de Dados

## Sara C. Madeira, 2024/2025

# Project 1 - Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 3 people**. 

Groups with less than 3 people might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `May, 4th (23:59)`.** 

Students should **upload a `.zip` file** containing a folder with all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the `zip` file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202425_P1.ipynb` as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs.**

**Decisions should be justified and results should be critically discussed.** 

Remember that **your notebook should be as clear and organized as possible**, that is, **only the relevant code and experiments should be presented, not everything you tried and did not work, or is not relevant** (that can be discussed in the text, if relevant)! Tables and figures can be used together with text to summarize results and conclusions, improving understanding, readability and concision. **More does not mean better! The target is quality not quantity!**

_**Project solutions containing only code and outputs without discussions will achieve a maximum grade of 10 out of 20.**_

## Dataset and Tools

The dataset to be analysed is **`Foodmart_2025_DM.csv`**, which is a modified and integrated version of the **Foodmart database**, used in several [Kaggle](https://www.kaggle.com) Pattern Mining competitions, with the goal of finding **actionable patterns** by analysing data from the `FOODmart Ltd` company, a leading supermarket chain. 

`FOODmart Ltd` has different types of stores: Deluxe Supermarkets, Gourmet Supermarkets, Mid-Size Grocerys, Small Grocerys and 
Supermarkets. Y

Your **goals** are to find: 
1. **global patterns** (common to all stores) and
2. **local/specific patterns** (related to the type of store).

**`Foodmart_2025_DM.csv`** stores **69549 transactions** from **24 stores**, where **103 different products** can be bought. 

Each transaction (row) has a `STORE_ID` (integer from 1 to 24), and a list of produts (items), together with the quantities bought. 

In the transation highlighted below, a given customer bought 1 unit of soup, 2 of cheese and 1 of wine at store 2.

<img src="Foodmart_2025_DM_Example.png" alt="Foodmart_2025_DM_Example" style="width: 1000px;"/>

In this context, the project has **2 main tasks**:
1. Mining Frequent Itemsets and Association Rules: Ignoring Product Quantities and Stores **(global patterns)**
2. Mining Frequent Itemsets and Association Rules: Looking for Differences between Stores **(local/specific patterns)**

**While doing PATTERN and ASSOCIATION MINING keep in mind the following basic/key questions and BE CREATIVE!**

1. What are the most popular products?
2. Which products are bought together?
3. What are the frequent patterns?
4. Can we find associations highlighting that when people buy a product/set of products also buy other product(s)?
5. Are these associations strong? Can we trust them? Are they misleading?
6. Can we analyse these patterns and evaluate these associations to find, not only frequent and strong associations, but also interest patterns and associations?

**In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and [`MLxtend`](http://rasbt.github.io/mlxtend/).**

When using `MLxtend`, frequent patterns can either be discovered using `Apriori` and `FP-Growth`. **Choose the pattern mining algorithm to be used.** 

## Team Identification

**GROUP NN**

Students:

* Student 1 - n_student1
* Student 2 - n_student2
* Student 3 - n_student3

## 1. Mining Frequent Itemsets and Association Rules: Ignoring Product Quantities and Stores

In this first task you should load and preprocessed the dataset **`Foodmart_2025_DM.csv`** in order to compute frequent itemsets and generate association rules considering all the transactions, regardeless of the store, and ignoring product quantities.

### 1.1. Load and Preprocess Dataset

 **Product quantities and stores should not be considered.**

In [1]:
import mlxtend
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, fpgrowth

In [34]:
def load_and_prepare_transactions(file_path):
    transactions_list = []
    with open(file_path, 'r', encoding='utf-8-sig') as f:
        for line in f:
            line = line.strip()
            if not line:
                continue

            # Split by comma, skip the first element (STORE_ID=X)
            parts = line.split(',')[1:]
            current_transaction_products = []
            for part in parts:
                if '=' in part:
                    # Extract product name (part before '=')
                    product_name = part.split('=', 1)[0]
                    current_transaction_products.append(product_name)
                elif part: # Handle potential empty strings if trailing commas exist
                    current_transaction_products.append(part)

            if current_transaction_products: # Avoid adding empty transactions
                transactions_list.append(current_transaction_products)

    return transactions_list

file_path = 'Foodmart_2025_DM.csv'
transactions = load_and_prepare_transactions(file_path)

print(f"Loaded {len(transactions)} transactions.")
print("First 5 transactions (list of lists format):")
print(transactions[:5])

# One-hot encoding using TransactionEncoder
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)

# Create the final DataFrame
foodmart_df = pd.DataFrame(te_ary, columns=te.columns_)

print("\nOne-hot encoded DataFrame head:")
display(foodmart_df.head()) # Use display() in Jupyter for better rendering

Loaded 69549 transactions.
First 5 transactions (list of lists format):
[['Pasta', 'Soup'], ['Soup', 'Fresh Vegetables', 'Milk', 'Plastic Utensils'], ['Cheese', 'Deodorizers', 'Hard Candy', 'Jam'], ['Fresh Vegetables'], ['Cleaners', 'Cookies', 'Eggs', 'Preserves']]

One-hot encoded DataFrame head:


Unnamed: 0,Acetominifen,Anchovies,Aspirin,Auto Magazines,Bagels,Batteries,Beer,Bologna,Candles,Canned Fruit,...,Sunglasses,TV Dinner,Tofu,Toilet Brushes,Tools,Toothbrushes,Tuna,Waffles,Wine,Yogurt
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Write text in cells like this ...


### 1.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support S_min. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least S < S_min.
* Change the minimum support values and discuss the results.

In [47]:
S_min = [0.1, 0.05, 0.01, 0.001]

#make a loop to iterate through the values of S_min
for s in S_min:
    print(f"\nFrequent itemsets with minimum support {s}:")
    frequent_itemsets = fpgrowth(foodmart_df, min_support=s, use_colnames=True)
    #do a loop to show the top 5 frequent itemsets of each length (sorted by support)
    for length in range(1, frequent_itemsets['itemsets'].apply(lambda x: len(x)).max() + 1):
        count = frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: len(x) == length)].shape[0]
        print(f"Length {length}: {count} itemsets found")
        print(f"\nFrequent itemsets of length {length}:")
        filtered_itemsets = frequent_itemsets[frequent_itemsets['itemsets'].apply(lambda x: len(x) == length)]
        filtered_itemsets = filtered_itemsets.sort_values(by='support', ascending=False).head(5)
        filtered_itemsets['length'] = length
        display(filtered_itemsets[['itemsets', 'support', 'length']])



Frequent itemsets with minimum support 0.1:
Length 1: 6 itemsets found

Frequent itemsets of length 1:


Unnamed: 0,itemsets,support,length
1,(Fresh Vegetables),0.284174,1
4,(Fresh Fruit),0.175229,1
0,(Soup),0.120059,1
2,(Cheese),0.117802,1
5,(Dried Fruit),0.117212,1



Frequent itemsets with minimum support 0.05:
Length 1: 31 itemsets found

Frequent itemsets of length 1:


Unnamed: 0,itemsets,support,length
1,(Fresh Vegetables),0.284174,1
12,(Fresh Fruit),0.175229,1
0,(Soup),0.120059,1
3,(Cheese),0.117802,1
14,(Dried Fruit),0.117212,1


Length 2: 1 itemsets found

Frequent itemsets of length 2:


Unnamed: 0,itemsets,support,length
31,"(Fresh Vegetables, Fresh Fruit)",0.050914,2



Frequent itemsets with minimum support 0.01:
Length 1: 102 itemsets found

Frequent itemsets of length 1:


Unnamed: 0,itemsets,support,length
2,(Fresh Vegetables),0.284174,1
29,(Fresh Fruit),0.175229,1
0,(Soup),0.120059,1
5,(Cheese),0.117802,1
32,(Dried Fruit),0.117212,1


Length 2: 76 itemsets found

Frequent itemsets of length 2:


Unnamed: 0,itemsets,support,length
134,"(Fresh Vegetables, Fresh Fruit)",0.050914,2
102,"(Fresh Vegetables, Soup)",0.035443,2
138,"(Dried Fruit, Fresh Vegetables)",0.035227,2
110,"(Fresh Vegetables, Cheese)",0.031144,2
112,"(Cookies, Fresh Vegetables)",0.027721,2



Frequent itemsets with minimum support 0.001:
Length 1: 102 itemsets found

Frequent itemsets of length 1:


Unnamed: 0,itemsets,support,length
2,(Fresh Vegetables),0.284174,1
29,(Fresh Fruit),0.175229,1
0,(Soup),0.120059,1
5,(Cheese),0.117802,1
32,(Dried Fruit),0.117212,1


Length 2: 2376 itemsets found

Frequent itemsets of length 2:


Unnamed: 0,itemsets,support,length
1120,"(Fresh Vegetables, Fresh Fruit)",0.050914,2
102,"(Fresh Vegetables, Soup)",0.035443,2
1192,"(Dried Fruit, Fresh Vegetables)",0.035227,2
222,"(Fresh Vegetables, Cheese)",0.031144,2
326,"(Cookies, Fresh Vegetables)",0.027721,2


Length 3: 529 itemsets found

Frequent itemsets of length 3:


Unnamed: 0,itemsets,support,length
104,"(Fresh Vegetables, Fresh Fruit, Soup)",0.007045,3
331,"(Cookies, Fresh Vegetables, Fresh Fruit)",0.00555,3
225,"(Fresh Vegetables, Fresh Fruit, Cheese)",0.005334,3
1199,"(Dried Fruit, Fresh Vegetables, Fresh Fruit)",0.004946,3
1212,"(Paper Wipes, Fresh Vegetables, Fresh Fruit)",0.004673,3


Write text in cells like this ...


### 1.3. Generate Association Rules from Frequent Itemsets

Using a minimum support S_min fundamented by the previous results. 
* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C and lift >= L.
* Change C and L when it makes sense and discuss the results.
* Use other metrics besides confidence and lift.
* Evaluate how good the rules are given the metrics and how interesting they are from your point of view.

In [256]:
# Write code in cells like this
# ....

Write text in cells like this ...


### 1.4. Take a Look at Maximal Patterns: Compute Maximal Frequent Itemsets
- discuss their utility compared to frequent patterns
- analyse the association rules they can unravel

In [260]:
# Write code in cells like this
# ....

Write text in cells like this ...


### 1.5 Conclusions from Mining Frequent Patterns in All Stores (Global Patterns and Rules)

Write text in cells like this ...


## 2. Mining Frequent Itemsets and Association Rules: Looking for Differences between Stores

The 24 stores, whose transactions were analysed in Task 1, are in fact from purchases carried out in **different types of stores**:
* Deluxe Supermarkets: STORE_ID = 8, 12, 13, 17, 19, 21
* Gourmet Supermarkets: STORE_ID = 4, 6
* Mid-Size Grocerys: STORE_ID = 9, 18, 20, 23
* Small Grocerys: STORE_ID = 2, 5, 14, 22
* Supermarkets: STORE_ID = 1, 3, 7, 10, 11, 15, 16

In this context, in this second task you should compute frequent itemsets and association rules for specific groups of stores (specific/local patterns), and then compare the store specific results with those obtained when all transactions were analysed independently of the type of store (global patterns). 

**The goal is to find similarities and differences in buying patterns according to the types of store. Do popular products change? Are there buying patterns specific to the type of store?**

### 2.1. Analyse Deluxe Supermarkets and Gourmet Supermarkets

Here you should analyse **both** the transactions from **Deluxe Supermarkets (STORE_ID = 8, 12, 13, 17, 19, 21)** and **Gourmet Supermarkets (STORE_ID = 4, 6)**.

#### 2.1.1. Load/Preprocess the Dataset

**You might need to change a bit the preprocessing, although most of it should be reused.**

In [None]:
def parse_transactions_from_file(file_path):
    transactions_data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            line = line.strip()
            if not line:
                continue # Skip empty lines

            current_transaction = {}
            parts = line.split(',')
            for part in parts:
                if '=' in part:
                    key, value_str = part.split('=', 1)
                    try:
                        # Store quantities/IDs as integers
                        value = int(value_str)
                    except ValueError:
                        # Handle cases where value is not an integer (e.g., corrupted data)
                        value = 0 # Default to 0 if conversion fails
                    current_transaction[key] = value
            if current_transaction: # Ensure we don't add empty dictionaries if a line was invalid
                 transactions_data.append(current_transaction)

    # Create DataFrame from the list of dictionaries.
    # Pandas automatically creates columns for all unique keys found.
    df = pd.DataFrame(transactions_data)

    # Identify product columns (all columns except STORE_ID)
    product_columns = [col for col in df.columns if col != 'STORE_ID']

    # Fill NaN values (products not present in a specific transaction) with 0
    df[product_columns] = df[product_columns].fillna(0).astype(int)

    # Ensure STORE_ID column is integer type, fill potential NaNs with a default (e.g., 0 or -1)
    if 'STORE_ID' in df.columns:
        df['STORE_ID'] = df['STORE_ID'].fillna(0).astype(int) # Using 0 as default for missing STORE_ID

    return df


Write text in cells like this ...


#### 2.1.2. Compute Frequent Itemsets

**This should be trivial now.**

In [273]:
# Write code in cells like this
# ....

Write text in cells like this ...


#### 2.1.3. Generate Association Rules from Frequent Itemsets

**This should be trivial now.**

In [277]:
# Write code in cells like this
# ....

Write text in cells like this 

#### 2.1.4.  Take a look at Maximal Patterns

In [281]:
# Write code in cells like this
# ....

Write text in cells like this 

#### 2.1.5.  Deluxe/Gourmet Supermarkets versus All Stores (Global versus Deluxe/Gourmet Supermarkets Specific Patterns and Rules)

Discuss the similarities and diferences between the results obtained in task 1. (frequent itemsets and association rules found in transactions from all stores) and those obtained above (frequent itemsets and association rules found in transactions only from Deluxe/Gourmet Supermarkets).


In [164]:
# Write code in cells like this
# ....

Write text in cells like this 

### 2.2. Analyse Small Groceries

Here you should analyse **Small Groceries (STORE_ID = 2, 5, 14, 22)**.

#### 2.2.1.  Load/Preprocess the Dataset

**This should be trivial now.**

In [174]:
# Write code in cells like this
# ....

Write text in cells like this 


#### 2.2.2. Compute Frequent Itemsets

Write text in cells like this 


In [168]:
# Write code in cells like this
# ....

#### 2.2.3. Generate Association Rules from Frequent Itemsets

In [168]:
# Write code in cells like this
# ....

Write text in cells like this


#### 2.2.4. Take a Look at Maximal Patterns

In [172]:
# Write code in cells like this
# ....

Write text in cells like this


#### 2.2.5. Small Groceries versus All Stores (Global versus Small Groceries Specific Patterns and Rules)

Discuss the similarities and diferences between the results obtained in task 1. (frequent itemsets and association rules found in transactions from all stores) and those obtained above (frequent itemsets and association rules found in transactions only Small Groceries).

Write text in cells like this


### 2.3.  Deluxe/Gourmet Supermarkets versus Small Groceries

Discuss the similarities and diferences between the results obtained in task 2.1. (frequent itemsets and association rules found in transactions only from Deluxe/Gourmet Supermarkets) and those obtained in task 2.2. (frequent itemsets and association rules found in transactions only Small Groceries).

Write text in cells like this