<a href="https://colab.research.google.com/github/brendanpshea/programming_problem_solving/blob/main/DataMining_APriori_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is the Apriori Algorithm?
The **Apriori Algorithm** is a fundamental algorithm in data mining for finding frequent itemsets in a dataset and for learning association rules. It was developed by Rakesh Agrawal and Ramakrishnan Srikant in 1994. This algorithm operates on databases containing transactions (like purchases in a supermarket) and identifies the frequent items bought together. It uses a breadth-first search strategy and operates in two steps: (1) finding all frequent itemsets, and (2) generating strong association rules from these itemsets.

To understand the Apriori Algorithm, consider a dataset as a list of transactions, where each transaction is a set of items. For example, in a supermarket setting, each transaction would represent the items a customer bought together. The Apriori Algorithm seeks to find combinations of items frequently purchased together - these are called 'frequent itemsets'. The algorithm uses a minimum support threshold to determine which itemsets are considered frequent. If an itemset appears in the dataset at least as often as this threshold, it is deemed frequent.

The process starts with finding all frequent itemsets of size one (single items). Then, it iteratively builds larger sets by combining the frequent itemsets found in the previous step and checks their frequency. This step is crucial because of the Apriori principle: if an itemset is frequent, all its subsets must also be frequent. This principle significantly reduces the search space, making the algorithm efficient.

After identifying the frequent itemsets, the Apriori Algorithm proceeds to the next phase: generating association rules. These rules are of the form "If X, then Y", indicating that when items in X are bought, items in Y are likely to be bought too. The strength of these rules is measured using metrics like confidence and lift.

For example, imagine a multiplayer video game where players can buy virtual items like skins, weapons, or accessories. Each purchase a player makes is a transaction in our dataset. The Apriori Algorithm can analyze these transactions to find patterns in the purchases. For instance, it might discover that players who buy a particular weapon skin often also buy matching character skins. These insights can help game developers understand player preferences and tailor the in-game store to encourage more purchases. The algorithm can also suggest items to players based on what others have bought, enhancing the user experience and potentially increasing sales.







## How Does the Algorithm Work?
The Apriori Algorithm operates in a series of steps, using a level-wise search where k-itemsets are used to explore (k+1)-itemsets. This method is known as a candidate generation and test approach. Below is an expanded step-by-step explanation of how the Apriori Algorithm works:

### Step 1: Initial Database Scan for Frequent 1-Itemsets
The algorithm starts by scanning the database to find the frequency of each item in the dataset. This initial scan counts how many times each item appears. Based on a predefined minimum support threshold, it then determines which items are frequent, thus creating the frequent 1-itemset. For example, in a supermarket database, this step would identify which individual products are purchased frequently.

### Step 2: Candidate Generation
In this step, the algorithm generates new candidate itemsets of length (k+1) from the previously identified frequent itemsets of length k. This is done by combining the frequent itemsets with themselves. The crucial aspect here is the use of the Apriori property: any subset of a frequent itemset must also be frequent. This property significantly reduces the number of candidates since any (k+1)-itemset where all k-sized subsets are not frequent can be immediately discarded.

### Step 3: Candidate Testing
The generated candidates are then tested against the database. This involves another scan of the database to count the frequency of each candidate itemset and determine if they meet the minimum support threshold. The itemsets that do meet the threshold are deemed frequent (k+1)-itemsets.

### Step 4: Iteration and Convergence
The algorithm iterates the process of generating and testing, each time increasing the size of the itemsets being considered by 1 (k := k + 1). With each iteration, the algorithm refines the list of potentially frequent itemsets. The process continues until it reaches a point where no new frequent itemsets can be found or no new candidates can be generated.

### Step 5: Result Compilation
Finally, the algorithm returns all the frequent itemsets discovered during its iterations. These itemsets represent combinations of items that are frequently bought together.

## Pseudocode for A Priori Algorithm
The pseucode for the algorithm is as follows:

```sql
-- Initialization
SET currentItemsetSize = 1;
SELECT item INTO FrequentItemsets FROM Transactions GROUP BY item HAVING COUNT(item) >= minSupport;

-- Main loop of the algorithm
WHILE (FrequentItemsets IS NOT EMPTY) DO
    -- Generate candidate itemsets of size (currentItemsetSize + 1)
    SELECT item1, item2, ..., itemK INTO CandidateItemsets
    FROM FrequentItemsets
    GROUP BY item1, item2, ..., itemK
    HAVING COUNT(DISTINCT items) = currentItemsetSize + 1;

    -- Count each candidate's frequency and check against minimum support
    SELECT item1, item2, ..., itemK INTO NextFrequentItemsets
    FROM CandidateItemsets, Transactions
    WHERE CandidateItemsets.item1 = Transactions.item1
      AND CandidateItemsets.item2 = Transactions.item2
      ...
      AND CandidateItemsets.itemK = Transactions.itemK
    GROUP BY item1, item2, ..., itemK
    HAVING COUNT(*) >= minSupport;

    -- Prepare for the next iteration
    SET currentItemsetSize = currentItemsetSize + 1;
    SET FrequentItemsets = NextFrequentItemsets;
END WHILE;

-- Return all collected frequent itemsets
SELECT * FROM FrequentItemsets;
```

Here's what is happening:
1. The algorithm starts by setting `currentItemsetSize` to 1. The `FrequentItemsets` are initialized by selecting items from the `Transactions` table that meet the minimum support threshold.

2. The `WHILE` loop continues as long as there are non-empty `FrequentItemsets`.

3. Inside the loop, new candidate itemsets are generated. These candidates are itemsets of size `currentItemsetSize + 1`, formed from the current `FrequentItemsets`.

4.  For each candidate itemset, the algorithm then counts its occurrences in the `Transactions` table. If the count meets or exceeds the minimum support threshold, these itemsets are stored in `NextFrequentItemsets`.

5.  After processing the current itemset size, `currentItemsetSize` is incremented, and `FrequentItemsets` are updated for the next iteration.

6.   Once the loop ends (no more frequent itemsets can be found), the algorithm returns all the `FrequentItemsets` collected.

## Example
To see how this works in a real-world case, let's begin by loading some sample transaction data into a SQLite database as follows

### Step 1: Create Transaction Database

In [14]:
import pandas as pd
import sqlite3

data = "/content/drive/MyDrive/Colab Data/categories.txt"

conn = sqlite3.connect('transactions.db')

# Create a cursor object using the cursor() method
cursor = conn.cursor()

# Create table as per requirement
cursor.execute('CREATE TABLE IF NOT EXISTS Transactions (TransactionID INTEGER PRIMARY KEY, Items TEXT)')

# Open the file and read lines
with open(data, 'r') as file:
    for line in file:
        # Insert line into the Transactions table
        cursor.execute('INSERT INTO Transactions (Items) VALUES (?)', (line.strip(),))

# Commit the transaction
conn.commit()

# Query all data from the Transactions table and load it into a DataFrame
df = pd.read_sql_query("SELECT * FROM Transactions", conn)

In [15]:
df.head()

Unnamed: 0,TransactionID,Items
0,1,Breakfast & Brunch;American (Traditional);Rest...
1,2,Sandwiches;Restaurants
2,3,Local Services;IT Services & Computer Repair
3,4,Restaurants;Italian
4,5,Food;Coffee & Tea


### Step 3: Find the 1-item Frequent Data Sets

In [16]:
def count_item_frequencies(item_lists, min_support=1):
    """
    Count the frequency of items in the provided item lists.

    Args:
    item_lists (list of list of str): A list where each element is a list of items.
    min_support (int): The minimum support threshold for an item to be considered frequent.

    Returns:
    pd.DataFrame: A DataFrame with two columns, 'Item' and 'Frequency', listing frequent items.
    """
    from collections import Counter
    import pandas as pd

    # Flatten the list of item lists and count frequencies
    all_items = [item for sublist in item_lists for item in sublist]
    item_counts = Counter(all_items)

    # Filter items based on the minimum support threshold
    frequent_items = {item: count for item, count in item_counts.items() if count >= min_support}

    # Convert to DataFrame for easy handling
    frequent_itemsets_df = pd.DataFrame(list(frequent_items.items()), columns=['Item', 'Frequency'])

    return frequent_itemsets_df


                       Item  Frequency
0        Breakfast & Brunch       2738
1    American (Traditional)       4832
2               Restaurants      50142
3                Sandwiches       4728
4            Local Services       6936
..                      ...        ...
887       Farming Equipment          2
888         Nursing Schools          2
889      Editorial Services          2
890    Translation Services          2
891                 Donairs          2

[892 rows x 2 columns]


In [17]:
# Example usage with the existing DataFrame
df = pd.read_sql_query("SELECT * FROM Transactions", conn)
split_items = df['Items'].str.split(';').tolist()
frequent_1_itemsets = count_item_frequencies(split_items, min_support=1)
frequent_1_itemsets.head(10)

Unnamed: 0,Item,Frequency
0,Breakfast & Brunch,2738
1,American (Traditional),4832
2,Restaurants,50142
3,Sandwiches,4728
4,Local Services,6936
5,IT Services & Computer Repair,606
6,Italian,3696
7,Food,18500
8,Coffee & Tea,4398
9,Fast Food,5702
