# **Fundamentals of Artificial Intelligence**

## MSc in Applied Artificial Intelligence 2025/2026 <br>
## Group 02 - Project 2 - Apriori
| Nome                              | Número de Aluno |
|-----------------------------------|------------:|
| Adelino Daniel da Rocha Vilaça    | a16939          |
| António Jorge Magalhães da Rocha  | a26052          |

---
# 0. - **INTRODUCTION**

## 0.1 - Goal
> The primary goal of this project is to apply the Apriori algorithm to a cybersecurity intrusion detection dataset to discover meaningful association rules. These rules aim to identify common patterns, relationships, and sequences of events that lead to or are characteristic of cyber-attacks. By uncovering these associations, we seek to provide actionable insights that can enhance intrusion detection systems, improve threat intelligence, and inform security incident response strategies.

## 0.2 - Environment
> This project is developed within a Google Colaboratory environment, leveraging its cloud-based Jupyter Notebook infrastructure. Key Python libraries utilized include `pandas` for data manipulation, and `mlxtend` for the implementation of the Apriori algorithm and association rule generation. The environment provides access to necessary computational resources and facilitates collaborative development.

## 0.3 - Definitions
> - **Association Rule Mining**: A data mining technique used to find strong relationships between items in large datasets. It identifies frequently occurring itemsets and derives implication rules.
> - **Apriori Algorithm**: An influential algorithm for mining frequent itemsets and relevant association rules. It operates on the principle that if an itemset is frequent, then all of its subsets must also be frequent.
> - **Support**: A measure of how frequently an itemset appears in the dataset, calculated as the proportion of transactions containing the itemset.
> - **Confidence**: A measure of the reliability of an association rule, indicating how often items in the consequent appear in transactions that already contain the antecedent.
> - **Lift**: A metric that compares the frequency of occurrence of the antecedent and consequent together with the frequency with which they would occur if they were independent. A lift greater than 1 indicates a positive correlation between the items.

---
# 1. - **AGENT DESIGN**

## 1.1 - Platforms

### 1.1.1 - Jupyter Notebook <br>
 > A Jupyter Notebook is an open-source web application that allows creating and sharing documents containing live code, equations, visualizations, and narrative text. It's widely used in data science, machine learning, and scientific computing for interactive development, exploration, and documentation.

### 1.1.2 - Google Colaboratory <br>
  >Free cloud-based service that provides a hosted Jupyter Notebook environment. It allows writing and executing code in a browser for free and without any setup.

## 1.2 Packages and Libraries


### Apriori python libraries

**mlxtend**: Implements manny machine learning algorithms and tools, including association rule mining.

**apyori**: Provides functions for manipulating transactional data and for generating association rules and evaluating their quality.

**PyCaret**: Low-code ML library for automating machine learning workflows. It provides a wrapper on top of mlxtend for easy implementation of the Apriori algorithm. Current version (3.2.0) does not support association rules. Find more in https://pycaret.org/.


# Unsupervised Learning - APRIORI

This notebook presents examples of the use of the well-known Apriori learning algorithms.


### Load the data

### Usage of `kaggle.json`

The `kaggle.json` file serves as an essential **authentication token** for interacting with the Kaggle API (Kaggle Application Programming Interface). It securely stores the user's Kaggle credentials, including their username and key.

In [None]:
from google.colab import files
files.upload()  # kaggle.json

### Token Permissions

In [None]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets list | head

In [None]:
!kaggle datasets download -d dnkumars/cybersecurity-intrusion-detection-dataset -p /content/ --unzip

In [None]:
# Imports
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

## 1.3 DATASET


### 1.3.1 Used Dataset

> This dataset, named `cybersecurity_intrusion_data.csv`, focuses on **cybersecurity intrusion detection**.

* It contains information about network sessions, such as packet size, protocol type, session duration, and login attempts.
* It includes details about user behavior, such as browser type and failed login attempts.
* The main objective is to classify whether a given session indicates an `attack_detected` (attack detected) or not, with this being the target variable.

---
# 2. **AGENT RUNNING**

In [None]:
# Read the file
dataset = pd.read_csv('cybersecurity_intrusion_data.csv')
df = dataset

# We will use 'df' as the base for our Apriori analysis.
apriori_df = df.copy()

# No retail-specific data preparation needed here

# Print head
apriori_df.tail()

In [None]:
# Display the number of nulls in each column
print(apriori_df.isnull().sum())

In [None]:
# Remove the rows with null values
apriori_df = apriori_df.dropna()

# No Date filtering is needed for this dataset

In [None]:
# Number of rows
print("Number of rows:", len(apriori_df))

# Number of distinct values
print("\nNumber of distinct values:")
print("session_id:         ", apriori_df['session_id'].nunique())
print("protocol_type:      ", apriori_df['protocol_type'].nunique())
print("encryption_used:    ", apriori_df['encryption_used'].nunique())
print("browser_type:       ", apriori_df['browser_type'].nunique())
print("unusual_time_access:", apriori_df['unusual_time_access'].nunique())
print("attack_detected:    ", apriori_df['attack_detected'].nunique())

### Pivot table
For using the apriori algorithm we need to pivot the table.

If the product is in the invoice, the intersection cell will be “True”. If is not, it will be “False”.

In [None]:
# The 'ds_grouped' DataFrame is now prepared in the cell above (H2Kv1w2zIf1k)
# with the transactional data from our cybersecurity dataset.
# We will use this 'ds_grouped' for the pivot table in the next step.

In [None]:
# Prepare data for Apriori: Discretization and creating transaction lists
apriori_df_processed = apriori_df.copy()

# Discretize 'failed_logins' based on the suggestion: "0", "1", "2", "3+"
bins_failed_logins = [-0.1, 0, 1, 2, apriori_df_processed['failed_logins'].max() + 1]
labels_failed_logins = ['failed_logins_0', 'failed_logins_1', 'failed_logins_2', 'failed_logins_3+']
apriori_df_processed['failed_logins_bin'] = pd.cut(
    apriori_df_processed['failed_logins'],
    bins=bins_failed_logins,
    labels=labels_failed_logins,
    right=False,
    include_lowest=True
)

# Discretize other numerical attributes using quantiles (e.g., 4 bins for now)
numerical_cols_to_bin = [
    'network_packet_size',
    'login_attempts',
    'session_duration',
    'ip_reputation_score'
]

for col in numerical_cols_to_bin:
    try:
        # Create 4 quantile-based bins
        apriori_df_processed[f'{col}_bin'] = pd.qcut(
            apriori_df_processed[col],
            q=4,
            duplicates='drop', # Handle cases with identical values across quantiles
            labels=[f'{col}_Q1', f'{col}_Q2', f'{col}_Q3', f'{col}_Q4']
        )
    except ValueError as e:
        print(f"Warning: Could not qcut column '{col}' due to: {e}. Falling back to equal-width binning.")
        apriori_df_processed[f'{col}_bin'] = pd.cut(
            apriori_df_processed[col],
            bins=4,
            labels=[f'{col}_B1', f'{col}_B2', f'{col}_B3', f'{col}_B4'],
            include_lowest=True
        )

# Identify all columns that will serve as 'items' for association rule mining
# Exclude 'session_id' and original numerical columns
item_columns = [
    'protocol_type',
    'encryption_used',
    'browser_type',
    'unusual_time_access',
    'attack_detected',
    'failed_logins_bin',
    'network_packet_size_bin',
    'login_attempts_bin',
    'session_duration_bin',
    'ip_reputation_score_bin'
]

# Ensure all identified item_columns exist in the DataFrame
item_columns = [col for col in item_columns if col in apriori_df_processed.columns]

# Convert all item columns to string to prepare for item list creation
for col in item_columns:
    apriori_df_processed[col] = apriori_df_processed[col].astype(str)


# Structure grouped data
transaction_items = []
for index, row in apriori_df_processed.iterrows():
    session_id = row['session_id']
    for col in item_columns:
        item_value = row[col]
        # Create a unique item string, e.g., 'protocol_type_TCP', 'browser_type_Chrome'
        transaction_items.append({'session_id': session_id, 'item': f'{col}_{item_value}'})

ds_grouped = pd.DataFrame(transaction_items)

# Display the head of this 'grouped' data
ds_grouped.head()

In [None]:
# Create apriori data structure
ds_pivot = ds_grouped.pivot_table(index='session_id', columns='item', aggfunc=lambda x: True, fill_value=False)
ds_pivot.tail(5)

---
## Learning Rules (Association Rule Learning)

In [None]:
# Get the rules
min_support=0.01
freq_itemsets = apriori(ds_pivot, min_support=min_support, use_colnames=True)

# Get the number of itemsets in freq_itemsets
num_itemsets = len(freq_itemsets)
print(f'Number of itemsets: {num_itemsets}')
display(freq_itemsets.head())

# Get the rules
rules = association_rules(freq_itemsets, metric="support", min_threshold=min_support)
display(rules.sort_values('support', ascending=False).head(10))

In [None]:
# List the 10 rules with higher confidence
rules.sort_values('confidence', ascending=False).head(10)

In [None]:
def get_rules_for_item(item_name):

    # Filter rules where the specified item is in the antecedents
    # Check if the item_name (string) is contained within any of the item strings in the frozenset
    filtered_rules = rules[rules['antecedents'].apply(lambda x: any(item_name == str(i) for i in x))]

    # Prepare the output
    results = []
    for index, row in filtered_rules.iterrows():
        rule_info = {}
        # Items are already descriptive, so we convert frozenset to a comma separated string
        rule_info['antecedents'] = ", ".join(map(str, row['antecedents']))
        rule_info['consequents'] = ", ".join(map(str, row['consequents']))
        rule_info['support'] = row['support']
        rule_info['confidence'] = row['confidence']
        results.append(rule_info)

    # Convert the results to a DataFrame and sort by confidence
    result_df = pd.DataFrame(results)
    if not result_df.empty:
        result_df = result_df.sort_values('confidence', ascending=False).head(10)

    return result_df

# Example usage with a relevant item from our dataset (e.g., 'protocol_type_TCP')
example_item = 'protocol_type_TCP'
product_rules = get_rules_for_item(example_item)
product_rules

### Inspecting Rules Related to Attack Detection

Since our project focuses on cybersecurity intrusion detection, let's specifically look for association rules where an `attack_detected_1` (meaning an attack was detected) is the consequent. These rules can be highly valuable for identifying patterns that precede or are strongly associated with security incidents.

In [None]:
# Filter rules where 'attack_detected_1' is in the consequents
attack_detection_rules = rules[rules['consequents'].apply(lambda x: 'attack_detected_1' in x)]

# Sort these rules by lift and then by confidence to find the most interesting ones
attack_detection_rules_sorted = attack_detection_rules.sort_values(
    by=['lift', 'confidence'], ascending=False
)

print(f"Number of rules where 'attack_detected_1' is a consequent: {len(attack_detection_rules)}")

# Display the top 10 rules related to attack detection
display(attack_detection_rules_sorted.head(10))

#### Interpretation of `attack_detection_rules_sorted`:

*   **High Lift**: Close attention to rules with a `lift` significantly greater than 1. These are the patterns where the `antecedents` are much more likely to occur *together with* `attack_detected_1` than would be expected by chance alone. This indicates a strong, potentially interesting relationship.
*   **High Confidence**: Alongside `lift`, `confidence` tells you how reliable the prediction is. A high confidence means that when the `antecedents` are present, `attack_detected_1` is very likely to also be present.
*   **Support**: While less critical for filtering than lift and confidence in this context, support still indicates how frequently this specific pattern occurs in the dataset. You might find some highly predictive rules (`high lift`, `high confidence`) that have lower support, meaning they are rarer but still important indicators when they do occur.

These rules can help us understand which combinations of factors (e.g., protocol type, browser, login attempts, IP reputation) are most strongly associated with attack detection.

### Visualization of Association Rules

A scatter plot is a powerful way to visualize association rules, allowing us to simultaneously observe the relationships between `support`, `confidence`, and `lift`. This helps in identifying rules that are frequent, reliable, and have a strong, non-random association.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Filter rules to make the plot more readable (e.g., lift > 1 for positive association)
# filtered_rules = rules[rules['lift'] > 1]
filtered_rules = rules[(rules['lift'] > 1.2) & (rules['confidence'] > 0.7)]

plt.figure(figsize=(14, 9))
scatter = sns.scatterplot(
    x='support',
    y='confidence',
    size='lift',
    hue='lift',
    data=filtered_rules,
    palette='viridis',
    sizes=(50, 600), # Adjust size range as needed
    legend='brief'
)

plt.title('Association Rules: Support vs. Confidence (Colored by Lift)')
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.grid(True, linestyle='--', alpha=0.6)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

# Rule with the highest lift:
# max_lift_rule = filtered_rules.loc[filtered_rules['lift'].idxmax()]
# plt.annotate(f'Max Lift: {max_lift_rule.name}',
#              (max_lift_rule['support'], max_lift_rule['confidence']),
#              textcoords="offset points", xytext=(0,10), ha='center')

plt.show()

### Example Aproach



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure filtered_rules is defined (from previous cell, lift > 1)
if 'filtered_rules' not in locals():
    print("Warning: 'filtered_rules' DataFrame not found. Please ensure previous cells are run.")
    # Fallback if filtered_rules is not available (though it should be if prior cells run)
    filtered_rules = rules[rules['lift'] > 1]

# Sort by lift and then confidence to get the most interesting rules
top_rules_for_plot = filtered_rules.sort_values(by=['lift', 'confidence'], ascending=False).head(1000)

plt.figure(figsize=(14, 9))
scatter = sns.scatterplot(
    x='support',
    y='confidence',
    size='lift',
    hue='lift',
    data=top_rules_for_plot,
    palette='viridis',
    sizes=(50, 600),
    legend='brief'
)

plt.title('Top 1000 Association Rules: Support vs. Confidence (Colored by Lift)')
plt.xlabel('Support')
plt.ylabel('Confidence')
plt.grid(True, linestyle='--', alpha=0.6)

# Move the legend outside the plot area to the right
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0.)

# Optionally, add annotations for the very top rules if desired
# for i, row in top_rules_for_plot.head(5).iterrows():
#     plt.annotate(f'Lift: {row['lift']:.2f}', (row['support'], row['confidence']),
#                  textcoords="offset points", xytext=(5,5), ha='left', fontsize=8)

plt.show()
print(f"Plotted {len(top_rules_for_plot)} rules.")

## 3.1 Overall
> This project successfully applied the Apriori algorithm to identify significant association rules within a cybersecurity intrusion detection dataset. We processed raw data, handled missing values, and discretized numerical features to transform the dataset into a transactional format suitable for Apriori. The analysis revealed various patterns, particularly highlighting conditions strongly associated with attack detection, providing valuable insights for security monitoring and incident response.

## 3.2 Challenges and solutions
> One key challenge was the transformation of numerical and categorical data into a format compatible with association rule mining. This was addressed by discretizing numerical columns into meaningful bins (e.g., quantiles for 'network_packet_size' and specific ranges for 'failed_logins') and encoding categorical variables. Another challenge was the interpretation of a large number of generated rules; this was mitigated by filtering and sorting rules based on metrics like lift and confidence, especially focusing on rules where 'attack_detected_1' was the consequent.

## 3.3 Looking forward
> Future work could involve exploring different binning strategies for numerical data to see how it impacts rule generation. Additionally, integrating other advanced association rule mining algorithms (e.g., FP-Growth for potentially larger datasets) could be beneficial. Further investigation into the specific antecedents leading to attack detection, perhaps by domain experts, could lead to more actionable security intelligence.

## 3.4 In hindsight
> Reflecting on the project, the initial data preparation phase, particularly the discretization of continuous variables, was more critical than anticipated for generating coherent and interpretable rules. Understanding the domain context thoroughly before defining bins for numerical data would have streamlined the process. The visualization of association rules proved invaluable for quickly grasping the most significant patterns rather than sifting through a large tabular output.

<br>

Thank you, Professor Joaquim :)