# Notebook 3 – Association Rules on Super Bowl Data

## 1. Introduction

This notebook focuses on the application of **association rule mining techniques** to the historical Super Bowl dataset (1967–2020). While previous notebooks explored the data through exploratory analysis and clustering, this notebook aims to identify **frequent patterns and relationships between variables** using the **Apriori algorithm**.

Association rules allow the discovery of items that tend to occur together, providing insights into hidden relationships within the data. In the context of Super Bowl games, these patterns can reveal associations between game characteristics such as teams, venues, scoring outcomes, and other categorical attributes.

The workflow followed in this notebook includes:

- Definition of the business goal
- Data selection and preparation for association rule mining
- Application and evaluation of the Apriori algorithm
- Adjustment of algorithm parameters
- Analysis of intermediate and final results
- Discussion of performance metrics and lessons learned

The insights obtained from this analysis can support sports analysts and enthusiasts in understanding recurring patterns and co-occurrences in Super Bowl history.

## 2. Dataset Loading and Initial Inspection

In [21]:
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules

# Load the dataset
superbowl_df = pd.read_csv("superbowl.csv")

# Display the first rows of the dataset
superbowl_df.head()

# Inspect data types and check for missing values
superbowl_df.info()

# Summary statistics for numerical columns
superbowl_df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Date        54 non-null     object
 1   SB          54 non-null     object
 2   Winner      54 non-null     object
 3   Winner Pts  54 non-null     int64 
 4   Loser       54 non-null     object
 5   Loser Pts   54 non-null     int64 
 6   MVP         54 non-null     object
 7   Stadium     54 non-null     object
 8   City        54 non-null     object
 9   State       54 non-null     object
dtypes: int64(2), object(8)
memory usage: 4.3+ KB


Unnamed: 0,Winner Pts,Loser Pts
count,54.0,54.0
mean,30.111111,16.203704
std,9.766455,7.413348
min,13.0,3.0
25%,23.25,10.0
50%,30.5,17.0
75%,35.0,21.0
max,55.0,33.0


## 3. Business Goal

The primary objective of this notebook is to identify and analyze association patterns among categorical attributes of Super Bowl games. 
These attributes include winners, losers, stadiums, cities, and states. By applying association rule mining, we aim to uncover hidden 
relationships and recurring patterns that may provide insights into historical trends and outcomes.

Specifically, this analysis seeks to:

- Discover frequent associations between game attributes using the Apriori algorithm.
- Evaluate the discovered associations through key metrics: support, confidence, and lift.
- Interpret the most relevant and interesting rules to extract actionable insights.
- Discuss the significance of the performance metrics and highlight lessons learned from the analysis.

Overall, the goal is to provide a comprehensive understanding of how different game characteristics are related and to identify 
patterns that could support sports analysts, historians, or enthusiasts in interpreting Super Bowl data.


## 4. Data Preparation

The Apriori algorithm requires binary input, meaning that all features must be represented in a one-hot encoded format. 
Therefore, we need to transform the relevant categorical columns into binary variables, where each category becomes a separate column 
indicating presence (1) or absence (0) for each game.

For this analysis, we select the following categorical columns from the Super Bowl dataset:

- `Winner`: the team that won the game.
- `Loser`: the team that lost the game.
- `Stadium`: the venue where the game was played.
- `City`: the city hosting the game.
- `State`: the state hosting the game.

In [22]:
# Select relevant categorical columns
df_sb = superbowl_df[['Winner', 'Loser', 'Stadium', 'City', 'State']]

# Convert categorical variables into one-hot encoded format
basket = pd.get_dummies(df_sb)

# Inspect the prepared dataset
basket.info()

# Display the top 10 most frequent categories
basket.sum().sort_values(ascending=False).head(10)



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54 entries, 0 to 53
Columns: 111 entries, Winner_Baltimore Colts to State_Texas
dtypes: bool(111)
memory usage: 6.0 KB


State_Florida                  16
State_California               12
State_Louisiana                10
City_New Orleans               10
Winner_New England Patriots     6
Winner_Pittsburgh Steelers      6
City_Miami Gardens              6
Stadium_Orange Bowl             5
Stadium_Rose Bowl               5
Stadium_Louisiana Superdome     5
dtype: int64

## 5. Application of the Apriori Algorithm

To discover frequent patterns among the categorical attributes of Super Bowl games, we apply the Apriori algorithm 
to the one-hot encoded dataset. 

Given the small size of the dataset (54 games), we set a relatively low minimum support of 0.05, which corresponds to 
itemsets appearing in at least 3 games. This allows us to identify meaningful associations without discarding potentially 
interesting patterns.

In [23]:
# Generate frequent itemsets with minimum support of 0.05
frequent_itemsets = apriori(basket, min_support=0.05, use_colnames=True)

# Sort and display the top 10 frequent itemsets by support
frequent_itemsets.sort_values('support', ascending=False).head(10)

Unnamed: 0,support,itemsets
28,0.296296,(State_Florida)
27,0.222222,(State_California)
30,0.185185,(State_Louisiana)
22,0.185185,(City_New Orleans)
48,0.185185,"(State_Louisiana, City_New Orleans)"
21,0.111111,(City_Miami Gardens)
3,0.111111,(Winner_New England Patriots)
5,0.111111,(Winner_Pittsburgh Steelers)
47,0.111111,"(State_Florida, City_Miami Gardens)"
10,0.092593,(Loser_Denver Broncos)


## 6. Generation and Evaluation of Association Rules

Once frequent itemsets have been identified using the Apriori algorithm, we can generate association rules to explore 
relationships between attributes in the dataset.

Key metrics used to evaluate these rules:

- **Support**: The proportion of games in which the itemset (antecedent and consequent together) appears.  
- **Confidence**: The probability that the consequent occurs given the antecedent.  
- **Lift**: Measures the strength of the association; a lift value greater than 1 indicates a positive correlation between the antecedent and consequent.


In [24]:
# Generate association rules with a minimum lift of 1
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Sort rules by lift in descending order
rules.sort_values('lift', ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
24,(State_Georgia),(City_Atlanta),0.055556,0.055556,0.055556,1.0,18.0,1.0,0.052469,inf,1.0,1.0,1.0,1.0
25,(City_Atlanta),(State_Georgia),0.055556,0.055556,0.055556,1.0,18.0,1.0,0.052469,inf,1.0,1.0,1.0,1.0
26,(State_Texas),(City_Houston),0.074074,0.055556,0.055556,0.75,13.5,1.0,0.05144,3.777778,1.0,0.75,0.735294,0.875
27,(City_Houston),(State_Texas),0.055556,0.074074,0.055556,1.0,13.5,1.0,0.05144,inf,0.980392,0.75,1.0,0.875
3,(Loser_Dallas Cowboys),(Stadium_Orange Bowl),0.055556,0.092593,0.055556,1.0,10.8,1.0,0.050412,inf,0.960784,0.6,1.0,0.8
2,(Stadium_Orange Bowl),(Loser_Dallas Cowboys),0.092593,0.055556,0.055556,0.6,10.8,1.0,0.050412,2.361111,1.0,0.6,0.576471,0.8
5,(Loser_Dallas Cowboys),(City_Miami),0.055556,0.092593,0.055556,1.0,10.8,1.0,0.050412,inf,0.960784,0.6,1.0,0.8
4,(City_Miami),(Loser_Dallas Cowboys),0.092593,0.055556,0.055556,0.6,10.8,1.0,0.050412,2.361111,1.0,0.6,0.576471,0.8
12,(City_Miami),(Stadium_Orange Bowl),0.092593,0.092593,0.092593,1.0,10.8,1.0,0.084019,inf,1.0,1.0,1.0,1.0
13,(Stadium_Orange Bowl),(City_Miami),0.092593,0.092593,0.092593,1.0,10.8,1.0,0.084019,inf,1.0,1.0,1.0,1.0


## 8. Conclusion

In this notebook, we applied **association rule mining** using the Apriori algorithm to the historical Super Bowl dataset, 
analyzing patterns in categorical attributes such as winning teams, losing teams, stadiums, cities, and states.

### 8.1 Analysis of Results

* Frequent itemsets and association rules revealed meaningful patterns, particularly in **geographic relationships**, such as consistent associations between cities and states, and between stadiums and cities.  
* Rules with high **lift values** highlighted strong correlations that are not merely due to chance, for example:
  * (State_Georgia → City_Atlanta) with lift 18 indicates that all games in Georgia occurred in Atlanta.
  * (City_Miami → Stadium_Orange Bowl) with lift 10.8 shows that all Miami games were played in the Orange Bowl.
* Support values were generally low due to the small dataset size (54 games), but confidence and lift helped identify reliable and significant associations.
* The analysis demonstrates how categorical patterns, such as winners appearing frequently in certain stadiums or states, can be systematically discovered using association rules.

### 8.2 Lessons Learned

1. **Impact of Dataset Size:** A small number of observations limits the frequency of itemsets and affects support, requiring careful adjustment of Apriori parameters to generate meaningful rules.  

2. **Importance of Data Encoding:** Transforming categorical attributes into one-hot encoding was essential for the Apriori algorithm to process the dataset effectively.  

3. **Parameter Tuning:** Setting an appropriate minimum support and lift threshold was critical to uncover patterns without overwhelming the analysis with trivial or spurious rules.  

4. **Interpreting Metrics:** Understanding support, confidence, and lift allowed us to distinguish between coincidental co-occurrences and meaningful associations.  

5. **Limitations:** Some rules reflect historical coincidences rather than generalizable trends due to the limited number of games. Future work could include a larger dataset (e.g., regular season games) or additional attributes such as MVPs or points to explore more complex associations.