# Notebook 1 — Automatic Classification
## 1. Introduction

> This first notebook of Project 02 focuses on the application of **automatic classification** techniques to sports data, using a historical dataset of NFL Super Bowl games.
>
> The main goal of this notebook is to develop machine learning models capable of **classifying a Super Bowl game** based on defined criteria, such as:
>
> - predicting whether the winning team scored above a certain threshold;
>
> - classifying games as competitive or one-sided;
>
> - predicting categories related to game performance based on available variables.
>
> This notebook demonstrates the ability to apply supervised machine learning methods, prepare data for classification, compare multiple algorithms, and interpret model results in a real sports context.

**Group members:**
- Pedro Ribeiro — student number 27960  
- Ricardo Fernandes — student number 27961  
- Carolina Branco — student number 27983  
- João Barbosa — student number 27964  
- Diogo Abreu — student number 27975  

## 2. Datasets
### 2.1 Dataset Source

The dataset used in this notebook comes from Kaggle, referenced in the project **“Superbowl History Analysis.”** 

The link to the dataset is the following one: https://www.kaggle.com/code/ahmadjaved097/superbowl-history-analysis/notebook

### 2.2 Dataset Description

The dataset includes detailed information for every Super Bowl played between 1967 and 2020.
It contains attributes such as:

- Game date

- Super Bowl identifier

- Winning and losing teams

- Points scored by each team

- MVP of the game

- Stadium, city, and state where the game was hosted

- Approximate number of records: ~54, one for each Super Bowl.

### 2.3 Metadata

| Attribute | Type | Description | 
|----------|--------------|--------|
| Date | Date | Date of the Super Bowl | 
| SB | Categorical | Official Super Bowl identifier |
| Winner | Categorical | Winning team |
| Winner Pts | Numeric | Points scored by the winner |
| Loser | Categorical | Losing team |
| Loser Pts | Numeric | Points scored by the loser |
| MVP | Categorical | Most Valuable Player of the game |
| Stadium | Categorical | Stadium hosting the game |
| City | Categorical | City of the stadium |
| State | Categorical | State of the stadium |

## 3. Exploratory Data Analysis (EDA)

The exploratory analysis in this notebook aims to understand the behavior of the variables that will feed the classification models.

### 3.1 Import Required Libraries


In [None]:
# Linear algebra and data processing
import numpy as np
import pandas as pd

# Graphics / Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display

# Utils
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load dataset locally (ensure the CSV is in the same folder as the notebook)
df = pd.read_csv("superbowl.csv")   
df.head()

### 3.2 Load and Check Data

Below is a description of the main variables in the Super Bowl dataset:

- **Date:** The date on which the Super Bowl was played

- **SB:** Identifier for the Super Bowl edition (e.g., “XL”, “LIV”)

- **Winner:** Name of the team that won the game

- **Winner Pts:** Points scored by the winning team

- **Loser:** Name of the team that lost the game

- **Loser Pts:** Points scored by the losing team

- **MVP:** Player awarded “Most Valuable Player”

- **Stadium:** Name of the stadium where the game was held

- **City:** Host city

- **State:** Host state

In [None]:
# Load the Super Bowl dataset

superbowl_df = pd.read_csv('superbowl.csv')  

print("Number of rows in the dataset:", len(superbowl_df))

### 3.3 Variable Description

In [None]:
# Print top examples of the dataset
superbowl_df.head()

In [None]:
# Display basic information about the DataFrame
print(superbowl_df.info())

# Display descriptive statistics for numerical features
print(superbowl_df.describe())

# Display unique values for categorical features
for column in superbowl_df.columns:
    if superbowl_df[column].dtype == object:
        print(f"\nUnique values for {column}:")
        print(superbowl_df[column].unique())

### 3.4 Univariate Variable Analysis - Categorical Variables

In [None]:
# Get the categorical variables from the Super Bowl dataset
categorical_features = [feature for feature in superbowl_df.columns if superbowl_df[feature].dtype == object]
print("Categorical features:", categorical_features)

In [None]:
def bar_plot(variable):
    """
        input: variable name, e.g., "Winner"
        output: bar plot & value counts
    """
    # get the feature
    var = superbowl_df[variable]
    # count number of categorical variable values
    varValue = var.value_counts()

    # visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values, rotation=45)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}:\n{}".format(variable, varValue))

In [None]:
# Plot categorical features with less than 10 distinct values
for cf in categorical_features:
    if superbowl_df[cf].nunique() < 10:
        bar_plot(cf)