# Notebook 1 – Automatic Classification of Olympic Medal Outcomes

## 1. Introduction

This notebook focuses on the application of **automatic classification techniques** to the historical Olympic athletes dataset covering the period from **1896 to 2016**. Building upon the exploratory data analysis (EDA) conducted previously, this notebook aims to transform the available data into a predictive framework using supervised machine learning algorithms.

The dataset contains detailed information about athletes, events, and Olympic results, allowing the identification of patterns related to athlete characteristics, sports, and competition contexts. By leveraging these features, classification models are trained to predict **medal outcomes** in Olympic events.

This notebook follows a structured machine learning workflow, including:

- Definition of business goals

- Data selection and preparation

- Selection and application of classification algorithms

- Model evaluation and comparison

- Hyperparameter optimization of the selected model

The results obtained in this notebook provide a solid baseline for understanding the predictive potential of the dataset.

---

## 2.  Dataset Loading and Initial Inspection

In this section, the Olympic athletes dataset is loaded and an initial inspection is performed to verify its structure, dimensions, and data types.

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("athlete_events.csv")

df.head()
df.shape
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


Unnamed: 0,ID,Age,Height,Weight,Year
count,271116.0,261642.0,210945.0,208241.0,271116.0
mean,68248.954396,25.556898,175.33897,70.702393,1978.37848
std,39022.286345,6.393561,10.518462,14.34802,29.877632
min,1.0,10.0,127.0,25.0,1896.0
25%,34643.0,21.0,168.0,60.0,1960.0
50%,68205.0,24.0,175.0,70.0,1988.0
75%,102097.25,28.0,183.0,79.0,2002.0
max,135571.0,97.0,226.0,214.0,2016.0


---

## 3. Business Goal

The main business goal of this notebook is to develop and evaluate **automatic classification models** capable of predicting **Olympic medal outcomes** based on athlete and event characteristics.

Using historical data from the Olympic Games (1896–2016), this analysis aims to:

- Build supervised machine learning models to classify **whether an athlete wins a medal or not**, based on features such as age, sex, physical attributes, sport, season, and country.

- Compare the performance of different classification algorithms and identify the most suitable model for this task.

- Optimize the selected model through hyperparameter tuning in order to improve predictive performance.

- Provide insights into which athlete and competition features are most relevant for medal prediction.

The results of this notebook can support **sports analysts, researchers, and data scientists** in understanding patterns of success in Olympic competitions and serve as a foundation for more advanced predictive and analytical tasks in subsequent notebooks.


---



## 4. Selected algorithms

In this notebook, several **supervised machine learning classification algorithms** are applied and compared in order to predict Olympic medal outcomes. The selected algorithms were chosen based on their popularity, interpretability, and suitability for classification tasks involving both numerical and categorical features.

### 4.1 Logistic Regression

Logistic Regression is used as a **baseline classification model**. Despite its simplicity, it is a widely adopted algorithm for binary classification problems and provides easily interpretable results.

This algorithm models the probability of an athlete winning a medal as a function of the input features. It is particularly useful for understanding the influence of individual variables on the target outcome and serves as a reference point for comparing more complex models.

**Key characteristics:**

- Simple and computationally efficient

- Suitable for binary classification

- Provides interpretable coefficients

- Sensitive to feature scaling and multicollinearity


### 4.2 Decision Tree Classifier

The Decision Tree classifier is a non-linear model that splits the dataset into subsets based on feature values, creating a tree-like structure of decisions. It is capable of capturing complex relationships between features without requiring extensive data preprocessing.

Decision Trees are intuitive and easy to visualize, making them useful for explaining classification decisions. However, they are prone to overfitting if not properly constrained.

**Key characteristics:**

- Handles both numerical and categorical data

- Easy to interpret and visualize

- Captures non-linear relationships

- Prone to overfitting without pruning or depth control


### 4.3 Random Forest Classifier

The Random Forest classifier is an ensemble learning method that combines multiple Decision Trees to improve classification performance and robustness. By aggregating the predictions of several trees, Random Forest reduces overfitting and improves generalization.

This algorithm is particularly well-suited for complex datasets such as Olympic results, where interactions between multiple features may influence medal outcomes.

**Key characteristics:**

- High predictive performance

- Reduces overfitting through ensemble learning

- Handles large datasets and feature interactions well

- Provides feature importance measures

---

## 5. Data Selection Criteria

The data selection process was guided by the objective of building a reliable and interpretable **classification model** for predicting Olympic medal outcomes. Only features with potential predictive value and acceptable data quality were selected.

### 5.1 Target Variable

The target variable used in this classification task is `Medal`, which represents the outcome of an athlete in a specific Olympic event.

For the purposes of this notebook, the target variable was transformed into a **binary classification** problem:

- `1` – The athlete won a medal (Gold, Silver, or Bronze)

- `0` – The athlete did not win a medal

This transformation simplifies the classification task, reduces class imbalance issues, and provides a more stable foundation for model training and evaluation.

### 5.2 Selected Features

The following features were selected as input variables based on their relevance to athlete performance and medal outcomes:

- **Age** – Represents the athlete’s age at the time of the competition

- **Sex** – Biological sex of the athlete (Male or Female)

- **Height** – Athlete’s height in centimeters

- **Weight** – Athlete’s weight in kilograms

- **Sport** – Type of sport in which the athlete competed

- **Season** – Olympic season (Summer or Winter)

- **Year** – Year of the Olympic Games

- **NOC** – National Olympic Committee code, representing the athlete’s country

These variables capture **demographic, physical, temporal, and contextual** aspects that are likely to influence athletic performance.

### 5.3 Excluded Features

Several attributes were excluded from the modeling process due to limited predictive value or potential issues:

- **ID** – Unique identifier with no relevance for prediction

- **Name** – High-cardinality textual feature, not suitable for generalization

- **Team** – Redundant with the NOC attribute

- **Games** – Combination of year and season, redundant with existing features

- **City** – High cardinality and low relevance to individual performance

- **Event** – Very high cardinality, which could introduce noise and sparsity

Excluding these features helps reduce model complexity, minimize noise, and improve generalization performance.

---