# 🏋️ Olympic Data Exploration
Initial data exploration and understanding for the Olympics history dataset.

In [1]:
import pandas as pd

## Load the CSV Datasets
I'll start by reading the datasets using pandas:
- `athlete_events.csv`
- `noc_regions.csv`

In [7]:
athlete_events_df = pd.read_csv("../Olympics-Data-Analysis/data/athlete_events.csv")
noc_regions_df = pd.read_csv("../Olympics-Data-Analysis/data/noc_regions.csv")

---

## Explore `athlete_events.csv`
Let's explore the structure, types, and sample values in the main dataset.

In [8]:
athlete_events_df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [9]:
athlete_events_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


In [10]:
athlete_events_df.describe(include = 'all')

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
count,271116.0,271116,271116,261642.0,210945.0,208241.0,271116,271116,271116,271116.0,271116,271116,271116,271116,39783
unique,,134732,2,,,,1184,230,51,,2,42,66,765,3
top,,Robert Tait McKenzie,M,,,,United States,USA,2000 Summer,,Summer,London,Athletics,Football Men's Football,Gold
freq,,58,196594,,,,17847,18853,13821,,222552,22426,38624,5733,13372
mean,68248.954396,,,25.556898,175.33897,70.702393,,,,1978.37848,,,,,
std,39022.286345,,,6.393561,10.518462,14.34802,,,,29.877632,,,,,
min,1.0,,,10.0,127.0,25.0,,,,1896.0,,,,,
25%,34643.0,,,21.0,168.0,60.0,,,,1960.0,,,,,
50%,68205.0,,,24.0,175.0,70.0,,,,1988.0,,,,,
75%,102097.25,,,28.0,183.0,79.0,,,,2002.0,,,,,


---

## Explore `noc_regions.csv`
Now let's look at the country/region mapping file.

In [11]:
noc_regions_df.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


In [12]:
noc_regions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230 entries, 0 to 229
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   NOC     230 non-null    object
 1   region  227 non-null    object
 2   notes   21 non-null     object
dtypes: object(3)
memory usage: 5.5+ KB


In [13]:
noc_regions_df.describe(include = 'all')

Unnamed: 0,NOC,region,notes
count,230,227,21
unique,230,206,21
top,AFG,Germany,Netherlands Antilles
freq,1,4,1


---

## Compare NOC Codes Between Datasets
Now I'll check whether all NOC codes in `athlete_events` are present in `noc_regions` to check data consistency.


In [14]:
# Number of unique NOCs
print("Unique NOCs in athlete_events:", athlete_events_df['NOC'].nunique())
print("Unique NOCs in noc_regions:", noc_regions_df['NOC'].nunique())


Unique NOCs in athlete_events: 230
Unique NOCs in noc_regions: 230


In [17]:
# Find mismatched NOCs
mismatched_nocs = athlete_events_df[~athlete_events_df['NOC'].isin(noc_regions_df['NOC'])]['NOC'].unique()
print("Mismatched NOCs:", mismatched_nocs)
print("Number of mismatched NOCs:", len(mismatched_nocs))


Mismatched NOCs: ['SGP']
Number of mismatched NOCs: 1


---

## Summary
- The data was successfully loaded.
- Columns, data types, and statistical summaries were explored successfully.
- There is one NOC code in `athlete_events` does not appear in `noc_regions` — I'll check that again in data cleaning stage to see what I can do with it.
