# FIFA21 EDA Dashboard 🏟️📊

**Interactive Exploratory Data Analysis of FIFA 21 player statistics**  
No machine learning — purely data cleaning, descriptive stats, interactive Plotly charts, and a Folium world map of nationalities.


## 📋 Introduction & Objectives

In this notebook we will:
- **Load** the FIFA 21 player dataset
- **Clean** and preprocess key features (age, height, weight, salary, etc.)
- Compute **descriptive statistics** and summary tables
- Build **static** and **interactive** visualizations with Plotly
- Create a **geospatial map** of player counts & ratings by country using Folium
- Draw **insights** and conclusions about player distributions and top performers


## 🗄️ 1. Data Loading

- Import necessary libraries (`pandas`, `plotly.express`, `folium`, `ipywidgets`, etc.)
- Read the CSV file from `data/players_fifa21.csv`
- Display the first few rows and check dimensions
- Briefly inspect data types and missing values


In [14]:
# 🗄️ 1. Data Loading

# 1. Import necessary libraries
import pandas as pd
import plotly.express as px
import folium
import ipywidgets as widgets

# 2. Read the CSV file (subindo um nível, pois o notebook está em notebooks/)
data_path = "../data/players_21.csv"
df = pd.read_csv(data_path)

# 3. Display the first few rows and check dimensions
print("Dataset shape:", df.shape)
display(df.head())

# 4. Briefly inspect data types and missing values
print("\nData types:")
display(df.dtypes)

print("\nMissing values per column:")
display(df.isna().sum().sort_values(ascending=False).head(10))


Dataset shape: (18944, 106)


Unnamed: 0,sofifa_id,player_url,short_name,long_name,age,dob,height_cm,weight_kg,nationality,club_name,...,lwb,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb
0,158023,https://sofifa.com/player/158023/lionel-messi/...,L. Messi,Lionel Andrés Messi Cuccittini,33,1987-06-24,170,72,Argentina,FC Barcelona,...,66+3,65+3,65+3,65+3,66+3,62+3,52+3,52+3,52+3,62+3
1,20801,https://sofifa.com/player/20801/c-ronaldo-dos-...,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,35,1985-02-05,187,83,Portugal,Juventus,...,65+3,61+3,61+3,61+3,65+3,61+3,54+3,54+3,54+3,61+3
2,200389,https://sofifa.com/player/200389/jan-oblak/210002,J. Oblak,Jan Oblak,27,1993-01-07,188,87,Slovenia,Atlético Madrid,...,32+3,36+3,36+3,36+3,32+3,32+3,33+3,33+3,33+3,32+3
3,188545,https://sofifa.com/player/188545/robert-lewand...,R. Lewandowski,Robert Lewandowski,31,1988-08-21,184,80,Poland,FC Bayern München,...,64+3,65+3,65+3,65+3,64+3,61+3,60+3,60+3,60+3,61+3
4,190871,https://sofifa.com/player/190871/neymar-da-sil...,Neymar Jr,Neymar da Silva Santos Júnior,28,1992-02-05,175,68,Brazil,Paris Saint-Germain,...,67+3,62+3,62+3,62+3,67+3,62+3,49+3,49+3,49+3,62+3



Data types:


sofifa_id      int64
player_url    object
short_name    object
long_name     object
age            int64
               ...  
lb            object
lcb           object
cb            object
rcb           object
rb            object
Length: 106, dtype: object


Missing values per column:


defending_marking       18944
loaned_from             18186
nation_jersey_number    17817
nation_position         17817
player_tags             17536
gk_kicking              16861
gk_diving               16861
gk_positioning          16861
gk_reflexes             16861
gk_handling             16861
dtype: int64

## 🧹 2. Data Cleaning & Transformation

- Handle missing values (drop or impute)
- Convert units:
  - Height (feet/inches → cm)
  - Weight (lbs → kg)
- Derive new metrics:
  - Body Mass Index (BMI)
  - Age (if birthdate given)
- Drop irrelevant or duplicate columns (e.g. photo URLs)


In [15]:
# 🧹 2. Data Cleaning & Preprocessing

# Make a copy of the original DataFrame
df_clean = df.copy()

# 1. Drop irrelevant columns using a for-loop
cols_to_drop = [
    'photo', 'flag', 'club_logo', 'real_face', 'player_url',
    'loaned_from', 'joined', 'contract_valid_until', 'nation_position',
    'nation_jersey_number'
]
for col in cols_to_drop:
    if col in df_clean.columns:
        df_clean.drop(columns=col, inplace=True)

# 2. Rename columns for readability
rename_map = {
    'sofifa_id': 'player_id',
    'long_name':  'name',
    'overall':    'rating',
    'potential':  'potential_rating',
}
for old_name, new_name in rename_map.items():
    if old_name in df_clean.columns:
        df_clean.rename(columns={old_name: new_name}, inplace=True)

# 3. Handle missing values
#   a) Drop rows where key metrics are missing
key_metrics = ['rating', 'potential_rating', 'age']
for metric in key_metrics:
    if metric in df_clean.columns:
        df_clean = df_clean[df_clean[metric].notna()]

#   b) Fill other numeric NaNs with the column median
numeric_columns = df_clean.select_dtypes(include='number').columns
for col in numeric_columns:
    median_value = df_clean[col].median()
    df_clean[col].fillna(median_value, inplace=True)

# 4. Convert height and weight to metric units
import re

def parse_height(height_str):
    """Convert height string like 5'11\" to centimeters."""
    match = re.match(r"(\d+)'(\d+)", str(height_str))
    if match:
        feet = int(match.group(1))
        inches = int(match.group(2))
        return round(feet * 30.48 + inches * 2.54, 1)
    return None

def parse_weight(weight_str):
    """Convert weight string like 165lbs to kilograms."""
    match = re.match(r"(\d+)", str(weight_str))
    if match:
        pounds = int(match.group(1))
        return round(pounds * 0.453592, 1)
    return None

if 'height' in df_clean.columns:
    df_clean['height_cm'] = df_clean['height'].apply(parse_height)

if 'weight' in df_clean.columns:
    df_clean['weight_kg'] = df_clean['weight'].apply(parse_weight)

# 5. Derive new metrics
#   a) Calculate Body Mass Index (BMI)
if 'weight_kg' in df_clean.columns and 'height_cm' in df_clean.columns:
    df_clean['bmi'] = df_clean['weight_kg'] / (df_clean['height_cm'] / 100) ** 2

#   b) Simplify position into primary role (e.g., 'ST' from 'ST C')
if 'position' in df_clean.columns:
    primary_positions = []
    for pos in df_clean['position']:
        if isinstance(pos, str):
            primary_positions.append(pos.split()[0])
        else:
            primary_positions.append(None)
    df_clean['primary_position'] = primary_positions

# Final check
print("Cleaned dataset shape:", df_clean.shape)
df_clean.head()


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean[col].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_clean[col].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va

Cleaned dataset shape: (18944, 100)


Unnamed: 0,player_id,short_name,name,age,dob,height_cm,weight_kg,nationality,club_name,league_name,...,ldm,cdm,rdm,rwb,lb,lcb,cb,rcb,rb,bmi
0,158023,L. Messi,Lionel Andrés Messi Cuccittini,33,1987-06-24,170,72,Argentina,FC Barcelona,Spain Primera Division,...,65+3,65+3,65+3,66+3,62+3,52+3,52+3,52+3,62+3,24.913495
1,20801,Cristiano Ronaldo,Cristiano Ronaldo dos Santos Aveiro,35,1985-02-05,187,83,Portugal,Juventus,Italian Serie A,...,61+3,61+3,61+3,65+3,61+3,54+3,54+3,54+3,61+3,23.735308
2,200389,J. Oblak,Jan Oblak,27,1993-01-07,188,87,Slovenia,Atlético Madrid,Spain Primera Division,...,36+3,36+3,36+3,32+3,32+3,33+3,33+3,33+3,32+3,24.615211
3,188545,R. Lewandowski,Robert Lewandowski,31,1988-08-21,184,80,Poland,FC Bayern München,German 1. Bundesliga,...,65+3,65+3,65+3,64+3,61+3,60+3,60+3,60+3,61+3,23.62949
4,190871,Neymar Jr,Neymar da Silva Santos Júnior,28,1992-02-05,175,68,Brazil,Paris Saint-Germain,French Ligue 1,...,62+3,62+3,62+3,67+3,62+3,49+3,49+3,49+3,62+3,22.204082


## 📊 3. Descriptive Statistics

- Use `df.describe()` for a global overview
- Identify **top 10 players** by Overall and by Potential
- Group by **Position** and **Club**:
  - Mean, median, count of Overall and Potential
- Present summary tables in well-formatted DataFrames


In [18]:
## 📊 3. Descriptive Statistics

# 3.0 — Standardize column names to avoid unexpected spaces/case
df.columns = df.columns.str.strip().str.replace(" ", "_").str.lower()

# 3.1 — Identify the actual column names without list comprehensions
overall_col = None
potential_col = None
name_col = None
position_col = None
club_col = None

for col in df.columns:
    if "overall" in col and overall_col is None:
        overall_col = col
    if "potential" in col and potential_col is None:
        potential_col = col
    if "name" in col and name_col is None:
        name_col = col
    if "position" in col and position_col is None:
        position_col = col
    if "club" in col and club_col is None:
        club_col = col

print(f"Using columns: name={name_col}, overall={overall_col}, potential={potential_col}, position={position_col}, club={club_col}")

# 3.2 — Global overview
desc = df.describe()
print("\nGlobal summary statistics:")
display(desc)

# 3.3 — Top 10 players by Overall and by Potential
top_overall = df.nlargest(10, overall_col)[[name_col, overall_col, potential_col]]
print("\nTop 10 players by Overall:")
display(top_overall)

top_potential = df.nlargest(10, potential_col)[[name_col, overall_col, potential_col]]
print("\nTop 10 players by Potential:")
display(top_potential)

# 3.4 — Group by Position
pos_stats = (
    df
    .groupby(position_col)
    .agg(
        count=('overall', 'count'),
        mean_overall=(overall_col, 'mean'),
        median_overall=(overall_col, 'median'),
        mean_potential=(potential_col, 'mean'),
        median_potential=(potential_col, 'median'),
    )
    .reset_index()
)
print("\nStatistics by Position:")
display(pos_stats)

# 3.5 — Group by Club
club_stats = (
    df
    .groupby(club_col)
    .agg(
        count=('overall', 'count'),
        mean_overall=(overall_col, 'mean'),
        median_overall=(overall_col, 'median'),
        mean_potential=(potential_col, 'mean'),
        median_potential=(potential_col, 'median'),
    )
    .sort_values('count', ascending=False)
    .reset_index()
)
print("\nStatistics by Club (sorted by player count):")
display(club_stats)



Using columns: name=short_name, overall=overall, potential=potential, position=player_positions, club=club_name

Global summary statistics:


Unnamed: 0,sofifa_id,age,height_cm,weight_kg,league_rank,overall,potential,value_eur,wage_eur,international_reputation,...,mentality_penalties,mentality_composure,defending_marking,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes
count,18944.0,18944.0,18944.0,18944.0,18719.0,18944.0,18944.0,18944.0,18944.0,18944.0,...,18944.0,18944.0,0.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0,18944.0
mean,226242.402872,25.225823,181.190773,75.016892,1.35707,65.677787,71.086729,2224813.0,8675.852513,1.09185,...,48.050412,57.978674,,47.581767,45.546505,16.446052,16.236486,16.103357,16.225982,16.551309
std,27171.091056,4.697354,6.825672,7.05714,0.739327,7.002278,6.109985,5102486.0,19654.774894,0.361841,...,15.671721,12.11839,,21.402461,20.953997,17.577332,16.84548,16.519399,17.017341,17.878121
min,41.0,16.0,155.0,50.0,1.0,47.0,47.0,0.0,0.0,1.0,...,6.0,12.0,,5.0,4.0,1.0,1.0,1.0,1.0,1.0
25%,210030.5,21.0,176.0,70.0,1.0,61.0,67.0,300000.0,1000.0,1.0,...,38.75,50.0,,27.0,24.0,8.0,8.0,8.0,8.0,8.0
50%,232314.5,25.0,181.0,75.0,1.0,66.0,71.0,650000.0,3000.0,1.0,...,49.0,59.0,,55.0,52.0,11.0,11.0,11.0,11.0,11.0
75%,246760.25,29.0,186.0,80.0,1.0,70.0,75.0,1800000.0,7000.0,1.0,...,60.0,66.0,,65.0,63.0,14.0,14.0,14.0,14.0,14.0
max,258970.0,53.0,206.0,110.0,4.0,93.0,95.0,105500000.0,560000.0,5.0,...,92.0,96.0,,93.0,90.0,90.0,92.0,93.0,91.0,90.0



Top 10 players by Overall:


Unnamed: 0,short_name,overall,potential
0,L. Messi,93,93
1,Cristiano Ronaldo,92,92
2,J. Oblak,91,93
3,R. Lewandowski,91,91
4,Neymar Jr,91,91
5,K. De Bruyne,91,91
6,K. Mbappé,90,95
7,M. ter Stegen,90,93
8,V. van Dijk,90,91
9,Alisson,90,91



Top 10 players by Potential:


Unnamed: 0,short_name,overall,potential
6,K. Mbappé,90,95
0,L. Messi,93,93
2,J. Oblak,91,93
7,M. ter Stegen,90,93
28,J. Sancho,87,93
62,K. Havertz,85,93
272,João Félix,81,93
366,Vinícius Jr.,80,93
1,Cristiano Ronaldo,92,92
29,T. Alexander-Arnold,87,92



Statistics by Position:


Unnamed: 0,player_positions,count,mean_overall,median_overall,mean_potential,median_potential
0,CAM,268,62.100746,61.0,70.574627,70.0
1,"CAM, CDM",13,65.615385,65.0,70.000000,72.0
2,"CAM, CDM, CM",4,67.250000,68.0,69.750000,70.0
3,"CAM, CDM, LM",1,62.000000,62.0,69.000000,69.0
4,"CAM, CF",23,66.347826,64.0,73.391304,74.0
...,...,...,...,...,...,...
606,"ST, RW, CF",1,63.000000,63.0,63.000000,63.0
607,"ST, RW, LM",2,65.000000,65.0,72.000000,72.0
608,"ST, RW, LW",43,65.581395,66.0,72.372093,73.0
609,"ST, RW, RM",5,70.600000,71.0,72.800000,74.0



Statistics by Club (sorted by player count):


Unnamed: 0,club_name,count,mean_overall,median_overall,mean_potential,median_potential
0,VfB Stuttgart,33,68.333333,70.0,75.696970,76.0
1,Udinese,33,70.151515,72.0,75.030303,75.0
2,AS Saint-Étienne,33,67.969697,68.0,76.303030,77.0
3,AS Monaco,33,73.333333,73.0,78.181818,78.0
4,Valencia CF,33,72.666667,74.0,79.636364,79.0
...,...,...,...,...,...,...
676,Internacional,20,72.850000,72.5,72.900000,72.5
677,São Paulo,20,73.200000,73.5,73.200000,73.5
678,Brisbane Roar,19,59.789474,61.0,65.736842,65.0
679,Central Coast Mariners,18,59.000000,60.0,65.722222,65.5


## 📈 4. Static Visualizations

- **Histograms** of Age, Overall, Potential
- **Boxplots** of Overall by Position
- **Bar chart** of number of Top-100 players per club
- Save key plots as PNG for inclusion in README


In [19]:
# 🖼️ 4. Static Visualizations

import plotly.express as px

# 1. Histogram of Age
fig_age = px.histogram(
    df_clean,
    x='age',
    nbins=20,
    title='Age Distribution of FIFA 21 Players',
    labels={'age': 'Age'}
)
fig_age.show()

# 2. Histogram of Overall Rating
fig_rating = px.histogram(
    df_clean,
    x='rating',
    nbins=20,
    title='Overall Rating Distribution',
    labels={'rating': 'Overall Rating'}
)
fig_rating.show()

# 3. Histogram of Potential Rating
fig_potential = px.histogram(
    df_clean,
    x='potential_rating',
    nbins=20,
    title='Potential Rating Distribution',
    labels={'potential_rating': 'Potential Rating'}
)
fig_potential.show()

# 4. Boxplot of Overall Rating by Position
fig_box = px.box(
    df_clean,
    x='primary_position',  # or 'position' if you prefer full labels
    y='rating',
    title='Overall Rating by Position',
    labels={'primary_position': 'Position', 'rating': 'Overall Rating'}
)
fig_box.update_layout(xaxis={'categoryorder':'total descending'})
fig_box.show()

# 5. Bar chart: number of Top-100 players per club
#   a) Select Top-100 by rating
top100 = df_clean.nlargest(100, 'rating')

#   b) Count how many of these belong to each club
club_counts = top100['club'].value_counts().reset_index()
club_counts.columns = ['club', 'count']

#   c) Plot the top 10 clubs
fig_bar = px.bar(
    club_counts.head(10),
    x='club',
    y='count',
    title='Top 10 Clubs by Number of Top-100 Players',
    labels={'club': 'Club', 'count': 'Number of Players'}
)
fig_bar.update_layout(xaxis_tickangle=-45)
fig_bar.show()


ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed