<a href="https://colab.research.google.com/github/drusho/blog/blob/master/_notebooks/2021-07-26-eda-olympic-history.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# "120 Years of Olympic History"
> "Data Exploration of a Kaggle Dataset"

- toc: true
- badges: true
- comments: true
- categories: [Pandas,Kaggle, Plotly, Olympics]
- image: "images/thumbnails/olympic_hist.jpeg"


> Note: __Notebook Created by David Rusho__

* [Github Blog](https://drusho.github.io/blog) | [Github](https://github.com/drusho) | [Tableau](https://public.tableau.com/app/profile/drusho/) | [Linkedin](https://linkedin.com/in/davidrusho)

## About the Data

Dataset Source: [120-years-of-olympic-history-athletes-and-results](https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results)

## Import and Merge Data

* Import Libraries 
* Import Kaggle Datasets 
* Import csvs and convert to pandas dataframes
* Merge dataframes on 'NOC' column

In [4]:
#collapse
# Import Libraries
import numpy as np
import pandas as pd
import plotly.express as px
import warnings

warnings.filterwarnings("ignore") # ignore warnings

In [5]:
#collapse
# Import csv from kaggle
# df_a == 'dataframe athletics'
fn_a = "athlete_events.csv"
df_a = pd.read_csv(fn_a)

# df_a == 'dataframe locations'
fn_l = "noc_regions.csv"
df_l = pd.read_csv(fn_l)

# Combine Dataframes on 'NOC'
df = df_a.merge(df_l, on="NOC")
df.head(3)

df_a.head(3)

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,


## Clean Dataframe 

Remove na and fill with ''

In [6]:
#collapse
# remove na values from df
df = df.fillna("")

# extract digits only
df["Age"] = df["Age"].astype(str).str.extract("([0-9]+)")

# format year dtype
df["Year"] = pd.to_datetime(df["Year"], format="%Y").dt.year

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270767 entries, 0 to 270766
Data columns (total 17 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   ID      270767 non-null  int64 
 1   Name    270767 non-null  object
 2   Sex     270767 non-null  object
 3   Age     261305 non-null  object
 4   Height  270767 non-null  object
 5   Weight  270767 non-null  object
 6   Team    270767 non-null  object
 7   NOC     270767 non-null  object
 8   Games   270767 non-null  object
 9   Year    270767 non-null  int64 
 10  Season  270767 non-null  object
 11  City    270767 non-null  object
 12  Sport   270767 non-null  object
 13  Event   270767 non-null  object
 14  Medal   270767 non-null  object
 15  region  270767 non-null  object
 16  notes   270767 non-null  object
dtypes: int64(2), object(15)
memory usage: 37.2+ MB


## Data Exploration

### Bar Chart Function for Plotly

Created to help remove duplicate code.

In [7]:
#collapse
# bar chart function setup for plotly


def fig_layout(fig, title):
    # update plot details
    fig.update_layout(
        {"plot_bgcolor": "rgba(255,255,255, 0.9)"},  # white background
        yaxis={'categoryorder':'total ascending'},
        title={
            "text": title,
            "y": 0.98,
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
        },
        xaxis_title="",  # remove axis titles
        yaxis_title="",  # remove axis titles
    )
    return fig

def bar_chart(df, x, y, location):
    # create bar chart
    fig = px.bar(df, x=x, y=y, text=y)

    # update bar markers
    fig.update_traces(
        textposition="outside", marker_color=m_color
    )  # blue color

    # update plot details
    fig.update_layout(
        {"plot_bgcolor": "rgba(255,255,255, 0.9)"},  # white background
        title={
            "text": f"Count of Games Hosted by {location}",
            "y": 0.98,
            "x": 0.5,
            "xanchor": "center",
            "yanchor": "top",
        },
        xaxis_title="",  # remove axis titles
        yaxis_title="",  # remove axis titles
    )

    fig.update_yaxes(showticklabels=False)

    return fig.show()

### Age Counts by Gender in Games

In [8]:
#collapse
# global marker color for plotly plots
m_color = "rgb(47,138,196)"

# Age Counts in Games
df_age = df.copy()
df_age['Age'] = df_age['Age'].dropna().astype('float')

# df_age.sort_values(by="Age",inplace=True)

fig = px.histogram(df_age, x="Age",color="Sex",barmode="overlay", 
                   color_discrete_map={"F": "rgb(237,100,90)", "M": m_color})
# update bar markers
# fig.update_traces(marker_color=m_color)  # blue color

fig.update_layout(
    {"plot_bgcolor": "rgba(255,255,255, 0.9)"},  # white background
    title={
        "text": "Age Counts by Gender in Games",
        "y": 0.98,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)


fig.show()

### Count of Games by Season

In [9]:
#collapse
# Number of Olympic Games by Season
df_season_city = (
    df.groupby(["Season"])["Year"]
    .nunique()
    .to_frame()
    .reset_index()
    .rename(columns={"Year": "Count"})
)

# highlight season with highest count
fig = px.pie(
    df_season_city,
    values="Count",
    names="Season",
    color="Season",
    title="Count of Games by Season",
    color_discrete_map={"Summer": "rgb(237,100,90)", "Winter": m_color}
)

# update plot details
fig.update_layout(
    title={
        #     "y": 0.100,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
)


fig.show()

## Counting Medals

### Total Medal Counts by Individuals

In [10]:
#collapse
# filter names by medals: Gold,Silver,Bronze
df_cntry_mds = df[df["Medal"].str.contains("|".join(["Gold", "Silver", "Bronze"]))]

# display name by counts of all medals
df_cntry_mds2 = (
    df_cntry_mds.groupby(by="Name")["Medal"].size().reset_index(name="counts")
)

# sort df by medal count, descending
df_cntry_mds3 = (
    df_cntry_mds2.sort_values(by="counts", ascending=False)
    .reset_index(drop=True)
    .head(14)
)

df_cntry_mds3.head(10)

Unnamed: 0,Name,counts
0,"Michael Fred Phelps, II",28
1,Larysa Semenivna Latynina (Diriy-),18
2,Nikolay Yefimovich Andrianov,15
3,Ole Einar Bjrndalen,13
4,Takashi Ono,13
5,Edoardo Mangiarotti,13
6,Borys Anfiyanovych Shakhlin,13
7,"Jennifer Elisabeth ""Jenny"" Thompson (-Cumpelik)",12
8,Ryan Steven Lochte,12
9,"Dara Grace Torres (-Hoffman, -Minas)",12


### All Medal Counts by Individuals

In [31]:
# collapse
# Name list of top 10 names with most medals count
top_medal_indiv = df_cntry_mds3.Name.to_list()

# df of top top 10 names with most medals count
df_top_medal_indiv = df[
    (df["Name"].str.contains("|".join(top_medal_indiv)))
    & (df["Medal"].str.contains("|".join(["Gold", "Silver", "Bronze"])))
]

# groupby Name and Medal counts
ml_ind_gp = (
    df_top_medal_indiv.groupby(by=["Name", "Medal"]).size().reset_index(name="counts")
)


# create df of medal names
df_mapping = pd.DataFrame(
    {
        "size": ["Bronze", "Silver", "Gold"],
    }
)

# create index ordered by medal rankings
sort_mapping = df_mapping.reset_index().set_index("size")

# map medal rankings to medal rankings by individuals
ml_ind_gp["medal_rank"] = ml_ind_gp["Medal"].map(sort_mapping["index"])

# sort df by medal_rank
ml_ind_gp.sort_values(by="medal_rank", inplace=True)

# create bar chart
fig = px.bar(
    data_frame=ml_ind_gp,
    y="Name",
    x="counts",
    barmode="stack",
    color="Medal",
    text="counts",
    color_discrete_map={
        "Bronze": "rgb(175, 100, 88)",
        "Silver": "rgb(179,179,179)",
        "Gold": "gold",
    },
    orientation="h",
)

# update layout, white background, remove axis titles, order y-axis
fig.update_layout(
    {"plot_bgcolor": "rgba(255,255,255, 0.9)"},  # white background
    yaxis={"categoryorder": "total ascending"},
    title={
        "text": "Medal Counts by Individuals",
        "y": 0.98,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    },
    xaxis_title="",  # remove axis titles
    yaxis_title="",  # remove axis titles
)

fig.update_xaxes(showticklabels=False)

fig.show()

## Import Olympic data (city, country, continent) from Wikipedia
Descending order by number of games hosted

Data pertaining to Cities, Countries, and Continents was not available in the Olympic dataset.  This information could be useful in creating a better understanding of the data.

In [12]:
#collapse
# Import olympic data (city, country, continent) from wikipedia
olympic_cc = pd.read_html(
    "https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities"
)

### Count of Games Hosted by Cities

In [13]:
#collapse
# Count of Olympics Hosted by Cities
oc_city = olympic_cc[2].groupby("City.1")["Country"].size().reset_index(name="Counts")

oc_city = (
    (oc_city.sort_values(by="Counts", ascending=False))
    .reset_index()
    .rename(columns={"City.1": "City"})
    .head(10)
)

# create and show figure (bar chart)
bar_chart(oc_city, "City", "Counts", "Counts")

### Count of Games Hosted by Country

In [14]:
#collapse
# Count of Olympics Hosted by Country
oc_cntry = olympic_cc[5][["Country", "Total"]].fillna("").head(15)

# create and show figure (bar chart)
bar_chart(oc_cntry, "Country", "Total", "Country")

### Count of Games Hosted by Continent

In [15]:
#collapse
# Count of Olympic Games Hosted by Continent

oc_cont = olympic_cc[2].groupby("Continent")["City.1"].size().reset_index(name="Counts")

# reorder count values
oc_cont = oc_cont.sort_values(by="Counts", ascending=False).reset_index()

# create and show figure (bar chart)
bar_chart(oc_cont, "Continent", "Counts", "Continent")