# How have the Olympic games athletes changed over time?

## Introduction

**Business Context.** You work for a company that specializes in analyzing data for a variety of clients in the sports industry. Some questions that you frequently encounter include determining if a new player is promising enough to invest money in their development, which teams are the most likely to win certain matches, what events will be the most attractive to advertisers, etc.

**Business Problem.** As part of one of your projects, you have been asked to perform an exploratory data analysis of historical data to **detect patterns in the provenance, physical profile, and other characteristics of the athletes who compete in the Olympic games**. The conclusions of your analysis will help the rest of the team prepare a report for a new client who helps sports gear manufacturers find advertising opportunities.

**Analytical Context.** You have scraped a dataset from the Internet, which contains data for all the Olympic games from Norway 1994 to Rio 2016. It comprises data for 46,533 individual athletes and has 13 columns for each one of them. There are 68,848 rows instead of 46,533 rows in the `olympics_data` worksheet because some athletes have won multiple medals:

* **ID**: A unique number assigned to each athlete
* **Name**: The athlete's name
* **Sex**: The athlete's sex
* **Age**: The athlete's age at the moment of the games
* **Height**: The athlete's height in centimeters
* **Weight**: The athlete's weight in kilograms
* **Team**: The athlete's team (country)
* **Year**: The year
* **Season**: The season
* **City**: The host city
* **Sport**: The sport the athlete competed in
* **Medal**: The medal that the athlete won, if any (can be Gold, Silver, Bronze, or NA)
* **Won medal?**: 1 if the athlete won a medal, 0 otherwise

The dataset can be downloaded from [this link](data/olympics_fellow.xlsx).

**Note:** Please write all your formulas in the `calculations` worksheet unless explicitly asked to do it in another sheet, clearly indicating the exercise they belong to. You will need to submit the Excel file along with this notebook.

In [1]:
import numpy as np
import pandas as pd

## Height, weight, and age

### Exercise 1

#### 1.1

Calculate the average height, weight, and age of athletes in Rio 2016 across all sports.

**Answer.**

In [2]:
file_source = '../data/olympics_fellow.xlsx'
df = pd.read_excel(file_source)

df.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,Year,Season,City,Sport,Medal,Won medal?
0,22,Andreea Aanei,F,22.0,170.0,125.0,Romania,2016,Summer,Rio de Janeiro,Weightlifting,,0
1,51,Nstor Abad Sanjun,M,23.0,167.0,64.0,Spain,2016,Summer,Rio de Janeiro,Gymnastics,,0
2,55,Antonio Abadia Beci,M,26.0,170.0,65.0,Spain,2016,Summer,Rio de Janeiro,Athletics,,0
3,62,Giovanni Abagnale,M,21.0,198.0,90.0,Italy,2016,Summer,Rio de Janeiro,Rowing,Bronze,1
4,65,Patimat Abakarova,F,21.0,165.0,49.0,Azerbaijan,2016,Summer,Rio de Janeiro,Taekwondo,Bronze,1


In [3]:
df_rio_2016 = df[(df['Year'] == 2016) & (df['City'] =='Rio de Janeiro')]

In [4]:
df_rio_2016

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,Year,Season,City,Sport,Medal,Won medal?
0,22,Andreea Aanei,F,22.0,170.0,125.0,Romania,2016,Summer,Rio de Janeiro,Weightlifting,,0
1,51,Nstor Abad Sanjun,M,23.0,167.0,64.0,Spain,2016,Summer,Rio de Janeiro,Gymnastics,,0
2,55,Antonio Abadia Beci,M,26.0,170.0,65.0,Spain,2016,Summer,Rio de Janeiro,Athletics,,0
3,62,Giovanni Abagnale,M,21.0,198.0,90.0,Italy,2016,Summer,Rio de Janeiro,Rowing,Bronze,1
4,65,Patimat Abakarova,F,21.0,165.0,49.0,Azerbaijan,2016,Summer,Rio de Janeiro,Taekwondo,Bronze,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11627,135489,Anastasiya Valeryevna Zuyeva-Fesikova,F,26.0,182.0,71.0,Russia,2016,Summer,Rio de Janeiro,Swimming,,0
11628,135525,Martin Zwicker,M,29.0,175.0,64.0,Germany,2016,Summer,Rio de Janeiro,Hockey,Bronze,1
11629,135528,Marc Zwiebler,M,32.0,181.0,75.0,Germany,2016,Summer,Rio de Janeiro,Badminton,,0
11630,135547,Viktoriya Viktorovna Zyabkina,F,23.0,174.0,62.0,Kazakhstan,2016,Summer,Rio de Janeiro,Athletics,,0


In [5]:
df_rio_2016_athletes = df_rio_2016.drop_duplicates('Name')

In [6]:
df_rio_2016_athletes

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,Year,Season,City,Sport,Medal,Won medal?
0,22,Andreea Aanei,F,22.0,170.0,125.0,Romania,2016,Summer,Rio de Janeiro,Weightlifting,,0
1,51,Nstor Abad Sanjun,M,23.0,167.0,64.0,Spain,2016,Summer,Rio de Janeiro,Gymnastics,,0
2,55,Antonio Abadia Beci,M,26.0,170.0,65.0,Spain,2016,Summer,Rio de Janeiro,Athletics,,0
3,62,Giovanni Abagnale,M,21.0,198.0,90.0,Italy,2016,Summer,Rio de Janeiro,Rowing,Bronze,1
4,65,Patimat Abakarova,F,21.0,165.0,49.0,Azerbaijan,2016,Summer,Rio de Janeiro,Taekwondo,Bronze,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11627,135489,Anastasiya Valeryevna Zuyeva-Fesikova,F,26.0,182.0,71.0,Russia,2016,Summer,Rio de Janeiro,Swimming,,0
11628,135525,Martin Zwicker,M,29.0,175.0,64.0,Germany,2016,Summer,Rio de Janeiro,Hockey,Bronze,1
11629,135528,Marc Zwiebler,M,32.0,181.0,75.0,Germany,2016,Summer,Rio de Janeiro,Badminton,,0
11630,135547,Viktoriya Viktorovna Zyabkina,F,23.0,174.0,62.0,Kazakhstan,2016,Summer,Rio de Janeiro,Athletics,,0


In [7]:
average_height, average_weight, average_age = df_rio_2016_athletes[['Height', 'Weight', 'Age']].mean()

In [8]:
print(('The average height, weight, and age of athletes in Rio 2016 across all sports are: '
       ' {:.2f} m, {:.2f} kg, and {:.2f} year').format(average_height/100, average_weight, average_age))

The average height, weight, and age of athletes in Rio 2016 across all sports are:  1.77 m, 71.92 kg, and 26.38 year


-------

#### 1.2

Repeat Exercise 1.1 but for Sydney 2000. Have the averages changed noticeably?

**Answer.**

In [9]:
def extract_average(df, city, year):
    df = df[(df['Year'] == year) & (df['City'] == city)]
    df_athletes = df.drop_duplicates('Name')
    n_athletes = len(df_athletes)
    average_height, average_weight, average_age = df_athletes[['Height', 'Weight', 'Age']].mean()
    
    print(('In {} {}, the number of athletes was {}, and the average height, weight, and ' 
           'age of athletes across all sports are: '
           '{:.2f} m, {:.2f} kg, and {:.2f} year').format(
        city, year, n_athletes, average_height/100, average_weight, average_age))
    return (n_athletes, average_height, average_weight, average_age)

In [10]:
n_s2000, height_s2000, weight_s2000, age_s2000 = extract_average(df, 'Sydney', 2000)

In Sydney 2000, the number of athletes was 10639, and the average height, weight, and age of athletes across all sports are: 1.77 m, 72.58 kg, and 25.83 year


-------

## Geographic representation

### Exercise 2

This is a chart of the number of countries that participated in the games from 1998 to 2016. What can you conclude from it?

![Teams per year](../data/images/teams_per_year.png)

**Hint:** Keep in mind that Summer and Winter games are not held in the same year. In the Winter games, the number of teams is typically lower than in the Summer games.

**Answer.**

In [11]:
melted_df = df[["Year","Season","Team"]].melt(id_vars=["Year","Season"], value_name="Country").drop_duplicates()

In [12]:
summary_q2 = pd.pivot_table(melted_df,
                            values = "Country",
                            index=["Season", "Year"],
                            aggfunc="count").rename(columns={"Country":"Number of countries"})

In [13]:
summary_q2

Unnamed: 0_level_0,Unnamed: 1_level_0,Number of countries
Season,Year,Unnamed: 2_level_1
Summer,2000,243
Summer,2004,260
Summer,2008,292
Summer,2012,245
Summer,2016,249
Winter,1998,106
Winter,2002,114
Winter,2006,113
Winter,2010,116
Winter,2014,119


In [14]:
summary_q2.groupby("Season").median()

Unnamed: 0_level_0,Number of countries
Season,Unnamed: 1_level_1
Summer,249.0
Winter,114.0


In [15]:
in_summer, in_winter = summary_q2.groupby("Season").median()['Number of countries']

In [16]:
print(("From the data, we can conclude that in Summer games the participation is {:.0f}% higher"
       " than in Winter games".format((in_summer/in_winter-1)*100)))

From the data, we can conclude that in Summer games the participation is 118% higher than in Winter games


-------

### Exercise 3

These are the top 10 countries by number of athletes sent for all the games between 1998 and 2016. What patterns can you spot?

![1998](../data/images/top_1998.png)
![2000](../data/images/top_2000.png)
![2002](../data/images/top_2002.png)
![2004](../data/images/top_2004.png)
![2006](../data/images/top_2006.png)
![2008](../data/images/top_2008.png)
![2010](../data/images/top_2010.png)
![2012](../data/images/top_2012.png)
![2014](../data/images/top_2014.png)
![2016](../data/images/top_2016.png)

**Answer.**

-------

## Athletes by gender

These pie charts show the number of athletes by gender in Sydney 2000 and Rio 2016:

![Male and female Sydney](../data/images/male_female_sydney.png)
![Male and female Rio](../data/images/male_female_rio.png)

### Exercise 4

#### 4.1

We need to put labels on these pie charts. How many male and female athletes were there in Rio 2016 and Sydney 2000?

**Hint:** You can use the **`COUNTIF()`** function to solve this exercise. This function works very similarly to the `COUNTA()` function, with the difference that it only counts those cells that meet a certain condition. Feel free to look this function up on the Internet!

**Answer.**

-------

In [44]:
def extract_female_male(df, city, year):
    df = df[(df['Year'] == year) & (df['City'] == city)]
    df_athletes = df.drop_duplicates('Name')
    n_males = df_athletes[df_athletes["Sex"]=="M"].shape[0]
    n_females = df_athletes[df_athletes["Sex"]=="F"].shape[0]
    ratio = n_males/n_females
    
    print(("In {} participated {} males and {} females, " 
           "so the ratio male/female was {:.2f}".format(city, n_males, 
                                                     n_females, 
                                                     ratio)))
    return (n_males, n_females, ratio)

In [45]:
cities = ['Rio de Janeiro', 'Sydney']
years = [2016, 2000]

for city, year in zip(cities, years):
    extract_female_male(df, city, year)

In Rio de Janeiro participated 6143 males and 5031 females, so the ratio male/female was 1.22
In Sydney participated 6575 males and 4064 females, so the ratio male/female was 1.62


#### 4.2

Use Excel to calculate the ratio of $\frac{male}{female}$ athletes in both Rio and Sydney. Has it changed?

**Answer.**

Yes, in Sydney 2000, the number of female athletes decreased in about 25%

-------

#### 4.3

Complete the table in the `gender_sport` worksheet. 

**Hint:** Use the **`COUNTIFS()`** function. It works like `COUNTIF()` but allows you to have more than one condition over more than one column.

#### 4.4

Interpret the results of your completed table from Exercise 4.3. What can you say?

**Answer.**

-------

## Medals

### Exercise 5

#### 5.1

How many medals were awarded between 1998 and 2016?

**Hint:** Use the `Won a medal?` column.

**Answer.**

-------

#### 5.2

How many medals per athlete were awarded in Rio 2016?

**Answer.**

-------

#### 5.3

How about in Sydney? Is it lower or higher?

**Answer.**

-------

### Exercise 6

#### 6.1

Complete the table in the `medals_sport` worksheet (you can use `COUNTIF()` and `SUMIF()`). What are the top 10 sports with the most medals per athlete in Rio 2016? How about in Sydney 2000?

**Hint:** To find the top tens, you will need to sort the table by multiple columns, namely year and medals per athlete. Again, use the Internet to help you with how to do this!

**Answer.**

-------

#### 6.2

Which sports are included in both rankings? What could be the reason that these sports show up in both tables?

**Answer.**

-------

## Attribution

"120 years of Olympic history: athletes and results", June 15, 2018, Kaggle (user rgriffin, with data from www.sports-reference.com), Sports Reference [Terms of Use](https://www.sports-reference.com/termsofuse.html), https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results