# Countries Olympics performance evaluation
## Data analysis project for **Pre-Olympics TV Broadcasting Quick Stats**
### October 2021
#### Andrea Biasioli

In [241]:
import os
import pandas as pd
import plotly.graph_objects as go
from IPython.display import display
import maps_plotting_functions as mp
import plotly.express as px
import importlib
pd.options.display.max_rows = 10


# Index
- [Goal](#goals)
- [Questions](#questions)
- [Executive summary](#summary)
- [Assumptions](#assumptions)
- [Approach](#approach)
- [Preliminary analysis](#preliminary)
- [Framework: new metrics and features](#framework)
- [Performance maps and tables](#maps)
- [Recap on initial hypotesys](#recap)
- [Results and correlations](#results)
- [Conclusions](#conclusions)
- [Challenges and model artifacts](#challenges)
- [What to do next](#next)

<a id='goals'></a>

## Goal of the analysis
The analysis relies on the SportsStats dataset to study country-wide patterns in terms of medal earning.

The goal is to understand which countries are currently performing better and which countries are on the rise.

Additional information provided include an analysis on whether different NOCs rely on single high-talent individuals competing in more editions/events, and which parameters correlate better with winning medals.

### Client
The results of the analysis will be used in a **sport TV broadcasting event** occurring before the next Olympics, narrating a story of what the different performance records are and what to expect in the next editions.

[Coursera SQL Capstone](https://www.coursera.org/learn/sql-data-science-capstone)

SportsStats is a sports analysis firm partnering with local news and elite personal trainers to provide “interesting” insights to help their partners.  Insights could be patterns/trends highlighting certain groups/events/countries, etc. for the purpose of developing a news story or discovering key health insights.




<a id='questions'></a>

# Question 1
## Which countries are currently the most successful (in terms of number of medals) and which ones are on the rise?

- It can be helpful to know if similarly-sized countries are faring better or worse
- It can also be helpful to see which countries are consistently performing well (maybe they established a good funding/coaching system)

# Question 2
## Which countries relies mostly on single-athletes performance to obtain their medals?

Some countries performance will be reliant on single individuals. This is not too promising for their future, as when the athlete retires the country performance will likely worsen.


# Question 3
## Is there any connection between the height and the expected performance?

- Sometimes young athletes are selected based on their technical skills, and not their physical potential. If a relationship is found between height and performance, maybe the recruiting process could be different and encourage physical-based down selection.
- Height may be relevant in more recent editions but not in earlier ones. Is the importance changing in time?
- The analysis should be differentiated based on sex

<a id='summary'></a>

# Executive summary
- United States and France are currently some of the most successful countries, quickly increasing their dominance. Russia and China hold on to a large number of medals, but their record prospects are not looking well. Reasons may lie in the doping scandal for Russia and in the recent edition hosting for China, which temporarily impacted China's record (more medals earned in 2008, and a decreasing record in the next editions)
- Comparisons with similar-sized countries can only be carried out with average or small size countries. In fact, the highest-ranked countries like USA, Russia, and China are not really comparable among them in size while they are in terms of performance.
- Largest countries are the ones relies on competing with one athlete in multiple events
- Athlete per event metric is likely not indicative of the future prospects of a country Olympics performance
- Height is strongly positively correlated to the performance (in terms of number of medals earned)
- Data before WWII is unreliable (missing)
- Both M/F are taller in recent editions, with M being on average 11cm taller than F
- Summer athletes are on average consistently taller and heavier than Winter athletes

<a id='assumptions'></a>

# Assumptions on data
- Data before WWII is likely unreliable, with many missing features.
- I expect larger population countries to be the most successful, but not necessarily the ones improving the most. I assume that smaller countries, where 1 or 2 medals can significantly shift their performance record, will be the ones with the best improvement record.
- I expect height and performance to be related. Aside from gymnastics, in many other sports height can offer significant advantages, even within a weight class
- I expect height for men and women to be centered at different mean values

<a id='approach'></a>

# Approach
- I will use the total number of medals (gold + silver + bronze) as the main performance index
- Define position/velocity/acceleration-like features to describe how countries are faring now, what’s their performance trend, and how quickly they are improving their record respectively
- Analyze the medals-to-athletes performance for each country during each year, to determine which countries rely on single-athletes performance
- It would be interesting to add the country population feature. This could be useful to investigate the number of medals pro-capite

<a id='preliminary'></a>

# Preliminary analysis of the dataset
## Missing feature analysis


In [232]:
gold_df = pd.read_parquet(os.path.join('dataset', 'gold_missing_features.parquet'))
data = [
    go.Scatter(x=gold_df.Year, y=gold_df.missing_weight_perc, name='weight'),
    go.Scatter(x=gold_df.Year, y=gold_df.missing_height_perc, name='height')
    ]
fig = go.Figure(data=data)
fig.update_layout(title='Missing height and weight data based on the Olympic year',
                   xaxis_title='Year',
                   yaxis_title='Missing data (%)')
fig.show()
del gold_df

## Missing features analysis results
 - Assumption: Data before WWII is likely unreliable, with many missing features.
 - Test strategy: evaluated missing weight/height features (as a percentage of the yearly entries) as a function of the Olympic year
 - Result: after 1960 the missing features are on average <10% (acceptable)
 - Conclusion: I will consider only the data after 1960


## Investigation on height based on sex


In [204]:
# Characterize men and women heights


silver_df = pd.read_parquet(os.path.join('dataset', 'silver_df.parquet'))
silver_df = silver_df[silver_df.Year>1960]
gold_height_df = silver_df.dropna(subset=['Height'])
fig = px.histogram(x=gold_height_df.Height, color=gold_height_df.Sex, nbins=30,
                   barmode='overlay', histnorm='probability density')
fig.update_layout(title='Height probability density based on sex and year',
                   yaxis_title='PDF (-)',
                   xaxis_title='Height (cm)',
                  legend={'title': 'Sex:'})
fig.show()

In [205]:
print('Female:')
display(gold_height_df['Height'].loc[gold_height_df.Sex=='F'].describe())

print('Male:')
display(gold_height_df['Height'].loc[gold_height_df.Sex=='M'].describe())

Female:


count    64402.000000
mean       168.012422
std          8.816888
min        127.000000
25%        163.000000
50%        168.000000
75%        174.000000
max        213.000000
Name: Height, dtype: float64

Male:


count    125277.000000
mean        179.276084
std           9.451385
min         127.000000
25%         173.000000
50%         180.000000
75%         185.000000
max         226.000000
Name: Height, dtype: float64

## Investigation on height based on sex (results)
- Assumption: I expect height for men and women to be centered at different mean values
- Test strategy: group data based on sex, and run descriptive stats and visualization
- Result: M/F have gaussian distribution with different mean values (men are on average 11cm taller). Additionally, women data is slightly less dispersed around the mean (less variability)
- Conclusion: M/F height analysis needs to be carried out independently


## Investigation on height based on sex (additional considerations)

In [206]:
# Height based on olympic year and sex
gold_year_stats_df = silver_df.drop(columns='ID').groupby(by=["Year", "Sex"]).mean()
gold_year_stats_df.reset_index(inplace=True)

fig = px.line(gold_year_stats_df, x='Year', y='Height', color='Sex')
fig.update_layout(title='Average height based on sex and year',
                   xaxis_title='Year',
                   yaxis_title='Height (cm)')
fig.show()

## Investigation on height based on sex (additional considerations)
- Assumption: I expect height to be increasing over time, as for the general population
- Test strategy: group data by year, sex and visualize it
- Result: M/F seem to be following the same trend of increasing height. Jagged right side shows that Summer athletes are on average taller than winter athletes
- Conclusion: I will filter out Winter athletes and focus my analysis on Summer athletes (where most of the medals are awarded)

(Similar considerations are valid for weight measurements, which are not the focus of this report)

<a id='framework'></a>

# Framework: new metrics and features
The framework section describe the work that was done to add features in order to carry out the desired analysis.

The framework section includes:
- Create a population-year lookup table
- Add alpha_3 and iso_names for each National Olympic Committee
- Height binning to evaluate how success correlates with height
- New metrics definition (with sanity check)
- Add velocity-like and acceleration-like features
- Count of medals awarded for each NOC in each Summer Olympics year

## Create a population-year lookup table
- In order to consider populations in the analysis, I need to create a population-year lookup table
- Loaded population data from https://data.worldbank.org
- Complemented missing data with https://population.un.org/wpp/
- Note: some NOCs, like the Refugee Olympic Committee and Individual Olympic Athletes, do not have a well known population, and have been excluded from the analysis

The results for the table are shown in the cell below (first 5 rows).


In [207]:
pop = pd.read_parquet(os.path.join('dataset', 'populations.parquet'))
display(pop.head(5))
del pop

Unnamed: 0,Country Name,Country Code,Year,Population
2,Afghanistan,AFG,1960,8996967
271,Afghanistan,AFG,1961,9169406
540,Afghanistan,AFG,1962,9351442
809,Afghanistan,AFG,1963,9543200
1078,Afghanistan,AFG,1964,9744772


## Add alpha_3 and iso_names for each National Olympic Committee
- In order to visualize data with maps, I need to associate each NOC to the respective alpha_3 country identification code (https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3)
- Iso names are loaded as well, for a proper visualization on the map (because region attributes of the NOC table were occasionally unreliable)
- The process was automated with pycountry project (https://pypi.org/project/pycountry/)

The results for the table are shown in the cell below (first 5 rows).


In [208]:
iso_countries = pd.read_parquet(os.path.join('dataset', 'iso_countries.parquet'))
display(iso_countries.head(5))
del iso_countries

Unnamed: 0,NOC,region,alpha_3,iso_names
0,AFG,Afghanistan,AFG,Afghanistan
1,AHO,Curacao,CUW,Curaçao
2,ALB,Albania,ALB,Albania
3,ALG,Algeria,DZA,Algeria
4,AND,Andorra,AND,Andorra


## Height "binning"
A preliminary work consists in grouping by height and sex, ordering by ascending height.
While grouping, the number of medals (or no medals) are counted (first rows are shown in the cell below).

Let's take a look at how populated are the different height classes.

In [209]:
noc_height_df = pd.read_parquet(os.path.join('dataset', 'noc_height.parquet'))
display(noc_height_df.head(5))

Unnamed: 0,Height,Sex,Gold,Silver,Bronze,NoMedal,TotalPartecipants
0,127,M,0,0,0,1,1
1,127,F,0,0,0,6,6
2,128,M,0,0,0,1,1
3,130,M,0,0,0,2,2
4,131,F,0,0,0,2,2


In [210]:
noc_height_df[['Height', 'TotalPartecipants']].describe()


Unnamed: 0,Height,TotalPartecipants
count,169.0,169.0
mean,173.538462,1122.360947
std,25.634333,1684.239232
min,127.0,1.0
25%,152.0,16.0
50%,173.0,231.0
75%,194.0,1783.0
max,226.0,9455.0


On average, each class (distinct value of height, for example 175cm or 137cm), contains 1122 athletes.
However, some classes contain only one athlete!
As I want run the correlation analysis between height and success frequency, I do not want classes that are not enough populated in my analysis (outliers).
For example, the male 127cm class contains only one athlete: if he succeeded or not, it is likely irrelevant for the analysis.
As a filtering strategy for outliers, I excluded height classes with less than 5.2 athletes (which is the 10% quantile of the classes population).

I use the MedalsPerPartecipant metric to evaluate the success in the different height classes. For example, if there are 20 athletes at 210cm, and 15 of them were awarded a medal, the MedalsPerPartecipant in the class would be 0.75.


NOTE: The metric could likely be further refined to account for highly (and lowly) populated classes, but I will leave it for future investigation.

## New metrics definition
### PopulationPerMedal_thousands
#### Lowest is better
For each country, how many people were needed to win a medal (based on each Olympic year).
For example, a country that in 2012 had 1,000,000 population and 1 Olympic medal (in 2012), would have a PopulationPerMedal_thousands metric of 1,000.
### EventPartecipationPerMedal
#### Lowest is better
For each country, how many events partecipation were needed to win a medal (per Olympic year).
For example, a country that in 2016 parecipated to 300 events and won 10 medals would have a EventPartecipationPerMedal metric of 30.
### AthletePerMedal
#### Lowest is better
Similar to EventPartecipationPerMedal metric, but based on the country distinct athletes competing.
### AthletePerEventPartecipation
#### Higher is better?
To evaluate which NOCs are more reliant on a single athlete performance (one athlete performing in many events). A ratio of 1 means there is a different athlete competing in every event. A ratio that approaches 0 means one same athlete is competing in all the events.



The descriptive stats are shown in the cell below. Some considerations follow.

In [211]:
gold_metrics_partecipants_df = pd.read_parquet(os.path.join('dataset', 'gold_countries_year_metrics.parquet'))
display(gold_metrics_partecipants_df.describe())


Unnamed: 0,Year,events_partecipations_no,distinct_athletes_no,Bronze,Gold,NoMedal,Silver,TotalPartecipants,TotalMedals,Population,PopulationPerMedal_thousands,EventPartecipationPerMedal,AthletePerMedal,AthletePerEventPartecipation
count,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0,861.0
mean,1994.466899,162.288037,119.869919,4.441347,3.996516,66.40302,3.982578,78.823461,12.420441,57597380.0,16619.76,25.607903,19.74804,0.785246
std,15.737095,164.983912,119.404391,6.834275,8.565275,50.906791,7.022451,67.26393,21.66322,165493000.0,79553.59,24.84535,18.49284,0.119667
min,1964.0,3.0,3.0,0.0,0.0,2.0,0.0,3.0,1.0,53200.0,50.24067,2.8,2.384615,0.355263
25%,1984.0,43.0,36.0,1.0,0.0,27.0,0.0,30.0,2.0,5591572.0,1012.278,11.56,8.684211,0.713287
50%,1996.0,95.0,74.0,2.0,1.0,52.0,1.0,57.0,4.0,14760090.0,2463.843,18.0,14.142857,0.785571
75%,2008.0,232.0,163.0,5.0,4.0,94.0,4.0,107.0,13.0,49230580.0,9222.692,31.0,24.0,0.87037
max,2016.0,839.0,648.0,46.0,82.0,234.0,69.0,322.0,195.0,1378665000.0,1129623.0,281.0,174.0,1.0


### Sanity check
As these metrics are on a per medal basis, countries with no medals awarded are excluded from this section.
- Year: starts at 1964 (as we filtered out earliest years due to missing height features)
- Events partecipations: Maximum is 839 events (USA 1996), minimum is 3 (Mozambique 1996)
- Distinct athletes partecipating: Max 648 (USA 1996), Min 3 (Curaçao 1988, Mozambique 1996)
- Total Medals: Max 195 (URSS 1980)
- Population: Max >1.3B (China 2016)
- PopulationPerMedal: In 2004 India partecipated to 81 events at the Olympics, winning one medal, for a record-high population-to-medals ratio of 1 medal every 1.12B people. On the lowest end, Chile partecipated to the 2004 Olympics and won 3 medals, with a population of only 150722 people.
- AthletePerMedal: In general, it takes 20 athletes to get a medal. However, some teams are so good that with less they get a medal every about 2.5 athletes (Romania in 1984 won 52 medals with 124 athletes). On the other side, some teams need 174 athletes to get a medal! (Mexico in 1972 won 1 silver medal, partecipating at the games with 174 athletes).
- AthletePerEventPartecipation: Max is 1, as expected. Minimum is 0.35 (Cuba in 1964 partecipated to 76 events with 27 distinct athletes)

## Add velocity-like and acceleration-like features
I want to define position/velocity/acceleration-like features to describe how countries are faring now, what’s their performance trend, and how quickly they are improving their record respectively.

The position feature is given by the number of total medals earned in a given year.
For example, in 2016 Italy was awarded 28 total medals.

The velocity feature is given by the derivative of the number of total medals, evaluated as a discrete difference between two consecutive years.

In principle:

$$v = \frac{{\text{Medals}}_{\,2016}-{\text{Medals}}_{\,2012}}{4\;\text{years}}$$

However, that is a first order approximation. I used the gradient function provided by the numpy library (https://numpy.org/doc/stable/reference/generated/numpy.gradient.html) for a second-order approximation of the (centered) derivative, with first-order approximation at the endpoints.

Similarly, I evaluated the acceleration as the derivative of velocity.

I am not particularly interested in the absolute values of velocity (and acceleration), but only to how they compare with the velocities (and accelerations) of other countries performance.




## Count of medals awarded for each NOC in each Summer Olympics year

I want to create the framework to analyze the success of different countries based on the total number of medals awarded

Test strategy: group the data frame by NOC, Sport, Event, Year, Season, Games, Team, Medal. The reason is that I could have athletes from the same NOC on the podium for the same competition. As such, the instances need to be kept separated for a reliable count. Example:
- Athlete A, USA, Swimming, 100m Freestyle, 2012, Summer, 2012 London, USA, Gold
- Athlete B, USA, Swimming, 100m Freestyle, 2012, Summer, 2012 London, USA, Silver

Once the data above is collected, I will group by NOC and count the total number of medals awarded. I will join this table with the original data frame, where I can count the total number of athletes representing every NOC

I will need to validate my results. I will compare the results from my query with the Medal records from Wikipedia for 2 Summer editions

### Validation: investigation of medals awarded (2016)

- Compare the results from the query with the 2016 medals table from Wikipedia (https://en.wikipedia.org/wiki/2016_Summer_Olympics#Medal_table)
- The first test validates the success of the query.
- The Olympics ranking is by convention ordered by the number of gold medals, and not the total number of medals. This is different than in my analysis. As such, the two rankings might present slight differences.

In [212]:
gold_countries_derivatives_df = pd.read_parquet(os.path.join('dataset', 'gold_countries_derivatives.parquet'))
cols = list(gold_countries_derivatives_df)

# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('iso_names')))
gold_countries_derivatives_df = gold_countries_derivatives_df.loc[:, cols]

partecipants_df = pd.read_parquet(os.path.join('dataset', 'gold_partecipants_zero_pop.parquet'))
display(partecipants_df[['iso_names', 'TotalMedals']].loc[partecipants_df.Year == 2016].sort_values(by='TotalMedals', ascending=False).head(10).reset_index(drop=True))
# display(gold_metrics_partecipants_df.loc[gold_metrics_partecipants_df.Year == 2016].sort_values(by='TotalMedals', ascending=False).head(10).reset_index(drop=True))
print(f'Total no. of medals awarded: {partecipants_df["TotalMedals"].loc[partecipants_df.Year == 2016].sum()}')

from IPython.display import Image
Image(url= "images/2016ranking.png", width=600)

Unnamed: 0,iso_names,TotalMedals
0,United States,121
1,China,70
2,United Kingdom,67
3,Russian Federation,56
4,Germany,42
5,France,42
6,Japan,41
7,Australia,29
8,Italy,28
9,Canada,22


Total no. of medals awarded: 973


### Validation: investigation of medals awarded (2012)

- Compare the results from the query with the 2012 medals table from Wikipedia (https://en.wikipedia.org/wiki/2012_Summer_Olympics#Medal_table)
- There is a 2 medals difference in the totals. This is caused by revoking/reassignment of medals (Wikipedia table is updated to reflect that). Two medals were left unassigned in this edition


In [213]:
display(partecipants_df[['iso_names', 'TotalMedals']].loc[partecipants_df.Year == 2012].sort_values(by='TotalMedals', ascending=False).head(10).reset_index(drop=True))
# display(gold_metrics_partecipants_df.loc[gold_metrics_partecipants_df.Year == 2016].sort_values(by='TotalMedals', ascending=False).head(10).reset_index(drop=True))
print(f'Total no. of medals awarded: {partecipants_df["TotalMedals"].loc[partecipants_df.Year == 2012].sum()}')

# from IPython.display import Image
Image(url= "images/2012ranking.png", width=600)


Unnamed: 0,iso_names,TotalMedals
0,United States,103
1,China,88
2,Russian Federation,82
3,United Kingdom,65
4,Germany,44
5,Japan,38
6,Australia,35
7,France,35
8,Italy,28
9,"Korea, Republic of",28


Total no. of medals awarded: 962


## Important observation on the SportsStats dataset
The dataset in use does not account for medal revoking/reassignments. As such, it might not represent accurately the final ranking.


For instance, in the 2012 edition Russia was revoked of 14 medals, bringing the total from 82 to 68. It is a large discrepancy. I will highlight in my results this consideration, with the provided rankings based on the “day after” rankings, and not accounting for later doping/scandals reassignments.


<a id='maps'></a>

# Maps and performance

**Feel free to explore the data by hovering and using the dropdown on the left side of the figures.
**




In [217]:
gold_countries_derivatives_df.drop(columns='display').loc[gold_countries_derivatives_df.Year == 2016].head(5)
recent_years = [2016, 2012, 2008]

## Worldwide view of the medals awarded per year
From the top 5 countries in 2016 we can see that it is not always the largest countries that win more medals, but the ones with more partecipating athletes.
This is expected, as more events attendance (and more athletes partecipation) is granted to top teams who pass the qualification tournaments.

From the map, we can see that the hosting country always performs much better (counting on the home support and a larger number of qualified athletes in virtue of the hosting status).
For example, China in 2008 and United Kingdom in 2012 performed much better than usual.

Please note that the terrific performance of Russia in 2012 has largely been invalidated due to doping violations (14 medals were removed).

Top 5 and bottom 5 countries in 2016 (based on total medals):

In [239]:
# display(gold_countries_derivatives_df.drop(columns=['TotalPartecipants', 'region', 'alpha_3', 'Population', 'display', 'velocity', 'acceleration']).loc[gold_countries_derivatives_df.Year == 2016].sort_values(by='TotalMedals', ascending=False).head(10).reset_index(drop=True))
disp = gold_countries_derivatives_df[['NOC', 'iso_names','Year', 'TotalMedals' ,'TotalPartecipants', 'population_display' ]].loc[gold_countries_derivatives_df.Year == 2016].sort_values(by='TotalMedals', ascending=False).reset_index(drop=True)


display(disp)

Unnamed: 0,NOC,iso_names,Year,TotalMedals,TotalPartecipants,population_display
0,USA,United States,2016,121,321,323071755
1,CHN,China,2016,70,239,1378665000
2,GBR,United Kingdom,2016,67,224,65611593
3,RUS,Russian Federation,2016,56,202,144342397
4,FRA,France,2016,42,212,66724104
...,...,...,...,...,...,...
199,IVB,"Virgin Islands, British",2016,0,4,29355
200,KGZ,Kyrgyzstan,2016,0,17,6079500
201,KIR,Kiribati,2016,0,3,112529
202,ALB,Albania,2016,0,6,2876101


In [219]:
fig = mp.get_fig_choropleth_world(gold_countries_derivatives_df, recent_years)
fig.show()

## Countries medal trend (velocity)

A possible interpretation of the velocity field is:
> Velocity is defined as the number of added medals per year from one Olympics edition to the next

For example, USA velocity in 2016 is 4.5, as they added 18 medals from the 2012 edition to the 2016 edition.
We can see that in the top 5 of this ranking there are 2 teams that were present in the top medals ranking shown above, and quite remarkably Team USA is number one in both.
Forthermore, the velocity for Team USA is about double the second ranked Uzbekistan.

We can notice that more smaller countries are more visible in this ranking.

Countries that had a bad performance in 2012 and performed well in 2016 will have a more positive velocity record: this effect is more balanced in years <2016, as 2016 is the endpoint of our range, and as such it presents only a first-order approximation of the derivative (while the other years presented a second-order approximation centered derivative which is often a more accurate metric).

On the contrary, countries with a good performance in earlier years (maybe due to hosting) and a worse recent performance are at the bottom of this ranking. Countries that earned a large amount of medals (such as China and Russia) are also the ones that risk "losing" more.


In [220]:
# display(gold_countries_derivatives_df[['iso_names', 'NOC', 'Year', 'velocity']].loc[gold_countries_derivatives_df.Year == 2016].sort_values(by='velocity', ascending=False).head(10).reset_index(drop=True))
disp = gold_countries_derivatives_df[['NOC', 'iso_names','Year', 'velocity']].loc[gold_countries_derivatives_df.Year == 2016].sort_values(by='velocity', ascending=False).reset_index(drop=True)

display(disp)

Top 5:


Unnamed: 0,NOC,iso_names,Year,velocity
0,USA,United States,2016,4.5
1,UZB,Uzbekistan,2016,2.5
2,AZE,Azerbaijan,2016,2.0
3,FRA,France,2016,1.75
4,DEN,Denmark,2016,1.5


Bottom 5:


Unnamed: 0,NOC,iso_names,Year,velocity
199,AUS,Australia,2016,-1.5
200,KOR,"Korea, Republic of",2016,-1.75
201,UKR,Ukraine,2016,-2.25
202,CHN,China,2016,-4.5
203,RUS,Russian Federation,2016,-6.5


In [221]:
fig = mp.get_fig_choropleth_world(gold_countries_derivatives_df, recent_years, 'velocity')
fig.show()

## Countries quickly improving (acceleration)

Acceleration cannot be explained as easily, but it tells how quickly we can change the country current trend (how fast we can change velocity).

We can notice once again that 2 of the teams in this top 5 ranking were already present in the top 5 medal ranking, and 3 of the teams were present in the top 5 velocity ranking.
We only have 2 new entries in this ranking.

Once again Team USA is at the first place, indicating quite a remarkable moment for sport in the US.
The UK (hosting country in 2012) has a decreased record, quite expected in the years after hosting. Russia has the worst record, following a great performance in 2012, but it is paying the aftermath of the doping scandals (that partially altered its 2012 record).

In [222]:
# display(gold_countries_derivatives_df[['iso_names', 'NOC', 'Year', 'acceleration']].loc[gold_countries_derivatives_df.Year == 2016].sort_values(by='acceleration', ascending=False).head(10).reset_index(drop=True))

disp = gold_countries_derivatives_df[['NOC', 'iso_names','Year', 'acceleration']].loc[gold_countries_derivatives_df.Year == 2016].sort_values(by='acceleration', ascending=False).reset_index(drop=True)

display(disp)

Top 5:


Unnamed: 0,NOC,iso_names,Year,acceleration
0,USA,United States,2016,0.78125
1,FRA,France,2016,0.40625
2,UZB,Uzbekistan,2016,0.40625
3,CUB,Cuba,2016,0.21875
4,SUI,Switzerland,2016,0.1875


Bottom 5:


Unnamed: 0,NOC,iso_names,Year,acceleration
199,JPN,Japan,2016,-0.3125
200,HUN,Hungary,2016,-0.34375
201,IRI,"Iran, Islamic Republic of",2016,-0.4375
202,GBR,United Kingdom,2016,-0.46875
203,RUS,Russian Federation,2016,-1.125


In [223]:
fig = mp.get_fig_choropleth_world(gold_countries_derivatives_df, recent_years, 'acceleration')
fig.show()

In [224]:
gold_metrics_df = pd.read_parquet(os.path.join('dataset', 'gold_countries_year_metrics.parquet'))
gold_metrics_df['population_display'] = gold_metrics_df['Population'].map('{:,}'.format)
gold_metrics_df = gold_metrics_df.assign(display= mp.get_display_customdata(gold_metrics_df))




## Athlete per event partecipation metric

Smaller countries generally have higher athlete/event ratio, going against my initial belief that smaller countries would bring less athletes to Olympics (and have them competing in multiple events) due to budget reasons.
The opposite is quite true, with larger athletes delegations (like Russia, China, USA) competing in many more events with the same athletes.

Many of the large countries consistently rely on single athletes competing in multiple events. This was initially considered as a grim indicator, however, Team like USA are performing well, with a positive trend, and rapidly increasing dominance.
It is therefore unlikely that this metric is a useful indicator.

In [225]:
disp = gold_metrics_df[['NOC', 'iso_names', 'AthletePerEventPartecipation']].loc[gold_metrics_df.Year == 2016].sort_values(by='AthletePerEventPartecipation', ascending=False).reset_index(drop=True)

display(disp)


Top 5:


Unnamed: 0,NOC,iso_names,AthletePerEventPartecipation
0,JOR,Jordan,1.0
1,GRN,Grenada,1.0
2,UAE,United Arab Emirates,1.0
3,KOS,,1.0
4,TJK,Tajikistan,1.0


Bottom 5:


Unnamed: 0,NOC,iso_names,AthletePerEventPartecipation
80,JAM,Jamaica,0.727273
81,NED,Netherlands,0.720365
82,RUS,Russian Federation,0.699507
83,SUI,Switzerland,0.671053
84,TTO,Trinidad and Tobago,0.651163


In [226]:
fig = mp.get_fig_choropleth_world(gold_metrics_df, recent_years, 'AthletePerEventPartecipation')
fig.show()


## Population per medal metric

The entire map is washed away by the negative record of India: no matter its large population, the country is still largely unsuccessful in Summer Olympics.
Excluding India from this visualization could be helpful to better highlight the difference between other countries, as India could be considered an outlier with one medal every 660M people.
Smaller countries dominate the bottom part of the ranking, with Grenada having one medal each 110k people. Albeit this was expected, it does not make it any less impressive or respectable.

In [227]:
disp = gold_metrics_df[['NOC', 'iso_names', 'PopulationPerMedal_thousands']].loc[gold_metrics_df.Year == 2016].sort_values(by='PopulationPerMedal_thousands', ascending=False).reset_index(drop=True)

display(disp)



Top 5:


Unnamed: 0,NOC,iso_names,PopulationPerMedal_thousands
0,IND,India,662258.625
1,NGR,Nigeria,185960.244
2,PHI,Philippines,103663.812
3,INA,Indonesia,87185.462
4,VIE,Viet Nam,46820.2175


Bottom 5:


Unnamed: 0,NOC,iso_names,PopulationPerMedal_thousands
80,DEN,Denmark,381.867333
81,JAM,Jamaica,264.203818
82,NZL,New Zealand,261.894444
83,BAH,Bahamas,188.9615
84,GRN,Grenada,110.263


In [228]:
fig = mp.get_fig_choropleth_world(gold_metrics_df, recent_years, 'PopulationPerMedal_thousands')
fig.show()




## Athlete per medal metric

Portugal was the worst performer in this category in year 2016, with only one medal earned with 90 athletes.

It is once again to see larger delegations, like USA and Russia, presenting an outstanding record of one medal every 5 athletes competing. On average this means that 20% of their athletes come back from the Games with a medal.

In [229]:
disp = gold_metrics_df[['NOC', 'iso_names', 'AthletePerMedal']].loc[gold_metrics_df.Year == 2016].sort_values(by='AthletePerMedal', ascending=False).reset_index(drop=True)

display(disp)


Top 5:


Unnamed: 0,NOC,iso_names,AthletePerMedal
0,POR,Portugal,90.0
1,NGR,Nigeria,71.0
2,AUT,Austria,71.0
3,IND,India,56.0
4,FIN,Finland,54.0


Bottom 5:


Unnamed: 0,NOC,iso_names,AthletePerMedal
80,RUS,Russian Federation,5.071429
81,ETH,Ethiopia,4.625
82,USA,United States,4.586777
83,PRK,"Korea, Democratic People's Republic of",4.428571
84,AZE,Azerbaijan,3.111111


In [230]:
fig = mp.get_fig_choropleth_world(gold_metrics_df, recent_years, 'AthletePerMedal')
fig.show()



## Comparison of Team Italy with similar sized countries
Comparisons with similar-sized countries can only be carried out with average or small size countries. In fact, the highest-ranked countries like USA, Russia, and China are not really comparable in size while they are in terms of performance.
The analysis considers countries with the same size $\pm$ 20%.

Performance of Team Italy have been in the recent years consistently subpar with respect to UK, France, and South Korea.

In [236]:
import pycountry

selected_country = "ITA"

if selected_country in gold_metrics_df.alpha_3.unique():
    fig = mp.get_fig_choropleth_similar_size(gold_metrics_df, selected_country, recent_years)
    fig.show()


## Comparison of Team USA with similar sized countries
United States size compares only with Indonesia, the latter having a much worse record and hardly comparable to the one of USA.

In [238]:
import pycountry

selected_country = "USA"

if selected_country in gold_metrics_df.alpha_3.unique():
    fig = mp.get_fig_choropleth_similar_size(gold_metrics_df, selected_country, recent_years)
    fig.show()

<a id='recap'></a>

# Recap on initial hypotesis
- The hypothesis of unreliable data with missing feature before WWII was validated
- The hypothesis of M/F height difference was validated
- A difference in height (and weight) between Summer and Winter athletes was observed, with Summer athletes being on average consistently taller and heavier than Winter athletes
- Both M/F are taller in recent editions
- The current data frame is based on the Olympics results only, and does not account for medals reallocations based on doping or other reasons


<a id='results'></a>

# Results and correlations
## 1. Number of athletes competing and number of medals are strongly correlated
Number of athletes competing and medals earned have a Pearson correlation coefficient r of 0.81.

This means that the number of athletes is strongly correlated to the number of medals that they will earn.

The p-value is 9.2e-16; as p-value <0.001, there is strong confidence in the result.

The result can be expected, as big delegations like USA and China generally take home many medals. However, this offers an interesting insight about the hosting country. Competing in the home country always gives the advantage of competing in front of many home supporters in familiar venues, but it also offers a granted competing spot in every event. As the number of athletes and medals are correlated, is hosting a "sure" way to get more medals?

NOTE: The correlation is very similarly value for the number of events attended.

In [None]:
fig = px.scatter(x=gold_metrics_partecipants_df.distinct_athletes_no, y=gold_metrics_partecipants_df['TotalMedals'])
fig.update_layout(title='Number of medals earned and number of distinct athletes partecipating',
                   yaxis_title='Number of medals',
                   xaxis_title='Number of athletes in the delegation')
fig.show()


## 2. Height and number of medals are strongly correlated, both for M and F
Women Height and number of medals awarded (on a per event partecipation basis) have a Pearson correlation coefficient r of 0.78.
This means that height is positively correlated with the number of medals earned. I was expecting a correlation, but not as strong, especially considering that for sports like gymnastic height can be a disadvantage.
The p-value is 9.2e-16, that is p-value <0.001, indicating a strong confidence in the result.

The results are valid also for men (r=0.76, p-value=2.1e-15).


In [None]:
import numpy as np
threshold = noc_height_df.TotalPartecipants.quantile(0.15)
print(f"Filter out heights that have less than {threshold:.1f} datapoints")
noc_height_df = noc_height_df[noc_height_df.TotalPartecipants>threshold]
noc_height_df = noc_height_df.assign(TotalMedals = noc_height_df.Gold + noc_height_df.Silver + noc_height_df.Bronze)
noc_height_df = noc_height_df.assign(MedalPerPartecipant= np.divide(noc_height_df.TotalMedals, noc_height_df.TotalPartecipants))


In [None]:
fig = px.scatter(x=noc_height_df.Height, y=noc_height_df['MedalPerPartecipant'], color=noc_height_df.Sex)
fig.update_layout(title='Medals for each athlete in a height-class',
                   yaxis_title='MedalPerPartecipant',
                   xaxis_title='Height (cm)',
                  legend={'title': 'Sex:'})
fig.show()


## 3. Many countries that are performing well are also increasing their performance more and more

United States and France are not only some of the most awarded team in recent editions (position-like feature), but they are also on a positive streak in the last editions (velocity-like feature), and they are quickly increasing their dominance (acceleration-like feature).


<a id='conclusions'></a>

# Conclusions
- United States and France are currently some of the most successful countries, quickly increasing their dominance. Russia and China hold on to a large number of medals, but their record prospects are not looking well. Reasons may lie in the doping scandal for Russia and in the recent edition hosting for China, which temporarily impacted China's record (more medals earned in 2008, and a decreasing record in the next editions)
- Comparisons with similar-sized countries can only be carried out with average or small size countries. In fact, the highest-ranked countries like USA, Russia, and China are not really comparable among them in size while they are in terms of performance.
- Largest countries are the ones relies on competing with one athlete in multiple events
- Athlete per event metric is likely not indicative of the future prospects of a country Olympics performance
- Height is strongly positively correlated to the performance (in terms of number of medals earned)

<a id='challenges'></a>

# Artifacts from modeling and challenges
- Studies on a per-medal basis needs necessarily to exclude countries who won no medals.
- Some countries in the dataset do not exist anymore (grouped or split)
- Some countries do not have known official population (like the Refugee Olympic Committee and Individual Olympic Athletes)
- Political situation of some regions created challenges in how to analyze, group, and report data in a neutral way

<a id='next'></a>

# What to do next
- Predictive models can be developed with machine learning techniques to predict next Olympics outcomes
