# Principal Components Analysis of Olympic Athletes Performance

The goal of this notebook is to use PCA to analyze data relating to the top performances in the Men's decathlon at the 2004 summer Olympics in Athens (https://en.wikipedia.org/wiki/Athletics_at_the_2004_Summer_Olympics_%E2%80%93_Men%27s_decathlon) and Decastar 2004 in Talence (https://fr.wikipedia.org/wiki/D%C3%A9castar). (Both events were won by Roman Šebrle).

This notebook was created by [Chloé-Agathe Azencott](http://cazencott.info), inspired by [FactoMineR](http://factominer.free.fr/factomethods/principal-components-analysis.html).

This notebook was created using
* python 3.4.3
* numpy 1.15.0
* matplotlib 2.2.2
* scikit-learn 0.19.2

You can check your version of Python by running
```python
import sys
print(sys.version)
```

and the version of any module by running
```python
import <module name>
print(<module name>.__version__)
```

## Loading the data science libraries

In [None]:
%pylab inline
import pandas as pd

## 1. Data

### Loading the data
The data are available under `data/decathlon.txt`

* The data set consists of 41 rows and 13 columns.
* The first row is a header describing the content of the columns and the remaining rows refer to the 40 observations (athletes) considered in this dataset.
* Columns 1 to 12 are continuous variables: the first ten columns correspond to the performance of the athletes for each event of the decathlon and columns 11 and 12 correspond respectively to the rank and the points obtained.
* The last column is a categorical variable corresponding to the athletic meeting (2004 Olympic Games or 2004 Decastar).

Our goal is to use only the variables describing athletes performance, and analyze their relationship to the athlete's ranks.

In [None]:
my_data = pd.read_csv('data/decathlon.txt', sep="\t")  # load data

In [None]:
print(type(my_data))  # display my_data data type

In [None]:
my_data.head(n=5)  # adjust n to view more data

In [None]:
# Let us eliminate the columns we will not use to represent athletes
reduced_data = my_data.drop(['Points', 'Rank', 'Competition'], axis=1)

# and transform the data into a numpy array
X = reduced_data.values

### Data standardization

Remember: PCA must be applied on standardized data. We will use scikit-learn's [preprocessing.StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

In [None]:
from sklearn import preprocessing

std_scale = preprocessing.StandardScaler().fit(X)
X_scaled = std_scale.transform(X)

## 2. Computing principal components

We will use scikit-learn's [decomposition.PCA](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

In [None]:
from sklearn import decomposition

pca = decomposition.PCA(n_components=2)
pca.fit(X_scaled)

## 3. Percentage of variance explained 

The `explained_variance_ratio_` attribute gives us access to the percentage of variance explained by each of the principal components.

In [None]:
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())

__Question 1:__ How much variance is explained by each of the two principal components? By both of these components? How many variables described the data initially, and how good do you think this dimensionality reduction is?

__Answer:__

## 4. Visualization 

Let us now plot each of the athletes along these two components only.

In [None]:
# project X_scaled on the two components
X_projected = pca.transform(X_scaled)

# Plot each sample 
plt.scatter(X_projected[:, 0], X_projected[:, 1],
    # color by 'Rank'
    c=my_data.get('Rank'))

plt.xlim([-5.5, 5.5])
plt.ylim([-4, 4])

plt.xlabel('PC 1', fontsize=14)
plt.ylabel('PC 2', fontsize=14)

plt.colorbar()

__Question 2:__ How does performance vary as a function of PC 1?

__Answer:__

## 5. Understanding the principal components

To better understand these principal components, let us look at their coordinates in the initial 10-dimensional space. For each of those 10 dimensions/atheltic performances, let us plot a point that has for coordinates its contribution to the first PC on the x-axis and of the second PC on the y-axis. 

In [None]:
plt.figure(figsize=(8, 8))

pcs = pca.components_

plt.scatter(pcs[0, :], pcs[1, :])
for i, (x, y) in enumerate(zip(pcs[0, :], pcs[1, :])):
    plt.text((x+0.01), (y+0.01), my_data.columns[i], fontsize='14')

# Plot a horizontal line y=0
plt.plot([-0.7, 0.7], [0, 0], color='grey', ls='--')

# Plot a vertical line x=0
plt.plot([0, 0], [-0.7, 0.7], color='grey', ls='--')

plt.xlim([-0.7, 0.7])
plt.ylim([-0.7, 0.7])

plt.xlabel('Weight of the contribution to PC 1', fontsize=14)
plt.ylabel('Weight of the contribution to PC 2', fontsize=14)

__Question 3:__ Some events have a positive contribution to PC1, and others have a negative contribution to that component. Can you figure out why, and what PC 1 represent?

__Answer:__

__Question 4:__ Can you figure out what PC2 represents?

__Answer:__

__Question 5:__ What can you say of the performances for discus throw (`Discus`) and shot put (`Shot.put`)?

__Answer:__