# Principal Directions

This notebook provides:

* an orientation of the `sklearn` implementation of PCA;
* examples of **principal directions** in 2D.

In [None]:
# standard imports
import numpy as np
import matplotlib.pyplot as plt 

In [None]:
# get data for notebook / plotting function from file
import sys
sys.path.insert(0, "./data")
import pcadirections

data = pcadirections.data
plot_line = pcadirections.plot_line

### Synthetic data

The import statements above include a list of synthetic datasets in the variable `data`. It's always a good idea to investigate your data before performing any analysis. We'll be looking at the first dataset (`data[0]`) initially.

* Extract the first dataset `data[0]`, assigning it to a variable `X`.
* Create a scatterplot of this first dataset (`X`) using `matplotlib`'s scatter function `plt.scatter(<x-values>, <y-values>)`.

### Scikit-learn and PCA

Scikit-learn categorises PCA as a `decomposition` algorithm (for reasons yet to be discussed). 

* Import the PCA class via the command: `from sklearn.decomposition import PCA`.

* Instantiate a PCA object via `PCA()`.
* Fit this PCA object to the data (`X`) by using the `.fit` method.

The fitted `PCA` object will have a variety of attributes (take a look). By far the most important is the `components_` attribute (note the underscore `_`)<sup>1</sup>. This is a matrix whose rows correspond to the principal directions. 

* Extract the first principal direction from the object.

### Plot the first principal direction

* Make a scatterplot of the **original data** (`X`) as above
* Now overlay this with the principal direction calculated from PCA. 

Note that a `plot_line` utility has been defined in the `pcadirections.py` file. Syntax:

```python
plot_line(<direction vector>)
```

see also `?plot_line`. You may wish to supply a color (e.g. `color=red`).

Note that performing `pca.fit` does **not** transform the dataset `X`, or calculate the principal direction scores. This method simply fits the principal directions. We will look below at how the principal directions vary as the data changes.

### Investigate the principal direction of other datasets

Seven other datasets are included in the `data` variable:

1. `data[1]`, `data[2]` are variations of the above dataset,
2. `data[3]`, `data[4]`, `data[5]`, `data[6]`, `data[7]` are datasets which do not quite fit the picture (or 'cartoon') of fitting principal directions as discussed in the slides so far.

For all these datasets:

* Repeat the previous exercise for each dataset in turn. 
* How does the situation differ from the examples seen so far in the slides? How does the principal direction vary in each case?
* You can choose to perform this either by copy-pasting your code, or defining a function that wraps these operations.

In [None]:
# Dataset 1
# Outlier
X = data[1]



In [None]:
# Dataset 2
# Low correlation

X = data[2]



In [None]:
# Dataset 3
# X-shape: do not capture either stalk
X = data[3]



In [None]:
# Dataset 4
# Two clusters

X = data[4]



In [None]:
# Dataset 5
# Four clusters

X = data[5]



In [None]:
# Dataset 6
# Nonlinear manifold (quadratic)

X = data[6]



In [None]:
# Dataset 7
# c. zero correlation

X = data[7]



------------

### Footnotes

<sup>1</sup> The trailing underscore of attributes of `sklearn` objects indicates attributes which have been *estimated*. These are likely the attributes that are of most interest, and the underscore is a form of emphasis. This is documented in the [Developer guide](https://scikit-learn.org/stable/developers/develop.html#estimated-attributes).