# *Wstęp do uczenia maszynowego* - Notebook 13, version for students
**Authors: Michał Ciach, Dorota Celińska-Kopczyńska**  



## Description

In this notebook, we will learn to analyze the properties of Principal Component Analysis in more detail. We will also learn another common dimensionality reduction technique: t-SNE.  



In [None]:
!pip install gdown
!gdown https://drive.google.com/uc?id=11we5UonQD42tM8uJ8bY8IaBshRy0iMY8
!gdown https://drive.google.com/uc?id=1I_cqB4z3Cap2S5kVuFJr5p3CclpTpzTd

Downloading...
From: https://drive.google.com/uc?id=11we5UonQD42tM8uJ8bY8IaBshRy0iMY8
To: /content/13. authors.csv
100% 165k/165k [00:00<00:00, 78.4MB/s]
Downloading...
From: https://drive.google.com/uc?id=1I_cqB4z3Cap2S5kVuFJr5p3CclpTpzTd
To: /content/13. pca_geometry.tsv
100% 71.5k/71.5k [00:00<00:00, 95.1MB/s]


## Data & library imports

In [None]:
import pandas as pd
import numpy as np
import numpy.random as rd
import plotly.express as px
from sklearn.decomposition import PCA
from scipy.stats import norm
import plotly.graph_objects as go
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [None]:
pca_geometry = pd.read_csv('13. pca_geometry.tsv', sep='\t')
pca_geometry

Unnamed: 0,X,Y,Z,Dataset
0,-0.635932,1.376577,1.948057,1
1,-1.782854,-3.883091,-2.302636,1
2,0.800412,0.285949,1.674842,1
3,-0.952556,-3.159498,-1.785776,1
4,0.306875,-1.378390,-0.095597,1
...,...,...,...,...
1195,-0.952292,0.261559,-1.125041,4
1196,0.981582,-0.027452,1.442443,4
1197,0.082051,0.401192,-1.460115,4
1198,-0.254978,1.002278,2.658149,4


In [None]:
from sklearn import datasets
iris_raw = datasets.load_iris()
iris = pd.DataFrame(iris_raw.data, columns=iris_raw.feature_names)
iris['Species'] = [iris_raw.target_names[x] for x in iris_raw.target]
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [None]:
authors = pd.read_csv('13. authors.csv', sep=';')
# Check the descriptive statistics
authors.describe()

authors_ids = authors['id']
authors.drop('id', axis=1, inplace=True)
sc = StandardScaler(with_mean=True)
authors[:] = sc.fit_transform(authors)
# Note: authors[:] = ... maintains the dataframe (retains the column names),
# authors = ... would result in numpy ndarray
authors.index = authors_ids

## Warmup: A basic application of Principal Component Analysis

We will start with recalling the basics of PCA. We will use it to analyze the differences between researchers in economics.

The `authors` dataset, loaded in the *Data & modules* section, describes the academic performance of economists (identified with id):
- wp, wp_pub -- the number of working papers and working papers published,
- art -- the number of journal papers published,
- soft -- amount of software released,
- chap, book, eds -- the number of chapters in books, the number of books published and the number of books the given economist was in the editorial board,
- X_citations -- the sum of citations for X, e.g., wp_citations means the number of citations one got for their working papers,
- X_nocit -- the number of X with no citations, e.g., wp_nocit means the number of working papers that were not cited by anyone,
- hindex -- author-level metric that measures both the productivity and citation impact of the publications, initially used for an individual scientist or scholar. H-index of x means that a given economist has at least x papers that were cited x times and that it is the largest such a number.
- affiliations -- the number of companies the economist is related to,
- neps -- NEP is an announcement service which filters information on new additions to RePEc (a database on economic papers) into edited reports. Variable contains the number of services that contained information on the papers of the given economist.

The data frame has been pre-processed - rows with missing values were removed, and the numerical variables were scaled with `StandardScaler` from `sklearn`.


**Exercise 0.** Quick questions:
  1. Does scaling the variables influence PCA? Does centering the variables influence it?
  2. Does the `authors` data frame contain categorical variables? If it does, do we need to remove them before doing a PCA?  
  3. Let's suppose we use the `PCA` object from `sklearn` with `n_components = 20` to get the first 20 principal components. For example, we run the following code:   
    ```
    PCA1 = PCA(n_components=20)  
    PCA1scores = PCA1.fit_transform(X)
    ```  
  Now, we use a `PCA` object on the same data, but with `n_components = 3`. For example, we run the follwing code:  
    ```
    PCA2 = PCA(n_components=3)  
    PCA2scores = PCA2.fit_transform(X)
    ```
  Are the following equalities true? Why/Why not?   
    ```
    PCA1scores[:, :3] == PCA2scores  
    PCA1.components_[:3, :] == PCA2.components_
    ```

**Exercise 1.** In this task, we will use PCA to reduce the dimentionality of the `authors` data set. An example application is a lossy compression of the data set.

1. Perform a principal component analysis of the authors. Use the `.fit()` method of the `PCA` object from `sklearn` (`n_components = 20`).

Now, we will select the optimal number of principal components to retain - that is, the smallest number of components that give a sufficient approximation of the data. To this end, we will compare three heuristic approaches:

2. Get the proportion of variance explained by each principal components from the `explained_variance_ratio_` attribute of the `PCA` object. Create a bar plot showing the percentage of variance explained by each principal component. Calculate the cumulative sum of these variances. Decide how many PCs should be retained to explain at least 70%, 80% or 90% of the total variance.
3. Get the eigenvalues of the covariance matrix associated with each of the principal components, i.e. the explained variances, from the `explained_variance_` attribute of the `PCA` object. How many PC should be retained according to the "eigenvalue over 1" heuristic?
4. Draw the *scree plot*: a line plot in which the X axis represents the index of a given PC, i.e. from 1 to n, and the Y axis represents the eigenvalue associated with that PC. Basically, the same thing as in Point 2, just with a line instead of bars. How many components should be retained based on the "scree plot heuristic" (the point at which adding new PCs stops giving much improvement in the amount of explained variance)?
5. Decide how many PC you would retain in the dataset. Justify your choice.

Now, we will use the selected PCs to compress the data. Let $n$ be the number of PCs you have selected.   

5. Create a new PCA object with the desired `n_components`. Transform the `authors` data set using the `.transform()` method of the `PCA` object. The transformed data set is often called the *score matrix*. This is your compressed data.

Now, we will decompress our data and compare it to the original.   

7. Use the `inverse_transform()` method of the `PCA` object to transform the score matrix back to the original space. Compare the obtained data frame with the original `authors` data frame (any way you like, it doesn't need to be a very formal comparison - you can just look at the data frames).

Note that this kind of compression loses information. The amount of information lost depends on the number of PCs you take. The cumulative explained variance ratio can be interpreted as the amount of information retained in the compression.  



In [None]:
# Put your code here


Variance ratios:
[0.34 0.09 0.06 0.06 0.05 0.05 0.05 0.05 0.04 0.04 0.04 0.03 0.03 0.02
 0.02 0.01 0.01 0.01 0.   0.  ]


Cumulative variance:
[0.34 0.43 0.5  0.56 0.61 0.66 0.71 0.76 0.8  0.84 0.88 0.91 0.94 0.96
 0.98 0.98 0.99 1.   1.   1.  ]
Eigenvalues:
[6.85 1.84 1.27 1.2  1.05 1.01 0.99 0.94 0.89 0.8  0.78 0.61 0.54 0.42
 0.32 0.17 0.13 0.11 0.08 0.01]


Decompressed:
      wp  wp_pub    art   soft   chap   book    eds  wp_citations  wp_nocit  \
0  0.116   0.170  0.328 -0.312 -0.361 -0.339 -0.187         0.190    -0.162   
1 -0.243  -0.195  0.139 -0.161 -0.244 -0.094  2.696        -0.093    -0.140   
2 -0.551  -0.607 -0.701 -0.241 -0.365 -0.352 -0.234        -0.316    -0.154   
3  1.021   1.058  1.372  0.084  4.710  3.105 -0.147         1.995    -0.042   
4  0.574   0.661  0.528 -0.230 -0.013 -0.180 -0.246         0.462    -0.143   

   art_citations  art_nocit  soft_citations  soft_nocit  chap_citations  \
0          0.490      0.221          -0.024      -0.037          -0.020   
1          0.233     -0.158          -0.022      -0.041           0.906   
2         -0.365     -0.578          -0.029      -0.040          -0.120   
3          1.771      0.728           0.033      -0.157           3.861   
4          0.153      0.681          -0.017      -0.052          -0.186   

   chap_nocit  book_citations  book_nocit  hindex  affiliati

### Using PCA for finding outliers
**Exercise 2.** The `authors` dataset is a real-life dataset obtained by web-scraping RePeC. That is why some outliers may occur (actually, performance-related or citation-related distributions are skewed!). Often, if those observations are not representative of the data, they are removed from the data set.  
To understand why, we will analyze their influence on the principal components.

1. Create a scatter plot showing the first two principal components.  
Label the points with ids of authors (in hover boxes).  
2. Does the scatter plot suggests that there are some outlying observations?  
3. Try to find the variables for which the weights in the first and second principal components are the largest, i.e. figure out what can you say about authors in different parts of the scatter plot.

In [None]:
# Put your code here


[[ 0.33574462  0.34529543  0.32299144  0.04536044  0.2399207   0.21010573
   0.1436199   0.31183639  0.05627501  0.19973833  0.33581014  0.01605841
   0.01177471  0.1260641   0.22668656  0.12922353  0.11229232  0.34391131
   0.08618456  0.25213759]
 [-0.21063341 -0.22310277 -0.00943486 -0.16601728  0.40659893  0.38782002
   0.0431966  -0.08753911 -0.05710954  0.24443115 -0.26373576  0.01414236
  -0.06642281  0.36584156  0.28186931  0.26119581  0.06775093 -0.03671201
  -0.15327219 -0.32377663]]


**Exercise 3.** In the data frame `pca_geometry`, loaded in the *Data & modules* sectio, you have four three-dimentional data sets of 300 points each. In this exericse, we will use PCA to analyze their geometry.  

1. Select the first data set from `pca_geometry`. Perform PCA and calculate the principal components, i.e. the transformed data.   
  1.1. *Quick question.* What is the maximum number of principal components that you can calculate for this data set?     
  1.2. Do you need to remove the `Dataset` column from the data frame before doing the PCA? Can this column influence the results?   
2. Visualize the principal components on a scatter plot. You may use either `px.scatter` to visualize a given pair of the principal components, or `px.scatter_matrix` to visualize all pairs.   
  2.1. Based on the scatter plot, try to guess the topology of the data set. For example, how many clusters does it have?   
  2.2. Try to guess the geometry of the clusters. For example, are they spherical, ellipsoidal, or other?   
  2.3. Can you answer the above questions using only a scatter plot of the first two principal components?  
  2.4. Do you notice any outliers? If so, what happens to the principal components if you remove them?
3. Check the vectors of weights corresponding to the principal components.  
  3.1. Based on the scatter plot and the weights, try to guess the three-dimentional distribution of each cluster. For example, what is the approximate location of each cluster? Is it a standard Gaussian distribution? Is it a Gaussian distribution with a non-trivial correlation matrix? Is it a combination of Gaussian distributions?  
4. Verify your conclusions by visualizing the data set on a three-dimentional scatter plot. You can use the `px.scatter_3d` function.  
5. Repeat the above points for the remaining data sets.

In [None]:
# Put your code here


Direction of the components:
[[ 0.35873976  0.35104928  0.86491051  0.        ]
 [ 0.60692891  0.6162562  -0.50186212  0.        ]
 [-0.7091848   0.70497709  0.00801365 -0.        ]]


**Exercise 4.** If we do a PCA on a randomly selected sample, then our weights $w_{ij}$ are random variables. These random variables estimate the true weights of the principal components of the original, full data set. In this exercise, we will analyze the bias and variance of the weights estimated from a random sample of points from a 2-dimentional Gaussian distribution. First, however, we need to remind ourselves a bit of probability theory.   

A column vector of random variables $Z = (Z_1, \dots, Z_d)^T$ is said to have a standard $d$-dimentional Gaussian distribution if $Z_i \sim_{\text{iid}} \mathcal{N}(0, 1)$ ("iid" means "independent, identically distributed", in this case for each $i$). Let $\mu \in \mathbb{R}^d$ and let $A \in \mathbb{R}^{d \times d}$. Then, a vector $X = \mu + A Z$ has a $d$ dimentional Gausian distribution with a location vector $\mu$ and a covariance matrix $\Sigma = AA^T$ (here, $\Sigma$ simply denotes the covariance matrix and is unrelated to the summation symbol). We denote this as $X \sim \mathcal{N}(\mu, \Sigma)$.

If we want to simulate centered and scaled 2-dimentional Gaussian random variables $X = (X_1, X_2)^T$ with $\text{cor}(X_1, X_2) = \rho$ (centered and scaled meaning that that $\mathbb{E}X_1 = \mathbb{E}X_2 = 0$ and $\mathbb{V}X_1 = \mathbb{V}X_2 = 1$), we can first simulate a vector of two Gaussian variables $(Z_1, Z_2)^T$ (e.g. using `scipy.norm.rvs`), and then perform a matrix multiplication $X = AZ$ with a matrix $A$ given by

$$ A = \left( \begin{array}{cc} 1 & 0 \\ \rho & \sqrt{1-\rho^2} \end{array} \right) $$

1. Try to guess the weights of the principal components of the random vector $X$ based on the shape of its distribution.  
2. Derive the true weights of the principal components of $X$ from the eigendecomposition of its covariance matrix $\Sigma$.
3. Repeat the following $R=1000$ times:  
  3.1. Simulate $N=100$ points from a 2-dimentional centered and scaled Gaussian distribution with a selected value of $\rho$. *Hint:* You can use `scipy.norm.rvs(size=(100, 2))` to calculate 100 row vectors from a 2-dimentional standard Gaussian distribution, and `np.dot` for a matrix multiplication.   
  3.2. Do a PCA on the sample. Save the weights of the first principal component (e.g. in a list).  
4. Calculate the coordinate-wise average of the estimated weights (i.e. the average of $w_1$ and the average of $w_2$). Are they similar to the theoretical ones? Why/why not?    
5. Calculate the coordinate-wise average of the absolute values of the estimated weights. Are they similar to the theoretical ones? Why/why not?     
6. Plot the estimated vectors of weights on a scatter plot. Do they fall on any particular shape? Do they appear to have any particular distribution?   
  6.1. Is the coordinate-wise average of the absolute values of the estimated weights a good measure of their "average location" in this case?  
7. Is the coordinate-wise standard deviation of the weights a good measure of the uncertainty of estimation in this case? Why/why not?  
8. For each vector of estimated weights, calculate the sine of its angle with the vector $(1, 1)$. For this point, use a positive value of $\rho$. *Hint:* the sine between a 2-dimentional vector $w$ and a vector $(1, 1)$ is equal to the scalar product $w \cdot (1/\sqrt{2},\ -1/\sqrt{2})$.  
  8.1. Plot the sines on a histogram and analyze it. What can you conclude about the properties of the estimator?  
9. Suppose the first principal component is estimated accurately. What can you say about the uncertainty of the estimation of the second principal component?  

In [None]:
# Put your code here


True direction of maximum variance: [0.7071067811865475, 0.7071067811865475]
Mean value of the first component weights: [0.02501946 0.02303954]
Mean absolute value of the first component weights: [0.70802596 0.70197575]



plotly.graph_objs.Line is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.scatter.Line
  - plotly.graph_objs.layout.shape.Line
  - etc.




**Exercise 5.\*** Write a function that takes an array of numerical values, e.g. a data frame or a numpy array, performs the principal component analysis based on the correlation matrix, and returns a matrix of **loadings**. Use this matrix to calculate the explained variance ratio and the principal components on the `iris` data set. Compare your results to the `sklearn` implementation.     
**Hint:** You can perform the eigenvalue decomposition using `numpy.linalg.eig`. Remember about the ordering of the eigenvectors.  
Remember that PCA assumes that the data is centered.  
If you find that you need to center the data yourself, you may use `np.mean(X, axis=0)` to get a vector with mean values of the columns of the matrix `X`.  


In [None]:
# Put your code here


[0.92461872 0.05306648 0.01710261 0.00521218]
[[ 0.36138659 -0.08452251  0.85667061  0.3582892 ]
 [-0.65658877 -0.73016143  0.17337266  0.07548102]
 [-0.58202985  0.59791083  0.07623608  0.54583143]
 [ 0.31548719 -0.3197231  -0.47983899  0.75365743]]


## t-SNE

**Exercise 6.** So far, in order to visualize multi-dimentional data on two-dimentional plots, we were using PCA. In this exercise, we will learn how to do such visualizations using another technique, called t-SNE. We will also analyze the impact of one of the most important parameters of t-SNE: *perplexity*, which can be roughly explained as a measure of the density of clusters (in the untransformed, high-dimentional space).

1. Create a list of the following perplexity values: `[5, 6, 10, 11, 30, 31, 60, 61]`.  
2. Create an empty data frame `iris_tsne_results` to store the embedded observations (the results of t-SNE).  
3. For each perplexity value, perform a t-SNE transformation (by setting `TSNE(perplexity=p)`). Cast the results to a data frame, add a column called `perplexity` with the perplexity value and a column with the species name. Append the data frame to `iris_tsne_results`.  
4. Create a table of scatter plots with the embedded data, one scatter plot for each perplexity value (using `fig = px.scatter(..., facet_col='perplexity', facet_col_wrap=2)`). Color the points with the corresponding species.  
5. Compare the results for different perplexities.  
  5.1. Are the plots for similar perplexities also similar, or different?  
  5.2. What is the impact of perplexity on cluster sizes in the 2-dimentional space?  
  5.3. What is the impact of perplexity on the distance between clusters in the 2-dimentional space?    
  5.4. In your opinion, what value of perplexity gave the best results? Do you suspect that this value works for all data sets, or only for `iris`?   
  5.5. Do you suspect that the optimal perplexity value is stable, or may be different if you slightly modify the data and/or run the transformation again?    
  5.6. Verify your conclusions by allowing each scatter plot to have its own scale on the X and Y axis. You can do that by setting `fig.update_yaxes(matches=None)` and `fig.update_xaxes(matches=None)`.



In [None]:
# Put your code here


**Exercise 7.** See if you can use t-SNE to guess the shapes of the four data sets from the `pca_geometry` table. Check different perplexity values and answer the follwing questions:  

1. Is a single value optimal for all data sets?  
2. Can you always find a good perplexity value that allows you to guess the shape of the data set? Can you at least always find a perplexity value that correctly shows all the clusters in the data?   
3. Does t-SNE accurately reflect the shapes of the data sets? How does the perplexity value influence the accuracy of shape?  
4. What happens with outlying observations? Can you use t-SNE to detect outlying observations, or is PCA better suited for this task?   
5. How many potential drawbacks of wrong perplexity values can you come up with?  


In [None]:
# Put your code here


<center><img src='https://drive.google.com/uc?export=view&id=12CrUdXDAiltLBT26sG7HZ_HciIhvGyT8'></center>