### **Variance as Information:**
- **Variance** measures how much data points differ from the average. High variance in a feature suggests that the data is spread out, capturing a lot of variability or information.

### **PCA and Variance:**
- **PCA** uses the idea that **higher variance = more important information**. It finds the directions (new axes) in your data where the variance is the highest. These directions are the **principal components**.

### **First Principal Component:**
- The **first principal component** is the direction where the data has the maximum variance. Think of it as the line through the data that captures the most "spread." Since this direction has the most variance, it contains the most information about how your data varies.

### **Subsequent Principal Components:**
- The **second principal component** is the next direction with the highest variance, but it must be perpendicular to the first. This ensures it captures different informationâ€”another way the data varies, but not just a repeat of the first direction.

### **Why Variance Matters in PCA:**
- By focusing on directions with the highest variance, **PCA** is effectively identifying the most "informative" ways the data varies. These directions help to summarize the data with fewer dimensions while retaining as much of the original variance (and therefore information) as possible.

### **Dimensionality Reduction:**
- After finding these principal components, PCA allows you to drop the ones with the least variance (the least important information). This reduces the complexity of your data but keeps most of its informative content.

### **Summary:**
- **PCA** is all about finding and keeping the directions (principal components) where your data has the most variance, as these directions are where the most important information lies. The more variance in a principal component, the more it explains about your data's structure.


In [4]:
import numpy as np

data = np.array([
    [6., 3., 2.],
    [3., 2., 7.],
    [5., 4., 2.],
    [1., 4., 3.],
    [7., 3., 1.0],
    [5., 1., 8.],
    [4., 2., 0.],
    [8., 6., 6.],
    [6., 3., 2.],
    [7., 1., 1.]
])



In [7]:
data-=data.mean(axis=0)
cov=np.cov(data,rowvar=False)
cov


array([[ 4.4       ,  0.46666667, -0.71111111],
       [ 0.46666667,  2.32222222,  0.13333333],
       [-0.71111111,  0.13333333,  7.73333333]])

In [10]:
from numpy.linalg import eig
from scipy import linalg as la
evals, evecs =la.eig(cov)

In [11]:
evals

array([7.87894934+0.j, 4.3690233 +0.j, 2.20758292+0.j])

In [12]:
evecs

array([[ 0.19938713,  0.95394887, -0.22411231],
       [-0.00676759,  0.23003958,  0.97315774],
       [-0.97989743,  0.19251843, -0.05232287]])

In [13]:
num_component = 2
sorted_key = np.argsort(evals)[::-1][:num_component]
evals, evecs = evals[sorted_key], evecs[:, sorted_key]


In [14]:
evals

array([7.87894934+0.j, 4.3690233 +0.j])

In [15]:
evecs

array([[ 0.19938713,  0.95394887],
       [-0.00676759,  0.23003958],
       [-0.97989743,  0.19251843]])

In [16]:
pc=np.dot(data,evecs)
pc

array([[ 1.33470986,  0.55514093],
       [-4.15617109, -1.57415309],
       [ 1.12855514, -0.16876835],
       [-0.64889083, -3.79204539],
       [ 2.51399443,  1.31657137],
       [-4.73052666,  0.2962235 ],
       [ 2.90249804, -1.96783325],
       [-2.20640836,  3.92323115],
       [ 1.33470986,  0.55514093],
       [ 2.52752961,  0.85649221]])