## Principal component analysis (PCA)

* Used in data analysis including feature compression;
* If your given data of any shape whatsoever, PCA finds a new coordinate system that's obtained from the old one by translation and rotation only;
* It moves the center of the coordinate system with the center of the data, it moves the x-axis into the principal axis of variation, where it has the most variation relative to all the data points, and it moves further axis down the road into a orthogonal less important directions of variation;
* It alson tells how important these axes are;
* The data don't need to be 1D perfectly to calculate the new center;
* Major axis ($\delta{x}$) and minus axis ($\delta{y}$)

## Measurable vs. Latent features

Given the features of a house, what is its price?
* Regression exercise because the output that we expect to get is going to be continuous (price)
* **Measurable**: square footage, number of rooms, school ranking, neighborhood safety.
* **Latent** (variables that you can't measure directly, but indirectly is possible): Size (can use square footage and number of rooms), neighborhood 

## Preserving information

How best to condense our 4 features to 2, so that we really get to the heart of the information?

* `SelectkBest`: Specify the number of features (k) to keep.
* `SelectPercentile`: Selects the top x percent where you're allowed to specify the percentage of features that you want to keep;
* Many features, but I hypothesize a smaller number of features actually during the patterns;
* Try making a **composite feature** (principle component) that more directly probes the underlying phenomenon;
* Example:
> square footage + number of rooms > size

* The principal component can looks like a regression but it's not, because it doesn't try to predict anything just come up with a direction in the data that can be projected onto while losing a minimal amount of information;
* With the principal component found (the line direction), use **projection** to create a one dimensional distribution (lying and flat).

## Maximal variance

How to determine the principal component
* **variance** (technical term in statistics): Roughly the "spread" of a data distribution (similar to standard deviation)
* The longer line is going to be the direction of maximum variance;
* The principal component of a dataset is the direction that has the largest variance because it retains the maximum amount of information in original data.

## Maximal variance and information loss

* Projection onto direction of maximal variance minimizes distance from old (higher - dimensional) data point to its new transformed value (minimizes information loss)

## PCA as a general algorithm for feature transformation

* Put all features into PCA, so it can automatically combine them into new features and rank the relative powers of those nre features;
* The max number of PCs (principal components) is `min(n_features, n_data_points)` in sklearn;

## Review/definition of PCA

* Systematized way to transform input features into principal components;
* Use the principal components as new features;
* PCs are directions in data that maximize variance (minimize information loss) when the project/compress down onto them;
* More variance of data along a PC, higher that PC is ranked;
* First PC = most variance/ most information; second PC = second-most variance (without overlapping with first PC, independent features);
* The max number of PCs is the number of input features;


In [1]:
from sklearn.decomposition import PCA

def doPCA(data):
    pca = PCA(n_components=2)
    pca.fit(data)
    return pca

pca = doPCA()
print(pca.explainde_variance_ratio_)
first_pc = pca.components_[0]
second_pc = pca.components_[1]

franformed_data = pca.transform(data)
for ii, jj in zip(transformed_data, data):
    plt.scatter(first_pc[0]*ii[0], first_pc[1]*ii[0], color='r')
    plt.scatter(second_pc[0]**ii[1], second_pc[1]**ii[1], color='c')
    plt.scatter(jj[0], jj[1], color='b')
    
plt.xlabel('bonus')
plt.ylabel('long-term incetive')
plt.show()

TypeError: doPCA() takes exactly 1 argument (0 given)

## PCA for facial recognition

What makes facial recognition in pictures good for PCA?
* Pictures of faces generally have high input dimensionality (many pixels);
* Faces have general patterns that could be captured in smaller number of dimentsions (two eyes on top, mouth/ chin on botton, etc);

## Eigenfaces code

In [None]:
# Import data before
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)
eigenfaces = pca.components_.reshape((n_components, h, w))
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# Train a SVM
paran_grid = {
    'C': [1e3, 5e3, 1e4, 5e4, 1e5],
    'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
    }
clf = GridSearchCV(SVC(kernel='rbf', class_weight='auro'), paran_grid)
clf = clf.fit(X_train_pca, y_train)
clf.best_estimator_
y_pred = clf.predict(x_test_pca)

Add more PCs as features for training your classifier, do you expect it to get better or worse performance?
* Ideally, we hope that adding more components will give us more signal information to improve the classifier performance.

Do you see any evidence of overfitting when using a large number of PCs?
* Yes, performance (F1 score) starts to drop with many PCs.

Selecting a number of principal components
* Train on different number of PCs, and see how accuracy responds - cut off when it becomes apperant that adding more PCs doesn't buy you much more discrimination;
* It can be done by plain old feature selection as welll. Take all features in order of importance and add them one at a time and see how the accuracy responds, and cut off when ig seems like it's plateauing;
* Be careful about throwing out information before you peform PCA