## 18.5.1
### Dimensionality Reduction
Martha has noticed that so far we have been working with pretty good datasets in terms of data used. Even after some data cleanup, there haven't been too many features to work with. However, she is beginning to worry that her cryptocurrency data has too many features and is not sure how this will affect our model. The way to handle this is with dimensionality reduction.

Think back to our example with the store owner who is trying to sell school supplies. His customer data could contain endless features, or columns. The data could include name, age, address, items bought, amount spent, time spent shopping, zip code, and so forth. Some features just aren't necessary and could throw off our algorithm. For instance, would converting names to an integer value be worth the time or even inform our analysis?

Also, throwing all of these features into the model might overfit the data.

Since overfitting is bad, it is best to find a way to limit features. The process of reducing features is called dimensionality reduction. There are two options for coping with too many features: elimination and extraction.

#### Feature Elimination

Your first idea is to remove a good amount of features so the model won't be run using every column. This is called feature elimination.

Feature elimination means what you think: You remove, or eliminate, a feature from the dataset. In our school supply example, you remove features that aren't relevant to what we're looking for, such as name, address, and zip code. This simple method increases and maintains interpretability.

The downside is, once you remove that feature, you can no longer glean information from it. If we want to know the likelihood of people buying school supplies, but we removed the zip code feature, then we'd miss a detail that could help us understand when certain residents tend to purchase school supplies.

#### Feature Extraction

Feature extraction combines all features into a new set that is ordered by how well they predict our original variable.

In other words, feature extraction reduces the number of dimensions by transforming a large set of variables into a smaller one. This smaller set of variables contains most of the important information from the original large set.
note

Sometimes, you need to use both feature elimination and extraction. For instance, the customer name feature doesn't inform us about whether or not customers will purchase school supplies. So, we would eliminate that feature during the preprocessing stage, then apply extraction on the remaining features.

## 18.5.2
### Principal Component Analysis
Your client assured you that all the data they have collected is important and needs to be used. Being worried about overfitting your data, you decided to use Principal Component Analysis (PCA).

PCA is a statistical technique to speed up machine learning algorithms when the number of input features (or dimensions) is too high. PCA reduces the number of dimensions by transforming a large set of variables into a smaller one that contains most of the information in the original large set.

PCA is a complicated process to understand, but it is easy to code. Let's start out by coding some PCA into our K-means, so that you can see it in action, then revisit the code's underlying theory.

Using the new_iris_data.csv (Links to an external site.) first, import the libraries we’ll use and load the data into a Pandas DataFrame:

    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.cluster import KMeans
    import hvplot.pandas

Load iris data into Pandas DataFrame.

There are four features in this dataset with values on different scales. The first step in PCA is to standardize these features by using the StandardScaler library:

StandardScaler to standardize values.

Now that the data has been standardized, we can use PCA to reduce the number of features. The PCA method takes an argument of n_components, which will pass in the value of 2, thus reducing the features from 4 to 2:

    # Initialize PCA model
    pca = PCA(n_components=2)

After creating the PCA model, we apply dimensionality reduction on the scaled dataset:

    # Get two principal components for the iris data.
    iris_pca = pca.fit_transform(iris_scaled)

After this dimensionality reduction, we get a smaller set of dimensions called principal components. These new components are just the two main dimensions of variation that contain most of the information in the original dataset.

The resulting principal components are transformed into a DataFrame to fit K-means:

Transform principal components into a DataFrame.

Use explained_variance_ratio to learn how much information can be attributed to each principal component:

Use explained_variance_ratio to learn how much information is attributable to each principal component.

What this tells us, is that the first principal component contains 72.77% of the variance and the second contains 23.03%. Together, they contain 95.80% of the information.

Next, we'll use the elbow curve with the generated principal components and see the K value is 3:

    # Find the best value for K
    inertia = []
    k = list(range(1, 11))

    # Calculate the inertia for the range of K values
    for i in k:
        km = KMeans(n_clusters=i, random_state=0)
        km.fit(df_iris_pca)
        inertia.append(km.inertia_)

    # Create the elbow curve
    elbow_data = {"k": k, "inertia": inertia}
    df_elbow = pd.DataFrame(elbow_data)
    df_elbow.hvplot.line(x="k", y="inertia", xticks=k, title="Elbow Curve")

A graph shows the elbow curve at point 3.

Use the principal components data with the K-means algorithm with a K value of 3. We could consider 2, but the direction shifts more after 3:

    # Initialize the K-means model
    model = KMeans(n_clusters=3, random_state=0)

    # Fit the model
    model.fit(df_iris_pca)

    # Predict clusters
    predictions = model.predict(df_iris_pca)

    # Add the predicted class columns
    df_iris_pca["class"] = model.labels_
    df_iris_pca.head()

Finally, we can plot the clusters. Instead of a 3D plot, the data is easier to analyze with only two features:

    df_iris_pca.hvplot.scatter(
        x="principal component 1",
        y="principal component 2",
        hover_cols=["class"],
        by="class",
    )

A 2D graph shows three clusters.
note

The next few sections will go over exactly how PCA works and can be a bit daunting. Remember, you can already code it!

In [4]:
## 18.5.2
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import hvplot.pandas

#Loading the preprocessed iris dataset 
file_path = "../Exported_Data/new_iris_data.csv"
df_iris = pd.read_csv(file_path)
df_iris.head() 

# Standaridze data with StandardScaler
iris_scaled = StandardScaler().fit_transform(df_iris)
print(iris_scaled[0:5])

# Initialize PCA model
pca = PCA(n_components=2)

# Get two principal components for the iris data.
iris_pca = pca.fit_transform(iris_scaled)

# The resulting principal components are transformed into a DataFrame to fit K-means:
#Transform PCA data to a DataFrame 
df_iris_pca = pd.DataFrame(data=iris_pca, columns=["principal component 1", "principal component 2"])
df_iris_pca.head

# Fetch the explained variance 
pca.explained_variance_ratio_

## Next, we'll use the elbow curve with the generated principal components and see the K value is 3:

# Find the best value for K
inertia = []
k = list(range(1, 11))

# Calculate the inertia for the range of K values
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(df_iris_pca)
    inertia.append(km.inertia_)

# Create the elbow curve
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)
df_elbow.hvplot.line(x="k", y="inertia", xticks=k, title="Elbow Curve")

# Initialize the K-means model
model = KMeans(n_clusters=3, random_state=0)

# Fit the model
model.fit(df_iris_pca)

# Predict clusters
predictions = model.predict(df_iris_pca)

# Add the predicted class columns
df_iris_pca["class"] = model.labels_
df_iris_pca.head()

df_iris_pca.hvplot.scatter(
    x="principal component 1",
    y="principal component 2",
    hover_cols=["class"],
    by="class",
)

#Plotting the clusters 
df_iris_pca.hvplot.scatter(
    x="principal component 1", 
    y = "principal component 2", 
    hover_cola=["class"], 
    by="class",
)


[[-0.90068117  1.03205722 -1.3412724  -1.31297673]
 [-1.14301691 -0.1249576  -1.3412724  -1.31297673]
 [-1.38535265  0.33784833 -1.39813811 -1.31297673]
 [-1.50652052  0.10644536 -1.2844067  -1.31297673]
 [-1.02184904  1.26346019 -1.3412724  -1.31297673]]


  f"KMeans is known to have a memory leak on Windows "


## 18.5.3
### Mean, Variance, and Covariance
Now that you convinced Martha that feature extraction is the way to go, she needs some background on why this works in case questions come up during her presentation on how she can "magically" combine these features in a meaningful way. To start, you dust off your stats knowledge and refresh your memory on mean, variance, and covariance. These will be the building blocks used for PCA.

There is a mathematical way to use feature extraction, but first let's review some stats concepts.

#### Mean

Recall that the mean is the sum of a group of numbers divided by the total amount of numbers. For example, we start with points 2, 3, and 7. First, we add up all the numbers: 2 + 3 + 7 = 12. Then we divide the result by the total amount of points, which is 3. So, 12 / 3 = 4, so the mean of those three points is 4.

### Variance

Variance is the square distance from each point from the center, added together, and divided by the total number of points. The variance measures the spread of a set of numbers. The center of the points may look familiar, and it should, because it is the mean of all the points. Variance, in other words, is a measure of how far apart the data points are from the mean.

Look at the following points on a line:

The image shows three points -4, 0, and 4 along a line.

Using 0 as the center point, the distances are -4 from the center, 0 from the center (the center point is still a point), and 4 from the center.

The sum of squared distances would be (-4)^2 + (0)^2 + (4)^2 = 16 + 0

    16 = 32. We use squared distance so they are all positive.

Divide by the total number of points, which is 3. The variance of this dataset would be 32/3, or 10⅔.

Normally, there won't be an even distribution of points around the center. The points 2, 3, and 7 from the previous example don't have a clear center.

This is where the mean comes into play. The center of the line is set to the mean, which we found to be 4. Here is what the points look like on a line:

The image shows four points at 2, 3, 4, and 7 along a line.

The distance from 4 to 2 is -2, the distance from 4 to 3 is -1, the distance from 4 to 4 is 0, and the distance from 4 to 7 is 3.

Add up the squares of each distance: (-2)^2 + (-1)^2 + (0)^2 + (3)^2 = 4 + 1 + 0 + 9 = 14.

Finally, divide the distances by the total number of points: 14 / 3. The variance equals 14/3, or 4⅔.

**note**
These examples showed points on the x-axis, and thus, form the x variance. The same process applies to elements on the y-axis, forming the y variance.

#### Covariance

Before defining what covariance is, look at the following two plots:

Graph A shows three coordinates: (1, 3), (2, 2), and (3, 1).

Graph B shows three coordinates: (1, 1), (2, 2), and (3, 3).

These two plots clearly are very different. Each has the same center, with different points on the left and the right, one sloping negatively and the other sloping positively.

Let's find the x and y variance for each line.

For graph A:

    The center point is (2, 2).
    The distances for the points are the distance from (2, 2).
    Point (1, 3) is a distance of -1 away on the x-axis and 1 on the y-axis.
    Point (3, 1) is a distance of 1 away on the x-axis and -1 on the y-axis.
    x variance = (-1)^2 + 0^2 + (1)^2 = 2 / 3
    y variance = (1)^2 + 0^2 + (-1)^2 = 2 / 3

For graph B:

    The center point is (2, 2).
    The distances for the points are the distance from (2, 2).
    Point (1, 1) is a distance of -1 away on the x-axis and -1 on the y-axis.
    Point (3, 3) is a distance of 1 away on the x-axis and 1 on the y-axis.
    x variance = (-1)^2 + 0^2 + (1)^2 = 2 / 3
    y variance = (-1)^2 + 0^2 + (1)^2 = 2 / 3

Wait. Both of these variances are exactly the same; however, it is very obvious that these two graphs are totally different! How can we tell the difference?

This is where covariance comes into play. Covariance is a metric that allows us to tell these two different sets of points apart.

Let's look at the following examples:

This graph shows five coordinates: (-3, 1), (-3, -1), (0, 0), (3, -1), and (3, 1).

The same graph now shows coordinates (-3, -1), (0. 0), and (3, 1) along Line A, and coordinates (-3, 1), (0, 0), and (3, -1) along Line B.

How can we tell the difference between the points that lie along Line A versus the points that lie along Line B?

We can do this with the product of coordinates, which is the multiple of each of the two points:

The graph shows five coordinates with corresponding points: (-3, 1) at point -3, (-3, -1) at point 3, (0., 0) at point 0, (3, -1) at point -3, and (3, 1) at point 3.

Covariance is the sum of the product of coordinates divided by the number of points.

Covariance is used to determine the relationship between points.

The formula for covariance is as follows: Google it 

What this equation is saying is that the covariance takes the sum of the product between each pair of coordinates and their difference from the mean divided by the total number of points. This may sound complicated but will make more sense once we look at an example.

Let's solve for the covariance of line A first which contains the points (-3, -1), (0, 0) and (3, 1).

First take the mean of the x coordinates in line A, -3 + 0 + 3 = 0 divided by 3 is zero. Then repeat for the y coordinates, -1 + 0 + 1 = 0 divided by 3 is also zero.

Then for each pair of coordinates find the difference between the point and their respective means.
X 	Y 	
	
-3 	-1 	-3 - 0 = -3 	-1 - 0 = -1
0 	0 	0 - 0 = 0 	0 - 0 = 0
3 	1 	3 - 0 = 3 	1 - 0 = 1

Now multiply the results of the coordinate pairs.
X 	Y 	
	
	
-3 	-1 	-3 - 0 = -3 	1 - 0 = -1 	3
0 	0 	0 - 0 = 0 	0 - 0 = 0 	0
3 	1 	3 - 0 = 3 	1 - 0 = 1 	3

Finally add the product of all the coordinated paris and divide by the number of points to find the covariance.

3 + 0 + 3 = 6

Plug the results into the top part of the equation, and since we know there or 3 points, we plug that in for N to get.

Reduce the equation.

The covariance for line A is 2.

Repeat the same process for line B would produce the following:
X 	Y 	
	
	
-3 	1 	-3 - 0 = -3 	1 - 0 = 1 	-3
0 	0 	0 - 0 = 0 	0 - 0 = 0 	0
3 	-1 	3 - 0 = 3 	1 - 0 = -1 	-3

Add the product of all the coordinated pairs.

-3 + 0 + -3 = -6

Plug the results into the top part of the equation, and again we know there are 3 points, we plug that in for N to get.

Cov (x,y) = -6/3

Reduce the equation.

The covariance for line A is -2.

The covariance for Line A is 2 while the covariance for Line B is -2.

We can then say that Line A has a positive covariance (at 2) while Line B has a negative covariance (at -2). There is also a third type of covariance called **covariance zero**. This is when the points tend to form a horizontal line.

**note**

Covariance is used to only describe the relationship between points, such as positive and negative as we just saw. You may recall another method for determining relationships is correlation. However, correlation is used to determine the strength of the relationship.

## 18.5.4
### Linear Transformations
Martha appreciates the refreshers on stats but is wondering where this is going. Well, patience is a virtue, and trust that all of this forms the building blocks to really start to understand how PCA works. Next up is linear transformations.

Say we have a set of points on a graph. We want to center these points by taking the average of the coordinate, both X and Y. Find the balance point and move that to zero:

The image shows a set of data points centered over 0 of the x- and y-axis.

Once the points are centered, we're going to create a 2x2 matrix that consists of the variance and covariances that we found in the previous step:

A 2-by-2 matrix showing the variances and covariances of x and y.

So, let's say the matrix above contains the following:

A 2-by-2 matrix that contains the numbers 6, 2 in the first row and 2, 3 in the second row.

This matrix will be used to transform the points from one graph to another by using the numbers to create a formula for our transformation. The top two values of the matrix will correspond to one point and the bottom two values to another.

In our example, the formula for the points becomes (6x + 2y, 2x + 3y). Let's plug some coordinates into the formula:
	
(x, y) 	(6x + 2y, 2x + 3y)
(0,0) 	(0,0)
(1,0) 	(6,2)
(0,1) 	(2,3)
(-1,0) 	(-6,-2)
(0,-1) 	(-2,-3)

Now, let's plot the new points from the right side of the matrix to create a linear transformation:

The graph displays a linear transformation with five points plotted.

#### Eigenvectors and Eigenvalues

**note**
Eigenvectors and eigenvalues can be complicated subjects rooted in linear algebra. We cover these at a very high level, but if you wish to explore more on your own, you can read more about Eigenvalues and eigenvectors (https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors) and watch this video (https://www.youtube.com/watch?v=PFDu9oVAE-g)

As you can see, the points stretch out in our graph in two directions. One direction moves from southwest to northeast direction while another direction moves from southeast to northwest. These are called eigenvectors, as indicated by the arrows in the graph below:

Graph with point plotted and two eigenvectors: a shorter one pointing diagonally to the left and a larger one pointing diagonally to the right.

There is a way to figure out the vectors and values with algebra, but we use the calculator on WolframAlpha (https://www.wolframalpha.com/input/?i=eigenvalues) to simplify the process. Plug in our matrix of {{6,2},{2,3}}, then click calculate.

From the results website, you can see in one direction the shape stretched to a value of 7 and another to a value of 2. The magnitude that each of these stretches is called the eigenvalue:

WolframAlpha calculates the eigenvalues as 2 and 7.

We also see the direction that stretched with the eigenvectors of (2, 1) and (-1, 2):

WolframAlpha shows the corresponding eigenvectors as (2, 1) and (-1, 2).

The big takeaway from eigenvectors and eigenvalues is that they show us the spread of the dataset and by how much.
© 2020 - 2022 Trilogy Education Services, a 2U, Inc. brand. All Rights Reserved.
