<a href="https://colab.research.google.com/github/angelatackett/Mathematics-for-Data-Science---DATA230/blob/main/Essential_Math_for_Data_Science_%5BNield%5D_Chapter_4_Linear_Algebra_Angela_Tackett.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome to your assignment about concepts covered in Chapter 4 of *Essential Math for Data Science* by Thomas Nield. You will be using linear algebra.

In this assignment, you will apply the concepts of linear algebra to solve two real-world problems in data science. The problems are of mid to low difficulty level, and you will need to use Python and the NumPy library to implement the necessary computations.

Please read each question carefully and provide detailed explanations for your answers, including any relevant calculations or work. You are also required to provide Python solutions for the technical problems in each question.

# Problem 1

Principal Component Analysis (PCA) is a widely used technique for dimensionality reduction in data science. PCA's relationship with linear algebra is central to its functioning and effectiveness. Linear algebra provides the mathematical framework for transforming, extracting meaningful information, and reducing the dimensionality of the data while preserving its essential characteristics. Understanding the underlying linear algebra concepts is essential for grasping the theory and implementation of PCA effectively. In this question, you will use PCA to reduce the dimensionality of a dataset of your choice. For example, an iris dataset in Python looks like this:

sepal_length sepal_width petal_length petal_width species

5.1 3.5 1.4 0.2 setosa

4.9 3.0 1.4 0.2 setosa

4.7 3.2 1.3 0.2 setosa

4.6 3.1 1.5 0.2 setosa

5.0 3.6 1.4 0.2 setosa

...

The dataset contains 150 rows, each representing a flower. The columns are:

sepal_length: the length of the sepal

sepal_width: the width of the sepal

petal_length: the length of the petal

petal_width: the width of the petal

species: the species of iris (setosa, versicolor, or virginica)

We will be following these steps:

1. Load the dataset into a NumPy array.
2. Center the data by subtracting the mean from each feature.
3. Compute the covariance matrix of the centered data.
4. Compute the eigendecomposition of the covariance matrix.
5. Choose the top k eigenvectors that correspond to the k largest eigenvalues
   where k is the desired number of principal components.
6. Transform the centered data using the selected eigenvectors.
7. Reconstruct the original data using the transformed data and the
   eigenvectors.

Write a Python program that implements the above steps and reduces the dimensionality of the dataset. Print the variance explained by the selected principal components for different values of k (e.g., k=2, k=3, k=4). Explain your results in 100 words.

Step 1. Import the necessary libraries.

In [None]:
from os import read
import numpy as np
from numpy.linalg import eig, inv
from sklearn.datasets import load_iris


 Step 2. Load the dataset into a NumPy array.



In [None]:
iris = load_iris()
X = iris.data
print (X)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

Step 3. Center the data by subtracting the mean from each feature.

In [None]:
X_centered = X - np.mean(X,axis=0)
print (X_centered)

[[-7.43333333e-01  4.42666667e-01 -2.35800000e+00 -9.99333333e-01]
 [-9.43333333e-01 -5.73333333e-02 -2.35800000e+00 -9.99333333e-01]
 [-1.14333333e+00  1.42666667e-01 -2.45800000e+00 -9.99333333e-01]
 [-1.24333333e+00  4.26666667e-02 -2.25800000e+00 -9.99333333e-01]
 [-8.43333333e-01  5.42666667e-01 -2.35800000e+00 -9.99333333e-01]
 [-4.43333333e-01  8.42666667e-01 -2.05800000e+00 -7.99333333e-01]
 [-1.24333333e+00  3.42666667e-01 -2.35800000e+00 -8.99333333e-01]
 [-8.43333333e-01  3.42666667e-01 -2.25800000e+00 -9.99333333e-01]
 [-1.44333333e+00 -1.57333333e-01 -2.35800000e+00 -9.99333333e-01]
 [-9.43333333e-01  4.26666667e-02 -2.25800000e+00 -1.09933333e+00]
 [-4.43333333e-01  6.42666667e-01 -2.25800000e+00 -9.99333333e-01]
 [-1.04333333e+00  3.42666667e-01 -2.15800000e+00 -9.99333333e-01]
 [-1.04333333e+00 -5.73333333e-02 -2.35800000e+00 -1.09933333e+00]
 [-1.54333333e+00 -5.73333333e-02 -2.65800000e+00 -1.09933333e+00]
 [-4.33333333e-02  9.42666667e-01 -2.55800000e+00 -9.99333333e


Step 4. Compute the covariance matrix of the centered data.


In [None]:
covariance_matrix = np.cov(X_centered, rowvar=False)
print(covariance_matrix)

[[ 0.68569351 -0.042434    1.27431544  0.51627069]
 [-0.042434    0.18997942 -0.32965638 -0.12163937]
 [ 1.27431544 -0.32965638  3.11627785  1.2956094 ]
 [ 0.51627069 -0.12163937  1.2956094   0.58100626]]


Step 5. Compute the eigendecomposition of the covariance matrix to obtain the eigenvectors and eigenvalues. *In the code below, enter the function from linalg and the variable name of the covariance matrix.

In [None]:
eigenvalues, eigenvectors = eig(covariance_matrix)
print(eigenvalues,eigenvectors)

[4.22824171 0.24267075 0.0782095  0.02383509] [[ 0.36138659 -0.65658877 -0.58202985  0.31548719]
 [-0.08452251 -0.73016143  0.59791083 -0.3197231 ]
 [ 0.85667061  0.17337266  0.07623608 -0.47983899]
 [ 0.3582892   0.07548102  0.54583143  0.75365743]]


Step 6. Choose the top k eigenvectors that correspond to the K largest eigenvalues.

In [None]:
k = 2
top_k_eigenvectors = eigenvectors[:, :k]
print(top_k_eigenvectors)

[[ 0.36138659 -0.65658877]
 [-0.08452251 -0.73016143]
 [ 0.85667061  0.17337266]
 [ 0.3582892   0.07548102]]


Step 7. Transform the centered data using the selected eigenvectors.

In [None]:
transformed_data = X_centered @ top_k_eigenvectors

Step 8. Reconstruct the original data using the transformed data and the eigenvectors.

In [None]:
reconstructed_data = transformed_data @ top_k_eigenvectors.T + np.mean(X, axis=0)
print(reconstructed_data)

[[5.08303897 3.51741393 1.40321372 0.21353169]
 [4.7462619  3.15749994 1.46356177 0.24024592]
 [4.70411871 3.1956816  1.30821697 0.17518015]
 [4.6422117  3.05696697 1.46132981 0.23973218]
 [5.07175511 3.52655486 1.36373845 0.19699991]
 [5.50581049 3.79140823 1.67552816 0.32616959]
 [4.76528947 3.23041102 1.35723837 0.19551776]
 [5.00155648 3.39859911 1.47993231 0.2460815 ]
 [4.42052031 2.87903672 1.3855842  0.20882514]
 [4.80273233 3.20016781 1.48805402 0.2503016 ]
 [5.36090126 3.74023124 1.4985348  0.25243081]
 [4.90879014 3.28892521 1.51717562 0.26209953]
 [4.6820989  3.12115258 1.41198408 0.21884697]
 [4.34251794 2.95641673 1.08492393 0.08287986]
 [5.66151963 4.14156276 1.28795452 0.16277348]
 [5.85960752 4.23600886 1.48196707 0.24344301]
 [5.4275086  3.87100742 1.36995112 0.19816072]
 [5.09103106 3.50887425 1.43521594 0.22693854]
 [5.62144408 3.88058108 1.72215216 0.34527869]
 [5.24526768 3.65105838 1.4519108  0.23332171]
 [5.26539106 3.53834771 1.71102272 0.3420543 ]
 [5.20837272 

Step 9. Print the variance explained by the selected principal components for different values of k.

In [None]:
for k in range(2, 5):
    top_k_eigenvectors = eigenvectors[:, :k]
    transformed_data = X_centered @ top_k_eigenvectors
    explained_variance = sum(eigenvalues[:k]) / sum(eigenvalues)
    print(f"Variance explained by {k} principal components: {explained_variance:.2f}")

Variance explained by 2 principal components: 0.98
Variance explained by 3 principal components: 0.99
Variance explained by 4 principal components: 1.00


Step 10. In this text box, explain what the results mean. What did we do in this problem and why did we do it?

Your response: Essentially we've centered the data around the origin and tranformed it to a reduced dimensionality (aka removed unnecessary variables). This highlights the the principal components (variables) that capture the largest amounts of variance within the data set. The results show the top three principal component capturing the most variance within the data set. Specifically, by using 2 principal components, they capture 98% of the dataset's variance. Using 3 components, 99% variance is captured, and if using 4 principal components, 100% of the variance is captured. This is logically sound being as there are only 4 variables (columns) within the dataset.

# Problem 2

Solving a System of Linear Equations using matrix inversion method

Suppose you are given a system of linear equations of the form Ax=b, where A is a square matrix and x and b are vectors.

A = [[2, 1, -1],[4, -6, 0],[-2, 7, 2]]

b = [5, -2, 9]

(you are solving these three equations:

2x +y -z = 5

4x -6y = -2

-2x +7y + 2z = 9)

Your task is to solve the system of equations using matrix inversion approach with Python.

Print the solution vector x and verify your solution by computing Ax.

What does your solution vector x represent?




Step 1. Import numpy as np. Then array from numpy. Next, get inv and solve from linalg.

In [None]:
import numpy as np
from numpy import array
from numpy.linalg import inv

Step 2. Create the A array and the b array. Follow the example 4-18 in the book to setup your arrays.

In [None]:
A = array([
[2, 1, -1],
[4, -6, 0],
[-2, 7, 2]
])
b = array([
    5,
    -2,
    9
])

Step 3. Use the solve function (from linalg) to solve Ax=b. This is not in the book, but if you need help you can read about it in the classroom.

In [None]:
x = np.linalg.solve(A,b)
print("The solution using solve in Python for this system of equation is         ", x)

The solution using solve in Python for this system of equation is          [2.         1.66666667 0.66666667]


Step 4. Use the matrix inversion approach to solve Ax=b.

In [None]:
x = inv(A).dot(b)
print("The solution using matrix inversion approach for this system of equation is", x)

The solution using matrix inversion approach for this system of equation is [2.         1.66666667 0.66666667]


Step 5. Print Ax to verify if Ax=b.

In [None]:
Ax = np.dot(A,x)
print("This is to verify if Ax=b, so we print Ax=", Ax)

This is to verify if Ax=b, so we print Ax= [ 5. -2.  9.]


Step 6.  What does the solution above represent? What are these values?

Your response: The solution vector x represents the variables (x=5, y= -2, z=9) within the linear equations that when multiplied by A (the square matrix) verifiably equate to b. Ax=b is verified by the matrix inversion method.

# Problem 3

Solving a System of Linear Equations using matrix inversion method

Suppose you are given a system of linear equations of the form Ax=b, where A is a square matrix and x and b are vectors.

A = [[3, -1, 2, 0],[2, 4, 0, 1],[-1, 3, 5, -2],[0, 2, -1, 4]]

b = [4, 3, 8, -1]

Your task is to solve the system of equations using matrix inversion approach with Python.

Print the solution vector x and verify your solution by computing Ax.

Step  1. Import numpy as np. Then array from numpy. Next, get inv  from linalg.

In [None]:
import numpy as np
from numpy import array
from numpy.linalg import inv

Step 2. Define the arrays Matrix A and vector b

In [None]:
A = np.array([
    [3, -1, 2, 0],
    [2, 4, 0, 1],
    [-1, 3, 5, -2],
    [0, 2, -1, 4]
    ])

b = np.array([4, 3, 8, -1])

Step 3. Use the matrix inversion approach to solve Ax=b.

In [None]:
x = inv(A).dot(b)

print("The solution using matrix inversion approach for this system of equation is", x)

The solution using matrix inversion approach for this system of equation is [ 0.59150327  0.49346405  1.35947712 -0.15686275]


Step 4. Verify the solution is correct. Look at problem 2 to perform the dot product.

In [None]:
Ax = np.dot(A,x)
print("This is to verify if Ax=b, so we print Ax=", Ax)

This is to verify if Ax=b, so we print Ax= [ 4.  3.  8. -1.]


Step 5. What does this solution represent? Enter your response in the text box below.

Your response: The x solution vector multiplied by the square matrix represents the solution in Step 4. These values equate to the b vector values verifying the Ax=b equation.