## Exercise 1:

#### The pca.csv file was loaded into the df DataFrame. Using the StandardScaler class, the variables in the X were standardized and assigned to the X_std.

#### Implement the PCA algorithm using the X_std array. Reduce the result to the two principal components and assign it to the X_pca variable.

#### In response, print the first ten rows of the X_pca array.

#### Steps:

- compute covariance matrix of X_std array

- find the eigenvectors and their corresponding eigenvalues for this covariance matrix

- sort eigenvectors by decreasing eigenvalues

- determine the number of components (in this case 2).

- create matrix W from selected vectors (columns as eigenvectors).

- multiply X_std by W and assign to the X_pca variable

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler


np.set_printoptions(precision=8, suppress=True, edgeitems=5, linewidth=200)
np.random.seed(42)
df = pd.read_csv('pca.csv')

X = df.copy()
y = X.pop('class')

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

# covariance matrix

covmat = np.cov(X_std, rowvar = False)

# eigen values and eigen vectors

eigenvalues, eigenvectors = np.linalg.eig(covmat)

# sort eigen values

sorted_indices = np.argsort(np.abs(eigenvalues))[::-1]

sorted_eigenvalues = eigenvalues[sorted_indices]
sorted_eigenvectors = eigenvectors[:, sorted_indices]

# number of components (2 in this case)
n_components = 2
W = sorted_eigenvectors[:, :n_components]

# multiply X_std by W to get the new transformed data

X_pca = X_std.dot(W)

print(X_pca[:10])

[[-2.06036006  0.2986744 ]
 [-2.1959812   0.10172707]
 [-2.36522102 -0.08074913]
 [-2.36579421 -0.20816508]
 [-2.12817063  0.20020073]
 [-1.60325585  0.4127035 ]
 [-2.32300467 -0.26268319]
 [-2.09455194  0.1857296 ]
 [-2.53503403 -0.39064128]
 [-2.23877073  0.15624518]]


#### Notes:

- np.linalg.eig:
    - Returns Eigenvalues and Eigenvectors: This function computes both the eigenvalues and the corresponding eigenvectors of a square matrix 
    - Non-uniqueness of Eigenvectors: The eigenvectors are not unique in their direction.
- np.cov: 
    - np.cov computes the covariance matrix of the given data, where each column represents a variable and each row represents an observation. The covariance matrix captures the pairwise covariance between each pair of variables.
    - Important to set rowvar=False: By default, np.cov assumes that each row represents a variable, not an observation. To handle datasets where columns represent variables, set rowvar=False.
- np.dot:
    - np.dot performs the dot product between two arrays.
    - Dimensionality: For 1D arrays (vectors), np.dot computes the scalar dot product, while for 2D arrays, it performs matrix multiplication. The shape of the resulting matrix will be determined by the outer dimensions of the inputs.

## Exercise 2:
#### The PCA algorithm was implemented using the X_std array and the result was assigned to the X_pca variable.

#### Create a DataFrame called df_pca using the X_pca array and the y variable and print the first ten rows of this object to the console.

In [5]:
np.set_printoptions(precision=8, suppress=True, edgeitems=5, linewidth=200)
np.random.seed(42)
df = pd.read_csv('pca.csv')

X = df.copy()
y = X.pop('class')

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

eig_vals, eig_vecs = np.linalg.eig(np.cov(X_std, rowvar=False))
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:, i]) for i in range(len(eig_vals))]
eig_pairs.sort(reverse=True)

W = np.hstack((eig_pairs[0][1].reshape(3, 1), eig_pairs[1][1].reshape(3, 1)))
X_pca = X_std.dot(W)


df_pca = pd.DataFrame(data=X_pca, columns=['pca_1', 'pca_2'])
df_pca['class'] = df['class']
df_pca['pca_2'] = -df_pca['pca_2']

print(df_pca.head(10))

      pca_1     pca_2  class
0 -2.060360 -0.298674    0.0
1 -2.195981 -0.101727    0.0
2 -2.365221  0.080749    0.0
3 -2.365794  0.208165    0.0
4 -2.128171 -0.200201    0.0
5 -1.603256 -0.412703    0.0
6 -2.323005  0.262683    0.0
7 -2.094552 -0.185730    0.0
8 -2.535034  0.390641    0.0
9 -2.238771 -0.156245    0.0


## Exercise 3
#### The pca.csv file was loaded into the df DataFrame. Using the StandardScaler class, the variables in the X object were standardized and assigned to the X_std variable.
#### Using the PCA class from the scikit-learn, perform the PCA analysis with two components using the X_std array and assign it to the df_pca variable. In resposne, print the first ten rows of this object (also add the 'class' column) as shown below.

In [6]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


np.set_printoptions(
    precision=8, suppress=True, edgeitems=5, linewidth=200
)
np.random.seed(42)
df = pd.read_csv('pca.csv')

X = df.copy()
y = X.pop('class')

scaler = StandardScaler()
X_std = scaler.fit_transform(X)

pca = PCA(n_components = 2)

X_pca = pca.fit_transform(X_std)

df_pca = pd.DataFrame(data=X_pca, columns=['pca_1', 'pca_2'])

df_pca['class'] = df['class']

print(df_pca.head(10))

      pca_1     pca_2  class
0 -2.060360 -0.298674    0.0
1 -2.195981 -0.101727    0.0
2 -2.365221  0.080749    0.0
3 -2.365794  0.208165    0.0
4 -2.128171 -0.200201    0.0
5 -1.603256 -0.412703    0.0
6 -2.323005  0.262683    0.0
7 -2.094552 -0.185730    0.0
8 -2.535034  0.390641    0.0
9 -2.238771 -0.156245    0.0


## Exercise 4
#### Load the pca.csv file into the df DataFrame. Perform dimensional reduction with PCA and three principal components using the scikit-learn package and the PCA class.

#### In response, print the percentage of the variance explained by these components as shown below (as DataFrame object).

In [7]:
pca = PCA(n_components = 3)

X_pca = pca.fit_transform(X_std)


results = pd.DataFrame(
    data={
        'explained_variance_ratio': pca.explained_variance_ratio_
    }
)


results['cumulative'] = results[
    'explained_variance_ratio'
].cumsum()


results['component'] = results.index + 1
print(results)

   explained_variance_ratio  cumulative  component
0                  0.923247    0.923247          1
1                  0.066471    0.989718          2
2                  0.010282    1.000000          3


## Notes:

- Explained Variance Ratio is the proportion of the dataset's total variance that is captured by each principal component.
- You can use explained_variance_ratio_ to decide how many components to keep. If the first two components explain, for example, 95% of the variance, you might choose to reduce the dimensionality to just those two components, since they capture most of the important information in the data.