# Metadata

**Topic** - algorithm problems using python

**Target Number of Turns (User + Assistant) - 1**

**Suggested Conversation :** - How can one implement Principal Component Analysis (PCA) from scratch in Python to reduce the dimensionality of a dataset, ensuring that the resulting components retain the maximum possible variance from the original features? Provide a code example that includes data standardization, computation of the covariance matrix, eigenvalue decomposition, and projection of the original dataset onto the new feature space.

# Conversation

**User**

Implement PCA (Principal Component Analysis) from scratch in Python to reduce the dimensionality of a given dataset retaining the maximum possible variance from the original features.


**Assistant**

Certainly! PCA (Principal Component Analysis) is a statistical procedure for dimensionality reduction often used to reduce the dimensionality of large datasets. Implementing PCA from scratch in Python involves several steps: Data Standardization, computation of covariance matrix, eigenvalue decomposition, and projection of the original dataset onto the new feature space. Here is a step-by-step guide with a code example.

In [None]:
# Assistant

import numpy as np

def principal_component_analysis(X, num_components, standardize=True):
    try:
        # Input validation
        if not isinstance(X, np.ndarray):
            raise TypeError("Input data must be a numpy.ndarray!")
        if len(X.shape) != 2:
            raise ValueError("Input data must be a 2D array!")
        if not isinstance(num_components, int):
            raise TypeError("Number of components must be an integer!")
        if num_components <= 0:
            raise ValueError("Number of components must be greater than 0!")
        if num_components > X.shape[1]:
            raise ValueError("Number of components must be less than or equal to number of features!")

        # Standardizing the dataset (optional) and always centering
        mean = np.mean(X, axis=0)
        if standardize:
            std_dev = np.std(X, axis=0, ddof=1)
            X = (X - mean) / std_dev
        else:
            X = X - mean

        # Calculate the covariance matrix
        covariance_matrix = np.cov(X, rowvar=False) # rowvar=False means that columns represent variables


        # Compute the eigenvalues and eigenvectors of the covariance matrix
        eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
        if np.iscomplexobj(eigenvalues):
            raise ValueError("Complex eigenvalues encountered. PCA might not be suitable for this dataset!")

        # Sort eigenvectors by their associated eigenvalues in descending order
        idx = np.argsort(eigenvalues)[::-1]
        eigenvalues = eigenvalues[idx]
        # Align eigenvectors with sorted eigenvalues
        eigenvectors = eigenvectors[:, idx]

        # Select the first num_components eigenvectors
        eigenvectors = eigenvectors[:, :num_components]

        # Transform the original matrix by projecting data onto principal components
        X_pca = np.dot(X, eigenvectors)

        total_variance = np.sum(eigenvalues)
        explained_variance = eigenvalues[:num_components] / total_variance

        return X_pca, explained_variance

    except TypeError as te:
        print(f"TypeError: {te}")
        return None
    except ValueError as ve:
        print(f"ValueError: {ve}")
        return None

# Example Usage
if __name__ == "__main__":
    np.random.seed(42)
    X = np.random.randn(500, 6)  # Simulated data

    # Specify the number of components
    num_components = 2

    # Perform PCA with standardization and get the transformed data and variance explained
    transformed_data, explained_variance = principal_component_analysis(X, num_components, standardize=True)

    if transformed_data is not None and explained_variance is not None:
        print("Original data shape: ", X.shape)
        print("Transformed data shape: ", transformed_data.shape)
        print("Variance explained by each principal component: ", explained_variance)
    else:
        print("PCA computation failed.")

**Assistant**

In this code, We follow these steps to achieve PCA:


* **Input Checks:** Validate the input data type, ensure the input is a **`2D numpy array`**, verify the number of features, and validate that the desired number of components is a positive integer less than or equal to the number of features in the dataset.

* **(Optional): Standardize Data:** If the **`standardize`** parameter is set to **`True`**, the function calculates the mean and standard deviation of each feature to normalize the data by subtracting the mean and dividing by the standard deviation. If **`False`**, this step is skipped.

* **Compute Covariance Matrix:** Calculate the **`covariance matrix`** using the (potentially standardized) data.

* **Eigen Decomposition and Error Handling:** Perform an eigen-decomposition on the covariance matrix to obtain `eigenvalues` and `eigenvectors`. If complex eigenvalues are encountered, raise a ValueError as PCA may not be suitable for this dataset.

* **Sort Eigen Pairs:** The function sorts the eigenvectors in decreasing order based on their associated eigenvalues

* **Select Principal Components:** It selects the top **`'num_components'`** eigenvectors (principal components) based on their associated sorted eigenvalues.

* **Project Data:** The function transforms the original data onto the subspace spanned by the chosen principal components.

* **Variance Explained:** Calculate the proportion of total variance explained by each selected principal component.

* **Error Handling:** If **`TypeError`** or **`ValueError`** exceptions occur during any of these steps, they will be caught and an error message will be printed, indicating the problem. The function will return **`None, None`** in this case.

The **`num_components`** parameter can be adjusted according to the desired number of principal components to be retained, as long as it is not greater than the number of features in the dataset. Similarly, the **`standardize`** parameter can be toggled depending on whether standardization is desired or not.