# **How to select important eigenvalues (or components) in Principal Component Analysis (PCA)**

The scree plot, Kaiser criteria, and variance plot are crucial tools in PCA for selecting the principal components to retain.

# **Scree Plot**

  **Purpose**: Visual method to assess the importance of each principal component.
  **Construction**: Plot the eigenvalues of the components in descending order against their corresponding component number.

  **Interpretation**: Look for an "elbow" or a point where the slope of the plot changes drastically from steep to shallow. The components before this elbow are considered significant because they explain a large portion of the variance, while those after contribute little additional explanatory power.

  **Role**: Provides a straightforward and intuitive way to visualize the diminishing returns of adding more components.

# **Kaiser Criteria**
  **Purpose**: Quantitative rule to retain significant components.
  **Rule**: Retain components with eigenvalues greater than 1.

  **Rationale**: An eigenvalue greater than 1 indicates that the component explains more variance than an individual original variable, justifying its retention.

  **Role**: Offers a simple and objective criterion for selecting components, helping to avoid retaining components that do not add substantial explanatory power.

# **Variance Plot (Cumulative Variance Explained)**

  **Purpose**: Quantifies the total variance explained by the selected components.
  **Construction**: Plot the cumulative proportion of total variance explained by the components against the number of components.

  **Interpretation**: Determine the number of components needed to reach a predetermined threshold of explained variance (e.g., 90% or 95%). This helps in ensuring that enough variance is captured by the retained components while reducing dimensionality.

  **Role**: Ensures that the retained components capture a sufficient proportion of the data’s variance, aiding in maintaining the integrity of the data representation.

**How They Complement Each Other**

  Scree Plot and Kaiser Criteria: The scree plot gives a visual indication which can be supported by the Kaiser criteria. For example, you may see an elbow at the 4th component in the scree plot and find that the first four components have eigenvalues greater than 1.

  Variance Plot and Kaiser Criteria: While the Kaiser criteria ensure that only components with substantial variance are kept, the variance plot ensures that collectively, these components explain an adequate amount of total variance.

 ** Scree Plot and Variance Plot:** The elbow point in the scree plot can be cross-validated with the variance plot to ensure that the components before the elbow capture most of the variance.

In [1]:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset

# Load the Iris dataset from scikit-learn's datasets module. X contains the feature data and y contains the target variable
iris = datasets.load_iris()

# Separate the DataFrame into features (X) and target variable (y).
X = iris.data
y = iris.target

# Convert to DataFrame for consistency with previous examples
data = pd.DataFrame(data=X, columns=iris.feature_names)
data['target'] = y

# Display the first few rows of the dataset to understand its structure
print(data.head())

# Separate features and target variable
X = data.drop('target', axis=1)  # All columns except 'target'
y = data['target']  # The 'target' column

# Split the dataset into training and testing sets
# Split the data into training and testing sets with an 80-20 split.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
# Standardize the features to have zero mean and unit variance.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Apply PCA
# Apply PCA to reduce the dataset to 2 principal components for simplicity
pca = PCA(n_components=2)  # Reduce to 2 principal components for simplicity
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

# Train a classifier on the reduced data
# Train a RandomForestClassifier on the PCA-transformed training data
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train_pca, y_train)

# Make predictions on the test set
# Predict the target variable on the PCA-transformed test data and evaluate the classifier's performance, printing the accuracy
y_pred = clf.predict(X_test_pca)

# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  
Accuracy: 0.90


In [2]:
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Convert to DataFrame for consistency with previous examples
data = pd.DataFrame(data=X, columns=iris.feature_names)
data['target'] = y

# Separate features and target variable
X = data.drop('target', axis=1)  # All columns except 'target'
y = data['target']  # The 'target' column

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a classifier without PCA
clf_no_pca = RandomForestClassifier(random_state=42)
clf_no_pca.fit(X_train_scaled, y_train)

# Make predictions without PCA
y_pred_no_pca = clf_no_pca.predict(X_test_scaled)

# Evaluate the performance without PCA
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)
print(f'Accuracy without PCA: {accuracy_no_pca:.2f}')

# Train a classifier with PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)  # Reduce to 2 principal components for simplicity
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

clf_with_pca = RandomForestClassifier(random_state=42)
clf_with_pca.fit(X_train_pca, y_train)

# Make predictions with PCA
y_pred_with_pca = clf_with_pca.predict(X_test_pca)

# Evaluate the performance with PCA
accuracy_with_pca = accuracy_score(y_test, y_pred_with_pca)
print(f'Accuracy with PCA: {accuracy_with_pca:.2f}')



Accuracy without PCA: 1.00
Accuracy with PCA: 0.90


In [14]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

# Generate synthetic data with highly correlated features
X, y = make_classification(n_samples=1000, n_features=50, n_informative=20, n_redundant=30, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier without PCA
clf_no_pca = RandomForestClassifier(random_state=42)
clf_no_pca.fit(X_train, y_train)
y_pred_no_pca = clf_no_pca.predict(X_test)
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)
print(f'Accuracy without PCA: {accuracy_no_pca:.2f}')

# Train a RandomForestClassifier with PCA
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=5)  # Reduce to 5 principal components
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

clf_with_pca = RandomForestClassifier()
clf_with_pca.fit(X_train_pca, y_train)
y_pred_with_pca = clf_with_pca.predict(X_test_pca)
accuracy_with_pca = accuracy_score(y_test, y_pred_with_pca)
print(f'Accuracy with PCA: {accuracy_with_pca:.2f}')



Accuracy without PCA: 0.91
Accuracy with PCA: 0.74


Let's adjust the scenario to demonstrate a case where the accuracy is high with PCA or at least comparable and low without PCA. We can modify the synthetic dataset to have features with noise and high dimensionality, making it challenging for the classifier to learn from the original feature space. PCA can help in reducing the noise and focusing on the most informative components.

In [17]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

# Generate synthetic data with noise and high dimensionality
X, y = make_classification(n_samples=1000, n_features=100, n_informative=10, n_redundant=10, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier without PCA
clf_no_pca = RandomForestClassifier(random_state=42)
clf_no_pca.fit(X_train, y_train)
y_pred_no_pca = clf_no_pca.predict(X_test)
accuracy_no_pca = accuracy_score(y_test, y_pred_no_pca)
print(f'Accuracy without PCA: {accuracy_no_pca:.2f}')

# Train a RandomForestClassifier with PCA
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=10)  # Reduce to 10 principal components
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

clf_with_pca = RandomForestClassifier(random_state=42)
clf_with_pca.fit(X_train_pca, y_train)
y_pred_with_pca = clf_with_pca.predict(X_test_pca)
accuracy_with_pca = accuracy_score(y_test, y_pred_with_pca)
print(f'Accuracy with PCA: {accuracy_with_pca:.2f}')


Accuracy without PCA: 0.88
Accuracy with PCA: 0.84


In [19]:
import numpy as np
from scipy.sparse import csr_matrix
from scipy.sparse.linalg import svds

# Given transactions
transactions = [["Milk", "Bread", "Eggs", "Cheese", "Yogurt"],
                ["Milk", "Bread", "Cheese", "Yogurt"],
                ["Milk", "Bread", "Butter", "Juice"],
                ["Eggs", "Bread", "Cheese", "Juice"],
                ["Milk", "Bread", "Butter", "Cheese", "Jam", "Juice"],
                ["Milk", "Eggs", "Jam", "Yogurt"]]

# Create a list of unique items
items = list(set(item for sublist in transactions for item in sublist))

# Create a dictionary to map items to indices
item_to_index = {item: index for index, item in enumerate(items)}
index_to_item = {index: item for item, index in item_to_index.items()}

# Create a transaction matrix
num_transactions = len(transactions)
num_items = len(items)
transaction_matrix = np.zeros((num_transactions, num_items), dtype=float)  # Change dtype to float

for i, transaction in enumerate(transactions):
    for item in transaction:
        j = item_to_index[item]
        transaction_matrix[i, j] = 1

# Convert the transaction matrix to a sparse matrix
sparse_transaction_matrix = csr_matrix(transaction_matrix)

# Perform Singular Value Decomposition (SVD)
k = min(num_transactions, num_items) - 1
U, Sigma, VT = svds(sparse_transaction_matrix, k=k)

# Generate recommendations
def recommend_items(user_id, U, Sigma, VT, top_n=3):
    user_vector = U[user_id, :]
    scores = np.dot(np.dot(user_vector, np.diag(Sigma)), VT)
    top_indices = np.argsort(scores)[::-1][:top_n]
    top_items = [index_to_item[index] for index in top_indices]
    return top_items

# Evaluate the model's performance (Example)
# Here, we'll evaluate the recommendation for the first user
user_id = 0
recommended_items = recommend_items(user_id, U, Sigma, VT)
print("Recommended items for user", user_id, ":", recommended_items)


Recommended items for user 0 : ['Yogurt', 'Bread', 'Milk']
