# Assignment 1

## General information

Please give short (2-3 sentences) interpretations / explanations to your answers, not only the program code and outputs. Be concise and focused, especially in the AI-world. We take this seriously and will downgrade excessive verbosity.

Grades will be distributed with the following rule: from the points you earn, you get 100% if you submit until the due date (**2026-02-25 12:00 CET**), 50% within 24 hours past due date, and 0% after that.

## Task 1: Gene expression data (3 points)

From the ISLR website, we can download a gene expression data set (`Ch10Ex11.csv`) that consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group.

We would have no chance to estimate any model on these 1,000 features. However, we could reduce the dimensionality with PCA. Then, we could look at the relation of the first few principal components (that captures part of the variance of all 1,000 features) and the outcome.

Note that as we only have 40 observations we can only define a subspace with at most 39 dimensions. Think of it like trying to define a volume in 3D space using only 2 points - you can only define a line, not a full 3D space. Technically, the sklearn PCA implementation will result in 40 components, but the last one will have a variance of 0.

In [8]:
# Load the data
import pandas as pd

# Read the CSV file
url = 'https://www.statlearning.com/s/Ch10Ex11.csv'
genes = pd.read_csv(url, header=None)

# Transpose the dataframe and convert to pandas DataFrame
genes = genes.T
genes = pd.DataFrame(genes)
print('Dimensions of genes dataframe:', genes.shape)

# Define health_status
health_status = ['healthy'] * 20 + ['diseased'] * 20

KeyboardInterrupt: 

<div style="page-break-after: always"></div>

### Questions

1. Compute the variance of each feature and plot a histogram of those variances. Based on the histogram, are the features similarly scaled? Would you standardize the features before applying PCA? Explain your reasoning. (*Hint*: treat features as similarly scaled if their variances differ by at most an order of magnitude.)

2. Apply PCA to the full dataset. How many principal components are required to explain at least 90\% of the total variance? How much variance is captured by the first two principal components?

3. Create a scatter plot of the first two component scores (the data projected onto the first two principal components). Color the points by patient health status. Do these components help distinguish healthy from diseased patients? Which component separates the groups more clearly?

4. For the principal component that best separates healthy and diseased patients, identify the genes with the largest absolute loadings (the strongest contributors). Briefly describe how you selected them. (*Hint*: examine the component loadings.)

## Task 2: PCA on simulated data (3 points)

In this exercise, you will work on simulated features. This simulated data have **mixed correlation structure**: some features cluster together (correlated), while others are independent.

In [None]:
# Data generation with mixed correlation structure
import numpy as np

rng = np.random.default_rng(seed=20260218)  # for reproducibility
n = 100
p = 50

# Create 5 groups of 10 correlated features each
# Within each group, features are correlated with each other
# Between groups, features are independent

X = np.zeros((n, p))

# Generate 5 independent latent factors (one per group)
n_groups = 5
features_per_group = 10
latent_factors = rng.normal(loc=0.0, scale=1.0, size=(n, n_groups))

for group in range(n_groups):
    start_idx = group * features_per_group
    end_idx = start_idx + features_per_group
    
    # Each feature in the group is the latent factor plus independent noise
    # This creates correlation within the group
    for i in range(features_per_group):
        noise = rng.normal(loc=0.0, scale=0.5, size=n)  # smaller noise = higher correlation
        X[:, start_idx + i] = latent_factors[:, group] + noise

print(f'Generated data shape: {X.shape}')
print(f'Data structure: {n_groups} groups of {features_per_group} correlated features each')

Generated data shape: (100, 50)
Data structure: 5 groups of 10 correlated features each


<div style="page-break-after: always"></div>

### Questions

1. Compute the correlation matrix of the generated features and visualize it as a heatmap. Describe the correlation structure you observe. Does it match the data-generating process?

2. Run PCA on the full dataset (100 samples, 50 features). How many components are needed to explain at least 90% of the variance? 

3. Plot the explained variance ratio for the first 20 components. What does the shape of this curve tell you about the data structure? (*Hint*: look for an "elbow" or plateaus in the curve.)

4. Split the data into 80–20% train/test sets. For a range of component counts, compute reconstruction errors on both sets like we did in class. Plot these errors. Based on the plot, how many components would you choose to best summarize this dataset?


## Task 3: PCA on uncorrelated features (4 points)

In this task you will simulate data where the data-generating process has no correlations between features (in math terms: the population covariance is the identity matrix).


In [None]:
# Data generation
import numpy as np

rng = np.random.default_rng(seed=20260218)  # for reproducibility
n = 100
p = 50
X = rng.normal(loc=0.0, scale=1.0, size=(n, p))  # population covariance = I_p

<div style="page-break-after: always"></div>

### Questions

1. Run PCA on the generated data (100 samples, 50 features). How many components are needed to explain at least 90% of the variance? Does this match your theoretical expectation given the data-generating process? (*Hint*: looking at the correlation heatmap like you did in the previous task could help.)

2. Split the data into an 80–20% train/test split. Fit PCA using the *train* set only. How many components are required to explain at least 90% of the variance on the training set? Compute the reconstruction error on the training set like we did in class and explain the result. Then, compute the reconstruction error on the test set and compare it to the training error.

3. For a range of component counts, compute reconstruction error on both train and test sets. Are the first few principal components equally effective on the test set as on the train set? Explain any differences.

4. Compare your findings across all three tasks: 
   - Task 1 (gene expression): high-dimensional real data
   - Task 2 (partial correlation): mixed structure
   - Task 3 (uncorrelated): no correlation structure  
   
   In which scenario is PCA most effective for dimensionality reduction? Explain why correlation structure matters for PCA's effectiveness.