# Machine Learning Homework 1
**Student Name:** __Aidan Borne____  
**LSU Email Address:** _aborn13@lsu.edu__
## Instructions
- This code frame is AI-generated (Claude AI Sonnet 4). If there is any problem, let me know ASAP. 
- You are NOT required to use this framework.
- Fill in your code in the designated sections marked with `# YOUR CODE HERE`
- Do not modify the structure of this notebook
- Make sure all cells run without errors
- Submit this completed notebook file

---
## Question 1: House Price Dataset (5 pts)

In this question, you will work with a real estate dataset to predict house prices using linear regression.

### Dataset Description
The HOUSES dataset contains recent real estate listings in San Luis Obispo county with the following fields:
- **MLS**: Multiple listing service number (unique ID)
- **Location**: City/town where the house is located
- **Price**: Most recent listing price (in dollars)
- **Bedrooms**: Number of bedrooms
- **Bathrooms**: Number of bathrooms  
- **Size**: House size in square feet
- **Price/SQ.ft**: Price per square foot
- **Status**: Type of sale (Short Sale, Foreclosure, Regular)

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression as SkLinearRegression

# Set random seed for reproducibility
np.random.seed(42)

### 1.1 Data Loading and Exploration

In [None]:
# Load the dataset
# TODO: YOUR CODE HERE: Read the RealEstate.csv file using pandas
data_path = './RealEstate.csv'# Your path to RealEstate.csv
data = pd.read_csv(data_path)

# Display basic information about the dataset
print("Dataset shape:", data.shape)
print("\nFirst few rows:")
print(data.head())
print(f'\nTotal number of samples: {len(data)}')

### 1.2 Data Preprocessing

You need to:
1. Filter data for each status type (Short Sale, Foreclosure, Regular)
2. Handle categorical features (Location)
3. Normalize/standardize continuous features
4. Split into training and testing sets

In [None]:
# 1.2.1 Filter data by status
print("Available status types:", data['Status'].unique())
print("Count of each status type:")
print(data['Status'].value_counts())

# Filter for Short Sale data (you'll need to repeat this for Foreclosure and Regular)
short_sale_data = data[data['Status'] == 'Short Sale']
foreclosure_data = data[data['Status'] == 'Foreclosure']
regular_data = data[data['Status'] == 'Regular']

print(f"\nShort Sale samples: {len(short_sale_data)}")
print(f"\nForeclosure samples: {len(foreclosure_data)}")
print(f"\nRegular samples: {len(regular_data)}")

In [None]:
# 1.2.2 Implement MinMaxScaler from scratch
class MinMaxScaler:
    """Min-Max scaler to normalize features to range [0, 1]"""
    
    def __init__(self, feature_range=(0, 1)):
        self.min = feature_range[0]
        self.max = feature_range[1]
        self.data_min = None
        self.data_max = None

    def fit(self, data):
        """Fit the scaler to the data"""
        # TODO: YOUR CODE HERE: Calculate min and max values for each feature
        self.data_min = data.min()
        self.data_max = data.max()

    def transform(self, data):
        """Transform the data using fitted parameters"""
        # TODO: YOUR CODE HERE: Apply min-max scaling transformation
        # Formula: (data - data_min) / (data_max - data_min) * (max - min) + min
        scaled_data = (data - self.data_min) / (self.data_max - self.data_min)
        scaled_data = scaled_data * (self.max - self.min) + self.min
        return scaled_data

    def fit_transform(self, data):
        """Fit and transform in one step"""
        self.fit(data)
        return self.transform(data)

In [None]:
# 1.2.3 Handle categorical features (Location)
# There are multiple ways to handle categorical features. Here, we'll replace each location with the average price of houses in that location.
# This approach captures the effect of location on price while keeping the feature numerical. However, pay attention to locations in the test set that may not exist in the training set.
# Alternatively, you could use one-hot encoding or label encoding.
def preprocess_location_feature(X_train, X_test, y_train):
    """
    Replace location with average price for that location
    """
    # TODO: YOUR CODE HERE: 
    # 1. Calculate average price for each location using training data
    # 2. Replace location names with these average prices
    # 3. Handle locations in test set that don't exist in training set
    
    location_price_map = y_train.groupby(X_train['Location']).mean().to_dict()['Price']

    # I'll just use the average price for all locations
    # if a location isn't in the training set
    global_average = y_train.mean()['Price']
    
    for loc in X_test['Location'].unique():
        if loc not in location_price_map:
            # handle the unseen location
            location_price_map[loc] = global_average
    
    # Apply the mapping
    X_train_processed = X_train.copy()
    X_test_processed = X_test.copy()
    
    # YOUR CODE HERE: Implement the location replacement logic
    X_train_processed['Location'] = X_train_processed['Location'].map(location_price_map)
    X_test_processed['Location'] = X_test_processed['Location'].map(location_price_map)
    
    return X_train_processed, X_test_processed

In [None]:
# 1.2.4 Split data and preprocess features
# TODO: YOUR CODE HERE: Split the short_sale_data into features (X) and target (y)
X = short_sale_data.drop(columns=['Price', 'Status'])  # Features (exclude 'Price' and 'Status')
y = short_sale_data[["Price"]]  # Target ('Price')

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Process location feature
X_train, X_test = preprocess_location_feature(X_train, X_test, y_train)

# Apply min-max scaling
scaler = MinMaxScaler()

# YOUR CODE HERE: Fit scaler on training data and transform both train and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Optional: Apply log transformation to target variable to reduce skewness
# YOUR CODE HERE: Apply log1p transformation to y_train and y_test
y_train_log = None
y_test_log = None

print("Preprocessing completed!")
print(f"Training set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")

### 1.3 Linear Regression Implementation

Implement linear regression from scratch using the closed-form solution (Normal Equation).

In [None]:
class LinearRegression:
    """Linear Regression using closed-form solution"""
    
    def __init__(self):
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        """
        Fit the linear regression model using Closed-form Equation
        """
        # TODO: YOUR CODE HERE: Implement the closed-form solution
        # Steps:
        # 1. Add bias column to X (column of ones)
        # 2. Use Closed-form Equation: theta = (X^T * X)^(-1) * X^T * y
        # 3. Extract bias and weights from theta

        # add the bias column
        X_copy = X.copy()
        X_copy['Bias'] = 1

        X_np = X_copy.to_numpy()
        X_transposed = X_np.transpose()

        # closed-form solution
        inverted = np.linalg.inv(X_transposed @ X_np)
        theta = inverted @ X_transposed @ y.to_numpy()

        # extract weights and bias (last element)
        self.weights, self.bias = theta[:-1], theta[-1]

    def predict(self, X):
        """Make predictions on new data"""
        # TODO: YOUR CODE HERE: Implement prediction
        # Don't forget to add bias term
        return (X @ self.weights + self.bias).to_numpy()

In [None]:
# Train your custom linear regression model
lr_model = LinearRegression()
# TODO: YOUR CODE HERE: Fit the model and make predictions
# Use the log-transformed target if you applied it above

lr_model.fit(X_train_scaled, y_train)

# Make predictions on test set
predictions = lr_model.predict(X_test_scaled)  # YOUR CODE HERE

### 1.4 Model Evaluation

In [None]:
def compute_mse(y_true, y_pred):
    """Compute Mean Squared Error"""
    # YOUR CODE HERE: Implement MSE calculation from scratch
    # numpy has a built-in method, but I implement it manually
    total = 0

    for val in y_true - y_pred:
        total += val**2

    mse = total / len(y_true)

    return mse

# Evaluate your custom model
# TODO: YOUR CODE HERE: Calculate MSE for your predictions
custom_mse = compute_mse(y_test.to_numpy(), predictions)
print(f'Mean Squared Error (Custom Implementation): {custom_mse}')

# Compare with sklearn's implementation
# I had to rename the import because it conflicted with the custom LinearRegression class
sklearn_model = SkLinearRegression()
# TODO: YOUR CODE HERE: Fit sklearn model and get predictions
sklearn_model.fit(X_train_scaled, y_train)

sklearn_predictions = sklearn_model.predict(X_test_scaled)
sklearn_mse = compute_mse(y_test.to_numpy(), sklearn_predictions)

print(f'Mean Squared Error (Sklearn Implementation): {sklearn_mse}')

# Compare coefficients
print("\n" + "="*50)
print("COEFFICIENT COMPARISON")
print("="*50)
print("Custom Implementation:")
print(f"Bias: {lr_model.bias}")
print(f"Weights: {lr_model.weights}")
print("\nSklearn Implementation:")
print(f"Bias: {sklearn_model.intercept_}")
print(f"Weights: {sklearn_model.coef_}")

### 1.5 Feature Importance Analysis

In [None]:
# Determine the most important features
# TODO: YOUR CODE HERE: Calculate feature importance based on absolute values of weights
importance = np.abs(lr_model.weights.flatten())  # Get absolute values of weights
feature_names = X_train_scaled.columns  # Get feature names (excluding 'Price' and 'Status')

# Sort features by importance
feature_importance = sorted(zip(feature_names, importance), key=lambda x: x[1], reverse=True)  # YOUR CODE HERE: Create list of (feature_name, importance) tuples and sort

print("Feature Importance (from most to least important):")
print("="*50)
for i, (feature, imp) in enumerate(feature_importance):
    print(f"{i+1}. {feature}: {imp:.4f}")

print(f"\nThe two most significant features for predicting house price are:")
print(f"1. {feature_importance[0][0]}")
print(f"2. {feature_importance[1][0]}")


**TODO for students**: Think about wether this feature importance makes sense in the context of real estate pricing
Does using absolute weights a good measure of feature importance? Why or why not

### 1.6 Repeat for Other Status Types

**TODO for students:** Repeat the above analysis for "Foreclosure" and "Regular" status types.
Create similar code blocks below for each status type.

In [None]:
# TODO: YOUR CODE HERE: Implement analysis for "Foreclosure" status
print("FORECLOSURE ANALYSIS")
print("="*30)
# Copy and modify the code from above sections

# I just copied everything and put it into this function
def analyze_foreclosure():
    X = foreclosure_data.drop(columns=['Price', 'Status'])  # Features (exclude 'Price' and 'Status')
    y = foreclosure_data[["Price"]]  # Target ('Price')

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    X_train, X_test = preprocess_location_feature(X_train, X_test, y_train)

    scaler = MinMaxScaler()

    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train your custom linear regression model
    lr_model = LinearRegression()

    lr_model.fit(X_train_scaled, y_train)

    # Make predictions on test set
    predictions = lr_model.predict(X_test_scaled)

    custom_mse = compute_mse(y_test.to_numpy(), predictions)
    print(f'Mean Squared Error (Custom Implementation): {custom_mse}')

    # Compare with sklearn's implementation
    sklearn_model = SkLinearRegression()
    # TODO: YOUR CODE HERE: Fit sklearn model and get predictions
    sklearn_model.fit(X_train_scaled, y_train)

    sklearn_predictions = sklearn_model.predict(X_test_scaled)
    sklearn_mse = compute_mse(y_test.to_numpy(), sklearn_predictions)

    print(f'Mean Squared Error (Sklearn Implementation): {sklearn_mse}')

    # Compare coefficients
    print("\n" + "="*50)
    print("COEFFICIENT COMPARISON")
    print("="*50)
    print("Custom Implementation:")
    print(f"Bias: {lr_model.bias}")
    print(f"Weights: {lr_model.weights}")
    print("\nSklearn Implementation:")
    print(f"Bias: {sklearn_model.intercept_}")
    print(f"Weights: {sklearn_model.coef_}")

    # Determine the most important features
    importance = np.abs(lr_model.weights.flatten())  # Get absolute values of weights
    feature_names = X_train_scaled.columns  # Get feature names (excluding 'Price' and 'Status')

    # Sort features by importance
    feature_importance = sorted(zip(feature_names, importance), key=lambda x: x[1], reverse=True)  # YOUR CODE HERE: Create list of (feature_name, importance) tuples and sort

    print("\nFeature Importance (from most to least important):")
    print("="*50)
    for i, (feature, imp) in enumerate(feature_importance):
        print(f"{i+1}. {feature}: {imp:.4f}")

    print(f"\nThe two most significant features for predicting house price are:")
    print(f"1. {feature_importance[0][0]}")
    print(f"2. {feature_importance[1][0]}")

analyze_foreclosure()

In [None]:
# TODO: YOUR CODE HERE: Implement analysis for "Regular" status  
print("REGULAR ANALYSIS")
print("="*30)
# Copy and modify the code from above sections

# same as above, but with the regular data
def analyze_regular():
    X = regular_data.drop(columns=['Price', 'Status'])  # Features (exclude 'Price' and 'Status')
    y = regular_data[["Price"]]  # Target ('Price')

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    X_train, X_test = preprocess_location_feature(X_train, X_test, y_train)

    scaler = MinMaxScaler()

    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train your custom linear regression model
    lr_model = LinearRegression()

    lr_model.fit(X_train_scaled, y_train)

    # Make predictions on test set
    predictions = lr_model.predict(X_test_scaled)

    custom_mse = compute_mse(y_test.to_numpy(), predictions)
    print(f'Mean Squared Error (Custom Implementation): {custom_mse}')

    # Compare with sklearn's implementation
    sklearn_model = SkLinearRegression()
    # TODO: YOUR CODE HERE: Fit sklearn model and get predictions
    sklearn_model.fit(X_train_scaled, y_train)

    sklearn_predictions = sklearn_model.predict(X_test_scaled)
    sklearn_mse = compute_mse(y_test.to_numpy(), sklearn_predictions)

    print(f'Mean Squared Error (Sklearn Implementation): {sklearn_mse}')

    # Compare coefficients
    print("\n" + "="*50)
    print("COEFFICIENT COMPARISON")
    print("="*50)
    print("Custom Implementation:")
    print(f"Bias: {lr_model.bias}")
    print(f"Weights: {lr_model.weights}")
    print("\nSklearn Implementation:")
    print(f"Bias: {sklearn_model.intercept_}")
    print(f"Weights: {sklearn_model.coef_}")

    # Determine the most important features
    importance = np.abs(lr_model.weights.flatten())  # Get absolute values of weights
    feature_names = X_train_scaled.columns  # Get feature names (excluding 'Price' and 'Status')

    # Sort features by importance
    feature_importance = sorted(zip(feature_names, importance), key=lambda x: x[1], reverse=True)  # YOUR CODE HERE: Create list of (feature_name, importance) tuples and sort

    print("\nFeature Importance (from most to least important):")
    print("="*50)
    for i, (feature, imp) in enumerate(feature_importance):
        print(f"{i+1}. {feature}: {imp:.4f}")

    print(f"\nThe two most significant features for predicting house price are:")
    print(f"1. {feature_importance[0][0]}")
    print(f"2. {feature_importance[1][0]}")

analyze_regular()

### 1.7 Summary and Comparison

**TODO for students:** Write a brief summary comparing the results across all three status types.

In [None]:
# TODO: YOUR CODE HERE: Create a summary comparison
print("SUMMARY COMPARISON ACROSS STATUS TYPES")
print("="*50)
# Compare MSE values, important features, etc.
print("The MSE for Regular was higher than the MSE for Short Sale, which was higher than the MSE for Foreclosure.")
print("Size and Price/SQ.Ft were the most important features for every status type. Size was more important for Foreclosure, while Price/SQ.Ft was")
print("more important for Regular and Short Sale. These results make sense because larger houses generally cost more, and Price/SQ.Ft directly relates to the overall price.")
print("Location was always the 3rd most important feature, likely because it was replaced with the average price for all houses at that location,")
print("so it captures more information about the original price compared to features like Bedrooms or Bathrooms.")
print("MLS was always the 5th or 6th most important feature, which makes sense because it has nothing to do with the price at all,")
print("but it's interesting that MLS was more important than Bathrooms for for Regular.")
print("Bedrooms and Bathrooms probably had low weights because a house with more Size usually has more Bedrooms and Bathrooms, so Size already captures most of that information.")
print("For all statuses, the MSE for the custom model was very close to the MSE for the sklearn model (besides minor floating point differences), which indicates")
print("that the custom implementation was implemented correctly.")

---
## Question 2: Principal Component Analysis (5 pts)

In this question, you will implement PCA from scratch and use it for face recognition on the Yale Face dataset.

In [None]:
import os
import matplotlib.pyplot as plt

# Set up plotting parameters
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10

### 2.1 Data Loading and Preprocessing

In [None]:
def load_yale_faces(data_dir="hw1/yalefaces"):
    """
    Load Yale face images
    """
    # Steps:
    # 1. Find all .gif files in the directory
    # 2. Separate training images from test images (test images contain 'test' in filename)
    # 3. Load images using plt.imread()
    # 4. Downsample images by factor of 4
    
    train_imgs = []
    test_imgs = []
    train_files = []
    test_files = []
    
    if os.path.exists(data_dir):
        for root, dirs, files in os.walk(data_dir):
            for file in files:
                if file.endswith(".gif"):
                    filepath = os.path.join(root, file)
                    img = plt.imread(filepath)
                    
                    # TODO: Downsample by factor of 4 
                    img_downsampled = img[::4, ::4] # YOUR CODE HERE
                    
                    if 'test' in file:
                        test_imgs.append(img_downsampled)
                        test_files.append(file)
                    else:
                        train_imgs.append(img_downsampled)
                        train_files.append(file)
    
    return train_imgs, test_imgs, train_files, test_files

# Load the images
data_path = './yalefaces' # Your path to yalefaces
train_images, test_images, train_filenames, test_filenames = load_yale_faces(data_dir=data_path)

print(f"Number of training images: {len(train_images)}")
print(f"Number of test images: {len(test_images)}")
print("Training files:", train_filenames)
print("Test files:", test_filenames)

# Display sample images
if len(train_images) > 0:
    height, width = train_images[0].shape
    print(f"Image dimensions: {height} x {width}")
    
    # Show a few sample images
    fig, axes = plt.subplots(1, min(4, len(train_images)), figsize=(12, 3))
    if len(train_images) == 1:
        axes = [axes]
    for i in range(min(4, len(train_images))):
        axes[i].imshow(train_images[i], cmap='gray')
        axes[i].set_title(f"Training Image {i+1}")
        axes[i].axis('off')
    plt.tight_layout()
    plt.show()

### 2.2 PCA Implementation

In [None]:
class PCA:
    """Principal Component Analysis implementation from scratch"""
    
    def __init__(self, n_components):
        self.n_components = n_components
        self.mean = None
        self.components = None
        self.eigenvalues = None

    def fit(self, X):
        """
        Fit PCA on the data
        
        Args:
            X: Data matrix of shape (n_samples, n_features)
        """
        # TODO: YOUR CODE HERE: Implement PCA fitting
        # Steps:
        # 1. Center the data by subtracting the mean
        # 2. Compute covariance matrix or use SVD
        # 3. Find eigenvalues and eigenvectors
        # 4. Sort by eigenvalues in descending order
        # 5. Select top n_components eigenvectors

        # using SVD because the covariance matrix was taking several minutes to compute
        self.mean = np.mean(X, axis=0)
        X_centered = X - self.mean

        U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

        eigenvalues = (S**2) / (X.shape[0] - 1)
        idx = np.argsort(eigenvalues)[::-1]

        # select the top components based on the sorted eigenvalue indices
        self.components = Vt[idx][:self.n_components, :]
        self.eigenvalues = eigenvalues[idx][:self.n_components]

    def transform(self, X):
        """
        Transform data to PCA space
        
        Args:
            X: Data to transform
            
        Returns:
            Transformed data in PCA space
        """
        # TODO: YOUR CODE HERE: Implement transformation
        # Steps:
        # 1. Center the data using the fitted mean
        # 2. Project onto principal components

        return (X - self.mean) @ self.components.transpose()       

    def fit_transform(self, X):
        """Fit and transform in one step"""
        self.fit(X)
        return self.transform(X)

### 2.3 Eigenface Generation

In [None]:
# Convert images to feature vectors
# TODO: YOUR CODE HERE: Convert images to vectors and create data matrix
if len(train_images) > 0:
    height, width = train_images[0].shape
    
    # Shape: (n_samples, n_features) where n_features = height * width
    X_train = np.array([img.flatten() for img in train_images])
    
    print(f"Data matrix shape: {X_train.shape}")

### 2.4 Subject-wise PCA Analysis

In [None]:
def analyze_subject(subject_id, train_images, train_filenames, n_components=6):
    """
    Analyze a specific subject using PCA
    """
    # TODO: YOUR CODE HERE: 
    # 1. Filter images for the specific subject
    # 2. Convert to feature vectors
    # 3. Apply PCA
    # 4. Extract and visualize eigenfaces
    
    print(f"Analyzing Subject {subject_id}")
    
    # Filter images for this subject
    subject_images = [img for img, fname in zip(train_images, train_filenames) if f"subject{subject_id}" in fname]
    # YOUR CODE HERE: Filter train_images for the specific subject
    
    if len(subject_images) == 0:
        print(f"No images found for subject {subject_id}")
        return None, None
    
    print(f"Found {len(subject_images)} images for subject {subject_id}")
    
    # Convert to feature vectors
    X_subject = np.array([img.flatten() for img in subject_images])  # YOUR CODE HERE
    
    # Apply PCA
    pca = PCA(n_components=n_components)
    # YOUR CODE HERE: Fit PCA
    pca.fit(X_subject)

    # Visualize eigenfaces
    # YOUR CODE HERE: Reshape components back to image dimensions and plot
    
    fig, axes = plt.subplots(1, n_components, figsize=(12, 8))
    for i in range(n_components):
        eigenface = pca.components[i].reshape(height, width)
        axes[i].imshow(eigenface, cmap='gray')
        axes[i].set_title(f"Eigenface {i+1}")
        axes[i].axis('off')
    plt.tight_layout()
    plt.show()
    
    return pca, subject_images

# Analyze Subject 01
print("="*50)
pca_subject01, images_subject01 = analyze_subject("01", train_images, train_filenames)

# Analyze Subject 14  
print("="*50)
pca_subject14, images_subject14 = analyze_subject("14", train_images, train_filenames)

### 2.5 Face Recognition

In [None]:
def recognize_face(test_image, test_filename, pca_models, subject_ids):
    """
    Perform face recognition using PCA
    
    Args:
        test_image: Test image to recognize
        test_filename: Filename of test image
        pca_models: Dictionary of PCA models for each subject
        subject_ids: List of subject IDs
        
    Returns:
        Recognition scores for each subject
    """
    print(f"\nRecognizing {test_filename}")
    print("-" * 30)
    
    # Convert test image to vector
    test_vector = test_image.flatten().reshape(1, -1) # YOUR CODE HERE
    scores = {}
    
    for subject_id in subject_ids:
        if pca_models[subject_id] is not None:
            # TODO: YOUR CODE HERE: 
            # 1. Project test image using this subject's PCA
            # 2. Calculate reconstruction error or similarity score
            # 3. Store the score
            pca = pca_models[subject_id]
            projected = pca.transform(test_vector)
            
            # reconstruct the image to calculate reconstruction error/similarity
            reconstructed = projected @ pca.components + pca.mean

            # reconstruction_error = np.linalg.norm(test_vector - reconstructed, 2)
            cosine_similarity = np.dot(test_vector.flatten(), reconstructed.flatten()) / (np.linalg.norm(test_vector) * np.linalg.norm(reconstructed))

            # reconstruction error is not as intuitive as the cosine similarity
            score = cosine_similarity  # YOUR CODE HERE: Calculate appropriate score
            scores[subject_id] = score
            print(f"Score for Subject {subject_id}: {score:.4f}")
    
    return scores

# Prepare PCA models dictionary
pca_models = {
    "01": pca_subject01,
    "14": pca_subject14
}

# Recognize test images
subject_ids = ["01", "14"]
recognition_results = {}

for i, (test_img, test_file) in enumerate(zip(test_images, test_filenames)):
    scores = recognize_face(test_img, test_file, pca_models, subject_ids)
    recognition_results[test_file] = scores

### 2.6 Analysis and Results

In [None]:
print("\nFACE RECOGNITION RESULTS")
print("="*50)

# TODO: YOUR CODE HERE: Analyze and report the four scores as requested:
# (1) Subject 1 test image using Subject 1 eigenfaces
# (2) Subject 1 test image using Subject 14 eigenfaces  
# (3) Subject 14 test image using Subject 1 eigenfaces
# (4) Subject 14 test image using Subject 14 eigenfaces

# there's an extra test face for subject 14

for test_file, scores in recognition_results.items():
    print(f"\nTest Image: {test_file}")
    for subject_id, score in scores.items():
        print(f"  Using Subject {subject_id} eigenfaces: {score:.4f}")

### 2.7 Conclusion and Discussion

**TODO for students:** Based on your results, answer the following questions:

1. Can you recognize faces using the computed scores? How?
2. What do the scores tell you about the similarity between test images and different subjects?
3. How might you improve the face recognition accuracy?

In [None]:
# TODO: YOUR CODE HERE: Write your analysis and conclusions
print("ANALYSIS AND CONCLUSIONS")
print("="*30)
print("""
1. Face Recognition Analysis:
   The faces can be recognized using the computed scores. Because the scores represent the cosine similarity between
   the test image and its reconstruction using the PCA model of a particular subject, a higher score indicates a better 
   match for that subject. In the results, the PCA model for a subject always yielded the highest score for test images 
   matching that subject (~0.98 compared to ~0.91), so the faces could be recognized based on if their score exceeds 
   a certain threshold (e.g., 0.95).

2. Score Interpretation:
   The scores represent the cosine similarity between the test image and its reconstruction using the PCA model of each subject.
   A higher score indicates a better match for that subject (with 1 being a perfect match), while a lower score indicates a poorer match.
   Very low scores would indicate that the test image doesn't resemble the subject at all.

3. Potential Improvements:
   We could increase the number of principal components to capture more variance in the training data, or we could use more training images
   for each subject to make the PCA model more robust. Additionally, different scoring metrics could be used to differentiate between subjects more effectively.
   Cosine similarity yielded a ~0.91 similarity score for the two subjects, which is somewhat high since they don't look very similar.
   I experimented with using reconstruction error (L2 norm) instead of cosine similarity, which yielded a difference of ~3000 for
   the two subjects (with the right subject being scored around ~1500), but that scoring metric wasn't as intuitive.
""")

---
## Submission Checklist

Before submitting, make sure you have completed:

**Question 1:**
- [ ] Loaded and explored the dataset
- [ ] Implemented MinMaxScaler from scratch
- [ ] Handled categorical features (Location)
- [ ] Implemented LinearRegression from scratch
- [ ] Evaluated model performance using MSE
- [ ] Identified the most important features
- [ ] Repeated analysis for all three status types
- [ ] Provided comparison summary

**Question 2:**
- [ ] Loaded and preprocessed Yale face images
- [ ] Implemented PCA from scratch
- [ ] Generated eigenfaces for both subjects
- [ ] Performed face recognition on test images
- [ ] Reported all four required scores
- [ ] Provided analysis and conclusions

**General:**
- [ ] All code cells run without errors
- [ ] Results are clearly displayed and interpreted
- [ ] Code is well-commented and readable