<p style="font-family: Arial; font-size:3em;color:black;"> Lab Exercise 9</p>

In [2]:
# For this example, we will use K-Means Clustering Project database from Kaggle (https://www.kaggle.com/faressayah/k-means-clustering-private-vs-public-universities)
# We actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.
# As we will shortly see, we have a data frame with 777 observations on 18 variables.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv('College_Data',index_col=0)
df.columns

Index(['apps', 'accept', 'enroll', 'top10perc', 'top25perc', 'f_undergrad',
       'p_undergrad', 'outstate', 'room_board', 'books', 'personal', 'phd',
       'terminal', 's_f_ratio', 'perc_alumni', 'expend', 'grad_rate'],
      dtype='object')

In [4]:
df.loc['Cazenovia College', 'grad_rate'] = 100  

# Try removing various columns (features) from the dataset and examin if it improves/degrades your K-Means model performance, or it may have little impact.
# Report 10 cases where you removed one or more features and indicate how it impacted the model performance.

# Standardize the features
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Import SimpleImputer for handling missing values
from sklearn.impute import SimpleImputer

# Function to perform clustering and evaluate
def evaluate_kmeans(data, features):
    X = data[features]
    
    # Handle missing values
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    # Scale the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_imputed)
    
    kmeans = KMeans(n_clusters=2, random_state=42)
    clusters = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, clusters)
    return score

# Base features
features = ['apps', 'accept', 'enroll', 'top10perc', 'top25perc', 'f_undergrad', 
           'p_undergrad', 'outstate', 'room_board', 'books', 'personal', 'phd', 
           'terminal', 's_f_ratio', 'perc_alumni', 'expend', 'grad_rate']

In [5]:
# Test different feature combinations
print("Baseline (all features):", evaluate_kmeans(df, features))

Baseline (all features): 0.23186349124110617


The baseline model using all features gives us a silhouette score of 0.232, which will be our reference point for comparing feature selection impacts.

In [6]:
# Case 1: Remove enrollment related features
features1 = [f for f in features if f not in ['enroll', 'f_undergrad', 'p_undergrad']]
print("\nCase 1 - Without enrollment features:", evaluate_kmeans(df, features1))


Case 1 - Without enrollment features: 0.2457449371724793


Removing enrollment features slightly improved the model performance to 0.246, suggesting these features may have been adding noise.

In [7]:
# Case 2: Only academic features
features2 = ['top10perc', 'top25perc', 'phd', 'terminal', 's_f_ratio', 'grad_rate']
print("\nCase 2 - Only academic features:", evaluate_kmeans(df, features2))


Case 2 - Only academic features: 0.3027706747413162


Using only academic features improved performance to 0.303, indicating these features are strong discriminators for clustering.

In [8]:
# Case 3: Only financial features
features3 = ['outstate', 'room_board', 'books', 'personal', 'expend']
print("\nCase 3 - Only financial features:", evaluate_kmeans(df, features3))


Case 3 - Only financial features: 0.3209956280286182


Financial features alone achieved a score of 0.321, showing they are also good discriminators for clustering universities.

In [9]:
# Case 4: Without cost features
features4 = [f for f in features if f not in ['outstate', 'room_board', 'books', 'personal']]
print("\nCase 4 - Without cost features:", evaluate_kmeans(df, features4))


Case 4 - Without cost features: 0.40426435383204645


Removing cost features led to a significant improvement with a score of 0.404, suggesting cost-related features may obscure underlying patterns.

In [10]:
# Case 5: Only admission features
features5 = ['apps', 'accept', 'top10perc', 'top25perc']
print("\nCase 5 - Only admission features:", evaluate_kmeans(df, features5))


Case 5 - Only admission features: 0.4859216670070813


Admission features alone produced the best performance with a score of 0.486, indicating they are the strongest predictors for clustering.

In [11]:
# Case 6: Without faculty features
features6 = [f for f in features if f not in ['phd', 'terminal', 's_f_ratio']]
print("\nCase 6 - Without faculty features:", evaluate_kmeans(df, features6))


Case 6 - Without faculty features: 0.3809913507437334


Removing faculty features improved performance to 0.381, suggesting these features may not be essential for clustering.

In [12]:
# Case 7: Only performance indicators
features7 = ['top10perc', 'top25perc', 'grad_rate', 'perc_alumni']
print("\nCase 7 - Only performance indicators:", evaluate_kmeans(df, features7))


Case 7 - Only performance indicators: 0.3634869943741708


Performance indicators achieved a score of 0.363, showing they are effective clustering features but not as strong as admission features alone.

In [13]:
# Case 8: Without application features
features8 = [f for f in features if f not in ['apps', 'accept']]
print("\nCase 8 - Without application features:", evaluate_kmeans(df, features8))


Case 8 - Without application features: 0.21911848983425133


Removing application features slightly decreased performance to 0.219, confirming their importance in the clustering model.

In [14]:
# Case 9: Core features only
features9 = ['top10perc', 'outstate', 'phd', 's_f_ratio', 'grad_rate']
print("\nCase 9 - Core features only:", evaluate_kmeans(df, features9))


Case 9 - Core features only: 0.30595321756048016


Using just core features achieved a score of 0.306, showing that a smaller set of well-chosen features can perform better than using all features.

In [15]:
# Case 10: Minimal feature set
features10 = ['top25perc', 'outstate', 'grad_rate']
print("\nCase 10 - Minimal feature set:", evaluate_kmeans(df, features10))


Case 10 - Minimal feature set: 0.3756254730397272


Even with just three features, the model achieved a score of 0.376, demonstrating that a very minimal feature set can still provide good clustering results.

### Inferences

From these experiments, we can draw several important inferences:

1. Feature selection significantly impacts clustering performance
2. Admission-related features (Case 5) provided the best clustering results with a score of 0.486
3. Using fewer, well-chosen features often outperformed using all features
4. Cost and enrollment features tended to reduce model performance when included
5. A minimal feature set (Case 10) still performed well, suggesting some features may be redundant

The optimal approach appears to be focusing on admission-related features while excluding cost and enrollment data for the best clustering results.

In [16]:
# Optimal feature set based on analysis
optimal_features = ['apps', 'accept', 'top10perc', 'top25perc']

# Evaluate the optimal model
optimal_score = evaluate_kmeans(df, optimal_features)
print("\nOptimal feature set score:", optimal_score)

# Display the selected features
print("\nOptimal features:")
for feature in optimal_features:
    print(f"- {feature}")


Optimal feature set score: 0.4859216670070813

Optimal features:
- apps
- accept
- top10perc
- top25perc


### Final Conclusions

The optimal feature set analysis reveals several key insights:

1. The highest performing model used just 4 admission-related features and achieved a silhouette score of 0.486
2. These features (`apps`, `accept`, `top10perc`, `top25perc`) all relate to the selectivity and admission standards of universities
3. This suggests that admission metrics are the most distinctive characteristics for clustering universities
4. Notably absent from the optimal set are financial features (costs, expenditures) and operational features (faculty ratios, enrollment)
5. The strong performance with just these 4 features indicates they capture the essential differences between university clusters

This analysis demonstrates that while universities differ across many dimensions, their admission standards and selectivity are the most useful metrics for meaningful categorization.