<p style="font-family: Arial; font-size:3em;color:black;"> Lab Exercise 9</p>

In [14]:
# For this example, we will use K-Means Clustering Project database from Kaggle (https://www.kaggle.com/faressayah/k-means-clustering-private-vs-public-universities)
# We actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.
# As we will shortly see, we have a data frame with 777 observations on 18 variables.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [15]:
df = pd.read_csv('College_Data',index_col=0)
df.columns

Index(['apps', 'accept', 'enroll', 'top10perc', 'top25perc', 'f_undergrad',
       'p_undergrad', 'outstate', 'room_board', 'books', 'personal', 'phd',
       'terminal', 's_f_ratio', 'perc_alumni', 'expend', 'grad_rate'],
      dtype='object')

In [16]:
df.loc['Cazenovia College', 'grad_rate'] = 100  

# Try removing various columns (features) from the dataset and examin if it improves/degrades your K-Means model performance, or it may have little impact.
# Report 10 cases where you removed one or more features and indicate how it impacted the model performance.

# Standardize the features
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Import SimpleImputer for handling missing values
from sklearn.impute import SimpleImputer

# Function to perform clustering and evaluate
def evaluate_kmeans(data, features):
    X = data[features]
    
    # Handle missing values
    imputer = SimpleImputer(strategy='mean')
    X_imputed = imputer.fit_transform(X)
    
    # Scale the features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X_imputed)
    
    kmeans = KMeans(n_clusters=2, random_state=42)
    clusters = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, clusters)
    return score

# Base features
features = ['apps', 'accept', 'enroll', 'top10perc', 'top25perc', 'f_undergrad', 
           'p_undergrad', 'outstate', 'room_board', 'books', 'personal', 'phd', 
           'terminal', 's_f_ratio', 'perc_alumni', 'expend', 'grad_rate']

In [17]:
# Test different feature combinations
print("Baseline (all features):", evaluate_kmeans(df, features))

Baseline (all features): 0.23186349124110617


The baseline model using all features gives us a silhouette score of 0.232, which will be our reference point for comparing feature selection impacts.

In [18]:
# Case 1: Remove enrollment related features
features1 = [f for f in features if f not in ['enroll', 'f_undergrad', 'p_undergrad']]
print("\nCase 1 - Without enrollment features:", evaluate_kmeans(df, features1))


Case 1 - Without enrollment features: 0.2457449371724793


In [19]:
# Case 2: Only academic features
features2 = ['top10perc', 'top25perc', 'phd', 'terminal', 's_f_ratio', 'grad_rate']
print("\nCase 2 - Only academic features:", evaluate_kmeans(df, features2))


Case 2 - Only academic features: 0.3027706747413162


In [20]:
# Case 3: Only financial features
features3 = ['outstate', 'room_board', 'books', 'personal', 'expend']
print("\nCase 3 - Only financial features:", evaluate_kmeans(df, features3))


Case 3 - Only financial features: 0.3209956280286182


In [21]:
# Case 4: Without cost features
features4 = [f for f in features if f not in ['outstate', 'room_board', 'books', 'personal']]
print("\nCase 4 - Without cost features:", evaluate_kmeans(df, features4))


Case 4 - Without cost features: 0.40426435383204645


In [22]:
# Case 5: Only admission features
features5 = ['apps', 'accept', 'top10perc', 'top25perc']
print("\nCase 5 - Only admission features:", evaluate_kmeans(df, features5))


Case 5 - Only admission features: 0.4859216670070813


In [23]:
# Case 6: Without faculty features
features6 = [f for f in features if f not in ['phd', 'terminal', 's_f_ratio']]
print("\nCase 6 - Without faculty features:", evaluate_kmeans(df, features6))


Case 6 - Without faculty features: 0.3809913507437334


In [24]:
# Case 7: Only performance indicators
features7 = ['top10perc', 'top25perc', 'grad_rate', 'perc_alumni']
print("\nCase 7 - Only performance indicators:", evaluate_kmeans(df, features7))


Case 7 - Only performance indicators: 0.3634869943741708


In [25]:
# Case 8: Without application features
features8 = [f for f in features if f not in ['apps', 'accept']]
print("\nCase 8 - Without application features:", evaluate_kmeans(df, features8))


Case 8 - Without application features: 0.21911848983425133


In [26]:
# Case 9: Core features only
features9 = ['top10perc', 'outstate', 'phd', 's_f_ratio', 'grad_rate']
print("\nCase 9 - Core features only:", evaluate_kmeans(df, features9))


Case 9 - Core features only: 0.30595321756048016


In [27]:
# Case 10: Minimal feature set
features10 = ['top25perc', 'outstate', 'grad_rate']
print("\nCase 10 - Minimal feature set:", evaluate_kmeans(df, features10))


Case 10 - Minimal feature set: 0.3756254730397272
