<p style="font-family: Arial; font-size:3em;color:black;"> Lab Exercise 9</p>

In [1]:
# For this example, we will use K-Means Clustering Project database from Kaggle (https://www.kaggle.com/faressayah/k-means-clustering-private-vs-public-universities)
# We actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.
# As we will shortly see, we have a data frame with 777 observations on 18 variables.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix, classification_report
%matplotlib inline

In [2]:
df = pd.read_csv('College_Data',index_col=0)
df.columns

Index(['Private', 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc',
       'F.Undergrad', 'P.Undergrad', 'Outstate', 'Room.Board', 'Books',
       'Personal', 'PhD', 'Terminal', 'S.F.Ratio', 'perc.alumni', 'Expend',
       'Grad.Rate'],
      dtype='object')

In [3]:
df.loc['Cazenovia College', 'Grad.Rate'] = 100

# Convert 'Private' column to numerical values
df['Private'] = df['Private'].apply(lambda x: 1 if x == 'Yes' else 0)

# Define a function to perform scaling, KMeans clustering, and evaluation
def perform_kmeans(data, drop_columns):
    scaler = StandardScaler()
    reduced_df = data.drop(drop_columns + ['Private'], axis=1)  # Drop selected columns and 'Private'
    scaled_data = scaler.fit_transform(reduced_df)

    # Apply KMeans
    kmeans = KMeans(n_clusters=2)
    kmeans.fit(scaled_data)

    # Add cluster predictions to the DataFrame
    data['Cluster'] = kmeans.labels_

    # Evaluate the performance
    print(f"Removed Features: {drop_columns}")
    print(confusion_matrix(data['Private'], data['Cluster']))
    print(classification_report(data['Private'], data['Cluster']))
    print('-' * 50)

# Test cases: 10 feature removal scenarios
feature_removals = [
    ['Top10perc'],
    ['Top25perc'],
    ['Room.Board'],
    ['F.Undergrad'],
    ['P.Undergrad'],
    ['Books'],
    ['S.F.Ratio'],
    ['PhD'],
    ['Expend'],
    ['Personal', 'perc.alumni']
]

# Loop through each feature removal case and evaluate
for features in feature_removals:
    perform_kmeans(df.copy(), features)

# Try removing various columns (features) from the dataset and examin if it improves/degrades your K-Means model performance, or it may have little impact.



Removed Features: ['Top10perc']
[[114  98]
 [550  15]]
              precision    recall  f1-score   support

           0       0.17      0.54      0.26       212
           1       0.13      0.03      0.04       565

    accuracy                           0.17       777
   macro avg       0.15      0.28      0.15       777
weighted avg       0.14      0.17      0.10       777

--------------------------------------------------
Removed Features: ['Top25perc']
[[ 82 130]
 [ 39 526]]
              precision    recall  f1-score   support

           0       0.68      0.39      0.49       212
           1       0.80      0.93      0.86       565

    accuracy                           0.78       777
   macro avg       0.74      0.66      0.68       777
weighted avg       0.77      0.78      0.76       777

--------------------------------------------------




Removed Features: ['Room.Board']
[[138  74]
 [360 205]]
              precision    recall  f1-score   support

           0       0.28      0.65      0.39       212
           1       0.73      0.36      0.49       565

    accuracy                           0.44       777
   macro avg       0.51      0.51      0.44       777
weighted avg       0.61      0.44      0.46       777

--------------------------------------------------
Removed Features: ['F.Undergrad']
[[166  46]
 [339 226]]
              precision    recall  f1-score   support

           0       0.33      0.78      0.46       212
           1       0.83      0.40      0.54       565

    accuracy                           0.50       777
   macro avg       0.58      0.59      0.50       777
weighted avg       0.69      0.50      0.52       777

--------------------------------------------------




Removed Features: ['P.Undergrad']
[[146  66]
 [339 226]]
              precision    recall  f1-score   support

           0       0.30      0.69      0.42       212
           1       0.77      0.40      0.53       565

    accuracy                           0.48       777
   macro avg       0.54      0.54      0.47       777
weighted avg       0.64      0.48      0.50       777

--------------------------------------------------
Removed Features: ['Books']
[[152  60]
 [351 214]]
              precision    recall  f1-score   support

           0       0.30      0.72      0.43       212
           1       0.78      0.38      0.51       565

    accuracy                           0.47       777
   macro avg       0.54      0.55      0.47       777
weighted avg       0.65      0.47      0.49       777

--------------------------------------------------




Removed Features: ['S.F.Ratio']
[[142  70]
 [348 217]]
              precision    recall  f1-score   support

           0       0.29      0.67      0.40       212
           1       0.76      0.38      0.51       565

    accuracy                           0.46       777
   macro avg       0.52      0.53      0.46       777
weighted avg       0.63      0.46      0.48       777

--------------------------------------------------
Removed Features: ['PhD']
[[ 56 156]
 [210 355]]
              precision    recall  f1-score   support

           0       0.21      0.26      0.23       212
           1       0.69      0.63      0.66       565

    accuracy                           0.53       777
   macro avg       0.45      0.45      0.45       777
weighted avg       0.56      0.53      0.54       777

--------------------------------------------------




Removed Features: ['Expend']
[[ 75 137]
 [222 343]]
              precision    recall  f1-score   support

           0       0.25      0.35      0.29       212
           1       0.71      0.61      0.66       565

    accuracy                           0.54       777
   macro avg       0.48      0.48      0.48       777
weighted avg       0.59      0.54      0.56       777

--------------------------------------------------
Removed Features: ['Personal', 'perc.alumni']
[[132  80]
 [354 211]]
              precision    recall  f1-score   support

           0       0.27      0.62      0.38       212
           1       0.73      0.37      0.49       565

    accuracy                           0.44       777
   macro avg       0.50      0.50      0.44       777
weighted avg       0.60      0.44      0.46       777

--------------------------------------------------




# Report 10 cases where you removed one or more features and indicate how it impacted the model performance.
Above are the 10 cases where one feature was removed for each intsance. The accuracy of the model is the lowest when 'Top10perc' feature is removed and the highest when the 'Top25perc' is removed. 