# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

- Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
- Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
- Do you see any overlap between the PCA features and those obtained from feature selection?

In [5]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import numpy as np
import pandas as pd

# Load dataset
mushroom_dataset = pd.read_csv('agaricus-lepiota.data')

# Map 'cap_shape' to numerical values (assuming it's categorical like 'edibility')
# We use this as our new target variable for classification
y = pd.get_dummies(mushroom_dataset['edibility'])

# Drop the 'edibility' column to prepare the feature matrix X
X = mushroom_dataset.drop(['edibility'], axis=1)

# Convert categorical variables into dummy variables
X = pd.get_dummies(X)

# You may need to remove other one-hot encoded columns related to 'cap_shape'

print(X.shape)
print(y.shape)

# Perform feature selection using chi-squared test and SelectKBest
skb = SelectKBest(chi2, k=5)
skb.fit(X, y)
X_new = skb.transform(X)

print("skb shape:", X_new.shape)

# Extract and print names of selected features
feature_mask = skb.get_support()
selected_features = X.columns[feature_mask]
print()
print("List of Top 5 Most Discriminative Features:")
for feature in selected_features:
    print(f"- {feature}")


(8124, 117)
(8124, 2)
skb shape: (8124, 5)

List of Top 5 Most Discriminative Features:
- odor_f
- odor_n
- gill-color_b
- stalk-surface-above-ring_k
- stalk-surface-below-ring_k


In [6]:
# Code source: Gaël Varoquaux
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]

print("Original space:",X.shape)
pca = decomposition.PCA(n_components=5)
pca.fit(X)
Xpca = pca.transform(X)

print("PCA space:",Xpca.shape)

most_contributing_features = []

for component in pca.components_:
    feature_index = np.argmax(component)
    most_contributing_features.append(X.columns[feature_index])
    
for feature in most_contributing_features:
    print(f"- {feature}")

Original space: (8124, 117)
PCA space: (8124, 5)
- bruises?_f
- spore-print-color_h
- habitat_g
- stalk-shape_t
- odor_n


In [7]:
overlap_features = selected_features.intersection(most_contributing_features)

print("Overlapping features:")
for feature in overlap_features:
    print(f"- {feature}")


Overlapping features:
- odor_n
