# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

* Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
* Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
* Do you see any overlap between the PCA features and those obtained from feature selection?

# Weka
Visually, odor and gill color seem to be very discriminative.\
Selecting attributes with InfoGainAttributeEval we get
```
Ranked attributes:
 0.90607    5 odor
 0.4807    20 spore-print-color
 0.41698    9 gill-color
 0.31802   19 ring-type
 0.28473   12 stalk-surface-above-ring
 0.27189   13 stalk-surface-below-ring
 0.25385   14 stalk-color-above-ring
 0.24142   15 stalk-color-below-ring
 0.23015    8 gill-size
 0.20196   21 population
 0.19238    4 bruises?
 0.15683   22 habitat
 0.10088    7 gill-spacing
 0.0488     1 cap-shape
 0.03845   18 ring-number
 0.03834   11 stalk-root
 0.03605    3 cap-color
 0.02859    2 cap-surface
 0.02382   17 veil-color
 0.01417    6 gill-attachment
 0.00752   10 stalk-shape
 0         16 veil-type
 ```
\Whereas PCA will give us\
```
Ranked attributes:
 1   110 habitat=l
 1    35 gill-color=n
 1    37 gill-color=p
 1    38 gill-color=w
 1    39 gill-color=h
 1    36 gill-color=g
 1    34 gill-color=k
 1    41 gill-color=e
 1    33 gill-size=b
 1    30 odor=m
 1    31 gill-attachment=a
 1    32 gill-spacing=w
 1    40 gill-color=u
 1    42 gill-color=b
 1    28 odor=y
 1    49 stalk-root=b
 1    51 stalk-surface-above-ring=s
 1    52 stalk-surface-above-ring=f
 1    53 stalk-surface-above-ring=k
 1    50 stalk-root=r
 1    48 stalk-root=c
 1    43 gill-color=r
 1    47 stalk-root=e
 1    44 gill-color=y
 1    45 gill-color=o
 1    46 stalk-shape=t
 1    29 odor=s
 1    27 odor=c
 1   109 habitat=w
 1     7 cap-surface=s
 1     9 cap-surface=f
 1    10 cap-surface=g
 1    11 cap-color=n
 1     8 cap-surface=y
 1     6 cap-shape=c
 1    13 cap-color=w
 1     5 cap-shape=k
 1     2 cap-shape=b
 1     3 cap-shape=s
 1     4 cap-shape=f
 1    12 cap-color=y
 1    14 cap-color=g
 1    26 odor=f
 1    21 bruises?=f
 1    23 odor=a
 1    24 odor=l
 1    25 odor=n
 1    22 odor=p
 1    20 cap-color=r
 1    15 cap-color=e
 1    19 cap-color=c
 1    16 cap-color=p
 1    17 cap-color=b
 1    18 cap-color=u
 1    54 stalk-surface-above-ring=y
 1    55 stalk-surface-below-ring=s
 1    56 stalk-surface-below-ring=f
 1    90 spore-print-color=n
 1    92 spore-print-color=h
 1    93 spore-print-color=w
 1    94 spore-print-color=r
 1    91 spore-print-color=u
 1    89 spore-print-color=k
 1    96 spore-print-color=y
 1    88 ring-type=n
 1    85 ring-type=e
 1    86 ring-type=l
 1    87 ring-type=f
 1    95 spore-print-color=o
 1    97 spore-print-color=b
 1    57 stalk-surface-below-ring=y
 1   104 habitat=u
 1   106 habitat=m
 1   107 habitat=d
 1   108 habitat=p
 1   105 habitat=g
 1   103 population=c
 1    98 population=s
 1   102 population=y
 1    99 population=n
 1   100 population=a
 1   101 population=v
 1    84 ring-type=p
 1    83 ring-number=n
 1    82 ring-number=t
 1    63 stalk-color-above-ring=b
 1    65 stalk-color-above-ring=o
 1    66 stalk-color-above-ring=c
 1    67 stalk-color-above-ring=y
 1    64 stalk-color-above-ring=e
 1    62 stalk-color-above-ring=n
 1    81 ring-number=o
 1    61 stalk-color-above-ring=p
 1    58 stalk-surface-below-ring=k
 1    59 stalk-color-above-ring=w
 1    60 stalk-color-above-ring=g
 1    68 stalk-color-below-ring=w
 1    69 stalk-color-below-ring=p
 1    70 stalk-color-below-ring=g
 1    71 stalk-color-below-ring=b
 1    78 veil-color=n
 1    79 veil-color=o
 1    80 veil-color=y
 1    77 veil-color=w
 1    76 stalk-color-below-ring=c
 1    75 stalk-color-below-ring=o
 1    72 stalk-color-below-ring=n
 1    73 stalk-color-below-ring=e
 1    74 stalk-color-below-ring=y
 1     1 cap-shape=x
```

There is indeed some overlap. We see that especially gill color is rather prevalent in both evaluations, and we can see that odor, which is the most distinctive with feature selection, also has multiple classifications high up with PCA.\

On the other hand, habitat has become the most distinctive feature with PCA, unlike feature selection. Similarly, gill-attachment a seems to be very distinctive in PCA.


# sklearn

In [22]:
# Importing the dataset from the last exercise
import pandas as pd
from IPython.display import display, HTML

md = pd.read_csv('agaricus-lepiota.csv')
md.describe()

Unnamed: 0,edibility,cap-shape,cap-surface,cap-color,bruises?,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [23]:
# Perform a feature selection using a classifier evaluator. Which features are most discriminitave?
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import numpy as np

X, y = md.loc[:, md.columns !='edibility'], md.loc[:, md.columns == 'edibility']

X_dum = pd.get_dummies(X)
y_dum = pd.get_dummies(y)

skb = SelectKBest(chi2, k=2)
skb.fit(X_dum, y_dum)
X_new = skb.transform(X_dum)

print(X_new.shape)

np.array(X_dum.columns)[skb.get_support(indices=True)]

(8124, 2)


array(['odor_f', 'odor_n'], dtype=object)

In [24]:
# Principal components, reduced space. Which combination of features explain the most variance in the dataset?
from sklearn.decomposition import PCA
from sklearn import preprocessing

data_scaled = pd.DataFrame(preprocessing.scale(X_dum), columns = X_dum.columns)

pca = PCA(n_components=2)
pca.fit_transform(data_scaled)

print(pd.DataFrame(pca.components_, columns=data_scaled.columns, index = ['PC-1', 'PC-2']))

      cap-shape_b  cap-shape_c  cap-shape_f  cap-shape_k  cap-shape_s  \
PC-1    -0.079834    -0.001892     0.013436     0.085837    -0.012201   
PC-2     0.016743     0.008192    -0.041886     0.131896     0.001770   

      cap-shape_x  cap-surface_f  cap-surface_g  cap-surface_s  cap-surface_y  \
PC-1    -0.026957      -0.032678      -0.005794      -0.017088       0.046603   
PC-2    -0.047485      -0.118132       0.005171       0.144959      -0.028728   

      ...  population_s  population_v  population_y  habitat_d  habitat_g  \
PC-1  ...     -0.114010      0.150387     -0.003754  -0.020478  -0.075409   
PC-2  ...     -0.009443      0.083799     -0.149492  -0.049327  -0.064072   

      habitat_l  habitat_m  habitat_p  habitat_u  habitat_w  
PC-1   0.080517  -0.080901   0.135525  -0.044890  -0.025837  
PC-2   0.176585  -0.015292  -0.013921  -0.006367   0.051097  

[2 rows x 117 columns]


In [25]:
# Do you see any overlap between the PCA features and those obtained from feature selection?
# no