# Homework 3
In this homework assignment, you will implement a univariate feature selection method. 

You will be given a toy dataset called 'Car Evaluation Data Set' (see: http://archive.ics.uci.edu/ml/datasets/Car+Evaluation for details).
You are not required to, but advised to test your code with the toy dataset, or any other dataset that contains categorical variables.

The given dataset contains six descriptive features and a target variable. Each of those are ordinal scale, categorical variables. The name of the target feature is 'evaluation'. 

Note here that you are expected to write your own code, so DO NOT COPY AND PASTE CODE OR USE LIBRARY FUNCTIONS. The goal of the homework is not to see if you can call library functions but to have you practice with the impurity measures and feature selection techniques.


In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

### Read the dataset

In [4]:
edf = pd.read_csv('careval.csv')
# display(edf.head())
# edf.info()
edf

Unnamed: 0,buying,maint,doors,persons,lug_boot,safety,evaluation
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc
...,...,...,...,...,...,...,...
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good


In [23]:
x = edf.evaluation.unique()
print(x)
for i in x :
    f = edf[edf.evaluation == i].shape
    print(f)

['unacc' 'acc' 'vgood' 'good']
(1210, 7)
(384, 7)
(65, 7)
(69, 7)


You will create a method called IUFS (impurity-based univariate feature selection), which will select the most informative features with a univariate feature selection schema. This feature selection method will take the dataset, name of the target variable, number of features to be selected (k) and the measure of impurity as an input, and will output the names of k best features based on the information gain. You are expected to implement information gain, entropy and Gini index functions. Note here that this will be a univariate selection, which means that you need to test the features individually.

In [6]:
# entropy (H)

def Enntropy(feature, dataset):
    """Calculates the entropy of a feature in a given dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    dataset: pd.DataFrame
        dataframe for the dataset
    Returns
    -------
    float
        entropy for the feature in the dataset
    """
    ##your implementation goes here
    entropy1 = 0
    Unique_Values = dataset[feature].unique()
    for i in Unique_Values :
        x = dataset[dataset[feature] == i].shape[0]
        y = x/dataset.shape[0]
        entropy1 = entropy1 + y*np.log2(y)
    return entropy1


entropy = Enntropy('buying', edf) 
print(entropy)

-2.0


In [7]:
# gini index (Gini)

def gini(feature, dataset):
    """Calculates the gini index of a feature in a given dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    dataset: pd.DataFrame
        dataframe for the dataset
    Returns
    -------
    float
        gini index for the feature in the dataset
    """
    ##your implementation goes here
    k=0
    Unique_Values = dataset[feature].unique()
    for i in Unique_Values :
        x = dataset[dataset[feature] == i].shape[0]
        y = x/dataset.shape[0]
        k = k + y*y
        #print(k)
    return 1-k

ginindex = gini('buying', edf) 
print(ginindex)

0.75


In [8]:
s = Enntropy('evaluation', edf)
print(s)


-1.2057409700121753


In [9]:
# information gain (IG)

def IG(feature, target, dataset, measure):
    """Calculates the information gain of a feature for a given target variable and a dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    measure: str ('entropy' or 'gini')
        measure of impurity to be used
    Returns
    -------
    float
        information gain for the feature in the dataset for a given target variable
    """
    ##your implementation goes here
    Weighted_Entropy = 0
    Entropy = Enntropy(target, edf)
    Unique_Values = dataset[feature].unique()
    for i in Unique_Values :
        x = dataset[dataset[feature] == i].shape[0]
        y = x/dataset.shape[0]
        Entropy_of_feature_value = Enntropy('buying', edf[edf[feature]== i])
        Weighted_Entropy = Weighted_Entropy + y*Entropy_of_feature_value
    Ig = Entropy - Weighted_Entropy
    return Ig


ig = IG('buying','evaluation', edf, 'gini')
print(ig) 

-1.2057409700121753


In [20]:
def IUFS(target, dataset, k, measure='entropy'):
    """Finds k most informative features in the given dataset based on the target variable
        using information gain with the selected measure.
        
    Parameters
    ----------
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    k: int
        number of features to return, must be less than or equal to number of descriptive features in dataset.
        in other words, 0 < k < len(dataset.columns).
    measure: str, 'entropy' or 'gini'
        measure of impurity
    Returns
    -------
    list
        returns a list of k feature names, selected based on univariate selection schema
    """
    ##your implementation goes here
    DfIGScore = pd.DataFrame(columns = ['feature','Imformation Gain'])
    features = dataset.columns
    #print(features)
    for feature in features :
        Imformation_gain = IG(feature,'evaluation', edf, 'gini')
        DfIGScore = DfIGScore.append(pd.DataFrame([[feature,Imformation_gain]],columns = ['feature','Imformation Gain']))
        #DfIGScore.append(dfr,ignore_index=True)
    display(DfIGScore)

IUFS('evaluation', edf, 2, measure='entropy')

Unnamed: 0,feature,Imformation Gain
0,buying,-1.205741
0,maint,0.794259
0,doors,0.794259
0,persons,0.794259
0,lug_boot,0.794259
0,safety,0.794259
0,evaluation,0.69781


### Bonus
Improve the IUFS by including an option for gain ratio. Gain ratio is an alternative to information gain and can be used with either of the Gini index or entropy measures.  

In [0]:
def GR(feature, target, dataset, measure):
    """Calculates the gain ratio of a feature for a given target variable and a dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    measure: str ('entropy' or 'gini')
        measure of impurity to be used
    Returns
    -------
    float
        gain ratio for the feature in the dataset for a given target variable
    """
    ##your implementation goes here
    Information_Gain = IG()


# GR('buying','evaluation', edf, 'gini') 

In [0]:
def IUFS2(target, dataset, k, measure='entropy', gain='IG'):
    """Finds k most informative features in the given dataset based on the target variable
        using information gain with the selected measure.
        
    Parameters
    ----------
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    k: int
        number of features to return, must be less than or equal to number of descriptive features in dataset.
        in other words, 0 < k < len(dataset.columns).
    measure: str, 'entropy' or 'gini'
        measure of impurity
    gain: str, 'IG' or 'GR'
        feature selection metric ('IG' for information gain, 'GR' for gain ratio)
    Returns
    -------
    list
        returns a list of k feature names, selected based on univariate selection schema
    """
    ##your implementation goes here
    pass

# IUFS2('evaluation', edf, 2, measure='gini', gain='GR')