# Homework 3
In this homework assignment, you will implement a univariate feature selection method. 

You will be given a toy dataset called 'Car Evaluation Data Set' (see: http://archive.ics.uci.edu/ml/datasets/Car+Evaluation for details).
You are not required to, but advised to test your code with the toy dataset, or any other dataset that contains categorical variables.

The given dataset contains six descriptive features and a target variable. Each of those are ordinal scale, categorical variables. The name of the target feature is 'evaluation'. 

Note here that you are expected to write your own code, so DO NOT COPY AND PASTE CODE OR USE LIBRARY FUNCTIONS. The goal of the homework is not to see if you can call library functions but to have you practice with the impurity measures and feature selection techniques.


In [57]:
%matplotlib inline
import pandas as pd
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt

### Read the dataset

In [58]:
edf = pd.read_csv('careval.csv')
# edf.head()
edf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1728 entries, 0 to 1727
Data columns (total 7 columns):
buying        1728 non-null object
maint         1728 non-null object
doors         1728 non-null object
persons       1728 non-null object
lug_boot      1728 non-null object
safety        1728 non-null object
evaluation    1728 non-null object
dtypes: object(7)
memory usage: 94.6+ KB


You will create a method called IUFS (impurity-based univariate feature selection), which will select the most informative features with a univariate feature selection schema. This feature selection method will take the dataset, name of the target variable, number of features to be selected (k) and the measure of impurity as an input, and will output the names of k best features based on the information gain. You are expected to implement information gain, entropy and Gini index functions. Note here that this will be a univariate selection, which means that you need to test the features individually.

In [59]:
# entropy (H)

def entropy(feature, dataset):
    """Calculates the entropy of a feature in a given dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    dataset: pd.DataFrame
        dataframe for the dataset
    Returns
    -------
    float
        entropy for the feature in the dataset
    """
    ##your implementation goes here
    numrows = float(len(dataset[feature]))
    entropy = 0
    attribute_counts = dataset[feature].value_counts()
    attribute_names = dataset[feature].value_counts().index.tolist()
    if sum(attribute_counts) != numrows:
        return 'sums do not match up'
    elif len(attribute_counts) != len(attribute_names):
        return 'something doesnt match up'
    for prop in attribute_counts:
        p_i = float(prop)/numrows
        if p_i != 1:
            entropy -= p_i*(math.log(p_i,2))
            entropy = round(entropy,4)
    return entropy
    

entropy('buying', edf) 


2.0

In [60]:
# gini index (Gini)

def gini(feature, dataset):
    """Calculates the gini index of a feature in a given dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    dataset: pd.DataFrame
        dataframe for the dataset
    Returns
    -------
    float
        gini index for the feature in the dataset
    """
    ##your implementation goes here
    numrows = float(len(dataset[feature]))
    ginival = 1
    attribute_counts = dataset[feature].value_counts()
    attribute_names = dataset[feature].value_counts().index.tolist()
    if sum(attribute_counts) != numrows:
        return 'sums do not match up'
    elif len(attribute_counts) != len(attribute_names):
        return 'something doesnt match up'
    for prop in attribute_counts:
        p_i = float(prop)/numrows
        if p_i != 1:
            ginival -= float(p_i*p_i)
        else:
            ginival = 0
        ginival=round(ginival,4)
    return ginival
  

gini('buying', edf) 


0.75

In [61]:
# information gain (IG)

def IG(feature, target, dataset, measure):
    """Calculates the information gain of a feature for a given target variable and a dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    measure: str ('entropy' or 'gini')
        measure of impurity to be used
    Returns
    -------
    float
        information gain for the feature in the dataset for a given target variable
    """
    ##your implementation goes here

    cols = []
    for col in dataset.columns:
        if col != target:
            cols.append(col)
    if feature not in cols:
        return 'the feature to be split on is not in the feature set'
    attribute_counts = dataset[feature].value_counts()
    attribute_names = dataset[feature].value_counts().index.tolist()
    numrows = float(len(dataset[feature]))
    
    if(measure=='entropy'):
     value1 = entropy(target,dataset)
     for i in range(len(attribute_names)):
        df = dataset.loc[dataset[feature] == attribute_names[i]]
        part = entropy(target,df)
        part = part * (float(df.shape[0])/numrows)
        part = round(part,4)
        value1 -= part
        value1=round(value1,4)   
        
    elif(measure=='gini'):
     value1 = gini(target,dataset)
     value1 = round(value1,4)
     for i in range(len(attribute_names)):
        df = dataset.loc[dataset[feature] == attribute_names[i]]
        part = gini(target,df)
        part = round(part,4)
        part = part * (float(df.shape[0])/numrows)
        value1 -= part
        value1=round(value1,4)
    return value1
    pass


IG('buying','evaluation', edf, 'gini') 
 

0.0142

In [62]:
def IUFS(target, dataset, k, measure='entropy'):
    """Finds k most informative features in the given dataset based on the target variable
        using information gain with the selected measure.
        
    Parameters
    ----------
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    k: int
        number of features to return, must be less than or equal to number of descriptive features in dataset.
        in other words, 0 < k < len(dataset.columns).
    measure: str, 'entropy' or 'gini'
        measure of impurity
    Returns
    -------
    list
        returns a list of k feature names, selected based on univariate selection schema
    """
    ##your implementation goes here
    attributes = []
    IGval = []
    if((k<len(dataset.columns))&(k>0)):
        for col in dataset.columns:
            if col != target:
                attributes.append(col)
                IGval.append(IG(col,target,dataset,measure))
                list_of_tuples = list(zip(attributes, IGval))
                list_of_tuples
                df = pd.DataFrame(list_of_tuples, columns = ['Attributes', 'Information Gain']) 
                df.sort_values("Information Gain", axis = 0, ascending = False, 
                 inplace = True)
                selectedFeatures=pd.Series()
                selectedFeatures=df['Attributes'].head(k)
    else:
        selectedFeatures="Update the count of attributes"
    return selectedFeatures
    
IUFS('evaluation', edf, 2, measure='entropy')


5     safety
3    persons
Name: Attributes, dtype: object

### Bonus
Improve the IUFS by including an option for gain ratio. Gain ratio is an alternative to information gain and can be used with either of the Gini index or entropy measures.  

In [63]:
def GR(feature, target, dataset, measure):
    """Calculates the gain ratio of a feature for a given target variable and a dataset.
    
    Parameters
    ----------
    feature: str
        name of the feature
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    measure: str ('entropy' or 'gini')
        measure of impurity to be used
    Returns
    -------
    float
        gain ratio for the feature in the dataset for a given target variable
    """
    ##your implementation goes here
    IGval=IG(feature, target, dataset, measure) 
    if(measure=='entropy'):
        entropyval=entropy(feature,dataset)
        gainratio=IGval/entropyval
    elif(measure=='gini'):
        ginival=gini(feature,dataset)
        gainratio=IGval/ginival
    return gainratio
  
GR('buying','evaluation', edf, 'gini') 


0.018933333333333333

In [64]:
def IUFS2(target, dataset, k, measure='entropy', gain='IG'):
    """Finds k most informative features in the given dataset based on the target variable
        using information gain with the selected measure.
        
    Parameters
    ----------
    target: str
        name of the target variable
    dataset: pd.DataFrame
        dataframe for the dataset
    k: int
        number of features to return, must be less than or equal to number of descriptive features in dataset.
        in other words, 0 < k < len(dataset.columns).
    measure: str, 'entropy' or 'gini'
        measure of impurity
    gain: str, 'IG' or 'GR'
        feature selection metric ('IG' for information gain, 'GR' for gain ratio)
    Returns
    -------
    list
        returns a list of k feature names, selected based on univariate selection schema
    """
    ##your implementation goes here
    attributes = []
    IGval = []
    GRatio = []
    if((k<len(dataset.columns))&(k>0)):
        for col in dataset.columns:
            if col != target:
                attributes.append(col)
                if(gain=='GR'):
                    GRatio.append(GR(col,target,dataset,measure))   
                    list_of_tuples = list(zip(attributes, GRatio))
                    list_of_tuples
                    df = pd.DataFrame(list_of_tuples, columns = ['Attributes', 'Gain Ratio']) 
                    df.sort_values("Gain Ratio", axis = 0, ascending = False, inplace = True)
                    selectedFeatures=pd.Series()
                    selectedFeatures=df['Attributes'].head(k)
                elif(gain=='IG'):
                    IGval.append(IG(col,target,dataset,measure))
                    list_of_tuples = list(zip(attributes, IGval))
                    list_of_tuples
                    df = pd.DataFrame(list_of_tuples, columns = ['Attributes', 'Information Gain']) 
                    df.sort_values("Information Gain", axis = 0, ascending = False, inplace = True)
                    selectedFeatures=pd.Series()
                    selectedFeatures=df['Attributes'].head(k)
    else:
        selectedFeatures="Update the count of attributes"
    return selectedFeatures

IUFS2('evaluation', edf, 2, measure='gini', gain='GR')


5     safety
3    persons
Name: Attributes, dtype: object