3.1) Add the SelectPercentile object to the feature_selection sub-package. You should create a module called "select_percentile.py" to implement this object. The SelectPercentile class has a similar architecture to the SelectKBest class. Consider the structure presented in the next slide.

In [10]:
import sys
import os
import numpy as np
import pandas as pd
sys.path.append("/Users/utilizador/Documents/GitHub/si/src")

from si.base.transformer import Transformer
from si.data.dataset import Dataset
from si.statistics.f_classification import f_classification
from si.io.csv_file import read_csv



class SelectPercentile(Transformer):
    
    """
    Select a certain percentage of the features taking into account the F-score value.
    this is first we see the f-score of each feature and sorted that.
    after we choose a percentil that representes x % of this f-values sorted
    so we keep the features that indices have the f-value <= to the percentile
    
    Parameters
    -----------
    score_func:callable 
        taking the dataset and return a pair os array (F and p value)- allow analize the variance 
    percentile: int, deafult 50
        number that represents a percentage of the data/features to select 

    estimated parameters(given by the score_func)
    ---------------
    F: array, shape (n_features,)
        F scores of features.
    p: array, shape (n_features,)
        p-values of F-scores.
    """
    
    def __init__(self, score_func: callable= f_classification, percentile:int =50):
        self.score_func = score_func
        self.percentile = percentile
        self.F= None
        self.p= None
    
        if self.percentile > 100 or self.percentile < 0:
            raise ValueError("the value of percentile must be between 0 and 100")
    
    def _fit(self, dataset: Dataset):
        """
        It fits SelectPercentile to compute the F scores and p-values.
        
        Parameters
        ----------
        dataset: Dataset
            A labeled dataset

        Returns
        -------
        self: object
            Returns self.
        """
        
        self.F, self.p = self.score_func(dataset)
        
        return self
    
    def _transform(self, dataset: Dataset) -> Dataset:
        
        """
        It selects the features according to the percentile.
        
        Parameters
        ----------
        dataset: Dataset
            A labeled dataset
        
        Returns
        ----------
        dataset: Dataset
            A labeled dataset with the selected features.
            
        """
        
        percentile = np.percentile(self.F, self.percentile) # vai buscar o percentile do f values
        
        idxs = np.where(self.F > percentile)[0] # vai buscar os indices das features que tem f values

        features = np.array(dataset.features)[self.F > percentile] #vai buscar o nome das features
        
        return Dataset(X=dataset.X[:, idxs], y=dataset.y, features=features, label=dataset.label)
        
        
    def fit_transform(self, dataset: Dataset) -> Dataset:
        """
        It fits SelectPercentile to compute the F scores and p-values and then selects the features according to the percentile.
        
        Parameters
        ----------
        dataset: Dataset
            A labeled dataset
        
        Returns
        ----------
        dataset: Dataset
            A labeled dataset with the selected features.
            
        """
        
        self.fit(dataset)
        return self.transform(dataset)
    

 

3.2) Test the SelectPercentile class in a Jupyter notebook using the "iris.csv" dataset (classification)

In [44]:
Path= "/Users/utilizador/Documents/GitHub/si/datasets/iris/"
data = read_csv(Path + "iris.csv", sep=",", label=True)
data.summary()

Unnamed: 0,feat_0,feat_1,feat_2,feat_3
mean,5.843333,3.054,3.758667,1.198667
median,5.8,3.0,4.35,1.3
min,4.3,2.0,1.0,0.1
max,7.9,4.4,6.9,2.5
var,0.681122,0.186751,3.092425,0.578532


In [47]:
selector = SelectPercentile(score_func=f_classification, percentile=30)
data_fit = selector.fit_transform(data)


print("Features selecionadas:", data_fit.features)


Features selecionadas: ['feat_2']
