# **Boruta**



Feature selection is one of the most crucial and time-consuming phases of the machine learning process, second only to data cleaning. What if we can automate the process? Well, that’s exactly what Boruta does. Boruta is an algorithm designed to take the “all-relevant” approach to feature selection, i.e., it tries to find all features from the dataset which carry information relevant to a given task. The counterpart to this is the “minimal-optimal” approach, which sees the minimal subset of features that are important in a model. 

To read about it more, please refer [this](https://analyticsindiamag.com/hands-on-guide-to-automated-feature-selection-using-boruta/) article.

# **Code Implementation**

## Intalling the module 


In [None]:
!python -m pip install pip --upgrade --user -q
!python -m pip install numpy pandas seaborn matplotlib scipy sklearn statsmodels --user -q

In [None]:
!python -m pip install Boruta --user -q

In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

  Importing Boruta and other required libraries. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Loading the dataset, separating the features from the target variable, and splitting the data into a train and a dev set.

In [None]:
URL = "https://raw.githubusercontent.com/Aditya1001001/English-Premier-League/master/pos_modelling_data.csv"

In [None]:
data = pd.read_csv(URL)

In [None]:
data.head()

In [None]:
data.isnull().sum().sum()

In [None]:
data.info()

In [None]:
X = data.drop('Position', axis = 1)
y = data['Position']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 1)

Creating a baseline RandomForrestClassifier model with all the features.

In [None]:
rf_all_features = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_all_features.fit(X_train, y_train)

In [None]:
accuracy_score(y_test, rf_all_features.predict(X_test))

## Using Boruta for feature selction

  Creating a BorutaPy object with RandomForestClassifier as the estimator and ranking the features. 

One important thing to note here is that Boruta works on NumPy arrays only

In [None]:
rfc = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2, random_state=1)
boruta_selector.fit(np.array(X_train), np.array(y_train)) 

In [None]:
type(boruta_selector)

In [None]:
print("Selected Features: ", boruta_selector.support_)    # check selected features

In [None]:
print("Ranking: ",boruta_selector.ranking_)               # check ranking of features

print("No. of significant features: ", boruta_selector.n_features_)

So boruta has selected 31 relavent features. (The features with a ranking of 1 are selected).

Let's visualise it in the form of a table

In [None]:
selected_rf_features = pd.DataFrame({'Feature':list(X_train.columns),
                                      'Ranking':boruta_selector.ranking_})
selected_rf_features.sort_values(by='Ranking')

Using the BorutaPy object to transform the features in the dataset.

In [None]:
X_important_train = boruta_selector.transform(np.array(X_train))
X_important_test = boruta_selector.transform(np.array(X_test))

In [None]:
X_important_train.shape

Creating another RandomForestClassifier model with the same parameters as the baseline classifier and training it with the selected features.

In [None]:
rf_boruta = RandomForestClassifier(random_state=1, n_estimators=1000, max_depth=5)
rf_boruta.fit(X_important_train, y_train)

In [None]:
accuracy_score(y_test, rf_boruta.predict(X_important_test))

In [None]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True, False],
    'max_depth': [5, 10, 15],
    'n_estimators': [500, 1000]}

In [None]:
rf_hyper = RandomForestClassifier(random_state = 1)

# Grid search cv
grid_search = GridSearchCV(estimator = rf_hyper, param_grid = param_grid, 
                          cv = 2, n_jobs = -1, verbose = 2)

In [None]:
grid_search.fit(X_important_train, y_train)

In [None]:

grid_search.best_params_

In [None]:
accuracy_score(y_test, grid_search.predict(X_important_test))