# RAVEN: Reducing Attributes Via Evaluating Nearness [🐦‍⬛](https://en.wikipedia.org/wiki/Backronym)

A convenient tool to reduce the attributes (features) of that insanely large dataset in a way that doesn't affect dataset quality. It does this by identifying clusters of linearly related (and therefore redundant) features, and only preserving the feature most 'near' to all other features.

This doc shows you how and when to use Raven.


In this example, we are using  a sample of [ClimSim](https://leap-stc.github.io/ClimSim/README.html), a huge (~377 GB) dataset containing 925 attributes. We will consider the first 556 columns to predict the 557th.

In [16]:
import pandas as pd

data = pd.read_csv('train1000.csv').drop(columns=['sample_id', 'Unnamed: 0'])
x = data.iloc[:, :556] 
y = data.iloc[:, 556]

x.head(), y.head()

(    state_t_0   state_t_1   state_t_2   state_t_3   state_t_4   state_t_5  \
 0  215.117418  235.579037  250.793334  263.115017  262.570816  256.696740   
 1  216.660442  236.356702  250.701147  260.185608  257.965210  254.948239   
 2  217.766221  233.437275  242.320351  253.499835  260.342576  261.025882   
 3  218.452505  232.471070  240.774734  253.624602  257.088679  258.302960   
 4  217.883840  235.344818  248.610196  256.193888  255.609799  255.376257   
 
     state_t_6   state_t_7   state_t_8   state_t_9  ...   pbuf_N2O_50  \
 0  248.951714  245.328685  238.121496  233.741971  ...  4.908584e-07   
 1  247.290346  245.702725  239.636065  234.963264  ...  4.908584e-07   
 2  254.783378  249.582744  240.681978  233.681856  ...  4.908584e-07   
 3  251.741567  248.882723  242.147880  235.834188  ...  4.908584e-07   
 4  247.574184  244.685004  238.785738  234.712877  ...  4.908584e-07   
 
     pbuf_N2O_51   pbuf_N2O_52   pbuf_N2O_53   pbuf_N2O_54   pbuf_N2O_55  \
 0  4.908584e-

Let's first train a random forest regressor on the data without using RAVEN.

In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

def train(x, y, reduction_status = ""):
    train_size = int(len(x) * 0.8)

    x_train = x.iloc[:train_size, :]
    y_train = y.iloc[:train_size]
    x_test = x.iloc[train_size:, :]
    y_test = y.iloc[train_size:]

    model = RandomForestRegressor(n_estimators=100, max_depth=None, random_state=42)

    model.fit(x_train, y_train)

    y_pred = model.predict(x_test)

    mse_without_reduction = mean_squared_error(y_test, y_pred)
    r2_without_reduction = r2_score(y_test, y_pred)
    print(f"Mean Squared Error{reduction_status}: ", mse_without_reduction)
    print(f"R2 Score{reduction_status}: ", r2_without_reduction)

In [18]:
%%time
train(x,y, " without reduction")

Mean Squared Error without reduction:  1.712628771132408e-11
R2 Score without reduction:  0.9795469803323228
CPU times: total: 8.56 s
Wall time: 24.5 s


The R^2 score looks great, but the training time is too long. For large datasets, many attributes tend to have strong correlations with others. RAVEN identifies these by calculating correlations between attribute pairs. We can speed this up by randomly sampling from the dataset.

Additionally, to avoid being too lenient, we can adjust the threshold 'tau' of the correlation coefficient, which determines when RAVEN flags attributes as redundant.

Let's use RAVEN with this dataset.

Within a (relatively) short amount of time, RAVEN identifies 339 redundant features!

In [19]:
from raven import raven

redundant = raven(x)
print(f"{len(redundant)} redundant columns identified, out of {len(x.columns)} columns")
print(f"{len(redundant)/len(x.columns) * 100}% Reduction")

x1 = x.drop(columns=redundant)

355 redundant columns identified, out of 556 columns
63.84892086330935% Reduction


In [20]:
%%time
train(x1, y, " after reduction")

Mean Squared Error after reduction:  1.646347956793695e-11
R2 Score after reduction:  0.980338537044501
CPU times: total: 3.38 s
Wall time: 7.75 s


That's a pretty big reduction with no change in the R^2 score!