# Motivation
In this competition locally I generated 200+ features, and it is too much for my RAM (I think, you have the same problem) <br>
I wanted to reduce usage of memory and it can be done in following ways: <br>
- `reduce_mem_usage` function
- split dataset by type and fit different models separately <br>
- feature reduction <br><br>
In this kernel I will cover method `Permutation importance` that the case of feature selection.

# The Idea
We train one model on all generated features. After that you can permutate column one by one and look at the change of score on hold-out validation.
If it is getting much worse, then this column contains useful information about target. If it is doesn't change or getting better, most probably this feature is useless. After that you make column return the column to its normal state and go to next column. <br><br>
Benefits of this method:
 - It is fast. Model is trained once. it is better than greedy algorithm, because you don't need to re-fit your model each time
 - When column is permutated, it has the same distribution. It means - no bias towards some classes or modes in target distribution.
 - Experiment can be ran several times so mean and std of error change can be measured.

# Imports and utils

In [1]:
import numpy as np
import pandas as pd
import os
import time
import datetime
import gc

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GroupKFold
from sklearn.metrics import mean_absolute_error

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm_notebook as tqdm

from catboost import CatBoostRegressor, Pool

import warnings
warnings.filterwarnings("ignore")

In [2]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                c_prec = df[col].apply(lambda x: np.finfo(x).precision).max()
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max and c_prec == np.finfo(np.float16).precision:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max and c_prec == np.finfo(np.float32).precision:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

def group_mean_log_mae(y_true, y_pred, types, floor=1e-9):
    """
    Fast metric computation for this competition: https://www.kaggle.com/c/champs-scalar-coupling
    Code is from this kernel: https://www.kaggle.com/uberkinder/efficient-metric
    """
    maes = (y_true-y_pred).abs().groupby(types).mean()
    return np.log(maes.map(lambda x: max(x, floor))).mean()

def encode_categoric(df):
    lbl = LabelEncoder()
    cat_cols=[]
    try:
        cat_cols = df.describe(include=['O']).columns.tolist()
        for cat in cat_cols:
            df[cat] = lbl.fit_transform(list(df[cat].values))
    except Exception as e:
        print('error: ', str(e) )

    return df

## Load train data

In [3]:
train = pd.read_csv('../input/train.csv')
structures = pd.read_csv('../input/structures.csv')

print('Train dataset shape is -> rows: {} cols:{}'.format(train.shape[0],train.shape[1]))
print('Structures dataset shape is  -> rows: {} cols:{}'.format(structures.shape[0],structures.shape[1]))

Train dataset shape is -> rows: 4658147 cols:6
Structures dataset shape is  -> rows: 2358657 cols:6


In [4]:
unique_molecules = train['molecule_name'].unique()

print("Few examples of molecule's names: ", '  '.join(unique_molecules[:3]), end='\n\n')
print('Amount of unique molecules in train: ', len(unique_molecules))

Few examples of molecule's names:  dsgdb9nsd_000001  dsgdb9nsd_000002  dsgdb9nsd_000003

Amount of unique molecules in train:  85003


### Let's use subset of training data due kaggle kernel power constraints

In [5]:
molecules_fraction = 0.1
molecules_amount = int(molecules_fraction * len(unique_molecules))

np.random.shuffle(unique_molecules)
train_molecules = unique_molecules[:molecules_amount]

train = train[train['molecule_name'].isin(train_molecules)]

print(f'Amount of molecules in the subset of train: {molecules_amount}, samples: {train.shape[0]}')

Amount of molecules in the subset of train: 8500, samples: 463724


All these feture generation functions based on public kernels:
 - https://www.kaggle.com/artgor/using-meta-features-to-improve-model
 - https://www.kaggle.com/aekoch95/bonds-from-structure-data
 - https://www.kaggle.com/adrianoavelar/bond-calculation-lb-0-82
 - https://www.kaggle.com/kmat2019/effective-feature
 
**Please upvote them!**

# Create features based on structures.csv

In [6]:
def atomic_radius_electonegativety(structures):
    atomic_radius = {'H':0.38, 'C':0.77, 'N':0.75, 'O':0.73, 'F':0.71} # Without fudge factor
    fudge_factor = 0.05
    atomic_radius = {k:v + fudge_factor for k,v in atomic_radius.items()}

    electronegativity = {'H':2.2, 'C':2.55, 'N':3.04, 'O':3.44, 'F':3.98}

    atoms = structures['atom'].values
    atoms_en = [electronegativity[x] for x in atoms]
    atoms_rad = [atomic_radius[x] for x in atoms]

    structures['EN'] = atoms_en
    structures['rad'] = atoms_rad
    
    return structures


def create_bonds(structures):
    i_atom = structures['atom_index'].values
    p = structures[['x', 'y', 'z']].values
    p_compare = p
    m = structures['molecule_name'].values
    m_compare = m
    r = structures['rad'].values
    r_compare = r

    source_row = np.arange(len(structures))
    max_atoms = 28

    bonds = np.zeros((len(structures)+1, max_atoms+1), dtype=np.int8)
    bond_dists = np.zeros((len(structures)+1, max_atoms+1), dtype=np.float32)

#     print('Calculating bonds')

    for i in range(max_atoms-1):
        p_compare = np.roll(p_compare, -1, axis=0)
        m_compare = np.roll(m_compare, -1, axis=0)
        r_compare = np.roll(r_compare, -1, axis=0)

        mask = np.where(m == m_compare, 1, 0) #Are we still comparing atoms in the same molecule?
        dists = np.linalg.norm(p - p_compare, axis=1) * mask
        r_bond = r + r_compare

        bond = np.where(np.logical_and(dists > 0.0001, dists < r_bond), 1, 0)

        source_row = source_row
        target_row = source_row + i + 1 #Note: Will be out of bounds of bonds array for some values of i
        target_row = np.where(np.logical_or(target_row > len(structures), mask==0), len(structures), target_row) #If invalid target, write to dummy row

        source_atom = i_atom
        target_atom = i_atom + i + 1 #Note: Will be out of bounds of bonds array for some values of i
        target_atom = np.where(np.logical_or(target_atom > max_atoms, mask==0), max_atoms, target_atom) #If invalid target, write to dummy col

        bonds[(source_row, target_atom)] = bond
        bonds[(target_row, source_atom)] = bond
        bond_dists[(source_row, target_atom)] = dists
        bond_dists[(target_row, source_atom)] = dists

    bonds = np.delete(bonds, axis=0, obj=-1) #Delete dummy row
    bonds = np.delete(bonds, axis=1, obj=-1) #Delete dummy col
    bond_dists = np.delete(bond_dists, axis=0, obj=-1) #Delete dummy row
    bond_dists = np.delete(bond_dists, axis=1, obj=-1) #Delete dummy col

#     print('Counting and condensing bonds')

    bonds_numeric = [[i for i,x in enumerate(row) if x] for row in bonds]
    bond_lengths = [[dist for i,dist in enumerate(row) if i in bonds_numeric[j]] for j,row in enumerate(bond_dists)]
    bond_lengths_mean = [ np.mean(x) for x in bond_lengths]
    n_bonds = [len(x) for x in bonds_numeric]


    bond_data = {'n_bonds':n_bonds, 'bond_lengths_mean': bond_lengths_mean }
    bond_df = pd.DataFrame(bond_data)
    structures = structures.join(bond_df)
    
    return structures

def map_atom_info(df, atom_idx):
    df = pd.merge(df, structures, how = 'left',
                  left_on  = ['molecule_name', f'atom_index_{atom_idx}'],
                  right_on = ['molecule_name',  'atom_index'])
    
    df = df.drop('atom_index', axis=1)
    df = df.rename(columns={'atom': f'atom_{atom_idx}',
                            'x': f'x_{atom_idx}',
                            'y': f'y_{atom_idx}',
                            'z': f'z_{atom_idx}',
                            'EN': f'EN_{atom_idx}',
                            'rad': f'rad_{atom_idx}',
                            'n_bonds': f'n_bonds_{atom_idx}',
                            'bond_lengths_mean': f'bond_lengths_mean_{atom_idx}',
                           })
    return df

In [7]:
structures = atomic_radius_electonegativety(structures)
structures = create_bonds(structures)

train = map_atom_info(train, 0)
train = map_atom_info(train, 1)

train.head()

Unnamed: 0,id,molecule_name,atom_index_0,atom_index_1,type,scalar_coupling_constant,atom_0,x_0,y_0,z_0,EN_0,rad_0,n_bonds_0,bond_lengths_mean_0,atom_1,x_1,y_1,z_1,EN_1,rad_1,n_bonds_1,bond_lengths_mean_1
0,332,dsgdb9nsd_000030,4,0,1JHC,85.8525,H,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,C,-0.030958,1.54775,0.031679,2.55,0.82,4,1.205017
1,333,dsgdb9nsd_000030,4,1,2JHC,-2.13349,H,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,C,0.014854,0.009625,-0.020822,2.55,0.82,4,1.298219
2,334,dsgdb9nsd_000030,4,2,3JHC,2.37669,H,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,C,0.690991,-0.499546,-1.208576,2.55,0.82,2,1.330296
3,335,dsgdb9nsd_000030,4,5,2JHH,-11.6048,H,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,H,-0.560824,1.950029,-0.835807,2.2,0.43,1,1.093215
4,336,dsgdb9nsd_000030,4,6,2JHH,-11.0265,H,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,H,-0.545767,1.882255,0.937091,2.2,0.43,1,1.093934


## Feature generation

In [8]:
def distances(df):
    df_p_0 = df[['x_0', 'y_0', 'z_0']].values
    df_p_1 = df[['x_1', 'y_1', 'z_1']].values
    
    df['dist'] = np.linalg.norm(df_p_0 - df_p_1, axis=1)
    df['dist_x'] = (df['x_0'] - df['x_1']) ** 2
    df['dist_y'] = (df['y_0'] - df['y_1']) ** 2
    df['dist_z'] = (df['z_0'] - df['z_1']) ** 2
    
    df['type_0'] = df['type'].apply(lambda x: x[0])
    
    return df

def map_atom_info(df_1,df_2, atom_idx):
    df = pd.merge(df_1, df_2, how = 'left',
                  left_on  = ['molecule_name', f'atom_index_{atom_idx}'],
                  right_on = ['molecule_name',  'atom_index'])
    df = df.drop('atom_index', axis=1)

    return df

def create_closest(df):
    df_temp=df.loc[:,["molecule_name","atom_index_0","atom_index_1","dist","x_0","y_0","z_0","x_1","y_1","z_1"]].copy()
    df_temp_=df_temp.copy()
    df_temp_= df_temp_.rename(columns={'atom_index_0': 'atom_index_1',
                                       'atom_index_1': 'atom_index_0',
                                       'x_0': 'x_1',
                                       'y_0': 'y_1',
                                       'z_0': 'z_1',
                                       'x_1': 'x_0',
                                       'y_1': 'y_0',
                                       'z_1': 'z_0'})
    df_temp=pd.concat(objs=[df_temp,df_temp_],axis=0)

    df_temp["min_distance"]=df_temp.groupby(['molecule_name', 'atom_index_0'])['dist'].transform('min')
    df_temp= df_temp[df_temp["min_distance"]==df_temp["dist"]]

    df_temp=df_temp.drop(['x_0','y_0','z_0','min_distance', 'dist'], axis=1)
    df_temp= df_temp.rename(columns={'atom_index_0': 'atom_index',
                                     'atom_index_1': 'atom_index_closest',
                                     'distance': 'distance_closest',
                                     'x_1': 'x_closest',
                                     'y_1': 'y_closest',
                                     'z_1': 'z_closest'})

    for atom_idx in [0,1]:
        df = map_atom_info(df,df_temp, atom_idx)
        df = df.rename(columns={'atom_index_closest': f'atom_index_closest_{atom_idx}',
                                        'distance_closest': f'distance_closest_{atom_idx}',
                                        'x_closest': f'x_closest_{atom_idx}',
                                        'y_closest': f'y_closest_{atom_idx}',
                                        'z_closest': f'z_closest_{atom_idx}'})
    return df

def add_cos_features(df):
    df["distance_0"]=((df['x_0']-df['x_closest_0'])**2+(df['y_0']-df['y_closest_0'])**2+(df['z_0']-df['z_closest_0'])**2)**(1/2)
    df["distance_1"]=((df['x_1']-df['x_closest_1'])**2+(df['y_1']-df['y_closest_1'])**2+(df['z_1']-df['z_closest_1'])**2)**(1/2)
    df["vec_0_x"]=(df['x_0']-df['x_closest_0'])/df["distance_0"]
    df["vec_0_y"]=(df['y_0']-df['y_closest_0'])/df["distance_0"]
    df["vec_0_z"]=(df['z_0']-df['z_closest_0'])/df["distance_0"]
    df["vec_1_x"]=(df['x_1']-df['x_closest_1'])/df["distance_1"]
    df["vec_1_y"]=(df['y_1']-df['y_closest_1'])/df["distance_1"]
    df["vec_1_z"]=(df['z_1']-df['z_closest_1'])/df["distance_1"]
    df["vec_x"]=(df['x_1']-df['x_0'])/df["dist"]
    df["vec_y"]=(df['y_1']-df['y_0'])/df["dist"]
    df["vec_z"]=(df['z_1']-df['z_0'])/df["dist"]
    df["cos_0_1"]=df["vec_0_x"]*df["vec_1_x"]+df["vec_0_y"]*df["vec_1_y"]+df["vec_0_z"]*df["vec_1_z"]
    df["cos_0"]=df["vec_0_x"]*df["vec_x"]+df["vec_0_y"]*df["vec_y"]+df["vec_0_z"]*df["vec_z"]
    df["cos_1"]=df["vec_1_x"]*df["vec_x"]+df["vec_1_y"]*df["vec_y"]+df["vec_1_z"]*df["vec_z"]
    df=df.drop(['vec_0_x','vec_0_y','vec_0_z','vec_1_x','vec_1_y','vec_1_z','vec_x','vec_y','vec_z'], axis=1)
    return df

def create_features(df):
    df['molecule_couples'] = df.groupby('molecule_name')['id'].transform('count')
    df['molecule_dist_mean'] = df.groupby('molecule_name')['dist'].transform('mean')
    df['molecule_dist_min'] = df.groupby('molecule_name')['dist'].transform('min')
    df['molecule_dist_max'] = df.groupby('molecule_name')['dist'].transform('max')
    df['atom_0_couples_count'] = df.groupby(['molecule_name', 'atom_index_0'])['id'].transform('count')
    df['atom_1_couples_count'] = df.groupby(['molecule_name', 'atom_index_1'])['id'].transform('count')
    df[f'molecule_atom_index_0_x_1_std'] = df.groupby(['molecule_name', 'atom_index_0'])['x_1'].transform('std')
    df[f'molecule_atom_index_0_y_1_mean'] = df.groupby(['molecule_name', 'atom_index_0'])['y_1'].transform('mean')
    df[f'molecule_atom_index_0_y_1_mean_diff'] = df[f'molecule_atom_index_0_y_1_mean'] - df['y_1']
    df[f'molecule_atom_index_0_y_1_mean_div'] = df[f'molecule_atom_index_0_y_1_mean'] / df['y_1']
    df[f'molecule_atom_index_0_y_1_max'] = df.groupby(['molecule_name', 'atom_index_0'])['y_1'].transform('max')
    df[f'molecule_atom_index_0_y_1_max_diff'] = df[f'molecule_atom_index_0_y_1_max'] - df['y_1']
    df[f'molecule_atom_index_0_y_1_std'] = df.groupby(['molecule_name', 'atom_index_0'])['y_1'].transform('std')
    df[f'molecule_atom_index_0_z_1_std'] = df.groupby(['molecule_name', 'atom_index_0'])['z_1'].transform('std')
    df[f'molecule_atom_index_0_dist_mean'] = df.groupby(['molecule_name', 'atom_index_0'])['dist'].transform('mean')
    df[f'molecule_atom_index_0_dist_mean_diff'] = df[f'molecule_atom_index_0_dist_mean'] - df['dist']
    df[f'molecule_atom_index_0_dist_mean_div'] = df[f'molecule_atom_index_0_dist_mean'] / df['dist']
    df[f'molecule_atom_index_0_dist_max'] = df.groupby(['molecule_name', 'atom_index_0'])['dist'].transform('max')
    df[f'molecule_atom_index_0_dist_max_diff'] = df[f'molecule_atom_index_0_dist_max'] - df['dist']
    df[f'molecule_atom_index_0_dist_max_div'] = df[f'molecule_atom_index_0_dist_max'] / df['dist']
    df[f'molecule_atom_index_0_dist_min'] = df.groupby(['molecule_name', 'atom_index_0'])['dist'].transform('min')
    df[f'molecule_atom_index_0_dist_min_diff'] = df[f'molecule_atom_index_0_dist_min'] - df['dist']
    df[f'molecule_atom_index_0_dist_min_div'] = df[f'molecule_atom_index_0_dist_min'] / df['dist']
    df[f'molecule_atom_index_0_dist_std'] = df.groupby(['molecule_name', 'atom_index_0'])['dist'].transform('std')
    df[f'molecule_atom_index_0_dist_std_diff'] = df[f'molecule_atom_index_0_dist_std'] - df['dist']
    df[f'molecule_atom_index_0_dist_std_div'] = df[f'molecule_atom_index_0_dist_std'] / df['dist']
    df[f'molecule_atom_index_1_dist_mean'] = df.groupby(['molecule_name', 'atom_index_1'])['dist'].transform('mean')
    df[f'molecule_atom_index_1_dist_mean_diff'] = df[f'molecule_atom_index_1_dist_mean'] - df['dist']
    df[f'molecule_atom_index_1_dist_mean_div'] = df[f'molecule_atom_index_1_dist_mean'] / df['dist']
    df[f'molecule_atom_index_1_dist_max'] = df.groupby(['molecule_name', 'atom_index_1'])['dist'].transform('max')
    df[f'molecule_atom_index_1_dist_max_diff'] = df[f'molecule_atom_index_1_dist_max'] - df['dist']
    df[f'molecule_atom_index_1_dist_max_div'] = df[f'molecule_atom_index_1_dist_max'] / df['dist']
    df[f'molecule_atom_index_1_dist_min'] = df.groupby(['molecule_name', 'atom_index_1'])['dist'].transform('min')
    df[f'molecule_atom_index_1_dist_min_diff'] = df[f'molecule_atom_index_1_dist_min'] - df['dist']
    df[f'molecule_atom_index_1_dist_min_div'] = df[f'molecule_atom_index_1_dist_min'] / df['dist']
    df[f'molecule_atom_index_1_dist_std'] = df.groupby(['molecule_name', 'atom_index_1'])['dist'].transform('std')
    df[f'molecule_atom_index_1_dist_std_diff'] = df[f'molecule_atom_index_1_dist_std'] - df['dist']
    df[f'molecule_atom_index_1_dist_std_div'] = df[f'molecule_atom_index_1_dist_std'] / df['dist']
    df[f'molecule_atom_1_dist_mean'] = df.groupby(['molecule_name', 'atom_1'])['dist'].transform('mean')
    df[f'molecule_atom_1_dist_min'] = df.groupby(['molecule_name', 'atom_1'])['dist'].transform('min')
    df[f'molecule_atom_1_dist_min_diff'] = df[f'molecule_atom_1_dist_min'] - df['dist']
    df[f'molecule_atom_1_dist_min_div'] = df[f'molecule_atom_1_dist_min'] / df['dist']
    df[f'molecule_atom_1_dist_std'] = df.groupby(['molecule_name', 'atom_1'])['dist'].transform('std')
    df[f'molecule_atom_1_dist_std_diff'] = df[f'molecule_atom_1_dist_std'] - df['dist']
    df[f'molecule_type_0_dist_std'] = df.groupby(['molecule_name', 'type_0'])['dist'].transform('std')
    df[f'molecule_type_0_dist_std_diff'] = df[f'molecule_type_0_dist_std'] - df['dist']
    df[f'molecule_type_dist_mean'] = df.groupby(['molecule_name', 'type'])['dist'].transform('mean')
    df[f'molecule_type_dist_mean_diff'] = df[f'molecule_type_dist_mean'] - df['dist']
    df[f'molecule_type_dist_mean_div'] = df[f'molecule_type_dist_mean'] / df['dist']
    df[f'molecule_type_dist_max'] = df.groupby(['molecule_name', 'type'])['dist'].transform('max')
    df[f'molecule_type_dist_min'] = df.groupby(['molecule_name', 'type'])['dist'].transform('min')
    df[f'molecule_type_dist_std'] = df.groupby(['molecule_name', 'type'])['dist'].transform('std')
    df[f'molecule_type_dist_std_diff'] = df[f'molecule_type_dist_std'] - df['dist']
    return df

In [9]:
start_time = time.time()

train = distances(train)

print('Create closest features')

train = create_closest(train)

print('Create cos features')

train = add_cos_features(train)

print('Create groupby features', end='\n\n')

train = create_features(train)

train = reduce_mem_usage(train, verbose=False)

print('Train dataset shape is -> rows: {} cols:{}'.format(train.shape[0],train.shape[1]))
print('Structures dataset shape is  -> rows: {} cols:{}'.format(structures.shape[0],structures.shape[1]), end='\n\n')
print(f'Exe time: {(time.time() - start_time)/60:.2} min')

Create closest features
Create cos features
Create groupby features

Train dataset shape is -> rows: 463724 cols:93
Structures dataset shape is  -> rows: 2358657 cols:10

Exe time: 1.6 min


In [10]:
molecules_id = train['molecule_name']
X = train.drop(['id', 'scalar_coupling_constant', 'molecule_name'], axis=1)
y = train['scalar_coupling_constant']

X = encode_categoric(X)

In [11]:
print('X size', X.shape)

del train
gc.collect()
X.head()

X size (463724, 90)


Unnamed: 0,atom_index_0,atom_index_1,type,atom_0,x_0,y_0,z_0,EN_0,rad_0,n_bonds_0,bond_lengths_mean_0,atom_1,x_1,y_1,z_1,EN_1,rad_1,n_bonds_1,bond_lengths_mean_1,dist,dist_x,dist_y,dist_z,type_0,atom_index_closest_0,x_closest_0,y_closest_0,z_closest_0,atom_index_closest_1,x_closest_1,y_closest_1,z_closest_1,distance_0,distance_1,cos_0_1,cos_0,cos_1,molecule_couples,molecule_dist_mean,molecule_dist_min,...,molecule_atom_index_0_z_1_std,molecule_atom_index_0_dist_mean,molecule_atom_index_0_dist_mean_diff,molecule_atom_index_0_dist_mean_div,molecule_atom_index_0_dist_max,molecule_atom_index_0_dist_max_diff,molecule_atom_index_0_dist_max_div,molecule_atom_index_0_dist_min,molecule_atom_index_0_dist_min_diff,molecule_atom_index_0_dist_min_div,molecule_atom_index_0_dist_std,molecule_atom_index_0_dist_std_diff,molecule_atom_index_0_dist_std_div,molecule_atom_index_1_dist_mean,molecule_atom_index_1_dist_mean_diff,molecule_atom_index_1_dist_mean_div,molecule_atom_index_1_dist_max,molecule_atom_index_1_dist_max_diff,molecule_atom_index_1_dist_max_div,molecule_atom_index_1_dist_min,molecule_atom_index_1_dist_min_diff,molecule_atom_index_1_dist_min_div,molecule_atom_index_1_dist_std,molecule_atom_index_1_dist_std_diff,molecule_atom_index_1_dist_std_div,molecule_atom_1_dist_mean,molecule_atom_1_dist_min,molecule_atom_1_dist_min_diff,molecule_atom_1_dist_min_div,molecule_atom_1_dist_std,molecule_atom_1_dist_std_diff,molecule_type_0_dist_std,molecule_type_0_dist_std_diff,molecule_type_dist_mean,molecule_type_dist_mean_diff,molecule_type_dist_mean_div,molecule_type_dist_max,molecule_type_dist_min,molecule_type_dist_std,molecule_type_dist_std_diff
0,4,0,0,0,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,0,-0.030958,1.54775,0.031679,2.55,0.82,4,1.205017,1.093215,1.020002,0.175114,4e-06,0,0,-0.030958,1.54775,0.031679,5,-0.560824,1.950029,-0.835807,1.093215,1.093215,0.308408,-1.0,-0.308408,30,2.208676,1.061574,...,0.795651,2.172079,1.078864,1.986873,3.081493,1.988277,2.818743,1.093215,0.0,1.0,0.683683,-0.409532,0.625388,1.523985,0.4307693,1.394039,2.169789,1.076574,1.984778,1.093215,-4.300595e-07,1.0,0.5895275,-0.503688,0.5392602,2.14889,1.061574,-0.031642,0.971056,0.854799,-0.238416,0.013786,-1.07943,1.089438,-0.003777,0.996545,1.097349,1.061574,0.013786,-1.07943
1,4,1,2,0,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,0,0.014854,0.009625,-0.020822,2.55,0.82,4,1.298219,2.181924,0.929565,3.82825,0.002977,1,0,-0.030958,1.54775,0.031679,7,0.519201,-0.374254,0.874962,1.093215,1.097341,-0.292232,-0.751521,-0.090196,30,2.208676,1.061574,...,0.795651,2.172079,-0.009844,0.995488,3.081493,0.899569,1.412282,1.093215,-1.088708,0.501033,0.683683,-1.498241,0.31334,2.076232,-0.1056918,0.95156,3.722016,1.540092,1.705842,1.097341,-1.084582,0.502924,0.9651432,-1.216781,0.4423359,2.14889,1.061574,-1.12035,0.486531,0.854799,-1.327125,0.201344,-1.98058,2.16658,-0.015344,0.992968,2.263643,2.09436,0.054043,-2.12788
2,4,2,5,0,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,0,0.690991,-0.499546,-1.208576,2.55,0.82,2,1.330296,2.776017,0.082945,6.079981,1.543346,2,0,-0.030958,1.54775,0.031679,7,0.519201,-0.374254,0.874962,1.093215,2.09436,0.051004,-0.436691,0.489832,30,2.208676,1.061574,...,0.795651,2.172079,-0.603938,0.782444,3.081493,0.305475,1.110041,1.093215,-1.682802,0.393807,0.683683,-2.092334,0.246282,2.573417,-0.2026003,0.927018,3.436049,0.660032,1.237762,2.09436,-0.6816577,0.754448,0.5255777,-2.25044,0.189328,2.14889,1.061574,-1.714444,0.382409,0.854799,-1.921218,0.406477,-2.369541,3.184754,0.408737,1.147239,3.722016,2.776017,0.370494,-2.405524
3,4,5,3,0,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,1,-0.560824,1.950029,-0.835807,2.2,0.43,1,1.093215,1.768448,2.371035,0.000262,0.75611,1,0,-0.030958,1.54775,0.031679,0,-0.030958,1.54775,0.031679,1.093215,1.093215,-0.308408,-0.808829,0.808829,30,2.208676,1.061574,...,0.795651,2.172079,0.403632,1.228241,3.081493,1.313045,1.742485,1.093215,-0.675232,0.618178,0.683683,-1.084764,0.386601,1.768448,0.0,1.0,1.768448,0.0,1.0,1.768448,0.0,1.0,,,,2.328248,1.752182,-0.016265,0.990802,0.530283,-1.238164,0.201344,-1.567104,1.767286,-0.001162,0.999343,1.774257,1.752182,0.010435,-1.758013
4,4,6,3,0,0.978993,1.966216,0.033738,2.2,0.43,1,1.093215,1,-0.545767,1.882255,0.937091,2.2,0.43,1,1.093934,1.774257,2.324893,0.007049,0.816045,1,0,-0.030958,1.54775,0.031679,0,-0.030958,1.54775,0.031679,1.093215,1.093934,-0.316153,-0.811081,0.811357,30,2.208676,1.061574,...,0.795651,2.172079,0.397822,1.224219,3.081493,1.307236,1.736779,1.093215,-0.681042,0.616154,0.683683,-1.090574,0.385335,1.774257,-7.130347e-08,1.0,1.774257,0.0,1.0,1.774257,-1.426069e-07,1.0,1.008383e-07,-1.774257,5.683412e-08,2.328248,1.752182,-0.022075,0.987558,0.530283,-1.243974,0.201344,-1.572913,1.767286,-0.006971,0.996071,1.774257,1.752182,0.010435,-1.763822


### Let's make group hold-out validation with 25% validation size
I used GroupKFold because train/test split has no intersection in molecules, so this validation will be more veridical.

In [12]:
kf = GroupKFold(4)
for tr_idx, val_idx in kf.split(X, groups=molecules_id):
    tr_X = X.iloc[tr_idx]; val_X = X.iloc[val_idx]
    tr_y = y.iloc[tr_idx]; val_y = y.iloc[val_idx]
    
    break

# Permutation importance implementation

 - `metric` - score function that have arguments: true_y and preds
 - `theshold` - threshold of score changing, confirming that score is useful
 - `minimize` - metric should be minimized or maximazed. In this competition `minimize=True`

In [13]:
def permutation_importance(model, X_val, y_val, metric, threshold=0.005,
                           minimize=True, verbose=True):
    results = {}
    
    y_pred = model.predict(X_val)
    
    results['base_score'] = metric(y_val, y_pred)
    if verbose:
        print(f'Base score {results["base_score"]:.5}')

    
    for col in tqdm(X_val.columns):
        freezed_col = X_val[col].copy()

        X_val[col] = np.random.permutation(X_val[col])
        preds = model.predict(X_val)
        results[col] = metric(y_val, preds)

        X_val[col] = freezed_col
        
        if verbose:
            print(f'column: {col} - {results[col]:.5}')
    
    if minimize:
        bad_features = [k for k in results if results[k] < results['base_score'] + threshold]
    else:
        bad_features = [k for k in results if results[k] > results['base_score'] + threshold]
    bad_features.remove('base_score')
    
    return results, bad_features

## Usage

Fit model on all generated features

In [14]:
def catboost_fit(model, X_train, y_train, X_val, y_val):
    train_pool = Pool(X_train, y_train)
    val_pool = Pool(X_val, y_val)
    model.fit(train_pool, eval_set=val_pool)
    
    return model

model = CatBoostRegressor(iterations=20000, 
                          max_depth=9,
                          objective='MAE',
                          task_type='GPU',
                          verbose=False)
model = catboost_fit(model, tr_X, tr_y, val_X, val_y)

This wrapper for function is used for passing `types=val_X['type']` like it was defined by default: <br>
`def group_mean_log_mae(y_true, y_pred, types=val_X['type']): ...`

In [15]:
from functools import partial
metric = partial(group_mean_log_mae, types=val_X['type'])

In [16]:
results, bad_features = permutation_importance(model=model,
                                               X_val=val_X,
                                               y_val=val_y,
                                               metric=metric,
                                               verbose=False)

HBox(children=(IntProgress(value=0, max=90), HTML(value='')))




`result` values contains score after permutatation of key column

In [17]:
results

{'base_score': -0.14327074537637194,
 'atom_index_0': -0.14066494665165488,
 'atom_index_1': -0.13329037208420935,
 'type': 2.5953407171680727,
 'atom_0': -0.14327074537637194,
 'x_0': -0.1419964634696653,
 'y_0': -0.14080725030983426,
 'z_0': -0.14242846520233154,
 'EN_0': -0.14327074537637194,
 'rad_0': -0.14327074537637194,
 'n_bonds_0': -0.14327074537637194,
 'bond_lengths_mean_0': 0.20624768187940296,
 'atom_1': 0.29600325849170317,
 'x_1': -0.1420629834115149,
 'y_1': -0.141971044589903,
 'z_1': -0.14271138242847736,
 'EN_1': 0.07769982389057897,
 'rad_1': -0.10666663454390833,
 'n_bonds_1': 0.4687891551202394,
 'bond_lengths_mean_1': 0.4765001359851019,
 'dist': 0.995829198181776,
 'dist_x': -0.14150346383018944,
 'dist_y': -0.13825602903436457,
 'dist_z': -0.1400977667252704,
 'type_0': 0.5618337209120853,
 'atom_index_closest_0': -0.08589956690103202,
 'x_closest_0': -0.14227641893602228,
 'y_closest_0': -0.14169259226264452,
 'z_closest_0': -0.14250289587364814,
 'atom_index_

In [18]:
bad_features

['atom_index_0',
 'atom_0',
 'x_0',
 'y_0',
 'z_0',
 'EN_0',
 'rad_0',
 'n_bonds_0',
 'x_1',
 'y_1',
 'z_1',
 'dist_x',
 'dist_z',
 'x_closest_0',
 'y_closest_0',
 'z_closest_0',
 'atom_index_closest_1',
 'x_closest_1',
 'y_closest_1',
 'z_closest_1',
 'molecule_atom_index_0_y_1_mean',
 'molecule_atom_index_0_y_1_mean_diff',
 'molecule_atom_index_0_y_1_mean_div',
 'molecule_atom_index_0_y_1_max',
 'molecule_atom_index_0_y_1_max_diff']

Let's check what the score without `bad_features`

In [19]:
tr_X_reduced = tr_X.drop(bad_features, axis=1).copy()
val_X_reduced = val_X.drop(bad_features, axis=1).copy()

In [20]:
model_reduced = CatBoostRegressor(iterations=20000, 
                          max_depth=9,
                          objective='MAE',
                          task_type='GPU',
                          verbose=False)
model_reduced = catboost_fit(model, tr_X_reduced, tr_y, val_X_reduced, val_y)

y_pred = model_reduced.predict(val_X_reduced)
new_score = metric(val_y, y_pred)

print(f'Original score: {results["base_score"]:.3}, amount of features: {len(results)-1}')
print(f'Score after removing bad_features: {new_score:.3}, amount of features: {tr_X_reduced.shape[1]}')

Original score: -0.143, amount of features: 90
Score after removing bad_features: -0.159, amount of features: 65


## Also let's check implementation from eli5
`get_score_importances` has some parameters:
 - `score_func`: your function with model inference and scoring.
 - `X`: features
 - `y`: target
 - `n_iter=5`: how many times columns will be permuted.
 - `columns_to_shuffle=None`: subset of columns to shuffle. If None, then all columns will be checked.
 - `random_state=None`

Function returns:
 - `base_score`: score on original features
 - `score_decreases`: list of length `n_iter` with feature importance arrays

In [21]:
from eli5.permutation_importance import get_score_importances

def score(X, y):
    y_pred = model.predict(X)
    return metric(y, y_pred)

base_score, score_decreases = get_score_importances(score, np.array(val_X), val_y, n_iter=1)

threshold = 0.001
bad_features = val_X.columns[score_decreases[0] > -threshold]

In [22]:
tr_X_reduced = tr_X.drop(bad_features, axis=1).copy()
val_X_reduced = val_X.drop(bad_features, axis=1).copy()

model_reduced = CatBoostRegressor(iterations=20000, 
                          max_depth=9,
                          objective='MAE',
                          task_type='GPU',
                          verbose=False)
model_reduced = catboost_fit(model_reduced, tr_X_reduced, tr_y, val_X_reduced, val_y)

y_pred = model_reduced.predict(val_X_reduced)
new_score = metric(val_y, y_pred)

print(f'Original score: {base_score:.3}, amount of features: {len(results)-1}')
print(f'Score after removing bad_features: {new_score:.3}, amount of features: {val_X_reduced.shape[1]}')

Original score: 2.09, amount of features: 90
Score after removing bad_features: 0.57, amount of features: 14


# Conclusion
This method improves CV and removes redundant features from your data. You can try it on your data and models.