In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
import seaborn as sns
sns.set_style("whitegrid")
# Bigger font
# sns.set_context("poster")
sns.set_context("talk")
# Figure size
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 10, 4
# np.random.seed(123)

Example:

Predicting the best banner to display.

In [2]:
data = {'category_ad':['auto_part','music_tickets','mobile_phones'],
        'category_site':['game_news','music_news','auto_blog'],
        'is_clicked':[0,1,0]}
df = pd.DataFrame(data)

This is a categorical combination of two categorical features:

It seems like a simple string concatenation would do the trick. It's a matter of doing it in a real example.

In [3]:
df['ad_site'] = df['category_ad'] + '|' + df['category_site']
df

Unnamed: 0,category_ad,category_site,is_clicked,ad_site
0,auto_part,game_news,0,auto_part|game_news
1,music_tickets,music_news,1,music_tickets|music_news
2,mobile_phones,auto_blog,0,mobile_phones|auto_blog


## When would a combination be bad?

For example, when combining two real-valued features with a sum.

- f1+f2=100: it is an interaction of 0 and 100, 100 and 0, and also an interaction of 90 and 10.

Also if they are strings it is better to have a delimiter like "|"

- "abc": it is an interaction of "a" and "bc", and also an interaction of "ab" and "c".


## Interaction order

- 2 or more features can interact
- The examples below will be of 2nd order

There are ways to construct that interaction between features.

## For Categorical Features

### A. Concatenating text values and doing one-hot encoding

In [4]:
data = {'f1':['A','B','B','A'],'f2':['X','Y','Z','Z']}
df = pd.DataFrame(data)
df['f_join'] = df['f1'] + '|' + df['f2']
df

Unnamed: 0,f1,f2,f_join
0,A,X,A|X
1,B,Y,B|Y
2,B,Z,B|Z
3,A,Z,A|Z


In [5]:
df_features = pd.get_dummies(df['f_join'])
df_features

Unnamed: 0,A|X,A|Z,B|Y,B|Z
0,1,0,0,0
1,0,0,1,0
2,0,0,0,1
3,0,1,0,0


### B. Onehot encoding to each feature and then multiply the matrices by column

In [6]:
data = {'f1':['A','B','B','A'],'f2':['X','Y','Z','Z']}
df = pd.DataFrame(data)
feat_1 = pd.get_dummies(df['f1'])
feat_2 = pd.get_dummies(df['f2'])
print(feat_1)
print(feat_2)

   A  B
0  1  0
1  0  1
2  0  1
3  1  0
   X  Y  Z
0  1  0  0
1  0  1  0
2  0  0  1
3  0  0  1


In [7]:
col = pd.MultiIndex.from_product([df.f1.unique(),df.f2.unique()]).map(''.join) 
df.apply(''.join,1).str.get_dummies().reindex(columns=col,fill_value=0)

Unnamed: 0,AX,AY,AZ,BX,BY,BZ
0,1,0,0,0,0,0
1,0,0,0,0,1,0
2,0,0,0,0,0,1
3,0,0,1,0,0,0


## For Real Valued Features

### By applying an operation between two features
- Multiplication
- Sum
- Diff
- Division


These kinds of combination make overfitting easier.  
And they are very effective for TB methods.

In [8]:
data = {'f1':[1.2,3.4,5.6,7.8], 'f2':[0.0,0.1,1.0,-1.0]}
df = pd.DataFrame(data)
df['f_join'] = df['f1'] * df['f2']
df

Unnamed: 0,f1,f2,f_join
0,1.2,0.0,0.0
1,3.4,0.1,0.34
2,5.6,1.0,5.6
3,7.8,-1.0,-7.8


## How to moderate feature interactions

Because there are too many possible interactions, we it has to be moderated with

- Dimensionality reduction
- Feature selection

With feature selection seems to be something common.


In [9]:
data = {'f1':[1.2,3.4,5.6,7.8], 'f2':[0.0,0.1,1.0,-1.0], 'target':[1,0,0,1]}
df_new = pd.DataFrame(data)
df_new['mult'] = df_new['f1'] * df_new['f2']
df_new['sum'] = df_new['f1'] + df_new['f2']
df_new['diff'] = df_new['f1'] - df_new['f2']
df_new['div'] = df_new['f1'] / df_new['f2']
df_new = df_new.replace(np.inf, 0)

df_new

Unnamed: 0,f1,f2,target,mult,sum,diff,div
0,1.2,0.0,1,0.0,1.2,1.2,0.0
1,3.4,0.1,0,0.34,3.5,3.3,34.0
2,5.6,1.0,0,5.6,6.6,4.6,5.6
3,7.8,-1.0,1,-7.8,6.8,8.8,-7.8


In [10]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=123)
X = df_new[['mult','sum','diff','div']]
model.fit(X, df_new['target'])

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=123, verbose=0, warm_start=False)

In [11]:
# compute feature importances
sorted_features = pd.DataFrame({'feature':X.columns, 'importance':model.feature_importances_}).sort_values('importance', ascending=False)
sorted_features

Unnamed: 0,feature,importance
0,mult,0.4
2,diff,0.4
3,div,0.1
1,sum,0.0


Then only the most important would be considered, so this is the process.

1. Fir RF
2. Get feature importances
3. Select a few most important features and join them together

## Another feature interaction: Features from tree leaves

Leaves can be used as features. Here are examples on how to get the values in the leaves. I still need to try them out to know if I use them correctly.

In [12]:
data = {'f1':[1.2,3.4,5.6,7.8], 'f2':[0.0,0.1,1.0,-1.0], 'target':[1,0,0,1]}
df_2 = pd.DataFrame(data)

### A. Scikit learn's RF

In [13]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=10,random_state=1)
model.fit(df_2[['f1','f2']], df_2['target'])
leave_indices = model.apply(df_2[['f1','f2']])
leave_indices

array([[1, 1, 2, 1, 1, 2, 1, 1, 2, 1],
       [1, 4, 2, 1, 1, 2, 3, 2, 3, 2],
       [2, 4, 2, 1, 1, 2, 3, 2, 3, 2],
       [1, 3, 1, 2, 2, 1, 4, 2, 4, 2]])

Each sub array is predictions for each sample by each of trees.
So to read **INDICES OF THE LEAFS**, this can be read downwards by column.

In [14]:
model.predict(df_2[['f1','f2']])

array([1, 0, 0, 1])

In [15]:
leave_indices.mean(axis=1)

array([1.3, 2.1, 2.2, 2.2])

So what can be used as features would be the array

### B. In XGBOOST

    model.predict(pref_leaf=True)
    
This trains a tree and outputs the indices of the leaves where each sample falls in.

    xgboost.train will ignore parameter n_estimators (number of trees), while xgboost.XGBRegressor accepts. In xgboost.train, boosting iterations (i.e. n_estimators) is controlled by num_boost_round(default: 10)

https://datascience.stackexchange.com/questions/17282/xgbregressor-vs-xgboost-train-huge-speed-difference

Then num_boost_round IS THE NUMBER OF TREES. So the parameters are defined accordingly.

In [16]:
import xgboost as xgb

dm_train = xgb.DMatrix(df_2[['f1','f2']].values,
                       label=df_2['target'].values,
                       feature_names=['f1','f2'])
# target = xgb.DMatrix(df_2['target'].values)
# X = 
param = {'max_depth':10, 
         'subsample':1,
         'min_child_weight':0.5,
         'eta':0.3, 
         'seed':1,
         'eval_metric':'auc',
        }

model = xgb.train(param, dm_train, num_boost_round=10)
leave_indices = model.predict(dm_train, pred_leaf=True)
leave_indices
## OUTPUT : array_like, shape = [n_samples, n_estimators]
# For each datapoint x in X and for each tree in the forest,
# return the index of the leaf that x ends up in.

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int32)

In [17]:
model.predict(dm_train)

array([0.94631296, 0.05368708, 0.05368708, 0.94631296], dtype=float32)

So the array that can be used as feature can be this one

In [18]:
leave_indices.mean(axis=1)

array([1., 2., 2., 1.])