## feature importance xgboost vs ppscore
In this work, I tried to build two different model with feature importance via xgboost and ppscore.

This work inspired by a discussion on [ppscore github page](https://github.com/8080labs/ppscore/issues/22).

For understanding and the proccess, I set up my experiment as follow:
 - train all dataset with a xgboost regressor
 - find 15 most important feature
 - train a new model on those feature and print the scores
 - use ppscore to find out the best 15 feature to close value 
 - train another model based on ppscore selected features
 - compare both model's ouptut

In [30]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error
import xgboost as xb
import requests
import json
import pandas as pd
from ppscore import score,predictors
import numpy as np
import warnings

from dateutil.relativedelta import  *
from datetime import date, timedelta
import dateutil
from ta import add_all_ta_features
from ta.utils import dropna
import urllib


pd.set_option ('display.float_format', lambda x: '%.3f' % x)
warnings.filterwarnings("ignore")

I will use a stock data since it was hard to find a data that ppscore can work on.

In [31]:
def get_stock(stock_name="GARAN", periods=5000):
    end = date.today()
    start = end - timedelta(days=periods)
    startt = str(start).replace("-","")
    endd = str(end).replace("-","")
    temp_val = "https://web-paragaranti-pubsub.foreks.com/web-services/historical-data?userName=undefined&exchange=BIST&name="+stock_name+"&market=E&group=E&last=500&period=1440&from="+startt+"000000&to="+endd+"000000"
    with urllib.request.urlopen(temp_val) as url:
        data_stock = json.loads(url.read().decode())
    df_stock=pd.DataFrame(data_stock['dataSet'])#[['date', 'close']]
    df_stock['date']=pd.to_datetime(df_stock.date, unit='ms').dt.strftime('%Y-%m-%d')
    return df_stock

In [32]:
df = get_stock(stock_name="GARAN", periods=2900)
df = df.set_index('date')
df.head()

Unnamed: 0_level_0,close,high,low,open,value,volume
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2014-09-16,7.25,7.345,7.198,7.319,0.0,736434000.0
2014-09-17,7.043,7.198,7.043,7.173,0.0,1048400000.0
2014-09-18,7.0,7.173,7.0,7.077,0.0,694376000.0
2014-09-21,7.129,7.19,7.0,7.0,0.0,815603000.0
2014-09-22,7.095,7.232,7.086,7.173,0.0,719069000.0


I am adding some indicators to stock data to enrich number of features 

In [33]:
df_i = add_all_ta_features (df, 'open', 'high', 'low', 'close', 'volume', fillna=False)
#dropping some columns that contains too much null values
df_i=df_i.drop(['trend_psar_up', 'trend_psar_down','trend_stc','trend_mass_index','trend_trix', 'high', 'low', 'open','value'],axis=1).dropna()
df_i.sample(5)

Unnamed: 0_level_0,close,volume,volume_adi,volume_obv,volume_cmf,volume_fi,volume_em,volume_sma_em,volume_vpt,volume_vwap,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014-12-22,8.044,1089680000.0,-4779493161.359,2226193000.0,-0.044,-13571624.026,0.001,-0.0,-1935643.964,7.898,...,0.921,1.067,-0.146,7.18,7.835,-0.655,7.931,-1.698,-1.712,10.945
2017-01-02,6.688,730724765.8,-33514141424.721,-9834536037.33,0.033,-5921697.656,-0.001,-0.0,-9149309.578,6.828,...,-0.106,-0.061,-0.045,-18.836,-18.752,-0.085,6.765,-1.185,-1.192,-7.752
2018-05-24,8.897,877050243.07,-44333037320.264,-4238119863.56,-0.079,19172468.454,0.001,0.001,12030587.104,8.729,...,-1.715,-2.322,0.607,9.949,5.903,4.046,8.964,3.112,3.064,22.716
2017-04-23,8.768,770459243.92,-29555141328.512,-1458568853.7,0.045,32982638.434,0.002,0.002,21790789.297,8.485,...,2.366,2.074,0.293,11.47,7.184,4.286,8.402,2.555,2.522,20.934
2015-01-04,8.243,784093043.65,-4907695703.3,3443147113.07,-0.117,16372441.897,0.002,0.001,11296469.158,7.864,...,1.191,1.058,0.132,-12.447,-9.101,-3.346,7.988,1.487,1.476,13.689


Now let's build an xgboost regressor and find most important features for prediction on close column. 
I will use regressor with default paramters and there will be no hyperparameter tuning

In [34]:
X = df_i.drop ('close', axis=1)
y = df_i[['close']]

X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.3, random_state=42)

In [35]:
clf=xb.XGBRegressor()
    
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print('xgboost mean square error score: ', str (mean_absolute_error (pred, y_test)))
print('xgboost mean absolotue percentage error score:', str (mean_absolute_percentage_error (pred, y_test)))

xgboost mean square error score:  0.034358967674663864
xgboost mean absolotue percentage error score: 0.0037773913939845344


ppscore predictors function return correlation like score with algortihm type used, I am going to choose best 15 feature here.

In [36]:
ppscore_preds = predictors(df_i, 'close')[:15]
ppscore_preds

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,others_cr,close,0.998,regression,True,mean absolute error,1.431,0.003,DecisionTreeRegressor()
1,trend_ichimoku_conv,close,0.833,regression,True,mean absolute error,1.431,0.239,DecisionTreeRegressor()
2,volatility_dch,close,0.824,regression,True,mean absolute error,1.431,0.252,DecisionTreeRegressor()
3,volatility_dcl,close,0.817,regression,True,mean absolute error,1.431,0.261,DecisionTreeRegressor()
4,volatility_dcm,close,0.817,regression,True,mean absolute error,1.431,0.261,DecisionTreeRegressor()
5,trend_ichimoku_base,close,0.802,regression,True,mean absolute error,1.431,0.283,DecisionTreeRegressor()
6,trend_ichimoku_a,close,0.789,regression,True,mean absolute error,1.431,0.303,DecisionTreeRegressor()
7,trend_ema_fast,close,0.784,regression,True,mean absolute error,1.431,0.309,DecisionTreeRegressor()
8,trend_ichimoku_b,close,0.782,regression,True,mean absolute error,1.431,0.312,DecisionTreeRegressor()
9,momentum_kama,close,0.763,regression,True,mean absolute error,1.431,0.339,DecisionTreeRegressor()


for best 15 feature based on model we trained, I used get_booster method. 
Xgboost calculate this with features action inside the tree, based on a stackoverflow discussion here is the method:

How the importance is calculated: either “weight”, “gain”, or “cover”
- ”weight” is the number of times a feature appears in a tree
- ”gain” is the average gain of splits which use the feature
- ”cover” is the average coverage of splits which use the feature where coverage is defined as the number of samples affected by the split

I choose weight method to calculate feature importances.

In [37]:
feature_important_xgboost = pd.DataFrame(clf.get_booster().get_score(importance_type='weight'), index=[0]).T[:15]
feature_important_xgboost

Unnamed: 0,0
volume,263.0
volume_adi,143.0
volume_obv,88.0
volume_cmf,89.0
volume_fi,79.0
volume_em,98.0
volume_sma_em,59.0
volume_vpt,60.0
volume_vwap,26.0
volume_mfi,72.0


so let's make 2 list from both method and train new models

In [38]:
ppscore_list = list(ppscore_preds['x'].values)
xgboost_list = list(feature_important_xgboost.index)

In [39]:
X_pp = df_i[ppscore_list]
y_pp = df_i[['close']]

X_train_pp, X_test_pp, y_train_pp, y_test_pp = train_test_split (X_pp, y_pp, test_size=0.3, random_state=42)

clf_pp=xb.XGBRegressor()
    
clf_pp.fit(X_train_pp, y_train_pp)
pred_pp = clf_pp.predict(X_test_pp)
print('xgboost mean square error score - ppscore features: ', str (mean_absolute_error (pred_pp, y_test_pp)))
print('xgboost mean absolotue percentage error score - ppscore features:', str (mean_absolute_percentage_error (pred_pp, y_test_pp)))

xgboost mean square error score - ppscore features:  0.01706578003087808
xgboost mean absolotue percentage error score - ppscore features: 0.001776958134873217


In [40]:
X_xb = df_i[xgboost_list]
y_xb = df_i[['close']]

X_train_xb, X_test_xb, y_train_xb, y_test_xb = train_test_split (X_xb, y_xb, test_size=0.3, random_state=42)

clf_xb=xb.XGBRegressor()
    
clf_xb.fit(X_train_pp, y_train_xb)
pred_xb = clf_pp.predict(X_test_xb)
print('xgboost mean square error score - xgboost features: ', str (mean_absolute_error (pred_xb, y_test_xb)))
print('xgboost mean absolotue percentage error score - xgboost features:', str (mean_absolute_percentage_error (pred_xb, y_test_xb)))

xgboost mean square error score - xgboost features:  7.579390743928952
xgboost mean absolotue percentage error score - xgboost features: 0.4702220262694828


## Conclusions
- It seems like ppscore feature selection worked better in this case.
- ppscore is picky about calculating scores on a given data set, it was hard to find one but feel free to try another and share result.
- ppscore feature selection even better than our first model's scores.
- You can try this sklearn regression models.
- This is only educational purpose work.