-
(A) 31.01.2020 20.20. Fix axis labels and legends on multiclass probability correlation plots
-
(A) 31.01.2020 20.20. Make multiclass probability correlation plots for test samples.
-
(A) 30.01.2020 00.45. Optimize xgboost multiclass classification model
-
(A) 30.01.2020 02.45. Incorporate multimodality of
Revenue*
distributions in the models -
(A) 30.01.2020 00.45. Optimize xgboost regression models
- (A) 30.01.2020 00.45. 'Revenue_MF'
- (A) 30.01.2020 00.45. 'Revenue_CC'
- (A) 30.01.2020 00.45. 'Revenue_CL'
-
(A) 30.01.2020 23.15. Calculate what fraction of Revenue is generated by outliers
-
(B) 28.01.2020 21.20. Ask Michael if missing entries in
Products_ActBalance
can be defaulted to0
(see log note) -
(B) 29.01.2020 00.55. Make correlation plots with and without suppression of default imputed values.
-
(C) 27.01.2020 23:08. Add automatic provenance generation for the output file. In particular, I wanted to do this for the
train_test_datasets.py
script. It would be nice to have a more general class/function for this purpose. See gdoc "Provenance" section https://docs.google.com/document/d/1YvFl1Dnwc3PGkx154Kpo2dYWy-abask3pxOmDC4Mqo8/edit# for more information -
(C1) 27.01.2020 23:08. I want to have in the provenance metadata file at least the info like: entity, Activity, generatedBy, StartTime, EndTime, md5, git commit.
-
(C) 28.01.2020 23:55. Add legend to the correlation plots. At the moment it doesn't look straightforward.
-
(C) 28.01.2020 02:26. Check if imputation for gender makes a difference for modelling (see log note)
- (A) 31.01.2020 20.20. Make multiclass train/test control plots.
- (A) 27.01.2020 18:35. Exploratory data analysis
- (A) 29.01.2020 21.10. Check for outliers in data.
-
(A1) 27.01.2020 18:35. Convert multiple files into single .csv file that can be fetched into pandas dataframe
-
(A2) 27.01.2020 18:35. Plot distributions of all features in all tables
-
(A3) 28.01.2020 18.00. Plot Sex, Age, Tenure
-
(A4) 28.01.2020 18.00. Plot 'VolumeCred','VolumeCred_CA','TransactionsCred','TransactionsCred_CA','VolumeDeb','VolumeDeb_CA','VolumeDebCash_Card'
-
(A4) 28.01.2020 18.00. Plot 'VolumeDebCashless_Card','VolumeDeb_PaymentOrder','TransactionsDeb','TransactionsDeb_CA','TransactionsDebCash_Card','TransactionsDebCashless_Card','TransactionsDeb_PaymentOrder'
-
(A5) 28.01.2020 18.00. Plot 'Sale_MF','Sale_CC','Sale_CL','Revenue_MF','Revenue_CC','Revenue_CL'
-
(A6) 28.01.2020 18.00. Plot 'Count_CA','Count_SA','Count_MF','Count_OVD','Count_CC','Count_CL'
-
(A7) 28.01.2020 18.00. Plot 'ActBal_CA','ActBal_SA','ActBal_MF','ActBal_OVD','ActBal_CC','ActBal_CL'
-
(B) 27.01.2020 21:43. Ask Michael what is " Inflow/outflow on C/A"? Does C/A stays for current account?
-
(C) 28.01.2020 00:07. Refactor all occurrences of
titanic
symbol in codes -
(C) 28.01.2020 00:41. Refactor all
model.build_test_train_*_feature_pad()
methods by moving them toview
class. This is more logical than having plotting capabilities inside themodel
class -
(A) 28.01.2020 12:50. Ask michael about target variables. Are
Sale*
variables categorial. What do they mean? WhatRevenue*
variables mean. Should I build a model for functionsSale(x1,x2,...)
andRevenue(x1,x2,...)
? -
(B) 27.01.2020 18:30. Ask Michael if Client is consistent across all tables and can be used as the index for joining
-
(B) 27.01.2020 20:45. Ask Michael if the tables are randomly reshuffled in order to avoid data leakage in training/testing datasets
- (A) 28.01.2020 18.15. Build baseline linear and xgboost models.
- (A) 29.01.2020 21.50. Build xgboost baseline for regression 'Revenue_MF','Revenue_CC','Revenue_CL'.
- (A) 29.01.2020 21.50. Build xgboost baseline for classification 'Sale_MF', 'Sale_CC', 'Sale_CL'.
- (A) 29.01.2020 21.50. Build linear baseline for regression 'Revenue_MF','Revenue_CC','Revenue_CL'.
- (A) 29.01.2020 21.50. Build svm baseline for classification 'Sale_MF', 'Sale_CC', 'Sale_CL'.
- (B) 29.01.2020 21.50. Build linear baseline for classification.
The data is provided in several tables in .xslx file. It looks like Client is a unique ID label shared by all tables. I guess it has to be used to correlated data across different tables. Have to ask Michael.
It would be better to write a script that can be reused if needed to convert individual tables into single .csv file.
I found that there are a few missing elements values in the Sex
column of the Soc_Dem
table.
import pandas as pd
df = pd.read_csv('data/27_01_2020/data.csv', index_col='Client')
df['Sex'].unique()
> array([ 0., 1., nan])
df[pd.isnull(df['Sex'])].index
> Int64Index([1363, 953, 843], dtype='int64', name=u'Client')
I also checked original files. Indeed, gender is missing for these three entries. Perhaps, I can safely drop these three row, because I have > 1500 entries for the model building, but it is better to compare the results of the training with imputing sex values based on the most frequent occurrence of gender in the dataset (I've made a todo entry for this).
According to the description table of the .xlsx file, "CA" stays for "current account"
Half of the columns are very well filled. There are only a few missing entries. Perhaps, I can drop those handful of row and continue working with 1587 entries.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1615 entries, 909 to 1466
Data columns (total 35 columns):
Sex 1612 non-null float64
Age 1615 non-null int64
Tenure 1615 non-null int64
VolumeCred 1587 non-null float64
VolumeCred_CA 1587 non-null float64
TransactionsCred 1587 non-null float64
TransactionsCred_CA 1587 non-null float64
VolumeDeb 1587 non-null float64
VolumeDeb_CA 1587 non-null float64
VolumeDebCash_Card 1587 non-null float64
VolumeDebCashless_Card 1587 non-null float64
VolumeDeb_PaymentOrder 1587 non-null float64
TransactionsDeb 1587 non-null float64
TransactionsDeb_CA 1587 non-null float64
TransactionsDebCash_Card 1587 non-null float64
TransactionsDebCashless_Card 1587 non-null float64
TransactionsDeb_PaymentOrder 1587 non-null float64
Count_CA 1615 non-null int64
Count_SA 426 non-null float64
Count_MF 306 non-null float64
Count_OVD 419 non-null float64
Count_CC 170 non-null float64
Count_CL 135 non-null float64
ActBal_CA 1615 non-null float64
ActBal_SA 426 non-null float64
ActBal_MF 306 non-null float64
ActBal_OVD 419 non-null float64
ActBal_CC 170 non-null float64
ActBal_CL 135 non-null float64
Sale_MF 969 non-null float64
Sale_CC 969 non-null float64
Sale_CL 969 non-null float64
Revenue_MF 969 non-null float64
Revenue_CC 969 non-null float64
Revenue_CL 969 non-null float64
dtypes: float64(32), int64(3)
memory usage: 454.2 KB
The group of columns with 1587 filled entries missing columns seem to be the same. I created a new file data/28_01_2020_1584entries/data.csv
by removing those rows
>>> df[pd.isnull(df['VolumeDeb'])].index.sort_values()
Int64Index([ 50, 58, 84, 305, 315, 334, 355, 377, 514, 522, 663,
723, 741, 799, 833, 941, 962, 1026, 1032, 1139, 1141, 1199,
1203, 1244, 1269, 1446, 1528, 1544],
dtype='int64', name=u'Client')
Missing entries in the Products_ActBalance
table seems to correspond to 0
by default, when the product was not used by the client. I created a new file imputing 0, whenever the values in these columns are not available.
import pandas as pd
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1584 entries, 909 to 1466
Data columns (total 35 columns):
Sex 1584 non-null float64
Age 1584 non-null int64
Tenure 1584 non-null int64
VolumeCred 1584 non-null float64
VolumeCred_CA 1584 non-null float64
TransactionsCred 1584 non-null float64
TransactionsCred_CA 1584 non-null float64
VolumeDeb 1584 non-null float64
VolumeDeb_CA 1584 non-null float64
VolumeDebCash_Card 1584 non-null float64
VolumeDebCashless_Card 1584 non-null float64
VolumeDeb_PaymentOrder 1584 non-null float64
TransactionsDeb 1584 non-null float64
TransactionsDeb_CA 1584 non-null float64
TransactionsDebCash_Card 1584 non-null float64
TransactionsDebCashless_Card 1584 non-null float64
TransactionsDeb_PaymentOrder 1584 non-null float64
Count_CA 1584 non-null int64
Count_SA 1584 non-null float64
Count_MF 1584 non-null float64
Count_OVD 1584 non-null float64
Count_CC 1584 non-null float64
Count_CL 1584 non-null float64
ActBal_CA 1584 non-null float64
ActBal_SA 1584 non-null float64
ActBal_MF 1584 non-null float64
ActBal_OVD 1584 non-null float64
ActBal_CC 1584 non-null float64
ActBal_CL 1584 non-null float64
Sale_MF 1584 non-null float64
Sale_CC 1584 non-null float64
Sale_CL 1584 non-null float64
Revenue_MF 1584 non-null float64
Revenue_CC 1584 non-null float64
Revenue_CL 1584 non-null float64
dtypes: float64(32), int64(3)
I guess, baseline modelling is so bad out of the box because the revenues distributions are multimodal with first mode at 0. This mode corresponds to clients that do not want to accept the marketing offer. Somehow I have to explicitly incorporate this information in the model. Perhaps, I can use gaussian mixture to deal with this problem or find a simpler alternative way.
I try to identify outliers in the target distribution using LOF method
from sklearn.neighbors import LocalOutlierFactor
import numpy as np
import pandas as pd
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
#Revenue_CC
X = df[df.Revenue_MF>0]['Revenue_MF']
clf = LocalOutlierFactor(n_neighbors=150)
outliers = clf.fit_predict(np.array(X).reshape(-1,1))
outliers_score = clf.negative_outlier_factor_
print np.sort(outliers_score)[0:20]
#[-21.98281451 -7.82617696 -6.83332992 -6.11866049 -5.20367369
# -5.03259564 -4.87207292 -4.81137682 -3.96651775 -3.78231777
# -3.58274375 -3.4398148 -3.32426665 -3.19256949 -3.01614364
# -2.89754806 -2.68524876 -2.58831833 -2.56493638 -2.43462459]
X.iloc[[outliers_score[el]<-6 for el in range(0,len(outliers_score))]]
#Client
#43 73.041786
#27 220.488214
#614 83.038036
#349 65.704107
#Revenue_CC
X = df[df.Revenue_CC>0]['Revenue_CC']
clf = LocalOutlierFactor(n_neighbors=150)
outliers = clf.fit_predict(np.array(X).reshape(-1,1))
outliers_score = clf.negative_outlier_factor_
print np.sort(outliers_score)[0:20]
#[-84.77600865 -43.76497299 -41.85003769 -22.29990926 -20.61009274
# -17.95787614 -15.21391245 -9.42846233 -8.08166001 -7.35834394
# -6.92048323 -6.82460076 -5.12042591 -4.77377491 -4.38435721
# -3.63171742 -3.10624988 -3.05134221 -2.85949796 -2.78882013]
X.iloc[[outliers_score[el]<-15 for el in range(0,len(outliers_score))]]
#Client
#1293 89.761071
#479 102.146429
#617 110.470000
#966 203.790714
#448 213.292857
#1335 407.962500
#374 76.859286
#Revenue_CL
X = df[df.Revenue_CL>0]['Revenue_CL']
clf = LocalOutlierFactor(n_neighbors=150)
outliers = clf.fit_predict(np.array(X).reshape(-1,1))
outliers_score = clf.negative_outlier_factor_
print np.sort(outliers_score)[0:20]
#[-18.28867726 -4.79219541 -3.29659602 -2.76674437 -2.7231154
# -2.56226161 -2.52235758 -2.46328553 -2.32705827 -2.25750651
# -2.18331312 -2.15530412 -2.1472402 -1.91073819 -1.86150714
# -1.8219801 -1.80256073 -1.75130793 -1.74766922 -1.67061297]
X.iloc[[outliers_score[el]<-18 for el in range(0,len(outliers_score))]]
#Client
#1549 133.275357
import pandas as pd
import numpy as np
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
def my_f(el):
result = -1
if el['Sale_MF']==1: result = 1
elif el['Sale_CC']==1: result = 2
elif el['Sale_CL']==1: result = 3
elif el['Sale_MF'] == 0 and el['Sale_CC']==0 and el['Sale_CL']==0: result = 0
elif el['Sale_MF'] == -1 or el['Sale_CC']==-1 or el['Sale_CL']==-1: result = -1
else: result = np.nan
return result
df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:my_f(el),axis=1)
df['Sale_multiclass'].head()
df['Sale_multiclass'].value_counts()
The relative occurrence of 'Reject' class is approximately twice higher than for the other classes. Perhaps this is the reason for overtraining.
df['Sale_multiclass'].value_counts()
-1 635
0 387
1 193
2 192
3 177
Initially I got wrong encoding of all possiblities for Sale_MF, Sale_CC and Sale_CL variables. I have corrected this with the new encoding fuction.
def binary_enconding(el):
'''
Improved encoding of all possible Sale_MF,Sale_CC,Sale_CL options
'''
import numpy as np
result = -1
if el['Sale_MF'] == -1 or el['Sale_CC']==-1 or el['Sale_CL']==-1: result = -1
else: result = el['Sale_MF']*4 + el['Sale_CC']*2 + el['Sale_CL']
return result
import pandas as pd
import numpy as np
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:binary_enconding(el),axis=1)
df['Sale_multiclass'].value_counts()
-1.0 635
0.0 387
1.0 177
2.0 137
4.0 106
3.0 55
5.0 42
6.0 31
7.0 14
Name: Sale_Multiclass, dtype: int64
def reduced_binary_coding(el):
'''
Improved coding of all possible Sale_MF,Sale_CC,Sale_CL options.
Lumping all cases with two or more Sale_ variables equal into single label.
Possible lable values are [-1,0,1,2,3,4] or [0,1,2,3,4], when pred class is excluded.
0 -> Reject
1 -> Sale_MF=1
2 -> Sale_CC=1
3 -> Sale_CL=1
4 -> Two or more Sale_* variables =1
'''
import numpy as np
result = -1
if el['Sale_MF'] == -1 or el['Sale_CC']== -1 or el['Sale_CL']== -1: result = -1
elif el['Sale_MF'] == 0 and el['Sale_CC'] == 0 and el['Sale_CL'] == 0: result = 0
elif el['Sale_MF'] == 1 and el['Sale_CC'] == 0 and el['Sale_CL'] == 0: result = 1
elif el['Sale_MF'] == 0 and el['Sale_CC'] == 1 and el['Sale_CL'] == 0: result = 2
elif el['Sale_MF'] == 0 and el['Sale_CC'] == 0 and el['Sale_CL'] == 1: result = 3
else: result = 4
return result
def reduced_binary_encoding_weight(el):
my_weights = {-1:1.0,0:1./0.40, 1:1./0.11, 2:1./0.14, 3:1./0.19, 4:1./0.149}
return my_weights[el]
import pandas as pd
import numpy as np
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:reduced_binary_coding(el),axis=1)
df['Sale_multiclass'].value_counts()
df['weights'] = df['Sale_multiclass'].apply(reduced_binary_encoding_weight)
def ordinal_enconding(el):
'''
This is very bad encoding resulting in overlaping classes. DO NOT USE!
'''
import numpy as np
result = -1
if el['Sale_MF']==1: result = 1
elif el['Sale_CC']==1: result = 2
elif el['Sale_CL']==1: result = 3
elif el['Sale_MF'] == 0 and el['Sale_CC']==0 and el['Sale_CL']==0: result = 0
elif el['Sale_MF'] == -1 or el['Sale_CC']==-1 or el['Sale_CL']==-1: result = -1
else: result = np.nan
return result
df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:reduced_binary_coding(el),axis=1)
import numpy as np
prob = np.array([0.2,0.5,0.1,0.2])
rev = np.array([100,1000,10000,np.max([100,1000,10000])])
exp_rev = prob*rev
best_offer = np.argmax(exp_rev)
print("Best offer (class, revenue): {} {}".format(best_offer,exp_rev[best_offer]))
# Modify train.csv, test.csv
# python
import pandas as pd
import numpy as np
df = pd.read_csv('data/original/test.csv', index_col='PassengerId')
df['Sex'] = df['Sex'].replace({'male':0,'female':1})
df['Name']=df['Name'].apply(lambda en: en.split(',')[0])
#object_cols = ['Embarked']
df['Embarked'] = df['Embarked'].replace({np.nan:0,'S':1,'C':2,'Q':3})
df.to_csv('data/2019-12-05/test.csv')