Skip to content

Latest commit

 

History

History
416 lines (359 loc) · 18.4 KB

LOG.md

File metadata and controls

416 lines (359 loc) · 18.4 KB

TODO

  • (A) 31.01.2020 20.20. Fix axis labels and legends on multiclass probability correlation plots

  • (A) 31.01.2020 20.20. Make multiclass probability correlation plots for test samples.

  • (A) 30.01.2020 00.45. Optimize xgboost multiclass classification model

  • (A) 30.01.2020 02.45. Incorporate multimodality of Revenue* distributions in the models

  • (A) 30.01.2020 00.45. Optimize xgboost regression models

    • (A) 30.01.2020 00.45. 'Revenue_MF'
    • (A) 30.01.2020 00.45. 'Revenue_CC'
    • (A) 30.01.2020 00.45. 'Revenue_CL'
  • (A) 30.01.2020 23.15. Calculate what fraction of Revenue is generated by outliers

  • (B) 28.01.2020 21.20. Ask Michael if missing entries in Products_ActBalance can be defaulted to 0 (see log note)

  • (B) 29.01.2020 00.55. Make correlation plots with and without suppression of default imputed values.

  • (C) 27.01.2020 23:08. Add automatic provenance generation for the output file. In particular, I wanted to do this for the train_test_datasets.py script. It would be nice to have a more general class/function for this purpose. See gdoc "Provenance" section https://docs.google.com/document/d/1YvFl1Dnwc3PGkx154Kpo2dYWy-abask3pxOmDC4Mqo8/edit# for more information

  • (C1) 27.01.2020 23:08. I want to have in the provenance metadata file at least the info like: entity, Activity, generatedBy, StartTime, EndTime, md5, git commit.

  • (C) 28.01.2020 23:55. Add legend to the correlation plots. At the moment it doesn't look straightforward.

  • (C) 28.01.2020 02:26. Check if imputation for gender makes a difference for modelling (see log note)


Doing

  • (A) 31.01.2020 20.20. Make multiclass train/test control plots.
  • (A) 27.01.2020 18:35. Exploratory data analysis
    • (A) 29.01.2020 21.10. Check for outliers in data.

Follow-up


Done

  • (A1) 27.01.2020 18:35. Convert multiple files into single .csv file that can be fetched into pandas dataframe

  • (A2) 27.01.2020 18:35. Plot distributions of all features in all tables

  • (A3) 28.01.2020 18.00. Plot Sex, Age, Tenure

  • (A4) 28.01.2020 18.00. Plot 'VolumeCred','VolumeCred_CA','TransactionsCred','TransactionsCred_CA','VolumeDeb','VolumeDeb_CA','VolumeDebCash_Card'

  • (A4) 28.01.2020 18.00. Plot 'VolumeDebCashless_Card','VolumeDeb_PaymentOrder','TransactionsDeb','TransactionsDeb_CA','TransactionsDebCash_Card','TransactionsDebCashless_Card','TransactionsDeb_PaymentOrder'

  • (A5) 28.01.2020 18.00. Plot 'Sale_MF','Sale_CC','Sale_CL','Revenue_MF','Revenue_CC','Revenue_CL'

  • (A6) 28.01.2020 18.00. Plot 'Count_CA','Count_SA','Count_MF','Count_OVD','Count_CC','Count_CL'

  • (A7) 28.01.2020 18.00. Plot 'ActBal_CA','ActBal_SA','ActBal_MF','ActBal_OVD','ActBal_CC','ActBal_CL'

  • (B) 27.01.2020 21:43. Ask Michael what is " Inflow/outflow on C/A"? Does C/A stays for current account?

  • (C) 28.01.2020 00:07. Refactor all occurrences of titanic symbol in codes

  • (C) 28.01.2020 00:41. Refactor all model.build_test_train_*_feature_pad() methods by moving them to view class. This is more logical than having plotting capabilities inside the model class

  • (A) 28.01.2020 12:50. Ask michael about target variables. Are Sale* variables categorial. What do they mean? What Revenue* variables mean. Should I build a model for functions Sale(x1,x2,...) and Revenue(x1,x2,...)?

  • (B) 27.01.2020 18:30. Ask Michael if Client is consistent across all tables and can be used as the index for joining

  • (B) 27.01.2020 20:45. Ask Michael if the tables are randomly reshuffled in order to avoid data leakage in training/testing datasets


Will not be done

  • (A) 28.01.2020 18.15. Build baseline linear and xgboost models.
    • (A) 29.01.2020 21.50. Build xgboost baseline for regression 'Revenue_MF','Revenue_CC','Revenue_CL'.
    • (A) 29.01.2020 21.50. Build xgboost baseline for classification 'Sale_MF', 'Sale_CC', 'Sale_CL'.
    • (A) 29.01.2020 21.50. Build linear baseline for regression 'Revenue_MF','Revenue_CC','Revenue_CL'.
    • (A) 29.01.2020 21.50. Build svm baseline for classification 'Sale_MF', 'Sale_CC', 'Sale_CL'.
    • (B) 29.01.2020 21.50. Build linear baseline for classification.


27 Jan 2020 18:00

Setting up the analysis framework

The data is provided in several tables in .xslx file. It looks like Client is a unique ID label shared by all tables. I guess it has to be used to correlated data across different tables. Have to ask Michael.

It would be better to write a script that can be reused if needed to convert individual tables into single .csv file.

28 Jan 2020 02:15

Gender missing values

I found that there are a few missing elements values in the Sex column of the Soc_Dem table.

import pandas as pd
df = pd.read_csv('data/27_01_2020/data.csv', index_col='Client')
df['Sex'].unique()
> array([ 0.,  1., nan])
df[pd.isnull(df['Sex'])].index
> Int64Index([1363, 953, 843], dtype='int64', name=u'Client')

I also checked original files. Indeed, gender is missing for these three entries. Perhaps, I can safely drop these three row, because I have > 1500 entries for the model building, but it is better to compare the results of the training with imputing sex values based on the most frequent occurrence of gender in the dataset (I've made a todo entry for this).

28 Jan 2020 10:20

Inflow CA

According to the description table of the .xlsx file, "CA" stays for "current account"

28 Jan 2020 10:40

Missing values

Half of the columns are very well filled. There are only a few missing entries. Perhaps, I can drop those handful of row and continue working with 1587 entries.

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1615 entries, 909 to 1466
Data columns (total 35 columns):
Sex                             1612 non-null float64
Age                             1615 non-null int64
Tenure                          1615 non-null int64
VolumeCred                      1587 non-null float64
VolumeCred_CA                   1587 non-null float64
TransactionsCred                1587 non-null float64
TransactionsCred_CA             1587 non-null float64
VolumeDeb                       1587 non-null float64
VolumeDeb_CA                    1587 non-null float64
VolumeDebCash_Card              1587 non-null float64
VolumeDebCashless_Card          1587 non-null float64
VolumeDeb_PaymentOrder          1587 non-null float64
TransactionsDeb                 1587 non-null float64
TransactionsDeb_CA              1587 non-null float64
TransactionsDebCash_Card        1587 non-null float64
TransactionsDebCashless_Card    1587 non-null float64
TransactionsDeb_PaymentOrder    1587 non-null float64
Count_CA                        1615 non-null int64
Count_SA                        426 non-null float64
Count_MF                        306 non-null float64
Count_OVD                       419 non-null float64
Count_CC                        170 non-null float64
Count_CL                        135 non-null float64
ActBal_CA                       1615 non-null float64
ActBal_SA                       426 non-null float64
ActBal_MF                       306 non-null float64
ActBal_OVD                      419 non-null float64
ActBal_CC                       170 non-null float64
ActBal_CL                       135 non-null float64
Sale_MF                         969 non-null float64
Sale_CC                         969 non-null float64
Sale_CL                         969 non-null float64
Revenue_MF                      969 non-null float64
Revenue_CC                      969 non-null float64
Revenue_CL                      969 non-null float64
dtypes: float64(32), int64(3)
memory usage: 454.2 KB

The group of columns with 1587 filled entries missing columns seem to be the same. I created a new file data/28_01_2020_1584entries/data.csv by removing those rows

>>> df[pd.isnull(df['VolumeDeb'])].index.sort_values()
Int64Index([  50,   58,   84,  305,  315,  334,  355,  377,  514,  522,  663,
             723,  741,  799,  833,  941,  962, 1026, 1032, 1139, 1141, 1199,
            1203, 1244, 1269, 1446, 1528, 1544],
           dtype='int64', name=u'Client')

28 Jan 2020 21:15

Missing values in Products_ActBalance

Missing entries in the Products_ActBalance table seems to correspond to 0 by default, when the product was not used by the client. I created a new file imputing 0, whenever the values in these columns are not available.

import pandas as pd
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1584 entries, 909 to 1466
Data columns (total 35 columns):
Sex                             1584 non-null float64
Age                             1584 non-null int64
Tenure                          1584 non-null int64
VolumeCred                      1584 non-null float64
VolumeCred_CA                   1584 non-null float64
TransactionsCred                1584 non-null float64
TransactionsCred_CA             1584 non-null float64
VolumeDeb                       1584 non-null float64
VolumeDeb_CA                    1584 non-null float64
VolumeDebCash_Card              1584 non-null float64
VolumeDebCashless_Card          1584 non-null float64
VolumeDeb_PaymentOrder          1584 non-null float64
TransactionsDeb                 1584 non-null float64
TransactionsDeb_CA              1584 non-null float64
TransactionsDebCash_Card        1584 non-null float64
TransactionsDebCashless_Card    1584 non-null float64
TransactionsDeb_PaymentOrder    1584 non-null float64
Count_CA                        1584 non-null int64
Count_SA                        1584 non-null float64
Count_MF                        1584 non-null float64
Count_OVD                       1584 non-null float64
Count_CC                        1584 non-null float64
Count_CL                        1584 non-null float64
ActBal_CA                       1584 non-null float64
ActBal_SA                       1584 non-null float64
ActBal_MF                       1584 non-null float64
ActBal_OVD                      1584 non-null float64
ActBal_CC                       1584 non-null float64
ActBal_CL                       1584 non-null float64
Sale_MF                         1584 non-null float64
Sale_CC                         1584 non-null float64
Sale_CL                         1584 non-null float64
Revenue_MF                      1584 non-null float64
Revenue_CC                      1584 non-null float64
Revenue_CL                      1584 non-null float64
dtypes: float64(32), int64(3)

30.01.2020 02:40

Multimodal revenue distributions

I guess, baseline modelling is so bad out of the box because the revenues distributions are multimodal with first mode at 0. This mode corresponds to clients that do not want to accept the marketing offer. Somehow I have to explicitly incorporate this information in the model. Perhaps, I can use gaussian mixture to deal with this problem or find a simpler alternative way.

30.01.2020 18.55

Outliers Revenue_MF

I try to identify outliers in the target distribution using LOF method

from sklearn.neighbors import LocalOutlierFactor
import numpy as np
import pandas as pd
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
#Revenue_CC
X = df[df.Revenue_MF>0]['Revenue_MF']
clf = LocalOutlierFactor(n_neighbors=150)
outliers = clf.fit_predict(np.array(X).reshape(-1,1))
outliers_score = clf.negative_outlier_factor_
print np.sort(outliers_score)[0:20]
#[-21.98281451  -7.82617696  -6.83332992  -6.11866049  -5.20367369
#  -5.03259564  -4.87207292  -4.81137682  -3.96651775  -3.78231777
#  -3.58274375  -3.4398148   -3.32426665  -3.19256949  -3.01614364
#  -2.89754806  -2.68524876  -2.58831833  -2.56493638  -2.43462459]
X.iloc[[outliers_score[el]<-6 for el in range(0,len(outliers_score))]]
#Client
#43      73.041786
#27     220.488214
#614     83.038036
#349     65.704107

#Revenue_CC
X = df[df.Revenue_CC>0]['Revenue_CC']
clf = LocalOutlierFactor(n_neighbors=150)
outliers = clf.fit_predict(np.array(X).reshape(-1,1))
outliers_score = clf.negative_outlier_factor_
print np.sort(outliers_score)[0:20]
#[-84.77600865 -43.76497299 -41.85003769 -22.29990926 -20.61009274
# -17.95787614 -15.21391245  -9.42846233  -8.08166001  -7.35834394
#  -6.92048323  -6.82460076  -5.12042591  -4.77377491  -4.38435721
#  -3.63171742  -3.10624988  -3.05134221  -2.85949796  -2.78882013]

X.iloc[[outliers_score[el]<-15 for el in range(0,len(outliers_score))]]
#Client
#1293     89.761071
#479     102.146429
#617     110.470000
#966     203.790714
#448     213.292857
#1335    407.962500
#374      76.859286

#Revenue_CL
X = df[df.Revenue_CL>0]['Revenue_CL']
clf = LocalOutlierFactor(n_neighbors=150)
outliers = clf.fit_predict(np.array(X).reshape(-1,1))
outliers_score = clf.negative_outlier_factor_
print np.sort(outliers_score)[0:20]
#[-18.28867726  -4.79219541  -3.29659602  -2.76674437  -2.7231154
#  -2.56226161  -2.52235758  -2.46328553  -2.32705827  -2.25750651
#  -2.18331312  -2.15530412  -2.1472402   -1.91073819  -1.86150714
#  -1.8219801   -1.80256073  -1.75130793  -1.74766922  -1.67061297]

X.iloc[[outliers_score[el]<-18 for el in range(0,len(outliers_score))]]
#Client
#1549    133.275357

31.01.2020 11.30

Pandas dataframe for Multiclass classification

import pandas as pd
import numpy as np
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
def my_f(el):
     result = -1
     if el['Sale_MF']==1: result = 1
     elif el['Sale_CC']==1: result = 2
     elif el['Sale_CL']==1: result = 3
     elif el['Sale_MF'] == 0 and el['Sale_CC']==0 and el['Sale_CL']==0: result = 0
     elif el['Sale_MF'] == -1 or el['Sale_CC']==-1 or el['Sale_CL']==-1: result = -1
     else: result = np.nan
     return result

df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:my_f(el),axis=1)
df['Sale_multiclass'].head()
df['Sale_multiclass'].value_counts()

01.02.2020 12.50

Dataset class imbalance

The relative occurrence of 'Reject' class is approximately twice higher than for the other classes. Perhaps this is the reason for overtraining.

df['Sale_multiclass'].value_counts()
-1    635
 0    387
 1    193
 2    192
 3    177

02.02.2020 00.40

Wrong multiclass encoding

Initially I got wrong encoding of all possiblities for Sale_MF, Sale_CC and Sale_CL variables. I have corrected this with the new encoding fuction.

def binary_enconding(el):
            '''
            Improved encoding of all possible Sale_MF,Sale_CC,Sale_CL options
            '''
            import numpy as np
            result = -1
            if el['Sale_MF'] == -1 or el['Sale_CC']==-1 or el['Sale_CL']==-1: result = -1
            else: result = el['Sale_MF']*4 + el['Sale_CC']*2 + el['Sale_CL']
            return result

import pandas as pd
import numpy as np
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:binary_enconding(el),axis=1)
df['Sale_multiclass'].value_counts()
-1.0    635
 0.0    387
 1.0    177
 2.0    137
 4.0    106
 3.0     55
 5.0     42
 6.0     31
 7.0     14
Name: Sale_Multiclass, dtype: int64
def reduced_binary_coding(el):
     '''
     Improved coding of all possible Sale_MF,Sale_CC,Sale_CL options.
     Lumping all cases with two or more Sale_ variables equal into single label.
     Possible lable values are [-1,0,1,2,3,4] or [0,1,2,3,4], when pred class is excluded.
     0 -> Reject
     1 -> Sale_MF=1
     2 -> Sale_CC=1
     3 -> Sale_CL=1
     4 -> Two or more Sale_* variables =1
     '''
     import numpy as np
     result = -1
     if el['Sale_MF'] == -1 or el['Sale_CC']== -1 or el['Sale_CL']== -1: result = -1
     elif el['Sale_MF'] == 0 and el['Sale_CC'] == 0 and el['Sale_CL'] == 0: result = 0
     elif el['Sale_MF'] == 1 and el['Sale_CC'] == 0 and el['Sale_CL'] == 0: result = 1
     elif el['Sale_MF'] == 0 and el['Sale_CC'] == 1 and el['Sale_CL'] == 0: result = 2
     elif el['Sale_MF'] == 0 and el['Sale_CC'] == 0 and el['Sale_CL'] == 1: result = 3
     else: result = 4
     return result

def reduced_binary_encoding_weight(el):
            my_weights = {-1:1.0,0:1./0.40, 1:1./0.11, 2:1./0.14, 3:1./0.19, 4:1./0.149}
            return my_weights[el]

import pandas as pd
import numpy as np
df = pd.read_csv('data/28_01_2020_1584entries/data_Products_ActBalance_default0.csv', index_col='Client')
df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:reduced_binary_coding(el),axis=1)
df['Sale_multiclass'].value_counts()

df['weights'] = df['Sale_multiclass'].apply(reduced_binary_encoding_weight)

def ordinal_enconding(el):
     '''
     This is very bad encoding resulting in overlaping classes. DO NOT USE!
     '''
     import numpy as np
     result = -1
     if el['Sale_MF']==1: result = 1
     elif el['Sale_CC']==1: result = 2
     elif el['Sale_CL']==1: result = 3
     elif el['Sale_MF'] == 0 and el['Sale_CC']==0 and el['Sale_CL']==0: result = 0
     elif el['Sale_MF'] == -1 or el['Sale_CC']==-1 or el['Sale_CL']==-1: result = -1
     else: result = np.nan
     return result

df['Sale_multiclass'] = df[['Sale_MF','Sale_CC','Sale_CL']].apply(lambda el:reduced_binary_coding(el),axis=1)


Miscellaneous

Expected revenue calculation

import numpy as np
prob = np.array([0.2,0.5,0.1,0.2])
rev  = np.array([100,1000,10000,np.max([100,1000,10000])])
exp_rev = prob*rev
best_offer = np.argmax(exp_rev)
print("Best offer (class, revenue): {} {}".format(best_offer,exp_rev[best_offer]))

Transformations to the original datasets

# Modify train.csv, test.csv

# python

import pandas as pd
import numpy as np
df = pd.read_csv('data/original/test.csv', index_col='PassengerId')
df['Sex'] = df['Sex'].replace({'male':0,'female':1})
df['Name']=df['Name'].apply(lambda en: en.split(',')[0])
#object_cols = ['Embarked']
df['Embarked'] = df['Embarked'].replace({np.nan:0,'S':1,'C':2,'Q':3})
df.to_csv('data/2019-12-05/test.csv')