In [1]:
import sys
sys.path.insert(0, "../")

from collections import Counter
import pickle

import pandas as pd
import numpy as np

from sklearn.feature_selection import f_classif
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, GroupKFold, GroupShuffleSplit
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier

from src.custom_selector import ColumnSelector
from src.evaluate import EvaluateModel

# 0. Report
## 0.1. Approach
1. An in-depth analysis of the data was conducted to identify any issues with the quality of the data.<u>(Please refer to sections 1/2/3)</u>
2. Next, an automated algorithm based on pearson's correlation and ANOVA's f-score is used to filter out irrelevant or duplicated features <u>(Please refer to section 4)</u>
3. The data is split in train/test while ensuring that all vectors for a specific receipt are in the same split. This is necessary as all features are based on information from both the receipt image and the transaction. <u>(Please refer to section 5.)</u>
4. A tree-based boosting algorithm (XGBoost) is used as a classifier. This is a perfect algorithm for the current task, as it can work very well with ordinal data and has high performance of imbalanced datasets. <u>(Please refer to section 6.)</u>
5. The hyperparameters of the model are then optimized using a randomized search across pre-defined search space. Due to time constraints this approach was selected instead of using hyperparameter optimizers like hyperopt, which rely on bayesian optimization to find the optimal set of hyperparameters. The hyperparameter tuning is done on the training data by utilizing a cross-validation approach due to limited number of "positive" events. <u>(Please refer to section 6.)</u>
6. The final model is trained on the full training data and is evaluated on the test data <u>(Please refer to section 6.)</u>
7. Finally all steps are incorporated within a train.py script available in the root
8. Please note that the model doesn't return a list, but an array with the original order of the data input. It is not advised for a model to return output in order different from its input, so for this project I would approach the data architect to revise the specification and impelment the ordering outside of the model 

## 0.2. Results
1. The discriminatory power of the model is very high with <u>Gini on both Train and Test higher than 93%</u>. This means that the model performs very well in ordering the matching vectors <u>(Please refer to section 7.)</u>
2. Distribution of the final score looks good, with observed values across the full range between 0 and 1. This means that we don't have huge clumps of values in a single score, which could complicate the discrimination. In addition it shows that even without calibration the probability score can serve as an indication of the true probability.
3. The Final success rate is reported using 4 metrics:
- <b>"User Success Rate" = 61%</b> - This is the success rate of visualziing the correct transaction in the final app as experienced by the user
- <b>"Maximum Success Rate" = 75%</b> - This is the maximum success rate given that for a large proportion of the data we don't have the correct transaction supplied and assuming that when we have same explanatory features for 2 transactions, one of which is the correct one, we will select the correct one by chance
- <b>"Normaized Success Rate" = 81%</b> - This is the success rate of the model after accounting for the receipts with missing matched transactions
- <b>"Normaized For Missing and Duplicates Success Rate" = 91%</b> - This is the cussess rate after account for the receipts with missing matched transactions and after excluding all receipts for which we have same features for both the matched transaction and an incorrect transaction.

All 4 above metrics are useful and they highlight the 2 main sources leading to lower model performance:
- The quality of the data. It's impact can be seen by comparing the "User Cussess Rate" and the "normalized For Missing and Duplicates Success Rate"
- Model specification and noise

## 0.3. Below you can find a list of next steps to improve the model
1. Contact the data supplier and discuss how to improve the current data extracts, as 26% of the supplied receipts don't have a matched transaction. <u>(Please refer to section 2.9)</u>
2. Request higher granularity for some of the input features as for 22% of all receipts we have a matched transaction with exactly the same features as at least one another <u>(Please refer to section 2.9)</u>
3. This means that in total 48% of the receipts can't be mapped consistently or at all, so it should be of highest priority to improve the quality of the data <u>(Please refer to section 2.9)</u>
4. The hyperparameter tuning algorithm can be improved by utilizing package "hyperopt" (available in the pip index).
5. We can introduce a threshold to exclude transaction with very low estiamted model confidence from the returned results. This is expected to improve the user experience.
6. To do so, we should obtain a new independent dataset, apply the model on it, order it by predicted probability and see if below a certain threshold we don't see any matched transaction ids. This would speed up the process of validation for the app user.
7. In addition we can increase the interpretability of the final score estimates by calibrating the resutls on the new independent dataset from the above point. For that we can use sklearn.calibration.CalibratedClassifierCV.

# 1. Load Dataset

In [3]:
data = pd.read_csv("../data/data_interview_test_updated_(1).csv", delimiter=":")

In [4]:
print("Data dimensions:", data.shape)

Data dimensions: (12034, 14)


In [5]:
data.sample(5, random_state=21).T

Unnamed: 0,3740,4179,7797,10603,11195
receipt_id,20132.0,20173,30244,40273,50094
company_id,20000.0,20000,30000,40000,50000
matched_transaction_id,20411.0,20180,30639,40115,50334
feature_transaction_id,20409.0,20175,30096,40030,50026
DateMappingMatch,0.75,0,0,0,0
AmountMappingMatch,0.0,0,0,0,0
DescriptionMatch,0.6,0,0,0,0
DifferentPredictedTime,1.0,1,1,1,1
TimeMappingMatch,0.0,0,0,0,0
PredictedNameMatch,0.0,0,0,0,0


### NOTES:

1. The supplied data includes 12,034 data rows and 14 columns
2. The first 4 features ensure the uniqueness of the data row, while the rest 10 features describe the feature_transaction_id realation to receipt_id

# 2. Analyze Data Features

## 2.1. Check Data Types

In [6]:
data.dtypes

receipt_id                  object
company_id                   int64
matched_transaction_id      object
feature_transaction_id      object
DateMappingMatch           float64
AmountMappingMatch         float64
DescriptionMatch           float64
DifferentPredictedTime     float64
TimeMappingMatch           float64
PredictedNameMatch         float64
ShortNameMatch             float64
DifferentPredictedDate     float64
PredictedAmountMatch       float64
PredictedTimeCloseMatch    float64
dtype: object

## 2.2. Check For Missings

In [7]:
data.isnull().sum()

receipt_id                 0
company_id                 0
matched_transaction_id     0
feature_transaction_id     0
DateMappingMatch           0
AmountMappingMatch         0
DescriptionMatch           0
DifferentPredictedTime     0
TimeMappingMatch           0
PredictedNameMatch         0
ShortNameMatch             0
DifferentPredictedDate     0
PredictedAmountMatch       0
PredictedTimeCloseMatch    0
dtype: int64

## 2.3. Check For Leading/Trailing WhiteSpaces in Strings

In [8]:
string_columns = data.dtypes[data.dtypes == "object"].index.values

for column in string_columns:
    if not data[column].equals(data[column].str.strip()):
        raise Exception(f"For column '{column}' we see entries with leading/trailing white space!")

## 2.4. Check For uniquness of receipt_id per company

In [9]:
if data.drop_duplicates(["receipt_id", "company_id"]).shape[0] != data.drop_duplicates(["receipt_id"]).shape[0]:
    raise Exception("Ensure that receipt_id is uniue across all companies")

## 2.5. Check For duplicates across receipt_id/feature_transaction_id

In [10]:
if data.duplicated(["receipt_id", "feature_transaction_id"]).sum() > 0:
    raise Exception("Duplicate matching vectors are observed in the data!")

## 2.6. Check For mismatch between receipt_id and matched_transaction_id

In [11]:
if not (data["receipt_id"].nunique() == data["matched_transaction_id"].nunique() == data[["receipt_id", "matched_transaction_id"]].drop_duplicates().shape[0]):
    raise Exception("Please ensure that a single receipt_id is matched to a single matched_transaction_id and the opposite.")

## 2.7. Check Feature Distribution

In [12]:
float_columns = data.dtypes[data.dtypes == "float"].index.values

for column in float_columns:
    print("==================================================")
    print(f"=========={column}=============")
    distribution = data[column].value_counts()
    
    if len(distribution) < 20:
        print(distribution)
    else:
        print(data[column].describe())
    print("==================================================")

0.000    9068
0.950    1636
0.850     571
0.900     217
0.650     194
0.825     179
0.550      98
1.000      36
0.750      21
0.525      11
0.725       3
Name: DateMappingMatch, dtype: int64
0.0    11225
0.4      615
0.7      159
0.6       26
0.9        9
Name: AmountMappingMatch, dtype: int64
0.0    11581
0.8      193
0.4      143
0.6       60
0.2       57
Name: DescriptionMatch, dtype: int64
1.0    11871
0.0      163
Name: DifferentPredictedTime, dtype: int64
0.0    11867
1.0      167
Name: TimeMappingMatch, dtype: int64
0.0    11589
0.8      251
0.4       91
0.6       84
0.2       19
Name: PredictedNameMatch, dtype: int64
0.0    11578
1.0      456
Name: ShortNameMatch, dtype: int64
1.0    9068
0.0    2966
Name: DifferentPredictedDate, dtype: int64
0.0    11989
0.1       24
0.5        9
0.4        8
0.6        3
0.2        1
Name: PredictedAmountMatch, dtype: int64
0.0    11113
1.0      921
Name: PredictedTimeCloseMatch, dtype: int64


## 2.8. Check Feature Correlation

In [13]:
corrs = data[float_columns].corr().abs().stack().reset_index()
corrs.columns = ['feature_1', 'feature_2', 'abs_corr']
corrs = corrs[(corrs.feature_1 != corrs.feature_2)].sort_values("abs_corr", ascending=False)

In [14]:
corrs

Unnamed: 0,feature_1,feature_2,abs_corr
70,DifferentPredictedDate,DateMappingMatch,0.990860
7,DateMappingMatch,DifferentPredictedDate,0.990860
34,DifferentPredictedTime,TimeMappingMatch,0.987785
43,TimeMappingMatch,DifferentPredictedTime,0.987785
39,DifferentPredictedTime,PredictedTimeCloseMatch,0.407039
...,...,...,...
48,TimeMappingMatch,PredictedAmountMatch,0.005925
38,DifferentPredictedTime,PredictedAmountMatch,0.005852
83,PredictedAmountMatch,DifferentPredictedTime,0.005852
82,PredictedAmountMatch,DescriptionMatch,0.002102


## 2.9. Check for duplicate explanatory features for same receipt after discretization

In [15]:
result = []
for rec_id, df in data.groupby("receipt_id"):
    mask = df["matched_transaction_id"] == df["feature_transaction_id"]
    
    correct_tx = df[mask][float_columns]
    incorrect_tx = df[~mask][float_columns]
    
    duplicate_flag = int(correct_tx.merge(incorrect_tx).shape[0] > 0)
    
    result.append({
        "receipt_id": rec_id,
        "missing_target": 1-int(mask.max()),
        "duplicate": duplicate_flag
    })

In [16]:
pd.DataFrame(result).mean()

missing_target    0.258009
duplicate         0.226840
dtype: float64

In [17]:
pd.DataFrame(result).sum(axis=1).value_counts()

0    595
1    560
dtype: int64

In [18]:
pd.DataFrame(result).shape[0]

1155

### NOTES:

1. All 10 features describing the matching vector are loaded as floats.
2. The 10 input features will be <b><u>treated as an output from a black box algorithm</u></b> for which we only can <b><u>assume that the relationship between the values of a feature is ordinal</u></b>
3. <b><u>For 23% of the receipts we have entries with exactly the same input as the one for the correctly mapped transactions. For such cases the performance of the model will be based on the randomized way the transactions are provided. A non-discretized features should be requested</u></b>
4. For 26% of the entries we don't have a matched transaction
5. No missing values are observed
6. No corrupted values are observed
7. The columns are:
- DateMappingMatch - A discretized matching similarity of the date returned by the upstream service
- AmountMappingMatch - A discretized matching similarity of the amount returned by the upstream service
- DescriptionMatch - A discretized matching similarity of the description returned by the upstream service
- DifferentPredictedTime - A binary flag indicating if the image receipt time is same as transaction time
- TimeMappingMatch - A discretized matching similarity of the time returned by the upstream service (0/1 observed)
- PredictedNameMatch -  A discretized matching similarity of the predicted name returned by the upstream service
- ShortNameMatch - A discretized matching similarity of the short name returned by the upstream servicee (0/1 observed)
- DifferentPredictedDate - A binary flag indicating if the predicted date is the same
- PredictedAmountMatch - A discretized matching similarity of the amount returned by the upstream servicee (0/1 observed)
- PredictedTimeCloseMatch - A binary flag indicating if image receipt time is close to transaction time

# 3. Get Target Flag

## 3.1. Derive Target Flag

In [19]:
data["target"] = (data["matched_transaction_id"] == data["feature_transaction_id"]).astype(int)

In [20]:
data.target.value_counts()

0    11177
1      857
Name: target, dtype: int64

In [21]:
print(f"Positive Rate In data is {round(data.target.mean(), 3)}")

Positive Rate In data is 0.071


### NOTES:
1. Data Is Imbalanced with 7.1% of observations having positive values

# 4. Univiariate Analysis

## 4.1. Check Gini

In [22]:
float_columns = data.dtypes[data.dtypes == "float"].index.values

for column in float_columns:
    gini = round(roc_auc_score(data["target"], data[column])*2 - 1, 3)
    print(f"{column}; Gini: {gini}")

DateMappingMatch; Gini: 0.83
AmountMappingMatch; Gini: 0.003
DescriptionMatch; Gini: 0.229
DifferentPredictedTime; Gini: -0.166
TimeMappingMatch; Gini: 0.17
PredictedNameMatch; Gini: 0.197
ShortNameMatch; Gini: 0.241
DifferentPredictedDate; Gini: -0.796
PredictedAmountMatch; Gini: 0.012
PredictedTimeCloseMatch; Gini: 0.224


## 4.2. Check Feature Correlation

In [23]:
high_correlation = 0.90

In [24]:
corrs = data[float_columns].corr().abs().stack().reset_index()
corrs.columns = ['feature_1', 'feature_2', 'abs_corr']
corrs = corrs[(corrs.feature_1 != corrs.feature_2) & (corrs["abs_corr"] >= high_correlation)].sort_values("abs_corr", ascending=False)

In [25]:
corrs

Unnamed: 0,feature_1,feature_2,abs_corr
7,DateMappingMatch,DifferentPredictedDate,0.99086
70,DifferentPredictedDate,DateMappingMatch,0.99086
34,DifferentPredictedTime,TimeMappingMatch,0.987785
43,TimeMappingMatch,DifferentPredictedTime,0.987785


### 4.2.1. Analyze DifferentPredictedDate

In [26]:
pd.crosstab(data["DifferentPredictedDate"], data["DateMappingMatch"])

DateMappingMatch,0.000,0.525,0.550,0.650,0.725,0.750,0.825,0.850,0.900,0.950,1.000
DifferentPredictedDate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0.0,0,11,98,194,3,21,179,571,217,1636,36
1.0,9068,0,0,0,0,0,0,0,0,0,0


### 4.2.2. Analyze DifferentPredictedTime

In [27]:
pd.crosstab(data["DifferentPredictedTime"], data["TimeMappingMatch"])

TimeMappingMatch,0.0,1.0
DifferentPredictedTime,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0,163
1.0,11867,4


## 4.3. Remove Highly Correlated Features

In [28]:
table = pd.DataFrame([dict(zip(float_columns, f_classif(data[float_columns], data["target"])[0]))], index=["F-Value"]).T.sort_values("F-Value", ascending=False)

In [29]:
table

Unnamed: 0,F-Value
DateMappingMatch,3911.125458
DifferentPredictedDate,3508.992917
TimeMappingMatch,1946.599561
DifferentPredictedTime,1909.256393
ShortNameMatch,1412.235723
DescriptionMatch,1242.015363
PredictedNameMatch,1155.423875
PredictedTimeCloseMatch,593.635571
PredictedAmountMatch,72.983841
AmountMappingMatch,1.004581


In [30]:
explanatory_features = []
for feature in table.index:
    if feature not in corrs["feature_1"].values or all([x not in explanatory_features for x in corrs[corrs["feature_1"] == feature]["feature_2"].values]):
        explanatory_features.append(feature)

In [31]:
explanatory_features

['DateMappingMatch',
 'TimeMappingMatch',
 'ShortNameMatch',
 'DescriptionMatch',
 'PredictedNameMatch',
 'PredictedTimeCloseMatch',
 'PredictedAmountMatch',
 'AmountMappingMatch']

### NOTES:
1. The relationship between most faetures and the target flag is positive with 2 features having a negative relationship - DifferentPredictedTime, DifferentPredictedDate. All relationships follow the expected direction
2. The supplied amount features are not expected to be very predictive in comparisson to the other features. Potentially we can remove them or depending on the classifier leave for them classifier to decide if they should be used
3. There are 2 pairs of highly correlated features. An analysis of the features shows that the DateMappingMatch feature can be used to exactly reproduce the "DifferentPredictedDate" feature and "TimeMappingMatch" can be used to reproduce almost exactly the "DifferentPredictedTime" with just 4 observations that can't. As a result it was decided to remove the features with prefix "Different" from the training process 

# 5. Split In Train And Test

In [32]:
float_columns = data.dtypes[data.dtypes == "float"].index.values

In [33]:
group_kfold = GroupShuffleSplit(test_size=0.25, random_state=21)
groups = data["receipt_id"]
train_index, test_index = list(group_kfold.split(data.values, None, groups))[0]

X_train = data.iloc[train_index, :][float_columns]
y_train = data.iloc[train_index, :]["target"]
group_col_train = data.iloc[train_index, :]["receipt_id"]
X_test = data.iloc[test_index, :][float_columns]
y_test = data.iloc[test_index, :]["target"]

In [34]:
y_train.value_counts()

0    8570
1     639
Name: target, dtype: int64

In [35]:
y_test.value_counts()

0    2607
1     218
Name: target, dtype: int64

### NOTE:
1. We split the data in 75% data for training and hyperparameter tuning and 25% hold-out sample for evaluation
2. We use the receipt_id for grouping to ensure that the same receipt is not used in both train and test as all features are based on both the receipt_id and the feature_transaction_id

# 6. Model Training

# 6.1. Initiate Model

In [36]:
counter = Counter(y_train)
scale_pos_weight = counter[0] / counter[1]

In [37]:
model = Pipeline(steps=[("sel", ColumnSelector(corr_threshold=0.90)),
                        ("cls", XGBClassifier(scale_pos_weight=scale_pos_weight, random_state=21))])

# 6.2. Specify Hyperparamters and Initiate optimizer

In [38]:
not_fixed_params = {
    "cls__gamma": [0, 0.5, 1, 3, 5],
    "cls__max_depth": [3, 5, 10, 20],
    "cls__learning_rate": [0.01, 0.1, 0.2],
    "cls__reg_alpha": [0, 0.5, 1, 3, 5],
    "cls__min_child_weight": [10, 20, 30, 40],
    "cls__reg_lambda": [0, 0.5, 1, 3, 5],
    "cls__n_estimators": [10, 15, 20, 30, 40]
}

In [39]:
cv = list(GroupKFold(n_splits=4).split(X_train, y_train, group_col_train))

In [40]:
optimizer = RandomizedSearchCV(model,
                               param_distributions=not_fixed_params,
                               n_iter=300,
                               scoring="roc_auc",
                               n_jobs=-1,
                               cv=cv,
                               random_state=21,
                               refit=True,
                               verbose=10)

In [41]:
optimizer = optimizer.fit(X_train, y_train)

Fitting 4 folds for each of 300 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   16.3s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   16.4s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   16.7s
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   16.8s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   17.1s
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:   17.2s
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:   17.8s
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:   18.2s
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:   18.4s
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:   18.7s
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:   19.1s
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed:   19.4s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:   19.8s
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed:  

In [42]:
model = model.set_params(**optimizer.best_params_)

In [43]:
model.fit(X_train, y_train)

Pipeline(steps=[('sel', ColumnSelector()),
                ('cls',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_bytree=1, gamma=0, gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='', learning_rate=0.2,
                               max_delta_step=0, max_depth=3,
                               min_child_weight=10, missing=nan,
                               monotone_constraints='()', n_estimators=40,
                               n_jobs=0, num_parallel_tree=1, random_state=21,
                               reg_alpha=0, reg_lambda=0.5,
                               scale_pos_weight=13.411580594679187, subsample=1,
                               tree_method='exact', validate_parameters=1,
                               verbosity=None))])

## 6.3. Store Estimator

In [44]:
with open('../models/model.pkl', 'wb') as fp:
    pickle.dump(model, fp)

# NOTES:
1. As we don't know much about the imput features we are using a tree-based model.
2. We are working with imbalanced dataset so a tree boosting algorithm is expected to perform better
3. For tuning the parameters we use a simple randomized grid search. In the future we can improve this by utilizing the package hyperopt and its bayesian optimization approach.
4. The final model is trained on the full training dataset

# 7. Evaluate Results

## 7.1. Load Model

In [45]:
with open('../models/model.pkl', 'rb') as fp:
    model = pickle.load(fp)

In [47]:
pred_test = model.predict_proba(X_test)[:,1]

In [48]:
metrics = EvaluateModel("target", "receipt_id")
metrics.fit(X_test, data.iloc[test_index, :][["receipt_id", "target"]], pred_test)

## 7.2. Evaluate Gini

In [49]:
roc_auc_score(y_train, model.predict_proba(X_train)[:,1])*2-1

0.9362272950551749

In [50]:
roc_auc_score(y_test, model.predict_proba(X_test)[:,1])*2-1

0.9328132093200028

## 7.3. Check Distribution

In [51]:
metrics._metrics["summary"]

Unnamed: 0_level_0,target,target,target
Unnamed: 0_level_1,sum,count,mean
prob_bin,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
"[0.0, 0.1)",2,2116,0.000945
"[0.1, 0.2)",2,14,0.142857
"[0.2, 0.3)",0,35,0.0
"[0.3, 0.4)",0,2,0.0
"[0.4, 0.5)",0,2,0.0
"[0.5, 0.6)",0,7,0.0
"[0.6, 0.7)",8,53,0.150943
"[0.7, 0.8)",54,377,0.143236
"[0.8, 0.9)",10,49,0.204082
"[0.9, 1.01)",142,170,0.835294


## 7.4. Check Success Rate

In [52]:
metrics._metrics["user_success_rate"]

0.6089965397923875

In [53]:
metrics._metrics["max_success_rate"]

0.754325259515571

In [54]:
metrics._metrics["norm_success_rate"]

0.8073394495412844

In [55]:
metrics._metrics["norm_no_dup_success_rate"]

0.9058823529411765

### NOTES
1. The discriminatory power of the model is very high with Gini on both Train and Test higher than 93%
2. Distribution of the final score looks good, with observed values across the full range between 0 and 1
3. The Final success rate is reported using 4 metrics:
* "User Success Rate" - This is the success rate of visualziing the correct transaction in the final app as experienced by the user
* "Maximum Success Rate" - This is the maximum success rate given that for a large proportion of the data we don't have the correct transaction supplied and assuming that when we have same explanatory features for 2 transactions, one of which is the correct one, we will select the correct one by chance
* "Normaized Success Rate" - This is the success rate of the model after accounting for the receipts with missing matched transactions
* "Normaized For Missing and Duplicates Success Rate" - This is the cussess rate after account for the receipts with missing matched transactions and after excluding all receipts for which we have same features for both the matched transaction and an incorrect transaction.