# <a name="0">Machine Learning Accelerator - Tabular Data Analysis - Predict if two products are substitutes of each other</a>

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock. 

Submission Link: https://leaderboard.corp.amazon.com/tasks/542__

1. <a href="#1">Read the datasets</a>
2. <a href="#2">Feature Engineering</a>
    * <a href="#21">New Feature 1: Name Similarity Score</a>
    * <a href="#22">New Feature 2: Same Package Weight</a>
    * <a href="#23">New Feature 3: Same Group Code</a>
3. <a href="#3">Training the Models using AutoGluon</a>
4. <a href="#4">Imputing Missing Value using TreeRegressors</a>
5. <a href="#5">Predicting on Test Data for final Submission</a>

In [3]:
# Have to install dependency on the first run of each day because of the instance shutdown
# !pip install --upgrade pip
# !pip install --upgrade mxnet autogluon

# import nltk
# nltk.download('stopwords')

import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")

import pandas as pd
import numpy as np

Collecting pip
  Using cached pip-21.0.1-py3-none-any.whl (1.5 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.3.3
    Uninstalling pip-20.3.3:
      Successfully uninstalled pip-20.3.3
Successfully installed pip-21.0.1
Collecting mxnet
  Using cached mxnet-1.7.0.post2-py2.py3-none-manylinux2014_x86_64.whl (54.7 MB)
Collecting autogluon
  Using cached autogluon-0.0.15-py3-none-any.whl (622 kB)
Collecting pyarrow<=1.0.0
  Using cached pyarrow-1.0.0-cp36-cp36m-manylinux2014_x86_64.whl (17.2 MB)
Collecting lightgbm<4.0,>=3.0
  Using cached lightgbm-3.1.1-py2.py3-none-manylinux1_x86_64.whl (1.8 MB)
Collecting scikit-optimize
  Using cached scikit_optimize-0.8.1-py2.py3-none-any.whl (101 kB)
Collecting Pillow<=6.2.1
  Using cached Pillow-6.2.1-cp36-cp36m-manylinux1_x86_64.whl (2.1 MB)
Collecting openml
  Using cached openml-0.11.0-py3-none-any.whl
Collecting fastparquet==0.4.1
  Using cached fastparquet-0.4.1-cp36-cp36m-linux_x86_64

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### <a name="1">Read the datasets</a>
(<a href="#0">Go to top</a>)

In [4]:
training_data = pd.read_csv('../../data/final_project/training.csv')
test_data = pd.read_csv('../../data/final_project/public_test_features.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)

The shape of the training dataset is: (36803, 228)
The shape of the test dataset is: (15774, 227)


In [5]:
numerical_features = ["key_pkg_height","key_pkg_length","key_pkg_width","key_pkg_weight",
                              "key_fma_qualified_price_max",
                              "cand_pkg_height","cand_pkg_length","cand_pkg_width","cand_pkg_weight",
                              "cand_fma_qualified_price_max"]

categorical_features = ["key_Product Group Description","key_is_conveyable","key_Is Sortable", 
                        "key_item_package_quantity",
                        "cand_has_ean","cand_is_conveyable","cand_Is Sortable"]

text_features = ["key_item_name", "cand_item_name"]

model_features = numerical_features + text_features + categorical_features
labels = ["label"]

df_train = training_data[labels + model_features]
df_test = test_data[model_features]

print(df_train.shape, df_test.shape)

(36803, 20) (15774, 19)


In [6]:
df_train[categorical_features + text_features] = df_train[categorical_features + text_features].astype('str')
df_train[categorical_features + text_features] = df_train[categorical_features + text_features].astype('str')

## <a name="2">Feature Engineering</a>

### <a name='21'>New Feature 1: Name Similarity Score (Cosine Similarity using TFIDF)</a>
(<a href="#0">Go to top</a>)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df = df_train[text_features]
vectorizer = TfidfVectorizer(lowercase=True, stop_words=stop_words, min_df=1)
similarity = []
for i in range(df.shape[0]):
    train_corpus = df.iloc[i,:].values
    X = vectorizer.fit_transform(train_corpus)
    X = X.toarray()
    similarity += [np.dot(X,X.T)[0,1]]
df_train["name_similarity_score"] = similarity

In [8]:
df = df_test[text_features]
similarity = []
for i in range(df.shape[0]):
    test_corpus = df.iloc[i,:].values
    X = vectorizer.fit_transform(test_corpus)
    X = X.toarray()
    similarity += [np.dot(X,X.T)[0,1]]
df_test["name_similarity_score"] = similarity

### <a name="22">New Feature 2: Same Package Weight (comparing weight of cand and key taking 20% threshold)</a>
(<a href="#0">Go to top</a>)

In [9]:
df_train["same_pkg_weight"] = pd.Series(np.where((df_train["cand_pkg_weight"] >= (0.8 * df_train["key_pkg_weight"])) &
                                                  (df_train["cand_pkg_weight"] <= (1.2 * df_train["key_pkg_weight"]))
                                                 ,1,0))
df_test["same_pkg_weight"] = pd.Series(np.where((df_test["cand_pkg_weight"] >= (0.8 * df_test["key_pkg_weight"])) &
                                                  (df_test["cand_pkg_weight"] <= (1.2 * df_test["key_pkg_weight"]))
                                                 ,1,0))

### <a name="23">New Feature 3: Same Group Code (comparing group code of cand)</a>

In [10]:
df_train["same_group_code"] = pd.Series(np.where((training_data["key_Product Group Code"] ==
                                                        training_data["cand_Product Group Code"]), 1, 0))

df_test["same_group_code"] = pd.Series(np.where((test_data["key_Product Group Code"] ==
                                                        test_data["cand_Product Group Code"]), 1, 0))

In [11]:
df_train.corr()

Unnamed: 0,label,key_pkg_height,key_pkg_length,key_pkg_width,key_pkg_weight,key_fma_qualified_price_max,cand_pkg_height,cand_pkg_length,cand_pkg_width,cand_pkg_weight,cand_fma_qualified_price_max,name_similarity_score,same_pkg_weight,same_group_code
label,1.0,0.032031,0.041,0.048462,0.022392,0.013313,0.07362,0.064573,0.082386,0.043717,0.060295,0.320742,0.129493,0.166013
key_pkg_height,0.032031,1.0,0.539176,0.683234,0.590024,0.301298,0.635583,0.428025,0.509291,0.328133,0.250824,-0.038531,0.048967,0.058961
key_pkg_length,0.041,0.539176,1.0,0.732192,0.667089,0.374143,0.42835,0.720966,0.556671,0.361648,0.276712,-0.027401,0.043466,0.036564
key_pkg_width,0.048462,0.683234,0.732192,1.0,0.626726,0.439664,0.505264,0.557947,0.67599,0.344272,0.314171,-0.015953,0.061412,0.058608
key_pkg_weight,0.022392,0.590024,0.667089,0.626726,1.0,0.479824,0.461447,0.536128,0.479621,0.478348,0.319421,-0.022666,0.03818,0.040544
key_fma_qualified_price_max,0.013313,0.301298,0.374143,0.439664,0.479824,1.0,0.233689,0.243943,0.278267,0.194485,0.504088,-0.000105,0.02811,0.036246
cand_pkg_height,0.07362,0.635583,0.42835,0.505264,0.461447,0.233689,1.0,0.553863,0.696156,0.456604,0.362481,-0.013263,0.082362,0.069951
cand_pkg_length,0.064573,0.428025,0.720966,0.557947,0.536128,0.243943,0.553863,1.0,0.734212,0.486075,0.353359,-0.014916,0.065796,0.055634
cand_pkg_width,0.082386,0.509291,0.556671,0.67599,0.479621,0.278267,0.696156,0.734212,1.0,0.473026,0.397965,0.007672,0.091604,0.079998
cand_pkg_weight,0.043717,0.328133,0.361648,0.344272,0.478348,0.194485,0.456604,0.486075,0.473026,1.0,0.443138,-0.001214,0.035086,0.027513


In [12]:
df_test.corr()

Unnamed: 0,key_pkg_height,key_pkg_length,key_pkg_width,key_pkg_weight,key_fma_qualified_price_max,cand_pkg_height,cand_pkg_length,cand_pkg_width,cand_pkg_weight,cand_fma_qualified_price_max,key_item_package_quantity,name_similarity_score,same_pkg_weight,same_group_code
key_pkg_height,1.0,0.537994,0.678501,0.592348,0.306478,0.639413,0.410272,0.499368,0.397986,0.236314,-0.009018,-0.031247,0.044567,0.050775
key_pkg_length,0.537994,1.0,0.734862,0.671435,0.382063,0.42391,0.714133,0.551457,0.448621,0.270331,-0.022491,-0.02952,0.031344,0.03891
key_pkg_width,0.678501,0.734862,1.0,0.64216,0.449089,0.498872,0.543991,0.664343,0.434688,0.312112,-0.029954,2.8e-05,0.046829,0.063262
key_pkg_weight,0.592348,0.671435,0.64216,1.0,0.484632,0.47218,0.530911,0.492571,0.644989,0.352184,-0.01436,-0.015506,0.027134,0.044241
key_fma_qualified_price_max,0.306478,0.382063,0.449089,0.484632,1.0,0.226755,0.23155,0.271737,0.264712,0.53453,-0.020204,0.013032,0.017263,0.050544
cand_pkg_height,0.639413,0.42391,0.498872,0.47218,0.226755,1.0,0.561432,0.70627,0.496509,0.331277,-0.011863,-0.00785,0.070534,0.068123
cand_pkg_length,0.410272,0.714133,0.543991,0.530911,0.23155,0.561432,1.0,0.742694,0.545467,0.313616,-0.018478,-0.020273,0.04699,0.034434
cand_pkg_width,0.499368,0.551457,0.664343,0.492571,0.271737,0.70627,0.742694,1.0,0.550791,0.371589,-0.014623,0.010151,0.074774,0.064618
cand_pkg_weight,0.397986,0.448621,0.434688,0.644989,0.264712,0.496509,0.545467,0.550791,1.0,0.568189,-0.010273,-0.011609,0.022912,0.017134
cand_fma_qualified_price_max,0.236314,0.270331,0.312112,0.352184,0.53453,0.331277,0.313616,0.371589,0.568189,1.0,-0.017034,0.041167,0.047063,0.066301


In [13]:
df_train[["same_pkg_weight","same_group_code"]] = df_train[["same_pkg_weight","same_group_code"]].astype('str')
df_test[["same_pkg_weight","same_group_code"]] = df_test[["same_pkg_weight","same_group_code"]].astype('str')

In [27]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36803 entries, 0 to 36802
Data columns (total 23 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   label                          36803 non-null  int64  
 1   key_pkg_height                 33358 non-null  float64
 2   key_pkg_length                 36803 non-null  float64
 3   key_pkg_width                  33358 non-null  float64
 4   key_pkg_weight                 33190 non-null  float64
 5   key_fma_qualified_price_max    34374 non-null  float64
 6   cand_pkg_height                29405 non-null  float64
 7   cand_pkg_length                29405 non-null  float64
 8   cand_pkg_width                 29405 non-null  float64
 9   cand_pkg_weight                29135 non-null  float64
 10  cand_fma_qualified_price_max   28715 non-null  float64
 11  key_item_name                  36803 non-null  object 
 12  cand_item_name                 36803 non-null 

## <a name="4">Imputing the missing value for key_pkg_length because of its high importance in the model</a>

In [15]:
numerical_features_custom = ["key_fma_qualified_price_max"]

categorical_features_custom = ["key_Product Group Description","key_is_conveyable","key_Is Sortable",
                        "key_binding_description","key_classification_description","key_item_package_quantity"]

model_features_custom = numerical_features_custom + categorical_features_custom
label_custom = ["key_pkg_length"]

df_train_custom = training_data[model_features_custom + label_custom + ["ID"]]
df_test_custom = test_data[model_features_custom + label_custom + ["ID"]]

df_train_custom[categorical_features_custom] = df_train_custom[categorical_features_custom].astype('str')
df_test_custom[categorical_features_custom] = df_test_custom[categorical_features_custom].astype('str')

print(df_train_custom.shape,df_test_custom.shape)

(36803, 9) (15774, 9)


In [16]:
df = pd.concat([df_train_custom,df_test_custom])
df_train_new = df[df["key_pkg_length"].notna()][model_features_custom + label_custom + ["ID"]]
df_test_new = df[df["key_pkg_length"].isna()][model_features_custom + ["ID"]]
print(df_train_new.shape, df_test_new.shape)

(47682, 9) (4895, 8)


In [17]:
df_test_new.isna().sum()

key_fma_qualified_price_max       1743
key_Product Group Description        0
key_is_conveyable                    0
key_Is Sortable                      0
key_binding_description              0
key_classification_description       0
key_item_package_quantity            0
ID                                   0
dtype: int64

In [18]:
from autogluon import TabularPrediction as task

metric = 'root_mean_squared_error'

predictor = task.fit(train_data=df_train_new, 
                     label='key_pkg_length',
                     eval_metric=metric,
                    excluded_model_types=["NN"],
                    id_columns=["ID"])

No output_directory specified. Models will be saved in: AutogluonModels/ag-20210214_210452/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20210214_210452/
AutoGluon Version:  0.0.15
Train Data Rows:    47682
Train Data Columns: 8
Preprocessing data ...
AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and label-values can't be converted to int).
	Label info (max, min, mean, stddev): (94.0, 0.0, 12.82366, 10.92852)
	If 'regression' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Dropping ID columns: ['ID']
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    6039.5 MB
	Train Data (Original)  Memory Usage: 18.86 MB (0.3% of available memory)
	Inferring data type of each feature based on column values. Set

[1000]	train_set's rmse: 0.871332	valid_set's rmse: 0.883522


	-0.8822	 = Validation root_mean_squared_error score
	4.16s	 = Training runtime
	0.4s	 = Validation runtime
Fitting model: LightGBMRegressorXT ...


[1000]	train_set's rmse: 1.79617	valid_set's rmse: 1.83393
[2000]	train_set's rmse: 1.41329	valid_set's rmse: 1.45641
[3000]	train_set's rmse: 1.20775	valid_set's rmse: 1.25153
[4000]	train_set's rmse: 1.09015	valid_set's rmse: 1.13068
[5000]	train_set's rmse: 1.01491	valid_set's rmse: 1.04712
[6000]	train_set's rmse: 0.968179	valid_set's rmse: 0.99369
[7000]	train_set's rmse: 0.938052	valid_set's rmse: 0.960404
[8000]	train_set's rmse: 0.918755	valid_set's rmse: 0.938838
[9000]	train_set's rmse: 0.904506	valid_set's rmse: 0.922571
[10000]	train_set's rmse: 0.89535	valid_set's rmse: 0.911684


	-0.9117	 = Validation root_mean_squared_error score
	39.05s	 = Training runtime
	8.22s	 = Validation runtime
Fitting model: CatboostRegressor ...
	-0.9868	 = Validation root_mean_squared_error score
	102.69s	 = Training runtime
	0.03s	 = Validation runtime
Fitting model: LightGBMRegressorCustom ...


[1000]	train_set's rmse: 0.870335	valid_set's rmse: 0.882567


	-0.8822	 = Validation root_mean_squared_error score
	8.51s	 = Training runtime
	0.8s	 = Validation runtime
Fitting model: weighted_ensemble_k0_l1 ...
	-0.3579	 = Validation root_mean_squared_error score
	0.48s	 = Training runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 195.29s ...


In [19]:
predictor.feature_importance(df_train_new)

Computing raw permutation importance for 8 features on weighted_ensemble_k0_l1 ...
	1.49s	= Expected runtime
	1.52s	= Actual runtime


key_is_conveyable                 9.934145
key_Is Sortable                   5.859989
key_fma_qualified_price_max       5.182504
key_Product Group Description     4.099280
key_binding_description           3.780384
key_item_package_quantity         1.166982
ID                                0.000000
key_classification_description    0.000000
dtype: float64

In [20]:
test_predictions = predictor.predict(df_test_new)
df_test_new["key_pkg_length"] = test_predictions
print(df_train_new.shape, df_test_new.shape)

(47682, 9) (4895, 9)


In [21]:
df = pd.concat([df_train_new,df_test_new])[["ID","key_pkg_length"]]
df.shape

(52577, 2)

In [23]:
training_data = training_data.drop(columns=["key_pkg_length"])
training_data = training_data.merge(df,
         on=["ID"],
         how='inner'
         )
df_train["key_pkg_length"] = training_data["key_pkg_length"]
df_train["key_pkg_length"].isna().sum()

0

In [25]:
test_data = test_data.drop(columns=["key_pkg_length"])
test_data = test_data.merge(df,
         on=["ID"],
         how='inner'
         )
df_test["key_pkg_length"] = test_data["key_pkg_length"]
df_test["key_pkg_length"].isna().sum()

0

In [26]:
!rm -r AutogluonModels #to save space

## <a name="3">Training the model using AutoGluon</a>
(<a href="#0">Go to top</a>)

In [28]:
#Notes: Removed key_has_ean and added same_weight boolean
from autogluon import TabularPrediction as task

metric = 'accuracy'
stopping_metric = 'balanced_accuracy'

predictor = task.fit(train_data=df_train, 
                     label='label',
                     eval_metric=metric,
                     stopping_metric=stopping_metric)

No output_directory specified. Models will be saved in: AutogluonModels/ag-20210214_211510/
Beginning AutoGluon training ...
AutoGluon will save models to AutogluonModels/ag-20210214_211510/
AutoGluon Version:  0.0.15
Train Data Rows:    36803
Train Data Columns: 22
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    5795.34 MB
	Train Data (Original)  Memory Usage: 33.68 MB (0.6% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify specia

In [29]:
predictor.leaderboard(extra_info=True, silent=True)

Unnamed: 0,model,score_val,pred_time_val,fit_time,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order,num_features,...,child_model_type,hyperparameters,hyperparameters_fit,AG_args_fit,features,child_hyperparameters,child_hyperparameters_fit,child_AG_args_fit,ancestors,descendants
0,weighted_ensemble_k0_l1,0.7232,3.605748,1057.4712,0.005597,1.191094,1,True,12,10,...,GreedyWeightedEnsembleModel,"{'max_models': 25, 'max_models_per_type': 5}",{},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[LightGBMClassifierXT, RandomForestClassifierG...",{'ensemble_size': 100},{'ensemble_size': 94},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[LightGBMClassifierXT, RandomForestClassifierG...",[]
1,CatboostClassifier,0.7124,0.245564,35.329732,0.245564,35.329732,0,True,9,3625,...,,"{'iterations': 10000, 'learning_rate': 0.1, 'r...",{'iterations': 231},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
2,LightGBMClassifierXT,0.7044,0.197916,10.224137,0.197916,10.224137,0,True,8,3625,...,,"{'num_boost_round': 10000, 'num_threads': -1, ...",{'num_boost_round': 56},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
3,LightGBMClassifier,0.7028,0.189816,9.885932,0.189816,9.885932,0,True,7,3625,...,,"{'num_boost_round': 10000, 'num_threads': -1, ...",{'num_boost_round': 23},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
4,LightGBMClassifierCustom,0.7012,0.208096,13.434147,0.208096,13.434147,0,True,11,3625,...,,"{'num_boost_round': 10000, 'num_threads': -1, ...",{'num_boost_round': 27},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
5,RandomForestClassifierGini,0.698,0.566915,189.710039,0.566915,189.710039,0,True,1,3625,...,,"{'n_estimators': 300, 'n_jobs': -1, 'random_st...",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
6,RandomForestClassifierEntr,0.6968,0.668412,204.073175,0.668412,204.073175,0,True,2,3625,...,,"{'n_estimators': 300, 'n_jobs': -1, 'random_st...",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
7,NeuralNetClassifier,0.6928,0.229975,118.473338,0.229975,118.473338,0,True,10,88,...,,"{'num_epochs': 500, 'epochs_wo_improve': 20, '...",{'num_epochs': 12},"{'ignored_type_group_special': ['text_ngram', ...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
8,ExtraTreesClassifierEntr,0.6884,0.971923,474.789669,0.971923,474.789669,0,True,4,3625,...,,"{'n_estimators': 300, 'n_jobs': -1, 'random_st...",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[weighted_ensemble_k0_l1]
9,ExtraTreesClassifierGini,0.684,0.870191,433.693546,0.870191,433.693546,0,True,3,3625,...,,"{'n_estimators': 300, 'n_jobs': -1, 'random_st...",{'n_estimators': 300},"{'max_memory_usage_ratio': 1.0, 'max_time_limi...","[key_pkg_height, key_pkg_length, key_pkg_width...",,,,[],[]


In [30]:
predictor.feature_importance(df_train)

Computing raw permutation importance for 22 features on weighted_ensemble_k0_l1 ...
	104.98s	= Expected runtime
	99.5s	= Actual runtime


cand_item_name                   0.141
key_item_name                    0.134
name_similarity_score            0.128
cand_pkg_width                   0.030
key_fma_qualified_price_max      0.024
cand_pkg_height                  0.023
cand_fma_qualified_price_max     0.023
cand_pkg_length                  0.022
cand_pkg_weight                  0.022
key_pkg_length                   0.021
same_group_code                  0.019
key_pkg_height                   0.019
key_pkg_width                    0.017
key_pkg_weight                   0.015
same_pkg_weight                  0.014
key_Product Group Description    0.010
key_item_package_quantity        0.004
cand_has_ean                     0.003
cand_Is Sortable                 0.002
key_is_conveyable                0.001
cand_is_conveyable               0.000
key_Is Sortable                 -0.001
dtype: float64

In [31]:
predictor.get_model_names()

['RandomForestClassifierGini',
 'RandomForestClassifierEntr',
 'ExtraTreesClassifierGini',
 'ExtraTreesClassifierEntr',
 'KNeighborsClassifierUnif',
 'KNeighborsClassifierDist',
 'LightGBMClassifier',
 'LightGBMClassifierXT',
 'CatboostClassifier',
 'NeuralNetClassifier',
 'LightGBMClassifierCustom',
 'weighted_ensemble_k0_l1']

In [32]:
predictor.feature_importance(df_train, model='RandomForestClassifierEntr')

Computing raw permutation importance for 22 features on RandomForestClassifierEntr ...
	24.66s	= Expected runtime
	25.1s	= Actual runtime


cand_item_name                   0.249
key_item_name                    0.196
name_similarity_score            0.056
same_group_code                  0.006
same_pkg_weight                  0.005
cand_fma_qualified_price_max     0.003
cand_pkg_width                   0.001
key_pkg_length                   0.001
key_Product Group Description    0.001
cand_is_conveyable               0.000
key_pkg_width                    0.000
key_pkg_weight                   0.000
key_fma_qualified_price_max      0.000
cand_pkg_weight                  0.000
cand_has_ean                     0.000
cand_Is Sortable                 0.000
key_is_conveyable                0.000
key_Is Sortable                  0.000
key_item_package_quantity        0.000
key_pkg_height                   0.000
cand_pkg_length                 -0.001
cand_pkg_height                 -0.001
dtype: float64

In [33]:
predictor.feature_importance(df_train, model='CatboostClassifier')

Computing raw permutation importance for 22 features on CatboostClassifier ...
	22.12s	= Expected runtime
	24.32s	= Actual runtime


name_similarity_score            0.116
cand_item_name                   0.041
key_item_name                    0.036
cand_fma_qualified_price_max     0.010
key_fma_qualified_price_max      0.007
same_group_code                  0.007
cand_pkg_weight                  0.006
key_pkg_length                   0.006
key_pkg_width                    0.005
cand_pkg_height                  0.004
cand_pkg_width                   0.002
key_item_package_quantity        0.001
key_Product Group Description    0.001
key_Is Sortable                  0.000
key_is_conveyable                0.000
cand_has_ean                     0.000
cand_is_conveyable               0.000
cand_Is Sortable                 0.000
cand_pkg_length                 -0.003
key_pkg_weight                  -0.004
key_pkg_height                  -0.004
same_pkg_weight                 -0.006
dtype: float64

In [34]:
predictor.feature_importance(df_train, model='LightGBMClassifierXT')

Computing raw permutation importance for 22 features on LightGBMClassifierXT ...
	18.6s	= Expected runtime
	19.07s	= Actual runtime


name_similarity_score            0.147
key_item_name                    0.034
cand_item_name                   0.020
key_pkg_weight                   0.019
key_Product Group Description    0.018
cand_pkg_weight                  0.014
key_pkg_height                   0.013
cand_fma_qualified_price_max     0.012
key_fma_qualified_price_max      0.010
key_pkg_width                    0.005
key_pkg_length                   0.004
cand_pkg_height                  0.003
cand_Is Sortable                 0.003
key_Is Sortable                  0.002
same_pkg_weight                  0.002
key_is_conveyable                0.001
key_item_package_quantity        0.001
cand_pkg_width                   0.001
cand_pkg_length                  0.001
cand_has_ean                     0.000
cand_is_conveyable              -0.001
same_group_code                 -0.003
dtype: float64

In [35]:
predictor.feature_importance(df_train, model='NeuralNetClassifier')

Computing raw permutation importance for 22 features on NeuralNetClassifier ...
	19.78s	= Expected runtime
	20.12s	= Actual runtime


name_similarity_score            0.084
cand_item_name                   0.052
key_item_name                    0.049
key_fma_qualified_price_max      0.020
key_Product Group Description    0.017
key_item_package_quantity        0.013
cand_pkg_width                   0.010
cand_fma_qualified_price_max     0.009
key_pkg_width                    0.009
key_Is Sortable                  0.007
cand_pkg_weight                  0.007
same_group_code                  0.006
key_pkg_height                   0.005
cand_pkg_length                  0.004
cand_Is Sortable                 0.003
key_pkg_length                   0.003
cand_pkg_height                  0.002
cand_has_ean                     0.001
cand_is_conveyable               0.001
key_pkg_weight                   0.001
key_is_conveyable                0.000
same_pkg_weight                 -0.007
dtype: float64

## <a name="4">Predicting on Test Data for final submission</a>
(<a href="#0">Go to top</a>)

In [36]:
test_predictions = predictor.predict(df_test)
test_data["label"] = test_predictions

In [37]:
submission = test_data[["ID","label"]]
submission.head(5)

Unnamed: 0,ID,label
0,35057,1
1,41573,0
2,44029,0
3,6462,1
4,17533,0


In [38]:
submission["label"].value_counts()

1    8759
0    7015
Name: label, dtype: int64

In [39]:
submission.to_csv("../../data/final_project/Autogluon_V_10.0.csv", index=False)

In [None]:
#!rm -r AutogluonModels

# Best prediction feature list: Leaderboard Score 0.7176236

In [None]:
# #Winning set of feature : 
# numerical_features = ["key_pkg_height","key_pkg_length","key_pkg_width","key_pkg_weight",
#                               "key_fma_qualified_price_max",
#                               "cand_pkg_height","cand_pkg_length","cand_pkg_width","cand_pkg_weight",
#                               "cand_fma_qualified_price_max"]

# categorical_features = ["key_Product Group Description","key_is_conveyable","key_Is Sortable", 
#                         "key_item_package_quantity",
#                         "cand_has_ean","cand_is_conveyable","cand_Is Sortable"]

# text_features = ["key_item_name", "cand_item_name"]

# model_features = numerical_features + text_features + categorical_features
# labels = ["label"]