<a href="https://colab.research.google.com/github/fabriziobasso/kaggle/blob/main/Models_dnn_v0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **S4E1 BANK CHURN**

##### About Dataset
The bank customer churn dataset is a commonly used dataset for predicting customer churn in the banking industry. It contains information on bank customers who either left the bank or continue to be a customer. The dataset includes the following attributes:

* Customer ID: A unique identifier for each customer
* Surname: The customer's surname or last name
* Credit Score: A numerical value representing the customer's credit score
* Geography: The country where the customer resides (France, Spain or Germany)
* Gender: The customer's gender (Male or Female)
* Age: The customer's age.
* Tenure: The number of years the customer has been with the bank
* Balance: The customer's account balance
* NumOfProducts: The number of bank products the customer uses (e.g., savings account, credit card)
* HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no)
* IsActiveMember: Whether the customer is an active member (1 = yes, 0 = no)
* EstimatedSalary: The estimated salary of the customer
* Exited: Whether the customer has churned (1 = yes, 0 = no)

##### Evaluation
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.


The submitted probabilities for a given row are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with
.

#### **Files:**
* train.csv - the training dataset; Hardness is the continuous target
* test.csv - the test dataset; your objective is to predict the value of Hardness
* sample_submission.csv - a sample submission file in the correct format
* Churn_Modelling.csv - Original Dataset

## 1.0 Workbook Set-up and Libraries:

#### 1.0 Libraries

In [1]:
%%capture
!pip install tensorflow-addons
#!pip install shap
#!pip install eli5
#!pip install tf-nightly
#!pip install -U scikit-learn==1.2.0
!pip install catboost
#!pip install haversine
#!pip install pytorch-forecasting
!pip install umap-learn
#!pip install reverse_geocoder
#!pip install --upgrade protobuf
!pip install colorama
!pip install imbalanced-learn
!pip install optuna
!pip install optuna-integration
#!pip install pygam
!pip install keras-tuner --upgrade
#!pip install pycaret
#!pip install lightning==2.0.1
!pip install keras-nlp
#!pip install MiniSom
!pip install BorutaShap

In [2]:
#importing modules

import warnings
warnings.filterwarnings('ignore')
import time
t = time.time()

print('Importing started...')

# basic moduele
import os
import numpy as np
import pandas as pd
import re
#from scipy import stats
from random import randint
import random
import math
import os
import gc
import pickle
from glob import glob
from IPython import display as ipd
from tqdm import tqdm
from datetime import datetime
from joblib import dump, load
import sklearn as sk
from imblearn.over_sampling import SMOTE, RandomOverSampler
from functools import partial
import itertools
import joblib
from itertools import combinations
import IPython
import statsmodels.api as sm
import IPython.display
from IPython.display import clear_output
from prettytable import PrettyTable

# visualization moduels
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.gridspec as gridspec
import matplotlib.patches as mpatches
from matplotlib_venn import venn2_unweighted
import seaborn as sns
import missingno as msno
import imblearn
import scipy.stats as stats
from scipy.special import boxcox, boxcox1p


# Palette Setup
colors = ['#FB5B68','#FFEB48','#2676A1','#FFBDB0',]
colormap_0 = mpl.colors.LinearSegmentedColormap.from_list("",colors)
palette_1 = sns.color_palette("coolwarm", as_cmap=True)
palette_2 = sns.color_palette("YlOrBr", as_cmap=True)
palette_3 = sns.light_palette("red", as_cmap=True)
palette_4 = sns.color_palette("viridis", as_cmap=True)
palette_5 = sns.color_palette("rocket", as_cmap=True)
palette_6 = sns.color_palette("GnBu", as_cmap=True)
palette_7 = sns.color_palette("tab20c", as_cmap=False)
palette_8 = sns.color_palette("Set2", as_cmap=False)

palette_custom = ['#fbb4ae','#b3cde3','#ccebc5','#decbe4','#fed9a6','#ffffcc','#e5d8bd','#fddaec','#f2f2f2']
palette_9 = sns.color_palette(palette_custom, as_cmap=False)

sns.set_style("whitegrid",{"grid.linestyle":"--", 'grid.linewidth':0.2, 'grid.alpha':0.5})
#sns.set_theme(style="ticks", context="notebook")
sns.despine(left=True, bottom=True, top=False, right=False)

mpl.rcParams['axes.spines.left'] = True
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.bottom'] = True

# Style Import
from colorama import Style, Fore
red = Style.BRIGHT + Fore.RED
blu = Style.BRIGHT + Fore.BLUE
mgt = Style.BRIGHT + Fore.MAGENTA
gld = Style.BRIGHT + Fore.YELLOW
res = Style.RESET_ALL

# preprocessing modules
from sklearn.model_selection import (train_test_split,
                                     KFold,
                                     StratifiedKFold,
                                     cross_val_score,
                                     GroupKFold,
                                     GridSearchCV,
                                     RepeatedStratifiedKFold)

from sklearn.preprocessing import (LabelEncoder,
                                   StandardScaler,
                                   MinMaxScaler,
                                   OrdinalEncoder,
                                   RobustScaler,
                                   PowerTransformer,
                                   OneHotEncoder,
                                   LabelEncoder,
                                   OrdinalEncoder,
                                   QuantileTransformer,
                                   PolynomialFeatures)

from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.compose import ColumnTransformer

from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import FunctionTransformer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


# metrics
from sklearn.metrics import (mean_squared_error,
                             r2_score,
                             mean_absolute_error,
                             mean_absolute_percentage_error,
                             classification_report,
                             confusion_matrix,
                             ConfusionMatrixDisplay,
                             multilabel_confusion_matrix,
                             accuracy_score,
                             roc_auc_score,
                             auc,
                             roc_curve,
                             log_loss)


# modeling algos
from sklearn.linear_model import (LogisticRegression,
                                  Lasso,
                                  ridge_regression,
                                  LinearRegression,
                                  Ridge,
                                  RidgeCV,
                                  ElasticNet,
                                  BayesianRidge,
                                  TweedieRegressor,
                                  ARDRegression,
                                  PoissonRegressor,
                                  GammaRegressor)

from sklearn.neighbors import KNeighborsRegressor

from sklearn.tree import DecisionTreeRegressor
from sklearn.isotonic import IsotonicRegression

from sklearn.ensemble import (AdaBoostRegressor,
                              RandomForestRegressor,
                              RandomForestClassifier,
                              VotingRegressor,
                              GradientBoostingRegressor,
                              GradientBoostingClassifier,
                              StackingRegressor,
                              HistGradientBoostingClassifier,
                              ExtraTreesClassifier)

from sklearn.base import BaseEstimator, TransformerMixin

# Other Models
#from pygam import LogisticGAM, s, te
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
import lightgbm as lgb
from lightgbm import (LGBMRegressor,
                      LGBMClassifier,
                      early_stopping,
                      record_evaluation,
                      log_evaluation)

#import catboost as cat
from catboost import CatBoost, CatBoostRegressor
from catboost import CatBoostClassifier

#from catboost.utils import get_roc_curve

from lightgbm import early_stopping
# check installed version
#import pycaret
warnings.filterwarnings("ignore")
#from minisom import MiniSom

from sklearn.base import clone ## sklearn base models for stacked ensemble model
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay

#Interpretiability of the model
#import shap
#import eli5
#from eli5.sklearn import PermutationImportance


## miss
from sklearn.pipeline import (make_pipeline,
                              Pipeline)


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import tensorflow.keras.backend as K
import tensorflow_addons as tfa
from keras.utils import FeatureSpace
import keras_nlp

# Import libraries for Hypertuning
import kerastuner as kt
from kerastuner.tuners import RandomSearch, GridSearch, BayesianOptimization
# Model Tuning tools:
import optuna
from optuna.integration import TFKerasPruningCallback
from optuna.integration import LightGBMPruningCallback, XGBoostPruningCallback
from optuna.trial import TrialState
from optuna.visualization import plot_intermediate_values
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_param_importances
from optuna.visualization import plot_contour
# Feature selection
from BorutaShap import BorutaShap
%matplotlib inline
SEED = 1984
N_SPLITS = 10

print('Done, All the required modules are imported. Time elapsed: {} sec'.format(time.time()-t))

Importing started...
Using TensorFlow backend
Done, All the required modules are imported. Time elapsed: 9.716738939285278 sec


<Figure size 640x480 with 0 Axes>

In [3]:
# Check Versions:
print("CHECK VERSIONS:")
print(f"sns: {sns.__version__}")
print(f"mpl: {mpl.__version__}")
print(f"tensorflow: {tf.__version__}")
print(f"pandas: {pd.__version__}")
print(f"numpy: {np.__version__}")
print(f"scikit-learn: {sk.__version__}")
print(f"statsmodels: {sm.__version__}")
print(f"missingno: {msno.__version__}")
#print(f"TF-addon: {tfa.__version__}")
print(f"Inbalance_Learning: {imblearn.__version__}")
print(f"XGBoost: {xgb.__version__}")
#print(f"CatBoost: {cat.__version__}")
#print(f"PyCaret: {pycaret.__version__}")

CHECK VERSIONS:
sns: 0.13.1
mpl: 3.7.1
tensorflow: 2.15.0
pandas: 1.5.3
numpy: 1.23.5
scikit-learn: 1.2.2
statsmodels: 0.14.1
missingno: 0.5.2
Inbalance_Learning: 0.10.1
XGBoost: 2.0.3


In [4]:
def seed_everything(seed,
                    tensorflow_init=True,
                    pytorch_init=True):
    """
    Seeds basic parameters for reproducibility of results
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    if tensorflow_init is True:
        tf.random.set_seed(seed)
    if pytorch_init is True:
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False


seed_everything(42,tensorflow_init=True,pytorch_init=False)

### **1.1 Utility Functions**

#### Graph Functions:

In [5]:
def summary(df):
    print(f'data shape: {df.shape}')
    summ = pd.DataFrame(df.dtypes, columns=['data type'])
    summ['#missing'] = df.isnull().sum().values
    summ['%missing'] = df.isnull().sum().values / len(df)* 100
    summ['#unique'] = df.nunique().values
    desc = pd.DataFrame(df.describe(include='all').transpose())
    summ['min'] = desc['min'].values
    summ['max'] = desc['max'].values
    summ['median'] = desc['50%'].values
    summ['mean'] = desc['mean'].values
    return summ

def plot_confusion_matrix(y_true, y_pred, labels):
    """
    This function plots:
        1. Confusion matrix
        2. Precision matrix
        3. Recall matrix

    Parameters
    ----------
    `y_true`: ground truth (or actual) values
    `y_pred`: predicted values
    `labels`: integer encoded target values

    Returns none.
    """
    cmat = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=labels)
    pmat = cmat / cmat.sum(axis=0)
    print("Column sum of precision matrix: {}".format(pmat.sum(axis=0)))
    rmat = ((cmat.T) / (cmat.sum(axis=1).T)).T
    print("Row sum of recall matrix:       {}".format(rmat.sum(axis=1)))

    plt.figure(figsize=(15, 3))
    plt.subplot(131)
    plot_heatmap(matrix=cmat, title='Confusion Matrix', labels=labels)
    plt.subplot(132)
    plot_heatmap(matrix=pmat, title='Precision Matrix', labels=labels)
    plt.subplot(133)
    plot_heatmap(matrix=rmat, title='Recall Matrix', labels=labels)
    plt.show()

def plot_heatmap(matrix, title, labels):
    """
    This function plots the heatmap.

    Parameters
    ----------
    `matrix`: 2D array
    `title`: title
    `labels`: integer encoded target values

    Returns none.
    """
    sns.heatmap(data=matrix, annot=True, fmt='.2f', linewidths=0.1,
                xticklabels=labels, yticklabels=labels)
    plt.xlabel(xlabel='Predicted Class')
    plt.ylabel(ylabel='Actual Class')
    plt.title(label=title, fontsize=10)

#### Data Analysis Functions

### **1.2 Connect Drives**

In [6]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

/bin/bash: line 1: nvidia-smi: command not found


Connect to Google Drive:

In [7]:
%%capture
# Connect to Colab:
from google.colab import drive
drive.mount('/content/drive')

In [8]:
folder_data = "/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E1_BankChurn"
models_folders = "/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E1_BankChurn"
folders_nn = "/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E1_BankChurn/neural_networks/"
folders_trees = "/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E1_BankChurn/trees_models/"

list_directories = [folder_data,models_folders,folders_nn,folders_trees]

for path in list_directories:
  try:
      os.mkdir(path)
  except OSError as error:
      print(f"{path} already exists")


os.chdir(folder_data)

/content/drive/MyDrive/Exercises/Studies_Structured_Data/Data/S4E1_BankChurn already exists
/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E1_BankChurn already exists
/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E1_BankChurn/neural_networks/ already exists
/content/drive/MyDrive/Exercises/Studies_Structured_Data/Models/S4E1_BankChurn/trees_models/ already exists


## 2.0 Create Datasets

In [9]:
#train = pd.read_csv('new_train_dnn_small_norm_cat.csv',index_col=0)
train = pd.read_csv('new_train_small.csv',index_col=0)
#old_train = pd.read_csv("Churn_Modelling.csv")
#test = pd.read_csv("new_test_dnn_small_norm_cat.csv",index_col=0)
test = pd.read_csv("new_test_small.csv",index_col=0)

ensemble = pd.read_csv("ensemble_new_data.csv",index_col=0)

duplicate_results_test_df = pd.read_csv("known_test_targets.csv",index_col=0)
sample_submission = pd.read_csv('sample_submission.csv',index_col=0)

# Drop column id
#train.drop('id',axis=1,inplace=True)
#test.drop('id',axis=1,inplace=True)
#old_train.dropna(inplace=True,axis=0)
#old_train.rename({"RowNumber":"id"},axis=1,inplace=True)
#old_train.set_index("id", inplace=True)

In [10]:
print("TRAIN DATA shape: {}".format(train.shape))
display(train.head(3))
#print("OLD-TRAIN DATA: {}".format(old_train.shape))
#display(old_train.head(3))
print("TEST DATA: {}".format(test.shape))
display(test.head(3))

TRAIN DATA shape: (165034, 77)


Unnamed: 0,act_nop_count,CreditScore_pca_comb_count_label,0-100_Balance_RangeBalance_Range_OHE,Germany_Male_Geo_GenderGeo_Gender_OHE,1.862856099342586pow2_Age_cat_OHE,EstimatedSalary_Range_enc,log_Age_cat,NumOfProducts_cat_count_label,Age_cat,Surname_tfidf_1,...,NumOfProducts_cat_count,1_Female_bs_genderbs_gender_OHE,Balance_Salary_Range_enc,Surname_tfidf_3,Geography_count,Geography_count/NumOfProducts_cat_count_label,Balance_pca_comb,act_age_enc,Age-NumOfProducts_cat_count,Exited
0,-0.101357,0.397885,-0.309433,-0.35226,-0.149078,-0.130872,-0.51031,-0.945227,-0.510312,-0.122063,...,0.425803,-0.571571,-0.81504,-0.142446,0.86685,1.397681,0.910394,-0.364637,-0.554434,0.0
1,0.823512,-0.847092,-0.309433,-0.35226,-0.149078,-1.974514,-0.51031,-0.945227,-0.510312,-0.122044,...,0.425803,-0.571571,-0.921566,-0.142508,0.86685,1.397681,0.910394,-0.843953,-0.554434,0.0
2,-0.101357,-1.118724,-0.309433,-0.35226,-0.149078,1.367406,-0.510291,-0.945227,-0.510289,-0.153394,...,0.425803,-0.571571,-0.93248,-0.187588,0.86685,1.397681,0.910394,-0.010303,-0.554427,0.0


TEST DATA: (110023, 76)


Unnamed: 0,act_nop_count,CreditScore_pca_comb_count_label,0-100_Balance_RangeBalance_Range_OHE,Germany_Male_Geo_GenderGeo_Gender_OHE,1.862856099342586pow2_Age_cat_OHE,EstimatedSalary_Range_enc,log_Age_cat,NumOfProducts_cat_count_label,Age_cat,Surname_tfidf_1,...,bs_gender_enc,NumOfProducts_cat_count,1_Female_bs_genderbs_gender_OHE,Balance_Salary_Range_enc,Surname_tfidf_3,Geography_count,Geography_count/NumOfProducts_cat_count_label,Balance_pca_comb,act_age_enc,Age-NumOfProducts_cat_count
0,0.823512,-0.405691,-0.309433,-0.35226,-0.149078,0.195666,1.959626,-0.945227,1.959626,-0.12258,...,0.234525,0.425803,1.749564,-0.711784,-0.145081,0.86685,1.397681,0.910394,-0.785562,0.241988
1,0.226791,-1.073452,-0.309433,-0.35226,-0.149078,-1.093178,-0.510276,0.909379,-0.510269,-0.12206,...,0.317463,-0.175903,1.749564,-0.765409,-0.142503,0.86685,-0.887967,0.910394,2.393417,-0.003469
2,-0.101357,-0.699959,-0.309433,-0.35226,-0.149078,0.211634,-0.510307,-0.945227,-0.510309,-0.12206,...,0.146496,0.425803,1.749564,-1.064111,-0.142503,0.86685,1.397681,0.910394,-0.500302,-0.554433


In [11]:
#total = pd.concat([train,test],axis=0,ignore_index=True)
#train = total.iloc[:len(train),:]
#test = total.iloc[len(train):,:]
#cat_col

In [12]:
train = train.astype("float")
test = test.astype("float")
cat_col = [name for name in train.columns if train[name].nunique()<25]

#train_=train.copy()
train[cat_col] = train[cat_col].astype("int")
cat_col.remove("Exited")
test[cat_col] = test[cat_col].astype("int")
summary(train).style.background_gradient(cmap='Reds')

data shape: (165034, 77)


Unnamed: 0,data type,#missing,%missing,#unique,min,max,median,mean
act_nop_count,int64,0,0.0,2,-5.0,0.0,0.0,-0.10207
CreditScore_pca_comb_count_label,float64,0,0.0,459,-1.130042,4.053589,-0.258558,0.0
0-100_Balance_RangeBalance_Range_OHE,int64,0,0.0,2,0.0,3.0,0.0,0.262146
Germany_Male_Geo_GenderGeo_Gender_OHE,int64,0,0.0,2,0.0,2.0,0.0,0.220779
1.862856099342586pow2_Age_cat_OHE,int64,0,0.0,2,0.0,6.0,0.0,0.130446
EstimatedSalary_Range_enc,float64,0,0.0,165034,-3.732784,4.49813,-0.059559,-0.0
log_Age_cat,int64,0,0.0,2,0.0,1.0,0.0,0.206606
NumOfProducts_cat_count_label,int64,0,0.0,2,0.0,2.0,0.0,0.040828
Age_cat,int64,0,0.0,2,0.0,1.0,0.0,0.206606
Surname_tfidf_1,float64,0,0.0,1007,-0.193693,8.134518,-0.12206,0.0


## **3.0 Dataset Manager:**


In [13]:
strat_feature = train["Exited"]

X=train.drop(columns=["Exited"]).copy()
y=train["Exited"].copy()
X_test=test.copy()

X.shape[1]==X_test.shape[1]

True

In [14]:
num_var = X.select_dtypes("float").columns
list_to_stand = [name for name in num_var if X[name].nunique()>25]

scaler = QuantileTransformer(subsample=50_000, output_distribution="normal",random_state=42)

X[list_to_stand] = scaler.fit_transform(X[list_to_stand])
X_test[list_to_stand] = scaler.transform(X_test[list_to_stand])

In [15]:
summary(X.select_dtypes("int")).style.background_gradient(cmap='Reds')

data shape: (165034, 38)


Unnamed: 0,data type,#missing,%missing,#unique,min,max,median,mean
act_nop_count,int64,0,0.0,2,-5.0,0.0,0.0,-0.10207
0-100_Balance_RangeBalance_Range_OHE,int64,0,0.0,2,0.0,3.0,0.0,0.262146
Germany_Male_Geo_GenderGeo_Gender_OHE,int64,0,0.0,2,0.0,2.0,0.0,0.220779
1.862856099342586pow2_Age_cat_OHE,int64,0,0.0,2,0.0,6.0,0.0,0.130446
log_Age_cat,int64,0,0.0,2,0.0,1.0,0.0,0.206606
NumOfProducts_cat_count_label,int64,0,0.0,2,0.0,2.0,0.0,0.040828
Age_cat,int64,0,0.0,2,0.0,1.0,0.0,0.206606
1.6829802775748723pow2_Age_cat_OHE,int64,0,0.0,2,0.0,4.0,0.0,0.193802
bs_nop_count,int64,0,0.0,3,-2.0,0.0,0.0,-0.33009
Geography_count_label-NumOfProducts_cat_count,int64,0,0.0,2,0.0,6.0,0.0,0.122484


In [16]:
summary(X.select_dtypes("float")).style.background_gradient(cmap='Blues')

data shape: (165034, 38)


Unnamed: 0,data type,#missing,%missing,#unique,min,max,median,mean
CreditScore_pca_comb_count_label,float64,0,0.0,457,-5.199338,5.199338,0.003764,-0.04285
EstimatedSalary_Range_enc,float64,0,0.0,165027,-5.199338,5.199338,-0.003226,-0.005359
Surname_tfidf_1,float64,0,0.0,1007,-5.199338,5.199338,0.03388,0.005193
NumOfProducts_enc,float64,0,0.0,165027,-5.199338,5.199338,0.006432,0.006823
bx_cx_CreditScore_count,float64,0,0.0,314,-5.199338,5.199338,-0.007527,0.038753
IsActiveMember_enc,float64,0,0.0,165034,-5.199338,5.199338,-0.003051,0.000323
quant_EstimatedSalary,float64,0,0.0,55296,-5.199338,5.199338,-0.001681,-0.003034
Tenure_enc,float64,0,0.0,165032,-5.199338,5.199338,0.007474,0.007017
Gender_enc,float64,0,0.0,165029,-5.199338,5.199338,0.00307,0.002051
Balance_Salary,float64,0,0.0,98048,-5.199338,5.199338,0.003093,-0.00309


In [17]:
num_var = X.select_dtypes("float").columns
cat_var = X.select_dtypes("int").columns

X[num_var] = X[num_var].astype("float32")
X_test[num_var] = X_test[num_var].astype("int32")

In [18]:
X["Exited"] = y
weight=X[X["Exited"]==0].shape[0]/X[X["Exited"]==1].shape[0]
X["weights"] = [1 if x==0 else weight for x in X["Exited"]]
X_test["Exited"] = 0
X_test["weights"] = 1

### 3.1 Data Loading

In [19]:
def dataframe_to_dataset(dataframe, shuffle=False, batch_size=32):
    dataframe = dataframe.copy()
    labels = dataframe["Exited"]
    weights = dataframe.pop("weights")
    dataframe = dataframe.drop(columns=["Exited"])

    ds = tf.data.Dataset.from_tensor_slices((
                                             dict(dataframe),
                                             labels,
                                             weights
                                             ))
    if shuffle:
      ds = ds.shuffle(buffer_size=len(dataframe))
    ds = ds.batch(batch_size)
    ds = ds.prefetch(batch_size)
    return ds

In [20]:
train_dataset = dataframe_to_dataset(X, batch_size=32, shuffle=True)
test_dataset = dataframe_to_dataset(X_test, batch_size=32, shuffle=False)

In [None]:
%%time
feature_space = FeatureSpace(
                            features={**{a:FeatureSpace.integer_categorical(num_oov_indices=1, output_mode="int") for a in cat_var},**{a:FeatureSpace.float_normalized() for a in num_var}},
                            output_mode="dict"
                            )

train_ds_with_no_labels = train_dataset.map(lambda x, *_: x)
print("Adapting Features Space....")
feature_space.adapt(train_ds_with_no_labels)

preprocessed_train_ds = train_dataset.map(lambda x, y, w: (feature_space(x), y, w), num_parallel_calls=tf.data.AUTOTUNE).prefetch(tf.data.AUTOTUNE)

Cause: could not parse the source code of <function <lambda> at 0x7e81e8ba0d30>: no matching AST found among candidates:



Cause: could not parse the source code of <function <lambda> at 0x7e81e8ba0d30>: no matching AST found among candidates:

Adapting Features Space....


In [None]:
gc.collect()

## **3.0 MODELS**