In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


# Introduction

This month's Tabular Playground Series is a binary classification problem:

> The August 2022 edition of the Tabular Playground Series in an opportunity to help the fictional company Keep It Dry improve its main product Super Soaker. The product is used in factories to absorb spills and leaks.
> 
> The company has just completed a large testing study for different product prototypes. Can you use this data to build a model that predicts product failures?
> 

Let's first read in the data and take a look at what we are dealing with:

In [None]:
sample = pd.read_csv('/kaggle/input/tabular-playground-series-aug-2022/sample_submission.csv')
train = pd.read_csv('/kaggle/input/tabular-playground-series-aug-2022/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-aug-2022/test.csv')

In [None]:
train.head()

In [None]:
train.info()

Nrows,Ncols = train.shape

The data consists of some information about different products, specicially a product code, `loading`, the type of construction material (`attribute_0` and `attribute_1`) and some unknown numerical data (`attribute_2` and `attribute_3`).

In [None]:

plt.figure(figsize=(24,6))
for i in range(6):
    plt.subplot(1,6,i+1)
    train[train.columns[i+1]].hist()
    _=plt.title(f'Distribution of {train.columns[i+1]}')


The `product_code` feature seems more or less uniformly distributed amongst A, B, C, D, and E. The `loading` feature is a numerical feature with a slight positive skew, but we could possibly get away with assuming this is normally distributed. The `attribute_i` features are categorical variables (possibly ordinal variables), with unbalanced (non-uniform) distributions suggesting that we may need to consider this when splitting the data for cross validation, and is also something to keep in mind when we get to setting up and choosing the model.

The test set consists of new products as evidenced by the values of the `product_code` feature in the test set (see below). For prediction, we can initially drop the product code from the training and test set, although there might be some hidden information in this feature.

In [None]:
test['product_code'].value_counts()

# Missing data analysis

Let's turn to missing data. I recently discovered the [missingno](https://github.com/ResidentMario/missingno) package which draws missing data figures very easily. It is already installed in the kaggle environment so there is no need to `pip install` this package, just import it.

In [None]:
import missingno as msno

msno.matrix(train.iloc[np.random.choice(range(Nrows), 250)])

It is clear that missing data needs to be dealt with here. Some of the `loading` features are missing, as well as some of the `measurement_i` variables. Next, we'll print out a list of the number of missing items per feature in both the training set and the test set:

In [None]:
train.drop(['failure'],axis=1).isna().sum().to_frame().rename(columns={0:'Training set'}).join(test.isna().sum().rename('Test set'))

The missingness seems similar between both the training and the test set. The product code and the "attribute" features are all completely non-missing. `measurement_i`, for $i \leq 2$ are all non-missing. `measurement_i` for $i \geq 3$ have some missing values and have progressively increasing proportion of missingness. This suggests that there are a series of measurements, the results of which determine whether or not later tests are carried out. The probabilities of later measurements being missing doesn't appear to be too strongly related to the missingness of earlier measurements, as inferred from `msno.heatmap` (not shown here), however the dendrogram analysis from msno shows a definite pattern:

In [None]:
msno.dendrogram(train)

The interpretation of this figure (from the[ missingno readme file](https://github.com/ResidentMario/missingno#readme) is that variables that are more likely to be missing together have join closer to zero on the y-axis. Thus there is a definite relationship between the missingness of `measurement_i` variables as $i$ increases. Missingness might also be related to the value (and not just the presence or absence) of previous measurements. We should probably treat infilling of the measurements carefully and try to understand why measeurements are missing. 

# Distribution of the measurement variables

The measurement variables are all nicely normally distributed. It looks like some variables are integers, particularly `measurement_0`, `measurement_1` and `measurement_2`, with the rest continuous variables.

In [None]:
plt.figure(figsize=(20,20))
_=train[[x for x in train.columns if 'measurement' in x]].hist(ax=plt.gca())

In [None]:
for col in [x for x in train.columns if 'measurement' in x]:
    print(f'Variable {col}: {len(train[col].value_counts())} unique values')


# Distribution of target variable (success/failure)

Probability of failure is around 1/5 based on the distribution of the `failure` features in the test set. As with the unbalanced nature of the `attribute` features, we may need to take this into consideration in later modelling steps, particularly when designing any cross-validation schemes.

In [None]:
train['failure'].value_counts()

# Simple data cleaning pipeline

The basic steps in data cleaning here are to infill the missing data somehow, and encoding of the categorical variables (`attribute_0` and `attribute_1`). Although we saw that missingness in the measurements is probably related to values and/or missingness in ealier measurements, and possibly other variables. However, for now, let's use a simple imputation method and infill with the median for the measurement variables. We can drop `id` as it contains no predictive information. `product_code` is not useful at this stage either as there are different products in the training and test dataset. 

In [None]:
missing_cols = list(train.columns[train.isna().sum()>0])
categorical_cols = ['attribute_0', 'attribute_1']
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
preprocessing = ColumnTransformer([('median_infill', SimpleImputer(strategy='median'), missing_cols),
                                   ('ordinal_encode', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value = -1), categorical_cols)],
                                  remainder='passthrough')

preprocessing.fit_transform(train.drop(['id','product_code','failure'], axis=1))
preprocessing.transform(test.drop(['id','product_code'], axis=1))

# Baseline model - XGBoost regressor

Let's fit a model to the end of the data preprocessing pipeline, and make some predictions as a baseline. I manually tuned a few of the hyper-parameters.


In [None]:
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
modelling_pipeline = Pipeline(steps = (['preprocessing', preprocessing],
                                       ['xgboost', XGBRegressor(n_estimators = 350,
                                                                objective = 'binary:logistic',
                                                                eval_metric = 'auc',
                                                                eta = 0.2,
                                                                max_depth = 2,
                                                                gamma = 1.2,
                                                                random_state = 200)]))


First, we split the training data into training/test dataset to get an estimate of the evaluation score prior to submitting it to the competition. For now, let's use `train_test_split`, although in future we may want to use a more sophisticated cross-validation scheme to stratify and group feature, as discussed above.

In [None]:
from sklearn.model_selection import train_test_split

Xtr, Xte, ytr, yte = train_test_split(train.drop(['id','product_code','failure'], axis=1), train['failure'],
                                      test_size=0.2, random_state = 123)
from sklearn.metrics import roc_auc_score # Evaluation metric
modelling_pipeline.fit(Xtr, ytr)
print(f'Estimated score: {roc_auc_score(yte, modelling_pipeline.predict(Xte)):0.3f}')

For submission purposes, use the entire training data set to make predictions:

In [None]:
modelling_pipeline.fit(train.drop(['id','product_code','failure'], axis=1), train['failure'])

predictions = modelling_pipeline.predict(test.drop(['id','product_code'], axis=1))

submission = pd.DataFrame({'id': test['id'],
                           'failure': predictions})

submission.to_csv('submission.csv', index=False)

# Submit to competition

Thanks for reading, comments are welcome.