# Setup

In [1]:
from pathlib import Path
import os

iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')
if iskaggle:
    !pip install -Uqq fastai
    path = Path('/kaggle/input/playground-series-s4e3')
else:
    import zipfile,kaggle
    path = Path('playground-series-s4e3')
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Downloading playground-series-s4e3.zip to /home/carljvh/code/git/Kaggle/Steel_plate_defects


100%|███████████████████████████████████████████████████| 1.74M/1.74M [00:04<00:00, 404kB/s]







In [2]:
import pandas as pd
import numpy as np
import warnings

import matplotlib as plt
import seaborn as sns

# Loading the data

In [3]:
train_df = pd.read_csv(path/'train.csv')
test_df = pd.read_csv(path/'test.csv')
target_classes = ["Pastry", "Z_Scratch", "K_Scatch", "Stains", "Dirtiness", "Bumps", "Other_Faults"]
targets_df = train_df[target_classes]

In [4]:
train_df.head().T

Unnamed: 0,0,1,2,3,4
id,0.0,1.0,2.0,3.0,4.0
X_Minimum,584.0,808.0,39.0,781.0,1540.0
X_Maximum,590.0,816.0,192.0,789.0,1560.0
Y_Minimum,909972.0,728350.0,2212076.0,3353146.0,618457.0
Y_Maximum,909977.0,728372.0,2212144.0,3353173.0,618502.0
Pixels_Areas,16.0,433.0,11388.0,210.0,521.0
X_Perimeter,8.0,20.0,705.0,16.0,72.0
Y_Perimeter,5.0,54.0,420.0,29.0,67.0
Sum_of_Luminosity,2274.0,44478.0,1311391.0,3202.0,48231.0
Minimum_of_Luminosity,113.0,70.0,29.0,114.0,82.0


In [5]:
test_df.head().T

Unnamed: 0,0,1,2,3,4
id,19219.0,19220.0,19221.0,19222.0,19223.0
X_Minimum,1015.0,1257.0,1358.0,158.0,559.0
X_Maximum,1033.0,1271.0,1372.0,168.0,592.0
Y_Minimum,3826564.0,419960.0,117715.0,232415.0,544375.0
Y_Maximum,3826588.0,419973.0,117724.0,232440.0,544389.0
Pixels_Areas,659.0,370.0,289.0,80.0,140.0
X_Perimeter,23.0,26.0,36.0,10.0,19.0
Y_Perimeter,46.0,28.0,32.0,11.0,15.0
Sum_of_Luminosity,62357.0,39293.0,29386.0,8586.0,15524.0
Minimum_of_Luminosity,67.0,92.0,101.0,107.0,103.0


In [6]:
targets_df.head()

Unnamed: 0,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,1
2,0,0,1,0,0,0,0
3,0,0,1,0,0,0,0
4,0,0,0,0,0,0,1


# Understanding the data

This notebook has a good description of what the features and labels mean, as this is not discussed in the competition literature: https://www.kaggle.com/competitions/playground-series-s4e3/discussion/481015

The following is copied from there as a reference:

An explanation of each of the steel plate faults present in this Kaggle competition, reminding you that all these faults are superficial:

* Pastry: Pastry refers to small patches or irregularities on the surface of the steel plate, typically caused by imperfections in the manufacturing process or handling during transport. These imperfections can affect the surface smoothness and appearance of the steel plate.

* Z_Scratch: Z-scratches are narrow scratches or marks on the surface of the steel plate that run parallel to the rolling direction. Various factors, such as handling, machining, or contact with abrasive materials during production or transportation, can cause these scratches.

* K_Scratch: K-scratches are similar to Z-scratches but run perpendicular to the rolling direction. They can also be caused by handling, machining, or contact with abrasive materials during manufacturing or transportation processes.

* Stains: Stains refer to discolored or contaminated areas on the surface of the steel plate. These stains can result from various sources, such as rust, oil, grease, or other foreign substances that come into contact with the steel surface during processing, storage, or handling.

* Dirtiness: Dirtiness indicates the presence of dirt or particulate matter on the surface of the steel plate. This can include various types of debris or contaminants that accumulate during manufacturing, handling, or storage processes.

* Bumps: Bumps are raised or protruding areas on the surface of the steel plate. These can be caused by irregularities in the manufacturing process, such as uneven rolling or cooling, or by physical damage during handling or transportation.

* Other_Faults: This category likely encompasses a broader range of faults or defects not explicitly categorized in the other fault types listed. It could include various types of surface imperfections, irregularities, or abnormalities that affect the quality or usability of the steel plate.

Here are some further information about the features:

* The dataset "Steel Plates Faults" contains 27 features that describe each fault in detail. Here is an explanation of some of the features based on the information gathered from the search results:

Location Features:

* X_Minimum: The minimum x-coordinate of the fault.
* X_Maximum: The maximum x-coordinate of the fault.
* Y_Minimum: The minimum y-coordinate of the fault.
* Y_Maximum: The maximum y-coordinate of the fault.

Size Features:

* Pixels_Areas: Area of the fault in pixels.
* X_Perimeter: Perimeter along the x-axis of the fault.
* Y_Perimeter: Perimeter along the y-axis of the fault.

Luminosity Features:

* Sum_of_Luminosity: Sum of luminosity values in the fault area.
* Minimum_of_Luminosity: Minimum luminosity value in the fault area.
* Maximum_of_Luminosity: Maximum luminosity value in the fault area.

Material and Index Features:

* TypeOfSteel_A300: Type of steel (A300).
* TypeOfSteel_A400: Type of steel (A400).
* Steel_Plate_Thickness: Thickness of the steel plate.
* Edges_Index, Empty_Index, Square_Index, Outside_X_Index, Edges_X_Index, Edges_Y_Index, * Outside_Global_Index: Various index values related to edges and geometry.

Logarithmic Features:

* LogOfAreas: Logarithm of the area of the fault.
* Log_X_Index, Log_Y_Index: Logarithmic indices related to X and Y coordinates.

Statistical Features:

* Orientation_Index: Index describing orientation.
* Luminosity_Index: Index related to luminosity.
* SigmoidOfAreas: Sigmoid function applied to areas.

# Looking at the targets

Looking at the distribution of the targets,we can see that this is an unbalanced dataset and we need to keep this in mind with e.g. stratifying the testing and validation sets and stratifying the cross-folds when we optimise hyperparameters

In [7]:
targets_df.sum()

Pastry          1466
Z_Scratch       1150
K_Scatch        3432
Stains           568
Dirtiness        485
Bumps           4763
Other_Faults    6558
dtype: int64

Note that the overwhelming majority of of rows have one target, but there are a few with 2 targets and with a significant number with no targets. This is important as our problem becomes a multi-label problem and not a multi-class problem

In [8]:
targets_df.sum(axis=1).value_counts()

1    18380
0      818
2       21
Name: count, dtype: int64

We look at the items with two targets and see that almost all of them 
are a combination of K_scratch and other, with 2 with K_scratch + bumps