# AI 4 Materials Industry
# Case study 1: Faulty steel plates
# Notebook 1: Exploratory Data Analysis for tabular data


## The dataset

The dataset consists of a series of features describing 6 well-defined classes of defects and one class containing all other faults. The dataset was made available by the [Semeion research center](http://www.semeion.it/wordpress/)

The following urls provide a link to the dataset itself and some example code:
* [Dataset at UCI ML](http://archive.ics.uci.edu/ml/datasets/steel+plates+faults "Faulty steel plate dataset")
* [Kaggle page with example code](https://www.kaggle.com/uciml/faulty-steel-plates "Kaggle")

Type of dependent variables, which we will try to predict (7 Types of Steel Plates Faults):
1. Pastry
2. Z_Scratch
3. K_Scatch
4. Stains
5. Dirtiness
6. Bumps
7. Other_Faults


27 independent variables, which we will use to make predictions:
* X_Minimum
* X_Maximum
* Y_Minimum
* Y_Maximum
* Pixels_Areas
* X_Perimeter
* Y_Perimeter
* Sum_of_Luminosity
* Minimum_of_Luminosity
* Maximum_of_Luminosity
* Length_of_Conveyer
* TypeOfSteel_A300
* TypeOfSteel_A400
* Steel_Plate_Thickness
* Edges_Index
* Empty_Index
* Square_Index
* Outside_X_Index
* Edges_X_Index
* Edges_Y_Index
* Outside_Global_Index
* LogOfAreas
* Log_X_Index
* Log_Y_Index
* Orientation_Index
* Luminosity_Index
* SigmoidOfAreas


## Exploratory Data Analysis (EDA)

The first step when doing machine learning is always to explore the data you're going to work with.

* What is the type of data?
* What does it represent?
* What are the limitations?
* Is the dataset balanced? Is each class roughly equally represented?
* Which features correlate with each other? Does this make sense?
* Which features correlate with which output class?

Knowing what the data can and cannot tell you is crucial to know what kind of models can be built on top of it, and what these models can be capable of.

### Preliminary: Setting up the environment

Basic functions like plotting and doing math. It's not important if not everything in this step makes sense to you right away.

In [None]:
# This is simply to install the packages we'll be using, this only needs to happen once.
# It might be necessary to restart the kernel after this step

! pip install seaborn
! pip install scikit-learn
! pip install shap
! pip install pyarrow
! pip install matplotlib

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

import numpy as np # Math functions
import pandas as pd # Pandas is used for handling databases, and will be used for reading and manipulating the data
import matplotlib.pyplot as plt # Plot functions
import seaborn as sns # More plot functions

sns.set_palette('colorblind') # Making the plots colorblind-friendly
sns.set_style('darkgrid') # More info at https://seaborn.pydata.org/tutorial/aesthetics.html

### Reading the data

First, we read the data from the file `faults.csv`. For this we use pandas (`pd`) and read it as a pandas dataframe. In this case all the columns we want to read have been explicitly named, this isn't always necessary, but it helps to know exactly what is being read and what the dataframe contains.

After the dataframe has been read, we use the `describe()` function to get a summary of what is contained and what sorts of values we can find in each column.

It is very important to check whether there isn't any identifying column present. For example, if you want to teach a model to rank materials based on a certain property, and you start from an ordered dataset that has the rank encoded as a feature, then the model will probably simply learn to identify the rank and ignore all the "real" information.

In [None]:
df = pd.read_csv('faults.csv',header=0,names=['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas',
       'X_Perimeter', 'Y_Perimeter', 'Sum_of_Luminosity',
       'Minimum_of_Luminosity', 'Maximum_of_Luminosity', 'Length_of_Conveyer',
       'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
       'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index',
       'Edges_X_Index', 'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas',
       'Log_X_Index', 'Log_Y_Index', 'Orientation_Index', 'Luminosity_Index',
       'SigmoidOfAreas', 'Pastry', 'Z_Scratch', 'K_Scratch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults'])
df.describe()

To get a sense for what the actual data looks like, we can use the `head()` function, which shows us the first 5 rows of the dataframe.

In [None]:
df.head()

We are dealing with tabular data, data stored in a table with columns and rows.

A row is a specific data point and can be addressed with `.iloc` in much the same way as a Python list.

In [None]:
df.iloc[42]

In [None]:
df.iloc[20:25].T # The .T operator transposes a grid, exchanging rows and columns.

A column is what we call a __feature__. These are typically labeled with a name and can be called as such.

In [None]:
df["X_Maximum"]

We can select multiple columns at once by passing a list.

In [None]:
df[["X_Maximum", "X_Minimum"]] # Note the brackets!

And perform arithmetic on them.

In [None]:
df["X_Maximum"] - df["X_Minimum"]

We can even store this as a new feature. This statement creates a new feature column `X_Size` and stores the result of the operation inside.

In [None]:
df["X_Size"] = df["X_Maximum"] - df["X_Minimum"]

Here, it's also useful to consider what information is being encoded in these features. We are looking at faults on steel plates. The `X_minimum`, `X_maximum`, `Y_minimum` and `Y_maximum` tell us the boundaries of where the fault starts and ends. But it might be more useful to store the location and size, instead.

### Exercise 1:

Create three new features called `X_Center`, `Y_Center` and `Y_Size` which store the coordinates of the center of the defect and its size. Be aware that these feature names are case sensitive!

We will use the `pandas.Dataframe.drop` command and remove the four original columns `X_minimum`, `X_maximum`, `Y_minimum` and `Y_maximum` from the dataframe, as we now have a more intuitive way of storing the same information.

In [None]:
df = df.drop(columns=["X_Minimum", "X_Maximum", "Y_Minimum", "Y_Maximum"])
# We could keep the original features, but if we want to be able to interpret our models,
# it helps to not have too many features that encode the same information.

Let's save the dataset for use in our other notebook. `pandas.DataFrame.to_csv` writes it to a standard CSV file, readable by Excel and other spreadsheet software. When reading the CSV again pandas will have to interpret how to store the data. The `pandas.Dataframe.to_Feather` method stores data in a binary format which does not require parsing and is thus much faster to read. If you want to add more features don't forget to do this before saving the file, then later you can read it again in the next notebook!

In [None]:
df.to_feather("vsc-ai4mi-case1-eda.feather")

***

## Plotting
Plots can be more easily interpreted than bare numbers, so it can be useful to plot an overview of certain things. For example, using a pie chart to see the distribution of classes, or a scatterplot to see how two features might relate.

Let us see what a plot looks like between the orientation index and square index.

In [None]:
df.plot.scatter(x='Orientation_Index',y='Square_Index')

In this case, it's clear that the Orientation Index encodes information about
how elongated the defect is, and whether its long side is in the X- or Y-direction.
The square index encodes whether the defect is strongly elongated (-1 or +1) or completely square (0).

Note that a linear regression between these features would show a correlation of 0.
This would imply they're not related at all, while it is in fact the perfect correlation in the
first half that perfectly cancels the anti-correlation in the second half.
This shows that it can help to not blindly rely on simplified statistics,
but that an actual look at the data can be useful.

### One-Hot encoding

This dataset uses so-called one-hot encoding, meaning that each target class has a separate column. If the specific sample is of that class, the corresponding column will have a 1, and all other columns will be 0. The other possible encoding is to have one single column to determine the class of defect, represented by a number 0 to 6.

In [None]:
df[['Pastry', 'Z_Scratch', 'K_Scratch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults']].iloc[[5, 300, 500, 750, 850, 1200, 1800]]

In [None]:
# It's useful to have a combined Defect_Type feature
target = ['Pastry', 'Z_Scratch', 'K_Scratch', 'Stains',
       'Dirtiness', 'Bumps', 'Other_Faults']

df['Defect_Type'] = 0 # Create a new feature, fill it with 0
for t in target:
    df.loc[df[t] == 1, 'Defect_Type'] = target.index(t) # Change one-hot encoding to a combined class

Let's look at the distribution of these classes with a pie chart.

In [None]:
cts = df['Defect_Type'].value_counts().sort_index()

plt.figure(dpi=120)
_ = plt.pie(cts, labels=target, autopct='%1.0f%%')

Another option is a histogram.

In [None]:
plt.figure(dpi=100)
_ = df['Defect_Type'].hist()

What do you think is the impact of the class distribution shown in the pie chart on how well a model will learn to recognize certain classes?

### Selecting parts of the dataframe

It's also possible to give logical statements as a selector to grab specific parts of the dataframe.

As an example, we can select all rows for which the type of steel is A300.

In [None]:
# This logical statement outputs a list of True and False values. 

df["TypeOfSteel_A300"] == 1

In [None]:
# Putting this in the dataframe itself yields a selection.

df[ df["TypeOfSteel_A300"] == 1 ]

We can also combine the logical row selection with a selection of feature columns, for example, to get an overview of the position of defects on A300 steel.

In [None]:
df[df["TypeOfSteel_A300"] == 1][["X_Center", "Y_Center"]].describe()

## Exercise 2

What can you say about the types of steel in the dataset? Take a look at the properties of A400 steel as well. In what properties do they differ?

## Exercise 3

Do all defects have the same size and shape? Which are typically the largest? Which are typically the smallest?