# Wafer defect classification in silicon manufacturing

Data from kaggle dataset

Overall goals:
* Explore dataset to get a feel for what's in it and any quick wins
    * Do some drill-down via pivot tables and exploratory graphs
* Perform supervised machine learning to be able to predict if a wafer will be defective or not defective
* Perform metaheuristic optimization in order to suggest optimal operating parameters

In [14]:
# imports
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
%matplotlib inline

data_path = '../../data/uci-secom.csv'

In [18]:
# this comes from fast.ai
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 100): 
        display(df)

### Exploratory Data Analysis

In [19]:
# read in data to work with
df = pd.read_csv(data_path)
display_all(df.describe().T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
0,1561.0,3014.452896,73.621787,2743.24,2966.26,3011.49,3056.65,3356.35
1,1560.0,2495.850231,80.407705,2158.75,2452.2475,2499.405,2538.8225,2846.44
2,1553.0,2200.547318,29.513152,2060.66,2181.0444,2201.0667,2218.0555,2315.2667
3,1553.0,1396.376627,441.69164,0.0,1081.8758,1285.2144,1591.2235,3715.0417
4,1553.0,4.197013,56.35554,0.6815,1.0177,1.3168,1.5257,1114.5366
5,1553.0,100.0,0.0,100.0,100.0,100.0,100.0,100.0
6,1553.0,101.112908,6.237214,82.1311,97.92,101.5122,104.5867,129.2522
7,1558.0,0.121822,0.008961,0.0,0.1211,0.1224,0.1238,0.1286
8,1565.0,1.462862,0.073897,1.191,1.4112,1.4616,1.5169,1.6564
9,1565.0,-0.000841,0.015116,-0.0534,-0.0108,-0.0013,0.0084,0.0749


In [23]:
df.shape

(1567, 592)

In [26]:
df.columns

Index(['Time', '0', '1', '2', '3', '4', '5', '6', '7', '8',
       ...
       '581', '582', '583', '584', '585', '586', '587', '588', '589',
       'Pass/Fail'],
      dtype='object', length=592)

In [27]:
df.Time.min()

'2008-01-08 02:02:00'

In [28]:
df.Time.max()

'2008-12-10 18:47:00'

Some initial thoughts just looking at this:
* We have some data that has zero variance- we will want to get rid of this
* We should look for highly linearly correlated features
* Most of the data looks numerical rather than categorical
* Columns names tell us nothing- maybe we re-label these to not confuse ourselves
* We should look for index-based trends as these likely are related to time dependencies of our data
* Data appears to be in completely different scales- any non tree-based method will need scaling done to it
* Should probably look for outliers as well since there are some data that appear to be skewed on first glance
* We onlt have ~3x as many observations as predictor fields...
    * More urgent need for dimensionality reduction
    * Can be done through PCA or other means
* Date ranges over the course of the year
    * Maybe look at typical scheduling routines
        * Time based features for 8 hour or 12 hour shifts
    * Likely won't have enough data to represent full seasonality over the course of multiple years