# Introduction

This notebook demonstrates defect detection on a set of chip wafer maps.

## Data Source

[Qingyi](https://www.kaggle.com/qingyi). (February 2018). WM-811K wafer map, Version 1. Retrieved January 2018 from https://www.kaggle.com/qingyi/wm811k-wafer-map/downloads/wm811k-wafer-map.zip/1.
    
### References

* See also this [kernel](https://www.kaggle.com/ashishpatel26/wm-811k-wafermap) which has graphs showing class distribution.
* This [script](https://github.com/caslabai/wafer-inspection/blob/master/dataset/pkl2tfrecord.py) has additional data loading code.

Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT-0

# Load raw data

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle

In [None]:
dataset = pd.read_pickle('raw-data/LSWMD.pkl')

In [None]:
dataset.info()

# Explore raw data

## Dimensions

Let's look at the range in image dimensions.

In [None]:
def find_dim(x):
    dim0=np.size(x,axis=0)
    dim1=np.size(x,axis=1)
    return dim0,dim1
dataset['waferMapDim']=dataset.waferMap.apply(find_dim)
dataset.sample(5)

In [None]:
max(dataset.waferMapDim), min(dataset.waferMapDim)

**Conclusion** The dimensions are all over the map, so we'll have to normalize them as a transformation.

## Pixel values

In [None]:
def find_pixel_min_max(x):
    dim0=np.min(x)
    dim1=np.max(x)
    return dim0,dim1
dataset['pixelRange']=dataset.waferMap.apply(find_pixel_min_max)
dataset.sample(5)

In [None]:
max(dataset.pixelRange), min(dataset.pixelRange)

## Class distribution

The graphs in this section are taken from the Kaggle kernel cited above.  They provide a good illustration of the distribution of data in each class.  That will be important in our later analysis.

Reference: https://www.kaggle.com/ashishpatel26/wm-811k-wafermap

In [None]:
dataset['failureNum']=dataset.failureType
dataset['trainTestNum']=dataset.trianTestLabel
mapping_type={'Center':0,'Donut':1,'Edge-Loc':2,'Edge-Ring':3,'Loc':4,'Random':5,'Scratch':6,'Near-full':7,'none':8}
mapping_traintest={'Training':0,'Test':1}
dataset=dataset.replace({'failureNum':mapping_type, 'trainTestNum':mapping_traintest})

In [None]:
df_withlabel = dataset[(dataset['failureNum']>=0) & (dataset['failureNum']<=8)]
df_withlabel =df_withlabel.reset_index()
df_withpattern = dataset[(dataset['failureNum']>=0) & (dataset['failureNum']<=7)]
df_withpattern = df_withpattern.reset_index()
df_nonpattern = dataset[(dataset['failureNum']==8)]
df_withlabel.shape[0], df_withpattern.shape[0], df_nonpattern.shape[0]

In [None]:
from matplotlib import gridspec
tol_wafers = dataset.shape[0]
fig = plt.figure(figsize=(20, 4.5)) 
gs = gridspec.GridSpec(1, 2, width_ratios=[1, 2.5]) 
ax1 = plt.subplot(gs[0])
ax2 = plt.subplot(gs[1])

no_wafers=[tol_wafers-df_withlabel.shape[0], df_withpattern.shape[0], df_nonpattern.shape[0]]

colors = ['silver', 'blue', 'green']
explode = (0.1, 0, 0)  # explode 1st slice
labels = ['no-label','label&pattern','label&non-pattern']
ax1.pie(no_wafers, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=140)

uni_pattern=np.unique(df_withpattern.failureNum, return_counts=True)
labels2 = ['','Center','Donut','Edge-Loc','Edge-Ring','Loc','Random','Scratch','Near-full']
ax2.bar(uni_pattern[0],uni_pattern[1]/df_withpattern.shape[0], color='blue', align='center', alpha=0.9)
ax2.set_title("failure type frequency")
ax2.set_ylabel("% of pattern wafers")
ax2.set_xticklabels(labels2)

plt.show()

In [None]:
uni_set=np.unique(df_withlabel.trainTestNum, return_counts=True)
N = len(uni_set[0])
ind = np.arange(N)
width = 0.35       
labels3 = ['Training','Test']
p1 = plt.bar(ind, uni_set[1]/df_withlabel.shape[0]*100, width)

plt.ylabel('% of train/test')
plt.title('split between train/test')
plt.xticks(ind, labels3)
plt.yticks(np.arange(0, 101, 10))

plt.show()

**Conclusion** The distributions are not even, so we'll want to take care to write them equally between training, test, and validation.

## Visualize images in each class

The graphs in this section are taken from the Kaggle kernel cited above.  They provide a nice visualization of examples from each type of pattern.

Reference: https://www.kaggle.com/ashishpatel26/wm-811k-wafermap

In [None]:
x = [0,1,2,3,4,5,6,7]
labels2 = ['Center','Donut','Edge-Loc','Edge-Ring','Loc','Random','Scratch','Near-full']

for k in x:
    fig, ax = plt.subplots(nrows = 1, ncols = 10, figsize=(18, 12))
    ax = ax.ravel(order='C')
    for j in [k]:
        img = df_withpattern.waferMap[df_withpattern.failureType==labels2[j]]
        for i in range(10):
            ax[i].imshow(img[img.index[i]])
            ax[i].set_title(df_withpattern.failureType[img.index[i]][0][0], fontsize=10)
            ax[i].set_xlabel(df_withpattern.index[img.index[i]], fontsize=10)
            ax[i].set_xticks([])
            ax[i].set_yticks([])
    plt.tight_layout()
    plt.show() 

# Write images to disk

We'll use the standard layout where we have one folder per data set 
(train/test), and inside each of those we have one folder per label.

We'll use a stratified split so we maintain the ratio between classes.

In [None]:
import imageio
import math
from pathlib import Path
from sklearn.model_selection import train_test_split
DATA = Path('vdata')
scale_factor = math.floor(255.0 / 2.0)

images      =   df_withlabel["waferMap"]
labels      =   df_withlabel["failureType"].apply(str)

img_train, img_test, label_train, label_test = train_test_split(images, 
                 labels,
                test_size=0.2,
                stratify=labels)
img_train_v, img_valid, label_train_v, label_valid = train_test_split(img_train, 
                 label_train,
                test_size=0.2,
                stratify=label_train)


In [None]:
plt.imshow(img_test.iloc[0])

In [None]:
def writeImgToDisk(imgdata, labeldata, dset, scale, parent):
    cnt = 0
    pset = Path(parent/dset)
    for img, label in zip(imgdata, labeldata):

        dclass = label[3:-3]
        ipath = Path(pset/dclass)

        if ipath.exists() == False:
            ipath.mkdir(parents=True)
            print("Making " + str(ipath))

        img_scaled = img * scale

        fname = str(cnt) + '.png'
        imageio.imwrite(uri=Path(ipath/fname), im=img_scaled, format='PNG-PIL')
        cnt = cnt + 1

    print("Wrote {0} images".format(str(cnt)))
    for child in pset.iterdir(): 
        if child.is_dir():
            child_cnt = len([x for x in child.iterdir() if x.is_file()])
            print("For class " + child.stem + ", wrote " + str(child_cnt) + " images")
                    

In [None]:
writeImgToDisk(img_train_v, label_train_v, 'train', scale_factor, DATA)

In [None]:
writeImgToDisk(img_test, label_test, 'test', scale_factor, DATA)

In [None]:
writeImgToDisk(img_valid, label_valid, 'valid', scale_factor, DATA)

## Visualize

Let's load up our saved images for a spot check.

In [None]:
from torchvision import datasets
import torch
import torchvision.transforms as transforms
from torch.utils.data.sampler import RandomSampler

# helper function to un-normalize and display an image
def imshow(img):
    plt.imshow(np.transpose(img, (1, 2, 0)))  # convert from Tensor image

In [None]:
# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 20
# percentage of training set to use as validation
valid_size = 0.2

train_transforms = transforms.Compose([transforms.Resize((32,32)),
                                      transforms.ToTensor()]) 

train_data = datasets.ImageFolder(DATA/'train', transform=train_transforms)
random_sampler = RandomSampler(train_data)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, num_workers=num_workers,
                                          sampler=random_sampler)

In [None]:
# obtain one batch of training images
dataiter = iter(train_loader)
images, labels = dataiter.next()
images = images.numpy() # convert images to numpy for display

# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(25, 4))
# display 20 images
for idx in np.arange(20):
    ax = fig.add_subplot(2, 20/2, idx+1, xticks=[], yticks=[])
    imshow(images[idx])
    ax.set_title(train_data.classes[labels[idx]])