⚠️ Note: wafers_train.csv is too large for GitHub  
This file must be manually downloaded and placed in the `/data` folder.  
You can request it or replace it with a smaller sample.

---

# 🔍 Scratch Detection on Semiconductor Wafers

This project focuses on detecting scratches on wafer maps — identifying both faulty dies and visually good dies that are part of a physical scratch. This is important in semiconductor manufacturing to avoid using low-quality dies that may have been physically damaged.

---

## 🧠 Project Description

In the semiconductor industry, **"wafers"** are thin discs of semiconductor material (like silicon) used to fabricate microelectronic devices such as transistors and integrated circuits. A wafer can contain hundreds or thousands of **"dies"**, which are later diced from the wafer for use in products.

Scratches on wafers appear as **elongated clusters** of faulty dies and may also include some visually “good” dies located along a physical scratch. These may be misclassified unless explicitly detected. The goal of this project is to train a model that can predict whether a given die is part of a scratch, regardless of whether it failed electrically or not.

<figure>
  <img src="assets/wafer.jpeg" width="300">
  <figcaption>Fig.1 - A semiconductor wafer</figcaption>
</figure>

Scratches are manually handled today — often using a visual inspection method. This project aims to automate that detection by analyzing the **wafer map**, which includes the position and status of each die.

<figure>
  <img src="assets/wafer_map.png" width="300">
  <figcaption>Fig.2 - Logical wafer map</figcaption>
</figure>

Some good dies may appear along a scratch path and should also be flagged (“inked”) to avoid using potentially damaged units.

---

## 📊 Dataset Overview

Each row in the training data represents a single die, with:

- `WaferName`: Wafer identifier  
- `DieX`, `DieY`: Position on the wafer  
- `IsGoodDie`: Whether the die passed electrical testing  
- `IsScratchDie`: Whether the die is part of a scratch (our target)

The test set has the same structure but **does not include** the `IsScratchDie` label — your model should generate predictions for it.

---

## 🎯 Project Goals

- **Predict scratches** using both bad and good dies  
- **Engineer spatial features** such as neighbor defect density and distance from center  
- **Handle imbalanced data** (scratches are rare) using techniques like SMOTE  
- **Optimize model performance** based on precision, recall, and F1 score

---

## 📈 Business Relevance

- **Automation**: Reduces cost and error in manual scratch tagging  
- **Quality Control**: Avoids sending risky dies to production  
- **Yield Optimization**: Balances minimizing ink usage with avoiding scratch risk

---

## 🧪 Technologies Used

- Python (pandas, NumPy, scikit-learn)
- Random Forest classifier
- SMOTE for class balancing
- Jupyter Notebook

---

## 🚀 Running the Project

1. Clone the repo  
2. Download the `wafers_train.csv` (manually place in `/data/`)  
3. Install dependencies:  
   ```bash
   pip install -r requirements.txt


# Start work

In [None]:
import pandas as pd
import zipfile
from datetime import datetime


### Load Data

In [None]:
#load zip file
zf = zipfile.ZipFile('data.zip') 

In [None]:
#load train data
df_wafers = pd.read_csv(zf.open('wafers_train.csv'))
df_wafers.head()

In [None]:
#load test data
df_wafers_test = pd.read_csv(zf.open('wafers_test.csv'))
df_wafers_test.head()

You can draw the wafers map to see how the wafers look like in the data. 

Using the following helper function you can draw the wafer maps with or without labels:

In [None]:
def plot_wafer_maps(wafer_df_list, figsize, labels = True):
    """
    plot wafer maps for list of df of wafers

    :param wafer_df_list: list, The list of df's of the wafers
    :param figsize: int, the size of the figsize height 
    :param labels: bool, Whether to show the layer of labels (based on column 'IsScratchDie')
    
    :return: None
    """
    def plot_wafer_map(wafer_df, ax, map_type):
        wafer_size = len(wafer_df)
        s = 2**17/(wafer_size)
        if map_type == 'Label':
            mes = 'Scratch Wafer' if (wafer_df['IsScratchDie'] == True).sum()>0 else 'Non-Scratch Wafer'
        else:
            mes = 'Yield: ' + str(round((wafer_df['IsGoodDie']).sum()/(wafer_df['IsGoodDie']).count(), 2)) 
        
        ax.set_title(f'{map_type} | Wafer Name: {wafer_df["WaferName"].iloc[0]}, \nSum: {len(wafer_df)} dies. {mes}', fontsize=20)
        ax.scatter(wafer_df['DieX'], wafer_df['DieY'], color = 'green', marker='s', s = s)

        bad_bins = wafer_df.loc[wafer_df['IsGoodDie'] == False]
        ax.scatter(bad_bins['DieX'], bad_bins['DieY'], color = 'red', marker='s', s = s)
        
        if map_type == 'Label':
            scratch_bins = wafer_df.loc[(wafer_df['IsScratchDie'] == True) & (wafer_df['IsGoodDie'] == False)]
            ax.scatter(scratch_bins['DieX'], scratch_bins['DieY'], color = 'blue', marker='s', s = s)

            ink_bins = wafer_df.loc[(wafer_df['IsScratchDie'] == True) & (wafer_df['IsGoodDie'] == True)]
            ax.scatter(ink_bins['DieX'], ink_bins['DieY'], color = 'yellow', marker='s', s = s)

            ax.legend(['Good Die', 'Bad Die', 'Scratch Die', 'Ink Die'], fontsize=8)
        else:
            ax.legend(['Good Die', 'Bad Die'], fontsize=8)

        ax.axes.get_xaxis().set_visible(False)
        ax.axes.get_yaxis().set_visible(False) 
    
    import numpy as np
    import matplotlib.pyplot as plt
    
    if labels:
        fig, ax = plt.subplots(2, len(wafer_df_list), figsize=(figsize*len(wafer_df_list), figsize*2))
        for idx1, wafer_df in enumerate(wafer_df_list):
            for idx2, map_type in enumerate(['Input', 'Label']):
                plot_wafer_map(wafer_df, ax[idx2][idx1], map_type)
    else:
        fig, ax = plt.subplots(1, len(wafer_df_list), figsize=(figsize*len(wafer_df_list), figsize))
        for idx, wafer_df in enumerate(wafer_df_list):
            plot_wafer_map(wafer_df, ax[idx], 'Input')

    plt.show()

Select the amount of samples you want to display:

In [None]:
n_samples = 4
list_sample_train = [df_wafers.groupby('WaferName').get_group(group) for group in df_wafers['WaferName'].value_counts().sample(n_samples, random_state=20).index]
plot_wafer_maps(list_sample_train, figsize = 8, labels = True)

In [None]:
list_sample_test = [df_wafers_test.groupby('WaferName').get_group(group) for group in df_wafers_test['WaferName'].value_counts().sample(n_samples, random_state=20).index]
plot_wafer_maps(list_sample_test, figsize = 8, labels = False)

# Build my solution

In [None]:
import numpy as np

def extract_features(wafer_df):
    features_list = []
    # separate wafer
    for wafer_name, wafer_data in wafer_df.groupby('WaferName'):
        print(f"Processing wafer {wafer_name}...")
        max_x = wafer_data['DieX'].max()
        max_y = wafer_data['DieY'].max()
        
        # -1 no die, 0 good die, 1 bad die
        grid = -np.ones((max_x + 1, max_y + 1)) # 
        # fill grid
        for _, row in wafer_data.iterrows():
            x = row['DieX']
            y = row['DieY']
            grid[x, y] = 0 if row['IsGoodDie'] else 1
        
        for _, die in wafer_data.iterrows():
            x = int(die['DieX'])
            y = int(die['DieY'])
            die_features = {
                'WaferName': wafer_name,
                'DieX': die['DieX'],
                'DieY': die['DieY'],
                'IsGoodDie': die['IsGoodDie'],
                'wafer_yield': wafer_data['IsGoodDie'].sum() / len(wafer_data),
            }
            
            if'IsScratchDie' in die:
                die_features['IsScratchDie'] = die['IsScratchDie']
            
            die_features['normalized_x'] = x / max_x if max_x > 0 else 0
            die_features['normalized_y'] = y / max_y if max_y > 0 else 0
            die_features['ditance_to_center_x'] = abs(x - max_x / 2)  / (max_x / 2) if max_x > 0 else 0
            die_features['ditance_to_center_y'] = abs(y - max_y / 2)  / (max_y / 2) if max_y > 0 else 0
            
            #line-like patterns
            for radius in range(1, 4):
               h_bad = 0 #horizontal line
               V_bad = 0  #vertical line
               d1_bad = 0 #diagonal top left to bottom right
               d2_bad = 0 #diagonal top right to bottom left
               h_total = 0 
               V_total = 0
               d1_total = 0
               d2_total = 0
               
               # checking around the die 
               for delta in range(1, radius+1):
                   # H line
                    for i_x in [x-delta, x+delta]:
                       if i_x >= 0 and i_x <= max_x and y >= 0 and y <= max_y and grid[i_x, y] != -1:
                           h_total += 1
                           if grid[i_x, y] == 1:
                               h_bad += 1
                    # V line
                    for i_y in [y-delta, y+delta]:
                        if x >= 0 and x <= max_x and i_y >= 0 and i_y <= max_y and grid[x, i_y] != -1:
                            V_total += 1
                            if grid[x, i_y] == 1:
                                V_bad += 1
                                
                    # diagonal line 1
                    for i_x, i_y in [(x-delta, y-delta), (x+delta, y+delta)]:
                        if i_x >= 0 and i_x <= max_x and i_y >= 0 and i_y <= max_y and grid[i_x, i_y] != -1:
                            d1_total += 1
                            if grid[i_x, i_y] == 1: #bad die
                                d1_bad += 1
                                
                    # diagonal line 2
                    for i_x, i_y in [(x-delta, y+delta), (x+delta, y-delta)]:
                        if i_x >= 0 and i_x <= max_x and i_y >= 0 and i_y <= max_y and grid[i_x, i_y] != -1:
                            d2_total += 1
                            if grid[i_x, i_y] == 1:
                                d2_bad += 1
                    # calaulating the ratio of bad die in the radius, total bad neighbors and total neighbors
                    die_features[f'(h_bad)_ratio_radius{radius}'] = h_bad / h_total if h_total > 0 else 0
                    die_features[f'(V_bad)_ratio_radius{radius}'] = V_bad / V_total if V_total > 0 else 0
                    die_features[f'(d1_bad)_ratio_radius{radius}'] = d1_bad / d1_total if d1_total > 0 else 0
                    die_features[f'(d2_bad)_ratio_radius{radius}'] = d2_bad / d2_total if d2_total > 0 else 0
                    
                    die_features[f'total_bad_neighbors_radius{radius}'] = h_bad + V_bad + d1_bad + d2_bad
                    die_features[f'total_neighbors_radius{radius}'] = h_total + V_total + d1_total + d2_total
                    
                    die_features[f'overall_bad_ratio_radius{radius}'] = (die_features[f'total_bad_neighbors_radius{radius}'] / die_features[f'total_neighbors_radius{radius}']) if die_features[f'total_neighbors_radius{radius}'] > 0 else 0
                    
                    # looking for the maximum bad ratio in the radius
                    die_features[f'max_bad_ratio_radius{radius}'] = max(die_features[f'(h_bad)_ratio_radius{radius}'],
                                                                        die_features[f'(V_bad)_ratio_radius{radius}'],
                                                                        die_features[f'(d1_bad)_ratio_radius{radius}'],
                                                                        die_features[f'(d2_bad)_ratio_radius{radius}'])
                    
                    
                    max_ratio = die_features[f'max_bad_ratio_radius{radius}']
                    other_ratios = [die_features[f'(h_bad)_ratio_radius{radius}'], die_features[f'(V_bad)_ratio_radius{radius}'], die_features[f'(d1_bad)_ratio_radius{radius}'], die_features[f'(d2_bad)_ratio_radius{radius}']]
                    other_ratios.remove(max_ratio)
                    avg_ratio = sum(other_ratios) / len(other_ratios) if len(other_ratios) > 0 else 0
                    die_features[f'direction_contrast_radius{radius}'] = max_ratio - avg_ratio
                    
                # adding the features to the list
        features_list.append(die_features)
    return pd.DataFrame(features_list) 


          
              

                                
               

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from imblearn.over_sampling import SMOTE

# Sample 100,000 dies
df_sample = df_wafers.sample(n=100000, random_state=42)
features_df = extract_features(df_sample)

# Split features and target
X = features_df.drop(columns=['WaferName', 'IsScratchDie'])
y = features_df['IsScratchDie'].astype(int)

print(f"The original class distribution:\n{y.value_counts()}")

# - SMOTE - 
smote = SMOTE(random_state=42)
X_bal, y_bal = smote.fit_resample(X, y)
print(f"After SMOTE class distribution:\n{pd.Series(y_bal).value_counts()}")

# Train validation split 80-20
X_train, X_val, y_train, y_val = train_test_split(
    X_bal, y_bal, test_size=0.2, stratify=y_bal, random_state=42
)

# Train the model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_val)

# Reports 
print("\nConfusion Matrix:")
print(confusion_matrix(y_val, y_pred))
print("\nClassification Report:")
print(classification_report(y_val, y_pred, zero_division=0))

# Plot
ConfusionMatrixDisplay.from_estimator(clf, X_val, y_val)
plt.title("Confusion Matrix - Scratch Detection (SMOTE, 100K Sample)")
plt.show()
# Plot wafer maps for the test set
print("Training data shape:", df_wafers.shape)
print("Test data shape:", df_wafers_test.shape)
# Class distribution
print("\nScratch dies distribution:")
print(df_wafers['IsScratchDie'].value_counts())
print(df_wafers['IsScratchDie'].value_counts(normalize=True)) # Normalized
# Group by wafer to get overall structure
wafer_counts = df_wafers.groupby('WaferName').size()
print("\nNumber of dies per wafer:")
print(wafer_counts.describe())
# How many wafers have scratches
scratch_wafers = df_wafers[df_wafers['IsScratchDie'] == True]['WaferName'].unique()
print(f"\nWafers with scratches: {len(scratch_wafers)} out of {df_wafers['WaferName'].nunique()}")

# Additional thoughts

Feature engineering played an important role in improving the model. Using spatial patterns like bad-die ratios in different directions and normalized position helped detect scratches more accurately than relying on raw coordinates.

Class imbalance was a major challenge, as scratched dies were rare. Applying SMOTE allowed the model to see more varied examples of scratches and learn better decision boundaries.

In a production environment, it might be useful to add features related to the depth or severity of scratches. This could help the model distinguish between critical defects and harmless surface marks.

For long-term use, it’s important to monitor model performance regularly and retrain when the wafer manufacturing process changes. This helps prevent model drift and maintains accuracy.

In [None]:
df_wafers_test = pd.read_csv('data/wafers_test.csv')
test_features = extract_features(df_wafers_test)
X_test = test_features.drop(columns=['WaferName'])
model = clf
IsScratchDie_preds = model.predict(X_test)
df_preds = test_features[['WaferName', 'DieX', 'DieY']].copy()
df_preds['IsScratchDie'] = IsScratchDie_preds
df_wafers_test = df_wafers_test.merge(
    df_preds,
    on=['WaferName', 'DieX', 'DieY'],
    how='left'
)
df_wafers_test['IsScratchDie'] = df_wafers_test['IsScratchDie'].fillna(0).astype(int)

