## Feature Engineering

## Setup

Load three datasets. Identify which columns are sensors because later I will extract features only from those fields.

In [32]:
import pandas as pd
import numpy as np

from pathlib import Path

BASE = Path("EquipmentDataFinal")

eq1 = pd.read_csv(BASE / "equipment1.csv", sep=";")
eq2 = pd.read_csv(BASE / "equipment2.csv", sep=";")
resp = pd.read_csv(BASE / "response.csv", sep=";")

print("eq1:", eq1.shape)
print("eq2:", eq2.shape)
print("resp:", resp.shape)

# identify sensors for both
sensor_cols_eq1 = [c for c in eq1.columns if c.startswith("sensor_")]
sensor_cols_eq2 = [c for c in eq2.columns if c.startswith("sensor_")]

print("Equipment1 sensors:", len(sensor_cols_eq1))
print("Equipment2 sensors:", len(sensor_cols_eq2))

eq1: (170896, 27)
eq2: (232144, 35)
resp: (2638, 4)
Equipment1 sensors: 24
Equipment2 sensors: 32


## Aggregate basic sstatistical features

In [33]:
def compute_basic_stats(df, sensor_cols):
    # by lot-wafer because one wafer is time-series
    grouped = df.groupby(["lot", "wafer"])

    # dictionary for holding all aggregated features
    feature_dict = {}

    for col in sensor_cols:
        stats = grouped[col].agg([
            "mean",
            "std",
            "min",
            "max",
            "median"
        ])
        
        # range is custom, so we calculate separate
        stats["range"] = stats["max"] - stats["min"]

        stats["p25"] = grouped[col].quantile(0.25)
        stats["p75"] = grouped[col].quantile(0.75)

        stats = stats.add_prefix(f"{col}_")

        feature_dict[col] = stats

    result = pd.concat(feature_dict.values(), axis=1)

    return result.reset_index()


In this step, I defined a function called compute_basic_stats that converts the time-series data for each wafer into a single feature vector. For every sensor column, I computed basic statistical values such as mean, standard deviation, minimum, maximum, median, range, and the 25th/75th percentiles. These statistics summarize the overall behavior of each sensor during wafer processing. The function groups by (lot, wafer) because each wafer is a separate time-series.

In [34]:
# run basic feature extraction for equipment1
features_eq1_basic = compute_basic_stats(eq1, sensor_cols_eq1)
print("equipment1 basic features:", features_eq1_basic.shape)

# run basic feature extraction for equipment2
features_eq2_basic = compute_basic_stats(eq2, sensor_cols_eq2)
print("equipment2 basic features:", features_eq2_basic.shape)

equipment1 basic features: (971, 194)
equipment2 basic features: (1319, 258)


I applied the feature extraction function to both equipment datasets. Equipment1 produced a table with 971 rows and 194 columns, and Equipment2 produced 1319 rows and 258 columns. The row counts match the number of unique wafers in each dataset, so the grouping worked correctly. The column counts also make sense: each sensor generates 8 statistical features, and the table includes lot and wafer as identifiers. This confirms that the basic statistical features were created successfully

In [35]:
# Check head of equipment1 features
print("Equipment1 Basic Features (head)")
display(features_eq1_basic.head())

# Check how many columns generated
print("Number of columns in eq1 basic features:", features_eq1_basic.shape[1])

# Check one random wafer to see features are not NaN or weird
sample_row_eq1 = features_eq1_basic.sample(1, random_state=42)
print("Sample wafer features from eq1")
display(sample_row_eq1)



# Same check for equipment2
print("Equipment2 Basic Features (head)")
display(features_eq2_basic.head())

print("Number of columns in eq2 basic features:", features_eq2_basic.shape[1])

sample_row_eq2 = features_eq2_basic.sample(1, random_state=42)
print("Sample wafer features from eq2")
display(sample_row_eq2)

Equipment1 Basic Features (head)


Unnamed: 0,lot,wafer,sensor_1_mean,sensor_1_std,sensor_1_min,sensor_1_max,sensor_1_median,sensor_1_range,sensor_1_p25,sensor_1_p75,...,sensor_23_p25,sensor_23_p75,sensor_24_mean,sensor_24_std,sensor_24_min,sensor_24_max,sensor_24_median,sensor_24_range,sensor_24_p25,sensor_24_p75
0,lot10,1,5.14853,11.140761,0.0,29.9332,0.0,29.9332,0.0,0.0,...,-1.0,0.0,1.027955,1.529511,0.0,20.84,0.975,20.84,0.94,1.05
1,lot10,2,5.175845,11.138174,0.0,29.9478,0.0,29.9478,0.0,0.0,...,-1.0,0.0,1.045284,1.995562,0.0,27.05,0.96,27.05,0.94,1.05
2,lot10,3,5.194486,11.197936,0.0,29.9332,0.0,29.9332,0.0,0.0,...,-1.0,0.0,1.022784,1.467563,0.0,20.01,0.96,20.01,0.94,1.05
3,lot10,4,5.158105,11.15808,0.0,29.9478,0.0,29.9478,0.0,0.0,...,-1.0,0.0,1.02875,1.697494,0.0,23.06,0.96,23.06,0.94,1.05
4,lot10,5,5.169599,11.167451,0.0,29.9332,0.0,29.9332,0.0,0.0,...,-1.0,0.0,1.028295,1.462505,0.0,19.96,1.02,19.96,0.93,1.05


Number of columns in eq1 basic features: 194
Sample wafer features from eq1


Unnamed: 0,lot,wafer,sensor_1_mean,sensor_1_std,sensor_1_min,sensor_1_max,sensor_1_median,sensor_1_range,sensor_1_p25,sensor_1_p75,...,sensor_23_p25,sensor_23_p75,sensor_24_mean,sensor_24_std,sensor_24_min,sensor_24_max,sensor_24_median,sensor_24_range,sensor_24_p25,sensor_24_p75
168,lot17,20,2.81254,5.871474,0.0,15.0026,0.0,15.0026,0.0,0.0,...,-1.0,0.0,0.870114,0.354046,0.0,2.03,1.0,2.03,1.0,1.0


Equipment2 Basic Features (head)


Unnamed: 0,lot,wafer,sensor_25_mean,sensor_25_std,sensor_25_min,sensor_25_max,sensor_25_median,sensor_25_range,sensor_25_p25,sensor_25_p75,...,sensor_55_p25,sensor_55_p75,sensor_56_mean,sensor_56_std,sensor_56_min,sensor_56_max,sensor_56_median,sensor_56_range,sensor_56_p25,sensor_56_p75
0,lot10,1,0.001464,0.001903,0.0,0.004882,0.0,0.004882,0.0,0.004136,...,0.0,1.303167,0.596534,0.492699,0.0,1.14,0.98,1.14,0.0,1.0
1,lot10,2,0.001454,0.00189,0.0,0.004873,0.0,0.004873,0.0,0.004139,...,0.0,1.302417,0.704602,0.461606,0.0,1.23,0.99,1.23,0.0,1.01
2,lot10,3,0.001459,0.001899,0.0,0.004873,0.0,0.004873,0.0,0.004138,...,0.0,1.302417,0.61358,0.490355,0.0,1.18,0.98,1.18,0.0,1.0
3,lot10,4,0.001367,0.001868,0.0,0.004876,0.0,0.004876,0.0,0.004135,...,0.0,1.308617,0.592216,0.492416,0.0,1.31,0.975,1.31,0.0,1.0
4,lot10,5,0.001474,0.001902,0.0,0.004881,0.0,0.004881,0.0,0.00414,...,0.0,1.307675,0.619545,0.491343,0.0,1.26,0.98,1.26,0.0,1.0


Number of columns in eq2 basic features: 258
Sample wafer features from eq2


Unnamed: 0,lot,wafer,sensor_25_mean,sensor_25_std,sensor_25_min,sensor_25_max,sensor_25_median,sensor_25_range,sensor_25_p25,sensor_25_p75,...,sensor_55_p25,sensor_55_p75,sensor_56_mean,sensor_56_std,sensor_56_min,sensor_56_max,sensor_56_median,sensor_56_range,sensor_56_p25,sensor_56_p75
677,lot5,7,0.001419,0.001865,0.0,0.004823,0.0,0.004823,0.0,0.004064,...,0.0,1.326631,0.613636,0.491487,0.0,1.26,0.975,1.26,0.0,1.0


In [36]:
# number of unique wafers vs feature rows
n_wafers_eq1 = eq1[["lot", "wafer"]].drop_duplicates().shape[0]
n_feat_eq1 = features_eq1_basic.shape[0]

n_wafers_eq2 = eq2[["lot", "wafer"]].drop_duplicates().shape[0]
n_feat_eq2 = features_eq2_basic.shape[0]

print("eq1 wafers:", n_wafers_eq1, " / feature rows:", n_feat_eq1)
print("eq2 wafers:", n_wafers_eq2, " / feature rows:", n_feat_eq2)


eq1 wafers: 971  / feature rows: 971
eq2 wafers: 1319  / feature rows: 1319


In this check, I inspected the first few rows of each feature table, counted the total number of columns, and looked at a random wafer row to confirm the values were reasonable. The features look correct. There are no NaN values, and the sensor statistics fall within normal ranges. I also verifiedd that the number of feature rows equals the number of wafers. This confirms that this step was completed properly and data is ready for the next feature engineering steps.

## Shape / Trend Features

In [37]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

def compute_shape_features(df, sensor_cols):
    # group by each wafer
    grouped = df.groupby(["lot", "wafer"])
    feature_dict = {}

    for col in sensor_cols:
        rows = []

        # loop over each wafer
        for (lot, wafer), g in grouped:
            y = g[col].values
            x = np.arange(len(y)).reshape(-1, 1)

            # slope using simple linear regression
            try:
                lr = LinearRegression().fit(x, y)
                slope = lr.coef_[0]
            except:
                slope = np.nan

            # first derivative
            deriv = np.diff(y)

            # variance of derivative (captures how noisy movement is)
            deriv_var = np.var(deriv)

            # count peaks
            # peak = value greater than both neighbors
            if len(y) > 2:
                peaks = np.sum((y[1:-1] > y[:-2]) & (y[1:-1] > y[2:]))
            else:
                peaks = 0

            # rising duration = number of times y increases
            rising = np.sum(deriv > 0)

            # falling duration = number of times y decreases
            falling = np.sum(deriv < 0)

            rows.append([lot, wafer, slope, deriv_var, peaks, rising, falling])

        feature_df = pd.DataFrame(
            rows,
            columns=[
                "lot",
                "wafer",
                f"{col}_slope",
                f"{col}_deriv_var",
                f"{col}_peaks",
                f"{col}_rising",
                f"{col}_falling"
            ]
        )

        feature_dict[col] = feature_df.set_index(["lot", "wafer"])

    # combine all sensor features
    result = pd.concat(feature_dict.values(), axis=1).reset_index()
    return result

# run for equipment1
features_eq1_shape = compute_shape_features(eq1, sensor_cols_eq1)
print("equipment1 shape features:", features_eq1_shape.shape)

# run for equipment2
features_eq2_shape = compute_shape_features(eq2, sensor_cols_eq2)
print("equipment2 shape features:", features_eq2_shape.shape)


equipment1 shape features: (971, 122)
equipment2 shape features: (1319, 162)


In this part, I generated several shape-based and trend-related feature from the time-series data. These features are different from the basic statistics because they focus on how each sensor value changes over time. For every sensor, I calcculated the slope using a simple linear regression, the variance of the first derivative, the number of peaks, and how many steps were rising or falling. These features describe the overall movement of the sensor, such as whether the signal is increasing, decreasing, fluctuating, or staying flat. Since each wafer is grouped by (lot, wafer), the output becomes one feature vector per wafer that summarizes the main characteristics of the sensor's time-series dynamics.

In [38]:
print("Equipment1 Shape Features (head)")
display(features_eq1_shape.head())

print("Number of columns in eq1 shape features:", features_eq1_shape.shape[1])

sample_eq1 = features_eq1_shape.sample(1, random_state=42)
print("Sample wafer shape features from eq1")
display(sample_eq1)


print("Equipment2 Shape Features (head)")
display(features_eq2_shape.head())

print("Number of columns in eq2 shape features:", features_eq2_shape.shape[1])

sample_eq2 = features_eq2_shape.sample(1, random_state=42)
display(sample_eq2)


Equipment1 Shape Features (head)


Unnamed: 0,lot,wafer,sensor_1_slope,sensor_1_deriv_var,sensor_1_peaks,sensor_1_rising,sensor_1_falling,sensor_2_slope,sensor_2_deriv_var,sensor_2_peaks,...,sensor_23_slope,sensor_23_deriv_var,sensor_23_peaks,sensor_23_rising,sensor_23_falling,sensor_24_slope,sensor_24_deriv_var,sensor_24_peaks,sensor_24_rising,sensor_24_falling
0,lot10,1,-0.024306,6.20212,7,16,8,-0.025237,6.098565,0,...,-2.019586,105881.171396,0,5,3,-0.006491,2.227867,69,79,80
1,lot10,2,-0.025322,3.967683,2,12,8,-0.026403,6.083424,0,...,-2.152458,105881.171396,0,5,3,-0.008144,3.889236,67,77,79
2,lot10,3,-0.02277,6.339515,9,15,14,-0.02402,6.1344,0,...,-1.967614,105881.171396,0,5,3,-0.006314,2.058156,70,82,78
3,lot10,4,-0.024375,6.162851,6,13,10,-0.025146,6.088485,0,...,-1.967297,105881.171396,0,5,3,-0.007215,2.767908,67,80,79
4,lot10,5,-0.022513,6.247827,7,13,11,-0.023415,6.13115,0,...,-1.967776,105881.171396,0,5,3,-0.00615,2.076497,65,80,80


Number of columns in eq1 shape features: 122
Sample wafer shape features from eq1


Unnamed: 0,lot,wafer,sensor_1_slope,sensor_1_deriv_var,sensor_1_peaks,sensor_1_rising,sensor_1_falling,sensor_2_slope,sensor_2_deriv_var,sensor_2_peaks,...,sensor_23_slope,sensor_23_deriv_var,sensor_23_peaks,sensor_23_rising,sensor_23_falling,sensor_24_slope,sensor_24_deriv_var,sensor_24_peaks,sensor_24_rising,sensor_24_falling
168,lot17,20,-0.016889,2.571926,10,13,17,-0.016889,2.571429,0,...,-1.991041,108145.285682,0,4,3,-0.00354,0.039783,26,37,30


Equipment2 Shape Features (head)


Unnamed: 0,lot,wafer,sensor_25_slope,sensor_25_deriv_var,sensor_25_peaks,sensor_25_rising,sensor_25_falling,sensor_26_slope,sensor_26_deriv_var,sensor_26_peaks,...,sensor_55_slope,sensor_55_deriv_var,sensor_55_peaks,sensor_55_rising,sensor_55_falling,sensor_56_slope,sensor_56_deriv_var,sensor_56_peaks,sensor_56_rising,sensor_56_falling
0,lot10,1,-1.5e-05,1.79282e-07,14,40,26,-3.161513,844.56241,0,...,-0.010577,0.010019,15,43,25,-0.008205,0.007861,39,55,43
1,lot10,2,-1.4e-05,1.205047e-07,14,44,22,-2.7364,845.926853,0,...,-0.005895,0.01329,14,43,31,-0.007102,0.016711,43,64,52
2,lot10,3,-1.5e-05,1.655346e-07,18,37,30,-3.119428,847.29278,0,...,-0.007186,0.013536,17,44,29,-0.008081,0.010095,32,53,44
3,lot10,4,-1.5e-05,1.186051e-07,12,39,23,-3.169153,848.658482,0,...,-0.007384,0.013683,18,50,34,-0.00819,0.014933,31,51,46
4,lot10,5,-1.5e-05,1.71623e-07,19,43,28,-3.106999,850.025288,0,...,-0.006964,0.013738,14,51,27,-0.008046,0.017852,35,53,48


Number of columns in eq2 shape features: 162


Unnamed: 0,lot,wafer,sensor_25_slope,sensor_25_deriv_var,sensor_25_peaks,sensor_25_rising,sensor_25_falling,sensor_26_slope,sensor_26_deriv_var,sensor_26_peaks,...,sensor_55_slope,sensor_55_deriv_var,sensor_55_peaks,sensor_55_rising,sensor_55_falling,sensor_56_slope,sensor_56_deriv_var,sensor_56_peaks,sensor_56_rising,sensor_56_falling
677,lot5,7,-1.4e-05,1.786202e-07,16,41,23,-1.767606,272.291316,0,...,-0.010525,0.005926,15,28,65,-0.008085,0.015809,35,56,45


To confirm that the shape features were generated correctly, I checked the first few rows and also inspected a random wafer sample. The number of rows matches the total number of wafers, so each wafer received one shape-based feature vector. The slope values are small but reasonable since many sensors change slowly over time. The derivative variance also shows differences in noise levels across sensors. The rising and falling counts add up naturally to the length of the time-series. Importantly, there are no missing or abnormal values. Overall, this confirms that the shape and trend feature extraction worked correctly.

Number of shape features interpretation -> Equipment1 produeced 122 shape-based features, and Equipment2 produced 162 shape-based features. This makes sense because each sensor creates six trend features(slope, derivative variance, peaks, rising count, falling count), and the two equipment types have a different number of sensors. Since Equipment2 has more sensors, it naturally generates more shape features. The result indicate that the feature extraction is consistent with the sensor counts and the dataset structure.

## Sparse-sensor Features (zero-heavy sensors)

In [39]:
import numpy as np
import pandas as pd

def compute_sparse_features(df, sensor_cols, zero_threshold=0.7):
    grouped = df.groupby(["lot", "wafer"])

    # check which sensors are zero-heavy on whole dataset
    zero_ratio_all = (df[sensor_cols] == 0).mean()
    sparse_sensors = [c for c in sensor_cols if zero_ratio_all[c] >= zero_threshold]

    print(f"Total sensors: {len(sensor_cols)}")
    print(f"Sparse sensors (zero_ratio >= {zero_threshold}): {len(sparse_sensors)}")
    print("Sparse sensor list:", sparse_sensors)

    feature_dict = {}

    for col in sparse_sensors:
        rows = []

        # loop over each wafer (lot, wafer pair)
        for (lot, wafer), g in grouped:
            vals = g[col].values

            n = len(vals)
            if n == 0:
                zero_ratio = np.nan
                activation_count = 0
                activation_duration = 0
                activation_mean = np.nan
            else:
                # zero ratio over time-series
                zeros = np.sum(vals == 0)
                zero_ratio = zeros / n

                # active means value > 0
                active_mask = vals > 0
                activation_count = np.sum(active_mask)

                if activation_count > 0:
                    # mean value when sensor is active
                    activation_mean = vals[active_mask].mean()

                    # longest consecutive active run
                    max_run = 0
                    cur_run = 0
                    for flag in active_mask:
                        if flag:
                            cur_run += 1
                            if cur_run > max_run:
                                max_run = cur_run
                        else:
                            cur_run = 0
                    activation_duration = max_run
                else:
                    # never active
                    activation_mean = 0.0
                    activation_duration = 0

            rows.append([
                lot,
                wafer,
                zero_ratio,
                activation_count,
                activation_duration,
                activation_mean,
            ])

        feature_df = pd.DataFrame(
            rows,
            columns=[
                "lot",
                "wafer",
                f"{col}_zero_ratio",
                f"{col}_activation_count",
                f"{col}_activation_duration",
                f"{col}_activation_mean",
            ],
        )

        feature_dict[col] = feature_df.set_index(["lot", "wafer"])

    # combine all sparse sensor features side by side
    if len(feature_dict) > 0:
        result = pd.concat(feature_dict.values(), axis=1).reset_index()
    else:
        print("No sparse sensors found with this threshold.")
        result = df[["lot", "wafer"]].drop_duplicates().reset_index(drop=True)

    return result, sparse_sensors


In [40]:
# run for equipment1
features_eq1_sparse, sparse_eq1 = compute_sparse_features(eq1, sensor_cols_eq1)
print("equipment1 sparse features:", features_eq1_sparse.shape)

print("\nEquipment1 Sparse Features (head)")
display(features_eq1_sparse.head())

sample_eq1_sparse = features_eq1_sparse.sample(1, random_state=42)
print("\nSample wafer sparse features from eq1")
display(sample_eq1_sparse)


# run for equipment2
features_eq2_sparse, sparse_eq2 = compute_sparse_features(eq2, sensor_cols_eq2)
print("equipment2 sparse features:", features_eq2_sparse.shape)

print("\nEquipment2 Sparse Features (head)")
display(features_eq2_sparse.head())

sample_eq2_sparse = features_eq2_sparse.sample(1, random_state=42)
print("\nSample wafer sparse features from eq2")
display(sample_eq2_sparse)

Total sensors: 24
Sparse sensors (zero_ratio >= 0.7): 4
Sparse sensor list: ['sensor_1', 'sensor_2', 'sensor_6', 'sensor_12']
equipment1 sparse features: (971, 18)

Equipment1 Sparse Features (head)


Unnamed: 0,lot,wafer,sensor_1_zero_ratio,sensor_1_activation_count,sensor_1_activation_duration,sensor_1_activation_mean,sensor_2_zero_ratio,sensor_2_activation_count,sensor_2_activation_duration,sensor_2_activation_mean,sensor_6_zero_ratio,sensor_6_activation_count,sensor_6_activation_duration,sensor_6_activation_mean,sensor_12_zero_ratio,sensor_12_activation_count,sensor_12_activation_duration,sensor_12_activation_mean
0,lot10,1,0.818182,32,32,28.316915,0.8125,33,33,27.972727,1.0,0,0,0.0,1.0,0,0,0.0
1,lot10,2,0.806818,34,34,26.79261,0.8125,33,33,27.554545,1.0,0,0,0.0,1.0,0,0,0.0
2,lot10,3,0.818182,32,32,28.569672,0.8125,33,33,27.363636,1.0,0,0,0.0,1.0,0,0,0.0
3,lot10,4,0.8125,33,33,27.509895,0.8125,33,33,27.927273,1.0,0,0,0.0,1.0,0,0,0.0
4,lot10,5,0.818182,32,32,28.432793,0.8125,33,33,28.081818,1.0,0,0,0.0,1.0,0,0,0.0



Sample wafer sparse features from eq1


Unnamed: 0,lot,wafer,sensor_1_zero_ratio,sensor_1_activation_count,sensor_1_activation_duration,sensor_1_activation_mean,sensor_2_zero_ratio,sensor_2_activation_count,sensor_2_activation_duration,sensor_2_activation_mean,sensor_6_zero_ratio,sensor_6_activation_count,sensor_6_activation_duration,sensor_6_activation_mean,sensor_12_zero_ratio,sensor_12_activation_count,sensor_12_activation_duration,sensor_12_activation_mean
168,lot17,20,0.8125,33,33,15.000215,0.8125,33,33,15.0,1.0,0,0,0.0,1.0,0,0,0.0


Total sensors: 32
Sparse sensors (zero_ratio >= 0.7): 3
Sparse sensor list: ['sensor_29', 'sensor_33', 'sensor_36']
equipment2 sparse features: (1319, 14)

Equipment2 Sparse Features (head)


Unnamed: 0,lot,wafer,sensor_29_zero_ratio,sensor_29_activation_count,sensor_29_activation_duration,sensor_29_activation_mean,sensor_33_zero_ratio,sensor_33_activation_count,sensor_33_activation_duration,sensor_33_activation_mean,sensor_36_zero_ratio,sensor_36_activation_count,sensor_36_activation_duration,sensor_36_activation_mean
0,lot10,1,0.806818,34,11,0.01,1.0,0,0,0.0,1.0,0,0,0.0
1,lot10,2,0.676136,57,16,0.01,1.0,0,0,0.0,1.0,0,0,0.0
2,lot10,3,0.784091,38,6,0.01,1.0,0,0,0.0,1.0,0,0,0.0
3,lot10,4,0.801136,35,11,0.01,1.0,0,0,0.0,1.0,0,0,0.0
4,lot10,5,0.761364,42,8,0.01,1.0,0,0,0.0,1.0,0,0,0.0



Sample wafer sparse features from eq2


Unnamed: 0,lot,wafer,sensor_29_zero_ratio,sensor_29_activation_count,sensor_29_activation_duration,sensor_29_activation_mean,sensor_33_zero_ratio,sensor_33_activation_count,sensor_33_activation_duration,sensor_33_activation_mean,sensor_36_zero_ratio,sensor_36_activation_count,sensor_36_activation_duration,sensor_36_activation_mean
677,lot5,7,0.818182,32,6,0.01,1.0,0,0,0.0,1.0,0,0,0.0


In this step, I focused on extracting special features from sensors that show sparse behavior. A sparse sensor is one that stays at zero for most of the time, so standard statistical features are not enough to describe its behavior. To identify these sensors, I calculated the zero ratio for every sensor and selected those with a zero ratio greater than 0.7. Then for each sparse sensor, I computed four additional wafer-level features: the zero ratio itself, the number of timestamps with non-zero values (activation_count), the longest continuous segment of active values (activvation_duration), and the average sensor value during the active period (activation_mean). These feature help capture event-like behavior that normal sensors do not show.

The results match the characteristics I observed earlier in the EDA. For Equipment 1, the method detected four sparse sensors (sensor1, sensor2, sensor6, sensor12), which aligns what I saw -> sensors 1 and 2 activate only in short bursts, while sensors 6 and 12 remain at zero for the entire process. For Equipment 2, three sparse sensors (sensor29, sensor33, and sensor 36) were identified, and again this is consistent with their near-constant zero patterns. The extracted feature tables also have the correct number of rows - one row per wafer - and the feature values look reasonable. Activation counts and durations are positive for sensors with occasional activity, and they remain zero for sensors that are permanently inactive. Overall, the outputs confirm that the sparse-sensor detection and feature extraction were performed correctly.

## Equipment1 + Equipment2 merge after feature extraction

In [41]:
## Merge Equipment 1 and 2 Features (Wafer-level)

# Combine all features
features_eq1_all = (
    features_eq1_basic
    .merge(features_eq1_shape, on=["lot", "wafer"], how="left")
    .merge(features_eq1_sparse, on=["lot", "wafer"], how="left")
)

print("Equipment1 combined features:", features_eq1_all.shape)

features_eq2_all = (
    features_eq2_basic
    .merge(features_eq2_shape, on=["lot", "wafer"], how="left")
    .merge(features_eq2_sparse, on=["lot", "wafer"], how="left")
)

print("Equipment2 combined features:", features_eq2_all.shape)

# Merge
merged_eq = features_eq1_all.merge(
    features_eq2_all,
    on=["lot", "wafer"],
    how="inner"
)

print("Merged eq1 + eq2:", merged_eq.shape)

# Check duplicated rows in the response table
dup_mask = resp.duplicated(subset=["lot", "wafer"], keep=False)
print("Duplicated (lot, wafer) rows in response:", dup_mask.sum())

# 4) Remove duplicates from the response table
resp_unique = resp.drop_duplicates(subset=["lot", "wafer"])
print("Original resp shape:", resp.shape)
print("Unique resp shape:", resp_unique.shape)

final_features = merged_eq.merge(
    resp_unique,
    on=["lot", "wafer"],
    how="inner"
)

print("Final merged dataset:", final_features.shape)

display(final_features.head())


Equipment1 combined features: (971, 330)
Equipment2 combined features: (1319, 430)
Merged eq1 + eq2: (971, 758)
Duplicated (lot, wafer) rows in response: 2638
Original resp shape: (2638, 4)
Unique resp shape: (1319, 4)
Final merged dataset: (971, 760)


Unnamed: 0,lot,wafer,sensor_1_mean,sensor_1_std,sensor_1_min,sensor_1_max,sensor_1_median,sensor_1_range,sensor_1_p25,sensor_1_p75,...,sensor_33_zero_ratio,sensor_33_activation_count,sensor_33_activation_duration,sensor_33_activation_mean,sensor_36_zero_ratio,sensor_36_activation_count,sensor_36_activation_duration,sensor_36_activation_mean,response,class
0,lot10,1,5.14853,11.140761,0.0,29.9332,0.0,29.9332,0.0,0.0,...,1.0,0,0,0.0,1.0,0,0,0.0,0.4086,good
1,lot10,2,5.175845,11.138174,0.0,29.9478,0.0,29.9478,0.0,0.0,...,1.0,0,0,0.0,1.0,0,0,0.0,0.4032,good
2,lot10,3,5.194486,11.197936,0.0,29.9332,0.0,29.9332,0.0,0.0,...,1.0,0,0,0.0,1.0,0,0,0.0,0.441,good
3,lot10,4,5.158105,11.15808,0.0,29.9478,0.0,29.9478,0.0,0.0,...,1.0,0,0,0.0,1.0,0,0,0.0,0.4032,good
4,lot10,5,5.169599,11.167451,0.0,29.9332,0.0,29.9332,0.0,0.0,...,1.0,0,0,0.0,1.0,0,0,0.0,0.4266,good


In this step, I merged all the engineered features from both machines into a single wafer-level dataset. First, I combined the basic statistics, shape-based features, and sparse-sensor features for Equipment 1 into one table, and did the same for Equipment 2 using lot and wafer as keys. Then I performed an inner join between the two equipment tables so that I only keep wafers that appear in both machines. After that, I checked the response.csv file and found that every (lot, wafer) pair was dupliccated, so I created a cleaned label table resp_unique by dropping duplicate (lot, wafer) rows. Finally, I merged the combined feature table with this unique response table to attach the target variables.

The printed shapes confirm that the pipeline behaved as expected. Equipment 1 features have 971 wafers, Equipment 2 features have 1319 wafers, and the merged equipment table keeps 971 wafers that exist in both. The cleaned response table has 1319 unique wafers, and after the final merge, the dataset has 971 rows and 760 columns. This means each wafer now has one row that includes all features from Equipment 1 and 2 plus a single associated response and class label. The head of the table also looks consistent - no duplicated (lot, wafer) rows, and feature values are in reasonable ranges. This final feature matrix is ready to be used for training the models.

## Feature Cleaning

In [42]:
import numpy as np
import pandas as pd

# Separate ID / target columns
id_cols = ["lot", "wafer"]
target_cols = ["response", "class"]

# Extract only feature columns
feature_cols = [c for c in final_features.columns if c not in id_cols + target_cols]

X_raw = final_features[feature_cols].copy()
y_reg = final_features["response"].copy()
y_cls = final_features["class"].copy()

print("X_raw shape:", X_raw.shape)
print("y_reg shape:", y_reg.shape)
print("y_cls value counts:")
print(y_cls.value_counts())

X_raw shape: (971, 756)
y_reg shape: (971,)
y_cls value counts:
class
good    801
bad     170
Name: count, dtype: int64


In this step, I separated the dataset into three parts - ID columns, feature columns, and target labels. I extracted only the real machine-generated features into X_raw, and saved both the regression target and classification target for later model training.

The raw feature matrix contains 971 wafers * 756 features, which matches the merged dataset size after earlier feature engineering. The label distribution shows class imbalance -> good = 801 wafers, bad = 170 wafers. This is important for model training because some models may need class weighting or balanced sampling.

In [43]:
# Remove near-constant (low variance) features
stds = X_raw.std(axis=0)

low_var_threshold = 1e-4
low_var_cols = stds[stds < low_var_threshold].index.tolist()

print("Number of near-constant features:", len(low_var_cols))
print("Example low-var features:", low_var_cols[:10])

X_no_lowvar = X_raw.drop(columns=low_var_cols)
print("Shape after removing near-constant features:", X_no_lowvar.shape)

Number of near-constant features: 134
Example low-var features: ['sensor_1_median', 'sensor_1_p25', 'sensor_1_p75', 'sensor_2_min', 'sensor_2_median', 'sensor_2_p25', 'sensor_2_p75', 'sensor_4_min', 'sensor_5_min', 'sensor_6_min']
Shape after removing near-constant features: (971, 622)


I computed the standard deviation of every feature and removed features whose variance was extremely small. Such feature do not change across wafers and contain no useful information for any ML model. This step prevents noise, reduces dimensionality, and improves interpretability.

The dataset originally had 756 features. Among them, 134 features were near-constant and safely removed. Examples include sensor statistics like medians or percentiles that were always zero or always the same value. After this step, the dataset was reduced to 622 useful features, which is expected and completely correct for this type of industrial sensor data.

In [44]:
# Remove perfectly duplicated features
X_T = X_no_lowvar.T
X_T_nodup = X_T.drop_duplicates()

X_clean = X_T_nodup.T

# List which columns were removed
removed_dup_cols = [c for c in X_no_lowvar.columns if c not in X_clean.columns]

print("Number of duplicated features removed:", len(removed_dup_cols))
print("Example duplicated features:", removed_dup_cols[:10])
print("Shape after removing duplicated features:", X_clean.shape)

# Save the final clean feature names
clean_feature_cols = X_clean.columns.tolist()
print("\nTotal clean feature count:", len(clean_feature_cols))

Number of duplicated features removed: 66
Example duplicated features: ['sensor_2_range', 'sensor_4_range', 'sensor_5_range', 'sensor_6_range', 'sensor_7_range', 'sensor_8_range', 'sensor_9_range', 'sensor_10_range', 'sensor_11_range', 'sensor_12_range']
Shape after removing duplicated features: (971, 556)

Total clean feature count: 556


This step indentifies and removes features that are exactly identical across all wafers. Sometimes, different feature extraction steps can unintentionally generate mathematical duplicates. Removing duplicates keeps the feature space clean and prevents redundant information from biasing the model.

From the 622 remaining features, 66 features were detected as perfect duplicates. Many of these were range features that had identical values due to sensors being constant. After this cleanup step, the dataset was reduced to 556 unique and meaningful features, which is the expected final size before scaling.

In [45]:
# Build final clean dataframe (ID + clean features + target)
final_features_clean = pd.concat(
    [
        final_features[id_cols].reset_index(drop=True),
        X_clean.reset_index(drop=True),
        final_features[target_cols].reset_index(drop=True),
    ],
    axis=1,
)

print("\nfinal_features_clean shape:", final_features_clean.shape)
display(final_features_clean.head())


final_features_clean shape: (971, 560)


Unnamed: 0,lot,wafer,sensor_1_mean,sensor_1_std,sensor_1_min,sensor_1_max,sensor_1_range,sensor_2_mean,sensor_2_std,sensor_2_max,...,sensor_29_zero_ratio,sensor_29_activation_count,sensor_29_activation_duration,sensor_29_activation_mean,sensor_33_zero_ratio,sensor_33_activation_count,sensor_36_zero_ratio,sensor_36_activation_count,response,class
0,lot10,1,5.14853,11.140761,0.0,29.9332,29.9332,5.244886,11.237582,30.0,...,0.806818,34.0,11.0,0.01,1.0,0.0,1.0,0.0,0.4086,good
1,lot10,2,5.175845,11.138174,0.0,29.9478,29.9478,5.166477,11.166946,30.0,...,0.676136,57.0,16.0,0.01,1.0,0.0,1.0,0.0,0.4032,good
2,lot10,3,5.194486,11.197936,0.0,29.9332,29.9332,5.130682,11.140863,30.0,...,0.784091,38.0,6.0,0.01,1.0,0.0,1.0,0.0,0.441,good
3,lot10,4,5.158105,11.15808,0.0,29.9478,29.9478,5.236364,11.229013,30.0,...,0.801136,35.0,11.0,0.01,1.0,0.0,1.0,0.0,0.4032,good
4,lot10,5,5.169599,11.167451,0.0,29.9332,29.9332,5.265341,11.259025,30.0,...,0.761364,42.0,8.0,0.01,1.0,0.0,1.0,0.0,0.4266,good


Final step reconstructs a clean, ready-to-be-trained dataset that includes ID columns, all cleaned features, target labels. This table will be used as the input for machine learning models.

The final cleaned dataset has the shape 971 rows * 560 columns -> 2 ID columns, 556 cleaned features, 2 target columns. This confirms the feature cleaning pipeline worked correctly. No rows were lost, and only irrelevant or redundant columns were removed. The dataset is now optimized for training classical ML models.

## Final Interpretation for Feature_Engineering
In this ipynb, I transformed the raw time-series sensor data into clean wafer-level features that can be used directly for model training. I extracted three main types of features: basic statistics, shape/trend features, and sparse-sensor activation features. These capture both the overall distribution of each sensor and how the signals change over time.

After feature extraction, I merged Equipment1 and Equipment2 data with the response labels. Then, I removed near-constant features and duplicated columns to reduce noise and redundancy. This step reduced the dimensionality from 756 to 556 useful features.

Orverall, the final dataset contains clean, informative, and non-redundant features for each wafer. It is not fully ready for model training, where I will train multiple machine learning models and compare their performance.

In [46]:
output_path = "final_features_clean.csv"
final_features_clean.to_csv(output_path, index=False)

print("Saved to:", output_path)

Saved to: final_features_clean.csv
