# Semiconductor Manufacturing Yield Analysis

## Data Description
Source:
https://archive.ics.uci.edu/dataset/179/secom

* Data comes from semiconductor manufacturing. 
* ~ 600 **unlabeled** features, including pass/fail yield.
* ~ 1600 rows, indexed by a timestamp. Each timestamp corresponds to a "production unit" per documentation.

**Goals and Strategy**

In this analysis we will train progressively improved random forest models to identify top yield predictors. 

## Data Preparation

In [None]:
import pandas as pd
import polars as pl

# Read in the data
# Define schema to treat all 591 columns as Float64. 
# This is necessary because polars will infer i64 on some columns, 
# probably because the first values encountered are integers 
schema = {f"column_{i}": pl.Float64 for i in range(591)}

data = pl.read_csv(
    source = r'data/secom.data', 
    has_header = False,
    separator = ' ',
    null_values = 'NaN',
    schema_overrides = schema)
#data.tail()

# Read the indeces
indeces = pl.read_csv(
    source = r'data/secom_labels.data',
    separator = ' ',
    has_header = False,
    new_columns = ['pass_fail','timestamp'],
    try_parse_dates=True
).select(['timestamp','pass_fail'])
# Select here just reorders the columns for display aesthetics
# indeces.tail()

# Will use hstack to stitch the files together as relation is implicit. Normally would join on explicit shared column. 
df = indeces.hstack(data)
df.to_pandas().tail()

In [None]:
# As sanity check let's count the failure rate
fails = indeces.select('pass_fail').filter( pl.col.pass_fail == 1 ).height

fail_rate = 100*fails/indeces.height
print(f"The failure rate is {round(fail_rate)}%")

## Data Analysis
### Random Forest, first try

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

X = data.to_numpy()
y = df['pass_fail'].to_numpy()

# Get feature names for importances
feature_names = df.drop(["pass_fail", "timestamp"]).columns

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = RandomForestClassifier(n_estimators=100, random_state=42)
f = model.fit(X_train, y_train)

After succesfully fitting the model we can extract the most important features.

In [None]:
import pandas as pd
importances = pd.Series(model.feature_importances_, index=feature_names)
print('Top predictors')
print('-----------------------')
print(importances.sort_values(ascending=False).head(10))

As sanity check let us check the distribution of test sets to ensure a proper proportion of failures are present.

In [None]:
# Class distribution
# Let's double check that the test and train sets are properly stratified.
print(pd.Series(y_train).value_counts(normalize=True))
print(pd.Series(y_test).value_counts(normalize=True))

Now let's evaluate the model. 

In [None]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print( classification_report(y_test, y_pred) )

This model is very bad. The warnings come from the fact that zero failures were correctly identified.
* Precision (  True positive / (False positive + True positive) ) is **zero** for yield failures.
* Recall (  True positive / (False negative + True positive) is zero as well.

This means the features listed above may not be very important. Performance may improve by balancing the model training across pass/fail, to be explored next.

## Random Forest with "balanced" training
Here we'll use a version of the random forest algorithm that balances the fraction of failures in the training data fed to every tree in the ensemble 50/50.

In [None]:
from sklearn.impute import SimpleImputer
from imblearn.ensemble import BalancedRandomForestClassifier

# Impute missing values -- in contrast to non-balanced function, this function cannot handle NaNs. 
# Hence we will impute with median. 
imputer = SimpleImputer(strategy="median")
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Train model
model = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_imputed, y_train)

# Predict with default threshold (0.5)
y_pred = model.predict(X_test_imputed)

# Classification report
print("\nClassification Report")
print(classification_report(y_test, y_pred, labels=[-1, 1], target_names=["Pass", "Fail"], zero_division=0))

# Feature importances
# This will now work because len(model.feature_importances_) == len(feature_names)
importances = pd.Series(model.feature_importances_, index=feature_names)
print("\nTop 10 Feature Importances:")
print(importances.sort_values(ascending=False).head(10))

This is a marked improvement in that the model now correctly predicts 20% of failures, up from zero. Let's explore if hyperparameter tuning can further improve the model.