# Semiconductor Manufacturing Yield Analysis

**Data set**:
https://archive.ics.uci.edu/dataset/179/secom
* ~ 600 features, including pass/fail yield.
* ~ 1600 rows, indexed by a timestamp. Each timestamp corresponds to a "production unit" per documentation. 

## Data Prep
Data and indeces are in two separate file. Need to stitch.

In [None]:
import polars as pl

# Read in the data
# Define schema to treat all 591 columns as Float64. 
# This is necessary because polars will infer i64 on some columns, 
# probably because the first values encountered are integers 
schema = {f"column_{i}": pl.Float64 for i in range(591)}

data = pl.read_csv(
    source = r'data/secom.data', 
    has_header = False,
    separator = ' ',
    null_values = 'NaN',
    schema_overrides = schema)
#data.tail()

# Read the indeces
indeces = pl.read_csv(
    source = r'data/secom_labels.data',
    separator = ' ',
    has_header = False,
    new_columns = ['pass_fail','timestamp'],
    try_parse_dates=True
).select(['timestamp','pass_fail']).with_columns( 
    pl.when( pl.col.pass_fail == -1).then(pl.lit('p')).otherwise(pl.lit('f')).alias('pass_fail')
        ) # pass fail encoding per documentation. 
# Select here just reorders the columns for display aesthetics
# indeces.tail()

# Will use hstack to stitch the files together as relation is implicit. 
df = indeces.hstack(data)
df.tail()

In [None]:
# As sanity check let's count the failures
indeces.select('pass_fail').filter( pl.col.pass_fail == 'f' ).height

In [None]:
# test change adsfadadsfads