# PROBLEM DEFINITION
  
- In semiconductor manufacturing, the production of integrated circuits involves multiple complex processes, including wafer fabrication. Wafers are thin, round substrates made of semiconductor materials, and they undergo various manufacturing steps to create microchips. The quality of these wafers is critical for ensuring the reliability and performance of the final IC's.

- The manufacturing process of semiconducter wafers is susceptible to various defects and faults that can compromise the quality and yield of IC's. These faults can result from contamination, equipment malfunctions or process variations. Detecting and classifying these faults early in the manufacturing process is essential to minimze waste and ensure product quality.

## GOAL
- The goal of this experiment notebook is to findout a succesful / generalized predictive classifier model that can accurately classify semiconducter wafers as either "Good" or "Bad" based on sensor data collected during the manufacturing process. 
- According to the domain knowledge of expertizes and regarding business priorities and potential consequences of false predicted cases, the cost function is calculated to be 10.FN + 1.FP 


****

## DATA EXPLORATION

[x]head, info, description

[x]quick profiling reports

[x] missing values analysis

[x]duplicates

[x]outliers

[x]feature distributions

[x]imbalance check

[x]correlation analysis

[x]notes

### NOTES
- all meaningful input features are float64 dtype
- wafer id can be removed
- no duplicated rows
- raw data is highly unscaled
- lots of outliers, even when i set iqr_threshold to 5% and coefficient to 5, there were still lots of outliers
- as an alternative, i peformed multivariate outlier detection using LOF, the elbow method suggested the threshold score to be <-2
- there is no duplicated rows, but are a lot of (more than 100) duplicated columns, may be these are all zero columns.
- there are 112 columns of constant value "0" 
- best_fit_distribution types are found , re-assess after data transformation
- highly imbalanced dataset, handle imbalance with stratified kfold when splitting as train-test dataset
- correlation study is performed with filter condition of abs(corr)>95 or abs(cor)<100 and if the correlated column count > threshold=5  , some columns meeting these conditions are detected

In [1]:
import pandas as pd 
import numpy as np
import dtale
import seaborn as sns
import matplotlib.pyplot as plt 
from scipy import stats 
import scipy 


from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.neighbors import LocalOutlierFactor

from sklearn.preprocessing import StandardScaler , MinMaxScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix,accuracy_score,f1_score, roc_auc_score, roc_curve, precision_score,recall_score

from xgboost import XGBClassifier
from catboost import CatBoostClassifier

from imblearn.combine import SMOTETomek # hybrid technique

from sklearn.base import BaseEstimator , TransformerMixin
from sklearn.pipeline import Pipeline

from xgboost import plot_importance

import optuna 
import dill


import os 
import warnings
warnings.filterwarnings("ignore")




IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html



In [2]:
%%html
<style>
.output_wrapper, .output {
    max-height: 800px; /* Adjust the width as needed */
}
</style>

In [3]:
valid_dataset_dir = "../valid_feature_store/valid_training_data/"


In [4]:
def restore_original_data()->pd.DataFrame:
    csv_file_list = os.listdir(valid_dataset_dir)
    df_merged = pd.DataFrame()
    for file in csv_file_list:
        file_path = os.path.join(valid_dataset_dir,file)
        df = pd.read_csv(file_path)
        df_merged = pd.concat(objs=[df_merged,df]) # merged around axis=0
        df_merged.drop(columns=["Wafer"],inplace=True)
        filt = df_merged["Good/Bad"]==1
        df_merged["Good/Bad"] = np.where(filt,1,0)

    return df_merged 

In [None]:
df_merged = restore_original_data()

In [None]:
df_merged.head()

In [None]:
df_merged.describe().T 

In [None]:
d= dtale.show(df_merged)

In [None]:
d.open_browser()

### EDA: Missing Value Analyis

In [None]:
def missing_values_table(dataframe:pd.DataFrame,is_return= False):
    na_cols = [col for col in dataframe.columns if dataframe[col].isna().sum()>0]
    na_data = dataframe[na_cols].isna().sum().sort_values(ascending=False)
    ratio = (dataframe[na_cols].isna().sum()/dataframe.shape[0]*100).sort_values(ascending=False)
    missing_df = pd.concat(objs=[na_data,np.round(ratio,2)],axis=1,keys= ["#missing","ratio"])
    print(missing_df,end="\n")
    if is_return:
        return (na_cols,missing_df)
    return missing_df

In [None]:
missing_table = missing_values_table(df_merged)

In [None]:
filtered_missed_table = missing_table.query("ratio>60")

In [None]:
filtered_missed_table

##### ->missing value analysis w/ target variable

In [None]:
def missing_vs_target(dataframe:pd.DataFrame,target:str,na_cols:list):
    i=0
    temp_df = dataframe.copy()
    for col in na_cols:
        temp_df[col + "_NA_FLAG"] = np.where(temp_df[col].isna(),1,0)
    
    na_flags= [col for col in temp_df.columns if "_NA_FLAG" in col ]
    

    
    for col in na_flags:
        table= pd.DataFrame({
            "Target_Mean": temp_df.groupby(col)[target].mean(),
            "count":temp_df.groupby(col)[target].count()
        })

        if (abs(table.iloc[1,0])> 0.30) and  (abs(table.iloc[1,0])< 0.70) :
            i+=1
            print(table,end="\n\n")

        
    print(i)

In [None]:
na_cols = missing_values_table(df_merged,is_return=True)[0]

In [None]:
missing_vs_target(df_merged,"Good/Bad",na_cols)

In [None]:
def missing_row_analysis(dataframe:pd.DataFrame):
    table= pd.DataFrame({
        "#missing":dataframe.isna().sum(axis=1).sort_values(ascending=False),
        "ratio": dataframe.isna().sum(axis=1).sort_values(ascending=False)/ dataframe.shape[0]*100
    })
    return table 

    

In [None]:
missing_row_analysis(df_merged)

### EDA: Outlier Detection

In [None]:
import matplotlib.pyplot as plt 

In [None]:
df_merged.columns 

In [None]:
fig, axs = plt.subplots(nrows=5, ncols=5, figsize=(16, 16))
#print(axs)

for i, col in enumerate(df_merged.columns[:25]):
    row = i // 5  
    col_num = i % 5  
    #print(row,col_num)
    sns.boxplot(y=df_merged[col], ax=axs[row, col_num])
    axs[row, col_num].set_title("Values")
    axs[row, col_num].set_ylabel(col)

plt.tight_layout() 
plt.show()


In [None]:
def iqr_threshold(dataframe:pd.DataFrame,col_name:str,threshold,coeff):
    q1= dataframe[col_name].quantile(threshold)
    q3 = dataframe[col_name].quantile(1-threshold)
    iqr = q3-q1 
    upper = q3 + coeff*iqr 
    lower = q1 - coeff*iqr
    return lower,upper 


In [None]:
def detect_outliers(dataframe:pd.DataFrame,col_name,index=True,threshold:float=0.05,coeff=1.5):
    lower,upper = iqr_threshold(dataframe,col_name,threshold,coeff)
    filt = (dataframe[col_name] > upper) | (dataframe[col_name] < lower)
    if index:
        return (dataframe[filt].index, dataframe.loc[dataframe[filt].index,col_name])
    else: 
        return dataframe[filt]
        

In [None]:
for col in df_merged.columns:
    
    indices, outliers = detect_outliers(df_merged,col,coeff=5)
    if len(indices)>3:
        print(col,"mean:",df_merged[col].mean())
        print(indices)
        print(outliers)

        

#### EDA- OUTLIER DETECTION : LOCAL OUTLIER FACTOR 

In [None]:
from sklearn.neighbors import LocalOutlierFactor

In [None]:
clf = LocalOutlierFactor(n_neighbors=100, contamination=0.1)
X = df_merged.fillna(0)
norm_X = (X-X.mean())/X.std()
y_pred = clf.fit_predict(X)
scores = sorted(clf.negative_outlier_factor_)
scores_2= clf.negative_outlier_factor_

In [None]:
scores

In [None]:
scores_2 

In [None]:
scores = pd.DataFrame(scores)
scores.plot( xlim = [0,100],ylim = [-10,0],style='.-')
plt.show()

In [None]:
scores

In [None]:
scores[scores[0]<-2] # then we need to access the indices of these data points

In [None]:
scores_2 = pd.DataFrame(scores_2)
scores_2

In [None]:

outliers = scores_2[scores_2[0]<-2] # then we need to access the indices of these data points

In [None]:
outlier_indices = outliers.index

In [None]:
outlier_indices

In [None]:
for index in outlier_indices:
    
    print(df_merged.iloc[index]["Good/Bad"])

#### EDA: Duplicate Detection

In [None]:
# DUPLICATED COLUMNS
duplicated_df = df_merged.T[df_merged.T.duplicated()]
duplicated_df

In [None]:
duplicated_df.index

In [None]:
df_merged.T[df_merged.T.duplicated()].shape

In [None]:
# DUPLICATED ROWS:
df_merged.duplicated().sum()

In [None]:
filt = df_merged[col]==0

In [None]:
zero_check = [col  for col in df_merged.columns if (df_merged[col].fillna(0)==0.0).sum()==df_merged.shape[0]]

In [None]:
zero_check

In [None]:
len(zero_check)

### EDA: Distribution Analysis

In [None]:
from scipy import stats

In [None]:
dist_types = [stats.norm, stats.expon, stats.gamma, stats.lognorm, stats.pareto]

In [None]:
def get_distribution(dataframe:pd.DataFrame, col_name):
    best_fit = None 
    best_p_value = np.inf 

    for dist in dist_types:
        params = dist.fit(dataframe[col_name].fillna(0.00001))
        _, p_value = stats.kstest(dataframe[col_name].fillna(0.00001), dist.cdf, args=params)
        if p_value < best_p_value:
            best_fit = dist 
            best_p_value = p_value

    print(f"{col_name} -> Best-fit distribution: {best_fit.name}")
    return best_fit.name 

In [None]:
best_fit_list = []
for col in df_merged.columns[:-1]:
    best_fit = get_distribution(df_merged,col)
    best_fit_list.append(best_fit)

In [None]:
best_fits = pd.Series(best_fit_list)

In [None]:
best_fits.value_counts()

### EDA: Target Imbalance Check

In [None]:
df_merged["Good/Bad"].value_counts(normalize=True)

### EDA: Correlation Analysis

In [None]:
corr_matrix =  df_merged.corr(method="pearson")

In [None]:
filt = (abs(corr_matrix)>0.95) & (abs(corr_matrix)<1.00)

In [None]:
threshold = 5
corr_counts = filt.sum(axis=1)

In [None]:
corr_counts[corr_counts>2].sort_values(ascending=False)

In [None]:
df_merged.columns[corr_counts>threshold]

****

****

## DATA TRANSFORMATION & FEATURE ENGINEERING

### NOTES (COPY)
- all meaningful input features are float64 dtype
- wafer id can be removed
- raw data is highly unscaled
- lots of outliers, even when i set iqr_threshold to 5% and coefficient to 5, there were still lots of outliers
- as an alternative, i peformed multivariate outlier detection using LOF, the elbow method suggested the threshold score to be <-2
- there is no duplicated rows, but are a lot of (more than 100) duplicated columns, may be these are all zero columns.
- there are 112 columns of constant value "0" 
- best_fit_distribution types are found , re-assess after data transformation
- highly imbalanced dataset, handle imbalance with stratified kfold when splitting as train-test dataset
- correlation study is performed with filter condition of abs(corr)>95 or abs(cor)<100 and if the correlated column count > threshold=5  , some columns meeting these conditions are detected

# DATA TRANFORMATION TASK LIST
##### (!) turn all the work into modular functions / classes

[x] - handle unwanted data

[x] - drop zero std columns

[x] - drop highly correlated columns

[x] - drop duplicated columns/rows

[x]  - handle missing values

[x] - handle outliers

[] - scale the dataframe

     - provide different scaling options
     
     - check distributions afterwards



In [None]:
df_merged.columns

In [None]:
df_merged.shape 

#### -> DATA TRANSFORMATION: ZERO STD CHECK

In [None]:
zero_check = [col  for col in df_merged.columns if (df_merged[col].fillna(0)==0.0).sum()==df_merged.shape[0]]

In [None]:
len(zero_check)

In [None]:
(df_merged.std() == 0 ).sum() # 112 of them are all zero columns

In [5]:
def drop_zero_std(dataframe:pd.DataFrame)->pd.DataFrame:
    zero_std_cols = dataframe.columns[dataframe.std()==0]
    dataframe2 = dataframe.drop(columns=zero_std_cols)
    return dataframe2

In [None]:
df_zero_std_dropped = drop_zero_std(df_merged)

****

#### -> DATA TRANSFORMATION: Drop highly correlated columns

In [None]:
corr_matrix = df_merged.corr(method="pearson")
filt = (abs(corr_matrix)>0.95) & (abs(corr_matrix)<1.00) # threshold 95% 
threshold = 5
corr_counts = filt.sum(axis=1)
corr_counts[corr_counts>2].sort_values(ascending=False)

In [6]:
def drop_highly_correlated_columns(dataframe:pd.DataFrame,corr_threshold=0.95,count_threshold=3):

    corr_matrix= dataframe.corr(method="pearson")
    filt = (abs(corr_matrix)>corr_threshold) & (abs(corr_matrix)<1.00)
    corr_counts = filt.sum(axis=1)
    #highly_correlated = corr_counts[cor_counts > count_threshold].sort_values(ascending=False)
    highly_correlated_cols = dataframe.columns[corr_counts>count_threshold]

    return dataframe.drop(columns=highly_correlated_cols)

In [None]:
df_no_corr = drop_highly_correlated_columns(dataframe=df_merged,count_threshold=5)

In [None]:
df_no_corr.shape 

****

#### -> DATA TRANSFORMATION: Drop duplicated columns

In [None]:
df_merged.T[df_merged.T.duplicated()].index

In [7]:
def drop_duplicated_cols(dataframe:pd.DataFrame)->pd.DataFrame:
    duplicated_cols = dataframe.T[dataframe.T.duplicated()].index
    return dataframe.drop(columns=duplicated_cols)

****

#### -> DATA TRANSFORMATION: Handle Missing Values

In [8]:
class HandleMissingValues:
    def __init__(self,dataframe:pd.DataFrame):
        self.dataframe = dataframe


    def missing_values_table(self,is_return= True):

        """
        Generate a summary of missing values in the DataFrame.

        Parameters:
        - is_return (bool)

        Returns:
        - If is_return is True, returns a tuple containing a list of columns with missing values and a DataFrame summarizing the missing values.
        - If is_return is False, returns a DataFrame summarizing the missing values.
        """

        na_cols = [col for col in self.dataframe.columns if self.dataframe[col].isna().sum()>0]
        na_data = self.dataframe[na_cols].isna().sum().sort_values(ascending=False)
        ratio = (self.dataframe[na_cols].isna().sum()/self.dataframe.shape[0]*100).sort_values(ascending=False)
        missing_df = pd.concat(objs=[na_data,np.round(ratio,2)],axis=1,keys= ["#missing","ratio"])
        print(missing_df,end="\n")
        if is_return:
            return (na_cols,missing_df)
        else:
            return missing_df


    def missing_vs_target(self,target:str,na_cols:list, threshold=0.8):
        """
        Check if missing value rows correlate with a specified target variable.

        Parameters:
        - target (str): The name of the target variable for correlation analysis.
        - na_cols (list): A list of columns with missing values.
        - threshold (float): The correlation threshold. Columns with correlations greater than this threshold are considered key columns.

        Returns:
        - key_cols (list): A list of column names that show strong correlation with the target variable.
        """

        key_cols = []
        temp_df = self.dataframe.copy()
        for col in na_cols:
            temp_df[col + "_NA_FLAG"] = np.where(temp_df[col].isna(),1,0)
        
        na_flags= [col for col in temp_df.columns if "_NA_FLAG" in col ]
        #print(na_flags)
        for col in na_flags:
            table= pd.DataFrame({
                "Target_Mean": temp_df.groupby(col)[target].mean(),
                "count":temp_df.groupby(col)[target].count()
            })
            #print(table,end="\n\n")

            if (abs(table.iloc[1,0])> threshold) :
                key_cols.append(col.replace("_NA_FLAG",""))
                print(table,end="\n\n")
        return key_cols



    def detect_highly_missing(self,missing_table, ratio_threshold=80):

        """
        Detect columns with a high ratio of missing values.

        Parameters:
        - missing_table (pd.DataFrame): A DataFrame containing missing values information.
        - ratio_threshold (float): The threshold for considering a column as highly missing.

        Returns:
        - highly_missing_cols (list): A list of column names with a high ratio of missing values.
        """

        highly_missing_cols = missing_table.query(f"ratio > {ratio_threshold}").index 

        return highly_missing_cols
        

    def handle_imputation(self,method="constant"):

        """
        Perform constant value imputation for missing values.

        Parameters:
        - constant: The constant value to use for imputing missing values : can be mean, median, mode or any constant scalar

        Returns:
        - imputed_df (pd.DataFrame): The DataFrame with missing values replaced by the constant value.
        """

        if method in ["mean","median"]:
            s_imputer = SimpleImputer(strategy=method)
            return pd.DataFrame(s_imputer.fit_transform(self.dataframe),columns=self.dataframe.columns)

        elif method=="constant":
            s_imputer = SimpleImputer(strategy=method,fill_value=0)
            return pd.DataFrame(s_imputer.fit_transform(self.dataframe),columns=self.dataframe.columns)

        elif method=="knn":
            knn_imputer = KNNImputer(n_neighbors=10)
            return pd.DataFrame(knn_imputer.fit_transform(self.dataframe),columns=self.dataframe.columns)

    


In [None]:
handle_missing = HandleMissingValues(dataframe=df_zero_std_dropped)

In [None]:
na_cols , missing_table =  handle_missing.missing_values_table()

In [None]:
highly_missing_cols = handle_missing.detect_highly_missing(missing_table,ratio_threshold=80)

In [None]:
key_cols = handle_missing.missing_vs_target(target="Good/Bad",na_cols=highly_missing_cols)

In [None]:
key_cols

In [None]:
handle_missing.knn_imputation().isna().sum().sum()

#### -> DATA TRANSFORMATION: Handle Outliers

In [9]:
class HandleOutliers:
    def __init__(self,dataframe:pd.DataFrame):
        self.dataframe = dataframe 
    

    
    def iqr_threshold(self,col_name:str,threshold,coeff):
        q1= self.dataframe[col_name].quantile(threshold)
        q3 = self.dataframe[col_name].quantile(1-threshold)
        iqr = q3-q1 
        upper = q3 + coeff*iqr 
        lower = q1 - coeff*iqr
        return lower,upper 

    def detect_outliers(self,col_name,index=True,threshold:float=0.05,coeff=1.5):
        lower,upper = self.iqr_threshold(col_name,threshold,coeff)
        filt = (self.dataframe[col_name] > upper) | (self.dataframe[col_name] < lower)
        if index:
            return (self.dataframe[filt].index, self.dataframe.loc[self.dataframe[filt].index,col_name])
        else: 
            return self.dataframe[filt]
            



    def iqr_approach(self,col_list,threshold:float= 0.05,coeff=3):
        indices = []
        values = []
        for col in col_list:
            index_list , value_list = self.detect_outliers(col,threshold=threshold, coeff=coeff)
            if len(index_list)>0:
                indices.append((col,index_list))
        return indices

    
    def find_critical_lof(self, n_neighbors=10, contamination=0.1, threshold=0.01):

        lof = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
        y_pred = lof.fit_predict(self.dataframe)
        scores= lof.negative_outlier_factor_
        sorted_scores = sorted(scores)

        critical_lof = None

        abs_diff_scores = np.abs(np.diff(sorted_scores))
        percentage = (abs_diff_scores/sorted_scores[:-1])

        critical_index = np.argmax(abs_diff_scores < threshold)
        
        if critical_index > 0: 
            critical_lof = sorted_scores[critical_index]
        
        else:
            critical_lof = None



        # Plot the LOF scores for the specified range of n_neighbors
        plt.plot(range(len(scores)),sorted_scores,marker = 'o', linestyle='-')
        plt.xlabel('data')
        plt.ylabel('LOF Scores')
        plt.title('LOF Score vs. data')
        plt.grid(True)
        plt.show()

        return (critical_lof,pd.DataFrame(scores))


    """def multivariate_w_lof(self,n_neighbors=20,contamination=0.1):
        # This should be applied to datasets with no missing values

        lof = LocalOutlierFactor(n_neighbors=10)
        y_pred = lof.fit_predict(self.dataframe)
        scores= sorted(lof.negative_outlier_factor_)
        scores = pd.DataFrame(scores)
        scores.plot( xlim = [0,100],ylim = [-10,0],style='.-')
        plt.show()
        return scores"""

    def drop_outliers(self,col,row_list):
        temp_df = self.dataframe.copy()
        for row in row_list:
            temp_df.at[row,col] = float("nan")

        return temp_df
    
    def impute_outliers(self,value):
        pass 
    



In [None]:
handle_outliers = HandleOutliers(df_zero_std_dropped.fillna(0))

In [None]:
ind = handle_outliers.iqr_approach(df_zero_std_dropped.columns[:-1])

In [None]:
ind 

In [None]:
counter = 0

for col,ind_list in ind:
    counter += len(ind_list)

counter 

In [None]:
elbow_point,scores = handle_outliers.find_critical_lof(threshold=0.01)

In [None]:
elbow_point

In [None]:
scores 

In [None]:
scores[scores[0]<elbow_point]

In [None]:
out_ind = scores[scores[0]<elbow_point].index.tolist()

In [None]:
df_zero_std_dropped.iloc[out_ind]["Good/Bad"].value_counts()  # thi suggests that being identified as an outlier by LOF  does not mean that you are a fault

****

#### -> DATA TRANSFORMATION: Handle Scaling

In [10]:
class HandleScaling:
    def __init__(self,dataframe:pd.DataFrame):
        self.dataframe = dataframe

    def standard_scaler(self):
        ss = StandardScaler()
        return pd.DataFrame(ss.fit_transform(self.dataframe),columns=self.dataframe.columns)
    
    def robust_scaler(self):
        rs = RobustScaler()
        return pd.DataFrame(rs.fit_transform(self.dataframe),columns=self.dataframe.columns)


    def min_max_scaler(self,feature_range=(0,1)):
        mms = MinMaxScaler(feature_range=feature_range)
        return pd.DataFrame(mms.fit_transform(self.dataframe),columns=self.dataframe.columns)



In [11]:
def handle_imbalance(X,y):
    smt = SMOTETomek(random_state=11,sampling_strategy="minority")
    return smt.fit_resample(X,y)


### <<< NOW SCALE ALL THE COLUMNS THEN CHECK DISTRIBUTION TYPES >>>

## MODEL SELECTION/TRAINING

In [None]:
def evaluate_clf(true,predicted):
    acc = accuracy_score(true,predicted)
    f1 = f1_score(true,predicted)
    precision = precision_score(true,predicted)
    recall = recall_score(true,predicted)
    roc_auc = roc_auc_score(true,predicted)
    
    return acc, f1, precision, recall , roc_auc 

In [None]:
def total_cost(true,pred):
    tn, fp, fn, tp = confusion_matrix(true,pred).ravel()
    cost = 10*fn + 1*fp 
    return cost 

In [None]:
def evaluate_models(X,y,models:dict)->pd.DataFrame:
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=11)
    
    models_list = []
    accuracy_list = []
    cost_list = []

    for model_name, model_obj in models.items():
        model_obj.fit(X_train,y_train)

        y_train_pred = model_obj.predict(X_train)
        y_test_pred = model_obj.predict(X_test)

        # model performance on training dataset
        train_acc, train_f1, train_precision, train_recall , train_roc_auc = evaluate_clf(y_train,y_train_pred) 
        train_cost = total_cost(y_train, y_train_pred)

        # model performance on testing dataset
        test_acc, test_f1, test_precision, test_recall , test_roc_auc = evaluate_clf(y_test,y_test_pred) 
        test_cost = total_cost(y_test, y_test_pred)

        print(model_name)
        models_list.append(model_name)  


        print('Model performance for Training set')
        print("- Accuracy: {:.4f}".format(train_acc))
        print('- F1 score: {:.4f}'.format(train_f1)) 
        print('- Precision: {:.4f}'.format(train_precision))
        print('- Recall: {:.4f}'.format(train_recall))
        print('- Roc Auc Score: {:.4f}'.format(train_roc_auc))
        print(f'- COST: {train_cost}.')

        print('----------------------------------')

        print('Model performance for Test set')
        print('- Accuracy: {:.4f}'.format(test_acc))
        print('- F1 score: {:.4f}'.format(test_f1))
        print('- Precision: {:.4f}'.format(test_precision))
        print('- Recall: {:.4f}'.format(test_recall))
        print('- Roc Auc Score: {:.4f}'.format(test_roc_auc))
        print(f'- COST: {test_cost}.')
        cost_list.append(test_cost)
        print('='*35)
        print('\n')
        
    report = pd.DataFrame(list(zip(models_list,cost_list)),columns=["Model Name","Cost"])
    return report 



#### Now, we initialize the models we want to investigate in a dictionary

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import IsolationForest

In [None]:
# after checking perfomances of various models, ensemble classifiers especially XGBClassifier outperforms and generalizes the best.
# Furthermore, we could train a neural network also, but the results from XGBClassifier was satisfactory.

models = {
    "XGBClassifier" : XGBClassifier(
        random_state=11,
        scale_pos_weight=30,
        reg_alpha=0.01),
 
    #"IsolationForest":IsolationForest(n_estimators=100,max_samples=500, 
    #            contamination=0.1,random_state=11, verbose=0)
    #"SVC": SVC(probability=True,class_weight={0: 1, 1: 50}),
    #"knn":KNeighborsClassifier()
}

### Some common data transformations before Model Training Step

### NOTES
- all meaningful input features are float64 dtype
- wafer id can be removed
- no duplicated rows
- raw data is highly unscaled
- lots of outliers, even when i set iqr_threshold to 5% and coefficient to 5, there were still lots of outliers
- as an alternative, i peformed multivariate outlier detection using LOF, the elbow method suggested the threshold score to be <-2
- there is no duplicated rows, but are a lot of (more than 100) duplicated columns, may be these are all zero columns.
- there are 112 columns of constant value "0" 
- best_fit_distribution types are found , re-assess after data transformation
- highly imbalanced dataset, handle imbalance with stratified kfold when splitting as train-test dataset
- correlation study is performed with filter condition of abs(corr)>95 or abs(cor)<100 and if the correlated column count > threshold=5  , some columns meeting these conditions are detected

In [None]:
df_merged = drop_zero_std(df_merged)

In [None]:
df_merged = drop_duplicated_cols(df_merged)

In [None]:
df_merged.shape 

****

****

missing value imputation = fillna with [constant(0, mean, median),knn imputer]

outlier detection = 

In [None]:
def create_train_test(dataframe):
    X= dataframe.drop("Good/Bad", axis="columns")
    y= dataframe["Good/Bad"]
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20, random_state=11,stratify=y)

    return X_train, X_test, y_train, y_test
    

In [None]:
model = models["XGBClassifier"]

In [None]:
thresholds = [i*0.0005 for i in range(1,1200)]

In [None]:
df_corr = pd.DataFrame()

In [None]:
df_corr["Sensor_id"] = df_merged.columns[:-1].tolist()


In [None]:
corr_score = [abs(round(df_merged[[col,"Good/Bad"]].corr().iloc[0,1],4)) for col in df_merged.columns[:-1]]

In [None]:
df_corr["corr"] = corr_score

In [None]:
df_corr.sort_values(by="corr",ascending=False)

In [None]:
highly_corr_cols = df_corr.sort_values(by="corr",ascending=False)[:200]["Sensor_id"].to_list()

In [None]:
def get_important_cols(dataframe:pd.DataFrame)->list:
    df_corr = pd.DataFrame()
    df_corr["Sensor_id"] = dataframe.columns[:-1].tolist()
    corr_score = [abs(round(dataframe[[col,"Good/Bad"]].corr().iloc[0,1],4)) for col in dataframe.columns[:-1]]
    df_corr["corr"] = corr_score
    important_cols = df_corr.sort_values(by="corr",ascending=False)[:200]["Sensor_id"].to_list()
    return important_cols


In [None]:
def eval_models(X_train,X_test,y_train,y_test, models:dict, threshold_list:list)->int:

    cost_list = [] 

    for model_name, model_obj in models.items():

        print(f"Results for {model_name} \n")

        model_obj.fit(X_train,y_train)

        y_pred_train_proba = model_obj.predict_proba(X_train)
        y_pred_test_proba = model_obj.predict_proba(X_test)

        for threshold in threshold_list:

            y_train_pred = (y_pred_train_proba[:,1]>threshold).astype(int)
            y_test_pred = (y_pred_test_proba[:,1]>threshold).astype(int)

            train_metrics = evaluate_clf(y_train,y_train_pred)
            test_metrics = evaluate_clf(y_test,y_test_pred)

            train_cost = total_cost(y_train,y_train_pred)
            test_cost = total_cost(y_test,y_test_pred)

            f1_score_train = round(train_metrics[1],4)
            f1_score_test  = round(test_metrics[1],4)
            roc_auc_train  = round(train_metrics[-1],4)
            roc_auc_test   = round(test_metrics[-1],4)

            cost_list.append(test_cost)
            
            print(f"RESULTS FOR THRESHOLD: {threshold}")
            print(f"TRAINING: F1-score: {f1_score_train}, ROC AUC: {roc_auc_train}, Cost: {train_cost}")
            print(f"TESTING : F1-score: {f1_score_test}, ROC AUC: {roc_auc_test}, Cost: {test_cost}")
    
            print("CONFUSION MATRICES:\n",confusion_matrix(y_train,y_train_pred), "\n",confusion_matrix(y_test,y_test_pred))
    
    return sorted(cost_list,reverse=False)[0]


In [None]:
X_train, X_test, y_train, y_test = create_train_test(df_merged)

In [None]:
X_train_res, y_train_res = handle_imbalance(X_train[highly_corr_cols],y_train)

In [None]:
eval_models(X_train_res, X_test[highly_corr_cols], y_train_res, y_test,models,thresholds)

In [None]:
importance_score = model.feature_importances_*100
features = X_train.columns.tolist()

In [None]:
sorted(list(zip(importance_score,features)),reverse=True)[:35]

In [None]:
important_features = [feat for score,feat in sorted(list(zip(importance_score,features)),reverse=True)[:100] ]

*****

In [None]:
from sklearn.model_selection import StratifiedKFold

# Define the number of folds (K)
n_splits = 5

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=11)
X_test = X_test[highly_corr_cols]
for test_index, val_index in skf.split(X_test,y_test):
    #print(test_index,val_index)
    X_test_cv, X_val = X_test.iloc[test_index], X_test.iloc[val_index]
    #print(X_train_cv)
    y_test_cv, y_val = y_test.iloc[test_index], y_test.iloc[val_index]

    # Apply SMOTE to X_train_cv and y_train_cv
    #X_train_res, y_train_res = handle_imbalance(X_train_cv, y_train_cv)

    #model.fit(X_train_res,y_train_res)
    
    print("##################################\n")
    for threshold in [i*0.001 for i in range(4,10)]:
        
        y_val_pred_proba = model.predict_proba(X_val)
        y_val_pred = (y_val_pred_proba[:,1]>threshold).astype(int)

        metrics = evaluate_clf(y_val,y_val_pred)
        cost = total_cost(y_val, y_val_pred)
        print(f"Threshold {threshold}: F1-score: {round(metrics[1],4)}, ROC AUC: {round(metrics[-1],4)}, Cost: {cost}")
        print(confusion_matrix(y_val,y_val_pred))
    print("##################################\n\n")

In [None]:

n_splits = 5

skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)

for test_index, val_index in skf.split(X_test, y_test):
    #print(train_index,val_index)
    X_test_cv, X_val = X_test.iloc[test_index], X_test.iloc[val_index]
    #print(X_train_cv)
    y_test_cv, y_val = y_test[test_index], y_test[val_index]

    

    for threshold in thresholds:
        
        y_val_pred_proba = model.predict_proba(X_val)
        y_val_pred = (y_val_pred_proba[:,1]>threshold).astype(int)

        metrics = evaluate_clf(y_val,y_val_pred)
        cost = total_cost(y_val, y_val_pred)
        print(f"Threshold {threshold}: F1-score: {round(metrics[1],4)}, ROC AUC: {round(metrics[-1],4)}, Cost: {cost*5}")
        print(confusion_matrix(y_val,y_val_pred)*5)
    print("##################################\n\n")

## EXPERIMENT #1 

- imputation: Simple Imputer with mean, median & constant values
- imbalance handling: SMOTETomek
- no outlier handling
- scaling : Standard Scaler
- no correlation handling

In [None]:
df_1 = restore_original_data()
df_1 = drop_zero_std(df_1)
df_1 = drop_duplicated_cols(df_1)

In [None]:
df_1 = df_1[highly_corr_cols]

In [None]:
X_train, X_test, y_train, y_test = create_train_test(df_1)

In [None]:
df_train = pd.concat([X_train,y_train],axis=1)
df_test  = pd.concat([X_test,y_test])

In [None]:
df_train.isna().sum().sum()

In [None]:
"""first we need to split df_merged[highly_corr_cols] into df_train, df_test then 
    perform imputation, balancing and scaling on train dataset, train the model
    then apply scale_ params (mean,std etc.) to test dataset to evaluate model performance
    """

handle_missing = HandleMissingValues(df_train)

methods= ["mean","median","constant","knn"]

imputed_df_dict = {
    "mean_imputed_df":None,
    "median_imputed_df":None,
    "constant_imputed_df":None,
    "knn_imputed_df":None
}

for method in methods:

    imputed_df_dict[f"{method}_imputed_df"] = handle_missing.handle_imputation(method)



In [None]:
imputed_df_dict["mean_imputed_df"].isna().sum().sum()+ imputed_df_dict["median_imputed_df"].isna().sum().sum() + \
imputed_df_dict["constant_imputed_df"].isna().sum().sum() + imputed_df_dict["knn_imputed_df"].isna().sum().sum()


In [None]:
for name, df in imputed_df_dict.items():

    
    X_balanced, y_balanced = handle_imbalance(df.iloc[:,:-1],df.iloc[:,-1])
    imputed_df_dict[name] = (X_balanced,y_balanced) # store as tuple
        

In [None]:
imputed_df_dict["mean_imputed_df"][1].value_counts()  # which is y_train of mean imputed df_train

In [None]:
mean_imp_X_train,mean_imp_y_train = imputed_df_dict["mean_imputed_df"]


In [None]:
s_scaler= StandardScaler()
mean_imp_X_train_scaled= s_scaler.fit_transform(mean_imp_X_train)

In [None]:
X_test_scaled = s_scaler.transform(X_test)

In [None]:
eval_models(mean_imp_X_train_scaled,X_test_scaled,mean_imp_y_train,y_test,models,thresholds)

### Experiment#1 standard-scaler, mean imputation, balanced train dataset conclusions
- for scale_pos_weight = 20, reg_alpha = 0.1:
    - best threshold range : [0.28-0.35] : take 0.3 -> roc_auc = 0.655, f1-score = 0.2917, cost= 133

****

In [None]:
median_imp_X_train,median_imp_y_train = imputed_df_dict["median_imputed_df"]
s_scaler= StandardScaler()
median_imp_X_train_scaled= s_scaler.fit_transform(median_imp_X_train)
X_test_scaled = s_scaler.transform(X_test)

In [None]:
eval_models(median_imp_X_train_scaled,X_test_scaled,median_imp_y_train,y_test,models,thresholds)

In [None]:
r_scaler = RobustScaler()
median_imp_X_train_scaled= r_scaler.fit_transform(median_imp_X_train)
X_test_scaled = r_scaler.transform(X_test)
eval_models(median_imp_X_train_scaled,X_test_scaled,median_imp_y_train,y_test,models,thresholds)
# cost 127

### Experiment#2 standard-scaler, median imputation, balanced train dataset conclusions
- for scale_pos_weight = 20, reg_alpha = 0.1:
    - best threshold range : [0.35-0.3675] : take 0.36 -> F1-score: 0.3182, ROC AUC: 0.6617, cost= 129

- if robust scaler is applied: cost = 127

****

In [None]:
constant_imp_X_train,constant_imp_y_train = imputed_df_dict["constant_imputed_df"]
s_scaler= StandardScaler()
constant_imp_X_train_scaled= s_scaler.fit_transform(constant_imp_X_train)
X_test_scaled = s_scaler.transform(X_test)

In [None]:
eval_models(constant_imp_X_train_scaled,X_test_scaled,constant_imp_y_train,y_test,models,thresholds)

In [None]:
r_scaler= RobustScaler()
constant_imp_X_train_scaled= r_scaler.fit_transform(constant_imp_X_train)
X_test_scaled = r_scaler.transform(X_test)
eval_models(constant_imp_X_train_scaled,X_test_scaled,constant_imp_y_train,y_test,models,thresholds)

### Experiment#3 standard-scaler, constant imputation, balanced train dataset conclusions
- for scale_pos_weight = 30, reg_alpha = 0.1:
    - best threshold range : [0.25-0.275] : take 0.26 -> F1-score: 0.3404, ROC AUC: 0.686, Cost: 121

- if robust scaler is applied : cost = 121

In [None]:
def objective(trial):
    # Define the hyperparameters to optimize
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 600,step=50),
        "max_depth": trial.suggest_int("max_depth", 3, 7),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.1, step=0.01),
        "scale_pos_weight": trial.suggest_int("scale_pos_weight", 5, 50, step=5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1.0, step = 0.1),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 1.0,step = 0.1),
    }

    # Create and train the XGBoost model
    model = XGBClassifier(**params)
    model.fit(constant_imp_X_train_scaled, constant_imp_y_train)

    # Make predictions on the validation set
    y_pred_proba = model.predict_proba(X_test_scaled)
 
    threshold = trial.suggest_float("threshold", 0.005, 0.4, step=0.005) # Add threshold as a hyperparameter
    y_pred = (y_pred_proba[:, 1] > threshold).astype(int)

    # Calculate the ROC AUC score
    roc_auc = roc_auc_score(y_test, y_pred)

    cost = total_cost(y_test,y_pred)

    # Return the negative ROC AUC as Optuna tries to minimize the objective
    return cost

In [None]:
study = optuna.create_study(direction="minimize")  # "maximize" if you want to maximize ROC AUC, minimze for cost metric
study.optimize(objective, n_trials=100)  # You can adjust the number of trials

# Get the best hyperparameters
best_params = study.best_params

In [None]:
best_params_026 = best_params
# 0.6913793103448276 124

In [None]:
best_params["threshold"]

In [None]:
best_params = {
 'n_estimators': 550,
 'max_depth': 5,
 'learning_rate': 0.060000000000000005,
 'scale_pos_weight': 50,
 'reg_alpha': 0.2,
 'reg_lambda': 0.6000000000000001
} 

In [None]:
model = XGBClassifier(**best_params)

In [None]:
model.fit(constant_imp_X_train_scaled, constant_imp_y_train)

In [None]:
y_pred_proba = model.predict_proba(X_test_scaled)

threshold = 0.13
y_pred = (y_pred_proba[:, 1] > threshold).astype(int)

# Calculate the ROC AUC score
roc_auc = roc_auc_score(y_test, y_pred)

cost = total_cost(y_test,y_pred)

print(roc_auc, cost)

****

In [None]:
knn_imp_X_train,knn_imp_y_train = imputed_df_dict["knn_imputed_df"]
s_scaler= StandardScaler()
knn_imp_X_train_scaled= s_scaler.fit_transform(knn_imp_X_train)
X_test_scaled = s_scaler.transform(X_test)

In [None]:
eval_models(knn_imp_X_train_scaled,X_test_scaled,knn_imp_y_train,y_test,models,thresholds)

In [None]:

r_scaler= RobustScaler()
knn_imp_X_train_scaled= r_scaler.fit_transform(knn_imp_X_train)
X_test_scaled = r_scaler.transform(X_test)
eval_models(knn_imp_X_train_scaled,X_test_scaled,knn_imp_y_train,y_test,models,thresholds)

### Experiment#4 standard-scaler, knn imputation, balanced train dataset conclusions
- for scale_pos_weight = 20, reg_alpha = 0.1:
    - best threshold range : [0.093-0.1] : take 0.095 -> F1-score: 0.3182, ROC AUC: 0.6617, Cost: 129
- if robust scaler is applied: cost = 129

In [None]:
from sklearn.base import BaseEstimator , TransformerMixin
from sklearn.pipeline import Pip

### So, we are better to apply transformation pipeline consisting of :
- [drop_zero_std -> drop_duplicated_cols -> constant(0) imputation -> SMOTOMEK balancing -> Robust Scaling ] 

- then train model XGBClassifier using best params: {scale_pos_weight:30,reg_alpha:0.1} with threshold from the range: [0.25-0.275]
  
  to get the best results: F1-score: 0.3404, ROC AUC: 0.686, Cost: 121 -> Which yields 33% less costs than the initial problem case.
  furthermore, i conducted hyperparameter optimization but there was no significant improvement. So i will keep this result

In [None]:
best_model =  XGBClassifier(
        random_state=11,
        scale_pos_weight=30,
        reg_alpha=0.01),

In [None]:
best_model_path = "best_model.pkl"

In [None]:
with open(best_model_path,"wb") as f:
    dill.dump(best_model,f)

In [None]:
transformer_pipeline = Pipeline(
    steps=[
        ("constant_imputer",SimpleImputer(strategy="constant",fill_value=0)),
        ("scaler",RobustScaler())       
    ])

In [None]:
class TrainingPreprocessor(BaseEstimator,TransformerMixin):
    def __init__(self,use_y = True):
        self.use_y = use_y

    def fit(self,X,y=None):
        if self.use_y:
            self.y = y 

        return self 
    
    def transform(self,X,is_testing=False):
        X_transformed = drop_zero_std(X)
        X_transformed = drop_duplicated_cols(X_transformed)
        X_transformed = X_transformed[highly_corr_cols[:-1]]

        imputer = SimpleImputer(strategy="constant",fill_value=0)
        X_transformed = imputer.fit_transform(X_transformed)

        if not is_testing:
            X_transformed , self.y = handle_imbalance(X_transformed,self.y)
            
        r_scaler = RobustScaler()
        X_transformed = r_scaler.fit_transform(X_transformed)

        return X_transformed, self.y 


In [None]:
preprocessor_obj = TrainingPreprocessor()

In [None]:
preprocessor_obj.fit(X_train,y_train)

In [None]:
file_path = "preprocessor_obj.pkl"
with open(file_path,"wb") as f :
    dill.dump(preprocessor_obj,f)

In [None]:
X_transformed, y_res = preprocessor_obj.transform(X_train)

In [None]:
X_transformed.std() 

In [None]:
y_res.shape 

In [159]:
model 

NameError: name 'model' is not defined