# PREDICTING SCRAP RATIO: A DATA-DRIVEN APPROACH TO IMPROVE PRODUCTION PLANNING

### Overview

This notebook presents a comprehensive data science project aimed at predicting the scrap ratio using **synthetic production data**. The target is to provide better production planning by predicting the scrap ratio before production starts, which can help in reducing waste and improving efficiency.

### Key Highlights:

* **Feature Engineering**: This notebook demonstrates several feature engineering techniques that enhance model performance. It includes custom functions to make the process more efficient and maintain tidy, readable code.
* **Encoding Methods**: The notebook applies One-Hot Encoding and Frequency Encoding to handle categorical variables, showcasing their differences and use cases.
* **Machine Learning Models**:
    * Random Forest Regressor: A tree-based ensemble model that performs well on tabular data.
    * LightGBM Regressor: A gradient boosting model known for its speed and efficiency.

### Goal:

The goal of this project is to predict the scrap ratio based on planning data, allowing production planners to anticipate waste and optimize the production process. By improving the prediction of scrap ratio, better production planning decisions can be made, ultimately leading to reduced costs and improved product quality.

# 1. Import Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import KFold, cross_val_score, train_test_split, RandomizedSearchCV, cross_val_predict

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
import lightgbm as lgb

import warnings
warnings.simplefilter("ignore")

# 2. Load Data

In [6]:
%%time
df_main = pd.read_excel("production_data.xlsx")

CPU times: user 13.1 s, sys: 125 ms, total: 13.2 s
Wall time: 13.3 s


In [7]:
df = df_main.copy()
print("Shape of dataframe: \nRows:", df.shape[0], "\nColumns:", df.shape[1])

Shape of dataframe: 
Rows: 119021 
Columns: 24


# 3. Feature Engineering

In [9]:
# Simplify and translate column names
# Drop unnecessary columns

def column_name_preprocess(df):
    df = df.rename(columns = {'Sipariş': "order_no", 'Tanım': "shorttext", 'Tanım.1':"recycle_info", 'Tanım.2': "color", 
                              'Baz': "base_code", 'RomaGofraj':"embossing", 'GLOSS': "gloss", 'Renk': "top_color",
                              'Baskı':"printed", 'WE-Datum': "date", 'Çekme mkt.': "raw_kg",'Scrap': "raw_scrap_kg",
                              'Çekme mkt..1': "color_kg", 'Scrap.1':"color_scrap_kg", 'Kalınlık': "thickness"})
    df = df.drop(['İşyeri','Teyit trh.', 'Birim', 'TÖB', 'TÖB.1', 
                  'TÖB.2', 'İht.mkt.', 'İhtiyaç mkt.', 'Hdf.miktar'],axis=1)
    return df

# Add necessary features
# Drop unnecessary ones

def feature_preprocess(df):
    df["material_type"] = df["shorttext"].apply(lambda x: "PVC" if "PER" in x else "ABS")
    df["jumbo"] = df["shorttext"].apply(lambda x: "jr" if "0001" in x else "slitted")
    df["recycle"] = df["recycle_info"].apply(lambda x: "Recycled" if "Geri" in x else "Not_recycled")
    df["total_kg"] = df["raw_kg"] + df["raw_scrap_kg"]
    df["total_scrap_kg"] = df["raw_scrap_kg"] + df["color_scrap_kg"]
    df["scrap_ratio"] = df["total_scrap_kg"] / df["total_kg"]
    df = df.drop(["recycle_info", "shorttext", "raw_kg", "raw_scrap_kg", 'order_no', 'color', 'top_color',
                 'color_kg', 'color_scrap_kg'], axis=1)
    df = df[df['thickness'] != 9.99]
    df.printed = df.printed/2
    df=df[df["total_kg"]>100]
    return df

# Handle missing colornames in data

def replace_invalid_colors_optimized(df, invalid_colors):
    valid_colors = df.loc[~df['colorname'].isin(invalid_colors) & df['colorname'].notna()]
    valid_colors = valid_colors.groupby('base_code')['colorname'].first().reset_index()

    df = df.merge(valid_colors, on='base_code', how='left', suffixes=('', '_valid'))

    df['colorname'] = df.apply(
        lambda row: row['colorname_valid'] if (pd.isnull(row['colorname']) or row['colorname'] in invalid_colors) 
        else row['colorname'], axis=1
    )

    df = df.drop(columns=['colorname_valid'])
    return df

# Handle missing values

def missing_values(df):
    
    invalid_colors = ['VN800673', 'A-M-T', '25kg', 'PVC', 'SİYAHI', 'R-FK-3', '3RLP', 
                      'İşlevsiz', 'ABS', "", "NA", "krem"]
    
    df = df.dropna(subset=["recycle_info"])
    df["color"] = df["color"].str.replace('*', '', regex=False)
    df["color"] = df["color"].str.rstrip()
    df["colorname"] = df["color"].apply(lambda x: x.split(" ")[-1] if pd.notna(x) else "NA")
    df = replace_invalid_colors_optimized(df, invalid_colors)
    df['printed'] = df.apply(lambda row: 0 if row['base_code'] == row['top_color'] else 2.0, axis=1)
    df = df.fillna("NotAvailable")
    return df

# scrap_ratio adjustment

def adjust_scrap_values(df):
    df.loc[df['scrap_ratio'] > 1, 'total_scrap_kg'] = df['total_kg']
    df.loc[df['scrap_ratio'] > 1, 'scrap_ratio'] = 1
    df.loc[df['scrap_ratio'] < 0, 'total_scrap_kg'] = 0
    df.loc[df['scrap_ratio'] < 0, 'scrap_ratio'] = 0
    return df

# group base codes

def categorize_base_code(df):
    frequency_counts = df['base_code'].value_counts()

    def categorize(freq, base_code):
        if freq < 10:
            return 'E'
        elif 10 <= freq < 50:
            return 'D'
        elif 50 <= freq < 100:
            return 'C'
        elif 100 <= freq < 150:
            return 'B'
        elif 150 <= freq < 200:
            return 'A'
        else:
            return base_code

    df['base_code'] = df['base_code'].map(frequency_counts).combine(df['base_code'], categorize)
    return df


In [10]:
%%time
df = column_name_preprocess(df)
df = missing_values(df)
df = feature_preprocess(df)
df = adjust_scrap_values(df)
df = categorize_base_code(df)

CPU times: user 1.38 s, sys: 24.7 ms, total: 1.41 s
Wall time: 1.41 s


# 4. Encoding

In [12]:
frequency_encoded = df['base_code'].value_counts(normalize=True)
df['base_code_encoded'] = df['base_code'].map(frequency_encoded)

frequency_encoded = df['embossing'].value_counts(normalize=True)
df['embossing_encoded'] = df['embossing'].map(frequency_encoded)

frequency_encoded = df['gloss'].value_counts(normalize=True)
df['gloss_encoded'] = df['gloss'].map(frequency_encoded)

frequency_encoded = df['colorname'].value_counts(normalize=True)
df['colorname_encoded'] = df['colorname'].map(frequency_encoded)

df = df.drop(["base_code", "embossing", "gloss", "colorname"], axis = 1)

In [13]:
num_cols = df.select_dtypes(include=["number"])
cat_cols = df.select_dtypes(include= ["object", "category"])

In [14]:
to_be_one_hot_encoded = df[["material_type", "recycle", "jumbo"]]
dummies = pd.get_dummies(to_be_one_hot_encoded, drop_first=True).astype("int8")
y = df.scrap_ratio
X_ = num_cols.drop(["scrap_ratio", "total_scrap_kg"], axis=1)
X = pd.concat([X_, dummies], axis=1)

# 5. Modelling

In [19]:
%%time
# Perform Random Forest Regressor


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

rf_model = RandomForestRegressor(random_state=42)

param_dist = {
    'n_estimators': [150],             # [100, 150],
    'max_depth': [None],               # [10, None]
    'min_samples_split': [2],          # [2, 5]
    'min_samples_leaf': [1],           # [1, 2]
    'bootstrap': [True]                # [True]  
}

kfold = KFold(n_splits=3, shuffle=True, random_state=42)
random_search = RandomizedSearchCV(estimator=rf_model, param_distributions=param_dist, 
                                   n_iter=10, cv=kfold, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error', 
                                   random_state=42)

random_search.fit(X_train, y_train)

best_params = random_search.best_params_
print("Best Parameters: ", best_params)

cv_scores = cross_val_score(random_search.best_estimator_, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
print(f"Cross-validation MSE: {-cv_scores.mean()}")

y_pred = random_search.best_estimator_.predict(X_test)

test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {test_mse}")

Fitting 3 folds for each of 1 candidates, totalling 3 fits
Best Parameters:  {'n_estimators': 150, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': None, 'bootstrap': True}
Cross-validation MSE: 0.008868508138510778
Test MSE: 0.008731706735990774
CPU times: user 1min 26s, sys: 848 ms, total: 1min 26s
Wall time: 1min 48s


In [20]:
%%time
# Perform LightGBM

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lgb_model = lgb.LGBMRegressor(random_state=42)

param_dist = {
    'num_leaves': [31, 50, 70],       
    'max_depth': [-1, 10, 20],             
    'learning_rate': [0.01, 0.1, 0.2], 
    'n_estimators': [100, 150],    
    'min_child_samples': [20, 50],   
    'subsample': [0.8, 1.0],   
    'colsample_bytree': [0.8, 1.0],
}

kfold = KFold(n_splits=3, shuffle=True, random_state=42)

random_search = RandomizedSearchCV(estimator=lgb_model, param_distributions=param_dist, 
                                   n_iter=10, cv=kfold, n_jobs=-1, verbose=2, scoring='neg_mean_squared_error', 
                                   random_state=42)

random_search.fit(X_train, y_train)

best_params = random_search.best_params_
print("Best Parameters: ", best_params)

cv_scores = cross_val_score(random_search.best_estimator_, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
print(f"Cross-validation MSE: {-cv_scores.mean()}")

y_pred = random_search.best_estimator_.predict(X_test)

test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE: {test_mse}")

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001134 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 478
[LightGBM] [Info] Number of data points in the train set: 93566, number of used features: 10
[LightGBM] [Info] Start training from score 0.203043
Best Parameters:  {'subsample': 0.8, 'num_leaves': 70, 'n_estimators': 150, 'min_child_samples': 20, 'max_depth': -1, 'learning_rate': 0.2, 'colsample_bytree': 0.8}
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000836 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 476
[LightGBM] [Info] Number of data points in the train set: 62377, number of used features: 10
[LightGBM] [Info]

# 6. Conclusion

In this notebook, we explored various feature engineering techniques and applied both Random Forest Regressor and LightGBM Regressor models to predict the scrap ratio from planning data. After tuning the models, LightGBM achieved the best performance with a Mean Squared Error (MSE) of 0.00846, demonstrating its effectiveness in predicting the scrap ratio.

With this model ready to make predictions, it can now be integrated into the production planning process. By forecasting the scrap ratio before production, planners can make data-driven decisions, anticipate potential waste, and optimize resources, ultimately leading to a more efficient and cost-effective production process.
