### ==============================================================================
### 🚀 SCENARIO - 2: The Impact of Raw Material Quality
### ==============================================================================

This scenario will find all the production runs from our real test data that used low-quality raw materials (a quality score below 80) and then simulate the outcome if those exact same runs had used high-quality materials (a score of 95).

This directly answers the business question: `Is it worth investing in better raw materials to increase our effective capacity?`

In [1]:
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [2]:
print("1. Loading the saved model preprocessor, and original test data...")

# Load the pre-trained model and the preprocessor
loaded_model = joblib.load("../model/defect_rate_predictor.joblib")
loaded_preprocessor = joblib.load("../model/preprocessor.joblib")

print("✅ Model and preprocessor loaded successfully.")

1. Loading the saved model preprocessor, and original test data...
✅ Model and preprocessor loaded successfully.


In [3]:
# Load the raw dataset to get the original, untransformed test data
df_raw = pd.read_csv("../data/smart_phone_surface_plastic_manufacture.csv")
X = df_raw.drop(columns=['Defect Rate (%)', 'Production Output (Units)', 'Unnamed: 0', 'Production Run ID', 'Date'])
y = df_raw['Defect Rate (%)'].dropna()

In [4]:
# Split the data into training and testing sets
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X.loc[y.index], y, test_size=0.2, random_state=42)

In [5]:
# --- 2. Identify "Low Quality" Runs from the Real Test Set ---
# A "problem run" is any run with a raw material quality score below 80
low_quality_runs = X_test_raw[X_test_raw['Raw Material Quality (Score)'] < 80].copy()

if low_quality_runs.empty:
    print("No low-quality material runs found in the test set to simulate. Please adjust the threshold if needed.")
else:
    # --- 3. Create a "Simulated High Quality" Version of These Runs ---
    # What if these exact runs had used premium materials (score of 95)?
    simulated_high_quality_runs = low_quality_runs.copy()
    simulated_high_quality_runs['Raw Material Quality (Score)'] = 95.0

    # --- Re-engineer features for both original and simulated data ---
    for df in [low_quality_runs, simulated_high_quality_runs]:
        df['Energy_per_Unit'] = df['Energy Consumption (kWh)'] / (df['Batch Size (Units)'] + 1e-6)
        df['Process_Stress_Index'] = df['Temperature (°C)'] * df['Pressure (Pa)']


    # --- 4. Preprocess and Predict for Both Groups ---
    # Preprocess the original low-quality runs
    original_processed = loaded_preprocessor.transform(low_quality_runs)
    # Preprocess the simulated high-quality runs
    simulated_processed = loaded_preprocessor.transform(simulated_high_quality_runs)

    # Get column names to create DataFrames for dropping features
    ohe_feature_names = loaded_preprocessor.named_transformers_['cat'].get_feature_names_out()
    all_feature_names = (
        list(loaded_preprocessor.named_transformers_['skewed'].feature_names_in_) +
        list(loaded_preprocessor.named_transformers_['symmetric'].feature_names_in_) +
        list(ohe_feature_names)
    )

    original_processed_df = pd.DataFrame(original_processed, columns=all_feature_names)
    simulated_processed_df = pd.DataFrame(simulated_processed, columns=all_feature_names)

    # Drop high-VIF columns from both
    cols_to_drop = ['Process_Stress_Index', 'Pressure (Pa)']
    original_final = original_processed_df.drop(columns=cols_to_drop)
    simulated_final = simulated_processed_df.drop(columns=cols_to_drop)

    # Predict the defect rate for both groups
    original_predicted_defects = loaded_model.predict(original_final)
    simulated_predicted_defects = loaded_model.predict(simulated_final)


    # --- 5. Display the Clear Business Impact ---
    avg_original_defect_rate = np.mean(original_predicted_defects)
    avg_simulated_defect_rate = np.mean(simulated_predicted_defects)
    improvement = avg_original_defect_rate - avg_simulated_defect_rate

    print("\n--- Business Impact Simulation: Raw Material Quality ---")
    print(f"Average Predicted Defect Rate for Runs with Low-Quality Materials: {avg_original_defect_rate:.2f}%")
    print(f"Predicted Defect Rate if High-Quality Materials Were Used: {avg_simulated_defect_rate:.2f}%")
    print("-" * 50)
    print(f"✅ By investing in higher-quality raw materials, the model predicts an average improvement of {improvement:.2f} percentage points in the defect rate.")
    print("\nThis provides a second, independent business case for how to increase effective capacity by making a data-driven change to the production inputs.")


--- Business Impact Simulation: Raw Material Quality ---
Average Predicted Defect Rate for Runs with Low-Quality Materials: 2.55%
Predicted Defect Rate if High-Quality Materials Were Used: 2.46%
--------------------------------------------------
✅ By investing in higher-quality raw materials, the model predicts an average improvement of 0.09 percentage points in the defect rate.

This provides a second, independent business case for how to increase effective capacity by making a data-driven change to the production inputs.
