# Final Project - Supply Chain GHG Emissions Predictor

### Student Name: Spandan  
### Objective:  
Predict supply chain greenhouse gas (GHG) emissions using data quality indicators and machine learning (Random Forest Regressor), and deploy the model using Streamlit.

---

## Contents

- Week 1: Data Loading and Preprocessing  
- Week 2: Model Training and Evaluation  
- Week 3: Deployment with Streamlit  
- Final Thoughts & Source Code Summary


## Week 1 - Data Collection and Preprocessing

In this phase, we:
- Loaded an Excel workbook with multiple sheets
- Selected only sheets relevant to commodity emissions
- Combined all the data
- Selected relevant features and cleaned missing values


In [None]:
import pandas as pd

# Load Excel file
excel_path = "Data_Set.xlsx"
excel_file = pd.ExcelFile(excel_path)

# Select relevant sheets
commodity_sheets = [sheet for sheet in excel_file.sheet_names if sheet.endswith('_Summary_Commodity')]

# Define features and target
feature_cols = [
    'Supply Chain Emission Factors without Margins',
    'Margins of Supply Chain Emission Factors',
    'DQ ReliabilityScore of Factors without Margins',
    'DQ TemporalCorrelation of Factors without Margins',
    'DQ GeographicalCorrelation of Factors without Margins',
    'DQ TechnologicalCorrelation of Factors without Margins',
    'DQ DataCollection of Factors without Margins'
]
target_col = 'Supply Chain Emission Factors with Margins'

# Load and merge data
dataframes = []
for sheet in commodity_sheets:
    df = pd.read_excel(excel_path, sheet_name=sheet)
    if all(col in df.columns for col in feature_cols + [target_col]):
        df = df[feature_cols + [target_col]].dropna()
        dataframes.append(df)

combined_df = pd.concat(dataframes, ignore_index=True)
combined_df.head()


## Week 2 - Model Training and Evaluation

Steps:
- Split dataset into training/testing sets
- Scale the features
- Train Random Forest Regressor
- Evaluate using Mean Squared Error


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Split the data
X = combined_df[feature_cols]
y = combined_df[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
print("📉 Mean Squared Error:", mse)


## Week 3 - Streamlit App Deployment

Here we:
- Saved the trained model and scaler
- Created a Streamlit UI for predictions


In [None]:
import joblib

# Save model artifacts
joblib.dump(model, "random_forest_model.pkl")
joblib.dump(scaler, "scaler.pkl")
joblib.dump(feature_cols, "feature_columns.pkl")


### Streamlit App Code

Below is the complete `app.py` source code for the web application.


In [None]:
import streamlit as st
import pandas as pd
import joblib
import os

model = joblib.load("models/random_forest_model.pkl")
scaler = joblib.load("models/scaler.pkl")
feature_cols = joblib.load("models/feature_columns.pkl")

st.title("🌿 Supply Chain Emissions Predictor")

st.markdown("""
Estimate **Supply Chain Emission Factors with Margins**  
based on key parameters and DQ metrics.
""")

with st.form("form"):
    st.selectbox("Substance", ['Carbon Dioxide', 'Methane', 'Nitrous Oxide', 'Other GHGs'])
    st.selectbox("Unit", ['kg/2018 USD, purchaser price', 'kg CO2e/2018 USD, purchaser price'])
    st.selectbox("Source", ['Commodity', 'Industry'])

    supply_wo_margin = st.number_input("Emission Factor without Margins", min_value=0.0)
    margin = st.number_input("Margin of Emission Factor", min_value=0.0)
    dq_reliability = st.slider("DQ Reliability Score", 0.0, 5.0)
    dq_temporal = st.slider("DQ Temporal Correlation", 0.0, 5.0)
    dq_geo = st.slider("DQ Geographical Correlation", 0.0, 5.0)
    dq_tech = st.slider("DQ Technological Correlation", 0.0, 5.0)
    dq_data = st.slider("DQ Data Collection Quality", 0.0, 5.0)

    submit = st.form_submit_button("Predict Emission Factor")

if submit:
    input_data = {
        'Supply Chain Emission Factors without Margins': supply_wo_margin,
        'Margins of Supply Chain Emission Factors': margin,
        'DQ ReliabilityScore of Factors without Margins': dq_reliability,
        'DQ TemporalCorrelation of Factors without Margins': dq_temporal,
        'DQ GeographicalCorrelation of Factors without Margins': dq_geo,
        'DQ TechnologicalCorrelation of Factors without Margins': dq_tech,
        'DQ DataCollection of Factors without Margins': dq_data
    }

    input_df = pd.DataFrame([input_data])
    input_df = input_df[feature_cols]
    input_scaled = scaler.transform(input_df)
    prediction = model.predict(input_scaled)

    st.success(f"🎯 Predicted Emission Factor: **{prediction[0]:.4f}**")

# Footer
st.markdown(
    "<hr style='margin-top:50px;'><div style='text-align: center; color: gray;'>Made by Spandan</div>",
    unsafe_allow_html=True
)


## Final Summary

- We loaded and combined multiple Excel sheets with supply chain emission data.
- Preprocessed and cleaned the data.
- Trained a Random Forest Regressor to predict emission factors based on DQ metrics.
- Evaluated the model using MSE.
- Built and deployed a Streamlit application for real-time prediction.

### Future Scope:
- Train on more comprehensive or domain-specific datasets.
- Test alternative regression models like XGBoost or Gradient Boosting.
- Enhance UI with visualization and explanation of model predictions.

---

**Project by:** Spandan  
**GitHub Repo:** https://github.com/Warmachine019/AICTE-Internship
