# 1. Target Definition

> In this project, our objective is to predict **two ground-truth greenhouse gas emission values** for each company. These targets correspond to the operational emissions defined by the **Greenhouse Gas Protocol**, the most widely adopted global framework for carbon accounting.

---

## **Scope 1 Emissions (Direct Emissions)**

**Scope 1** represents *direct* greenhouse gas emissions released from sources that are **owned or controlled by the company**.

Typical examples include:
- Onsite fuel combustion (boilers, furnaces, generators)
- Emissions from manufacturing or chemical processing equipment
- Company-owned vehicle fleets
- Fugitive emissions from industrial systems (e.g., refrigerants, methane leaks)

These emissions primarily reflect the **physical intensity and operational scale** of the company’s core business activities.  
Sectors such as manufacturing, mining, energy production, and heavy transportation tend to exhibit significantly higher Scope 1 levels.

---

## **Scope 2 Emissions (Indirect Emissions From Purchased Energy)**

**Scope 2** represents *indirect* emissions generated from the production of **purchased electricity, heat, or steam** that the company consumes.

Scope 2 depends on:
- The company’s total electricity demand  
- The carbon intensity of the regional power grid  
- Local energy mix (renewables vs fossil-based generation)  
- Operational characteristics influencing electricity usage (e.g., factories, data centers, logistics hubs)

Examples of activities contributing to Scope 2 include:
- Electricity used for manufacturing lines  
- Powering large buildings or data centers  
- Purchased heating or cooling for industrial or commercial facilities  

Because these emissions originate outside the company (at the utility/power provider), they differ fundamentally from Scope 1.

---

## **Why Predict Both?**

Together, **Scope 1 + Scope 2** represent the company’s **operational emissions**, a key component of ESG reporting and climate risk analysis.

However, the two scopes have **different physical drivers**:
- Scope 1 reflects *direct operational processes* and asset ownership  
- Scope 2 reflects *electricity consumption and regional grid characteristics*  

This means:
- Different features contribute to each target  
- Sector exposure affects each scope differently  
- Business scale (revenue) alone cannot explain both emission types  

For these reasons, Scope 1 and Scope 2 require **separate modeling approaches** even though both are part of the overall emissions profile.

---


# 2. Why These Predictions Matter

Many companies do not publicly report emissions. Financial institutions, rating agencies, and corporate partners still need approximate estimates for:

- ESG and sustainability assessments
- Investment risk analysis, including transition risk
- Portfolio-level carbon accounting
- Supply-chain emission estimation
- Regulatory reporting and compliance checks

In practice, these predictions are used in aggregate or at portfolio scale, so stable median accuracy is more valuable than precise outlier prediction.

---

## **Who Needs These Estimates and Why?**

### **1. ESG and Sustainability Assessments**
Sustainability analysts and rating agencies incorporate operational emissions into ESG scoring frameworks. Companies with higher estimated emissions may face stricter expectations for environmental management and disclosure.

### **2. Investment Risk and Transition Risk Analysis**
Financial institutions need emissions data to quantify:
- Exposure to carbon-intensive assets  
- Vulnerability to future carbon pricing  
- Regulatory transition risks under climate policies  

Accurate estimates help investors understand long-term climate-related financial risks.

### **3. Portfolio-Level Carbon Accounting**
Asset managers and institutional investors often compute the **total carbon footprint** of their investment portfolios. Missing emission data for even a few companies can distort portfolio metrics, so imputed values are essential.

### **4. Supply-Chain Emission Estimation**
Large enterprises increasingly monitor upstream emissions from suppliers. When suppliers do not disclose emissions, predictive models help estimate:
- Scope 1 and 2 of suppliers  
- Contributions to the buyer’s Scope 3

### **5. Regulatory Reporting and Compliance**
In many jurisdictions, climate disclosure rules are tightening. Companies may need to estimate emissions for:
- Voluntary reporting frameworks  
- Mandatory climate risk disclosures  
- Internal carbon pricing or decarbonization targets

---

## **Why Stable Median Accuracy Is Often More Useful Than Outlier Precision**

Although some companies produce extremely high emissions (e.g., power plants, refineries), most users rely on **aggregated or portfolio-level** analyses rather than exact point predictions for rare outliers.

For this reason:
- Highly stable predictions around the **central tendency** are more valuable  
- Excessive sensitivity to extreme values reduces practical usefulness  
- Median-oriented metrics (like MAE) align better with how emissions data is consumed in the real world  

Regulators, investors, and sustainability analysts often prefer robust, interpretable results rather than noisy predictions dominated by a few outlier companies.

---
used in aggregate or at portfolio scale, so stable median accuracy is more valuable than precise outlier prediction.



# 3. Target Distribution and ML Objective

## 3.1 Observed Distribution

Exploratory data analysis reveals that both **Scope 1** and **Scope 2** exhibit strong **right-skewed distributions** with substantial variability across companies.  
Key observations include:

- Both targets contain **heavy tails**, with a small number of very large industrial firms contributing disproportionately to overall emissions.  
- The **scales differ significantly** between companies, reflecting diverse operational sizes and sector activities.  
- After applying a **log transformation**, the distributions become much closer to a bell-shaped form, indicating that emissions scale multiplicatively rather than linearly.

This pattern is typical for energy-use and emissions datasets, where operational intensity varies dramatically across industries.

## 3.2 Why Median-Focused Prediction Is More Appropriate

For this task, we emphasize stable median predictions rather than precise modeling of extreme outliers. The main reasons are:

- **Extreme emission values** generally arise from specialized industrial assets such as power plants, refineries, or large-scale manufacturing facilities. These activities cannot be captured reliably without detailed operational disclosures, which are not available in this dataset.  
- ESG analysts and financial institutions primarily assess emissions at the **portfolio or sector level**, where stable central tendencies matter more than exact outlier accuracy.  
- Loss functions like **RMSE** can overemphasize a small number of extreme values, resulting in unstable or biased models.  
- Metrics such as **MAE**, **Log-MAE**, and **Log-RMSE** align more naturally with the distributional structure by focusing on the bulk of the data.

Therefore, a median-oriented modeling approach produces more robust and practically useful predictions.

## 3.3 Objective and Evaluation Metrics

Based on the distributional characteristics and practical use cases:

- **Primary objective:** Mean Absolute Error (MAE)  
- **Secondary metrics:** Log-MAE, Log-RMSE  
- **Target preprocessing:** Outlier trimming at approximately two standard deviations

These design choices reduce sensitivity to heavy-tail noise, prioritize reliable central estimates, and align with the needs of typical users of emissions forecasts.


# 4. Feature Space Overview

The provided dataset includes three broad categories of features:

Scale of business

revenue

revenue distribution across NACE Level 2 sectors
These proxy for operational size and activity mix.

Geography

region

country
These relate to energy-grid intensity and regulatory environment.

Behavioral and sustainability attributes

environmental/social/governance score

environmental activity adjustments

SDG commitments

These reflect environmental maturity and may correlate with emissions performance.



# 5. Hypotheses and Empirical Validation
Hypothesis 1

Business scale (revenue) is the strongest single driver of emissions.

Reasoning: Larger operations and higher production volumes lead directly to more emissions.

Empirical result: Spearman correlation between revenue and total emissions ≈ 0.42.
This is a relatively strong univariate signal.

Conclusion: Revenue serves as the primary baseline feature.

Hypothesis 2

The mix of sectors in which a company operates affects Scope 1 and Scope 2 differently.

Rationale based on domain knowledge:

Direct-emission-intensive sectors (manufacturing, mining, transportation)
→ stronger relationship with Scope 1

Electricity-intensive or service-oriented sectors (ICT, retail, services)
→ stronger relationship with Scope 2

Procedure:

Define direct-related and indirect-related sector groups based on sustainability domain expertise.

Use revenue_distribution_by_sector.csv to compute each company’s direct revenue share and indirect revenue share.

Measure correlations:

direct-revenue ↔ Scope 1: strong

indirect-revenue ↔ Scope 2: strong

cross-correlations (direct ↔ Scope 2 or indirect ↔ Scope 1): weak

Conclusion: Use sector-partitioned revenue as target-specific predictive features.



# 6. Geography Analysis
Region

In the train set, WEU and NAM dominate; other regions have only a handful of records.

In the test set, only WEU and NAM appear.

Minority regions are grouped as “Others” to reduce noise and instability.

Note: potential fairness and bias issues should be revisited in future work.

Country

Some countries have very few data points.

Countries with more samples often show very high variance.

Country-level features were therefore removed due to lack of consistent signal.



# 7. Behavioral Features

Not yet fully explored.

Potential signals:

Negative environmental adjustments indicate beneficial activities.

SDG commitments (especially SDG 13 “Climate Action”).

ESG scores may reflect process maturity but can be noisy.

Further EDA is required to validate their predictive value.



# 8. Target Preprocessing

Due to extreme heavy tails, outliers are trimmed at roughly two standard deviations to stabilize training.
This is consistent with the median-oriented objective.



# 9. Experimentation

Baseline (provided by organizers)

Full feature set

One-hot encoding

Variance-threshold filtering

Standard scaling

Linear regression model

Our Model 1 (planned)

Separate models for Scope 1 and Scope 2

Log-transformed targets

Feature engineering using direct vs indirect revenue segmentation

Tree-based models (e.g., LightGBM/XGBoost) to capture nonlinear interactions

Cross-validation grouped by revenue decile or region to prevent leakage

Additional ensemble or stacking approaches may be added later.

# 10. Conclusion