# Expanded ESG Indicators Analysis

This notebook analyzes the expanded ESG indicators derived from established frameworks:
1. **GRI Standards** - Global Reporting Initiative indicators
2. **SASB Standards** - Sustainability Accounting Standards Board indicators
3. **TCFD Framework** - Task Force on Climate-related Financial Disclosures
4. **Additional ESG Frameworks** - Other recognized standards

This complements the ontology-derived indicators with comprehensive framework-based metrics.

In [None]:
# Install required packages if not already installed
# !pip install pandas matplotlib seaborn plotly

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import json
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path
from collections import Counter

## 1. Load Expanded ESG Indicators

In [None]:
# Load the expanded ESG indicators
indicators_path = Path('../data/indicators/expanded_esg_indicators.csv')
indicators_df = pd.read_csv(indicators_path)

print(f"Loaded {len(indicators_df)} expanded ESG indicators")
print(f"Columns: {list(indicators_df.columns)}")
indicators_df.head()

In [None]:
# Load the JSON version for detailed analysis
json_path = Path('../data/indicators/expanded_esg_indicators.json')
with open(json_path, 'r') as f:
    indicators_json = json.load(f)

print(f"JSON structure keys: {list(indicators_json.keys())}")
if 'indicators' in indicators_json:
    print(f"Total indicators in JSON: {len(indicators_json['indicators'])}")
    print(f"Example indicator: {list(indicators_json['indicators'].keys())[0]}")

## 2. Framework Distribution Analysis

In [None]:
# Analyze framework distribution
framework_counts = indicators_df['Framework'].value_counts()
print("Framework Distribution:")
print(framework_counts)

# Visualize framework distribution
plt.figure(figsize=(10, 6))
framework_counts.plot(kind='bar', color=['#2E8B57', '#4682B4', '#DAA520', '#CD853F'])
plt.title('Distribution of ESG Indicators by Framework')
plt.xlabel('Framework')
plt.ylabel('Number of Indicators')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Create an interactive pie chart
fig = px.pie(values=framework_counts.values, 
             names=framework_counts.index,
             title='ESG Indicators by Framework',
             color_discrete_sequence=['#2E8B57', '#4682B4', '#DAA520', '#CD853F'])
fig.show()

## 3. Category Analysis

In [None]:
# Analyze category distribution
category_counts = indicators_df['Category'].value_counts()
print("Category Distribution:")
print(category_counts)

# Visualize category distribution
plt.figure(figsize=(10, 6))
category_counts.plot(kind='bar', color=['#228B22', '#4169E1', '#B8860B'])
plt.title('Distribution of ESG Indicators by Category')
plt.xlabel('ESG Category')
plt.ylabel('Number of Indicators')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Cross-tabulation of Framework vs Category
cross_tab = pd.crosstab(indicators_df['Framework'], indicators_df['Category'])
print("Framework vs Category Cross-tabulation:")
print(cross_tab)

# Visualize as heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(cross_tab, annot=True, cmap='Blues', fmt='d')
plt.title('ESG Indicators: Framework vs Category')
plt.tight_layout()
plt.show()

## 4. Detailed Framework Analysis

In [None]:
# Analyze GRI indicators
gri_indicators = indicators_df[indicators_df['Framework'] == 'GRI']
print(f"GRI Indicators ({len(gri_indicators)}):")
gri_categories = gri_indicators['Category'].value_counts()
print(gri_categories)

print("Sample GRI indicators:")
for _, indicator in gri_indicators.head(5).iterrows():
    print(f"- {indicator['Indicator']}: {indicator['Description'][:100]}...")

In [None]:
# Analyze SASB indicators
sasb_indicators = indicators_df[indicators_df['Framework'] == 'SASB']
print(f"SASB Indicators ({len(sasb_indicators)}):")
sasb_categories = sasb_indicators['Category'].value_counts()
print(sasb_categories)

print("Sample SASB indicators:")
for _, indicator in sasb_indicators.head(5).iterrows():
    print(f"- {indicator['Indicator']}: {indicator['Description'][:100]}...")

In [None]:
# Analyze TCFD indicators
tcfd_indicators = indicators_df[indicators_df['Framework'] == 'TCFD']
print(f"TCFD Indicators ({len(tcfd_indicators)}):")
tcfd_categories = tcfd_indicators['Category'].value_counts()
print(tcfd_categories)

print("Sample TCFD indicators:")
for _, indicator in tcfd_indicators.head(5).iterrows():
    print(f"- {indicator['Indicator']}: {indicator['Description'][:100]}...")

## 5. Indicator Complexity Analysis

In [None]:
# Analyze description lengths as a proxy for complexity
indicators_df['description_length'] = indicators_df['Description'].str.len()

print("Description Length Statistics:")
print(indicators_df['description_length'].describe())

# Visualize description lengths by framework
plt.figure(figsize=(12, 6))
sns.boxplot(data=indicators_df, x='Framework', y='description_length')
plt.title('Indicator Description Length by Framework')
plt.ylabel('Description Length (characters)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Analyze common keywords in descriptions
from collections import Counter
import re

# Extract keywords from descriptions
all_descriptions = ' '.join(indicators_df['Description'].astype(str))
words = re.findall(r'\b\w{4,}\b', all_descriptions.lower())

# Remove common stop words
stop_words = {'that', 'with', 'from', 'they', 'been', 'have', 'this', 'will', 'would', 'could', 'should', 'their', 'there', 'where', 'when', 'what', 'which', 'while'}
filtered_words = [word for word in words if word not in stop_words]

word_counts = Counter(filtered_words)
top_keywords = word_counts.most_common(20)

print("Top 20 Keywords in ESG Indicator Descriptions:")
for word, count in top_keywords:
    print(f"{word}: {count}")

## 6. Source and Reference Analysis

In [None]:
# Analyze sources
if 'Source' in indicators_df.columns:
    source_counts = indicators_df['Source'].value_counts()
    print("Source Distribution:")
    print(source_counts.head(10))
    
    # Visualize top sources
    plt.figure(figsize=(12, 6))
    source_counts.head(10).plot(kind='bar')
    plt.title('Top 10 Sources for ESG Indicators')
    plt.xlabel('Source')
    plt.ylabel('Number of Indicators')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("No 'Source' column found in the dataset")

## 7. Integration with Ontology Indicators

In [None]:
# Load ontology indicators for comparison
ontology_path = Path('../data/indicators/esg_indicators_mapping.csv')
if ontology_path.exists():
    ontology_df = pd.read_csv(ontology_path)
    
    print(f"Ontology indicators: {len(ontology_df)}")
    print(f"Expanded indicators: {len(indicators_df)}")
    print(f"Total combined indicators: {len(ontology_df) + len(indicators_df)}")
    
    # Compare categories
    print("\nOntology categories:")
    print(ontology_df['Category'].value_counts())
    
    print("\nExpanded framework categories:")
    print(indicators_df['Category'].value_counts())
else:
    print("Ontology indicators file not found. Run the ontology analysis notebook first.")

In [None]:
# Create a comprehensive indicator summary
summary = {
    'expanded_indicators': {
        'total': len(indicators_df),
        'by_framework': framework_counts.to_dict(),
        'by_category': category_counts.to_dict()
    }
}

if ontology_path.exists():
    summary['ontology_indicators'] = {
        'total': len(ontology_df),
        'by_category': ontology_df['Category'].value_counts().to_dict()
    }
    summary['combined_total'] = len(ontology_df) + len(indicators_df)

print("Comprehensive ESG Indicator Summary:")
print(json.dumps(summary, indent=2))

## 8. Export Analysis Results

In [None]:
# Save analysis results
output_dir = Path('../data/indicators')

# Save summary statistics
with open(output_dir / 'expanded_indicators_analysis.json', 'w') as f:
    json.dump(summary, f, indent=2)

# Save framework statistics
framework_stats = pd.DataFrame({
    'Framework': framework_counts.index,
    'Count': framework_counts.values,
    'Percentage': (framework_counts.values / len(indicators_df) * 100).round(2)
})
framework_stats.to_csv(output_dir / 'framework_statistics.csv', index=False)

# Save category statistics
category_stats = pd.DataFrame({
    'Category': category_counts.index,
    'Count': category_counts.values,
    'Percentage': (category_counts.values / len(indicators_df) * 100).round(2)
})
category_stats.to_csv(output_dir / 'category_statistics.csv', index=False)

print(f"Analysis results saved to: {output_dir}")
print(f"- Summary: expanded_indicators_analysis.json")
print(f"- Framework stats: framework_statistics.csv")
print(f"- Category stats: category_statistics.csv")

## Summary

This notebook analyzes **46 comprehensive ESG indicators** from established frameworks:

### Framework Coverage:
- **GRI Standards**: Global sustainability reporting guidelines
- **SASB Standards**: Industry-specific sustainability metrics
- **TCFD Framework**: Climate-related financial disclosures
- **Additional Standards**: Other recognized ESG frameworks

### Key Insights:
1. **Comprehensive Coverage**: Indicators span Environmental, Social, and Governance domains
2. **Industry Relevance**: SASB indicators provide sector-specific metrics
3. **Climate Focus**: TCFD indicators address climate-related risks and opportunities
4. **Global Standards**: GRI indicators ensure international comparability

### Integration with Ontology:
- **Combined Total**: ~97 indicators (51 ontology + 46 framework-based)
- **Complementary Coverage**: Ontology provides semantic structure, frameworks provide practical metrics
- **Model Training**: Both sets can be used for comprehensive ESG extraction model training

## Next Steps

1. **Data Preparation**: Create training datasets using both indicator sets
2. **Model Development**: Fine-tune FinBERT using comprehensive indicator coverage
3. **Evaluation**: Test extraction performance on corporate reports
4. **Integration**: Combine ontology-based semantic understanding with framework-based practical metrics