# Climate Risk Analysis of Sierra Club Press Releases

## Project Overview

This project aims to analyze Sierra Club press releases to identify and quantify mentions of climate risks, focusing on transition risks and physical risks. We'll use both traditional (TF-IDF) and modern (BERT) NLP techniques to process and analyze the text data.

## Installation

Before we begin, let's install the necessary packages for this lab. Run the following cell to install the required libraries:


In [None]:
%pip install nlp4ss


## Setup and Data Loading

- We import necessary libraries and initialize the project environment using HyFI.
- NLTK data is downloaded for text processing tasks.
- The Sierra Club press release data is loaded from a JSONL file into a pandas DataFrame.


In [None]:
from hyfi import HyFI

if HyFI.is_colab():
    HyFI.mount_google_drive()
    project_root = "/content/drive/MyDrive/nlp4ss"
else:
    project_root = "$HOME/workspace/courses/nlp4ss"

h = HyFI.initialize(
    project_name="nlp4ss",
    project_root=project_root,
    logging_level="INFO",
    verbose=True,
)

print("Project directory:", h.project.root_dir)
print("Project workspace directory:", h.project.workspace_dir)

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim.downloader as api
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

nltk.download("punkt")
nltk.download("punkt_tab")
nltk.download("stopwords")
nltk.download("wordnet")

In [None]:
# Load the data
raw_data_file = h.project.workspace_dir / "data/raw/articles.jsonl"
rdata = h.load_dataset("json", data_files=raw_data_file.as_posix())
df = rdata["train"].to_pandas()

print(df.info())
print("\nSample of the data:")
print(df.head())

# Sample data
df = df.sample(500, random_state=42)

## Define Initial Climate Risk Keywords

- We start with two lists of initial keywords: one for transition risks and another for physical risks.
- These keywords are based on common terms associated with each type of climate risk.

> Bua, G., Kapp, D., Ramella, F., & Rognone, L. (2024). Transition versus physical climate risk pricing in European financial markets: A text-based approach. The European Journal of Finance, 1-35.


In [None]:
# Initial keyword lists
initial_transition_risk_keywords = [
    "EJ/YR",
    "Radiative Forcing",
    "HCFC",
    "Ozone",
    "Bioenergy",
    "Technical Potential",
    "GHG Emissions",
    "Refrigerant",
    "IPCC",
    "GHG",
    "Ozone Layer",
    "Geothermal",
    "Pathways",
    "Exajoules",
    "Biomass",
    "Hydropower",
    "GigaJoules",
    "Photovoltaics",
    "Chlorofluorocarbon",
    "Heat Pumps",
    "Ocean Energies",
    "Carbon Dioxide Capture and Storage",
    "Mitigation Scenarios",
    "Lifecycle",
    "USD/kWh",
    "Fluid",
    "Equivalent CO2",
    "Methane",
    "Halon",
    "Blowing Agent",
    "Aerosols",
    "Leakage",
    "Sustainable Development",
    "UNEP",
    "Montreal Protocol",
    "Anthropogenic",
    "Radiative",
    "Wind Energy",
    "Solar energy",
    "Hydrogen",
    "UNFCCC",
    "Product carbon footprints",
    "report safeguarding",
    "geological storage",
    "direct solar",
    "Reservoir",
    "IEA",
    "anthropogenic",
    "adaptation options",
    "ecosystems",
    "global warming potential",
    "ozone-depleting substances",
    "GTCO2",
    "global warming",
    "primary energies",
    "ocean",
    "atmosphere",
    "EQ/YR",
    "dioxide capture and storage",
    "methane",
    "ocean storage",
    "equivalent",
    "dioxide capture",
    "change mitigation",
    "teap",
    "levels cost",
    "energies systems",
    "life cycle climate performance",
    "mitigation options",
    "capacity factors",
    "TWH/YR",
    "feedstock",
    "foam",
    "solvent",
    "biofuels",
    "ozone depletion",
    "sustainable development",
    "Tco2",
    "MTCO2",
    "MTCO2 EQ",
    "stratospheric",
    "climate systems",
    "troposphere",
    "investment cost",
    "human system",
]

initial_physical_risk_keywords = [
    "coastal",
    "ecosystem services",
    "climate models",
    "wetlands",
    "ipcc",
    "adaptation",
    "ryosphere",
    "ice sheet",
    "biodiversity",
    "species",
    "phytoplankton",
    "antarctic",
    "climate variables",
    "biophysical",
    "ghg",
    "pathways",
    "climate change",
    "precipitation",
    "anthropogenic",
    "coupled model",
    "intercomparison projects",
    "cyclones",
    "climate related",
    "ocean",
    "streamflow",
    "adaptation response",
    "change impacts",
    "observed change",
    "socioeconomic",
    "freshwater",
    "temperature increase",
    "coastal zones",
    "sea level",
    "phenology",
    "future climate",
    "upwelling",
    "fisheries",
    "hazards",
    "general circulation models",
    "nutrient",
    "adaptation",
    "permafrost",
    "arid",
    "reefs",
    "water resources",
    "terrestrial",
    "spatial",
    "coral",
    "land degradation",
    "RCP",
    "adaptation planning",
    "change climate",
    "glaciers",
    "salinity",
    "hydrological variables",
    "sediment",
    "tropical cyclones",
    "latitudes",
    "projected change",
]

In [None]:
def preprocess_text(text):
    # Convert to lowercase and remove special characters
    text = re.sub(r"[^a-zA-Z\s]", "", text.lower())

    # Tokenize
    tokens = word_tokenize(text)

    # Remove stopwords and short words
    stop_words = set(stopwords.words("english"))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 3]

    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]

    return " ".join(tokens)


# Preprocess keywords
initial_transition_risk_keywords = [
    preprocess_text(keyword) for keyword in initial_transition_risk_keywords
]
initial_physical_risk_keywords = [
    preprocess_text(keyword) for keyword in initial_physical_risk_keywords
]

# Remove duplicates
initial_transition_risk_keywords = list(set(initial_transition_risk_keywords))
physicalinitial_physical_risk_keywords_risk_keywords = list(
    set(initial_physical_risk_keywords)
)

# Replace space with underscore
initial_transition_risk_keywords = [
    keyword.replace(" ", "_")
    for keyword in initial_transition_risk_keywords
    if len(keyword) > 0
]
initial_physical_risk_keywords = [
    keyword.replace(" ", "_")
    for keyword in initial_physical_risk_keywords
    if len(keyword) > 0
]

print("Number of transition risk keywords:", len(initial_transition_risk_keywords))
print("Transition risk keywords:", initial_transition_risk_keywords)

print("Number of physical risk keywords:", len(initial_physical_risk_keywords))
print("Physical risk keywords:", initial_physical_risk_keywords)

## Expand Keywords Using Word Embeddings

- We use pre-trained Word2Vec embeddings to find semantically similar words to our initial keywords.
- The `expand_keywords` function takes a list of keywords and returns an expanded list of related terms.
- We combine the original keywords with the expanded ones to create our final keyword lists.
- This expansion helps capture a broader range of terms related to climate risks, potentially improving our analysis.


In [None]:
%%time

# Load pre-trained word embeddings
word2vec_model = api.load("word2vec-google-news-300")


def expand_keywords(keywords, model, topn=5):
    expanded_keywords = set()
    for keyword in keywords:
        try:
            similar_words = model.most_similar(keyword, topn=topn)
            expanded_keywords.update([word.lower() for word, _ in similar_words])
        except KeyError:
            continue  # Skip words not in the vocabulary
    return list(expanded_keywords)


# Expand keyword lists
expanded_transition_keywords = expand_keywords(
    initial_transition_risk_keywords, word2vec_model
)
expanded_physical_keywords = expand_keywords(
    initial_physical_risk_keywords, word2vec_model
)

# Combine original and expanded keywords
transition_risk_keywords = (
    initial_transition_risk_keywords + expanded_transition_keywords
)
physical_risk_keywords = initial_physical_risk_keywords + expanded_physical_keywords

# Remove duplicates
transition_risk_keywords = list(set(transition_risk_keywords))
physical_risk_keywords = list(set(physical_risk_keywords))

print("Number of expanded transition risk keywords:", len(transition_risk_keywords))
print("Number of expanded physical risk keywords:", len(physical_risk_keywords))

print("Expanded transition risk keywords:", transition_risk_keywords)
print("Expanded physical risk keywords:", physical_risk_keywords)

## Text Preprocessing

- We define a function to preprocess the text, which includes:
  - Converting to lowercase
  - Tokenizing the text
  - Creating both unigrams and bigrams
- This preprocessing step is crucial for capturing both single words and two-word phrases in our analysis.


In [None]:
# Preprocess text to include bigrams
def preprocess_text_with_bigrams(text):
    # Convert to lowercase and tokenize
    tokens = preprocess_text(text).split()
    # Create unigrams and bigrams
    unigrams = tokens
    bigrams = [f"{tokens[i]}_{tokens[i+1]}" for i in range(len(tokens) - 1)]
    return unigrams + bigrams


# Update the dataframe with preprocessed text including bigrams
df["processed_content_bigrams"] = (
    df["content"].apply(preprocess_text_with_bigrams).apply(" ".join)
)

# Print the updated dataframe
df[["processed_content_bigrams"]].head()

## TF-IDF Analysis

- We use TF-IDF (Term Frequency-Inverse Document Frequency) to analyze the importance of climate risk keywords in each document.
- The TF-IDF vectorizer is configured to use our specific climate risk vocabulary.
- We calculate separate scores for transition risks and physical risks based on the TF-IDF matrix.


In [None]:
%%time

# TF-IDF Analysis with bigrams
tfidf_vectorizer = TfidfVectorizer(
    vocabulary=set(transition_risk_keywords + physical_risk_keywords)
)
tfidf_matrix = tfidf_vectorizer.fit_transform(df["processed_content_bigrams"])

# Calculate risk scores using sum method
df["tfidf_transition_score_sum"] = tfidf_matrix[
    :,
    [
        tfidf_vectorizer.vocabulary_.get(word)
        for word in transition_risk_keywords
        if word in tfidf_vectorizer.vocabulary_
    ],
].sum(axis=1)
df["tfidf_physical_score_sum"] = tfidf_matrix[
    :,
    [
        tfidf_vectorizer.vocabulary_.get(word)
        for word in physical_risk_keywords
        if word in tfidf_vectorizer.vocabulary_
    ],
].sum(axis=1)

# Calculate risk scores using cosine similarity
transition_risk_keywords_vector = tfidf_vectorizer.transform(transition_risk_keywords)
physical_risk_keywords_vector = tfidf_vectorizer.transform(physical_risk_keywords)
df["tfidf_transition_score_sim"] = df["processed_content_bigrams"].apply(
    lambda x: cosine_similarity(
        tfidf_vectorizer.transform([x]), transition_risk_keywords_vector
    ).mean()
)
df["tfidf_physical_score_sim"] = df["processed_content_bigrams"].apply(
    lambda x: cosine_similarity(
        tfidf_vectorizer.transform([x]), physical_risk_keywords_vector
    ).mean()
)

# Normalize scores
for col in [
    "tfidf_transition_score_sum",
    "tfidf_physical_score_sum",
    "tfidf_transition_score_sim",
    "tfidf_physical_score_sim",
]:
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

# Combine scores
df["tfidf_transition_score"] = (
    df["tfidf_transition_score_sum"] + df["tfidf_transition_score_sim"]
) / 2
df["tfidf_physical_score"] = (
    df["tfidf_physical_score_sum"] + df["tfidf_physical_score_sim"]
) / 2

In [None]:
# Display results
print("Top 10 articles by TF-IDF physical risk score (combined):")
print(
    df[["processed_content_bigrams", "tfidf_physical_score"]]
    .sort_values("tfidf_physical_score", ascending=False)
    .head(10)
)

print("\nTop 10 articles by TF-IDF transition risk score (combined):")
print(
    df[["processed_content_bigrams", "tfidf_transition_score"]]
    .sort_values("tfidf_transition_score", ascending=False)
    .head(10)
)

# Visualization
plt.figure(figsize=(12, 6))
plt.scatter(df["tfidf_transition_score"], df["tfidf_physical_score"], alpha=0.5)
plt.xlabel("TF-IDF Transition Risk Score (Combined)")
plt.ylabel("TF-IDF Physical Risk Score (Combined)")
plt.title("TF-IDF: Transition vs Physical Risk in Sierra Club Press Releases")
plt.show()

## BERT-based Analysis

- We use BERT (Bidirectional Encoder Representations from Transformers) for a more context-aware analysis of climate risk mentions.
- The `get_bert_embedding` function generates embeddings for text using BERT.
- The `contextual_keyword_importance` function calculates the importance of keywords in the context of each document, considering both semantic similarity (via BERT embeddings) and keyword frequency.
- We calculate BERT-based scores for both transition and physical risks.


In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased").to(device)

In [None]:
%%time

def get_bert_embedding(text):
    inputs = tokenizer(
        text, return_tensors="pt", truncation=True, padding=True, max_length=512
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).cpu().numpy().flatten()


def contextual_keyword_importance(text, keywords):
    # Get BERT embeddings for the full text and keywords
    text_embedding = get_bert_embedding(text)
    keyword_embeddings = np.array(
        [get_bert_embedding(keyword.replace("_", " ")) for keyword in keywords]
    )

    # Calculate attention scores
    attention_scores = cosine_similarity(
        text_embedding.reshape(1, -1), keyword_embeddings
    ).flatten()

    # Count keyword occurrences (considering bigrams)
    keyword_counts = np.array(
        [text.lower().count(keyword.replace("_", " ")) for keyword in keywords]
    )

    # Combine attention scores and counts
    importance_scores = attention_scores * keyword_counts

    return importance_scores.sum()


# Calculate BERT-based scores
df["bert_transition_score"] = df["content"].apply(
    lambda x: contextual_keyword_importance(x, transition_risk_keywords)
)
df["bert_physical_score"] = df["content"].apply(
    lambda x: contextual_keyword_importance(x, physical_risk_keywords)
)

# Normalize scores
for col in ["bert_transition_score", "bert_physical_score"]:
    df[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())

## Results Analysis and Visualization

- We display the top 10 articles for each risk type and analysis method.
- Scatter plots are created to visualize the relationship between transition and physical risk scores for both TF-IDF and BERT methods.


In [None]:
# Display results
print("Top 10 articles by TF-IDF transition risk score:")
print(
    df[["content", "tfidf_transition_score"]]
    .sort_values("tfidf_transition_score", ascending=False)
    .head(10)
)

print("\nTop 10 articles by TF-IDF physical risk score:")
print(
    df[["content", "tfidf_physical_score"]]
    .sort_values("tfidf_physical_score", ascending=False)
    .head(10)
)

print("\nTop 10 articles by BERT transition risk score:")
print(
    df[["content", "bert_transition_score"]]
    .sort_values("bert_transition_score", ascending=False)
    .head(10)
)

print("\nTop 10 articles by BERT physical risk score:")
print(
    df[["content", "bert_physical_score"]]
    .sort_values("bert_physical_score", ascending=False)
    .head(10)
)

# Visualization
plt.figure(figsize=(12, 6))
plt.scatter(df["tfidf_transition_score"], df["tfidf_physical_score"], alpha=0.5)
plt.xlabel("TF-IDF Transition Risk Score")
plt.ylabel("TF-IDF Physical Risk Score")
plt.title("TF-IDF: Transition vs Physical Risk in Sierra Club Press Releases")
plt.show()

plt.figure(figsize=(12, 6))
plt.scatter(df["bert_transition_score"], df["bert_physical_score"], alpha=0.5)
plt.xlabel("BERT Transition Risk Score")
plt.ylabel("BERT Physical Risk Score")
plt.title("BERT: Transition vs Physical Risk in Sierra Club Press Releases")
plt.show()

## Time Series Analysis

- We convert the timestamp to a datetime index and resample the data to monthly averages.
- A time series plot is created to show how different risk scores change over time.


In [None]:
# Time series analysis
df["date"] = pd.to_datetime(df["timestamp"], errors="coerce")
df = df.dropna(subset=["date"])
df.set_index("date", inplace=True)

# Monthly average risk scores
monthly_risks = df.resample("M")[
    [
        "tfidf_physical_score",
        "tfidf_transition_score",
        "bert_physical_score",
        "bert_transition_score",
    ]
].mean()

# Plot time series
plt.figure(figsize=(14, 7))
monthly_risks.plot()
plt.title("Monthly Average Risk Scores Over Time")
plt.xlabel("Date")
plt.ylabel("Risk Score")
plt.legend(loc="best")
plt.show()

## Keyword Frequency Analysis

- We analyze the frequency of each keyword in the entire corpus.
- The results are displayed for the top 20 most frequent keywords in each category.
- Bar plots are created to visualize the top 10 keywords for each risk type.


In [None]:
def analyze_keyword_frequency(keywords, df_column):
    keyword_freq = {
        keyword: df_column.apply(lambda x: x.count(keyword.replace("_", " "))).sum()
        for keyword in keywords
    }
    return pd.Series(keyword_freq).sort_values(ascending=False)


transition_freq = analyze_keyword_frequency(
    transition_risk_keywords, df["processed_content_bigrams"]
)
physical_freq = analyze_keyword_frequency(
    physical_risk_keywords, df["processed_content_bigrams"]
)

print("Top 20 most frequent transition risk keywords:")
print(transition_freq.head(20))

print("\nTop 20 most frequent physical risk keywords:")
print(physical_freq.head(20))

# Visualize top keywords
plt.figure(figsize=(12, 6))
transition_freq.head(10).plot(kind="bar")
plt.title("Top 10 Transition Risk Keywords")
plt.xlabel("Keywords")
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

plt.figure(figsize=(12, 6))
physical_freq.head(10).plot(kind="bar")
plt.title("Top 10 Physical Risk Keywords")
plt.xlabel("Keywords")
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

## Conclusion

This project presents a comprehensive analysis of climate risk discussions in Sierra Club press releases, utilizing both traditional (TF-IDF) and modern (BERT) NLP techniques. By examining transition and physical risks separately, we gain valuable insights into how different aspects of climate change are addressed in environmental communications.

Key findings and implications:

1. Temporal Trends: The time series analysis reveals evolving patterns in climate risk discourse, potentially reflecting changing priorities or external events influencing the Sierra Club's messaging.

2. Risk Type Comparison: By quantifying the emphasis on transition versus physical risks, we can understand which aspects of climate change receive more attention in the organization's communications.

3. Keyword Analysis: The frequency analysis of specific climate risk terms provides a granular view of the most prominent topics within each risk category, offering insights into the Sierra Club's focus areas.

4. Methodological Comparison: The use of both TF-IDF and BERT-based approaches allows for a nuanced understanding of climate risk mentions, showcasing the strengths and potential complementarity of different NLP techniques.

5. Keyword Expansion Impact: The incorporation of word embeddings to expand our initial keyword lists demonstrates how semantic relationships can enhance the detection of climate risk discussions, potentially capturing more nuanced or varied terminology.

Limitations and Future Directions:

- While keyword expansion increases coverage, it may introduce some noise. Future work could involve refining the expanded keyword list based on domain expertise.
- The analysis could be extended to compare results using initial versus expanded keyword lists to quantify the impact of this approach.
- Experimenting with different word embedding models or expansion techniques could further optimize the keyword selection process.
- Comparative analysis with other environmental organizations' communications could provide broader context for the Sierra Club's approach to climate risk discussion.

This project demonstrates the potential of combining traditional and modern NLP techniques to analyze complex environmental communications. By providing a data-driven approach to understanding climate risk discourse, this analysis can inform strategic communication decisions, policy discussions, and further research in environmental studies and climate change communication.
