<a href="https://colab.research.google.com/github/alexandrastna/AI-for-ESG/blob/main/Notebooks/6_Thesis_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Thesis 6 – Sentiment Analysis with FinBERT

In this step, we apply FinBERT to all extracted sentences to determine their sentiment. This complements the ESG classification by capturing the tone (positive, negative, neutral) of the ESG-related discourse. The results were successfully exported to Google Drive for further use.

1 – Setup & Mount Drive

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# 📄 Path to the previously classified ESG sentences
path = "/content/drive/MyDrive/Thèse Master/Exports2/classified_all_sentences.csv"

# Load essential libraries
import pandas as pd
import numpy as np


Mounted at /content/drive


2 – Load the dataset and filter for ESG-classified sentences only

In [None]:
# Load the dataset
df = pd.read_csv(path)

# Keep only sentences classified as Environmental, Social, or Governance
mask_esg = (df["label_env"] == "environmental") | (df["label_soc"] == "social") | (df["label_gov"] == "governance")
df_esg = df[mask_esg].copy().reset_index(drop=True)

print(f"✅ {len(df_esg)} phrases ESG retenues (sur {len(df)} au total)")


✅ 47272 phrases ESG retenues (sur 201247 au total)


In [None]:
#Save filtered dataset to Drive
output_path = "/content/drive/MyDrive/Thèse Master/Exports2/df_esg.csv"
df_esg.to_csv(output_path, index=False)
print(f"✅ Fichier final sauvegardé avec {len(df_esg)} lignes ➤ {output_path}")

✅ Fichier final sauvegardé avec 47272 lignes ➤ /content/drive/MyDrive/Thèse Master/Exports2/df_esg.csv


3 – Load FinBERT Model on GPU

This step loads the FinBERT model, a financial-domain adaptation of BERT trained for sentiment classification.
The pipeline is configured to use GPU (device=0) for significantly faster inference on large datasets.
While GPU access on Google Colab is typically a paid feature (via Colab Pro), free users are occasionally granted access depending on availability and usage history — which made it possible to run this step at no cost in this case.

In [None]:
# Install the Transformers library
!pip install transformers --quiet

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

#  Load FinBERT sentiment model
model_name = "yiyanghkust/finbert-tone"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a sentiment analysis pipeline using GPU (device=0)
pipe_sentiment = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

Device set to use cuda:0


4 – Apply FinBERT to All ESG Sentences

This function applies FinBERT to each ESG-classified sentence and extracts the probability scores for positive, negative, and neutral tones.

The label with the highest score is saved as the predicted sentiment (sent_label).

Any processing error (e.g., due to unusual characters or long text) is caught and logged.

Using tqdm.progress_apply adds a progress bar to monitor the process.

Finally, the results are merged into the original ESG dataset for downstream analysis.



In [None]:
from tqdm.notebook import tqdm
tqdm.pandas()

# Function to classify sentiment using FinBERT
def classify_sentiment(text):
    try:
        preds = pipe_sentiment(text, truncation=True, max_length=512, top_k=3)
        output = {d["label"].lower(): d["score"] for d in preds}
        return pd.Series({
            "sent_pos": output.get("positive", 0),
            "sent_neg": output.get("negative", 0),
            "sent_neu": output.get("neutral", 0),
            "sent_label": max(output, key=output.get)  # Most probable sentiment label
        })
    except Exception as e:
        print(f"❌ Erreur avec : {text[:50]}... ➤ {e}")
        return pd.Series({
            "sent_pos": None,
            "sent_neg": None,
            "sent_neu": None,
            "sent_label": "ERROR"
        })

# Apply the sentiment classification to all ESG sentences
sentiment_df = df_esg["sentence"].progress_apply(classify_sentiment)

# Merge sentiment results back into the ESG dataframe
df_esg_sentiment = pd.concat([df_esg.reset_index(drop=True), sentiment_df], axis=1)


  0%|          | 0/47272 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


5 – Save the final dataset with sentiment scores

In [None]:
output_path = "/content/drive/MyDrive/Thèse Master/Exports2/df_esg_with_sentiment.csv"
df_esg_sentiment.to_csv(output_path, index=False)
print(f"✅ Fichier final sauvegardé avec {len(df_esg_sentiment)} lignes ➤ {output_path}")


✅ Fichier final sauvegardé avec 47272 lignes ➤ /content/drive/MyDrive/Thèse Master/Exports/df_esg_with_sentiment.csv


This file now includes all ESG-classified sentences enriched with sentiment scores (positive, negative, neutral) and the dominant sentiment label (sent_label). It is ready for further analysis (e.g. scoring companies based on ESG tone).