<a href="https://colab.research.google.com/github/alexandrastna/AI-for-ESG/blob/main/Notebooks/7_1_Thesis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Thesis 7_1 – GPT-3.5 for Sentiment Analysis
1 - OpenAI Batch Requests : Creating the prompt and 4 JSONL batches to use with OpenAI's batch endpoint.

In [None]:
# Initial setup
!pip install --upgrade openai pandas tqdm

import pandas as pd
from tqdm import tqdm
import json

from google.colab import drive
drive.mount('/content/drive')  # Authorize access to Google Drive

# Load data
df = pd.read_csv('/content/drive/MyDrive/Thèse Master/Exports2/df_esg.csv')
print(f"✅ Fichier chargé avec {len(df)} phrases")

# Create JSONL file for batch API
system_prompt = (
    "You are an assistant that performs sentiment classification for ESG-related sentences. "
    "For each input, respond only with one of the following labels: 'positive', 'neutral', or 'negative'. "
    "Use 'positive' if the sentence describes an improvement, benefit, or progress. "
    "Use 'negative' if it describes controversies, problems, or deteriorations. "
    "Use 'neutral' if it is descriptive without clear judgment or consequence."
)

#Save all requests in a single JSONL file
with open("/content/batch_sentiment.jsonl", "w") as f:
    for i, row in tqdm(df.iterrows(), total=len(df)):
        sentence = row["sentence"].strip().replace("\n", " ")
        prompt = {
            "custom_id": f"row-{i}",
            "method": "POST",
            "url": "/v1/chat/completions",
            "body": {
                "model": "gpt-3.5-turbo-0125",
                "messages": [
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": sentence}
                ],
                "max_tokens": 1,
                "temperature": 0
            }
        }
        f.write(json.dumps(prompt) + "\n")

print("✅ Fichier JSONL prêt pour la Batch API.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Fichier chargé avec 47272 phrases


100%|██████████| 47272/47272 [00:04<00:00, 10954.37it/s]

✅ Fichier JSONL prêt pour la Batch API.





Split the JSONL into 4 chunks for parallel batching.

In [None]:
import pandas as pd
from tqdm import tqdm
import json
import os

# Reload the ESG sentence dataset
df = pd.read_csv('/content/drive/MyDrive/Thèse Master/Exports2/df_esg.csv')
print(f"✅ Fichier chargé avec {len(df)} phrases")

# Reuse the same sentiment classification prompt
system_prompt = (
    "You are an assistant that performs sentiment classification for ESG-related sentences. "
    "For each input, respond only with one of the following labels: 'positive', 'neutral', or 'negative'. "
    "Use 'positive' if the sentence describes an improvement, benefit, or progress. "
    "Use 'negative' if it describes controversies, problems, or deteriorations. "
    "Use 'neutral' if it is descriptive without clear judgment or consequence."
)

# Create output folder if it doesn’t exist
os.makedirs("/content/batches_sentiment", exist_ok=True)

# Split the dataset into 4 equal parts
n = len(df)
chunks = [df.iloc[i:i + n // 4] for i in range(0, n, n // 4)]

# Create 4 separate batch files in JSONL format
for idx, chunk in enumerate(chunks):
    output_path = f"/content/batches_sentiment/batch_sentiment_part{idx+1}.jsonl"
    with open(output_path, "w") as f:
        for i, row in tqdm(chunk.iterrows(), total=len(chunk), desc=f"Batch {idx+1}"):
            prompt = {
                "custom_id": f"row-{idx}-{i}",
                "method": "POST",
                "url": "/v1/chat/completions",
                "body": {
                    "model": "gpt-3.5-turbo-0125",
                    "messages": [
                        {"role": "system", "content": system_prompt},
                        {"role": "user", "content": row["sentence"]}
                    ],
                    "max_tokens": 1,
                    "temperature": 0
                }
            }
            f.write(json.dumps(prompt) + "\n")
    print(f"✅ Fichier batch {idx+1} créé ➤ {output_path}")


✅ Fichier chargé avec 47272 phrases


Batch 1: 100%|██████████| 11818/11818 [00:01<00:00, 10069.69it/s]


✅ Fichier batch 1 créé ➤ /content/batches_sentiment/batch_sentiment_part1.jsonl


Batch 2: 100%|██████████| 11818/11818 [00:01<00:00, 11503.47it/s]


✅ Fichier batch 2 créé ➤ /content/batches_sentiment/batch_sentiment_part2.jsonl


Batch 3: 100%|██████████| 11818/11818 [00:01<00:00, 10733.96it/s]


✅ Fichier batch 3 créé ➤ /content/batches_sentiment/batch_sentiment_part3.jsonl


Batch 4: 100%|██████████| 11818/11818 [00:01<00:00, 10394.90it/s]

✅ Fichier batch 4 créé ➤ /content/batches_sentiment/batch_sentiment_part4.jsonl



