# Creating an Instruction Tuning Dataset for ESG-specialised Model Development

We'll develop a new 'fine-tuned' specialised model with ESG (Environmental, Social, and Governance) analysis capability based on provided news articles. Some ESG datasets are puclicly accessible, categorizing news content and headlines according to ESG impact scores and topics. In this tutorial, we'll demonstrate how to convert one of these datasets into an "instructions" format prior to fine-tuning our model. 

Here is a nice explanation of instruction tuning requirements, with some model specific technical details: <https://ai.google.dev/responsible/safety_tuning>

In [6]:
import json
import csv

In [None]:
!git clone https://github.com/ymntseng/DynamicESG.git
!mv DynamicESG/ ../data

In [3]:
json_file = '../data/DynamicESG_dataset.json'
with open(json_file, 'r') as f:
        data = json.load(f)

In [10]:
keys = list(data[0].keys())
keys

['URL', 'News_Headline', 'Impact_Type', 'Impact_Duration', 'ESG_Category']

In [5]:
data[0]

{'URL': 'https://esg.businesstoday.com.tw/article/category/180687/post/202211170005',
 'News_Headline': '台達電前進COP27！LED照養試管珊瑚，台達珊瑚復育傳入聯合國',
 'Impact_Type': ['Opportunity', 'Risk'],
 'Impact_Duration': ['>5', '>5'],
 'ESG_Category': [['E01', 'E04'], ['E04', 'E07']]}

In [11]:
for i in range(2):
    prompt_template = f'You are an ESG specialist with the expertise to identify {keys[2]}, {keys[3]}, and {keys[4]}. {keys[2]} can be only one of three categories: "Opportunity", "Risk", and "Cannot_Distinguish". {keys[3]} can be only one of three categories "<2", "2~5" or ">5". The {keys[4]} can be only one of ten categories: "Climate_Change", "Natural_Capital", "Pollution_Waste", "Env_Opportunity", "Human_Capital", "Product_Liability", "Stakeholder_Opposition", "Social_Opportunity" and "Corporate Governance". Based on the given {keys[1]} define {keys[2]}, {keys[3]}, and {keys[4]}. Your answer should only contain a key-value pair like the example below.\nExample: "News_Headline":"台達電前進COP27！LED照養試管珊瑚，台達珊瑚復育傳入聯合國" Answer: {keys[2]}:"Opportunity" {keys[3]}:">5" {keys[4]}:"Climate_Change"'
    print(prompt_template)
    # add this prompt to the data object
    data[i]['instruction'] = prompt_template

You are an ESG specialist with the expertise to identify Impact_Type, Impact_Duration, and ESG_Category. Impact_Type can be only one of three categories: "Opportunity", "Risk", and "Cannot_Distinguish". Impact_Duration can be only one of three categories "<2", "2~5" or ">5". The ESG_Category can be only one of ten categories: "Climate_Change", "Natural_Capital", "Pollution_Waste", "Env_Opportunity", "Human_Capital", "Product_Liability", "Stakeholder_Opposition", "Social_Opportunity" and "Corporate Governance". Based on the given News_Headline define Impact_Type, Impact_Duration, and ESG_Category. Your answer should only contain a key-value pair like the example below.
Example: "News_Headline":"台達電前進COP27！LED照養試管珊瑚，台達珊瑚復育傳入聯合國" Answer: Impact_Type:"Opportunity" Impact_Duration:">5" ESG_Category:"Climate_Change"
"News_Headline":台達電前進COP27！LED照養試管珊瑚，台達珊瑚復育傳入聯合國 Answer:
You are an ESG specialist with the expertise to identify Impact_Type, Impact_Duration, and ESG_Category. Impact_Type can 

In [12]:
csv_file = '../data/DynamicESG_instruction.csv'

# save the data to csv
with open(csv_file, 'w') as f:
    writer = csv.writer(f)
    writer.writerow(keys)
    for d in data:
        writer.writerow(list(d.values()))

In [18]:
# Let's explore a different dataset, this time it is a Kaggle dataset
!kaggle datasets download -d equintel/dax-esg-media-dataset
!mv dax-esg-media-dataset.zip ../data
!unzip ../data/dax-esg-media-dataset.zip -d ../data/dax-esg-media-dataset
!rm ../data/dax-esg-media-dataset.zip

Dataset URL: https://www.kaggle.com/datasets/equintel/dax-esg-media-dataset
License(s): unknown
Downloading dax-esg-media-dataset.zip to /Users/asabuncuoglu/Documents/equitable-ai-cookbook/docs/development
 99%|█████████████████████████████████████▊| 38.0M/38.2M [00:02<00:00, 14.7MB/s]
100%|██████████████████████████████████████| 38.2M/38.2M [00:02<00:00, 16.3MB/s]
Archive:  ../data/dax-esg-media-dataset.zip
  inflating: ../data/dax-esg-media-dataset/esg_documents_for_dax_companies.csv  
  inflating: ../data/dax-esg-media-dataset/sdg_descriptions_with_targetsText.csv  


In [20]:
# read the csv file with pandas
import pandas as pd
csv_file = '../data/dax-esg-media-dataset/esg_documents_for_dax_companies.csv'
df_esg = pd.read_csv(csv_file, on_bad_lines='skip', delimiter="|", encoding='latin1')

In [21]:
df_esg.describe()

Unnamed: 0.1,Unnamed: 0,internal
count,11548.0,11548.0
mean,5682.517406,0.007967
std,3332.037359,0.088904
min,0.0,0.0
25%,2794.75,0.0
50%,5681.5,0.0
75%,8568.25,0.0
max,11455.0,1.0


In [22]:
df_esg.head(2)

Unnamed: 0.1,Unnamed: 0,company,content,datatype,date,domain,esg_topics,internal,symbol,title,url
0,2,Beiersdorf AG,Sustainability Highlight Report CARE BEYOND SK...,sustainability_report,2021-03-31,,"['CleanWater', 'GHGEmission', 'ProductLiabilit...",1,BEI,BeiersdorfAG Sustainability Report 2021,
1,3,Deutsche Telekom AG,Corporate Responsibility Report 2021 2 Content...,sustainability_report,2021-03-31,,"['DataSecurity', 'Iso50001', 'GlobalWarming', ...",1,DTE,DeutscheTelekomAG Sustainability Report 2021,


In [24]:
# read the csv file with pandas
import pandas as pd
csv_file = '../data/dax-esg-media-dataset/sdg_descriptions_with_targetsText.csv'
df_sdg = pd.read_csv(csv_file, on_bad_lines='skip', encoding='latin1')

In [25]:
df_sdg.describe()

Unnamed: 0,id
count,17.0
mean,9.0
std,5.049752
min,1.0
25%,5.0
50%,9.0
75%,13.0
max,17.0


In [26]:
df_sdg.head(2)

Unnamed: 0,id,name,description,targets,targets_json_array,progress
0,1,No Poverty,End poverty in all its forms everywhere,"['1.1', 'By 2030, eradicate extreme poverty fo...","[{""target"":""1.1"",""description"":""By 2030, eradi...",['The impact of the COVID-19 pandemic reversed...
1,2,Zero Hunger,"End hunger, achieve food security and improved...","['2.1', 'By 2030, end hunger and ensure access...","[{""target"":""2.1"",""description"":""By 2030, end h...","['Between 2014 and the onset of the pandemic, ..."
