# Notebook Tasks

1. Create a tokenizer and Initialize a model with the __PATH__ `nbroad/ESG-BERT`.
2. Use the TextClassificationPipeline from Huggingface.
3. Access the first row of the element of the dataframe and initialize the pipeline for the model.
4. Utilize `Pandas` function `idxmax()`.

# Imports

In [1]:
import pandas as pd

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TextClassificationPipeline

#ignore warnings
import warnings 
warnings.filterwarnings('ignore')

2024-03-01 16:05:32.074195: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Data Path

In [2]:
PATH = 'data/'
FILENAME = 'report_data.csv'

## Load the Data

In [3]:
# we load our extracted data from the previuos milestone
text = pd.read_csv(PATH + FILENAME)
text.head()

Unnamed: 0,text
0,We believe that well-designed products have lo...
1,Our energy efficiency goals extend well beyond...
2,"Last year, one of our manufacturers in Guangzh..."
3,We’re at a pivotal moment in addressing climat...
4,Our goal to reach carbon neutrality by 2030 re...


## Load the model from Huggingface

1. Create a `tokenizer` with `AutoTokenizer.from_pretrained` [link to documentation](https://huggingface.co/docs/transformers/v4.21.1/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) and use the following __PATH__ : `'nbroad/ESG-BERT'`.
2. Initialize the `model` from `AutoModelForSequenceClassification.from_pretrained` with the __PATH__ : `'nbroad/ESG-BERT'`.

In [4]:
tokenizer = AutoTokenizer.from_pretrained("nbroad/ESG-BERT")

model = AutoModelForSequenceClassification.from_pretrained("nbroad/ESG-BERT")

### Infos about the model's tokenizer

In [5]:
# get info about the tokenizer
print(tokenizer.is_fast)
print()
print(tokenizer)

True

BertTokenizerFast(name_or_path='nbroad/ESG-BERT', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


Use the `TextClassificationPipeline` from Huggingface here is the link to the [documentation](https://huggingface.co/docs/transformers/v4.21.1/en/main_classes/pipelines#transformers.TextClassificationPipeline).

In [6]:
# initialize the pipeline for the text classification
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True)

Access the first row of the element of the dataframe and initialize the `pipe()` with it i. e. `pipe(df.first_row)`.

In [7]:
test = pipe(text['text'][0])
test

[[{'label': 'Business_Ethics', 'score': 0.0005536633543670177},
  {'label': 'Data_Security', 'score': 0.0005997914122417569},
  {'label': 'Access_And_Affordability', 'score': 0.0010173521004617214},
  {'label': 'Business_Model_Resilience', 'score': 0.002356546698138118},
  {'label': 'Competitive_Behavior', 'score': 0.0005852491594851017},
  {'label': 'Critical_Incident_Risk_Management',
   'score': 0.0008547173929400742},
  {'label': 'Customer_Welfare', 'score': 0.002451641019433737},
  {'label': 'Director_Removal', 'score': 0.001316296518780291},
  {'label': 'Employee_Engagement_Inclusion_And_Diversity',
   'score': 0.0008579465211369097},
  {'label': 'Employee_Health_And_Safety', 'score': 0.001422567875124514},
  {'label': 'Human_Rights_And_Community_Relations',
   'score': 0.0006334123318083584},
  {'label': 'Labor_Practices', 'score': 0.0004994926857762039},
  {'label': 'Management_Of_Legal_And_Regulatory_Framework',
   'score': 0.0011158727575093508},
  {'label': 'Physical_Impacts

`Pandas` function `idxmax()` with documentation [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.idxmax.html). This function is used to find the maximum value in each dataframe for the print-out. 

In [9]:
# iterate  over the rows of the text dataframe applying the pipeline to each row wich returns a list of lists
for i in range(text.shape[0]):
    # flatten the list of lists into a single list and  storee it in result than  convert it to a dataframe
    result = sum(pipe(text.text[i]), [])
    df = pd.DataFrame(result)
    print(df)
    print()

    # find the label of the max score
    max_index = df.idxmax(numeric_only=True)[0]
    max_label = df.iloc[max_index][0]
    max_score = df.loc[max_index][1]

    #print(f"index of max score: {max_index}")
    #print(f"label of max score: {max_label}")
    #print(f"max score: {max_score}")

    print('\n dataframe {}'.format(i))
    print(df)
    print('\n',39 * '--')
    print(f'Max value from classification: label {max_label} with score {round(max_score, 4)}')
    print(40 *'==') 

                                           label     score
0                                Business_Ethics  0.000554
1                                  Data_Security  0.000600
2                       Access_And_Affordability  0.001017
3                      Business_Model_Resilience  0.002357
4                           Competitive_Behavior  0.000585
5              Critical_Incident_Risk_Management  0.000855
6                               Customer_Welfare  0.002452
7                               Director_Removal  0.001316
8    Employee_Engagement_Inclusion_And_Diversity  0.000858
9                     Employee_Health_And_Safety  0.001423
10          Human_Rights_And_Community_Relations  0.000633
11                               Labor_Practices  0.000499
12  Management_Of_Legal_And_Regulatory_Framework  0.001116
13            Physical_Impacts_Of_Climate_Change  0.000801
14                    Product_Quality_And_Safety  0.003557
15       Product_Design_And_Lifecycle_Management  0.9679