# Evaluation des procès-verbaux de la commune de nyon 
<br>

---
### Table of Contents
1. [Model 1: camembert-base](#cb)
2. [Model 2: camembert-base-long](#cb-long)
3. [Model 3: camembert-large](#cbl)
4. [Model 4: camembert-large-long](#cbl-long)

---

In [1]:
import utilsNb as utils
import torch
import utilsNb as utils
from transformers import AutoTokenizer
import models

ID_TO_LABEL = {
    0: 'non-esg',
    1: 'environnemental',
    2: 'social',
    3: 'gouvernance'
}
pdf_file_path = "../data/ccpv230306.pdf"
LABEL_TO_ID = {v: k for k, v in ID_TO_LABEL.items()}
NUM_LABELS = len(ID_TO_LABEL)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# reload modules
%load_ext autoreload
%autoreload 2

## Load PV 

In [2]:
pvp = utils.PvParser(pdf_file_path)

text = pvp.read_pv()

In [3]:
pv_df_1024 = pvp.pv_to_df(chunk_size=1024)
pv_df_512 = pvp.pv_to_df(chunk_size=512)
pv_df_1024

Unnamed: 0,section_number,text
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...
1,2,2. \n\nProcès-verbal de la séance du 30 janvie...
2,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...
3,4,4. \n\nCommunications du Bureau \n\n• M. le ...
4,4,"• Pour remplacer M. Olivier Riesen, PLR, ..."
...,...,...
85,19,19. \n\nInterpellation de M. Joël Vetter intit...
86,20,20. \n\nDivers en rapport avec la séance \n\nM...
87,20,Elle vient de relire ce préavis et \ncom...
88,20,C’est peut-être une bonne \nmanière de fa...


<a id="cb"></a>

## Model camembert-base

Finetuning camembert-base model. Input seuqence of maximum 512.

In [4]:
CHECKPOINT = 'camembert-base'
TOKENIZER = AutoTokenizer.from_pretrained('camembert-base')

cb_sd_path ='../models/model-cb/run1/models/state_dict/model_cb_sd.pt'
model = models.ModelCB(checkpoint=CHECKPOINT, num_labels=NUM_LABELS, id2label=ID_TO_LABEL)
model.load_state_dict(torch.load(cb_sd_path, map_location=torch.device('cpu')))
print('\nModel loaded for evaluation')

cb_predictions = utils.predict_df(model,TOKENIZER,pv_df_512,tokenizer_max_len=512)

cb_predictions.to_csv("./predictions/model_cb.csv", index=False, encoding='utf-8')

cb_predictions.head()

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Model loaded for evaluation


100%|██████████| 182/182 [00:15<00:00, 11.75it/s]


Unnamed: 0,section_number,text,predicted_class,probability,probas_dict
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...,environnemental,0.709,non-esg: 0.186\nenvironnemental: 0.709\nsocial...
1,1,"Jean-Marc DUCRY, huissier \n\nExcusés : \n\nDA...",gouvernance,0.984,non-esg: 0.001\nenvironnemental: 0.0\nsocial: ...
2,2,2. \n\nProcès-verbal de la séance du 30 janvie...,gouvernance,0.831,non-esg: 0.052\nenvironnemental: 0.043\nsocial...
3,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...,environnemental,0.998,non-esg: 0.002\nenvironnemental: 0.998\nsocial...
4,3,Les modifications proposées sont acceptées à l...,non-esg,0.883,non-esg: 0.883\nenvironnemental: 0.009\nsocial...


<a id="cb-lon"></a>

## Model camembert-base-long

Finetuning camembert-base model. IncreasediInput seuqence to be a maximum of 1024.

In [5]:
CHECKPOINT = 'camembert-base'
TOKENIZER = AutoTokenizer.from_pretrained('camembert-base')

cb_long_sd_path ='../models/model-cb-long/run1/models/state_dict/camembert_long_state_dict.pt'
model = models.ModelCBLong(checkpoint=CHECKPOINT, num_labels=NUM_LABELS, id2label=ID_TO_LABEL, max_length=1024)
model.load_state_dict(torch.load(cb_long_sd_path, map_location=torch.device('cpu')))
print('\nModel loaded for evaluation')

cb_long_predictions = utils.predict_df(model,TOKENIZER,pv_df_1024,tokenizer_max_len=1024)

cb_long_predictions.to_csv("./predictions/model_cb_long.csv", index=False, encoding='utf-8')

cb_long_predictions.head()

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Model loaded for evaluation


100%|██████████| 90/90 [00:11<00:00,  7.61it/s]


Unnamed: 0,section_number,text,predicted_class,probability,probas_dict
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...,non-esg,0.948,non-esg: 0.948\nenvironnemental: 0.001\nsocial...
1,2,2. \n\nProcès-verbal de la séance du 30 janvie...,non-esg,0.621,non-esg: 0.621\nenvironnemental: 0.042\nsocial...
2,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...,environnemental,0.995,non-esg: 0.0\nenvironnemental: 0.995\nsocial: ...
3,4,4. \n\nCommunications du Bureau \n\n• M. le ...,non-esg,0.995,non-esg: 0.995\nenvironnemental: 0.0\nsocial: ...
4,4,"• Pour remplacer M. Olivier Riesen, PLR, ...",non-esg,0.927,non-esg: 0.927\nenvironnemental: 0.001\nsocial...


<a id="cbl"></a>

## Model camembert-large

Finetuning camembert-large model. Input seuqence of maximum 512.

In [6]:
CHECKPOINT = 'camembert/camembert-large'
TOKENIZER = AutoTokenizer.from_pretrained('camembert/camembert-large')

cb_sd_path ='../models/model-cbl/run1/models/state_dict/cb_large_model1_sd.pt'
model = models.ModelCBL(checkpoint=CHECKPOINT, num_labels=NUM_LABELS, id2label=ID_TO_LABEL)
model.load_state_dict(torch.load(cb_sd_path, map_location=torch.device('cpu')))
print("Model loaded")

cbl_predictions = utils.predict_df(model,TOKENIZER,pv_df_512,tokenizer_max_len=512)

cbl_predictions.to_csv("./predictions/model_cbl.csv", index=False, encoding='utf-8')

cbl_predictions.head()

Some weights of the model checkpoint at camembert/camembert-large were not used when initializing CamembertModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model loaded


100%|██████████| 182/182 [00:41<00:00,  4.34it/s]


Unnamed: 0,section_number,text,predicted_class,probability,probas_dict
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...,non-esg,1.0,non-esg: 1.0\nenvironnemental: 0.0\nsocial: 0....
1,1,"Jean-Marc DUCRY, huissier \n\nExcusés : \n\nDA...",social,0.719,non-esg: 0.278\nenvironnemental: 0.001\nsocial...
2,2,2. \n\nProcès-verbal de la séance du 30 janvie...,non-esg,0.986,non-esg: 0.986\nenvironnemental: 0.002\nsocial...
3,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...,social,0.571,non-esg: 0.266\nenvironnemental: 0.154\nsocial...
4,3,Les modifications proposées sont acceptées à l...,non-esg,0.996,non-esg: 0.996\nenvironnemental: 0.002\nsocial...



## Model camembert-large with long input
<a id="cbl-long"></a>

Finetuning camembert-large model. Input sequence of maximum 1024.

In [7]:
CHECKPOINT = 'camembert/camembert-large'
TOKENIZER = AutoTokenizer.from_pretrained('camembert/camembert-large')

cb_long_sd_path ='../models/model-cbl-long/run1/models/state_dict/cbl_model_long_sd.pt'
model = models.ModelCBLlong(checkpoint=CHECKPOINT, num_labels=NUM_LABELS, id2label=ID_TO_LABEL, max_length=1024)
model.load_state_dict(torch.load(cb_long_sd_path, map_location=torch.device('cpu')))
print("Model loaded")

cbl_long_predictions = utils.predict_df(model,TOKENIZER,pv_df_1024,tokenizer_max_len=1024)

cbl_long_predictions.to_csv("./predictions/model_cbl_long.csv", index=False, encoding='utf-8')

cbl_long_predictions.head()

Some weights of the model checkpoint at camembert/camembert-large were not used when initializing CamembertModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model loaded


100%|██████████| 90/90 [00:38<00:00,  2.33it/s]


Unnamed: 0,section_number,text,predicted_class,probability,probas_dict
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...,non-esg,0.61,non-esg: 0.61\nenvironnemental: 0.057\nsocial:...
1,2,2. \n\nProcès-verbal de la séance du 30 janvie...,non-esg,0.896,non-esg: 0.896\nenvironnemental: 0.005\nsocial...
2,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...,non-esg,0.762,non-esg: 0.762\nenvironnemental: 0.01\nsocial:...
3,4,4. \n\nCommunications du Bureau \n\n• M. le ...,environnemental,0.957,non-esg: 0.002\nenvironnemental: 0.957\nsocial...
4,4,"• Pour remplacer M. Olivier Riesen, PLR, ...",environnemental,0.999,non-esg: 0.0\nenvironnemental: 0.999\nsocial: ...


## Comparison of models

In [20]:
cb = cb_predictions.copy()
cb.drop(columns=['probability','probas_dict'], inplace=True)
cb.rename(columns={'predicted_class':'cb_preds'}, inplace=True)
cb['cbl_preds'] = cbl_predictions['predicted_class']
cb.rename(columns={'predicted_class':'cb_preds'}, inplace=True)


cb.to_csv("./predictions/cb_models_comparisons.csv", index=False, encoding='utf-8')
cb.head()

Unnamed: 0,section_number,text,cb_preds,cbl_preds
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...,environnemental,non-esg
1,1,"Jean-Marc DUCRY, huissier \n\nExcusés : \n\nDA...",gouvernance,social
2,2,2. \n\nProcès-verbal de la séance du 30 janvie...,gouvernance,non-esg
3,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...,environnemental,social
4,3,Les modifications proposées sont acceptées à l...,non-esg,non-esg


In [21]:
cb_long = cb_long_predictions.copy()
cb_long.drop(columns=['probability','probas_dict'], inplace=True)
cb_long.rename(columns={'predicted_class':'cb_long_preds'}, inplace=True)
cb_long['cbl_long_preds'] = cbl_long_predictions['predicted_class']
cb_long.rename(columns={'predicted_class':'cb_long_preds'}, inplace=True)


cb_long.to_csv("./predictions/cb_long_models_comparisons.csv", index=False, encoding='utf-8')
cb_long.head()

Unnamed: 0,section_number,text,cb_long_preds,cbl_long_preds
0,1,1. \n\nAppel : \n\n84 Conseillères et Conseil...,non-esg,non-esg
1,2,2. \n\nProcès-verbal de la séance du 30 janvie...,non-esg,non-esg
2,3,3. \n\nApprobation de l’ordre du jour \n\nM. l...,environnemental,non-esg
3,4,4. \n\nCommunications du Bureau \n\n• M. le ...,non-esg,environnemental
4,4,"• Pour remplacer M. Olivier Riesen, PLR, ...",non-esg,environnemental
