# Sentiment regression demo

AUTHOR: Michal Mochtak (michal.mochtak@ru.nl), Peter Rupnik (peter.rupnik@ijs.si), Nikola Ljubešić

DATE: 2024-06-24

---

In this notebook we see how to annotate a sample file from ParlaMint

On the first run, the data will be downloaded from the internet. In the next cell a function was prepared to filter the dataset by specific conditions (e.g. taking only the MPs that have a specific number of speeches on the record). In the next cells we will inspect two countries, Croatia and the Netherlands.

Download a single country from [ParlaMint-4.0](https://www.clarin.si/repository/xmlui/handle/11356/1859):

In [6]:
! curl --remote-name-all https://www.clarin.si/repository/xmlui/bitstream/handle/11356/1859/ParlaMint-BA.tgz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 82.7M  100 82.7M    0     0  58.7M      0  0:00:01  0:00:01 --:--:-- 58.7M


Uncompress the files:

In [7]:
! tar -xzvf ParlaMint-BA.tgz

README-BA.md
ParlaMint-BA.TEI/
ParlaMint-BA.TEI/2014/
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-07-25-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-07-24-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-06-06-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-12-09-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-09-10-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-02-06-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-12-29-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-03-13-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-01-23-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-02-25-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-09-25-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-07-10-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-07-31-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-09-04-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-03-26-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-01-24-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-03-12-0.xml
ParlaMint-BA.TEI/2014/ParlaMint-BA_2014-04-29-0.xml
ParlaMint-

Let's open first 3 rows of a single file, split it into sentences, and assign sentiment to them:

In [8]:

import pandas as pd

df = pd.read_csv("ParlaMint-BA.txt/1998/ParlaMint-BA_1998-11-26-0.txt", sep="\t", names=["utterance", "text"]).head(3)
meta = pd.read_csv("ParlaMint-BA.txt/1998/ParlaMint-BA_1998-11-26-0-meta-en.tsv", sep="\t")

Let's construct a function that will split the input text into sentences. We will use the Croatian pipeline for this as it should work for BA as well.

In [9]:
import classla, conllu
classla.download("hr")
nlp = classla.Pipeline("hr")

def split_into_sentences(s: str)-> list[str]:
    doc = nlp(s)
    parsed = conllu.parse(doc.to_conll())
    return [i.metadata.get("text") for i in parsed]
split_into_sentences(df.text[0])


Downloading https://raw.githubusercontent.com/clarinsi/classla-resources/main/resources_1.0.1.json: 10.3kB [00:00, 15.4MB/s]                   
2024-06-13 10:20:15 INFO: Downloading these customized packages for language: hr (Croatian)...
| Processor | Package  |
------------------------
| tokenize  | standard |
| pos       | standard |
| lemma     | standard |
| depparse  | standard |
| ner       | standard |
| pretrain  | standard |

2024-06-13 10:20:15 INFO: File exists: /home/peterr/classla_resources/hr/pos/standard.pt.
2024-06-13 10:20:16 INFO: File exists: /home/peterr/classla_resources/hr/lemma/standard.pt.
2024-06-13 10:20:16 INFO: File exists: /home/peterr/classla_resources/hr/depparse/standard.pt.
2024-06-13 10:20:16 INFO: File exists: /home/peterr/classla_resources/hr/ner/standard.pt.
2024-06-13 10:20:17 INFO: File exists: /home/peterr/classla_resources/hr/pretrain/standard.pt.
2024-06-13 10:20:17 INFO: Finished downloading models and saved to /home/peterr/classla_resources.

['Dame i gospodo predlažem da počnemo sa radom.',
 'Pripala mi je ugodna dužnost da otvorim Konstituirajuću sjednicu Predstavničkog doma Parlamentarne skupštine Bosne i Hercegovine.',
 'Sačekat ćemo trenutak da uđu poslanici sa prostora Republike Srpske.',
 'Od cjelokupnog sastava Predstavničkog doma Parlamenta Bosne i Hercegovine, koji broji 42 poslanika i u odnosu sa 28 sa prostora Federacije i 14 sa prostora Republike Srpske, prisutno je ukupno 34 poslanika, odnosno 26 sa prostora Federacije i 8 sa prostora Republike Srpske tj. imamo kvorum za pravovaljano odlučivanje i rad ovog Predstavničkog doma, ove sjednice.',
 'Prije svega dozvolite mi da pozdravim novoizabrane poslanike i zastupnike i zaželim puno uspjeha u vršenju ove odgovorne funkcije.',
 'Posebno pozdravljam članove Predsjedništva Bosne i Hercegovine gospodina Radišića, gospodina Izetbegovića i gospodina Jelavića, dopredsjedavajuće potpredsjednika i članove Vijeća ministara ovdje prisutne, a zatim ambasdora Westendorpa, v

Let's split the text in the `text` columns and save it in `sentences` column. We will also calculate lengths of sentences.

In [10]:
df["sentences"] = df.text.apply(split_into_sentences)
df["lengths"] = df.sentences.apply(lambda l: [len(i) for i in l])

The cell below sets up the parlasent sentiment regression model and tests it on two random sentences:

In [11]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import torch
model_args = ClassificationArgs(
        regression=True,
    )
model = ClassificationModel(model_type="xlmroberta", model_name="classla/xlm-r-parlasent",use_cuda=torch.cuda.is_available(), num_labels=1,args=model_args)
model.predict([
    "This is where sentences to be evaluated can be passed to the model.",
    "The model returns scores from 0-5, with 0 meaning negative sentiment and 5 positive sentiment."
], )



0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

(array([3.05859375, 3.49023438]), array([3.05859375, 3.49023438]))

To deploy the model we will first gather all sentences into a list and pass it to the model. We will store the scores in a dictionary as {sentence: score}.

In [12]:
all_sentences = [sentence for list_of_sentences in df.sentences for sentence in list_of_sentences]
logits, _ = model.predict(all_sentences)
mapper = {s:l for s, l in zip(all_sentences, logits)}

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

Let's assign sentiment scores to all the sentences in the sentences column:

In [13]:
df["logits"] = df.sentences.apply(lambda l: [mapper[i] for i in l])

For every row, we will calculate two averages: simple average of the scores and a length-weighted average:

In [14]:
from numpy import average
df["logits_averaged"] = df.logits.apply(average)
df["logits_pondered"] = df.apply(lambda row: average(row["logits"], weights=row["lengths"]), axis=1)

Let's look how our dataframe looks like:

In [15]:
df

Unnamed: 0,utterance,text,sentences,lengths,logits,logits_averaged,logits_pondered
0,ParlaMint-BA_1998-11-26-0.u775,Dame i gospodo predlažem da počnemo sa radom. ...,[Dame i gospodo predlažem da počnemo sa radom....,"[45, 129, 68, 364, 130, 598, 201, 168, 298, 14...","[4.2109375, 4.1484375, 3.224609375, 3.55664062...",3.8012,3.865108
1,ParlaMint-BA_1998-11-26-0.u777,Ja bih imao samo jednu korekciju do ovog trenu...,[Ja bih imao samo jednu korekciju do ovog tren...,[121],[3.0234375],3.023438,3.023438
2,ParlaMint-BA_1998-11-26-0.u778,"Zahvaljujem, ima li još prijedloga i sugesitja...","[Zahvaljujem, ima li još prijedloga i sugesitj...","[47, 13, 227, 119, 122, 339, 342, 123, 106]","[3.48046875, 1.7783203125, 4.8046875, 3.685546...",3.186957,3.314663


Next we join the metadata we obtain from the metadata file:

In [16]:
merged = df.merge(meta, left_on="utterance", right_on="ID", how="left")
merged.head()

Unnamed: 0,utterance,text,sentences,lengths,logits,logits_averaged,logits_pondered,Text_ID,ID,Title,...,Speaker_MP,Speaker_minister,Speaker_party,Speaker_party_name,Party_status,Party_orientation,Speaker_ID,Speaker_name,Speaker_gender,Speaker_birth
0,ParlaMint-BA_1998-11-26-0.u775,Dame i gospodo predlažem da počnemo sa radom. ...,[Dame i gospodo predlažem da počnemo sa radom....,"[45, 129, 68, 364, 130, 598, 201, 168, 298, 14...","[4.2109375, 4.1484375, 3.224609375, 3.55664062...",3.8012,3.865108,ParlaMint-BA_1998-11-26-0,ParlaMint-BA_1998-11-26-0.u775,,...,MP,notMinister,KCD_BiH,Koalicija za cjelovitu i demokratsku BiH,Coalition,-,GenjacHalid,"Genjac, Halid",M,1958
1,ParlaMint-BA_1998-11-26-0.u777,Ja bih imao samo jednu korekciju do ovog trenu...,[Ja bih imao samo jednu korekciju do ovog tren...,[121],[3.0234375],3.023438,3.023438,ParlaMint-BA_1998-11-26-0,ParlaMint-BA_1998-11-26-0.u777,,...,MP,notMinister,Sloga,Sloga,Coalition,-,DokićBranko,"Dokić, Branko",M,1949
2,ParlaMint-BA_1998-11-26-0.u778,"Zahvaljujem, ima li još prijedloga i sugesitja...","[Zahvaljujem, ima li još prijedloga i sugesitj...","[47, 13, 227, 119, 122, 339, 342, 123, 106]","[3.48046875, 1.7783203125, 4.8046875, 3.685546...",3.186957,3.314663,ParlaMint-BA_1998-11-26-0,ParlaMint-BA_1998-11-26-0.u778,,...,MP,notMinister,KCD_BiH,Koalicija za cjelovitu i demokratsku BiH,Coalition,-,GenjacHalid,"Genjac, Halid",M,1958


Finally, let's add columns `char_lenght` and `country`, and report the columns in the exact order as we get them in the prepared sample:


In [17]:
final_df = merged.assign(char_length=merged.lengths.apply(sum),
              country=merged.utterance.apply(lambda s: s.split("-")[1].split("_")[0])).rename(columns={
    "utterance": "newdoc id"
}
)[['newdoc id', 'logits_pondered', 'logits_averaged', 'char_length',
       'country', 'Text_ID', 'ID', 'Title', 'Date', 'Body', 'Term', 'Session',
       'Meeting', 'Sitting', 'Agenda', 'Subcorpus', 'Lang', 'Speaker_role',
       'Speaker_MP', 'Speaker_minister', 'Speaker_party', 'Speaker_party_name',
       'Party_status', 'Party_orientation', 'Speaker_ID', 'Speaker_name',
       'Speaker_gender', 'Speaker_birth']]
final_df

Unnamed: 0,newdoc id,logits_pondered,logits_averaged,char_length,country,Text_ID,ID,Title,Date,Body,...,Speaker_MP,Speaker_minister,Speaker_party,Speaker_party_name,Party_status,Party_orientation,Speaker_ID,Speaker_name,Speaker_gender,Speaker_birth
0,ParlaMint-BA_1998-11-26-0.u775,3.865108,3.8012,3002,BA,ParlaMint-BA_1998-11-26-0,ParlaMint-BA_1998-11-26-0.u775,,1998-11-26,Unicameralism,...,MP,notMinister,KCD_BiH,Koalicija za cjelovitu i demokratsku BiH,Coalition,-,GenjacHalid,"Genjac, Halid",M,1958
1,ParlaMint-BA_1998-11-26-0.u777,3.023438,3.023438,121,BA,ParlaMint-BA_1998-11-26-0,ParlaMint-BA_1998-11-26-0.u777,,1998-11-26,Unicameralism,...,MP,notMinister,Sloga,Sloga,Coalition,-,DokićBranko,"Dokić, Branko",M,1949
2,ParlaMint-BA_1998-11-26-0.u778,3.314663,3.186957,1438,BA,ParlaMint-BA_1998-11-26-0,ParlaMint-BA_1998-11-26-0.u778,,1998-11-26,Unicameralism,...,MP,notMinister,KCD_BiH,Koalicija za cjelovitu i demokratsku BiH,Coalition,-,GenjacHalid,"Genjac, Halid",M,1958


Finally, let's clean-up: remove all the downloaded files.

In [18]:
!rm -r *BA*