# Creation of a dataset for camembert LLM based on parliamentary debates

**Objective:** We want to use the data in the open data of the french parliament, to create a data set containing.
The dataset will look like this:

input:
>[THÈME] Projet de loi sur la sécurité intérieure
>
>[CONTEXTE]
>
>Intervenant 1 : Nous voulons plus de sécurité dans nos quartiers.
>
>Intervenant 2 : Ce projet est une atteinte aux libertés publiques.
>
>[INTERVENTION]
>Je soutiens cette mesure pour protéger nos concitoyens.
`

label:
>RN

Explanation:
The input is divided into three parts:
- Theme: Represents the name of the topic being discussed.
- Context: Represents a defined number of previous interventions.
- Intervention: Represents the text of the current intervention being analyzed. The goal is to predict the political group of the speaker of this intervention.
As for the output, it is the political group associated with the intervention, for example: RN.



### How to proceed ?
Here is what we will go through


### 1) Get the data
We will check where we can get the data from.

### 2) Check the structure of the data
Here we will explore quickly the structure of each files
#### 2.a) The organs files
#### 2.b) The acteurs files (actors)
#### 2.c) The debates minutes files

### 3) Parse the XML files
Here we start the actual job of parsing the xml files, and collecting the relevant data for our dataset.
#### 3.a) Extract the list of organes
First, we check how to extract the list of organs of the XML (the organs are all the bodies saved in the opendata system of the Assemblée Nationale.
The aim is to get a pandas dataframe of all of them, but we will only keep the political group (GP), because this is what we will work with
#### 3.b) Extract the list of acteurs
Then we will extract the list of acteurs (actors), which is all the physical persons that are saved in the open data systems (MP, ministers, etc..)
The aim is again to get pandas dataframe
#### 3.c) Tow usefull functions !
#### 3.d) Parse Compte Rendu and create the data set
##### 3.d.1) Extract Point: the *parsePoint()* function
##### 3.d.2) Parse Compte Rendu 
##### 3.d.3) Create dataset
#### 4) Some last checks and we save!

# Let's get started !! 🤓

### 1) Get the data
We will use the open data from the the french nationnal assembly.
You can find all the open data here : https://data.assemblee-nationale.fr/

But we will specially use the debate minutes from 17th *Législature* (17th assembly, the one that started on the 17th of July 2024).
You can find the zip with the data here: https://data.assemblee-nationale.fr/static/openData/repository/17/vp/syceronbrut/syseron.xml.zip

Additionally, we will collect information about all the *acteurs* (actors, aka any physical person that is in this database). and all the *organes* (organs, aka any political gorup, body, commission, etc... registered in this database). You can download a file that compiles all the historic here: [download here](https://data.assemblee-nationale.fr/static/openData/repository/16/amo/tous_acteurs_mandats_organes_xi_legislature/AMO30_tous_acteurs_tous_mandats_tous_organes_historique.xml.zip)







### 2) Check the structure of the data

Some documentation to understand the data structure is available on the webesite of the *Assemblée Nationnale*, but in my opinion it is lacking a bit some information or examples: [access documentation here](https://www.assemblee-nationale.fr/opendata/Index_pub.html).


#### 2.a) The organs files
The organs files ar located in the "organe" folder. Each file corresponds to one organ.
You can find interesting information about each organ. One of them is the *uid*, which later one is the same as *organeRef*.

You can find here the [relevant documentation about *organes*](https://www.assemblee-nationale.fr/opendata/Schemas_Entites/AMO/Schemas_Organes.html).

We mostly care about the *Groupes Politique* which are actually the *Groupes Parlementaire* (Parliamentary Groups). You can learn more about what they are here: [Parliamentary Group Wikipedia article](https://en.wikipedia.org/wiki/Parliamentary_group).

Here we show one file of an organ. The organ type is identified with the *codeType*, in our case *GP*, for *Groupe Politique*.
Its *uid* (or *organeRef*) is PO758835. The groupe name is *Socialistes et apparentés*, and it is abbreviated as *SOC*. It has a start date (*dateDebut*) and an end date (*dateFin*). It is classified as part of the opposition.
```
<?xml version='1.0' encoding='UTF-8'?>
<organe xmlns="http://schemas.assemblee-nationale.fr/referentiel" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="GroupePolitique_type">
  <uid>PO758835</uid>
  <codeType>GP</codeType>
  <libelle>Socialistes et apparentés</libelle>
  <libelleEdition>du groupe Socialistes et apparentés</libelleEdition>
  <libelleAbrege>SOC</libelleAbrege>
  <libelleAbrev>SOC</libelleAbrev>
  <viMoDe>
    <dateDebut>2018-09-12</dateDebut>
    <dateAgrement xsi:nil="true"/>
    <dateFin>2022-06-21</dateFin>
  </viMoDe>
  <organeParent xsi:nil="true"/>
  <chambre xsi:nil="true"/>
  <regime>5ème République</regime>
  <legislature>15</legislature>
  <secretariat>
    <secretaire01 xsi:nil="true"/>
    <secretaire02 xsi:nil="true"/>
  </secretariat>
  <positionPolitique>Opposition</positionPolitique>
  <preseance>4</preseance>
  <couleurAssociee>#D46CA9</couleurAssociee>
</organe>

```


#### 2.b) The acteurs files (actors)

The actors files are located in the folder *acteur*. We will collect all the *acteurs*, but in reality we are only interested in MP (they are the ones that have a political group). We are especially interested in the list of mandates of each MP, as we will be able to make the match the *acteurRef* id for each speaker in the debate, to the corresponding MP, then later on using the mandate we will find which political group is he a part of (through *organRef* which when looking into the list of *organes* will give us the initials of the GP).

Here is what a tpycial XML file of an *acteur* looks like:
```
<?xml version='1.0' encoding='UTF-8'?>
<acteur xmlns="http://schemas.assemblee-nationale.fr/referentiel">
  <uid xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="IdActeur_type">PA794734</uid>
  <etatCivil>
    <ident>
      <civ>M.</civ>
      <prenom>Michel</prenom>
      <nom>Guiniot</nom>
      <alpha>Guiniot</alpha>
      <trigramme>MGI</trigramme>
    </ident>
    <infoNaissance>
      <dateNais>1954-11-29</dateNais>
      <villeNais>Chauny</villeNais>
      <depNais xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/>
      <paysNais>France</paysNais>
    </infoNaissance>
    <dateDeces xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:nil="true"/>
  </etatCivil>
  <profession>
    <libelleCourant>Ancien cadre</libelleCourant>
    <socProcINSEE>
      <catSocPro>Cadres des services administratifs et commerciaux d’entreprise</catSocPro>
      <famSocPro>Cadres et professions intellectuelles supérieures</famSocPro>
    </socProcINSEE>
  </profession>
  <uri_hatvp>https://www.hatvp.fr/pages_nominatives/guiniot-michel-24237</uri_hatvp>
  <adresses>
    <adresse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="AdressePostale_Type">
      <uid>AD794736</uid>
      <type>0</type>
      <typeLibelle>Adresse officielle</typeLibelle>
      <poids>1</poids>
      <adresseDeRattachement xsi:nil="true"/>
      <intitule>Assemblée nationale,</intitule>
      <numeroRue>126</numeroRue>
      <nomRue>Rue de l'Université,</nomRue>
      <complementAdresse xsi:nil="true"/>
      <codePostal>75355</codePostal>
      <ville>Paris 07 SP</ville>
    </adresse>
    <adresse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="AdresseMail_Type">
      <uid>AD798023</uid>
      <type>15</type>
      <typeLibelle>Mèl</typeLibelle>
      <poids xsi:nil="true"/>
      <adresseDeRattachement xsi:nil="true"/>
      <valElec>michel.guiniot@assemblee-nationale.fr</valElec>
    </adresse>
    <adresse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="AdresseSiteWeb_Type">
      <uid>AD798759</uid>
      <type>25</type>
      <typeLibelle>Facebook</typeLibelle>
      <poids xsi:nil="true"/>
      <adresseDeRattachement xsi:nil="true"/>
      <valElec>guiniotofficiel</valElec>
    </adresse>
    <adresse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="AdresseSiteWeb_Type">
      <uid>AD798803</uid>
      <type>24</type>
      <typeLibelle>Twitter</typeLibelle>
      <poids xsi:nil="true"/>
      <adresseDeRattachement xsi:nil="true"/>
      <valElec>@MichelGuiniot</valElec>
    </adresse>
    <adresse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="AdressePostale_Type">
      <uid>AD844550</uid>
      <type>2</type>
      <typeLibelle>Adresse publiée de circonscription</typeLibelle>
      <poids>22</poids>
      <adresseDeRattachement xsi:nil="true"/>
      <intitule xsi:nil="true"/>
      <numeroRue>11</numeroRue>
      <nomRue>Rue de Grèce</nomRue>
      <complementAdresse xsi:nil="true"/>
      <codePostal>60400</codePostal>
      <ville>Noyon</ville>
    </adresse>
  </adresses>
  <mandats>
    <mandat xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="MandatSimple_Type">
      <uid>PM845918</uid>
      <acteurRef>PA794734</acteurRef>
      <legislature>17</legislature>
      <typeOrgane>GP</typeOrgane>
      <dateDebut>2024-07-19</dateDebut>
      <datePublication>2024-07-19</datePublication>
      <dateFin xsi:nil="true"/>
      <preseance>20</preseance>
      <nominPrincipale>1</nominPrincipale>
      <infosQualite>
        <codeQualite>Membre</codeQualite>
        <libQualite>Membre du</libQualite>
        <libQualiteSex>Membre du</libQualiteSex>
      </infosQualite>
      <organes>
        <organeRef>PO845401</organeRef>
      </organes>
    </mandat>
```

....... Here is actually a very long list of mandates, as it needs to include all sub group of the parliament such as commissions, ...

```
    <mandat xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:type="MandatParlementaire_type">
      <uid>PM843146</uid>
      <acteurRef>PA794734</acteurRef>
      <legislature>17</legislature>
      <typeOrgane>ASSEMBLEE</typeOrgane>
      <dateDebut>2024-07-07</dateDebut>
      <datePublication xsi:nil="true"/>
      <dateFin xsi:nil="true"/>
      <preseance>50</preseance>
      <nominPrincipale>1</nominPrincipale>
      <infosQualite>
        <codeQualite>membre</codeQualite>
        <libQualite>membre</libQualite>
        <libQualiteSex>membre</libQualiteSex>
      </infosQualite>
      <organes>
        <organeRef>PO838901</organeRef>
      </organes>
      <suppleants>
        <suppleant>
          <dateDebut>2024-07-07</dateDebut>
          <dateFin xsi:nil="true"/>
          <suppleantRef>PA841583</suppleantRef>
        </suppleant>
      </suppleants>
      <chambre xsi:nil="true"/>
      <election>
        <lieu>
          <region>Hauts-de-France</region>
          <regionType>Métropolitain</regionType>
          <departement>Oise</departement>
          <numDepartement>60</numDepartement>
          <numCirco>6</numCirco>
        </lieu>
        <causeMandat>élections générales</causeMandat>
        <refCirconscription>PO839649</refCirconscription>
      </election>
      <mandature>
        <datePriseFonction>2024-07-08</datePriseFonction>
        <causeFin xsi:nil="true"/>
        <premiereElection>1</premiereElection>
        <placeHemicycle>008</placeHemicycle>
        <mandatRemplaceRef xsi:nil="true"/>
      </mandature>
      <collaborateurs>
        <collaborateur>
          <qualite>Mme</qualite>
          <prenom>Annick</prenom>
          <nom>Sézille</nom>
          <dateDebut xsi:nil="true"/>
          <dateFin xsi:nil="true"/>
        </collaborateur>
        <collaborateur>
          <qualite>M.</qualite>
          <prenom>Yannick</prenom>
          <nom>Lozac'h de Cordoue-Hecquard</nom>
          <dateDebut xsi:nil="true"/>
          <dateFin xsi:nil="true"/>
        </collaborateur>
      </collaborateurs>
    </mandat>
  </mandats>
</acteur>
```

From this XML, we are mostly interested in the *uid* (corresponding to *acteurRef*), as well as the *organeRef* of the political group, as well as the start and end dates when the MP was a member.


#### 2.c) The debates minutes files

The debates minutes describe all the speakers interventions. 
It first contains a sort of metadata header, then there is the actual content with the debates minutes.
The debates are organised into *points* and *paragraphe*. Each paragarphe is an intervention of a speaker. 
We can see that there is a *code_grammaire* that describes the type of internvetion. The main one we will look at is *PAROLE_GENERIQUE* (generic speech), the second interesting one is *INTERRUPTION_1_10*, which is for interruption. Note that sometimes, interruptions are directly included in the text of the main speaker.

It looks like that: 
```
<?xml version='1.0' encoding='UTF-8'?>
<compteRendu xmlns="http://schemas.assemblee-nationale.fr/referentiel">
  <uid>CRSANR5L17S2024D1N001</uid>
  <seanceRef>RUANR5L17S2024IDS28537</seanceRef>
  <sessionRef>SCR5A2024D1</sessionRef>
  <metadonnees>
    <dateSeance>20240718150000000</dateSeance>
    <dateSeanceJour>jeudi 18 juillet 2024</dateSeanceJour>
    <numSeanceJour>Unique</numSeanceJour>
    <numSeance>1</numSeance>
    <typeAssemblee>AN</typeAssemblee>
    <legislature>17</legislature>
    <session>Session de droit juillet 2024</session>
    <nomFichierJo>20245001</nomFichierJo>
    <validite>valide</validite>
    <etat>complet</etat>
    <diffusion>public</diffusion>
    <version>avant_JO</version>
    <environnement>PROD</environnement>
    <heureGeneration>2024-07-23T12:12:48.000+02:00</heureGeneration>
    <sommaire>
      <presidentSeance id_syceron="3508735">Présidence de M. José Gonzalez, doyen d’âge</presidentSeance>
      <sommaire1 valeur_pts_odj="1">
        <titreStruct id_syceron="3508739">
          <intitule>Ouverture de la XVII<exposant>e </exposant>législature</intitule>
          <sousIntitule>0</sousIntitule>
        </titreStruct>
      </sommaire1>
```
... This section contains the list of all points to be discussed, it's the summary
```
      <sommaire1 valeur_pts_odj="8">
        <titreStruct id_syceron="3509305">
          <intitule>Ordre du jour de la prochaine séance</intitule>
        </titreStruct>
      </sommaire1>
    </sommaire>
  </metadonnees>
```
... Here starts the actual content
```
  <contenu>
    <quantiemes>
      <journee>Séance du jeudi 18 juillet 2024</journee>
      <session>Session de droit juillet 2024</session>
    </quantiemes>
    <ouvertureSeance nivpoint="1" valeur_ptsodj="0" ordinal_prise="1" id_preparation="2536865" ordre_absolu_seance="1" code_grammaire="OUV_SEAN_1_1" code_style="Présidence" code_parole="" sommaire="1" id_syceron="3508735" valeur="">
      <orateurs/>
      <texte>Présidence de M. José Gonzalez, doyen d’âge</texte>
      <paragraphe valeur_ptsodj="0" ordinal_prise="1" id_preparation="2536866" ordre_absolu_seance="2" id_acteur="PA793362" id_mandat="PM842465" id_nomination_oe="0" id_nomination_op="-1" code_grammaire="OUV_SEAN_2_1" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3508736" valeur="" roledebat="president">
        <orateurs>
          <orateur>
            <nom>M. le président</nom>
            <id>793362</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="1099.79">La séance est ouverte.</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="0" ordinal_prise="1" id_preparation="2536867" ordre_absolu_seance="3" code_grammaire="OUV_SEAN_2_2" code_style="Info Italiques" code_parole="" sommaire="0" id_syceron="3508737" valeur="">
        <orateurs/>
        <texte>
          <italique>(La séance est ouverte à quinze heures.)</italique>
        </texte>
      </paragraphe>
    </ouvertureSeance>
    <point nivpoint="1" valeur_ptsodj="1" ordinal_prise="1" id_preparation="2536869" ordre_absolu_seance="5" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="3508739" valeur="">
      <orateurs/>
      <texte>Ouverture de la XVII<exposant>e </exposant>législature</texte>
      <paragraphe valeur_ptsodj="1" ordinal_prise="1" id_preparation="2536870" ordre_absolu_seance="6" id_acteur="PA793362" id_mandat="PM842465" id_nomination_oe="0" id_nomination_op="-1" code_grammaire="ODJ_APPEL_DISCUSSION" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3508740" valeur="" roledebat="president">
        <orateurs>
          <orateur>
            <nom>M. le président</nom>
            <id>793362</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="1114.09">Je déclare ouverte la XVII<exposant>e</exposant> législature de l’Assemblée nationale et la session de droit prévue par l’article 12 de la Constitution.</texte>
      </paragraphe>
    </point>
```
... Here we have more points discussed
```
    <point nivpoint="1" valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536932" ordre_absolu_seance="95" code_grammaire="TITRE_TEXTE_DISCUSSION" code_style="Titre" code_parole="" sommaire="1" id_syceron="3509288" valeur="">
      <orateurs/>
      <texte>Allocution de Mme la présidente</texte>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536933" ordre_absolu_seance="96" id_acteur="PA721908" id_mandat="PM843467" id_nomination_oe="0" id_nomination_op="PM845400" code_grammaire="PAROLE_GENERIQUE" code_style="NORMAL" code_parole="" sommaire="1" id_syceron="3509289" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme la présidente</nom>
            <id>721908</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21535.51">Mes chers collègues, c’est avec une immense émotion que je prends la parole devant vous. Les dernières semaines ont été particulièrement tendues. Notre pays est inquiet et fracturé. Nous avons une immense responsabilité. Pour la première fois depuis plusieurs dizaines d’années, les Français se sont rendus massivement aux urnes. Près de 70 % d’entre eux ont voté aux dernières élections législatives.</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536934" ordre_absolu_seance="97" id_acteur="PA791812" id_mandat="PM840384" id_nomination_oe="-1" id_nomination_op="-1" code_grammaire="INTERRUPTION_1_10" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509290" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme Sophia Chikirou</nom>
            <id>791812</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21578.64">Ils n’ont pas voté pour vous ! <italique>(Protestations.)</italique></texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536935" ordre_absolu_seance="98" id_acteur="PA721908" id_mandat="PM843467" id_nomination_oe="0" id_nomination_op="-1" code_grammaire="PAROLE_GENERIQUE" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509291" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme la présidente</nom>
            <id>721908</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21583.13">Ces voix, ces suffrages, cette mobilisation exceptionnelle et inédite nous confèrent une immense responsabilité. Si nos compatriotes ont été aussi nombreux à se rendre aux urnes, c’est qu’ils ont compris que la démocratie était un bien précieux, que certains enjeux étaient majeurs, que les hommes et les femmes politiques que nous sommes pouvaient avoir un effet direct sur leurs vies – nos décisions et nos actions peuvent changer leurs vies. Ils nous ont dit : « Occupez-vous de nous ; occupez-vous de notre pouvoir d’achat,…</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536936" ordre_absolu_seance="99" id_acteur="PA721410" id_mandat="PM843350" id_nomination_oe="-1" id_nomination_op="-1" code_grammaire="INTERRUPTION_1_10" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509292" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme Émilie Bonnivard</nom>
            <id>721410</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21638.35">Très bien !</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536937" ordre_absolu_seance="100" id_acteur="PA721908" id_mandat="PM843467" id_nomination_oe="0" id_nomination_op="-1" code_grammaire="PAROLE_GENERIQUE" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509293" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme la présidente</nom>
            <id>721908</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21645.41">…des déserts médicaux, de nos écoles, de nos services publics <italique>(Applaudissements), </italique>de l’emploi, de nos enfants, de notre planète et de l’environnement, de notre sécurité,…</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536938" ordre_absolu_seance="101" id_acteur="PA719118" id_mandat="PM842501" id_nomination_oe="-1" id_nomination_op="-1" code_grammaire="INTERRUPTION_1_10" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509294" valeur="">
        <orateurs>
          <orateur>
            <nom>M. François Cormier-Bouligeon</nom>
            <id>719118</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21658.46">Oui !</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536939" ordre_absolu_seance="102" id_acteur="PA721908" id_mandat="PM843467" id_nomination_oe="0" id_nomination_op="-1" code_grammaire="PAROLE_GENERIQUE" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509295" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme la présidente</nom>
            <id>721908</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21673.99">…de notre défense ». Quels que soient nos bords politiques et nos territoires d’élection, nous devons entendre ce message et apporter de nouvelles solutions, avec de nouvelles méthodes.</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536940" ordre_absolu_seance="103" id_acteur="PA795010" id_mandat="PM840354" id_nomination_oe="-1" id_nomination_op="-1" code_grammaire="INTERRUPTION_1_10" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509296" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme Marie-Charlotte Garin</nom>
            <id>795010</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21689.81">Mais oui, bien sûr !</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536941" ordre_absolu_seance="104" id_acteur="PA721908" id_mandat="PM843467" id_nomination_oe="0" id_nomination_op="-1" code_grammaire="PAROLE_GENERIQUE" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509297" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme la présidente</nom>
            <id>721908</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21693.42">Cette Assemblée nationale, sans doute plus représentative que jamais, est aussi plus divisée que jamais. Mais parce que notre responsabilité est immense, nous n’avons pas le choix : nous devons nous entendre, coopérer, être capables de rechercher des compromis, de dialoguer, de nous écouter et d’avancer. <italique>(Applaudissements.) </italique>Vous me trouverez toujours à vos côtés pour dialoguer, innover, tracer le nouveau chemin que l’Assemblée nationale doit emprunter.<br/>Je ne m’étendrai pas davantage, car la journée a été longue et l’Assemblée nationale poursuivra ses travaux demain et samedi. Je veux cependant remercier chacun d’entre vous du fond du cœur, et plus particulièrement nos collègues candidats à la présidence de l’Assemblée, que je félicite : tout d’abord, M. le président Chassaigne <italique>(De nombreux députés, dont certains se lèvent, applaudissent longuement)</italique>, qui sait à quel point je le respecte et je l’apprécie ; ensuite, M. Sébastien Chenu <italique>(Mêmes mouvements)</italique>, Mme Naïma Moutchou <italique>(Mêmes mouvements),</italique> M. Philippe Juvin <italique>(Mêmes mouvements)</italique> et M. Charles de Courson<italique>.</italique> <italique>(Mêmes mouvements.)</italique> Chacun d’entre vous a porté une voix singulière, la voix de son groupe parlementaire, au-delà des députés présents dans l’hémicycle. Sachez que je m’engage à travailler avec chacun d’entre vous, tout au long de ce mandat.</texte>
      </paragraphe>
      <paragraphe valeur_ptsodj="7" ordinal_prise="9" id_preparation="2536946" ordre_absolu_seance="108" id_acteur="PA736201" id_mandat="PM840423" id_nomination_oe="-1" id_nomination_op="-1" code_grammaire="INTERRUPTION_1_10" code_style="NORMAL" code_parole="" sommaire="0" id_syceron="3509302" valeur="">
        <orateurs>
          <orateur>
            <nom>Mme Sophie Taillé-Polian</nom>
            <id>736201</id>
            <qualite/>
          </orateur>
        </orateurs>
        <texte stime="21900.61">Nous avons déjà entendu ce discours !</texte>
      </paragraphe>
    </point>
```


### 3) Parse the XML files
Here we start the actual job of parsing the xml files, and collecting the relevant data for our dataset.
First we will just add a few imports, and prepaer the access to the files we want to process


In [28]:
import xml.etree.ElementTree as ET
import pandas as pd
# from google.colab import drive #in case we need to import from drive
import re #this is ReGex, so we can remove "italique" from the text (and also <br> balises), for example to get rid of in text interruptions
import os
import pathlib
from tqdm import tqdm  # Import tqdm

# Get the root project directory (assumes this script is in a subfolder, e.g., /src)
project_path = pathlib.Path().resolve().parent
# Point to your data folder relative to the project root
data_path = project_path / "data"

# Define specific folders
folder_path_organs = data_path / "AMO30_tous_acteurs_tous_mandats_tous_organes_historique.xml" / "xml" / "organe"

# Define specific folders
folder_path_organs = data_path / "AMO30_tous_acteurs_tous_mandats_tous_organes_historique.xml" / "xml" / "organe"
folder_path_acteurs = data_path / "AMO30_tous_acteurs_tous_mandats_tous_organes_historique.xml" / "xml" / "acteur"
folder_path_debatsL17 = data_path / "syseron.xml" / "xml" / "compteRendu"

# Here is the code in case you want to run it in google collab
# drive.mount('/content/drive')
# folder_path = "/content/drive/My Drive/Colab Notebooks/Project AN/Data/syseron.xml/xml/compteRendu/"
# also add the other ones

file_list_organs = [f for f in os.listdir(folder_path_organs) if f.endswith(".xml")]
file_list_acteurs = [f for f in os.listdir(folder_path_acteurs) if f.endswith(".xml")]
file_list_debatsL17 = [f for f in os.listdir(folder_path_debatsL17) if f.endswith(".xml")]



#### 3.a) Extract the list of organes
First, we check how to extract the list of organs of the XML (the organs are all the bodies saved in the opendata system of the Assemblée Nationale.
The aim is to get a pandas dataframe of all of them, but we will only keep the political group (GP), because this is what we will work with

For this we will create a function that parse the files:**Function parseOrgan**
First we create a function that can process XML files. Each XML file contains the information about an organ.
From this xml file, we will extract :
- *uid*: Unique identifier of the political body.
- *codeType*: Type code of the political body.
- *libelle*: Full label.
- *libelleAbrev*: Abbreviated label.
- *positionPolitique*: Political position, if available.
- *dateDebut*: Start date of the organ's activity (from <viMoDe> block).
- *dateFin*: End date of the organ's activity (from <viMoDe> block).
- *organeRef*: The file name without extension, used as a reference ID.

Then we put all of this in a list and return it


In [31]:
def parseOrgan(folder_path, file_name):
    """
    Parses an XML file representing a political body ("organe") from the French National Assembly dataset.

    Parameters:
    - folder_path (str): Path to the folder containing the XML file.
    - file_name (str): Name of the XML file to parse.

    Returns:
    - data (list of dict): A list containing one dictionary with the extracted information:
        - uid: Unique identifier of the political body.
        - codeType: Type code of the political body.
        - libelle: Full label.
        - libelleAbrev: Abbreviated label.
        - positionPolitique: Political position, if available.
        - dateDebut: Start date of the organ's activity (from <viMoDe> block).
        - dateFin: End date of the organ's activity (from <viMoDe> block).
        - organeRef: The file name without extension, used as a reference ID.
    """
    file_path = os.path.join(folder_path, file_name)
    data = []
    tree = ET.parse(file_path)
    root = tree.getroot()

    # Remove namespace
    for elem in root.iter():
        if '}' in elem.tag:
            elem.tag = elem.tag.split('}', 1)[1]

    # Safe text extractor
    def get_text(tag: str, parent=root) -> str:
        element = parent.find(tag)
        return element.text if element is not None and element.text is not None else ""

    # Main fields
    fields = ["uid", "codeType", "libelle", "libelleAbrev", "positionPolitique"]
    entry = {field: get_text(field) for field in fields}

    # Handle viMoDe block
    viMoDe = root.find("viMoDe")
    if viMoDe is not None:
        entry["dateDebut"] = get_text("dateDebut", viMoDe)
        entry["dateFin"] = get_text("dateFin", viMoDe)
    else:
        entry["dateDebut"] = ""
        entry["dateFin"] = ""

    # Add organeRef
    entry["organeRef"] = os.path.splitext(file_name)[0]

    data.append(entry)
    return data

Here we just pars all the xml files, and put them in a `data` variable

In [32]:
#this is where we will gather the data to later turn them to a pandaframe
data = []
for file in tqdm(file_list_organs, desc="Processing files", unit="file"):

  data.extend(parseOrgan(folder_path_organs, file))  # Use extend instead of +

print(len(data))

Processing files: 100%|██████████████████████████████████████████████████████████████| 10636/10636 [04:00<00:00, 44.15file/s]

10636





Now we put the `data` in a dataframe

In [34]:
df = pd.DataFrame(data)
print("This is the size with all organs:", df.shape)
df.head(5)

This is the size with all organs: (10636, 8)


Unnamed: 0,uid,codeType,libelle,libelleAbrev,positionPolitique,dateDebut,dateFin,organeRef
0,PO191887,ORGEXTPARL,Commission nationale pour l'élimination des mi...,160,,1999-05-11,,PO191887
1,PO201115,API,Section française de l'Assemblée parlementaire...,APF,,1998-07-09,,PO201115
2,PO201269,DELEG,Délégation de l'Assemblée nationale aux droits...,EGA,,1999-07-12,,PO201269
3,PO201275,DELEG,Délégation de l'Assemblée nationale à l'aménag...,TER,,1999-06-25,2009-06-15,PO201275
4,PO201361,ORGEXTPARL,Comité local d'information et de suivi du labo...,161,,1991-12-30,,PO201361


Now we only keep the organs that are GP (Political Group)

In [36]:
# filter to only keep organs that have the "GP" code type, which corresponds to Political Groups
df_GP = df[df['codeType']=="GP"]

print("this is now the new size:", df_GP.shape)

df_GP.head(5)

this is now the new size: (62, 8)


Unnamed: 0,uid,codeType,libelle,libelleAbrev,positionPolitique,dateDebut,dateFin,organeRef
613,PO266900,GP,Députés n'appartenant à aucun groupe,NI,,2002-06-19,2007-06-19,PO266900
615,PO270903,GP,Union pour un Mouvement Populaire,UMP,,2002-06-25,2007-06-19,PO270903
616,PO270907,GP,Socialiste,SOC,,2002-06-25,2007-06-19,PO270907
617,PO270911,GP,Union pour la Démocratie Française,UDF,,2002-06-25,2007-06-19,PO270911
618,PO270915,GP,Député-e-s Communistes et Républicains,CR,,2002-06-25,2007-06-19,PO270915


And now we can see that we are done to only 62 Political Groups !


#### 3.b) Extract the list of acteurs

Similarly to what we did with the *organes*, we will create a *parseActeur* function, and run it on all the files. The we will turn the data we got into a dataframe


First, we will need to create two small functions that will be usefull later on

In [37]:
# we want a function to correctly add the date in one of the column
from datetime import datetime

def convert_french_date(date_string):
    # Define the French weekday and month names
    weekday_translation = {
        "lundi": "Monday",
        "mardi": "Tuesday",
        "mercredi": "Wednesday",
        "jeudi": "Thursday",
        "vendredi": "Friday",
        "samedi": "Saturday",
        "dimanche": "Sunday"
    }

    month_translation = {
        "janvier": "January",
        "février": "February",
        "mars": "March",
        "avril": "April",
        "mai": "May",
        "juin": "June",
        "juillet": "July",
        "août": "August",
        "septembre": "September",
        "octobre": "October",
        "novembre": "November",
        "décembre": "December"
    }

    # Split the date string to extract the weekday, day, month, and year
    parts = date_string.split()
    day = parts[1]
    month = month_translation[parts[2].lower()]
    year = parts[3]

    # Construct the date string in the format "day month year"
    french_date = f"{day} {month} {year}"

    # Parse the French date using datetime and format it
    date_object = datetime.strptime(french_date, "%d %B %Y")
    formatted_date = date_object.strftime("%Y-%m-%d")

    return formatted_date

#and another small function to check that the text in is UTF8
def is_valid_utf8(string):
    try:
        string.encode('utf-8').decode('utf-8')
        return True
    except UnicodeDecodeError:
        return False

Now this is the actual parseActeur function. Bascially we want to extract from the xml the following data:
acteurRef: Unique identifier of the actor.
- civ: Civil title (e.g., Mr, Ms).
- prenom: First name of the actor.
- nom: Last name of the actor.
- mandats_GP: List of mandates related to political groups (type "GP"), each represented as a dictionary with:
    - organeRef: Reference ID of the associated group.
    - dateDebut: Start date of the mandate.
    - dateFin: End date of the mandate.
    - legislature: Legislature number during which the mandate occurred.

In [39]:
def parseActeur(folder_path, file_name):
    """
    Parses an XML file representing a political actor ("acteur") from the French National Assembly dataset.

    Parameters:
    - folder_path (str): Path to the folder containing the XML file.
    - file_name (str): Name of the XML file to parse.

    Returns:
    - data (list of dict): A list containing one dictionary with the extracted information:
        - acteurRef: Unique identifier of the actor.
        - civ: Civil title (e.g., Mr, Ms).
        - prenom: First name of the actor.
        - nom: Last name of the actor.
        - mandats_GP: List of mandates related to political groups (type "GP"), each represented as a dictionary with:
            - organeRef: Reference ID of the associated group.
            - dateDebut: Start date of the mandate.
            - dateFin: End date of the mandate.
            - legislature: Legislature number during which the mandate occurred.
    """
    file_path = os.path.join(folder_path, file_name)
    data = []
    tree = ET.parse(file_path)
    root = tree.getroot()

    # Remove namespace
    for elem in root.iter():
        if '}' in elem.tag:
            elem.tag = elem.tag.split('}', 1)[1]

    # Safe text extractor
    def get_text(tag: str, parent=root) -> str:
        element = parent.find(tag)
        return element.text if element is not None and element.text is not None else ""

    entry = {}
    entry["acteurRef"] = get_text("uid")
    
    # here we collect the civ, prenom and nom
    etatCivil = root.find("etatCivil")
    if etatCivil is not None:
        ident = etatCivil.find("ident")
        if ident is not None:
            for field in ["civ", "prenom", "nom"]:
                entry[field] = get_text(field, parent=ident)

    # collect all mandats de type GP
    listMandatsCurrentMP = []
    mandats = root.findall("mandats/mandat")
    for mandat in mandats:
        if get_text("typeOrgane", parent=mandat) == "GP":
            listMandatsCurrentMP.append({
                "organeRef": get_text("organes/organeRef", parent=mandat),
                "dateDebut": get_text("dateDebut", parent=mandat),
                "dateFin": get_text("dateFin", parent=mandat),
                "legislature": get_text("legislature", parent=mandat)
            })

    entry["mandats_GP"] = listMandatsCurrentMP

    data.append(entry)
    return data


Now we run the function on the all the files

In [40]:
#this is where we will gather the data to later turn them to a pandaframe
data = []
for file in tqdm(file_list_acteurs, desc="Processing files", unit="file"):

  data.extend(parseActeur(folder_path_acteurs, file))  # Use extend instead of +



Processing files: 100%|████████████████████████████████████████████████████████████████| 3064/3064 [01:39<00:00, 30.66file/s]


In [41]:
print("We got ", len(data), " acteurs")

We got  3064  acteurs


Now we turn this into a dataframe

In [42]:
df_Acteurs = pd.DataFrame(data)
df.head(3)

Unnamed: 0,uid,codeType,libelle,libelleAbrev,positionPolitique,dateDebut,dateFin,organeRef
0,PO191887,ORGEXTPARL,Commission nationale pour l'élimination des mi...,160,,1999-05-11,,PO191887
1,PO201115,API,Section française de l'Assemblée parlementaire...,APF,,1998-07-09,,PO201115
2,PO201269,DELEG,Délégation de l'Assemblée nationale aux droits...,EGA,,1999-07-12,,PO201269


Now we create two small functions:
- one to get the ID of the political group based on an Acteur Ref and a given date (basically telling us from which GP was this MP a part of at a given date)
- one to get the abbreviation (*libelleAbrev*) of an *organe* based on their id (*organeRef*)

In [53]:
def getOrganeRefFromActeurRefAtGivenDate(acteurRef, dateSeanceJour, df_Acteurs):
    #we collect the list of mandates for a given MP identified with their acteurRef id
    mandats = df_Acteurs.loc[df_Acteurs["acteurRef"] == acteurRef, "mandats_GP"].values

    def trouver_mandat_a_date(mandats, date_seance):
        date_seance = datetime.fromisoformat(date_seance)
        if mandats is not None and hasattr(mandats, "__len__") and len(mandats) > 0:
            for mandat in mandats.tolist()[0]:
                debut = datetime.fromisoformat(mandat['dateDebut'])
                # If 'dateFin' is emply, we consider a future date (31-12-9999)
                fin_str = mandat['dateFin']
                fin = datetime.fromisoformat(fin_str) if fin_str else datetime(9999, 12, 31)
        
                if debut <= date_seance <= fin:
                    return mandat['organeRef']
        return None
    
    organeRef = trouver_mandat_a_date(mandats, dateSeanceJour)
    return organeRef

In [52]:
def getLibelleAbrevFromOrganeRef(organeRef, df_GP):
    organRef = str(organeRef).strip() if organeRef is not None else None
    
    if organRef is None:
        print(f"[WARN] organeRef is None")
        return None

    matches = df_GP[df_GP["organeRef"].astype(str).str.strip() == organRef]["libelleAbrev"].tolist()
    
    if matches:
        return matches[0]
    else:
        print(f"[WARN] organeRef '{organRef}' not found in df_GP")
        return None


#### 3.d) Parse Compte Rendu and create the data set
In this section, we will parse the debate files. Again, the objective is to create a dataset of examples that look like that:

````
{
    "text":" [THEME] Projet de loi sur la sécurité intérieure 
            [CONTEXTE] 
            Intervenant 1 : Nous voulons plus de sécurité dans nos quartiers.
            Intervenant 2 : Ce projet est une atteinte aux libertés publiques.
            [INTERVENTION] Je soutiens cette mesure pour protéger nos concitoyens."
    "label":"RN"
}
````
Where the label (here "RN"), indicates the Political Group of the speaker in the [INTERVENTION].



We we do a quick function that will clean the text we collect

In [51]:
def clean_paragraph_text(rawText):
    cleanText = re.sub(r"<br\s*/?>", "\n", rawText or "") #getting rid of inner banners
    cleanText = re.sub(r"<italique>.*?</italique>", "", cleanText) #getting rid of in-paragraphe interruptions
    cleanText = cleanText.replace('\xa0', ' ') # getting rid of special spaces
    cleanText = re.sub(r'\s+', ' ', cleanText) # 
    is_valid_utf8(cleanText) #checking if it's clean text
    return cleanText.strip()


##### 3.d.1) Extract Point: the *parsePoint()* function
 We create a function *parsePoint()* that will take a give *point* of a debate file, and extracts one or more supervised learning examples from an XML <point> block representing part of a parliamentary debate.



Each example includes:
- a theme (the topic of the point),
- a context (a sequence of previous interventions),
- a target intervention (to be predicted),
- and the political group of the speaker (used as the label).

**Function signature:**
```python
def parsePoint(
    point,
    dateSeanceJour,
    df_Acteurs,
    df_GP,
    numberOfPreviousInterventionsForContext=0,
    mentionPreviousOrateursGP=False,
    removeInterruptions=False
):
```
**Parameters:**

|Name | Type | Description|
|-----|------|------------|
|point | xml.etree.Element | XML <point> element containing debate paragraphs.|
|dateSeanceJour | str or datetime | Date of the session, used to identify the speaker’s political group at that time.|
|df_Acteurs | pd.DataFrame | Metadata about MPs (speakers), including mandate and affiliated groups.|
|df_GP | pd.DataFrame | Political group metadata (abbreviations, IDs, etc.).|
|numberOfPreviousInterventionsForContext | int | Number of previous interventions to include as context.|
|mentionPreviousOrateursGP | bool | Whether to include the political group of each context speaker in the text.|
|removeInterruptions | bool | If True, paragraphs marked as interruptions are excluded.|

**Returns:**

|Type|	Description|
|-----|------|
|list[dict]	|List of training examples, each as a dictionary: {"text": ..., "label": ...}.|

 **Processing Steps:**
- Extract Theme: Gets the content of the <texte> element within the <point>.
- Filter Paragraphs: Keeps only those with code_grammaire == "PAROLE_GENERIQUE" (and optionally interruptions).
- Generate Examples: For each window of context + 1 target paragraph:
- Clean and format previous paragraphs as [CONTEXT].
- Optionally prepend speaker political group (if mentionPreviousOrateursGP is True).
- Clean and format the target paragraph as [INTERVENTION].
- Format Output: Each training sample looks like:

In [50]:
def parsePoint(point, dateSeanceJour, df_Acteurs, df_GP, numberOfPreviousInterventionsForContext = 0, mentionPreviousOrateursGP = False, removeInterruptions = False):
    listOfExamplesForThisPoint = []

    # this "texte" is basically the topic of the "point", that's why we save it
    textPoint = point.find("texte")
    
    # Now we will list all the paragraphs
    paragrapheList = point.findall("paragraphe")
    if len(paragrapheList) ==0:
        # print("  "+"Paragraphe not found !!!!")
        pass
    else:
        # we only want to keep the PAROLE_GENERIQUE, and INTERRUPTION_1_10 (if the arugment removeInterruptions is False)
        # which means the basic intervention/statements by MP, exluding the interuptions and other issues
        # for this we just create a new list and keep only the correct paragraphs
        #Here we could remove more of the intervention that do not have a high political values

        cleanParagrapheList = []
        
        for paragraphe in paragrapheList:
            codeGrammaire = paragraphe.get("code_grammaire")
            if (codeGrammaire == "PAROLE_GENERIQUE"):
                cleanParagrapheList.append(paragraphe)
            if (removeInterruptions == False and codeGrammaire =="INTERRUPTION_1_10"):
                cleanParagrapheList.append(paragraphe)
                
        #Now cleanParagrapheList contains only the parapgrahs we care about
        #First we check if the new list of Paragraph is not emppt
        if len(cleanParagrapheList) ==0:
            # print("  "+"Paragraphe not found !!!!")
            pass
        elif len(cleanParagrapheList) <numberOfPreviousInterventionsForContext +1:
            pass
            # print("Not enough Interventions in the cleanParagrapheList: needed="+str(numberOfPreviousInterventionsForContext)+", actual="+str(len(cleanParagrapheList)))
        else:
    
            # The paragraph exists !
            
            #Now we will make a loop, based on how many paragraphes we have we will create as many example for the training as we can
            maxOffset = len(cleanParagrapheList) - (numberOfPreviousInterventionsForContext + 1)
            for paragraphOffset in range(maxOffset + 1):
                firstContextParagraphIndex = paragraphOffset
                lastContextParagraphIndex = paragraphOffset + numberOfPreviousInterventionsForContext
                interventionIndex = paragraphOffset + numberOfPreviousInterventionsForContext + 1

                firstContextParagraphIndex = paragraphOffset
                lastContextParagraphIndex = paragraphOffset + numberOfPreviousInterventionsForContext - 1
                interventionIndex = paragraphOffset + numberOfPreviousInterventionsForContext


                theme = "[THEME]" + "\n" + textPoint.text
                context = "[CONTEXT]"
                intervention = "[INTERVENTION]"

                for contextParagraphIndex in range(firstContextParagraphIndex, lastContextParagraphIndex):
                    currentContextParagraph = cleanParagrapheList[contextParagraphIndex]
                    
                    # we save information about who speaks
                    acteurRef = currentContextParagraph.get("id_acteur")
                    mandatRef = currentContextParagraph.get("id_mandat")

                    organeRef = getOrganeRefFromActeurRefAtGivenDate(acteurRef, dateSeanceJour, df_Acteurs)
                    if organeRef is not None:
                    # print("organeRef: ", organeRef)
                        gpLibelleAbrev = getLibelleAbrevFromOrganeRef(organeRef, df_GP)
                    # print("LibelleAbrev :", gpLibelleAbrev)
                    else:
                        gpLibelleAbrev = "X"

                    # now we collect what the person is saying
                    texteEl = currentContextParagraph.find("texte")
                    rawText = texteEl.text
                    
                    cleanText = clean_paragraph_text(rawText)

                    if mentionPreviousOrateursGP == False:
                        currentContextParagraphParsed = cleanText
                    else:
                        currentContextParagraphParsed = "intervenant "+gpLibelleAbrev+" :" + cleanText

                    context = context + "\n" + currentContextParagraphParsed
                #here we just finisedh building the context.
                
                #Now we get the info about the main intervention
                paragraphIntervention = cleanParagrapheList[interventionIndex]
                # we save information about who speaks
                acteurRefInt = paragraphIntervention.get("id_acteur")
                mandatRefInt = paragraphIntervention.get("id_mandat")
                organeRefInt = getOrganeRefFromActeurRefAtGivenDate(acteurRefInt, dateSeanceJour, df_Acteurs)
                if organeRefInt is not None:
                    # print("organeRef: ", organeRef)
                    gpLibelleAbrevInt = getLibelleAbrevFromOrganeRef(organeRefInt, df_GP)
                                    # now we collect what the person is saying
                    texteEl = paragraphIntervention.find("texte")
                    rawText = texteEl.text
                    
                    cleanText = clean_paragraph_text(rawText)
    
                    intervention = intervention + "\n" + cleanText
                    
                    inputLLM = theme + "\n" + context + "\n" + intervention
    
                    example = {"text" : inputLLM, 'label': gpLibelleAbrevInt}
                    listOfExamplesForThisPoint.append(example)
                else:
                    pass
            #Now we looping all paragraphs for a point 
    return listOfExamplesForThisPoint

##### 3.d.2) Parse Compte Rendu 
We we will create a function tha will process a file containing debates minutes.
Parses an XML transcript file of a French National Assembly session ("compte rendu").
This function processes the XML structure, removes namespaces, and loops over all "point" sections (i.e., topics or agenda points). It then delegates to parsePoint() to extract labeled text data suitable for machine learning (e.g., group classification tasks).

**Function's signature**
```` 
def parseCompteRendu(
    file_path,
    df_Acteurs,
    df_GP,
    numberOfPreviousInterventionsForContext=0,
    mentionPreviousOrateursGP=False,
    removeInterruptions=False
):
````

**Function's parameters**
 |Name | Type | Description |
 |-----|------|------------|
 |file_path | str | Path to the XML transcript file (compte rendu). |
 |df_Acteurs | pd.DataFrame | DataFrame containing metadata about MPs (députés). |
 |df_GP | pd.DataFrame | DataFrame containing information about political groups. |
 |numberOfPreviousInterventionsForContext | int, optional | Number of previous interventions to include as context (default: 0). |
 |mentionPreviousOrateursGP | bool, optional | Whether to include the political group of previous speakers in the context (default: False). |
 |removeInterruptions | bool, optional | Whether to remove paragraphs labeled as interruptions (default: False). |


**Function's return**
List[Dict[str, str]]
(here the list is a list of training examples. Each element is a dictionary of the form:
````
{
  "text": "[THEME] .....[CONTEXT] ... [INTERVENTION]....",
  "label": "<group_label>"
}
````

**Workflow Summary**
- Loads and parses the XML transcript using ElementTree.
- Removes XML namespaces to simplify tag access.
- Extracts metadata such as session ID and date.
- Iterates over each point in the debate.
- For each point, calls parsePoint() to generate one or more labeled text examples.
- Collects and returns all generated examples as a list.

In [48]:
def parseCompteRendu(file_path, df_Acteurs, df_GP,numberOfPreviousInterventionsForContext = 0, mentionPreviousOrateursGP = False, removeInterruptions = False):
    data = [] #we will put the data here
    tree = ET.parse(file_path)
    root = tree.getroot()
    
    #this removes the namespace
    #we remove the namespaces because we will only work with this dadata internally,
    # and we will not re export the XML
    for elem in root.iter():
        elem.tag = elem.tag.split("}")[-1]  # Remove namespace prefix
    
    # here we already gather some nice identifiers that we will add to the data
    uid = root.find("uid").text
    sessionRef = root.find("sessionRef").text
    seanceRef = root.find("seanceRef").text
    dateSeanceJour = root.find("metadonnees/dateSeanceJour").text
    dateSeanceJour = convert_french_date(dateSeanceJour)
    
    #the minutes file is divided into "points", each "point" cointains "paragraphs" (who speaks is associated with each paragraph)
    listPoints = root.find("contenu").findall("point")
    
    # now we explore each "point" one by one
    for point in listPoints:
        examplesListForThisPoint = parsePoint(point, dateSeanceJour, df_Acteurs, df_GP, numberOfPreviousInterventionsForContext , mentionPreviousOrateursGP, removeInterruptions)
        data = data + examplesListForThisPoint
    
    return data




##### 3.d.3) Create dataset
Now we process all the debates files we accumulated, to create the dataset. The dataset is stored in athe variable *data*.

We call it with the following parameters:
- numberOfPreviousInterventionsForContext = 5, because we want to have the 5 previous interventions in the context
- mentionPreviousOrateursGP = True, we want to mention the speaker's GP for each intervention in the context
- removeInterruptions = False, we want to keep the interruptions

In [54]:
#this is where we will gather the data to later turn them to a pandaframe
data = []
# print(os.path.exists(folder_path_debatsL17))  # Check if the directory exists
# print(os.listdir(folder_path_debatsL17))  # List files in the directory
for file in tqdm(file_list_debatsL17, desc="Processing files", unit="file"):
    file_path = os.path.join(folder_path_debatsL17, file)
    data.extend(parseCompteRendu(file_path, df_Acteurs, df_GP,numberOfPreviousInterventionsForContext = 5, mentionPreviousOrateursGP = True, removeInterruptions = False))

Processing files: 100%|██████████████████████████████████████████████████████████████████| 116/116 [04:25<00:00,  2.29s/file]


We quickly check what is the size of the dataset we extracted:

In [55]:
len(data)

17942

#### 4) Some last checks and we save!

##### 4.1) Undersampling

We will count how many exmaples per classe did we get

In [30]:
label_counts = Counter(example["label"] for example in data)

# Print sorted by count (ascending)
for label, count in sorted(label_counts.items(), key=lambda x: x[1]):
    print(f"{label}: {count}")

NI: 95
LIOT: 292
UDR: 445
HOR: 778
GDR: 787
DEM: 917
SOC: 1302
RN: 1927
ECOS: 2093
EPR: 2625
DR: 2992
LFI-NFP: 3689


We can see that the groups do not contribute to the same extend to the parliamentary debate. One obvious reason for this is that some groups have way less members than others. To optimize learning, we want each group (aka, each class) to be equaly represented. This is dataset balancing. We will do that by removing some classe that are very under represented, and then doing undersampling (limiting the number of examples of the over represented classes, so the dataset is more balanced).

Here is the function we will use to do that:

In [59]:
import random
from collections import defaultdict

def undersample_list_dataset(dataset, seed=42):
    random.seed(seed)

    # Group items by label
    grouped = defaultdict(list)
    for item in dataset:
        grouped[item["label"]].append(item)

    # Find the minimum class size
    min_size = min(len(items) for items in grouped.values())

    # Sample min_size elements from each group
    undersampled_data = []
    for label, items in grouped.items():
        undersampled_items = random.sample(items, min_size)
        undersampled_data.extend(undersampled_items)

    # Shuffle final dataset
    random.shuffle(undersampled_data)

    return undersampled_data


We decide to drop the group NI and LIOT

In [65]:
labels_to_keep = {"LFI-NFP", "DR", "EPR", "ECOS", "RN", "SOC", "DEM", "GDR", "HOR", "UDR"}  # dropped NI and LIOT

filtered_dataset = [ex for ex in data if ex["label"] in labels_to_keep]

Now we do the dataset balancing, then double check the number of examples per classe

In [66]:
balanced_filtered_dataset = undersample_list_dataset(filtered_dataset)

print(Counter(item["label"] for item in balanced_filtered_dataset))


Counter({'GDR': 445, 'UDR': 445, 'RN': 445, 'ECOS': 445, 'DEM': 445, 'EPR': 445, 'HOR': 445, 'DR': 445, 'LFI-NFP': 445, 'SOC': 445})


This looks nice, we are ready to go to the next step (training!), so let's save our data for now!

In [70]:
import json
import os

# Define folder where to save the dataset
folder_path_csv_to_save = data_path / "processedData" / "GPClassification"
folder_path_csv_to_save.mkdir(parents=True, exist_ok=True)  # Ensure the folder exists

# Define the output file path
json_file_path = folder_path_csv_to_save / "balanced_dataset-removed-2-classes-THEME-CONTEXT-INTERVENTION-5InterventionsForContext-GPmentionedInContext.json"

# Save to a JSON file
with open(json_file_path, 'w') as json_file:
    json.dump(balanced_filtered_dataset, json_file)
