# Tutorial 1 - First steps with ParlaMint dataset:

**ParlaMint**:
- Multilingual corpus of **European parliamentary debates**
    - Contains transcribed speeches with rich metadata (speaker role, party affiliation, dates, speech lenght, etc.)
    - ParlaCAP : adds **comparative Agendas Project (CAP)** policy topic codes (categories) 
    - ParlaSent: adds **sentiment scores** per speech segment 
- covers **multiple countries** and time periods
- Enables **comparative political research** 
- Useful for **topic modelin** and **sentiment analysis**
- Source: Erjavec, Tomaž; et al., 2025, Multilingual comparable corpora of parliamentary debates ParlaMint 5.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/2004


**1. Setup**

Setting up the requirements:

In [1]:
import pandas as pd
from pathlib import Path
import csv

**2. Data Loading & Filtering**

The following code:
- **creates a list of country codes** for which we have processed *.tsv* speech data files
- **loops through each country code** in the list and adds a column to label the country
- and then **combines all the files** into **one big dataset** 

It then selects only those that were held by Regular MPs (*Members of Parliament*) and filters out the CAP categories "Mix" and "Other". It also creates separate DataFrames for coalition and opposition party speeches.

In [2]:
# ---- 1. First, we have to increase the CSV field size limit ----
max_int = 2**31 - 1
while True:
    try:
        csv.field_size_limit(max_int)
        break
    except OverflowError:
        max_int = max_int // 10

countries = ["AT", "BA", "BE", "BG", "CZ", "DK", "EE", "ES-CT", "ES-GA", "ES-PV",
             "FR", "GB", "GR", "HR", "HU", "IS", "IT", "LV", "NL", "NO", "PL", "PT", "RS", "SE", "SI", "TR", "UA"] #change country codes according to your available datasets

base_dir = Path().resolve()

# ---- 2. Choose what columns to read (including CAP and sentiment columns) ----
cols_to_keep = [
    "id", "date", "lang_code", "lang", "speaker_role", "speaker_MP",
    "speaker_minister", "speaker_party", "speaker_party_name", "party_status",
    "party_orientation", "speaker_id", "speaker_name", "speaker_gender",
    "speaker_birth", "word_count", "CAP_category", "sent3_category", "sent6_category", "sent_logit"
]

# ---- 3. Define dtypes to reduce memory ----
dtypes = {
    "id": str,
    "date": str,
    "lang_code": "category",
    "lang": "category",
    "speaker_role": "category",
    "speaker_MP": "category",
    "speaker_minister": "category",
    "speaker_party": "category",
    "speaker_party_name": "category",
    "party_status": "category",
    "party_orientation": "category",
    "speaker_id": str,
    "speaker_name": str,
    "speaker_gender": "category",
    "speaker_birth": "Int32",
    "word_count": "Int32",
    "CAP_category": "category",
    "sent3_category": "category",
    "sent6_category": "category",
    "sent_logit": "float64"
}

# ---- 4. Create lists to accumulate filtered chunks ----
all_chunks = []
coalition_chunks = []
opposition_chunks = []

for country in countries:
    file_path = base_dir / f"ParlaMint-{country}_processed_no_text.tsv"

    # --- 4.1. Read in chunks using pandas.read_csv ----
    for chunk in pd.read_csv(file_path, sep="\t", usecols=cols_to_keep,
                             dtype=dtypes, chunksize=50_000, engine="python"):
        chunk["country"] = country
        chunk["country"] = chunk["country"].astype("category")

        # ---- 4.2. Filter MPs with regular role ----
        filtered_chunk = chunk.query("speaker_MP == 'MP' and speaker_role == 'Regular'")

        # ---- 4.3. Drop rows where CAP_category or sentiment is empty ----
        filtered_chunk = filtered_chunk[
            filtered_chunk["CAP_category"].notna() & (filtered_chunk["CAP_category"] != "") &
            filtered_chunk["sent3_category"].notna() & (filtered_chunk["sent3_category"] != "") &
            filtered_chunk["sent6_category"].notna() & (filtered_chunk["sent6_category"] != "")
        ]

        # ---- 4.4. Accumulate filtered chunks ----
        if not filtered_chunk.empty:
            all_chunks.append(filtered_chunk)

            # ---- 4.5. Split coalition / opposition ----
            coalition_chunk = filtered_chunk[filtered_chunk["party_status"] == "Coalition"]
            if not coalition_chunk.empty:
                coalition_chunks.append(coalition_chunk)

            opposition_chunk = filtered_chunk[filtered_chunk["party_status"] == "Opposition"]
            if not opposition_chunk.empty:
                opposition_chunks.append(opposition_chunk)

# ---- 5. Concatenate all accumulated chunks into DataFrames ----
filtered_all = pd.concat(all_chunks, ignore_index=True)
filtered_all_coalition = pd.concat(coalition_chunks, ignore_index=True)
filtered_all_opposition = pd.concat(opposition_chunks, ignore_index=True)

print("DataFrames ready:")
print("All filtered:", filtered_all.shape)
print("Coalition:", filtered_all_coalition.shape)
print("Opposition:", filtered_all_opposition.shape)

DataFrames ready:
All filtered: (4565042, 21)
Coalition: (1456092, 21)
Opposition: (2047981, 21)


Now, we can explore the general structure of the data:
- "*.head()*" prints the **first then columns** of our DataFrame (*filtered_all*)
    - It also outputs "10 rows x n columns" on the bottom which tells us how many columns the DataFrame has

In [3]:
filtered_all.head(10)

Unnamed: 0,id,date,lang_code,lang,speaker_role,speaker_MP,speaker_minister,speaker_party,speaker_party_name,party_status,...,speaker_id,speaker_name,speaker_gender,speaker_birth,word_count,CAP_category,sent_logit,sent3_category,sent6_category,country
0,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,NEOS,parliamentary group of NEOS,-,...,PAD_10304,"Strolz, Matthias",M,1973,685,Macroeconomics,1.600716,Neutral,Neutral Negative,AT
1,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,NEOS,parliamentary group of NEOS,-,...,PAD_10304,"Strolz, Matthias",M,1973,685,Macroeconomics,1.600716,Neutral,Neutral Negative,AT
2,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,SPÖ,parliamentary group of the Social Democratic P...,Coalition,...,PAD_35504,"Schieder, Andreas",M,1969,367,Macroeconomics,1.182058,Negative,Mixed Negative,AT
3,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,SPÖ,parliamentary group of the Social Democratic P...,Coalition,...,PAD_35504,"Schieder, Andreas",M,1969,367,Macroeconomics,1.182058,Negative,Mixed Negative,AT
4,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,FPÖ,parliamentary group of the Austrian Freedom Party,Opposition,...,PAD_59908,"Podgorschek, Elmar",M,1958,491,Macroeconomics,2.138607,Neutral,Neutral Negative,AT
5,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,FPÖ,parliamentary group of the Austrian Freedom Party,Opposition,...,PAD_59908,"Podgorschek, Elmar",M,1958,491,Macroeconomics,2.138607,Neutral,Neutral Negative,AT
6,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,ÖVP,parliamentary group of the Austrian People's P...,Coalition,...,PAD_15526,"Lopatka, Reinhold",M,1960,429,Macroeconomics,1.885045,Neutral,Neutral Negative,AT
7,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,ÖVP,parliamentary group of the Austrian People's P...,Coalition,...,PAD_15526,"Lopatka, Reinhold",M,1960,429,Macroeconomics,1.885045,Neutral,Neutral Negative,AT
8,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,Grüne,parliamentary group of The Greens – The Green ...,-,...,PAD_35516,"Rossmann, Bruno",M,1952,646,Macroeconomics,1.745513,Neutral,Neutral Negative,AT
9,ParlaMint-AT_2014-05-20-025-XXV-NRSITZ-00026_d...,2014-05-20,AT,German,Regular,MP,notMinister,STRONACH,parliamentary group Team Stronach,-,...,PAD_83143,"Vetter, Georg",M,1962,384,Macroeconomics,1.174776,Negative,Mixed Negative,AT


*.shape* lets us see how many **rows and columns** our DataFrame has 

In [6]:
filtered_all.shape

(4565042, 21)

Get an **overview for the columns "country", "CAP_category" and "sent3_category"** in the DataFrame

In [8]:
filtered_all[["country", "CAP_category", "sent3_category"]].describe()


Unnamed: 0,country,CAP_category,sent3_category
count,4565042,4565042,4565042
unique,24,23,3
top,TR,Other,Neutral
freq,1005966,1469743,2271571


See how many **unique value** we have **per column**, e.g. for "CAP_category"

In [None]:
pd.Series(filtered_all["CAP_category"].unique())

0            Macroeconomics
1                   Culture
2     Government Operations
3              Civil Rights
4                       Mix
5               Immigration
6            Transportation
7                   Defense
8            Social Welfare
9                     Labor
10                  Housing
11                Education
12                   Energy
13              Agriculture
14                   Health
15              Environment
16        Domestic Commerce
17                    Other
18    International Affairs
19            Law and Crime
20             Public Lands
21            Foreign Trade
22               Technology
dtype: category
Categories (23, object): ['Agriculture', 'Civil Rights', 'Culture', 'Defense', ..., 'Public Lands', 'Social Welfare', 'Technology', 'Transportation']

... or for "party_status"

In [11]:
pd.Series(filtered_all["party_status"].unique())

0             -
1     Coalition
2    Opposition
dtype: object

... or for "party_orientation"

In [12]:
pd.Series(filtered_all["party_orientation"].unique())

0                                       Centre
1                                  Centre-left
2                           Right to far-right
3                        Centre-right to right
4                          Centre-left to left
5                                            -
6                           Syncretic politics
7                                 Centre-right
8                                        Right
9                       Centre to centre-right
10                                    Big tent
11                       Centre to centre-left
12                                        Left
13                            Left to far-left
14                                   Far-right
15               Centre to centre-right;Centre
16                    Centre-right;Centre-left
17         Centre-right;;Centre to centre-left
18          Centre-right;Centre to centre-left
19                                    Far-left
20                                Pirate Party
21           

See how many times each **unique value** appears **in the "CAP_category" column** (excluding NaN by default)

In [13]:
filtered_all["CAP_category"].value_counts() #instead of "CAP_category", we could also look for the unique values of "party_orientation", "speaker_role", etc.

CAP_category
Other                    1469743
Mix                       323632
Macroeconomics            292208
Law and Crime             276087
Government Operations     258561
Health                    220831
Civil Rights              213281
International Affairs     212455
Education                 147987
Labor                     130152
Agriculture               118269
Domestic Commerce         116287
Transportation            111285
Social Welfare            109840
Environment                92655
Immigration                90605
Defense                    87652
Housing                    86961
Energy                     78317
Technology                 43128
Culture                    29611
Foreign Trade              28426
Public Lands               27069
Name: count, dtype: int64

Check for **missing values**

In [14]:
filtered_all["CAP_category"].isnull().values.any()

np.False_