# Tutorial 1 - First steps with ParlaMint dataset

Welcome to this **hands-on tutorial exploring the ParlaMint dataset**, a rich multilingual corpus of European parliamentary debates. This tutorial is designed for researchers, data scientists and political science enthusiasts who want to analyze parliamentary proceedings across different European countries. 

The **ParlaMint dataset** is a comprehensive and multilingual corpus of transcribed speeches from different European parliaments. It serves as a valuable resource for computational social science and digital humanities research. 

At its core, ParlaMint contains the full text of parliamentary speeches annotated with rich metadata, including the speaker's role (e.g. *Member of Parliament*), political party affiliation, date of the session and the length of each speech. This foundational data was significantly expanded with **CAP categories** and **sentiment scores**.

First, the ParlaCAP component categorizes all available speech segments into a specific policy domain from the **Comparative Agendas Project (CAP)**, e.g. healthcare, education or foreign affairs. 

Second, the **ParlaSent** extension provides detailed **sentiment scores** for each segment, allowing researchers to analyze the emotional tone of the debates.

The dataset's power lies in its comparative desing. It covers **multiple countries** and spans different time periods, enabling cross-parliament analysis. This makes ParlaMint very useful for **comparative political research**, allowing users to systematically study the differences in e.g. political discourse or policy priorities, across nations. Besides that, its structure makes it ideal for technical applications like **topic modeling** and **sentiment analysis** on a large scale.

*Source: Erjavec, Tomaž; et al. (2025). Multilingual comparable corpora of parliamentary debates ParlaMint 5.0. Slovenian language resource repository CLARIN.SI. ISSN 2820-4042. http://hdl.handle.net/11356/2004*


**1. Setup**

Before we begin working with the data, we need to prepare our Python environment. This involves two steps: first, ensuring the needed packages are installed, and second, importing them into our session so we can use them.

**Package Installation**:

If you are running this code in a new environment (e.g. Google Colab, a new conda environment, or a Jupyter notebook), you may need to install some packages, like the 'pandas' library, first. You can typically do this by running the command below in a code cell. The exlamation mark ('!') is used in environments like Jupyter to execute this type of command. 


In [None]:
# Uncomment and run the following line if you haven't installed pandas yet
# !pip install pandas

**Importing Libraries**:

The following lines of code import the necessary tools for our analysis. In this step, we *gather* all the tools that we need before starting our analysis. 'pandas' is one of our primary tools for handling and analyzing tabular data, 'Path' helps us navigate the computer's filesystem to find our dataset files, and 'csv' is a module we need to read, write and parse through tabular data that is written in the comma-separated values (csv) format. 

In [None]:
import pandas as pd
from pathlib import Path
import csv

**Note: Dataset size and memory management**

Before we load the data, it's important to understand its scale. We are working with 28 country-specific datasets, which together represent a very large corpus of parliamentary speech. The individual files in your directory vary in size. We can group them into four broad categories:
- **Small datasets (< 50 MB)**: Including countries like Sweden (SE) and Spain's regional parliaments (ES-CT, ES-GA, ES-PV).
- **Medium Datasets (50 - 150 MB)**: This is the largest category, encompassing the majority of countries such as the Czech Republic (CZ), Denmark (DK), Greece (GR), and Poland (PL).
- **Large datasets (150 - 250 MB)**: this group includes larger parliamentary bodies like those of France (FR), the United Kingdom (GB) and the Netherlands (NL).
- **The largest dataset (> 350 MB)**: The corpus for Turkey (TR) is the single largest file at 370 MB.

**The comined size of all these files is over 2.5 GB**. However, when loaded into a Pandas DataFrame in memory, this can expand significantly, potentially requiring 8-10 GB of RAM or more if loaded without smart filtering. To prevent your computer to crash or become slow, we need to work efficiently with the data and use these two important strategies:

1. **Optimized data typed**: In the code below, we explicitly tell Python whether a column contains text, categories or numbers. This significantly reduces the amount of RAM required. For example, storing a limited set of values like 'Coalition' and 'Opposition' as a 'category' is much more efficient than storing them as text strings.

2. **Chunked loading**: Instead of loading all the dataset files at once, we read them in smaller, manageable pieces (e.g. 50.000 rows at a time), process each piece (filtering and optimizing it) and then combine the results. This allows us to work on datasets much larger than our available RAM.

The code below implements these strategies to ensure our analysis can run smoothly on a standard laptop.

**2. Data Loading & Filtering**

The following code is structured in 5 main steps to load and filter the data efficiently:

**1. Increase CSV field size limit (preparation)**: Parliamentary speeches can be very long which can exceed Python's default maximum field size. This part adjusts Python's built-in limits to prevent errors when reading the files.

**2. & 3. Memory optimization**: Here, the code specifies exactly which columns of the datasets to load and defines the data types to reduca RAM usage.

**4. Loading & processing loop**: This is the core of the script. For each country in our list, this code **reads the large dataset file in small, manageable pieces (chunks of 50.000 rows) (4.1.)**. For each chunk it adds a 'country' label, **filters speeches (4.2.)** to keep only those given by regular Members of Parliament (MP), removing technical speakers, ministers and other to focus on core parliamentary debate and **cleans the data (4.3.)** by removing rows where the policy topic (**CAP_category**) or sentiment is missing. 

**5. Combine the final datasets**: This part comes after all chunks from all countries have been processed and it combines everything to a final master DataFrame *filtered_all*.

In [None]:
# ---- 1. First, we have to increase the CSV field size limit ----
max_int = 2**31 - 1
while True:
    try:
        csv.field_size_limit(max_int)
        break
    except OverflowError:
        max_int = max_int // 10

countries = ["AT", "BA", "BE", "BG", "CZ", "DK", "EE", "ES", "ES-CT", "ES-GA", "ES-PV",
             "FR", "GB", "GR", "HR", "HU", "IS", "IT", "LV",
             "NL", "NO", "PL", "PT", "RS", "SE", "SI", "TR", "UA"] #change country codes according to your available datasets

base_dir = Path().resolve()

# ---- 2. Choose what columns to read (including CAP and sentiment columns) ----
cols_to_keep = [
    "id", "date", "lang_code", "lang", "speaker_role", "speaker_MP",
    "speaker_minister", "speaker_party", "speaker_party_name", "party_status",
    "party_orientation", "speaker_id", "speaker_name", "speaker_gender",
    "speaker_birth", "word_count", "CAP_category", "sent3_category", "sent6_category", "sent_logit"
]

# ---- 3. Define dtypes to reduce memory ----
dtypes = {
    "id": str,
    "date": str,
    "lang_code": "category",
    "lang": "category",
    "speaker_role": "category",
    "speaker_MP": "category",
    "speaker_minister": "category",
    "speaker_party": "category",
    "speaker_party_name": "category",
    "party_status": "category",
    "party_orientation": "category",
    "speaker_id": "category",
    "speaker_name": "category",
    "speaker_gender": "category",
    "speaker_birth": "Int32",
    "word_count": "Int32",
    "CAP_category": "category",
    "sent3_category": "category",
    "sent6_category": "category",
    "sent_logit": "float32"
}

# ---- 4. Create lists to accumulate filtered chunks ----
all_chunks = []

for country in countries:
    file_path = base_dir / f"ParlaMint-{country}_processed_no_text.tsv"

    # --- 4.1. Read in chunks using pandas.read_csv ----
    for chunk in pd.read_csv(file_path, sep="\t", usecols=cols_to_keep,
                             dtype=dtypes, chunksize=50_000, engine="python"):
        chunk["country"] = country
        chunk["country"] = chunk["country"].astype("category")
       # chunk["CAP_category"] = chunk["CAP_category"].astype("category")

        # ---- 4.2. Filter MPs with regular role ----
        filtered_chunk = chunk.query("speaker_MP == 'MP' and speaker_role == 'Regular'")

        # ---- 4.3. Drop rows where CAP_category or sentiment is empty ----
        filtered_chunk = filtered_chunk[
            filtered_chunk["CAP_category"].notna() & (filtered_chunk["CAP_category"] != "") &
            filtered_chunk["sent3_category"].notna() & (filtered_chunk["sent3_category"] != "") &
            filtered_chunk["sent6_category"].notna() & (filtered_chunk["sent6_category"] != "")
        ]

        # ---- 4.4. Accumulate filtered chunks ----
        if not filtered_chunk.empty:
            all_chunks.append(filtered_chunk)

# ---- 5. Concatenate all accumulated chunks into DataFrames ----
filtered_all = pd.concat(all_chunks, ignore_index=True)
del all_chunks
print("All filtered:", filtered_all.shape)

**2.1. Final filtering**

Before we begin exploring our dataset, we will make two final adjustments to the 'CAP_category' variable to ensure our analysis of policy topics will be clean and meaningful. 

First, we want to exclude the two catch-all categories,'**Mix**' (for speeches that cover multiple topics) and '**Other**' (for topics that don't fit into the main taxonomy). For many research questions, these categories are too vague to provide interpretable results. The line below filters our DataFrame to **remove** ('~' means 'not in') any speeches that have been classified into these two categories.

In [None]:
filtered_all["CAP_category"] = filtered_all["CAP_category"].astype("category")
filtered_all = filtered_all[~filtered_all["CAP_category"].isin(["Mix", "Other"])]


After this filtering, our dataset no longer contains these categories. However, pandas internally still remembers that 'Mix' and 'Other' are possible values for the 'CAP_category' column. This is like an archive having empty folders for documents you've already removed. To tidy this up and make our analysis more efficient, we use this command:

In [None]:
filtered_all["CAP_category"] = filtered_all["CAP_category"].cat.remove_unused_categories()

**Extra: Applying filters to other variables**

*This same filtering can easily be adapted to focus your analysis on other key variables, such as gender, party affiliation or political orientation, by changing the column name and the values you wish to keep or exclude.*

In [None]:
# Example 1: Keep only speeches from female speakers
# filtered_all = filtered_all[filtered_all["speaker_gender"].isin(["female"])]
# filtered_all["speaker_gender"] = filtered_all["speaker_gender"].cat.remove_unused_categories()

# Example 2: Focus analysis on a specific set of parties
# parties_to_keep = ["Social Democratic Party", "Green Party", "Conservative Party"]
# filtered_all = filtered_all[filtered_all["speaker_party_name"].isin(parties_to_keep)]
# filtered_all["speaker_party_name"] = filtered_all["speaker_party_name"].cat.remove_unused_categories()

# Example 3: Exclude speeches from politically left parties
# filtered_all = filtered_all[~filtered_all["party_orientation"].isin(["Left, Far-left"])]
# filtered_all["party_orientation"] = filtered_all["party_orientation"].cat.remove_unused_categories()

**3. Initial Data Exploration: Getting to know your data**

Now that we have loaded and filtered our data into the 'filtered_all' DataFrame (provided by the 'pandas' library), it's time to get acquinted with it. In traditional research, this would be like a first skim through a new archive box - checking what documents are inside, and how many, and getting a general sense of the content before moving on to a deep analysis. In data science, we do this with simple methods that give us a high-level overview.

We start with the most basic question: **What does this data actually look like?**

**3.1. head()**

The '.head()' method is our go-to tool for this. It allows us to **peek at the first few rows** of our dataset. This shows us a sample of our actual data, the values in each column and confirms that our data loads correctly.

In [None]:
filtered_all.head(10)

When you run this code, you will see a neatly formatted table output. The row shows the **names of all our colums** (like 'country', 'date', 'speaker_party', 'CAP_category') and below it, you will see the acutal data for the first 10 parliamentary speeches. 

Crucially, at the bottom of this output, you will see a line that says something like: 

*10 rows x 21 column*

This is a quick summary telling us that we are looking at 10 rows (speeches) and that each row is described by 21 different variables (columns) that we imported.

**3.2. shape()**

Our next question is: **How much data are we working with?**

The '.shape()' method answers this by outputting a pair of numbers in this format: '(number_of_rows, number_of_columns)'.

In [None]:
filtered_all.shape

The output, e.g. (2754914, 21), would tell us that we are analyzing **2.754.914 rows (individual speeches)**, each described by **21 different variables**. 

**3.3. describe()**

The '.describe()' method provides a quick **statistical overview** of the data in the columns we name. 

In [None]:
filtered_all[["country", "CAP_category", "sent3_category"]].describe()


For each chosen column ('country', 'CAP_category' and 'sent3_category'), '.describe()' calculates summary statistics:
- **count**: number of entries
- **unique**: number of different values that exist per chosen column (e.g. the number of unique countries or policy topics).
- **top**: most frequently occurring values (e.g. the most common data)
- **freq**: number of time the top value appears

**3.4. unique()**

The 'unique()' method extracts a clean list of all possible values in a colum (e.g. 'CAP_category'). It extracts every distinct category without any duplicates or counts.

When we wrap that method in 'pd.Series()', it outputs a list in an easy-to-read format.

In [None]:
pd.Series(filtered_all["CAP_category"].unique())

Here, we do the same but instead of 'CAP_category', we look at all unique values in the column 'party_status'.

In [None]:
pd.Series(filtered_all["party_status"].unique())

... or 'party_orientation'.

In [None]:
pd.Series(filtered_all["party_orientation"].unique())

**3.5. value_counts()**

To understand the data on a deeper level, we use '.value_counts()'. This method **counts how many times each unique value appears** in a column. Also, it shows us what values are the most common.

In [None]:
filtered_all["CAP_category"].value_counts() #instead of "CAP_category", we could also look for the unique values of "party_orientation", "speaker_role", etc.

In the case of 'CAP_category', the command above answers the question: **"Which policy topics dominate the parliamentary agenda?"**. The output is a ranked list, showing the most frequently debated topics at the top. 

*The same method can be applied to other columns, such as 'party_orientation' to see the left-right balance of speeches.*

**3.6. Checking for missing values**

Before any analysis, we must veryify that our key variables are complete. The '.isnull()' method checks each value in a column to see if it's *empty*. By combining it with '.values.any()', it outputs a single, clear answer.

In [None]:
filtered_all["CAP_category"].isnull().values.any()
#filtered_all["sent3_logit"].isnull().values.any()
#filtered_all["word_count"].isnull().values.any()

This code answers a simple yes/no question: **"Are there any gaps in our policy topic data?"**. The output will be:
- **False**: Nice! This means there are **no missing values**; every speech has been assigned a policy topic.
- **True**: This would indicate that **at least one value is missing** and requires further investigation or reloading the notebook.

**Conclusion**

In this first tutorial, we took the essential steps toward working with the ParlaMint dataset. We learned how to:
- **Set up the environment** by installing and importing the necessary Python libraries
- **Load and filter the data efficiently**, using memory optimization techniques and chunked processing to handle very large files.
- **Clean and refine the dataset**
- **Explore the data** with foundational 'pandas'-methods such as '.head()', '.shape()', '.describe()', 'unique()' and '.value_counts()' to get a first overview of the structure and content

By the end of this tutorial, you now have a **clean, filtered dataset** of parliamentary debates that is ready for deeper analysis. 

In the next tutorial (**Tutorial 2**), we will take this filtered dataset and explore **topic and sentiment distributions** in depth - visualizing which CAP categories dominate parliamentary debates or how sentiment varies by topic and country.