# Sampling Procedure for Manual Labeling
In this notebook, a random sample of the previously created ParlaMint_PT_interventions. The ParlaMint_PT_interventions created during preprocessing stage contains a total of 248577 interventions. This random sample for manual labeling should include 1500.

In [1]:
import pandas as pd
import numpy as np
from google.colab import drive
import re

### Import the data

In [2]:
#connect to google drive
drive_path = '/content/drive/MyDrive/Thesis/ParlaMint_PT_interventions.csv'

drive.mount('/content/drive')
print("Google Drive successfully accessed.")

#Import the data from google drive
print(f"Reading data from: {drive_path}")
try:
    df_raw = pd.read_csv(drive_path, encoding='utf-8')
except FileNotFoundError:
    print(f"ERROR: File not found at {drive_path}. Please check your path and filename.")
except UnicodeDecodeError:
    print("UTF-8 decoding failed. Trying 'latin1' encoding...")
    df_raw = pd.read_csv(drive_path, encoding='latin1')

print(f"Interventions full dataset size: {len(df_raw)}")

# Clean up column names by removing spaces and converting to lowercase
df_raw.columns = df_raw.columns.str.strip().str.lower()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive successfully accessed.
Reading data from: /content/drive/MyDrive/Thesis/ParlaMint_PT_interventions.csv
Interventions full dataset size: 248577


### Random Sampling for Manual Labeling and create an empty column for manual labeling

In [3]:
Intervention_sample_size = 1500

# Create random sample
df_final_labeling_set = df_raw.sample(
    n=Intervention_sample_size,
).reset_index(drop=True)

print(f"Random sample size: {len(df_final_labeling_set)}")

# Create an empty column for manual annotation
df_final_labeling_set['intervention_label'] = ''

Random sample size: 1500


In [4]:
# Columns to keep (this was done to remove extra columns not needed for the labeling process)
columns_for_labeling = ['speech_id','intervention_id','party','speaker_name','text','text_length','intervention_label']

# Create a copy with only these columns
df_final_labeling_set_reduced = df_final_labeling_set[columns_for_labeling].copy()

### Final Output for Manual Labeling

In [5]:
# Save sample
output_path = '/content/drive/MyDrive/Thesis/intervention_sample_for_manual_labeling.xlsx'
df_final_labeling_set_reduced.to_csv(output_path, index=False, encoding='utf-8')

print(f"\nSample created for manual labeling: {len(df_final_labeling_set_reduced)} interventions.")
print(f"\nRandom sample is saved to: {output_path}")


Sample created for manual labeling: 1500 interventions.

Random sample is saved to: /content/drive/MyDrive/Thesis/intervention_sample_for_manual_labeling.xlsx
