# Zero-shot classification of political corruption in multilingual news using BART Model

This notebook applies a zero-shot classification technique to analyze textual data and determine whether the content involves political corruption. The classification is performed using the BART model (`facebook/bart-large-mnli`). The notebook processes a dataset of texts, classifying each entry into one of two categories: `political corruption` or `no political corruption`. In addition, we classify whether the article mentions an `identifyable victim`, and `which type of victim`. The results, including predicted labels and confidence scores, are saved into a CSV file for further analysis.

In [3]:
from tqdm import tqdm
from transformers import pipeline
import pandas as pd
import time

path_to_news = "/home/akroon/data/volume_2/RESPONDE/data/data_conbined/"
path_to_RESPOND_data = '/home/akroon/data/volume_2/RESPONDE/'

sample_df =  pd.read_csv(f'{path_to_RESPOND_data}translated_sample_df.csv')

  from .autonotebook import tqdm as notebook_tqdm


In [4]:

# Initialize the zero-shot classification pipeline with a multilingual model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Improved custom prompt with clearer context and definitions
prompt = (
    "Analyze the following text to determine if it involves political corruption. "
    "Political corruption refers to actions by political actors such as politicians, public officials, or political parties "
    "that undermine democratic processes, distort decision-making, or involve undue influence, bribery, embezzlement, or nepotism. "
    "This text is about: {}."
)
# Candidate labels
candidate_labels = ["political corruption", "no political corruption"]

# Function to classify each text with detailed output
def classify_text(row):
    result = classifier(
        row["combined_text"],
        candidate_labels,
        hypothesis_template=prompt
    )
    # Extract the most probable label and its confidence score
    label = result["labels"][0]
    score = result["scores"][0]
    return pd.Series([label, score])

tqdm.pandas(desc="Classifying texts")

sample_df[["corruption_label", "corruption_score"]] = sample_df.progress_apply(classify_text, axis=1)

output_path = 'classified_sample_df.csv'
sample_df.to_csv(output_path, index=False
# Preview results
print(sample_df[["combined_text", "corruption_label", "corruption_score"]].head())

Classifying texts: 100%|██████████| 900/900 [20:11<00:00,  1.35s/it]

                                       combined_text      corruption_label  \
0  "Нова демокрация" прекрати партийното членство...  political corruption   
1  Съдът намали паричната гаранция на Васил Божко...  political corruption   
2  "Равен мач" за Зеленски, но всъщност - победа ...  political corruption   
3  Трима задържани за измама с евросредства за зе...  political corruption   
4  Окончателно: Стайко Стайков ще се лекува под д...  political corruption   

   corruption_score  
0          0.518439  
1          0.603267  
2          0.546086  
3          0.503022  
4          0.689050  





In [None]:
general_prompt = (
    "Analyze the following text to determine if it involves identifiable victims of corruption. "
    "An identifiable victim refers to a specific individual, group, or entity harmed directly by corrupt actions, "
    "such as embezzlement, bribery, nepotism, or abuse of power. This text involves: {}."
)

victim_type_prompt = (
    "Analyze the following text to identify the type of identifiable victim(s) harmed by corruption. "
    "Corruption involves harm through bribery, embezzlement, nepotism, abuse of power, or distortion of public services. "
    "The identifiable victim(s) in this text are best described as: {}."
)

general_labels = ["Identifiable victims of corruption", "No identifiable victims of corruption"]
victim_type_labels = ["Individual citizens", "Specific communities", "Organizations or institutions", "No identifiable victims"]

# Function to classify each text for victim presence and type
def classify_victims(row):
    text = row["combined_text"]
    
    # Step 1: Identify if there are victims
    general_result = classifier(text, general_labels, hypothesis_template=general_prompt)
    general_label = general_result["labels"][0]  # Most probable label
    general_score = general_result["scores"][0]  # Confidence score
    
    if general_label == "Identifiable victims of corruption":
        # Step 2: Classify the type of victim
        type_result = classifier(text, victim_type_labels, hypothesis_template=victim_type_prompt)
        victim_type_label = type_result["labels"][0]
        victim_type_score = type_result["scores"][0]
    else:
        victim_type_label = "No identifiable victims"
        victim_type_score = 0.0

    return pd.Series([general_label, general_score, victim_type_label, victim_type_score])

# Apply classification with progress bar
tqdm.pandas(desc="Classifying texts for victim presence and type")
sample_df[["victim_label", "victim_score", "victim_type", "type_score"]] = sample_df.progress_apply(classify_victims, axis=1)

# Save results of victim classification
output_path = "classified_victims_sample_df.csv"
sample_df.to_csv(output_path, index=False)

print(sample_df[["combined_text", "victim_label", "victim_type", "victim_score", "type_score"]].head())

Classifying texts for victim presence and type:  96%|█████████▋| 868/900 [58:48<02:33,  4.80s/it]  