<a href="https://colab.research.google.com/github/ahmed-boutar/techniques-for-explainability/blob/main/explainable_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import os

# Remove Colab default sample_data
!rm -r ./sample_data

## Description

This notebook visualizes the embedding space generated by the e5-large-v2 embedding model, a pre-trained model featured on the MTEB leaderboard (source: [Hugging Face](https://huggingface.co/intfloat/e5-large-v2)).

We’ll be using a dataset of campaign speeches from the 2020 U.S. presidential election, which includes a variety of speech types—debates, rallies, official events, press interviews, etc.—by Joe Biden, Donald Trump, Kamala Harris, and Mike Pence.

The notebook, started on October 29, aims to explore whether there are discernible similarities in speech patterns within and between the politicians of each party. Our goal is to prompt interest and reflection on what might underlie these speech similarities. However, it’s worth noting that embedding models offer approximations rather than absolute truths; thus, the visualizations should be interpreted with caution. Since the model’s 512-token limit necessitates dividing each speech into chunks, the average embedding of all chunks for a speech is used to represent its overall embedding.

To visualize the embeddings, we will use t-SNE, PCA, and UMAP.

NOTE: It takes around 20 to 30 minutes to run the model on the given data

In [1]:
# Basic
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import os
import pandas as pd
import glob
import json

# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
from torch import Tensor
import torch
from transformers import AutoTokenizer, AutoModel
import torch.nn.functional as F

# Embeddings
import gensim.downloader as api
from sentence_transformers import SentenceTransformer

## Model
To analyze the political speeches, we will use the e5-large-v2 embedding model, as it ranks well on the MTEB leaderboard for semantic similarity tasks. This model is well-suited for segmenting speeches into coherent, topical chunks, as the 512-token limit typically captures the main themes and rhetoric effectively. Additionally, it has demonstrated robust performance on similarity tasks, providing efficient processing and yielding cleaner results by focusing on specific contextual windows. However, there are a few limitations to consider with this model. The 512-token constraint requires us to chunk the speeches into smaller segments, which may result in the loss of cross-paragraph context. Lastly, it is important to note that embedding models performance declines as the token size increases. 

In [2]:
model_name = 'intfloat/e5-large-v2'

In [3]:
try:
    tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-large-v2')
    model = AutoModel.from_pretrained('intfloat/e5-large-v2')
    print("Successfully loaded the e5-large-v2 model")
except Exception as e:
    raise Exception(f"Error loading model: {str(e)}")

Successfully loaded the e5-large-v2 model


## Dataset

This dataset includes speeches from the 2020 U.S. presidential campaign by key Democratic and Republican candidates: Kamala Harris, Donald Trump, Joe Biden, and Mike Pence. Published on September 20, 2024, by Ioannis Chalkiadakis, Louise Anglès d'Auriac, Gareth W. Peters, and Divina Frau-Meigs, and licensed under MIT, the dataset contains 1,056 speeches spanning January 2019 to January 2021.

Reflecting the growing interest in analyzing political rhetoric through text-as-data methods, this dataset aims to support research on election outcomes, candidate appeal, ideologies, and campaign strategies. It stands out by providing speeches from varied settings—debates, rallies, official events, and interviews—while maintaining a structured approach to rhetorical consistency. This focus on speech structure enables more rigorous quantitative analysis, enhancing the dataset’s relevance for studies requiring detailed semantic or syntactical insights.

The dataset addresses gaps in existing collections, which are often limited by size and source variety. By offering a carefully curated but comprehensive text corpus, this resource supports high-quality, timely analysis of U.S. presidential elections.

Dataset source: [Figshare](https://figshare.com/articles/dataset/A_text_dataset_of_campaign_speeches_of_the_main_tickets_in_the_2020_US_presidential_election/26862064?file=48850534)

In [4]:
def extract_organization(filename: str) -> str:
    try:
        # Split on '*' and take the first part
        org = filename.split('_')[0]
        # Remove any path components and get just the filename
        org = os.path.basename(org)
        # Convert to lowercase for consistency
        return org.lower()
    except Exception:
        return "unknown"

In [5]:
def process_directory_to_dataframe(directory_path: str, output_csv_path: str) -> pd.DataFrame:
    """
    Process all JSONL files in directory and create a consolidated DataFrame
    
    Args:
        directory_path: Path to directory containing JSONL files
        output_csv_path: Path where the CSV should be saved
    
    Returns:
        pandas DataFrame containing all processed speeches
    """
    
    
    # Get all JSONL files in directory
    jsonl_files = glob.glob(os.path.join(directory_path, "*.jsonl"))
    if not jsonl_files:
        raise ValueError(f"No JSONL files found in {directory_path}")
    
    # Process all files
    all_speeches = []
    
    for jsonl_path in jsonl_files:
        try:
            # Extract organization from filename
            organization = extract_organization(os.path.basename(jsonl_path))
            
            print(f"Processing file: {jsonl_path}")
            with open(jsonl_path, 'r', encoding='utf-8') as file:
                for line_number, line in enumerate(file, 1):
                    try:
                        data = json.loads(line.strip())
                        # Extract text, ensuring it's not empty
                        text = data.get('RawText', '').strip()
                        
                        if not text:
                            continue
                        
                        # Extract other fields with default values
                        speech_data = {
                            'SpeechID': data.get('SpeechID'),
                            'Speaker': data.get('POTUS', 'Unknown'),
                            'Date': data.get('Date', 'Unknown'),
                            'SpeechType': str(data.get('Type')).lower(),
                            'Speech': text,
                            'Speech_length': len(text.split()),
                            'Organization': organization  # Add organization
                        }
                        
                        all_speeches.append(speech_data)
                        
                    except json.JSONDecodeError as e:
                        print(f"Error decoding JSON in file {jsonl_path}, line {line_number}: {str(e)}")
                        continue
                        
        except Exception as e:
            print(f"Error processing file {jsonl_path}: {str(e)}")
            continue
    
    if not all_speeches:
        raise ValueError("No valid speeches found in any JSONL files")
    
    # Convert to DataFrame
    df = pd.DataFrame(all_speeches)
    
    # Save to CSV
    df.to_csv(output_csv_path, index=False)
    print(f"Saved processed data to {output_csv_path}") 
    
    # Print summary statistics
    print("\nSummary:")
    print(f"Total speeches processed: {len(df)}")
    print("\nSpeeches per organization:")
    print(df['Organization'].value_counts())
    
    return df

In [6]:
df = process_directory_to_dataframe('./data/ElectionSpeeches/', './data/ElectionSpeeches/merged/election_speech_data.csv')

Processing file: ./data/ElectionSpeeches/votesmart_JoeBiden.jsonl
Processing file: ./data/ElectionSpeeches/millercenter_DonaldTrump.jsonl
Processing file: ./data/ElectionSpeeches/medium_KamalaHarris.jsonl
Processing file: ./data/ElectionSpeeches/votesmart_KamalaHarris.jsonl
Processing file: ./data/ElectionSpeeches/cspan_JoeBiden.jsonl
Error processing file ./data/ElectionSpeeches/cspan_JoeBiden.jsonl: 'NoneType' object has no attribute 'strip'
Processing file: ./data/ElectionSpeeches/votesmart_DonaldTrump.jsonl
Processing file: ./data/ElectionSpeeches/cspan_MikePence.jsonl
Error processing file ./data/ElectionSpeeches/cspan_MikePence.jsonl: 'NoneType' object has no attribute 'strip'
Processing file: ./data/ElectionSpeeches/cspan_KamalaHarris.jsonl
Error processing file ./data/ElectionSpeeches/cspan_KamalaHarris.jsonl: 'NoneType' object has no attribute 'strip'
Processing file: ./data/ElectionSpeeches/medium_JoeBiden.jsonl
Processing file: ./data/ElectionSpeeches/cspan_DonaldTrump.jsonl

Out of the 1056 speeches, there are 136 speeches that did not have a recorded value. We will focus our visualization on the 920 speeches that were recorded.

Add a political party affiliation to the dataframe 

In [7]:
df['Party'] = np.where(df['Speaker'].isin(['Joe Biden', 'Kamala Harris']), 'Democrat', 'Republican')

Let's visualize the dataframe columns. The organization column corresponds to the organization that broadcasted/relayed the political speech. 

In [8]:
df.columns

Index(['SpeechID', 'Speaker', 'Date', 'SpeechType', 'Speech', 'Speech_length',
       'Organization', 'Party'],
      dtype='object')

## Viualization

In [9]:
def average_pool(last_hidden_state, attention_mask):
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    return sum_embeddings / sum_mask 

In [10]:
import numpy as np
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# This function was inspired by the code included in the e5-large-v2 page on Hugging Face (source included in the description)
def average_pool(last_hidden_state, attention_mask):
    """Apply average pooling to get embeddings, masking out padding tokens."""
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
    sum_embeddings = torch.sum(last_hidden_state * input_mask_expanded, 1)
    sum_mask = input_mask_expanded.sum(1)
    return sum_embeddings / sum_mask  # Average across the tokens

# This function was created with the help of Claude
def get_embeddings(model, tokenizer, texts, chunk_size=256, batch_size=32):
    def chunk_text(text):
        """Split text into chunks of roughly `chunk_size` words."""
        words = text.split()
        if len(words) <= chunk_size:
            return [text]
        
        chunks = []
        current_chunk = []
        current_size = 0
        
        sentences = text.split('. ')
        for sentence in sentences:
            sentence_words = sentence.split()
            if current_size + len(sentence_words) > chunk_size and current_chunk:
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentence]
                current_size = len(sentence_words)
            else:
                current_chunk.append(sentence)
                current_size += len(sentence_words)
        
        if current_chunk:
            chunks.append(' '.join(current_chunk))
        
        return chunks

    all_embeddings = []
    
    try:
        for text in texts:
            # Split text into chunks if necessary
            chunks = chunk_text(text)
            chunk_embeddings = []
            
            # Process each chunk and get its embedding
            for chunk in chunks:
                inputs = tokenizer(chunk, max_length=256, padding='max_length', truncation=True, return_tensors='pt')
                output = model(**inputs)
                embeddings = average_pool(output.last_hidden_state, inputs['attention_mask'])
                embeddings = F.normalize(embeddings, p=2, dim=1)  # Normalize embeddings
                chunk_embeddings.append(embeddings.cpu().detach().numpy())
            
            # Average the chunk embeddings
            if len(chunk_embeddings) > 1:
                text_embedding = np.mean(chunk_embeddings, axis=0)
            else:
                text_embedding = chunk_embeddings[0]
            
            all_embeddings.append(text_embedding)
        
        return np.array(all_embeddings)
    
    except Exception as e:
        raise Exception(f"Error generating embeddings: {str(e)}")


We can call the function we defined that take care of creating a list of the embedding vectors, where each embedding vector corresponds to a speech. The embeddings are generated by first splitting a text to 256 tokens and processing it in chunks. The average of all embedding vectors of the different chunks is taken to represent the speech's embedding vector. 

In [11]:
# This code figures on the e5-large-v2 model page (source included in the description)
# Initialize model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('intfloat/e5-large-v2')
model = AutoModel.from_pretrained('intfloat/e5-large-v2')
texts = df['Speech'].tolist()
embeddings = get_embeddings(model, tokenizer, texts)
embeddings = embeddings.squeeze() #The output has an extra dimension so get rid of it here

### PCA Visualization

In [32]:
def visualize_pca(n_components, dimensions, df_main):
     # Apply PCA
    pca = PCA(n_components=n_components)
    embeddings_pca = pca.fit_transform(embeddings)

    # Add PCA components to dataframe
    df = df_main.copy()
    df['PC1'] = embeddings_pca[:, 0]
    df['PC2'] = embeddings_pca[:, 1]
    if dimensions == 3:
        df['PC3'] = embeddings_pca[:, 2]

    # Create plot
    if dimensions == 2:
        fig_pca = px.scatter(
            df, x='PC1', y='PC2',
            color='Party',
            color_discrete_map={"Republican": "red", "Democrat": "blue"},
            title="2D PCA of Speech Embeddings by Party Affiliation",
            hover_data=['Speaker', 'Date', 'Organization'],  # Add Speaker and Date to hover info
            labels={'PC1': 'Principal Component 1', 'PC2': 'Principal Component 2'},
        )
    else:
        fig_pca = px.scatter_3d(
            df, x='PC1', y='PC2', z='PC3',
            color='Party',
            color_discrete_map={"Republican": "red", "Democrat": "blue"},
            title="3D PCA of Speech Embeddings by Party Affiliation",
            hover_data=['Speaker', 'Date', 'Organization'],
            labels={'PC1': 'Principal Component 1', 
                    'PC2': 'Principal Component 2',
                    'PC3': 'Principal Component 3'}
        )

    fig_pca.update_traces(marker=dict(size=8))
    fig_pca.show()
    if (dimensions == 3):
        fig_pca.write_html('./visualizations/3d-pca-visuals.html')
    

Let's visualize PCA in 2-D and in 3-D

In [13]:
visualize_pca(3, 2, df)

In [33]:
visualize_pca(3, 3, df)

It is interesting to see how isolated the speech given by Donald Trump on Jan, 06, 2021, the day the Capitol Riot happened, which only shows up as an isolated instance only when we display a 3D plot

### T-SNE

In [25]:
def visualize_tsne(n_components, dimensions, df_main):
    # Apply t-SNE
    tsne = TSNE(n_components=n_components, perplexity=30, n_iter=300, random_state=42)
    embeddings_tsne = tsne.fit_transform(embeddings)

    # Add t-SNE components to dataframe
    df = df_main.copy()
    df['TSNE1'] = embeddings_tsne[:, 0]
    df['TSNE2'] = embeddings_tsne[:, 1]
    if dimensions == 3:
        df['TSNE3'] = embeddings_tsne[:, 2]

    # Create plot
    if dimensions == 2:
        fig_tsne = px.scatter(
            df, x='TSNE1', y='TSNE2',
            color='Party',
            color_discrete_map={"Republican": "red", "Democrat": "blue"},
            title="2D t-SNE of Speech Embeddings by Party Affiliation",
            hover_data=['Speaker', 'Date', 'Organization'],  # Add Speaker and Date to hover info
            labels={'TSNE1': 'Component 1', 'TSNE2': 'Component 2'}
        )
    else:
        fig_tsne = px.scatter_3d(
            df, x='TSNE1', y='TSNE2', z='TSNE3',
            color='Party',
            color_discrete_map={"Republican": "red", "Democrat": "blue"},
            title="3D t-SNE of Speech Embeddings by Party Affiliation",
            hover_data=['Speaker', 'Date', 'Organization'],
            labels={'TSNE1': 'Component 1', 
                    'TSNE2': 'Component 2',
                    'TSNE3': 'Component 3'}
        )

    fig_tsne.update_traces(marker=dict(size=8))
    fig_tsne.show()
    if (dimensions == 3):
        fig_tsne.write_html('./visualizations/3d-tsne-visuals.html')

In [26]:
visualize_tsne(3, 2, df)

In [27]:
visualize_tsne(3, 3, df)

### UMAP

In [29]:
def visualize_umap(n_components, dimensions, df_main):
    # Apply UMAP
    umap_reducer = umap.UMAP(
        n_neighbors=15,
        min_dist=0.1,
        n_components=n_components,
        random_state=42
    )
    embeddings_umap = umap_reducer.fit_transform(embeddings)

    # Add UMAP components to dataframe
    df = df_main.copy()
    df['UMAP1'] = embeddings_umap[:, 0]
    df['UMAP2'] = embeddings_umap[:, 1]
    if dimensions == 3:
        df['UMAP3'] = embeddings_umap[:, 2]

    # Create plot
    if dimensions == 2:
        fig_umap = px.scatter(
            df, x='UMAP1', y='UMAP2',
            color='Party',
            color_discrete_map={"Republican": "red", "Democrat": "blue"},
            title="2D UMAP of Speech Embeddings by Party Affiliation",
            hover_data=['Speaker', 'Date', 'Organization'],  # Add Speaker and Date to hover info
            labels={'UMAP1': 'Component 1', 'UMAP2': 'Component 2'}
        )
    else:
        fig_umap = px.scatter_3d(
            df, x='UMAP1', y='UMAP2', z='UMAP3',
            color='Party',
            color_discrete_map={"Republican": "red", "Democrat": "blue"},
            title="3D UMAP of Speech Embeddings by Party Affiliation",
            hover_data=['Speaker', 'Date', 'Organization'],
            labels={'UMAP1': 'Component 1', 
                    'UMAP2': 'Component 2',
                    'UMAP3': 'Component 3'}
        )

    fig_umap.update_traces(marker=dict(size=8))
    fig_umap.show()
    if (dimensions == 3):
        fig_umap.write_html('./visualizations/3d-umap-visuals.html')

In [30]:
visualize_umap(3, 2, df)


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



In [31]:
visualize_umap(3, 3, df)


n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.



## Visualization Comparison

NOTE: You can visualize the 3-d scatter plots by opening the corresponding HTML file found under /visualizations

**UMAP Visualization:**

The UMAP visualization highlights strong clustering by party affiliation, indicating that the e5-large-v2 embeddings capture distinct rhetorical patterns between Democrats and Republicans. Republican speeches (shown in red) form a tighter, more cohesive cluster with higher Component 1 values. In contrast, Democratic speeches (in blue) display more dispersion but generally cluster towards lower Component 1. The overlapping area in the middle may reflect common political language or shared campaign themes. UMAP’s non-linear approach effectively captures both local and global structures, making it particularly suited for revealing overall rhetorical distinctions.

**t-SNE Visualization:**

The t-SNE visualization provides a more granular view of the speech embeddings, with the vertical spread (Component 2) possibly indicating different speech contexts, such as rallies, debates, or interviews. Republican speeches tend to cluster densely in the upper region, while Democratic speeches show broader dispersion, especially in the negative range of Component 2. Though the separation is less linear than in the UMAP visualization, t-SNE potentially uncovers more nuanced differences in speech patterns. Given that our e5-large-v2 model is set to process speeches in 256-token chunks, these clusters may correspond to specific rhetorical strategies or topics within each party’s discourse.

**PCA Visualization:**

PCA presents the most conservative but interpretable view of the speech data. The principal components likely capture major variations in vocabulary and rhetorical style, with Democratic speeches showing greater variance across both components, hinting at more diverse language use. Republican speeches, on the other hand, form a more compact cluster, suggesting consistent messaging. The diagonal trend may illustrate a spectrum from partisan-specific language to common political phrasing. Because PCA is a linear method, this visualization best represents fundamental differences in word choice and sentence structure between the two parties.

It is important to note how the 3-d models give us a different perspective of the datapoints, showing clusters within other clusters. When hovering over the datapoints, we can see that a lot of them are close in terms of dates.

It would be interesting to see if the location where the speech was given can provide an explanation to the similarities. We would need to manually add the location for each speech by first cross-referencing multiple sources to ensure accurate reporting of the location.