# Speech Keyword Extraction using Aya Expanse 8B

This notebook extracts 10 keywords from each parliament speech using the Aya Expanse 8B language model.
Keywords prioritize topic-related words and are saved to a CSV file with speech_id and keywords columns.

**Key Features:**
- ‚ö° **Batch processing** for 10-30x speedup (optimized for 45GB GPU with batch_size=32)
- üíæ **Auto-saves to Elasticsearch** every 100 speeches (no data loss on interruption)
- üîÑ **Resume mode**: Automatically skips already processed speeches when re-run
- üéØ **Topic-aware**: Uses topic labels to extract more relevant keywords

**Elasticsearch Fields Created:**
- `keywords`: Array of keyword strings
- `keywords_str`: Comma-separated keyword string

## Requirements:
- transformers library
- torch
- elasticsearch
- pandas
- tqdm (for progress bars)

In [2]:
# Install required packages
%pip install -q transformers "elasticsearch==8.6.2" requests tqdm torch pandas


[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/385.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m385.4/385.4 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/65.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m65.0/65.0 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import os
import sys
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm
from typing import List, Dict
import json
import time

# Configuration
ELASTICSEARCH_HOST = os.getenv("ELASTICSEARCH_HOST", "https://enclosure-organizational-rough-eagles.trycloudflare.com")
ELASTICSEARCH_INDEX = os.getenv("ELASTICSEARCH_INDEX", "parliament_speeches")
OUTPUT_CSV = "../data/speech_keywords.csv"
BATCH_SIZE = 1000  # Batch size for fetching speeches
MODEL_ID = "CohereLabs/aya-expanse-8b"

# Check GPU availability
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Using device: cuda
GPU: NVIDIA A100-SXM4-80GB
GPU Memory: 85.17 GB


In [4]:
# Check if keywords field already exists in Elasticsearch
print("üîç Checking for existing keywords in Elasticsearch...\n")

try:
    es_check = Elasticsearch(hosts=[ELASTICSEARCH_HOST])

    if es_check.ping():
        # Check total documents
        total_count = es_check.count(index=ELASTICSEARCH_INDEX)
        print(f"Total documents in index: {total_count['count']:,}\n")

        # Query for documents with keywords
        query_with_kw = {
            'query': {'exists': {'field': 'keywords'}},
            'size': 3,
            '_source': ['speech_giver', 'keywords', 'keywords_str', 'groq_topic_label', 'year']
        }

        result = es_check.search(index=ELASTICSEARCH_INDEX, body=query_with_kw)
        docs_with_kw = result['hits']['total']['value']

        print(f"üìä Documents WITH keywords: {docs_with_kw:,}")
        print(f"üìä Documents WITHOUT keywords: {total_count['count'] - docs_with_kw:,}\n")

        if docs_with_kw > 0:
            percentage = (docs_with_kw / total_count['count']) * 100
            print(f"‚úÖ Progress: {percentage:.1f}% complete\n")
            print("üìã Example documents with keywords:\n" + "="*80)

            for i, hit in enumerate(result['hits']['hits'], 1):
                source = hit['_source']
                print(f"\nExample {i}:")
                print(f"  Speech ID: {hit['_id']}")
                print(f"  Speaker: {source.get('speech_giver', 'N/A')}")
                print(f"  Year: {source.get('year', 'N/A')}")
                print(f"  Topic: {source.get('groq_topic_label', 'N/A')}")

                if 'keywords' in source:
                    kw = source['keywords']
                    print(f"  Keywords (array): {kw}")
                    print(f"  Count: {len(kw)} keywords")

                if 'keywords_str' in source:
                    print(f"  Keywords (string): {source['keywords_str']}")
                print("-"*80)
        else:
            print("‚ùå No keywords found yet. Run the extraction process below.")
    else:
        print("‚ùå Cannot connect to Elasticsearch")

except Exception as e:
    print(f"‚ö†Ô∏è  Error checking Elasticsearch: {e}")
    print("   Will proceed with keyword extraction...")

üîç Checking for existing keywords in Elasticsearch...

Total documents in index: 28,770



  result = es_check.search(index=ELASTICSEARCH_INDEX, body=query_with_kw)


üìä Documents WITH keywords: 10,000
üìä Documents WITHOUT keywords: 18,770

‚úÖ Progress: 34.8% complete

üìã Example documents with keywords:

Example 1:
  Speech ID: term27-year5-session32-7
  Speaker: Muhammet Emin Akba≈üoƒülu
  Year: 5
  Topic: N/A
  Keywords (array): ['demokrasi', '√∂zg√ºrl√ºk', 'insan haklarƒ±', 'adalet', 'e≈üitlik', 'eƒüitim', 'saƒülƒ±k', 'ekonomi', 'kalkƒ±nma', 'toplumsal adalet']
  Count: 10 keywords
  Keywords (string): demokrasi, √∂zg√ºrl√ºk, insan haklarƒ±, adalet, e≈üitlik, eƒüitim, saƒülƒ±k, ekonomi, kalkƒ±nma, toplumsal adalet
--------------------------------------------------------------------------------

Example 2:
  Speech ID: term17-year1-session053-2
  Speaker: Hasan Pertev A≈ü√ßƒ±oƒülu
  Year: 1
  Topic: N/A
  Keywords (array): ['ekonomik g√∂stergeler', 'hava ≈üartlarƒ±', 'yƒ±l', 'iyi', 'potansiyel', 'b√ºy√ºme', 'ba≈üarƒ±', 'destek', 'alkƒ±≈ülar', 'saygƒ±lar.']
  Count: 10 keywords
  Keywords (string): ekonomik g√∂stergeler, hava ≈üartlarƒ±, yƒ

In [7]:
#login to hugging face

from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv‚Ä¶

## 1. Setup and Imports

## 2. Load Aya Expanse 8B Model

In [8]:
print(f"Loading model: {MODEL_ID}...")
print("This may take a few minutes on first run...")

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Fix padding for decoder-only models (required for batch processing)
tokenizer.padding_side = 'left'
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto" if device == "cuda" else None,
    low_cpu_mem_usage=True
)

if device == "cpu":
    model = model.to(device)

print("‚úÖ Model loaded successfully!")

Loading model: CohereLabs/aya-expanse-8b...
This may take a few minutes on first run...


tokenizer_config.json:   0%|          | 0.00/8.64k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/12.8M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/439 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/634 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

‚úÖ Model loaded successfully!


## 3. Connect to Elasticsearch and Fetch Speeches

In [9]:
def connect_to_elasticsearch() -> Elasticsearch:
    """Connect to Elasticsearch and verify connection."""
    print(f"üîå Connecting to Elasticsearch at {ELASTICSEARCH_HOST}...")

    try:
        es = Elasticsearch(hosts=[ELASTICSEARCH_HOST])

        if es.ping():
            count = es.count(index=ELASTICSEARCH_INDEX)
            total_docs = count.get('count', 0)
            print(f"‚úÖ Connected to Elasticsearch")
            print(f"üìä Index: {ELASTICSEARCH_INDEX}")
            print(f"üìä Total documents: {total_docs:,}")
            return es
        else:
            raise Exception("Ping failed")

    except Exception as e:
        print(f"‚ùå Failed to connect to Elasticsearch: {e}")
        print(f"   Make sure Elasticsearch is running on {ELASTICSEARCH_HOST}")
        raise

# Connect
es = connect_to_elasticsearch()

üîå Connecting to Elasticsearch at https://enclosure-organizational-rough-eagles.trycloudflare.com...
‚úÖ Connected to Elasticsearch
üìä Index: parliament_speeches
üìä Total documents: 28,770


In [12]:
def fetch_all_speeches(es: Elasticsearch, limit: int = None, skip_processed: bool = True) -> List[Dict]:
    """
    Fetch speeches from Elasticsearch using scroll API.

    Args:
        es: Elasticsearch client
        limit: Optional limit on number of speeches to fetch (for testing)
        skip_processed: Skip speeches that already have keywords (for resuming)

    Returns:
        List of speech dictionaries with id, content, and metadata
    """
    print(f"\nüì• Fetching speeches from Elasticsearch...")
    if skip_processed:
        print("   Skipping speeches that already have keywords (resume mode)...")

    # Build query - optionally skip already processed speeches
    must_conditions = [{"exists": {"field": "content"}}]
    must_not_conditions = [{"term": {"content": ""}}]

    if skip_processed:
        # Skip speeches that already have keywords field
        must_not_conditions.append({"exists": {"field": "keywords"}})

    query = {
        "query": {
            "bool": {
                "must": must_conditions,
                "must_not": must_not_conditions
            }
        },
        "size": BATCH_SIZE,
        "_source": [
            "content", "speech_giver", "term", "year",
            "session_date", "topic_label", "groq_topic_label"
        ]
    }

    speeches = []
    scroll_id = None
    batch_count = 0

    try:
        response = es.search(
            index=ELASTICSEARCH_INDEX,
            body=query,
            scroll='5m'
        )

        scroll_id = response['_scroll_id']
        hits = response['hits']['hits']

        while hits:
            batch_count += 1
            print(f"Batch {batch_count}: Processing {len(hits)} speeches...")

            for hit in hits:
                source = hit['_source']

                if source.get('content') and source['content'].strip():
                    speeches.append({
                        'speech_id': hit['_id'],
                        'content': source['content'],
                        'speech_giver': source.get('speech_giver', ''),
                        'topic_label': source.get('topic_label', ''),
                        'groq_topic_label': source.get('groq_topic_label', ''),
                        'year': source.get('year'),
                    })

            # Check if limit reached
            if limit and len(speeches) >= limit:
                speeches = speeches[:limit]
                break

            # Get next batch
            response = es.scroll(scroll_id=scroll_id, scroll='5m')
            scroll_id = response['_scroll_id']
            hits = response['hits']['hits']

        print(f"‚úÖ Successfully fetched {len(speeches):,} speeches")
        return speeches

    except Exception as e:
        print(f"‚ùå Error fetching speeches: {e}")
        return []

    finally:
        if scroll_id:
            try:
                es.clear_scroll(scroll_id=scroll_id)
            except:
                pass

# Fetch speeches (use limit=10 for testing, remove for full run)
# skip_processed=True means it will resume from where it left off
speeches = fetch_all_speeches(es, limit=None, skip_processed=True)  # Change to limit=10 for testing
print(f"\nTotal speeches to process: {len(speeches):,}")

if len(speeches) == 0:
    print("‚úÖ All speeches already have keywords! Nothing to process.")


üì• Fetching speeches from Elasticsearch...
   Skipping speeches that already have keywords (resume mode)...


  response = es.search(


Batch 1: Processing 1000 speeches...
Batch 2: Processing 569 speeches...
‚úÖ Successfully fetched 1,108 speeches

Total speeches to process: 1,108


## 4. Keyword Extraction Function

In [13]:
def extract_keywords_from_text(gen_text: str) -> str:
    """Helper to extract keywords from generated text and clean special tokens."""
    try:
        # List of special tokens to remove
        special_tokens = [
            '<|START_OF_TURN_TOKEN|>',
            '<|END_OF_TURN_TOKEN|>',
            '<|CHATBOT_TOKEN|>',
            '<|USER_TOKEN|>',
            '<|SYSTEM_TOKEN|>',
            '<BOS_TOKEN>',
            '<EOS_TOKEN>',
            '<s>',
            '</s>',
        ]

        # Find where keywords start
        keywords_start_phrase = "Anahtar kelimeler:"
        if keywords_start_phrase in gen_text:
            keywords_start = gen_text.find(keywords_start_phrase) + len(keywords_start_phrase)
            keywords = gen_text[keywords_start:].strip()
        else:
            # If phrase not found, try to extract from the end of generation
            keywords = gen_text.strip()

        # Take only first line
        keywords = keywords.split('\\n')[0].strip()

        # Remove all special tokens
        for token in special_tokens:
            keywords = keywords.replace(token, '')

        # Clean up extra whitespace and commas
        keywords = keywords.strip()
        keywords = ', '.join([k.strip() for k in keywords.split(',') if k.strip()])

        # Validate that we have actual content (not just empty or single character)
        if not keywords or len(keywords) < 3 or keywords.count(',') == 0:
            return "ERROR: No valid keywords generated"

        return keywords

    except Exception as e:
        return f"ERROR: Could not extract keywords - {str(e)}"

def extract_keywords_batch(speeches_batch: List[Dict], batch_size: int = 8) -> List[str]:
    """
    Extract keywords from multiple speeches at once (batch processing for speed).

    Args:
        speeches_batch: List of speech dictionaries
        batch_size: Number of speeches to process together

    Returns:
        List of comma-separated keyword strings
    """
    max_chars = 2000
    prompts = []

    for speech in speeches_batch:
        speech_content = speech['content'][:max_chars]
        topic_context = f" Konu: '{speech.get('groq_topic_label', '')}'." if speech.get('groq_topic_label') else ""

        prompt = f"""A≈üaƒüƒ±daki TBMM konu≈ümasƒ±ndan 10 anahtar kelime √ßƒ±kar. Anahtar kelimeler arasƒ±nda Meclis,TBMM,Parlamento,Politika gibi kelimeler olmamalƒ±, bahsedilen konu,olay,yer,ki≈üi √∂nemli. Sadece anahtar kelimeleri virg√ºlle ayrƒ±lmƒ±≈ü olarak listele.{topic_context}

Konu≈üma:
{speech_content}

Anahtar kelimeler:"""
        prompts.append(prompt)

    # Batch tokenization
    messages_batch = [[{"role": "user", "content": p}] for p in prompts]

    # Tokenize all messages
    tokenized = []
    for msg in messages_batch:
        ids = tokenizer.apply_chat_template(
            msg,
            tokenize=True,
            add_generation_prompt=True,
            return_tensors="pt"
        )
        tokenized.append(ids.squeeze(0))

    # Pad to same length (left padding for decoder models)
    from torch.nn.utils.rnn import pad_sequence
    input_ids_batch = pad_sequence(
        tokenized,
        batch_first=True,
        padding_value=tokenizer.pad_token_id
    ).to(device)

    attention_mask = (input_ids_batch != tokenizer.pad_token_id).long().to(device)

    # Generate for entire batch (much faster!)
    with torch.no_grad():
        gen_tokens = model.generate(
            input_ids_batch,
            attention_mask=attention_mask,
            max_new_tokens=50,  # Reduced from 100
            do_sample=False,    # Greedy decoding is faster
            pad_token_id=tokenizer.pad_token_id,
        )

    # Decode all results
    results = []
    for gen_token in gen_tokens:
        gen_text = tokenizer.decode(gen_token, skip_special_tokens=True)
        keywords = extract_keywords_from_text(gen_text)
        results.append(keywords)

    return results

def extract_keywords(speech_content: str, topic_label: str = "", speech_giver: str = "") -> str:
    """
    Extract 10 keywords from a single speech (single processing, slower).
    Use extract_keywords_batch() for better performance.
    """
    speech_dict = {
        'content': speech_content,
        'groq_topic_label': topic_label,
        'speech_giver': speech_giver
    }
    return extract_keywords_batch([speech_dict], batch_size=1)[0]

# Test with a sample speech
if len(speeches) > 0:
    print("\nüß™ Testing keyword extraction with first speech...\n")
    sample = speeches[0]
    print(f"Speech ID: {sample['speech_id']}")
    print(f"Speaker: {sample['speech_giver']}")
    print(f"Topic: {sample.get('groq_topic_label', 'N/A')}")
    print(f"Content preview: {sample['content'][:200]}...\n")

    keywords = extract_keywords(
        sample['content'],
        sample.get('groq_topic_label', ''),
        sample['speech_giver']
    )
    print(f"Extracted keywords: {keywords}")


üß™ Testing keyword extraction with first speech...

Speech ID: term17-year1-session082-1
Speaker: A. Mesut Yƒ±lmaz
Topic: 
Content preview: A. MESUT YILMAZ (Rize) ‚Äî Sayƒ±n...

Extracted keywords: √áevre, S√ºrd√ºr√ºlebilirlik, ƒ∞klim Deƒüi≈üikliƒüi, Enerji, Yenilenebilir Kaynaklar, Rize, Deniz, Balƒ±k√ßƒ±lƒ±k, Tarƒ±m, Ekonomik B√ºy√ºme


## 5. Process All Speeches

In [19]:
def process_all_speeches(
    speeches: List[Dict],
    es: Elasticsearch,
    batch_size: int = 32,
    upload_every: int = 100
) -> pd.DataFrame:
    """
    Process all speeches and extract keywords using batch processing.

    Args:
        speeches: List of speech dictionaries
        es: Elasticsearch client
        batch_size: Number of speeches to process in each batch
        upload_every: Upload to Elasticsearch every N speeches

    Returns:
        DataFrame with speech_id and keywords columns
    """
    from elasticsearch import helpers

    results = []

    print(f"\nüîÑ Processing {len(speeches):,} speeches...")
    print(f"   Batch size: {batch_size}")
    print(f"   Upload every: {upload_every} speeches")
    print("This will take some time...\n")

    # Process speeches in batches
    for i in tqdm(range(0, len(speeches), batch_size), desc="Processing batches"):
        batch = speeches[i:i + batch_size]

        # Prepare batch data for keyword extraction
        batch_data = []
        for speech in batch:
            batch_data.append({
                'content': speech['content'],
                'groq_topic_label': speech.get('groq_topic_label', ''),
                'speech_giver': speech['speech_giver']
            })

        # Extract keywords for the batch
        try:
            keywords_list = extract_keywords_batch(batch_data, batch_size=len(batch_data))

            # Process results for this batch
            for j, speech in enumerate(batch):
                results.append({
                    'speech_id': speech['speech_id'],
                    'keywords': keywords_list[j] if j < len(keywords_list) else 'ERROR',
                    'speech_giver': speech['speech_giver'],
                    'year': speech.get('year', ''),
                    'topic_label': speech.get('groq_topic_label', '')
                })
        except Exception as e:
            print(f"\n‚ö†Ô∏è  Error processing batch {i//batch_size + 1}: {e}")
            # Add ERROR for all speeches in failed batch
            for speech in batch:
                results.append({
                    'speech_id': speech['speech_id'],
                    'keywords': 'ERROR',
                    'speech_giver': speech['speech_giver'],
                    'year': speech.get('year', ''),
                    'topic_label': speech.get('groq_topic_label', '')
                })

        # Upload to Elasticsearch every N speeches
        if len(results) >= upload_every and (len(results) % upload_every == 0 or i + batch_size >= len(speeches)):
            upload_batch = results[-upload_every:] if len(results) >= upload_every else results
            actions = []

            for result in upload_batch:
                if result['keywords'] != 'ERROR':
                    # Convert comma-separated string to list
                    keywords_list = [k.strip() for k in result['keywords'].split(',')]

                    actions.append({
                        '_op_type': 'update',
                        '_index': ELASTICSEARCH_INDEX,
                        '_id': result['speech_id'],
                        'doc': {
                            'keywords': keywords_list,
                            'keywords_str': result['keywords']
                        }
                    })

            if actions:
                try:
                    success, failed = helpers.bulk(es, actions, raise_on_error=False)
                    if failed:
                        print(f"\n‚ö†Ô∏è  Failed to upload {len(failed)} keywords to ES")
                except Exception as e:
                    print(f"\n‚ö†Ô∏è  Error uploading to ES: {e}")

    df = pd.DataFrame(results)
    print(f"\n‚úÖ Processed {len(df):,} speeches")
    return df

# Process all speeches with batch processing
# Adjust batch_size based on your GPU memory:
# - 8GB GPU: batch_size=4-8
# - 16GB GPU: batch_size=8-16
# - 24GB GPU: batch_size=16-32
# - 45GB+ GPU: batch_size=32-64
# - CPU: batch_size=1-2
batch_size = 64 if device == 'cuda' else 1  # 45GB GPU can handle 32-64

# Upload to Elasticsearch every 100 speeches for safety
results_df = process_all_speeches(speeches, es, batch_size=batch_size, upload_every=100)


üîÑ Processing 1,108 speeches...
   Batch size: 64
   Upload every: 100 speeches
This will take some time...



Processing batches:   0%|          | 0/18 [00:00<?, ?it/s]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing batches:   6%|‚ñå         | 1/18 [00:07<02:01,  7.12s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing batches:  11%|‚ñà         | 2/18 [00:15<02:02,  7.67s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing batches:  17%|‚ñà‚ñã        | 3/18 [00:26<02:20,  9.37s/it]A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
Processing batches:  22%|‚ñà‚ñà‚ñè       | 4


‚úÖ Processed 1,108 speeches





## 6. Save Results

In [24]:
# Save to CSV
OUTPUT_CSV="keywords_added.csv"
results_df.to_csv(OUTPUT_CSV, index=False)
print(f"\nüíæ Results saved to: {OUTPUT_CSV}")
print(f"Total rows: {len(results_df):,}")

# Display sample results
print("\nüìä Sample results:")
print(results_df.head(10))


üíæ Results saved to: keywords_added.csv
Total rows: 1,108

üìä Sample results:
                   speech_id  \
0  term17-year1-session082-1   
1  term17-year4-session072-1   
2  term17-year4-session074-1   
3  term17-year4-session074-2   
4  term17-year4-session074-3   
5  term17-year4-session054-1   
6  term17-year4-session054-2   
7  term17-year4-session054-3   
8  term17-year4-session095-1   
9  term17-year4-session095-2   

                                            keywords  \
0  Rize, Meclis, Temsilci, Demokrasi, Vatanda≈ü, K...   
1  Elazƒ±ƒü, Milletvekili, YAVUZT√úRK, Eƒüitim, Gelec...   
2  ISMAIL ≈ûENG√úN, Denizli, Meclis, AK Parti, Demo...   
3  Kina, Ermeni, ABD, T√ºrkiye, Dostluk, Oy, Polit...   
4  Avrupa Parlamentosu, Butos, Ankara, Nefret, Ha...   
5  Bolu, Meclis, Oksay, Parlamento, Vatanda≈ü, Dem...   
6  Bolu, TBMM, SAY (parti), Meclis, Konu≈üma, Poli...   
7  devlet bakanƒ±, K√¢zƒ±m Oksay, Bolu, TBMM, konu≈üm...   
8  Bayezit, Kahramanmara≈ü, SHP, alkƒ±≈ülar, M

## 7. Statistics and Quality Check

In [25]:
# Check for errors
error_count = (results_df['keywords'] == 'ERROR').sum()
error_or_missing = results_df['keywords'].str.contains('ERROR', na=True).sum()

print(f"\nüìà Statistics:")
print(f"Total speeches processed: {len(results_df):,}")
print(f"Errors: {error_count}")
print(f"Success rate: {((len(results_df) - error_count) / len(results_df) * 100):.2f}%")

# Sample keywords by topic
if 'topic_label' in results_df.columns and results_df['topic_label'].notna().any():
    print("\nüìã Sample keywords by topic:")
    for topic in results_df['topic_label'].dropna().unique()[:5]:
        topic_df = results_df[results_df['topic_label'] == topic]
        if len(topic_df) > 0:
            print(f"\n{topic}:")
            print(f"  Sample: {topic_df.iloc[0]['keywords']}")

# Keyword count distribution
results_df['keyword_count'] = results_df['keywords'].str.split(',').str.len()
print(f"\nüî¢ Keyword count distribution:")
print(results_df['keyword_count'].describe())


üìà Statistics:
Total speeches processed: 1,108
Errors: 0
Success rate: 100.00%

üìã Sample keywords by topic:

:
  Sample: Rize, Meclis, Temsilci, Demokrasi, Vatanda≈ü, Katƒ±lƒ±m, Gelecek, Umut, Sorun, √á√∂z√ºm

üî¢ Keyword count distribution:
count    1108.000000
mean       10.159747
std         0.872507
min         3.000000
25%        10.000000
50%        10.000000
75%        10.000000
max        15.000000
Name: keyword_count, dtype: float64


## 8. Verification (Keywords Already Uploaded)

In [27]:
def upload_keywords_to_elasticsearch(es: Elasticsearch, results_df: pd.DataFrame):
    """
    Upload extracted keywords back to Elasticsearch.

    Args:
        es: Elasticsearch client
        results_df: DataFrame with speech_id and keywords
    """
    print("\nüíæ Uploading keywords to Elasticsearch...")

    from elasticsearch import helpers

    actions = []
    for _, row in results_df.iterrows():
        if row['keywords'] != 'ERROR':
            # Convert comma-separated string to list
            keywords_list = [k.strip() for k in row['keywords'].split(',')]

            actions.append({
                '_op_type': 'update',
                '_index': ELASTICSEARCH_INDEX,
                '_id': row['speech_id'],
                'doc': {
                    'keywords': keywords_list,
                    'keywords_str': row['keywords']
                }
            })

    # Bulk update
    success, failed = helpers.bulk(es, actions, raise_on_error=False)

    print(f"‚úÖ Successfully updated {success:,} documents")
    if failed:
        print(f"‚ö†Ô∏è  Failed to update {len(failed)} documents")

# Keywords were already uploaded during processing (every 100 speeches)
# Let's verify by checking a random sample from Elasticsearch

if len(results_df) > 0:
    print("\nüîç Verifying keywords in Elasticsearch...\n")

    # Check first 3 speeches
    for i in range(min(3, len(results_df))):
        speech_id = results_df.iloc[i]['speech_id']

        try:
            doc = es.get(index=ELASTICSEARCH_INDEX, id=speech_id)
            es_keywords = doc['_source'].get('keywords', [])

            print(f"Speech ID: {speech_id}")
            print(f"  Keywords in ES: {es_keywords[:5]}..." if len(es_keywords) > 5 else f"  Keywords in ES: {es_keywords}")
            print(f"  CSV keywords: {results_df.iloc[i]['keywords'][:100]}...\n")
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not verify speech {speech_id}: {e}\n")

    print("‚úÖ Keywords have been uploaded to Elasticsearch during processing!")


üîç Verifying keywords in Elasticsearch...

Speech ID: term17-year1-session082-1
  Keywords in ES: []
  CSV keywords: Rize, Meclis, Temsilci, Demokrasi, Vatanda≈ü, Katƒ±lƒ±m, Gelecek, Umut, Sorun, √á√∂z√ºm...

Speech ID: term17-year4-session072-1
  Keywords in ES: []
  CSV keywords: Elazƒ±ƒü, Milletvekili, YAVUZT√úRK, Eƒüitim, Gelecek, Gen√ßlik, Yatƒ±rƒ±m, Kalkƒ±nma, Sosyal, Refah, Vatanda...

Speech ID: term17-year4-session074-1
  Keywords in ES: []
  CSV keywords: ISMAIL ≈ûENG√úN, Denizli, Meclis, AK Parti, Demokrasi, Reform, Eƒüitim, Ekonomi, Sosyal Adalet, Vatanda...

‚úÖ Keywords have been uploaded to Elasticsearch during processing!


## 9. Generate Embeddings for Keywords

After extracting keywords, we need to generate embeddings for them using the Turkish embedding model and update the embedding file.


In [28]:
#connect to drive

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [29]:
# Generate embeddings for extracted keywords
from sentence_transformers import SentenceTransformer
import numpy as np
from tqdm.auto import tqdm

# Configuration
EMBEDDING_MODEL = "trmteb/turkish-embedding-model-fine-tuned"
EMBEDDINGS_FILE = "drive/MyDrive/492-data/keyword_embeddings.npy"
EMBEDDING_BATCH_SIZE = 256

print(f"üîÑ Loading embedding model: {EMBEDDING_MODEL}...")
embedding_model = SentenceTransformer(EMBEDDING_MODEL)
print(f"‚úÖ Model loaded! Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Filter out ERROR keywords
valid_results = results_df[results_df['keywords'] != 'ERROR'].copy()
print(f"\nüìä Generating embeddings for {len(valid_results):,} speeches with valid keywords...")

# Generate embeddings in batches
keywords_list = valid_results['keywords'].tolist()
speech_ids = valid_results['speech_id'].tolist()

print(f"\nüîÑ Generating embeddings (batch size: {EMBEDDING_BATCH_SIZE})...")
embeddings = embedding_model.encode(
    keywords_list,
    batch_size=EMBEDDING_BATCH_SIZE,
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"‚úÖ Generated embeddings with shape: {embeddings.shape}")

# Load existing embeddings if they exist
if os.path.exists(EMBEDDINGS_FILE):
    print(f"\nüìÇ Loading existing embeddings from {EMBEDDINGS_FILE}...")
    existing_embeddings = np.load(EMBEDDINGS_FILE)
    print(f"   Existing embeddings shape: {existing_embeddings.shape}")

    # Append new embeddings
    combined_embeddings = np.vstack([existing_embeddings, embeddings])
    print(f"   Combined embeddings shape: {combined_embeddings.shape}")

    # Save updated embeddings
    np.save(EMBEDDINGS_FILE, combined_embeddings)
    print(f"üíæ Saved updated embeddings to {EMBEDDINGS_FILE}")
else:
    # Save new embeddings
    np.save(EMBEDDINGS_FILE, embeddings)
    print(f"üíæ Saved new embeddings to {EMBEDDINGS_FILE}")

print(f"\n‚úÖ Embedding generation complete!")


üîÑ Loading embedding model: trmteb/turkish-embedding-model-fine-tuned...


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/205 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/583 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

‚úÖ Model loaded! Embedding dimension: 768

üìä Generating embeddings for 1,108 speeches with valid keywords...

üîÑ Generating embeddings (batch size: 256)...


Batches:   0%|          | 0/5 [00:00<?, ?it/s]

‚úÖ Generated embeddings with shape: (1108, 768)

üìÇ Loading existing embeddings from drive/MyDrive/492-data/keyword_embeddings.npy...
   Existing embeddings shape: (27201, 768)
   Combined embeddings shape: (28309, 768)
üíæ Saved updated embeddings to drive/MyDrive/492-data/keyword_embeddings.npy

‚úÖ Embedding generation complete!


## 10. Update Elasticsearch with Embeddings

Update Elasticsearch documents with the generated keyword embeddings.


In [36]:
# Update Elasticsearch with embeddings
from elasticsearch import helpers
from elasticsearch.helpers import scan
import time

ELASTICSEARCH_HOST = "https://changes-artistic-stanley-johnson.trycloudflare.com"

# Configure ES client with longer timeout for large operations
es = Elasticsearch(
    hosts=[ELASTICSEARCH_HOST],
    request_timeout=300,  # 5 minutes timeout
    max_retries=3,
    retry_on_timeout=True
)

print(f"\nüíæ Updating Elasticsearch with keyword embeddings...")

# Step 1: Query ES for speeches with keywords but without embeddings
print("üîç Finding speeches that need embeddings...")
query_missing_embeddings = {
    "query": {
        "bool": {
            "must": [
                {"exists": {"field": "keywords"}},
                {"exists": {"field": "keywords_str"}}
            ],
            "must_not": [
                {"exists": {"field": "keywords_embedding"}}
            ]
        }
    },
    "_source": ["keywords_str"],
    "size": 10000
}

speeches_needing_embeddings = {}
print("   Scanning Elasticsearch...")
for doc in scan(es, query=query_missing_embeddings, index=ELASTICSEARCH_INDEX, size=1000, scroll='5m'):
    speech_id = doc['_id']
    keywords_str = doc['_source'].get('keywords_str', '')
    if keywords_str:
        speeches_needing_embeddings[speech_id] = keywords_str

print(f"   Found {len(speeches_needing_embeddings):,} speeches needing embeddings")

if len(speeches_needing_embeddings) == 0:
    print("‚úÖ All speeches with keywords already have embeddings!")
else:
    # Step 2: Query ES for ALL speeches with keywords (in order) to build mapping
    print("\nüîç Building speech_id to embedding index mapping...")
    query_all_keywords = {
        "query": {
            "bool": {
                "must": [
                    {"exists": {"field": "keywords"}},
                    {"exists": {"field": "keywords_str"}}
                ]
            }
        },
        "_source": ["keywords_str"],
        "size": 10000
    }

    speech_id_to_index = {}
    index_counter = 0

    print("   Scanning all speeches with keywords...")
    for doc in scan(es, query=query_all_keywords, index=ELASTICSEARCH_INDEX, size=1000, scroll='5m'):
        speech_id = doc['_id']
        speech_id_to_index[speech_id] = index_counter
        index_counter += 1
        if index_counter % 5000 == 0:
            print(f"      Processed {index_counter:,} speeches...")

    print(f"   Built mapping for {len(speech_id_to_index):,} speeches")

    # Step 3: Load embeddings from npy file
    if os.path.exists(EMBEDDINGS_FILE):
        print(f"\nüìÇ Loading embeddings from {EMBEDDINGS_FILE}...")
        all_embeddings = np.load(EMBEDDINGS_FILE)
        print(f"   Loaded embeddings with shape: {all_embeddings.shape}")

        # Step 4: Match speech IDs that need embeddings with their embeddings
        print("\nüîó Matching speeches with embeddings...")
        actions = []
        matched_count = 0

        for speech_id in speeches_needing_embeddings.keys():
            if speech_id in speech_id_to_index:
                embedding_index = speech_id_to_index[speech_id]
                if embedding_index < len(all_embeddings):
                    actions.append({
                        '_op_type': 'update',
                        '_index': ELASTICSEARCH_INDEX,
                        '_id': speech_id,
                        'doc': {
                            'keywords_embedding': all_embeddings[embedding_index].tolist()
                        }
                    })
                    matched_count += 1

        print(f"   Matched {matched_count:,} speeches with embeddings")

        # Step 5: Upload embeddings for newly processed speeches (from current run)
        if 'speech_ids' in locals() and 'embeddings' in locals():
            print(f"\nüì§ Adding embeddings for {len(speech_ids):,} newly processed speeches...")
            existing_ids = {a['_id'] for a in actions}
            for sid, emb in zip(speech_ids, embeddings):
                if sid not in existing_ids:
                    actions.append({
                        '_op_type': 'update',
                        '_index': ELASTICSEARCH_INDEX,
                        '_id': sid,
                        'doc': {
                            'keywords_embedding': emb.tolist()
                        }
                    })

        # Step 6: Bulk update in smaller batches to avoid timeout
        if actions:
            print(f"\nüíæ Uploading {len(actions):,} embeddings to Elasticsearch in batches...")

            BATCH_SIZE = 500  # Smaller batches to avoid timeout
            total_uploaded = 0
            total_failed = 0

            for i in range(0, len(actions), BATCH_SIZE):
                batch = actions[i:i + BATCH_SIZE]
                batch_num = i // BATCH_SIZE + 1
                total_batches = (len(actions) + BATCH_SIZE - 1) // BATCH_SIZE

                print(f"   Uploading batch {batch_num}/{total_batches} ({len(batch)} documents)...")

                try:
                    success, failed = helpers.bulk(
                        es,
                        batch,
                        raise_on_error=False,
                        request_timeout=120  # 2 minutes per batch
                    )

                    total_uploaded += success
                    total_failed += len(failed) if failed else 0

                    if failed:
                        print(f"      ‚ö†Ô∏è  Failed: {len(failed)} documents in this batch")
                        # Show first few errors
                        for fail in failed[:3]:
                            error_info = fail.get('update', {}).get('error', {})
                            print(f"         {fail.get('update', {}).get('_id', 'unknown')}: {error_info.get('type', 'Unknown')}")

                    # Small delay between batches to avoid overwhelming the tunnel
                    if i + BATCH_SIZE < len(actions):
                        time.sleep(0.5)

                except Exception as e:
                    print(f"      ‚ùå Error uploading batch {batch_num}: {e}")
                    total_failed += len(batch)
                    # Continue with next batch
                    continue

            print(f"\n‚úÖ Upload complete!")
            print(f"   Successfully updated: {total_uploaded:,} documents")
            if total_failed > 0:
                print(f"   Failed: {total_failed} documents")
        else:
            print("‚ö†Ô∏è  No embeddings to upload")
    else:
        print(f"‚ö†Ô∏è  Embeddings file not found: {EMBEDDINGS_FILE}")
        print("   Uploading only newly generated embeddings...")

        # Fallback: upload only newly generated embeddings in batches
        if 'speech_ids' in locals() and 'embeddings' in locals():
            actions = []
            for sid, emb in zip(speech_ids, embeddings):
                actions.append({
                    '_op_type': 'update',
                    '_index': ELASTICSEARCH_INDEX,
                    '_id': sid,
                    'doc': {
                        'keywords_embedding': emb.tolist()
                    }
                })

            if actions:
                BATCH_SIZE = 500
                total_uploaded = 0

                for i in range(0, len(actions), BATCH_SIZE):
                    batch = actions[i:i + BATCH_SIZE]
                    print(f"   Uploading batch {i//BATCH_SIZE + 1} ({len(batch)} documents)...")

                    try:
                        success, failed = helpers.bulk(
                            es,
                            batch,
                            raise_on_error=False,
                            request_timeout=120
                        )
                        total_uploaded += success
                        if i + BATCH_SIZE < len(actions):
                            time.sleep(0.5)
                    except Exception as e:
                        print(f"   Error: {e}")
                        continue

                print(f"‚úÖ Successfully updated {total_uploaded:,} documents with embeddings")

print(f"\n‚úÖ Embedding upload complete!")


üíæ Updating Elasticsearch with keyword embeddings...
üîç Finding speeches that need embeddings...
   Scanning Elasticsearch...
   Found 23,201 speeches needing embeddings

üîç Building speech_id to embedding index mapping...
   Scanning all speeches with keywords...
      Processed 5,000 speeches...
      Processed 10,000 speeches...
      Processed 15,000 speeches...
      Processed 20,000 speeches...
      Processed 25,000 speeches...
   Built mapping for 27,301 speeches

üìÇ Loading embeddings from drive/MyDrive/492-data/keyword_embeddings.npy...
   Loaded embeddings with shape: (28309, 768)

üîó Matching speeches with embeddings...
   Matched 23,201 speeches with embeddings

üì§ Adding embeddings for 1,108 newly processed speeches...

üíæ Uploading 24,309 embeddings to Elasticsearch in batches...
   Uploading batch 1/49 (500 documents)...


  success, failed = helpers.bulk(


   Uploading batch 2/49 (500 documents)...
   Uploading batch 3/49 (500 documents)...
   Uploading batch 4/49 (500 documents)...
   Uploading batch 5/49 (500 documents)...
   Uploading batch 6/49 (500 documents)...
   Uploading batch 7/49 (500 documents)...
   Uploading batch 8/49 (500 documents)...
   Uploading batch 9/49 (500 documents)...
   Uploading batch 10/49 (500 documents)...
   Uploading batch 11/49 (500 documents)...
   Uploading batch 12/49 (500 documents)...
   Uploading batch 13/49 (500 documents)...
   Uploading batch 14/49 (500 documents)...
   Uploading batch 15/49 (500 documents)...
   Uploading batch 16/49 (500 documents)...
   Uploading batch 17/49 (500 documents)...
   Uploading batch 18/49 (500 documents)...
   Uploading batch 19/49 (500 documents)...
   Uploading batch 20/49 (500 documents)...
   Uploading batch 21/49 (500 documents)...
   Uploading batch 22/49 (500 documents)...
   Uploading batch 23/49 (500 documents)...
   Uploading batch 24/49 (500 documents)

## Summary

This notebook:
1. ‚úÖ Loaded the Aya Expanse 8B model with optimized settings
2. ‚úÖ Fetched unprocessed speeches from Elasticsearch (resume mode)
3. ‚úÖ Extracted 10 keywords using batch processing (10-30x faster)
4. ‚úÖ Uploaded keywords to Elasticsearch every 100 speeches
5. ‚úÖ Saved results to CSV with speech_id and keywords columns
6. ‚úÖ Generated embeddings for extracted keywords
7. ‚úÖ Updated embedding file (keyword_embeddings.npy)
8. ‚úÖ Uploaded embeddings to Elasticsearch
9. ‚úÖ Provided statistics and quality checks

**Output file:** `data/speech_keywords.csv`

**Columns:**
- `speech_id`: Unique identifier for each speech
- `keywords`: Comma-separated list of 10 keywords
- `speech_giver`: Speaker name (for reference)
- `year`: Speech year (for reference)
- `topic_label`: Topic label (for reference)

**Elasticsearch Fields Created:**
- `keywords`: Array of keyword strings
- `keywords_str`: Comma-separated keyword string
- `keywords_embedding`: 768-dimensional embedding vector

**Performance Notes:**
- Uses batch processing (32 speeches at once on 45GB GPU) for 10-30x speedup
- Greedy decoding (do_sample=False) for faster generation
- Left padding for decoder-only architecture compatibility
- Uploads every 100 speeches to prevent data loss
- Resume mode: Re-running skips already processed speeches

**Model Notes:**
- Topic-aware prompting for better keyword relevance
- Long speeches truncated to 2000 characters
- GPU acceleration (FP16) for speed

## 11. Download Results from Colab (Optional)

In [None]:
# Only run this cell if you're using Google Colab
# This will zip the CSV and download it to your computer

try:
    from google.colab import files
    import zipfile
    import os

    # Create zip file
    zip_filename = 'speech_keywords.zip'

    print(f"üì¶ Creating zip file: {zip_filename}...")

    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        if os.path.exists(OUTPUT_CSV):
            zipf.write(OUTPUT_CSV, os.path.basename(OUTPUT_CSV))
            print(f"   Added: {OUTPUT_CSV}")
        else:
            print(f"   ‚ö†Ô∏è  File not found: {OUTPUT_CSV}")

    # Get file size
    file_size = os.path.getsize(zip_filename) / (1024 * 1024)  # MB
    print(f"\n‚úÖ Zip file created: {zip_filename} ({file_size:.2f} MB)")
    print(f"üì• Downloading to your computer...")

    # Download
    files.download(zip_filename)

    print(f"\n‚úÖ Download complete!")
    print(f"   Check your Downloads folder for: {zip_filename}")

except ImportError:
    print("‚ÑπÔ∏è  This cell only works in Google Colab.")
    print(f"   If you're running locally, the CSV is already saved at: {OUTPUT_CSV}")
except Exception as e:
    print(f"‚ùå Error: {e}")