# üèõÔ∏è Italian Parliament Speech Analyzer

This notebook runs the complete analysis pipeline for Italian Parliament speeches.

## What this notebook does:
1. **Scrapes** speeches from senato.it and camera.it
2. **Generates embeddings** using Sentence Transformers
3. **Computes analytics** (identity, relations, temporal, sentiment)
4. **Exports JSON** files for the frontend visualization

---

‚ö†Ô∏è **GPU Recommended**: `Runtime ‚Üí Change runtime type ‚Üí T4 GPU`

## 1. Setup Environment

In [None]:
# Suppress TensorFlow and other noisy warnings
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['TRANSFORMERS_VERBOSITY'] = 'error'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

import warnings
warnings.filterwarnings('ignore')

print('‚úÖ Warnings suppressed')

In [None]:
# Clone the repository
!git clone https://github.com/WeridFire/Parliament-Speech-Analyzer.git
%cd Parliament-Speech-Analyzer

In [None]:
# Install dependencies (don't install torch - use Colab's pre-installed version)
!pip install -q pandas numpy requests beautifulsoup4 SPARQLWrapper
!pip install -q scikit-learn sentence-transformers plotly tqdm
!pip install -q spacy transformers
!python -m spacy download it_core_news_sm -q

print('‚úÖ All dependencies installed!')

## 2. Configuration

In [None]:
# Configuration
DATA_SOURCE = 'both'  # Options: 'senate', 'camera', 'both'
USE_TRANSFORMER_SENTIMENT = True  # True for better accuracy, False for speed

print(f'üìã Data source: {DATA_SOURCE}')
print(f'üìã Transformer sentiment: {USE_TRANSFORMER_SENTIMENT}')

## 3. Run Analysis Pipeline

This will:
- Scrape speeches (~10-15 min first run)
- Generate embeddings (~5 min GPU)
- Compute all analytics
- Export JSON files

In [None]:
# Build and run command
# Use -u to force unbuffered output so we see logs in real-time
cmd = f'python -u backend/export_data.py --source {DATA_SOURCE} --refetch --reembed --verbose'
if USE_TRANSFORMER_SENTIMENT:
    cmd += ' --transformer-sentiment'

print(f'üöÄ Running: {cmd}')
print('=' * 60)
!{cmd}

## 4. Verify Output

In [None]:
import os, json

output_dir = 'frontend/public'
files = [f for f in os.listdir(output_dir) if f.endswith('.json')]

print('üìÅ Generated files:')
for f in files:
    path = os.path.join(output_dir, f)
    size_mb = os.path.getsize(path) / (1024 * 1024)
    with open(path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    print(f'   ‚úÖ {f}: {size_mb:.2f} MB | {len(data.get("speeches", []))} speeches')

## 5. Download Results

In [None]:
from google.colab import files

!cd frontend/public && zip -r ../../parliament_data.zip *.json
files.download('parliament_data.zip')

print('\nüì• Download complete! Extract to frontend/public/')

## 6. Quick Exploration

In [None]:
import json

with open('frontend/public/camera.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

analytics = data.get('analytics', {}).get('global', {})
print('üìä Available Analytics:', list(analytics.keys()))

# Show sample keywords
keywords = analytics.get('identity', {}).get('distinctive_keywords', {})
for party, words in list(keywords.items())[:3]:
    print(f'\nüè∑Ô∏è {party}: {', '.join(words[:8])}')

---
## Next Steps

1. Download `parliament_data.zip`
2. Extract to `frontend/public/`
3. Run frontend: `cd frontend && npm install && npm run dev`
4. Open http://localhost:5173