In [1]:
!pip install ollama

Collecting ollama
  Downloading ollama-0.5.1-py3-none-any.whl.metadata (4.3 kB)
Downloading ollama-0.5.1-py3-none-any.whl (13 kB)
Installing collected packages: ollama
Successfully installed ollama-0.5.1


In [2]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: fineGrained).
The token `Test` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `Test`


In [3]:
import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from sklearn.metrics import classification_report, confusion_matrix
import re
from tqdm import tqdm
import json

class ParliamentaryToneClassifier:
    def __init__(self, model_name="meta-llama/Llama-3.2-1B"):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
            device_map="auto" if torch.cuda.is_available() else None
        )

        # Add padding token if it doesn't exist
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token

    def create_few_shot_prompt(self, text_to_classify):

        prompt = """Classify the tone of parliamentary proceedings sentences into one of these categories: supportive, antagonistic, mournful, celebratory, other.

Examples:

Text: "I want to congratulate the minister on this excellent initiative that will benefit all Canadians."
Tone: supportive

Text: "The government's complete failure on this issue is absolutely disgraceful and unacceptable."
Tone: antagonistic

Text: "We gather today to remember the tragic loss of life and express our deepest condolences to the families."
Tone: mournful

Text: "This is a historic achievement and a wonderful day for our great nation!"
Tone: celebratory

Text: "The committee will meet next Tuesday to review the proposed amendments to the bill."
Tone: other

Text: "I thank the honourable member for raising this important question about healthcare funding."
Tone: supportive

Text: "This reckless policy will destroy jobs and hurt working families across the country."
Tone: antagonistic

Text: "Our hearts go out to those affected by this devastating natural disaster."
Tone: mournful

Text: "We are thrilled to announce the successful completion of this landmark project."
Tone: celebratory

Text: "The document contains three main sections outlining the regulatory framework."
Tone: other

Now classify this text:
Text: "{}"
Tone:""".format(text_to_classify)

        return prompt

    def classify_text(self, text, max_length=1024):
        prompt = self.create_few_shot_prompt(text)

        # Tokenize the prompt
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=max_length,
            padding=True
        )

        if torch.cuda.is_available():
            inputs = {k: v.cuda() for k, v in inputs.items()}

        # Generate response
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=10,
                temperature=0.1,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id
            )

        # Decode the response
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

        # Extract the classification from the response
        classification = self.extract_classification(response)
        return classification

    def extract_classification(self, response):
        # Look for the last occurrence of "Tone:" and get what follows
        tone_match = re.findall(r'Tone:\s*(\w+)', response)

        if tone_match:
            predicted_tone = tone_match[-1].lower().strip()

            # Map to valid categories
            valid_tones = ['supportive', 'antagonistic', 'mournful', 'celebratory', 'other']

            for tone in valid_tones:
                if tone in predicted_tone:
                    return tone

        # Default to 'other' if no clear classification
        return 'other'

    def classify_batch(self, texts, batch_size=8):
        results = []

        for i in tqdm(range(0, len(texts), batch_size), desc="Classifying texts"):
            batch = texts[i:i+batch_size]
            batch_results = []

            for text in batch:
                try:
                    classification = self.classify_text(text)
                    batch_results.append(classification)
                except Exception as e:
                    print(f"Error classifying text: {e}")
                    batch_results.append('other')

            results.extend(batch_results)

        return results

def load_parliamentary_data(file_path):
    # Assuming the data is in CSV format with a 'text' column
    # Adjust this function based on your actual data format
    try:
        df = pd.read_csv(file_path)
        # If the column is named differently, adjust here
        if 'sentence' in df.columns:
            texts = df['sentence'].tolist()
        elif 'text' in df.columns:
            texts = df['text'].tolist()
        else:
            # Take the first text column found
            text_columns = df.select_dtypes(include=['object']).columns
            texts = df[text_columns[0]].tolist()

        # Clean the texts
        texts = [str(text).strip() for text in texts if pd.notna(text) and len(str(text).strip()) > 10]
        return texts[:500]  # Limit to 500 for testing
    except Exception as e:
        print(f"Error loading data: {e}")
        # Return sample data for testing
        return [
            "I want to congratulate the minister on this excellent initiative.",
            "The government's complete failure on this issue is absolutely disgraceful.",
            "We gather today to remember the tragic loss of life.",
            "This is a historic achievement and a wonderful day for our nation!",
            "The committee will meet next Tuesday to review the amendments.",
            "This reckless policy will destroy jobs and hurt families.",
            "Our hearts go out to those affected by this disaster.",
            "We are thrilled to announce the successful completion of this project.",
            "The document contains three main sections.",
            "I thank the honourable member for raising this important question."
        ]

def evaluate_results(texts, predictions, true_labels=None):
    # Count predictions by category
    prediction_counts = pd.Series(predictions).value_counts()
    print("Prediction Distribution:")
    print(prediction_counts)
    print()

    # Show some examples for each category
    results_df = pd.DataFrame({
        'text': texts,
        'predicted_tone': predictions
    })

    print("Sample predictions by category:")
    for tone in ['supportive', 'antagonistic', 'mournful', 'celebratory', 'other']:
        samples = results_df[results_df['predicted_tone'] == tone].head(3)
        if len(samples) > 0:
            print(f"\n{tone.upper()}:")
            for idx, row in samples.iterrows():
                print(f"- {row['text'][:100]}...")

    return results_df

def prepare_for_lit_nlp(results_df, output_file="parliamentary_results.jsonl"):
    # Convert to format suitable for LIT-NLP
    lit_data = []
    for idx, row in results_df.iterrows():
        lit_data.append({
            'text': row['text'],
            'predicted_tone': row['predicted_tone'],
            'id': idx
        })

    # Save as JSONL for LIT-NLP
    with open(output_file, 'w') as f:
        for item in lit_data:
            f.write(json.dumps(item) + '\n')

    print(f"Data prepared for LIT-NLP analysis: {output_file}")

    # Instructions for LIT-NLP
    print("\nTo use with LIT-NLP:")
    print("1. Install LIT-NLP: pip install lit-nlp")
    print("2. Run: python -m lit_nlp.examples.quickstart --data_path=parliamentary_results.jsonl")
    print("3. Open browser to the provided URL to analyze results")

def main():
    print("Initializing Parliamentary Tone Classifier...")

    # Initialize classifier (this will download the model if not cached)
    classifier = ParliamentaryToneClassifier()

    # Load data (replace with your actual file path)
    print("Loading parliamentary data...")
    data_file = "parliamentary_proceedings.csv"  # Replace with actual path
    texts = load_parliamentary_data(data_file)

    print(f"Loaded {len(texts)} texts for classification")

    # Classify texts
    print("Starting classification...")
    predictions = classifier.classify_batch(texts, batch_size=4)

    # Evaluate results
    print("\nEvaluating results...")
    results_df = evaluate_results(texts, predictions)

    # Prepare for LIT-NLP analysis
    prepare_for_lit_nlp(results_df)

    # Save results
    results_df.to_csv("parliamentary_classification_results.csv", index=False)
    print("\nResults saved to: parliamentary_classification_results.csv")

    return results_df

# Instructions for improvement based on LIT-NLP analysis
def improvement_suggestions():
    """Print suggestions for improving the model based on common issues"""

    suggestions = """
    IMPROVEMENT SUGGESTIONS BASED ON LIT-NLP ANALYSIS:

    1. ANALYZE ERRORS IN LIT-NLP:
       - Look for patterns in misclassified examples
       - Check if certain keywords lead to wrong classifications
       - Identify if sentence length affects performance

    2. IMPROVE FEW-SHOT EXAMPLES:
       - Add more diverse examples for each category
       - Include examples that are similar to your misclassified cases
       - Balance the examples better across categories

    3. PROMPT ENGINEERING:
       - Add more context about parliamentary language
       - Include definitions of each tone category
       - Add negative examples (what each category is NOT)

    4. POST-PROCESSING:
       - Add keyword-based rules for obvious cases
       - Use confidence scores to flag uncertain predictions
       - Implement ensemble methods with multiple prompts

    5. DATA PREPROCESSING:
       - Clean parliamentary-specific formatting
       - Handle speaker attributions
       - Normalize common parliamentary phrases
    """

    print(suggestions)

if __name__ == "__main__":
    results = main()
    print("\n" + "="*50)
    improvement_suggestions()

Initializing Parliamentary Tone Classifier...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Loading parliamentary data...
Error loading data: [Errno 2] No such file or directory: 'parliamentary_proceedings.csv'
Loaded 10 texts for classification
Starting classification...


Classifying texts: 100%|██████████| 3/3 [02:29<00:00, 49.92s/it]



Evaluating results...
Prediction Distribution:
supportive      2
antagonistic    2
mournful        2
celebratory     2
other           2
Name: count, dtype: int64

Sample predictions by category:

SUPPORTIVE:
- I want to congratulate the minister on this excellent initiative....
- I thank the honourable member for raising this important question....

ANTAGONISTIC:
- The government's complete failure on this issue is absolutely disgraceful....
- This reckless policy will destroy jobs and hurt families....

MOURNFUL:
- We gather today to remember the tragic loss of life....
- Our hearts go out to those affected by this disaster....

CELEBRATORY:
- This is a historic achievement and a wonderful day for our nation!...
- We are thrilled to announce the successful completion of this project....

OTHER:
- The committee will meet next Tuesday to review the amendments....
- The document contains three main sections....
Data prepared for LIT-NLP analysis: parliamentary_results.jsonl

To use wit