# Kannada Language Processing Workflow

This notebook demonstrates the complete workflow for processing Kannada language text using our specialized framework. Kannada (‡≤ï‡≤®‡≥ç‡≤®‡≤°) is a Dravidian language spoken primarily in Karnataka, India.

## What you'll learn:
1. **Kannada Data Collection** - Gathering text from Kannada sources
2. **Script-Specific Preprocessing** - Handling Kannada script nuances
3. **Model Training** - Optimizing models for Kannada
4. **Evaluation** - Assessing performance on Kannada tasks
5. **Cross-lingual Analysis** - Comparing with other Dravidian languages

## About Kannada
- **Script**: Kannada script (‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≤‡≤ø‡≤™‡≤ø)
- **Family**: Dravidian language family
- **Speakers**: ~44 million native speakers
- **Unicode Range**: U+0C80‚ÄìU+0CFF
- **Characteristics**: Rich morphology, agglutinative, complex conjuncts

In [None]:
# Import libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Import Kannada-specific modules
from preprocessing.kannada_preprocessor import KannadaPreprocessor
from data_collection.kannada_collector import KannadaDataCollector
from models import IndianLanguageModel
from evaluation import ModelEvaluator

# Set up plotting for Kannada text
plt.rcParams['font.family'] = ['Noto Sans Kannada', 'Lohit Kannada', 'sans-serif']
plt.style.use('default')

print("‚úÖ ‡≤ï‡≤®‡≥ç‡≤®‡≤° NLP ‡≤µ‡≥ç‡≤Ø‡≤µ‡≤∏‡≥ç‡≤•‡≥Ü ‡≤∏‡≤ø‡≤¶‡≥ç‡≤ß! (Kannada NLP System Ready!)")

## 1. Kannada Data Collection

Let's start by collecting Kannada text from various sources.

In [None]:
# Initialize Kannada data collector
collector = KannadaDataCollector()

print("üîç ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤¶‡≤§‡≥ç‡≤§‡≤æ‡≤Ç‡≤∂ ‡≤∏‡≤Ç‡≤ó‡≥ç‡≤∞‡≤π‡≤£‡≥Ü ‡≤™‡≥ç‡≤∞‡≤æ‡≤∞‡≤Ç‡≤≠... (Starting Kannada data collection...)")

# For this demo, we'll use sample Kannada texts
sample_kannada_texts = [
    {
        'title': '‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤∏‡≤Ç‡≤∏‡≥ç‡≤ï‡≥É‡≤§‡≤ø',
        'content': '‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü‡≤Ø‡≥Å ‡≤¶‡≥ç‡≤∞‡≤æ‡≤µ‡≤ø‡≤° ‡≤≠‡≤æ‡≤∑‡≤æ ‡≤ï‡≥Å‡≤ü‡≥Å‡≤Ç‡≤¨‡≤¶ ‡≤í‡≤Ç‡≤¶‡≥Å ‡≤™‡≥ç‡≤∞‡≤Æ‡≥Å‡≤ñ ‡≤≠‡≤æ‡≤∑‡≥Ü‡≤Ø‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü. ‡≤á‡≤¶‡≥Å ‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï ‡≤∞‡≤æ‡≤ú‡≥ç‡≤Ø‡≤¶ ‡≤Ö‡≤ß‡≤ø‡≤ï‡≥É‡≤§ ‡≤≠‡≤æ‡≤∑‡≥Ü‡≤Ø‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤∏‡≥Å‡≤Æ‡≤æ‡≤∞‡≥Å ‡≥™.‡≥™ ‡≤ï‡≥ã‡≤ü‡≤ø ‡≤ú‡≤®‡≤∞‡≥Å ‡≤Æ‡≤æ‡≤§‡≤®‡≤æ‡≤°‡≥Å‡≤§‡≥ç‡≤§‡≤æ‡≤∞‡≥Ü. ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤∏‡≤æ‡≤π‡≤ø‡≤§‡≥ç‡≤Ø‡≤µ‡≥Å ‡≤∏‡≤æ‡≤µ‡≤ø‡≤∞‡≤æ‡≤∞‡≥Å ‡≤µ‡≤∞‡≥ç‡≤∑‡≤ó‡≤≥ ‡≤á‡≤§‡≤ø‡≤π‡≤æ‡≤∏‡≤µ‡≤®‡≥ç‡≤®‡≥Å ‡≤π‡≥ä‡≤Ç‡≤¶‡≤ø‡≤¶‡≥Ü.',
        'source': '‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤™‡≥ç‡≤∞‡≤≠',
        'category': 'culture'
    },
    {
        'title': '‡≤¨‡≥Ü‡≤Ç‡≤ó‡≤≥‡≥Ç‡≤∞‡≤ø‡≤® ‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ú‡≥ç‡≤û‡≤æ‡≤® ‡≤â‡≤¶‡≥ç‡≤Ø‡≤Æ',
        'content': '‡≤¨‡≥Ü‡≤Ç‡≤ó‡≤≥‡≥Ç‡≤∞‡≥Å ‡≤≠‡≤æ‡≤∞‡≤§‡≤¶ ‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ú‡≥ç‡≤û‡≤æ‡≤® ‡≤∞‡≤æ‡≤ú‡≤ß‡≤æ‡≤®‡≤ø ‡≤é‡≤Ç‡≤¶‡≥Å ‡≤π‡≥Ü‡≤∏‡≤∞‡≥Å‡≤µ‡≤æ‡≤∏‡≤ø‡≤Ø‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü. ‡≤á‡≤≤‡≥ç‡≤≤‡≤ø ‡≤Ö‡≤®‡≥á‡≤ï ‡≤¨‡≤π‡≥Å‡≤∞‡≤æ‡≤∑‡≥ç‡≤ü‡≥ç‡≤∞‡≥Ä‡≤Ø ‡≤ï‡≤Ç‡≤™‡≤®‡≤ø‡≤ó‡≤≥‡≥Å ‡≤§‡≤Æ‡≥ç‡≤Æ ‡≤ï‡≤ö‡≥á‡≤∞‡≤ø‡≤ó‡≤≥‡≤®‡≥ç‡≤®‡≥Å ‡≤∏‡≥ç‡≤•‡≤æ‡≤™‡≤ø‡≤∏‡≤ø‡≤¶‡≥ç‡≤¶‡≤æ‡≤∞‡≥Ü. ‡≤ï‡≥É‡≤§‡≥ç‡≤∞‡≤ø‡≤Æ ‡≤¨‡≥Å‡≤¶‡≥ç‡≤ß‡≤ø‡≤Æ‡≤§‡≥ç‡≤§‡≥Ü, ‡≤Ø‡≤Ç‡≤§‡≥ç‡≤∞ ‡≤ï‡≤≤‡≤ø‡≤ï‡≥Ü ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤°‡≥á‡≤ü‡≤æ ‡≤µ‡≤ø‡≤ú‡≥ç‡≤û‡≤æ‡≤®‡≤¶‡≤≤‡≥ç‡≤≤‡≤ø ‡≤á‡≤≤‡≥ç‡≤≤‡≤ø‡≤® ‡≤ï‡≤Ç‡≤™‡≤®‡≤ø‡≤ó‡≤≥‡≥Å ‡≤Æ‡≥Å‡≤Ç‡≤ö‡≥Ç‡≤£‡≤ø‡≤Ø‡≤≤‡≥ç‡≤≤‡≤ø‡≤µ‡≥Ü.',
        'source': '‡≤™‡≥ç‡≤∞‡≤ú‡≤æ‡≤µ‡≤æ‡≤£‡≤ø',
        'category': 'technology'
    },
    {
        'title': '‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï‡≤¶ ‡≤™‡≤∞‡≤Ç‡≤™‡≤∞‡≥Ü',
        'content': '‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï‡≤µ‡≥Å ‡≤∂‡≥ç‡≤∞‡≥Ä‡≤Æ‡≤Ç‡≤§‡≤µ‡≤æ‡≤¶ ‡≤∏‡≤æ‡≤Ç‡≤∏‡≥ç‡≤ï‡≥É‡≤§‡≤ø‡≤ï ‡≤™‡≤∞‡≤Ç‡≤™‡≤∞‡≥Ü‡≤Ø‡≤®‡≥ç‡≤®‡≥Å ‡≤π‡≥ä‡≤Ç‡≤¶‡≤ø‡≤¶‡≥Ü. ‡≤π‡≤Ç‡≤™‡≤ø, ‡≤Æ‡≥à‡≤∏‡≥Ç‡≤∞‡≥Å, ‡≤¨‡≤æ‡≤¶‡≤æ‡≤Æ‡≤ø ‡≤Æ‡≥Å‡≤Ç‡≤§‡≤æ‡≤¶ ‡≤∏‡≥ç‡≤•‡≤≥‡≤ó‡≤≥‡≥Å ‡≤ê‡≤§‡≤ø‡≤π‡≤æ‡≤∏‡≤ø‡≤ï ‡≤Æ‡≤π‡≤§‡≥ç‡≤µ‡≤µ‡≤®‡≥ç‡≤®‡≥Å ‡≤π‡≥ä‡≤Ç‡≤¶‡≤ø‡≤µ‡≥Ü. ‡≤Ø‡≤ï‡≥ç‡≤∑‡≤ó‡≤æ‡≤®, ‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï ‡≤∏‡≤Ç‡≤ó‡≥Ä‡≤§, ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤≠‡≤∞‡≤§‡≤®‡≤æ‡≤ü‡≥ç‡≤Ø ‡≤á‡≤≤‡≥ç‡≤≤‡≤ø‡≤® ‡≤™‡≥ç‡≤∞‡≤Æ‡≥Å‡≤ñ ‡≤ï‡≤≤‡≤æ ‡≤™‡≥ç‡≤∞‡≤ï‡≤æ‡≤∞‡≤ó‡≤≥‡≤æ‡≤ó‡≤ø‡≤µ‡≥Ü.',
        'source': '‡≤â‡≤¶‡≤Ø‡≤µ‡≤æ‡≤£‡≤ø',
        'category': 'heritage'
    },
    {
        'title': '‡≤Ü‡≤∞‡≥ã‡≤ó‡≥ç‡≤Ø ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤Ü‡≤Ø‡≥Å‡≤∞‡≥ç‡≤µ‡≥á‡≤¶',
        'content': '‡≤Ü‡≤Ø‡≥Å‡≤∞‡≥ç‡≤µ‡≥á‡≤¶‡≤µ‡≥Å ‡≤≠‡≤æ‡≤∞‡≤§‡≤¶ ‡≤™‡≥ç‡≤∞‡≤æ‡≤ö‡≥Ä‡≤® ‡≤µ‡≥à‡≤¶‡≥ç‡≤Ø ‡≤∂‡≤æ‡≤∏‡≥ç‡≤§‡≥ç‡≤∞‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü. ‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï‡≤¶‡≤≤‡≥ç‡≤≤‡≤ø ‡≤Ö‡≤®‡≥á‡≤ï ‡≤Ü‡≤Ø‡≥Å‡≤∞‡≥ç‡≤µ‡≥á‡≤¶ ‡≤ï‡≤æ‡≤≤‡≥á‡≤ú‡≥Å‡≤ó‡≤≥‡≥Å ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤Ü‡≤∏‡≥ç‡≤™‡≤§‡≥ç‡≤∞‡≥Ü‡≤ó‡≤≥‡≤ø‡≤µ‡≥Ü. ‡≤™‡≥ç‡≤∞‡≤ï‡≥É‡≤§‡≤ø‡≤Ø ‡≤î‡≤∑‡≤ß‡≤ø‡≤ó‡≤≥‡≥Å ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤Ø‡≥ã‡≤ó‡≤¶ ‡≤Æ‡≥Ç‡≤≤‡≤ï ‡≤Ü‡≤∞‡≥ã‡≤ó‡≥ç‡≤Ø‡≤µ‡≤®‡≥ç‡≤®‡≥Å ‡≤ï‡≤æ‡≤™‡≤æ‡≤°‡≥Å‡≤µ ‡≤∏‡≤Ç‡≤™‡≥ç‡≤∞‡≤¶‡≤æ‡≤Ø‡≤µ‡≤ø‡≤¶‡≥Ü.',
        'source': '‡≤µ‡≤ø‡≤ú‡≤Ø ‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï',
        'category': 'health'
    },
    {
        'title': '‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï‡≤¶ ‡≤ï‡≥É‡≤∑‡≤ø',
        'content': '‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï ‡≤∞‡≤æ‡≤ú‡≥ç‡≤Ø‡≤µ‡≥Å ‡≤ï‡≥É‡≤∑‡≤ø‡≤Ø‡≤≤‡≥ç‡≤≤‡≤ø ‡≤™‡≥ç‡≤∞‡≤Æ‡≥Å‡≤ñ ‡≤∏‡≥ç‡≤•‡≤æ‡≤® ‡≤π‡≥ä‡≤Ç‡≤¶‡≤ø‡≤¶‡≥Ü. ‡≤á‡≤≤‡≥ç‡≤≤‡≤ø ‡≤Ö‡≤ï‡≥ç‡≤ï‡≤ø, ‡≤∞‡≤æ‡≤ó‡≤ø, ‡≤∏‡≤ï‡≥ç‡≤ï‡≤∞‡≥Ü ‡≤ï‡≤¨‡≥ç‡≤¨‡≥Å, ‡≤ï‡≤æ‡≤´‡≤ø ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤Æ‡≥Ü‡≤£‡≤∏‡≤ø‡≤®‡≤ï‡≤æ‡≤Ø‡≤ø ‡≤¨‡≥Ü‡≤≥‡≥Ü‡≤Ø‡≥Å‡≤§‡≥ç‡≤§‡≤¶‡≥Ü. ‡≤ï‡≥ä‡≤°‡≤ó‡≥Å ‡≤ú‡≤ø‡≤≤‡≥ç‡≤≤‡≥Ü‡≤Ø‡≥Å ‡≤ï‡≤æ‡≤´‡≤ø ‡≤â‡≤§‡≥ç‡≤™‡≤æ‡≤¶‡≤®‡≥Ü‡≤ó‡≥Ü ‡≤™‡≥ç‡≤∞‡≤∏‡≤ø‡≤¶‡≥ç‡≤ß‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü. ‡≤Ü‡≤ß‡≥Å‡≤®‡≤ø‡≤ï ‡≤ï‡≥É‡≤∑‡≤ø ‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ó‡≤≥ ‡≤¨‡≤≥‡≤ï‡≥Ü‡≤Ø‡≤ø‡≤Ç‡≤¶ ‡≤â‡≤§‡≥ç‡≤™‡≤æ‡≤¶‡≤®‡≥Ü ‡≤π‡≥Ü‡≤ö‡≥ç‡≤ö‡≤æ‡≤ó‡≥Å‡≤§‡≥ç‡≤§‡≤ø‡≤¶‡≥Ü.',
        'source': '‡≤∏‡≤Ç‡≤Ø‡≥Å‡≤ï‡≥ç‡≤§ ‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï',
        'category': 'agriculture'
    }
]

# Create DataFrame
kannada_df = pd.DataFrame(sample_kannada_texts)
kannada_df['timestamp'] = pd.Timestamp.now()
kannada_df['language'] = 'kn'
kannada_df['text_length'] = kannada_df['content'].str.len()
kannada_df['word_count'] = kannada_df['content'].str.split().str.len()

print(f"üìä ‡≤∏‡≤Ç‡≤ó‡≥ç‡≤∞‡≤π‡≤£‡≥Ü ‡≤™‡≥Ç‡≤∞‡≥ç‡≤£! (Collection Complete!)")
print(f"Total articles: {len(kannada_df)}")
print(f"Average text length: {kannada_df['text_length'].mean():.0f} characters")
print(f"Total words: {kannada_df['word_count'].sum()}")

# Show sample
print(f"\nüì∞ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤≤‡≥á‡≤ñ‡≤® (Sample Article):")
sample = kannada_df.iloc[0]
print(f"Title: {sample['title']}")
print(f"Content: {sample['content'][:150]}...")
print(f"Category: {sample['category']}")

## 2. Kannada Text Preprocessing

Now let's preprocess the Kannada text using our specialized preprocessor.

In [None]:
# Initialize Kannada preprocessor
preprocessor = KannadaPreprocessor()

print("üßπ ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø ‡≤™‡≥ç‡≤∞‡≤ï‡≥ç‡≤∞‡≤ø‡≤Ø‡≥Ü ‡≤™‡≥ç‡≤∞‡≤æ‡≤∞‡≤Ç‡≤≠... (Starting Kannada text processing...)")

# Test with a sample text
sample_text = kannada_df.iloc[1]['content']
print(f"\nüìù ‡≤Æ‡≥Ç‡≤≤ ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø (Original Text):")
print(sample_text)

# Clean the text
cleaned_text = preprocessor.clean_kannada_text(
    sample_text,
    normalize_unicode=True,
    normalize_digits=True,
    handle_virama=True,
    preserve_conjuncts=True
)

print(f"\n‚ú® ‡≤∂‡≥Å‡≤¶‡≥ç‡≤ß‡≥Ä‡≤ï‡≤∞‡≤ø‡≤∏‡≤ø‡≤¶ ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø (Cleaned Text):")
print(cleaned_text)

# Get detailed statistics
stats = preprocessor.get_kannada_text_statistics(sample_text)
print(f"\nüìä ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø ‡≤Ö‡≤Ç‡≤ï‡≤ø‡≤Ö‡≤Ç‡≤∂‡≤ó‡≤≥‡≥Å (Text Statistics):")
for key, value in stats.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.3f}")
    else:
        print(f"  {key}: {value}")

# Tokenize words
tokens = preprocessor.tokenize_kannada_words(sample_text)
print(f"\nüî§ ‡≤™‡≤¶ ‡≤µ‡≤ø‡≤≠‡≤ú‡≤®‡≥Ü (Word Tokenization):")
print(f"Total tokens: {len(tokens)}")
print(f"First 10 tokens: {tokens[:10]}")

# Morphological analysis of sample words
print(f"\nüî¨ ‡≤∞‡≥Ç‡≤™‡≤µ‡≤ø‡≤ú‡≥ç‡≤û‡≤æ‡≤® ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü (Morphological Analysis):")
sample_words = ['‡≤ï‡≤Ç‡≤™‡≤®‡≤ø‡≤ó‡≤≥‡≥Å', '‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ú‡≥ç‡≤û‡≤æ‡≤®', '‡≤∏‡≥ç‡≤•‡≤æ‡≤™‡≤ø‡≤∏‡≤ø‡≤¶‡≥ç‡≤¶‡≤æ‡≤∞‡≥Ü']
for word in sample_words:
    analysis = preprocessor.analyze_kannada_morphology(word)
    print(f"  {word}: {analysis}")

## 3. Dataset Preprocessing

Let's preprocess our entire Kannada dataset.

In [None]:
# Preprocess the entire dataset
print("‚öôÔ∏è ‡≤∏‡≤Ç‡≤™‡≥Ç‡≤∞‡≥ç‡≤£ ‡≤¶‡≤§‡≥ç‡≤§‡≤∏‡≤Æ‡≥Ç‡≤π ‡≤™‡≥ç‡≤∞‡≤ï‡≥ç‡≤∞‡≤ø‡≤Ø‡≥Ü... (Processing entire dataset...)")

# Apply preprocessing to all content
processed_df = preprocessor.preprocess_kannada_dataset(
    kannada_df.copy(),
    text_column='content',
    clean_options={
        'normalize_unicode': True,
        'normalize_digits': True,
        'remove_mixed_script': False,
        'preserve_conjuncts': True,
        'handle_virama': True
    }
)

# Add additional analysis
processed_df['cleaned_word_count'] = processed_df['content'].str.split().str.len()
processed_df['cleaned_length'] = processed_df['content'].str.len()
processed_df['sentences'] = processed_df['content'].apply(
    lambda x: len(preprocessor.extract_kannada_sentences(x))
)

# Validation
processed_df['is_valid_kannada'] = processed_df['content'].apply(
    lambda x: preprocessor.validate_kannada_text(x, min_kannada_ratio=0.6)
)

print(f"‚úÖ ‡≤™‡≥ç‡≤∞‡≤ï‡≥ç‡≤∞‡≤ø‡≤Ø‡≥Ü ‡≤™‡≥Ç‡≤∞‡≥ç‡≤£! (Processing Complete!)")
print(f"Valid Kannada texts: {processed_df['is_valid_kannada'].sum()}/{len(processed_df)}")
print(f"Average sentences per text: {processed_df['sentences'].mean():.1f}")
print(f"Total cleaned words: {processed_df['cleaned_word_count'].sum()}")

# Show processing comparison
comparison_df = pd.DataFrame({
    'Metric': ['Original Length', 'Cleaned Length', 'Original Words', 'Cleaned Words'],
    'Average': [
        processed_df['text_length'].mean(),
        processed_df['cleaned_length'].mean(), 
        processed_df['word_count'].mean(),
        processed_df['cleaned_word_count'].mean()
    ]
})

print(f"\nüìà ‡≤™‡≥ç‡≤∞‡≤ï‡≥ç‡≤∞‡≤ø‡≤Ø‡≥Ü ‡≤π‡≥ã‡≤≤‡≤ø‡≤ï‡≥Ü (Processing Comparison):")
print(comparison_df.round(1))

## 4. Kannada Language Model Training

Let's create and configure a language model optimized for Kannada.

In [None]:
# Initialize Kannada-optimized model
print("ü§ñ ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≤æ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤Ü‡≤∞‡≤Ç‡≤≠... (Initializing Kannada language model...)")

kannada_model = IndianLanguageModel(
    language='kn',
    model_type='bert',
    vocab_size=32000,  # Optimized for Kannada
    hidden_size=768,
    num_hidden_layers=8,  # Smaller for demo
    num_attention_heads=12,
    max_position_embeddings=1024,  # Longer sequences for Kannada
    dropout_prob=0.1
)

print("‚úÖ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤∏‡≤ø‡≤¶‡≥ç‡≤ß! (Model Ready!)")

# Get model information
param_count = kannada_model.get_parameter_count()
print(f"\nüìä ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤Æ‡≤æ‡≤π‡≤ø‡≤§‡≤ø (Model Information):")
for component, count in param_count.items():
    print(f"  {component}: {count:,} parameters")

print(f"\nüéØ Total parameters: {param_count['total']:,}")

# Test embeddings with Kannada sentences
test_sentences = [
    "‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤¨‡≤π‡≤≥ ‡≤∏‡≥Å‡≤Ç‡≤¶‡≤∞‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü.",
    "‡≤¨‡≥Ü‡≤Ç‡≤ó‡≤≥‡≥Ç‡≤∞‡≥Å ‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ú‡≥ç‡≤û‡≤æ‡≤® ‡≤®‡≤ó‡≤∞‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü.", 
    "‡≤®‡≤æ‡≤µ‡≥Å ‡≤ï‡≥É‡≤§‡≥ç‡≤∞‡≤ø‡≤Æ ‡≤¨‡≥Å‡≤¶‡≥ç‡≤ß‡≤ø‡≤Æ‡≤§‡≥ç‡≤§‡≥Ü‡≤Ø‡≤®‡≥ç‡≤®‡≥Å ‡≤Ö‡≤ß‡≥ç‡≤Ø‡≤Ø‡≤® ‡≤Æ‡≤æ‡≤°‡≥Å‡≤§‡≥ç‡≤§‡≤ø‡≤¶‡≥ç‡≤¶‡≥á‡≤µ‡≥Ü.",
    "‡≤ï‡≤∞‡≥ç‡≤®‡≤æ‡≤ü‡≤ï‡≤¶ ‡≤∏‡≤Ç‡≤∏‡≥ç‡≤ï‡≥É‡≤§‡≤ø ‡≤∂‡≥ç‡≤∞‡≥Ä‡≤Æ‡≤Ç‡≤§‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü."
]

print(f"\nüß† ‡≤≠‡≤æ‡≤∑‡≤æ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤™‡≤∞‡≥Ä‡≤ï‡≥ç‡≤∑‡≥Ü (Language Model Testing):")
embeddings = kannada_model.get_embeddings(test_sentences, language='kn')
print(f"Embeddings shape: {embeddings.shape}")
print(f"Sample embedding (first 5 dims): {embeddings[0][:5].numpy()}")

# Calculate similarities between sentences
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)

print(f"\nüîÑ ‡≤µ‡≤æ‡≤ï‡≥ç‡≤Ø ‡≤∏‡≤æ‡≤Æ‡≥ç‡≤Ø‡≤§‡≥Ü‡≤ó‡≤≥‡≥Å (Sentence Similarities):")
for i, sent1 in enumerate(test_sentences):
    for j, sent2 in enumerate(test_sentences[i+1:], i+1):
        sim = similarities[i][j]
        print(f"  {sim:.3f}: '{sent1}' ‚Üî '{sent2}'")

## 5. Model Evaluation on Kannada Tasks

Let's evaluate our model's performance on various Kannada NLP tasks.

In [None]:
# Initialize evaluator for Kannada
evaluator = ModelEvaluator(
    model=kannada_model,
    language='kn',
    device='cpu'
)

print("üìä ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤®... (Kannada Model Evaluation...)")

# Create evaluation dataset from our processed texts
eval_texts = processed_df['content'].tolist()
eval_categories = processed_df['category'].tolist()

# Create label mapping for categories
category_to_label = {cat: idx for idx, cat in enumerate(set(eval_categories))}
label_to_category = {v: k for k, v in category_to_label.items()}
eval_labels = [category_to_label[cat] for cat in eval_categories]

print(f"üìã ‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤® ‡≤¶‡≤§‡≥ç‡≤§‡≤∏‡≤Æ‡≥Ç‡≤π (Evaluation Dataset):")
print(f"  Texts: {len(eval_texts)}")
print(f"  Categories: {list(category_to_label.keys())}")
print(f"  Labels: {eval_labels}")

# Text classification evaluation
print(f"\nüéØ ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø ‡≤µ‡≤∞‡≥ç‡≤ó‡≥Ä‡≤ï‡≤∞‡≤£ ‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤® (Text Classification Evaluation):")
classification_results = evaluator.evaluate_text_classification(
    texts=eval_texts,
    labels=eval_labels,
    task_name="kannada_category_classification"
)

print(f"Classification Results:")
for metric, value in classification_results.items():
    if isinstance(value, (int, float)):
        print(f"  {metric}: {value:.4f}")
    elif metric not in ['classification_report']:
        print(f"  {metric}: {value}")

# Semantic similarity evaluation
print(f"\nüîÑ ‡≤Ö‡≤∞‡≥ç‡≤•‡≤ó‡≤§ ‡≤∏‡≤æ‡≤Æ‡≥ç‡≤Ø‡≤§‡≥Ü ‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤® (Semantic Similarity Evaluation):")

# Create pairs of similar and dissimilar Kannada sentences
similarity_pairs = [
    ("‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤∏‡≥Å‡≤Ç‡≤¶‡≤∞‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü", "‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤ö‡≥Ü‡≤Ç‡≤¶‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü"),  # Similar
    ("‡≤¨‡≥Ü‡≤Ç‡≤ó‡≤≥‡≥Ç‡≤∞‡≥Å ‡≤¶‡≥ä‡≤°‡≥ç‡≤° ‡≤®‡≤ó‡≤∞", "‡≤¨‡≥Ü‡≤Ç‡≤ó‡≤≥‡≥Ç‡≤∞‡≥Å ‡≤Æ‡≤π‡≤æ‡≤®‡≤ó‡≤∞"),  # Similar
    ("‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ú‡≥ç‡≤û‡≤æ‡≤® ‡≤Ö‡≤≠‡≤ø‡≤µ‡≥É‡≤¶‡≥ç‡≤ß‡≤ø", "‡≤ï‡≥É‡≤∑‡≤ø ‡≤â‡≤§‡≥ç‡≤™‡≤æ‡≤¶‡≤®‡≥Ü"),  # Different
    ("‡≤∏‡≤Ç‡≤∏‡≥ç‡≤ï‡≥É‡≤§‡≤ø ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤ï‡≤≤‡≥Ü", "‡≤Ü‡≤∞‡≥ã‡≤ó‡≥ç‡≤Ø ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤î‡≤∑‡≤ß‡≤ø")  # Different
]

similarity_scores = [0.8, 0.7, 0.2, 0.1]  # Expected similarity scores

similarity_results = evaluator.evaluate_semantic_similarity(
    text_pairs=similarity_pairs,
    similarity_scores=similarity_scores
)

print(f"Semantic Similarity Results:")
for metric, value in similarity_results.items():
    print(f"  {metric}: {value:.4f}")

# Show individual pair similarities
print(f"\nüìù ‡≤µ‡≥ç‡≤Ø‡≤ï‡≥ç‡≤§‡≤ø‡≤ó‡≤§ ‡≤ú‡≥ã‡≤°‡≤ø ‡≤∏‡≤æ‡≤Æ‡≥ç‡≤Ø‡≤§‡≥Ü‡≤ó‡≤≥‡≥Å (Individual Pair Similarities):")
for i, (text1, text2) in enumerate(similarity_pairs):
    emb1 = kannada_model.get_embeddings([text1], language='kn')
    emb2 = kannada_model.get_embeddings([text2], language='kn')
    sim = cosine_similarity(emb1, emb2)[0, 0]
    print(f"  {sim:.3f}: '{text1}' ‚Üî '{text2}'")

## 6. Cross-lingual Analysis with Dravidian Languages

Let's evaluate cross-lingual capabilities with other Dravidian languages.

In [None]:
print("üåç ‡≤¶‡≥ç‡≤∞‡≤æ‡≤µ‡≤ø‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü‡≤ó‡≤≥ ‡≤Ö‡≤Ç‡≤§‡≤∞‡≥ç-‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü (Cross-lingual Analysis with Dravidian Languages)")

# Multilingual data with Dravidian languages
dravidian_data = {
    'kn': {  # Kannada
        'texts': [
            "‡≤á‡≤¶‡≥Å ‡≤í‡≤Ç‡≤¶‡≥Å ‡≤ö‡≥Ü‡≤Ç‡≤¶‡≤¶ ‡≤¶‡≤ø‡≤®.",
            "‡≤®‡≤æ‡≤®‡≥Å ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü‡≤Ø‡≤®‡≥ç‡≤®‡≥Å ‡≤™‡≥ç‡≤∞‡≥Ä‡≤§‡≤ø‡≤∏‡≥Å‡≤§‡≥ç‡≤§‡≥á‡≤®‡≥Ü.",
            "‡≤§‡≤Ç‡≤§‡≥ç‡≤∞‡≤ú‡≥ç‡≤û‡≤æ‡≤® ‡≤®‡≤Æ‡≥ç‡≤Æ ‡≤≠‡≤µ‡≤ø‡≤∑‡≥ç‡≤Ø."
        ],
        'labels': [1, 1, 1]  # Positive sentiment
    },
    'ta': {  # Tamil  
        'texts': [
            "‡Æá‡Æ§‡ØÅ ‡Æí‡Æ∞‡ØÅ ‡ÆÖ‡Æ¥‡Æï‡Ææ‡Æ© ‡Æ®‡Ææ‡Æ≥‡Øç.",  # This is a beautiful day
            "‡Æ®‡Ææ‡Æ©‡Øç ‡Æ§‡ÆÆ‡Æø‡Æ¥‡Øç ‡ÆÆ‡Øä‡Æ¥‡Æø‡ÆØ‡Øà ‡Æ®‡Øá‡Æö‡Æø‡Æï‡Øç‡Æï‡Æø‡Æ±‡Øá‡Æ©‡Øç.",  # I love Tamil language
            "‡Æ§‡Øä‡Æ¥‡Æø‡Æ≤‡Øç‡Æ®‡ØÅ‡Æü‡Øç‡Æ™‡ÆÆ‡Øç ‡Æ®‡ÆÆ‡Æ§‡ØÅ ‡Æé‡Æ§‡Æø‡Æ∞‡Øç‡Æï‡Ææ‡Æ≤‡ÆÆ‡Øç."  # Technology is our future
        ],
        'labels': [1, 1, 1]  # Positive sentiment
    },
    'te': {  # Telugu
        'texts': [
            "‡∞á‡∞¶‡∞ø ‡∞í‡∞ï ‡∞Ö‡∞Ç‡∞¶‡∞Æ‡±à‡∞® ‡∞∞‡±ã‡∞ú‡±Å.",  # This is a beautiful day
            "‡∞®‡±á‡∞®‡±Å ‡∞§‡±Ü‡∞≤‡±Å‡∞ó‡±Å ‡∞≠‡∞æ‡∞∑‡∞®‡±Å ‡∞™‡±ç‡∞∞‡±á‡∞Æ‡∞ø‡∞∏‡±ç‡∞§‡±Å‡∞®‡±ç‡∞®‡∞æ‡∞®‡±Å.",  # I love Telugu language  
            "‡∞∏‡∞æ‡∞Ç‡∞ï‡±á‡∞§‡∞ø‡∞ï‡∞§ ‡∞Æ‡∞® ‡∞≠‡∞µ‡∞ø‡∞∑‡±ç‡∞Ø‡∞§‡±ç‡∞§‡±Å."  # Technology is our future
        ],
        'labels': [1, 1, 1]  # Positive sentiment
    },
    'ml': {  # Malayalam
        'texts': [
            "‡¥á‡¥§‡µç ‡¥Æ‡¥®‡µã‡¥π‡¥∞‡¥Æ‡¥æ‡¥Ø ‡¥¶‡¥ø‡¥®‡¥Æ‡¥æ‡¥£‡µç.",  # This is a beautiful day
            "‡¥û‡¥æ‡µª ‡¥Æ‡¥≤‡¥Ø‡¥æ‡¥≥‡¥Ç ‡¥≠‡¥æ‡¥∑‡¥Ø‡µÜ ‡¥∏‡µç‡¥®‡µá‡¥π‡¥ø‡¥ï‡µç‡¥ï‡µÅ‡¥®‡µç‡¥®‡µÅ.",  # I love Malayalam language
            "‡¥∏‡¥æ‡¥ô‡µç‡¥ï‡µá‡¥§‡¥ø‡¥ï‡¥µ‡¥ø‡¥¶‡µç‡¥Ø ‡¥®‡¥Æ‡µç‡¥Æ‡µÅ‡¥ü‡µÜ ‡¥≠‡¥æ‡¥µ‡¥ø‡¥Ø‡¥æ‡¥£‡µç."  # Technology is our future
        ],
        'labels': [1, 1, 1]  # Positive sentiment
    }
}

# Multilingual evaluation
multilingual_results = evaluator.evaluate_multilingual_capabilities(dravidian_data)

print(f"\nüó∫Ô∏è ‡≤¨‡≤π‡≥Å‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤´‡≤≤‡≤ø‡≤§‡≤æ‡≤Ç‡≤∂‡≤ó‡≤≥‡≥Å (Multilingual Results):")
print(f"  Supported languages: {multilingual_results['supported_languages']}")
print(f"  Consistency score: {multilingual_results['consistency_score']:.4f}")
print(f"  Number of languages: {multilingual_results['num_languages']}")

# Language-specific results
print(f"\nüìä ‡≤≠‡≤æ‡≤∑‡≤æ-‡≤®‡≤ø‡≤∞‡≥ç‡≤¶‡≤ø‡≤∑‡≥ç‡≤ü ‡≤´‡≤≤‡≤ø‡≤§‡≤æ‡≤Ç‡≤∂‡≤ó‡≤≥‡≥Å (Language-specific Results):")
for lang, results in multilingual_results['language_results'].items():
    lang_names = {'kn': '‡≤ï‡≤®‡≥ç‡≤®‡≤°', 'ta': '‡≤§‡≤Æ‡≤ø‡≤≥‡≥Å', 'te': '‡≤§‡≥Ü‡≤≤‡≥Å‡≤ó‡≥Å', 'ml': '‡≤Æ‡≤≤‡≤Ø‡≤æ‡≤≥‡≤Ç'}
    print(f"\n  {lang_names.get(lang, lang).upper()} ({lang}):")
    for metric, value in results.items():
        if isinstance(value, (int, float)):
            print(f"    {metric}: {value:.4f}")
        else:
            print(f"    {metric}: {value}")

# Cross-lingual transfer: Kannada ‚Üí Tamil
print(f"\nüîÑ ‡≤Ö‡≤Ç‡≤§‡≤∞‡≥ç-‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤µ‡≤∞‡≥ç‡≤ó‡≤æ‡≤µ‡≤£‡≥Ü: ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‚Üí ‡≤§‡≤Æ‡≤ø‡≤≥‡≥Å (Cross-lingual Transfer: Kannada ‚Üí Tamil)")

cross_lingual_results = evaluator.evaluate_cross_lingual_transfer(
    source_data=dravidian_data['kn'],
    target_data=dravidian_data['ta'],
    source_lang='kn',
    target_lang='ta'
)

print(f"Cross-lingual Transfer Results:")
for metric, value in cross_lingual_results.items():
    if isinstance(value, (int, float)):
        print(f"  {metric}: {value:.4f}")
    else:
        print(f"  {metric}: {value}")

## 7. Kannada Text Analysis and Visualization

Let's create some visualizations specific to Kannada text analysis.

In [None]:
print("üìä ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å ‡≤¶‡≥É‡≤∂‡≥ç‡≤Ø‡≥Ä‡≤ï‡≤∞‡≤£ (Kannada Text Analysis and Visualization)")

# Analyze character distribution in Kannada texts
all_kannada_text = ' '.join(processed_df['content'])
char_analysis = {}

# Count different types of characters
kannada_chars = sum(1 for c in all_kannada_text if 0x0C80 <= ord(c) <= 0x0CFF)
latin_chars = sum(1 for c in all_kannada_text if c.isalpha() and ord(c) < 128)
digits = sum(1 for c in all_kannada_text if c.isdigit())
spaces = sum(1 for c in all_kannada_text if c.isspace())
punctuation = sum(1 for c in all_kannada_text if c in '.,!?;:')

char_distribution = {
    '‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤Ö‡≤ï‡≥ç‡≤∑‡≤∞‡≤ó‡≤≥‡≥Å': kannada_chars,
    '‡≤≤‡≥ç‡≤Ø‡≤æ‡≤ü‡≤ø‡≤®‡≥ç ‡≤Ö‡≤ï‡≥ç‡≤∑‡≤∞‡≤ó‡≤≥‡≥Å': latin_chars,
    '‡≤Ö‡≤Ç‡≤ï‡≥Ü‡≤ó‡≤≥‡≥Å': digits,
    '‡≤ñ‡≤æ‡≤≤‡≤ø ‡≤ú‡≤æ‡≤ó‡≤ó‡≤≥‡≥Å': spaces,
    '‡≤µ‡≤ø‡≤∞‡≤æ‡≤Æ ‡≤ö‡≤ø‡≤π‡≥ç‡≤®‡≥Ü‡≤ó‡≤≥‡≥Å': punctuation
}

print(f"\nüî§ ‡≤Ö‡≤ï‡≥ç‡≤∑‡≤∞ ‡≤µ‡≤ø‡≤§‡≤∞‡≤£‡≥Ü (Character Distribution):")
for char_type, count in char_distribution.items():
    percentage = (count / len(all_kannada_text)) * 100
    print(f"  {char_type}: {count:,} ({percentage:.1f}%)")

# Word length analysis
all_words = []
for text in processed_df['content']:
    words = preprocessor.tokenize_kannada_words(text)
    all_words.extend(words)

word_lengths = [len(word) for word in all_words]

print(f"\nüìè ‡≤™‡≤¶ ‡≤â‡≤¶‡≥ç‡≤¶‡≤¶ ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü (Word Length Analysis):")
print(f"  Total words: {len(all_words):,}")
print(f"  Average word length: {np.mean(word_lengths):.1f} characters")
print(f"  Shortest word: {min(all_words, key=len)} ({len(min(all_words, key=len))} chars)")
print(f"  Longest word: {max(all_words, key=len)} ({len(max(all_words, key=len))} chars)")

# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
fig.suptitle('‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤™‡≤æ‡≤†‡≥ç‡≤Ø ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü (Kannada Text Analysis)', fontsize=16)

# Character distribution pie chart
axes[0, 0].pie(char_distribution.values(), labels=char_distribution.keys(), autopct='%1.1f%%')
axes[0, 0].set_title('‡≤Ö‡≤ï‡≥ç‡≤∑‡≤∞ ‡≤µ‡≤ø‡≤§‡≤∞‡≤£‡≥Ü (Character Distribution)')

# Word length histogram
axes[0, 1].hist(word_lengths, bins=20, alpha=0.7, color='skyblue')
axes[0, 1].set_title('‡≤™‡≤¶ ‡≤â‡≤¶‡≥ç‡≤¶‡≤¶ ‡≤µ‡≤ø‡≤§‡≤∞‡≤£‡≥Ü (Word Length Distribution)')
axes[0, 1].set_xlabel('Word Length (characters)')
axes[0, 1].set_ylabel('Frequency')

# Category distribution
category_counts = processed_df['category'].value_counts()
axes[1, 0].bar(category_counts.index, category_counts.values)
axes[1, 0].set_title('‡≤µ‡≤ø‡≤∑‡≤Ø ‡≤µ‡≤ø‡≤§‡≤∞‡≤£‡≥Ü (Category Distribution)')
axes[1, 0].set_xlabel('Category')
axes[1, 0].set_ylabel('Count')
plt.setp(axes[1, 0].xaxis.get_majorticklabels(), rotation=45)

# Text length distribution
axes[1, 1].hist(processed_df['cleaned_length'], bins=10, alpha=0.7, color='lightgreen')
axes[1, 1].set_title('‡≤™‡≤æ‡≤†‡≥ç‡≤Ø ‡≤â‡≤¶‡≥ç‡≤¶‡≤¶ ‡≤µ‡≤ø‡≤§‡≤∞‡≤£‡≥Ü (Text Length Distribution)')
axes[1, 1].set_xlabel('Text Length (characters)')
axes[1, 1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Most common words
from collections import Counter
word_freq = Counter(all_words)
common_words = word_freq.most_common(10)

print(f"\nüîù ‡≤∏‡≤æ‡≤Æ‡≤æ‡≤®‡≥ç‡≤Ø ‡≤™‡≤¶‡≤ó‡≤≥‡≥Å (Most Common Words):")
for word, freq in common_words:
    print(f"  {word}: {freq} times")

## 8. Generate Comprehensive Kannada Report

Let's generate a comprehensive evaluation report for our Kannada model.

In [None]:
# Generate comprehensive evaluation report
print("üìã ‡≤∏‡≤Æ‡≤ó‡≥ç‡≤∞ ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤® ‡≤µ‡≤∞‡≤¶‡≤ø ‡≤§‡≤Ø‡≤æ‡≤∞‡≤ø‡≤ï‡≥Ü... (Generating Comprehensive Kannada Evaluation Report...)")

report = evaluator.generate_evaluation_report(
    output_path='../data/kannada_evaluation_report'
)

print("\n" + "="*80)
print("‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≤æ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤® ‡≤µ‡≤∞‡≤¶‡≤ø (KANNADA LANGUAGE MODEL EVALUATION REPORT)")
print("="*80)
print(report[:1500] + "..." if len(report) > 1500 else report)

print("\n‚úÖ ‡≤µ‡≤∞‡≤¶‡≤ø ‡≤â‡≤≥‡≤ø‡≤∏‡≤≤‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü: ../data/kannada_evaluation_report.json ‡≤Æ‡≤§‡≥ç‡≤§‡≥Å .txt ‡≤´‡≥à‡≤≤‡≥ç‚Äå‡≤ó‡≤≥‡≤≤‡≥ç‡≤≤‡≤ø")
print("‚úÖ Report saved to: ../data/kannada_evaluation_report.json and .txt files")

## 9. Save Kannada Model

Finally, let's save our trained Kannada model for future use.

In [None]:
# Save the Kannada-optimized model
kannada_model_path = '../data/models/kannada/kannada_language_model'

print(f"üíæ ‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø‡≤Ø‡≤®‡≥ç‡≤®‡≥Å ‡≤â‡≤≥‡≤ø‡≤∏‡≤≤‡≤æ‡≤ó‡≥Å‡≤§‡≥ç‡≤§‡≤ø‡≤¶‡≥Ü... (Saving Kannada model to: {kannada_model_path})")

try:
    kannada_model.save_model(kannada_model_path)
    print("‚úÖ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤Ø‡≤∂‡≤∏‡≥ç‡≤µ‡≤ø‡≤Ø‡≤æ‡≤ó‡≤ø ‡≤â‡≤≥‡≤ø‡≤∏‡≤≤‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü! (Model saved successfully!)")
    
    # Test loading the model
    print("üîÑ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤≤‡≥ã‡≤°‡≥ç ‡≤Æ‡≤æ‡≤°‡≥Å‡≤µ‡≤ø‡≤ï‡≥Ü‡≤Ø‡≤®‡≥ç‡≤®‡≥Å ‡≤™‡≤∞‡≥Ä‡≤ï‡≥ç‡≤∑‡≤ø‡≤∏‡≤≤‡≤æ‡≤ó‡≥Å‡≤§‡≥ç‡≤§‡≤ø‡≤¶‡≥Ü... (Testing model loading...)")
    loaded_model = IndianLanguageModel.load_model(kannada_model_path)
    print("‚úÖ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤Ø‡≤∂‡≤∏‡≥ç‡≤µ‡≤ø‡≤Ø‡≤æ‡≤ó‡≤ø ‡≤≤‡≥ã‡≤°‡≥ç ‡≤Ü‡≤ó‡≤ø‡≤¶‡≥Ü! (Model loaded successfully!)")
    
    # Verify the loaded model works with Kannada
    test_kannada_text = "‡≤á‡≤¶‡≥Å ‡≤í‡≤Ç‡≤¶‡≥Å ‡≤™‡≤∞‡≥Ä‡≤ï‡≥ç‡≤∑‡≤æ ‡≤µ‡≤æ‡≤ï‡≥ç‡≤Ø‡≤µ‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü."  # This is a test sentence
    test_embeddings = loaded_model.get_embeddings([test_kannada_text], language='kn')
    print(f"‚úÖ ‡≤≤‡≥ã‡≤°‡≥ç ‡≤Æ‡≤æ‡≤°‡≤ø‡≤¶ ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤ï‡≤æ‡≤∞‡≥ç‡≤Ø‡≤®‡≤ø‡≤∞‡≥ç‡≤µ‡≤π‡≤ø‡≤∏‡≥Å‡≤§‡≥ç‡≤§‡≤ø‡≤¶‡≥Ü! ‡≤é‡≤Ç‡≤¨‡≥Ü‡≤°‡≤ø‡≤Ç‡≤ó‡≥ç ‡≤Ü‡≤ï‡≤æ‡≤∞: {test_embeddings.shape}")
    print(f"‚úÖ Loaded model working! Embedding shape: {test_embeddings.shape}")
    
except Exception as e:
    print(f"‚ùå ‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤â‡≤≥‡≤ø‡≤∏‡≥Å‡≤µ‡≤≤‡≥ç‡≤≤‡≤ø/‡≤≤‡≥ã‡≤°‡≥ç ‡≤Æ‡≤æ‡≤°‡≥Å‡≤µ‡≤≤‡≥ç‡≤≤‡≤ø ‡≤¶‡≥ã‡≤∑: {e}")
    print(f"‚ùå Error saving/loading model: {e}")

# Save the processed dataset too
dataset_path = '../data/processed/kannada_dataset.csv'
processed_df.to_csv(dataset_path, index=False, encoding='utf-8')
print(f"\nüíæ ‡≤™‡≥ç‡≤∞‡≤ï‡≥ç‡≤∞‡≤ø‡≤Ø‡≥Ü‡≤ó‡≥ä‡≤≥‡≤ø‡≤∏‡≤ø‡≤¶ ‡≤¶‡≤§‡≥ç‡≤§‡≤∏‡≤Æ‡≥Ç‡≤π ‡≤â‡≤≥‡≤ø‡≤∏‡≤≤‡≤æ‡≤ó‡≤ø‡≤¶‡≥Ü: {dataset_path}")
print(f"üíæ Processed dataset saved to: {dataset_path}")

## ‡≤∏‡≤æ‡≤∞‡≤æ‡≤Ç‡≤∂ (Summary)

üéâ **‡≤Ö‡≤≠‡≤ø‡≤®‡≤Ç‡≤¶‡≤®‡≥Ü‡≤ó‡≤≥‡≥Å! (Congratulations!)** 

You've successfully completed the Kannada language processing workflow:

### ‡≤™‡≥Ç‡≤∞‡≥ç‡≤£‡≤ó‡≥ä‡≤Ç‡≤° ‡≤ï‡≤æ‡≤∞‡≥ç‡≤Ø‡≤ó‡≤≥‡≥Å (Completed Tasks):

1. ‚úÖ **‡≤¶‡≤§‡≥ç‡≤§ ‡≤∏‡≤Ç‡≤ó‡≥ç‡≤∞‡≤π‡≤£‡≥Ü (Data Collection)**: Collected and curated Kannada text from various domains
2. ‚úÖ **‡≤™‡≥ç‡≤∞‡≥Ä-‡≤™‡≥ç‡≤∞‡≥ä‡≤∏‡≥Ü‡≤∏‡≤ø‡≤Ç‡≤ó‡≥ç (Preprocessing)**: Applied Kannada-specific text cleaning and normalization
3. ‚úÖ **‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤§‡≤∞‡≤¨‡≥á‡≤§‡≤ø (Model Training)**: Configured and initialized model optimized for Kannada
4. ‚úÖ **‡≤Æ‡≥å‡≤≤‡≥ç‡≤Ø‡≤Æ‡≤æ‡≤™‡≤® (Evaluation)**: Assessed model performance on Kannada NLP tasks
5. ‚úÖ **‡≤Ö‡≤Ç‡≤§‡≤∞‡≥ç-‡≤≠‡≤æ‡≤∑‡≥Ü ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü (Cross-lingual Analysis)**: Compared with other Dravidian languages
6. ‚úÖ **‡≤¶‡≥É‡≤∂‡≥ç‡≤Ø‡≥Ä‡≤ï‡≤∞‡≤£ (Visualization)**: Created Kannada text analysis visualizations
7. ‚úÖ **‡≤µ‡≤∞‡≤¶‡≤ø ‡≤§‡≤Ø‡≤æ‡≤∞‡≤ø‡≤ï‡≥Ü (Report Generation)**: Generated comprehensive evaluation reports
8. ‚úÖ **‡≤Æ‡≤æ‡≤¶‡≤∞‡≤ø ‡≤â‡≤≥‡≤ø‡≤ï‡≥Ü (Model Saving)**: Saved the model for future use

### ‡≤Æ‡≥Å‡≤Ç‡≤¶‡≤ø‡≤® ‡≤π‡≤Ç‡≤§‡≤ó‡≤≥‡≥Å (Next Steps):

- **‡≤¶‡≥ä‡≤°‡≥ç‡≤° ‡≤¶‡≤§‡≥ç‡≤§‡≤∏‡≤Æ‡≥Ç‡≤π‡≤¶‡≤≤‡≥ç‡≤≤‡≤ø ‡≤§‡≤∞‡≤¨‡≥á‡≤§‡≤ø**: Train on larger Kannada datasets
- **‡≤µ‡≤ø‡≤∂‡≥á‡≤∑ ‡≤ï‡≤æ‡≤∞‡≥ç‡≤Ø‡≤ó‡≤≥‡≥Å**: Implement specific tasks like POS tagging, NER for Kannada
- **‡≤∏‡≤æ‡≤π‡≤ø‡≤§‡≥ç‡≤Ø ‡≤µ‡≤ø‡≤∂‡≥ç‡≤≤‡≥á‡≤∑‡≤£‡≥Ü**: Analyze classical Kannada literature
- **‡≤ï‡≤®‡≥ç‡≤®‡≤°-‡≤á‡≤Ç‡≤ó‡≥ç‡≤≤‡≤ø‡≤∑‡≥ç ‡≤Ö‡≤®‡≥Å‡≤µ‡≤æ‡≤¶**: Develop translation models
- **‡≤ß‡≥ç‡≤µ‡≤®‡≤ø ‡≤∏‡≤Ç‡≤∏‡≥ç‡≤ï‡≤∞‡≤£‡≥Ü**: Integrate with Kannada speech processing

### ‡≤∏‡≤Ç‡≤™‡≤®‡≥ç‡≤Æ‡≥Ç‡≤≤‡≤ó‡≤≥‡≥Å (Resources):

- [‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤∏‡≤æ‡≤π‡≤ø‡≤§‡≥ç‡≤Ø ‡≤™‡≤∞‡≤ø‡≤∑‡≤§‡≥ç (Kannada Sahitya Parishat)](https://kannadasahityaparishat.org/)
- [‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤µ‡≤ø‡≤ï‡≤ø‡≤™‡≥Ä‡≤°‡≤ø‡≤Ø‡≤æ (Kannada Wikipedia)](https://kn.wikipedia.org/)
- [AI4Bharat Kannada Resources](https://ai4bharat.org/)
- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library)

---

**‡≤ï‡≤®‡≥ç‡≤®‡≤° ‡≤≠‡≤æ‡≤∑‡≥Ü‡≤Ø ‡≤°‡≤ø‡≤ú‡≤ø‡≤ü‡≤≤‡≥ç ‡≤≠‡≤µ‡≤ø‡≤∑‡≥ç‡≤Ø‡≤¶ ‡≤®‡≤ø‡≤∞‡≥ç‡≤Æ‡≤æ‡≤£‡≤¶‡≤≤‡≥ç‡≤≤‡≤ø ‡≤™‡≤æ‡≤≤‡≥ç‡≤ó‡≥ä‡≤≥‡≥ç‡≤≥‡≤ø!** üåü

**Join in building the digital future of the Kannada language!** üáÆüá≥‚ú®