# Getting Started with Indian Language NLP

This notebook demonstrates how to use the Indian Language NLP framework to:
1. Collect data for underrepresented Indian languages
2. Preprocess and clean the text data
3. Train language models optimized for Indian languages
4. Evaluate model performance across various tasks

## Prerequisites

Make sure you have installed all dependencies:
```bash
pip install -r requirements.txt
```

In [None]:
# Import necessary libraries
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

# Import our custom modules
from data_collection import WebScraper
from preprocessing import TextCleaner
from models import IndianLanguageModel
from evaluation import ModelEvaluator

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ All imports successful!")

## 1. Data Collection

Let's start by collecting some Hindi text data from news websites.

In [None]:
# Initialize web scraper for Hindi
scraper = WebScraper(
    language='hi',
    delay=2.0,  # Be respectful to websites
    max_retries=3
)

# Define some Hindi news sites
hindi_news_sites = [
    {'name': 'BBC Hindi', 'url': 'https://www.bbc.com/hindi'},
    {'name': 'NDTV India', 'url': 'https://khabar.ndtv.com/'}
]

print(f"🔍 Starting data collection for Hindi...")

# Note: This is a demo - be careful with actual scraping
# Make sure to respect robots.txt and terms of service
sample_articles = []

# For demonstration, we'll create some sample data instead of actual scraping
sample_data = {
    'url': 'https://example.com/hindi-article',
    'title': 'भारत में नई तकनीक का विकास',
    'content': 'भारत में तकनीकी विकास तेजी से बढ़ रहा है। नई खोजें और नवाचार देश को आगे बढ़ाने में मदद कर रहे हैं।',
    'language': 'hi',
    'source': 'Demo'
}

sample_articles = [sample_data] * 10  # Create 10 sample articles

print(f"📰 Collected {len(sample_articles)} sample articles")
print(f"Sample article title: {sample_articles[0]['title']}")

## 2. Text Preprocessing

Now let's clean and preprocess our Hindi text data.

In [None]:
# Initialize text cleaner for Hindi
cleaner = TextCleaner(language='hi')

print("🧹 Cleaning text data...")

# Clean the sample text
sample_text = sample_articles[0]['content']
print(f"Original text: {sample_text}")

# Apply various cleaning operations
cleaned_text = cleaner.clean_text(
    sample_text,
    normalize_unicode=True,
    remove_extra_whitespace=True,
    normalize_digits=True
)

print(f"Cleaned text: {cleaned_text}")

# Get text statistics
stats = cleaner.get_text_statistics(sample_text)
print("\n📊 Text Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

# Sentence tokenization
sentences = cleaner.sentence_tokenize(sample_text)
print(f"\n📝 Found {len(sentences)} sentences:")
for i, sentence in enumerate(sentences[:3]):  # Show first 3
    print(f"  {i+1}. {sentence}")

## 3. Model Training

Let's create and initialize an Indian language model optimized for Hindi.

In [None]:
# Initialize Indian Language Model for Hindi
print("🤖 Initializing Indian Language Model...")

model = IndianLanguageModel(
    language='hi',
    model_type='bert',
    vocab_size=30000,
    hidden_size=768,
    num_hidden_layers=6,  # Smaller model for demo
    num_attention_heads=12,
    max_position_embeddings=512
)

print("✅ Model initialized successfully!")

# Get model parameter information
param_count = model.get_parameter_count()
print("\n📈 Model Parameters:")
for component, count in param_count.items():
    print(f"  {component}: {count:,} parameters")

print(f"\n🔧 Total parameters: {param_count['total']:,}")

# Test the model with some sample text
sample_texts = [
    "भारत एक महान देश है।",
    "तकनीक का विकास तेजी से हो रहा है।",
    "हमें अपनी भाषाओं को संरक्षित करना चाहिए।"
]

print("\n🧠 Testing model embeddings...")
embeddings = model.get_embeddings(sample_texts, language='hi')
print(f"Generated embeddings shape: {embeddings.shape}")
print(f"Sample embedding (first 5 dimensions): {embeddings[0][:5]}")

## 4. Model Evaluation

Let's evaluate our model using various metrics.

In [None]:
# Initialize model evaluator
evaluator = ModelEvaluator(
    model=model,
    language='hi',
    device='cpu'  # Use CPU for demo
)

print("📊 Starting model evaluation...")

# Create some sample evaluation data
eval_texts = [
    "यह एक अच्छी फिल्म है।",
    "मुझे यह किताब पसंद नहीं आई।",
    "आज मौसम बहुत अच्छा है।",
    "मैं खुश हूं।",
    "यह समाचार दुखद है।"
]

# Sample labels for sentiment (0=negative, 1=neutral, 2=positive)
eval_labels = [2, 0, 2, 2, 0]

# Evaluate text classification
classification_results = evaluator.evaluate_text_classification(
    texts=eval_texts,
    labels=eval_labels,
    task_name="sentiment_analysis"
)

print("\n🎯 Classification Results:")
for metric, value in classification_results.items():
    if isinstance(value, (int, float)):
        print(f"  {metric}: {value:.4f}")
    elif metric != 'classification_report':
        print(f"  {metric}: {value}")

# Test semantic similarity
text_pairs = [
    ("मुझे खुशी है", "मैं खुश हूं"),
    ("यह अच्छा है", "यह बुरा है"),
    ("आज बारिश है", "मौसम बहुत अच्छा है")
]

similarity_scores = [0.9, 0.1, 0.5]  # Ground truth similarities

similarity_results = evaluator.evaluate_semantic_similarity(
    text_pairs=text_pairs,
    similarity_scores=similarity_scores
)

print("\n🔄 Semantic Similarity Results:")
for metric, value in similarity_results.items():
    print(f"  {metric}: {value:.4f}")

## 5. Cross-lingual Evaluation

Let's test the model's ability to handle multiple Indian languages.

In [None]:
# Create multilingual test data
multilingual_data = {
    'hi': {
        'texts': [
            "यह एक अच्छा दिन है।",
            "मुझे यह पसंद है।",
            "तकनीक बहुत उपयोगी है।"
        ],
        'labels': [1, 1, 1]  # Positive sentiment
    },
    'bn': {
        'texts': [
            "এটি একটি ভাল দিন।",  # This is a good day
            "আমি এটা পছন্দ করি।",  # I like this
            "প্রযুক্তি খুব উপকারী।"  # Technology is very useful
        ],
        'labels': [1, 1, 1]  # Positive sentiment
    }
}

print("🌍 Evaluating multilingual capabilities...")

# Evaluate multilingual performance
multilingual_results = evaluator.evaluate_multilingual_capabilities(
    multilingual_data
)

print("\n🗺️ Multilingual Results:")
print(f"  Supported languages: {multilingual_results['supported_languages']}")
print(f"  Consistency score: {multilingual_results['consistency_score']:.4f}")

# Language-specific results
for lang, results in multilingual_results['language_results'].items():
    print(f"\n  {lang.upper()} Language Results:")
    for metric, value in results.items():
        if isinstance(value, (int, float)):
            print(f"    {metric}: {value:.4f}")
        else:
            print(f"    {metric}: {value}")

# Cross-lingual transfer evaluation
if len(multilingual_data) >= 2:
    source_lang = 'hi'
    target_lang = 'bn'
    
    cross_lingual_results = evaluator.evaluate_cross_lingual_transfer(
        source_data=multilingual_data[source_lang],
        target_data=multilingual_data[target_lang],
        source_lang=source_lang,
        target_lang=target_lang
    )
    
    print(f"\n🔄 Cross-lingual Transfer ({source_lang} → {target_lang}):")
    for metric, value in cross_lingual_results.items():
        if isinstance(value, (int, float)):
            print(f"  {metric}: {value:.4f}")
        else:
            print(f"  {metric}: {value}")

## 6. Generate Comprehensive Report

Let's generate a comprehensive evaluation report.

In [None]:
# Generate evaluation report
print("📋 Generating comprehensive evaluation report...")

report = evaluator.generate_evaluation_report(
    output_path='../data/evaluation_report'
)

print("\n" + "="*60)
print("EVALUATION REPORT SUMMARY")
print("="*60)
print(report[:1000] + "..." if len(report) > 1000 else report)

print("\n✅ Full report saved to: ../data/evaluation_report.json and .txt")

## 7. Visualization

Create visualizations of the evaluation results.

In [None]:
# Create visualizations
print("📊 Creating evaluation visualizations...")

try:
    evaluator.visualize_results(
        save_path='../data/evaluation_charts.png'
    )
    print("✅ Visualizations saved to: ../data/evaluation_charts.png")
except Exception as e:
    print(f"⚠️ Visualization failed: {e}")
    print("This is normal in some environments - check the saved report files instead.")

## 8. Save Model

Finally, let's save our trained model for future use.

In [None]:
# Save the model
model_save_path = '../data/models/hindi_language_model'

print(f"💾 Saving model to: {model_save_path}")

try:
    model.save_model(model_save_path)
    print("✅ Model saved successfully!")
    
    # Test loading the model
    print("🔄 Testing model loading...")
    loaded_model = IndianLanguageModel.load_model(model_save_path)
    print("✅ Model loaded successfully!")
    
    # Verify the loaded model works
    test_text = "यह एक परीक्षा है।"  # This is a test
    embeddings = loaded_model.get_embeddings([test_text], language='hi')
    print(f"✅ Loaded model working! Embedding shape: {embeddings.shape}")
    
except Exception as e:
    print(f"❌ Error saving/loading model: {e}")

## Summary

🎉 **Congratulations!** You've successfully completed the Indian Language NLP workflow:

1. ✅ **Data Collection**: Learned how to collect text data from web sources
2. ✅ **Text Preprocessing**: Cleaned and normalized Indian language text
3. ✅ **Model Training**: Initialized and configured an Indian language model
4. ✅ **Evaluation**: Assessed model performance across various tasks
5. ✅ **Cross-lingual**: Tested multilingual capabilities
6. ✅ **Reporting**: Generated comprehensive evaluation reports
7. ✅ **Visualization**: Created performance charts
8. ✅ **Model Saving**: Saved the model for future use

## Next Steps

- Explore other notebooks in this repository for advanced topics
- Try training on larger datasets
- Experiment with different model architectures
- Test with more Indian languages
- Deploy your model for real-world applications

## Resources

- [IndicNLP Library](https://github.com/anoopkunchukuttan/indic_nlp_library)
- [AI4Bharat](https://ai4bharat.org/)
- [Hugging Face Transformers](https://huggingface.co/transformers/)
- [Indian Language Datasets](http://www.cfilt.iitb.ac.in/)

---

**Happy coding and building bridges between technology and India's linguistic diversity!** 🇮🇳✨