# Tamil Language Detection
## FIRE 2025 - CoLI-Dravidian Shared Task

This notebook implements word-level language identification for Tamil using LaBSE.

**Paper:** Transformer Driven Word Level Classification of Dravidian Languages (FIRE 2025)

**Competition Result:**
- Macro F1: **0.7434**
- Rank: **🥇 1st**

In [None]:
# Install required libraries
!pip install transformers==4.10.0
!pip install simpletransformers==0.64.3

In [None]:
import pandas as pd
import numpy as np
import torch
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report
from simpletransformers.classification import ClassificationModel
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Data Loading

In [None]:
# Load training data
df_train = pd.read_csv('../data/tamil/tm_train.csv')
print(f'Training samples: {len(df_train)}')
print(f'\nLabel distribution:\n{df_train["Tag"].value_counts()}')
df_train.head()

In [None]:
# Load validation data
df_eval = pd.read_csv('../data/tamil/tm_val.csv')
print(f'Validation samples: {len(df_eval)}')
df_eval.head()

## 2. Label Encoding

Combine train and validation tags for consistent encoding.

In [None]:
# Fit label encoder on combined train + eval tags
le = LabelEncoder()
all_tags = pd.concat([df_train['Tag'], df_eval['Tag']]).unique()
le.fit(all_tags)

# Transform training labels
df_train['Tag'] = le.transform(df_train['Tag'])

# Display label mapping
label_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(f'Label mapping: {label_mapping}')
print(f'Number of labels: {len(le.classes_)} (expected: 8)')

## 3. Model Configuration

**Model:** setu4993/LaBSE (Language-agnostic BERT Sentence Embedding)

**Hyperparameters (from paper Section 4.1):**
- Training epochs: **10**
- Batch size: **32**
- Optimizer: Adam
- Number of labels: **8** (for Tamil)

In [None]:
# Clear GPU cache
torch.cuda.empty_cache()

# Initialize model with paper's hyperparameters
model = ClassificationModel(
    'bert',
    'setu4993/LaBSE',
    num_labels=8,
    use_cuda=torch.cuda.is_available(),
    args={
        'reprocess_input_data': True,
        'use_cached_eval_features': False,
        'overwrite_output_dir': True,
        'num_train_epochs': 10,
        'train_batch_size': 32,
        'eval_batch_size': 32,
        'save_steps': 500,
        'logging_steps': 100,
        'output_dir': '../models/tamil/',
        'best_model_dir': '../models/tamil/best_model/'
    }
)

print('Model initialized successfully!')

## 4. Model Training

In [None]:
# Train the model
model.train_model(df_train)
print('Training completed!')

## 5. Model Evaluation

In [None]:
# Transform validation labels
df_eval['Tag'] = le.transform(df_eval['Tag'])

# Evaluate model
result, model_outputs, wrong_predictions = model.eval_model(df_eval)
print(f'\nEvaluation Results: {result}')

## 6. Performance Analysis

In [None]:
# Get predictions
y_true = df_eval['Tag'].values
y_pred = np.argmax(model_outputs, axis=1)

# Confusion Matrix
cm = confusion_matrix(y_true, y_pred)
print('Confusion Matrix:')
print(cm)

# Plot confusion matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Tamil')
plt.savefig('../results/tamil_confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Classification Report
print('\nClassification Report:')
print(classification_report(y_true, y_pred, target_names=le.classes_))

## 7. Test Set Predictions

In [None]:
# Load test data
df_test = pd.read_csv('../data/tamil/tm_test.csv')
print(f'Test samples: {len(df_test)}')
df_test.head()

In [None]:
# Make predictions
predictions, raw_outputs = model.predict(df_test['Word'].astype(str).tolist())

# Add predictions to dataframe
df_test['Tag'] = predictions

# Inverse transform to get original labels
df_test['Tag'] = le.inverse_transform(df_test['Tag'])

# Display sample predictions
print('Sample predictions:')
print(df_test.head(20))

In [None]:
# Save predictions
df_test.to_csv('../results/tamil_predictions.csv', index=False)
print('Predictions saved to ../results/tamil_predictions.csv')

## Results Summary

**Expected Results (from paper):**
- Macro F1: **0.7434**
- Macro Precision: 0.7696
- Macro Recall: 0.7353
- Accuracy: 0.9249
- **Competition Rank: 🥇 1st**