```
Team Name: RAV
Team Members: VIGNESH J, ASHWATH VINODKUMAR, RAHUL BHARGAV TALLADA
Leaderboard Rank: 50
```

# Soil Classification Challenge - Training Notebook

This notebook contains the training pipeline for the soil classification challenge. It uses a pre-trained vision transformer model with CLIP architecture to extract features from soil images and trains a logistic regression classifier on these features.

## 1. Setup and Dependencies

First, we'll install the required packages and import necessary libraries.

In [None]:
!pip install open-clip-torch pandas pillow scikit-learn --quiet

import sys
import os
sys.path.append('../src')

import open_clip
import torch
import pandas as pd
import numpy as np
from PIL import Image
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Import custom utilities
from preprocessing import get_image_embeddings, generate_text_embeddings
from postprocessing import predict_image, batch_predict

## 2. Configuration

Define the model configuration and parameters.

In [None]:
# Configuration (MODIFY IF NEEDED)
MODEL_NAME = "ViT-H-14"        # High-performance vision transformer
PRETRAINED = "laion2b_s32b_b79k"  # Pretraining dataset
BATCH_SIZE = 8                 # Reduce to 4 if OOM errors persist
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
CLASSES = ["Alluvial soil", "Black Soil", "Clay soil", "Red soil"]

print(f"Using device: {DEVICE}")

## 3. Model Loading

Load the pre-trained CLIP model and preprocessor.

In [None]:
# Load model and preprocessor
model, _, preprocess = open_clip.create_model_and_transforms(
    model_name=MODEL_NAME,
    pretrained=PRETRAINED
)
model = model.to(DEVICE).eval()
print(f"Loaded {MODEL_NAME} model with {PRETRAINED} weights")

## 4. Data Loading

Load the training data and metadata.

In [None]:
# Load metadata
# Update paths as needed for your environment
DATA_DIR = Path("../data")
if not DATA_DIR.exists():
    # For Kaggle environment
    DATA_DIR = Path("/kaggle/input/soil-classification/soil_classification-2025")

train_df = pd.read_csv(DATA_DIR / "train_labels.csv")
print(f"Loaded {len(train_df)} training samples")
print(train_df.head())

## 5. Prompt Engineering

Create rich text prompts for each soil class to enhance zero-shot classification.

In [None]:
# Enhanced prompt engineering
class_prompts = {
    "Alluvial soil": [
        "A high-resolution photo of alluvial soil: light brown, fine-textured, river-deposited",
        "Satellite image showing alluvial plains with fertile soil",
        "Microscopic view of alluvial soil particles"
    ],
    "Black Soil": [
        "Agricultural black soil with high clay content",
        "Vertisol soil cracking in dry conditions",
        "Aerial view of black cotton soil fields"
    ],
    "Clay soil": [
        "Sticky clay soil with poor drainage",
        "Cracked clay surface during drought",
        "Red clay soil with high iron content"
    ],
    "Red soil": [
        "Lateritic red soil in tropical regions",
        "Red earth with visible iron oxide deposits",
        "Terra rossa soil in Mediterranean climate"
    ]
}

## 6. Text Embedding Generation

Precompute text embeddings for each class prompt.

In [None]:
# Precompute text embeddings using utility function
text_embeddings = generate_text_embeddings(model, class_prompts, device=DEVICE)
print(f"Generated text embeddings for {len(text_embeddings)} classes")

## 7. Feature Extraction

Extract features from training images.

In [None]:
# Prepare training data
train_images = [DATA_DIR / "train" / img_id for img_id in train_df.image_id]
print(f"Extracting features from {len(train_images)} images...")

# Use utility function for feature extraction
X_train = get_image_embeddings(model, preprocess, train_images, batch_size=BATCH_SIZE, device=DEVICE)
y_train = train_df.soil_type.map({cls:i for i, cls in enumerate(CLASSES)}).values
print(f"Feature extraction complete. Shape: {X_train.shape}")

## 8. Model Training

Train a logistic regression classifier on the extracted features.

In [None]:
# Train classifier
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    C=0.1,
    penalty="l2",
    random_state=42
)
clf.fit(X_train_split, y_train_split)
val_accuracy = clf.score(X_val_split, y_val_split)
print(f"Validation Accuracy: {val_accuracy:.2%}")

## 9. Model Saving

Save the trained model and embeddings for inference.

In [None]:
# Save model and embeddings
import pickle

# Create output directory if it doesn't exist
os.makedirs('../models', exist_ok=True)

# Save classifier
with open('../models/classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

# Save text embeddings
torch.save(text_embeddings, '../models/text_embeddings.pt')

print("Model and embeddings saved successfully")

## 10. Training Summary

Summarize the training process and results.

In [None]:
print("Training Summary:")
print(f"- Model: {MODEL_NAME} with {PRETRAINED} weights")
print(f"- Training samples: {len(X_train_split)}")
print(f"- Validation samples: {len(X_val_split)}")
print(f"- Validation accuracy: {val_accuracy:.2%}")
print(f"- Classes: {CLASSES}")