```
Team Name: RAV
Team Members: VIGNESH J, ASHWATH VINODKUMAR, RAHUL BHARGAV TALLADA
Leaderboard Rank: 50
```

# Soil Classification Challenge - Inference Notebook

This notebook contains the inference pipeline for the soil classification challenge. It uses a hybrid approach combining logistic regression and zero-shot classification to predict soil types from images.

## 1. Setup and Dependencies

First, we'll install the required packages and import necessary libraries.

In [None]:
!pip install open-clip-torch pandas pillow scikit-learn --quiet

import sys
import os
sys.path.append('../src')

import open_clip
import torch
import pandas as pd
import numpy as np
from PIL import Image
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Import custom utilities
try:
    from preprocessing import get_image_embeddings, generate_text_embeddings
    from postprocessing import predict_image, batch_predict, create_submission
    print("Successfully imported custom utilities")
except ImportError:
    print("Custom utilities not found, using internal functions")

## 2. Configuration

Define the model configuration and parameters.

In [None]:
# Configuration (MODIFY IF NEEDED)
MODEL_NAME = "ViT-H-14"        # High-performance vision transformer
PRETRAINED = "laion2b_s32b_b79k"  # Pretraining dataset
BATCH_SIZE = 8                 # Reduce to 4 if OOM errors persist
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
CLASSES = ["Alluvial soil", "Black Soil", "Clay soil", "Red soil"]

print(f"Using device: {DEVICE}")

## 3. Model Loading

Load the pre-trained CLIP model and preprocessor.

In [None]:
# Load model and preprocessor
model, _, preprocess = open_clip.create_model_and_transforms(
    model_name=MODEL_NAME,
    pretrained=PRETRAINED
)
model = model.to(DEVICE).eval()
print(f"Loaded {MODEL_NAME} model with {PRETRAINED} weights")

## 4. Data Loading

Load the training and test data.

In [None]:
# Load metadata
DATA_DIR = Path("../data")
if not DATA_DIR.exists():
    # For Kaggle environment
    DATA_DIR = Path("/kaggle/input/soil-classification/soil_classification-2025")

train_df = pd.read_csv(DATA_DIR / "train_labels.csv" if DATA_DIR.exists() else "/kaggle/input/soil-classification/soil_classification-2025/train_labels.csv")
test_df = pd.read_csv(DATA_DIR / "test_ids.csv" if DATA_DIR.exists() else "/kaggle/input/soil-classification/soil_classification-2025/test_ids.csv")

print(f"Training samples: {len(train_df)}")
print(f"Test samples: {len(test_df)}")

# Display sample data
print("\nTraining data sample:")
display(train_df.head())
print("\nTest data sample:")
display(test_df.head())

## 5. Prompt Engineering

Create rich text prompts for each soil class to enhance zero-shot classification.

In [None]:
# Enhanced prompt engineering
class_prompts = {
    "Alluvial soil": [
        "A high-resolution photo of alluvial soil: light brown, fine-textured, river-deposited",
        "Satellite image showing alluvial plains with fertile soil",
        "Microscopic view of alluvial soil particles"
    ],
    "Black Soil": [
        "Agricultural black soil with high clay content",
        "Vertisol soil cracking in dry conditions",
        "Aerial view of black cotton soil fields"
    ],
    "Clay soil": [
        "Sticky clay soil with poor drainage",
        "Cracked clay surface during drought",
        "Red clay soil with high iron content"
    ],
    "Red soil": [
        "Lateritic red soil in tropical regions",
        "Red earth with visible iron oxide deposits",
        "Terra rossa soil in Mediterranean climate"
    ]
}

print(f"Created prompts for {len(class_prompts)} soil classes")

## 6. Text Embedding Generation

Precompute text embeddings for each class prompt.

In [None]:
# Precompute text embeddings
# Use utility function if available, otherwise use inline code
try:
    text_embeddings = generate_text_embeddings(model, class_prompts, device=DEVICE)
except NameError:
    # Fallback if utility function is not available
    with torch.no_grad():
        text_embeddings = {}
        for cls, prompts in class_prompts.items():
            embeddings = []
            for prompt in prompts:
                text = open_clip.tokenize([prompt]).to(DEVICE)
                embeddings.append(model.encode_text(text))
            text_embeddings[cls] = torch.mean(torch.cat(embeddings), dim=0, keepdim=True)
            
print(f"Generated text embeddings for {len(text_embeddings)} classes")

## 7. Training Data Preparation

Prepare the training data for model training.

In [None]:
# Prepare training data
train_images = [DATA_DIR / "train" / img_id if DATA_DIR.exists() else Path("/kaggle/input/soil-classification/soil_classification-2025/train")/img_id 
               for img_id in train_df.image_id]

# Use utility function if available, otherwise use inline code
try:
    X_train = get_image_embeddings(model, preprocess, train_images, batch_size=BATCH_SIZE, device=DEVICE)
except NameError:
    # Fallback if utility function is not available
    def get_image_embeddings_inline(image_paths):
        """Batch processing to prevent OOM errors"""
        embeddings = []
        for i in range(0, len(image_paths), BATCH_SIZE):
            batch_paths = image_paths[i:i+BATCH_SIZE]
            batch = torch.stack([preprocess(Image.open(p).convert("RGB")) for p in batch_paths])
            
            with torch.no_grad():
                batch = batch.to(DEVICE)
                batch_emb = model.encode_image(batch)
                embeddings.append(batch_emb.cpu().numpy())
            
            # Explicit memory cleanup
            del batch, batch_emb
            torch.cuda.empty_cache()
        
        return np.concatenate(embeddings)
    
    X_train = get_image_embeddings_inline(train_images)

y_train = train_df.soil_type.map({cls:i for i, cls in enumerate(CLASSES)}).values

print(f"Prepared {len(X_train)} training samples with shape {X_train.shape}")

## 8. Model Training

Train a logistic regression classifier on the extracted features.

In [None]:
# Train classifier
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

clf = LogisticRegression(
    max_iter=1000,
    class_weight="balanced",
    C=0.1,
    penalty="l2",
    random_state=42
)
clf.fit(X_train_split, y_train_split)
val_accuracy = clf.score(X_val_split, y_val_split)
print(f"Validation Accuracy: {val_accuracy:.2%}")

## 9. Hybrid Prediction Function

Define a function that combines logistic regression and zero-shot classification for improved predictions.

In [None]:
# Hybrid prediction function
# Use utility function if available, otherwise use inline code
if 'predict_image' not in globals():
    def predict_image_inline(image_path):
        """Predict soil class for a single image using hybrid approach"""
        # Get embeddings for logistic regression
        if 'get_image_embeddings' in globals():
            img_emb = get_image_embeddings(model, preprocess, [image_path], batch_size=1, device=DEVICE)
        else:
            img_emb = get_image_embeddings_inline([image_path])
            
        probe_pred = clf.predict_proba(img_emb)
        
        # Get embeddings for zero-shot classification
        image = preprocess(Image.open(image_path).convert("RGB")).unsqueeze(0).to(DEVICE)
        with torch.no_grad():
            image_features = model.encode_image(image)
            image_features /= image_features.norm(dim=-1, keepdim=True)
            
            zero_shot_probs = []
            for cls in CLASSES:
                text_features = text_embeddings[cls].to(DEVICE)
                text_features /= text_features.norm(dim=-1, keepdim=True)
                zero_shot_probs.append((image_features @ text_features.T).item())
            zero_shot_probs = torch.softmax(torch.tensor(zero_shot_probs), dim=0).numpy()
        
        # Weighted ensemble (70% logistic regression, 30% zero-shot)
        combined_probs = 0.7*probe_pred + 0.3*zero_shot_probs
        return CLASSES[np.argmax(combined_probs)]
    
    # Use the inline function if the utility function is not available
    if 'predict_image' not in globals():
        predict_image = predict_image_inline

print("Hybrid prediction function ready")

## 10. Test Set Prediction

Generate predictions for the test set.

In [None]:
# Generate predictions for test set
print(f"Generating predictions for {len(test_df)} test images...")
test_images = [DATA_DIR / "test" / img_id if DATA_DIR.exists() else Path("/kaggle/input/soil-classification/soil_classification-2025/test")/img_id 
              for img_id in test_df.image_id]

# Use batch prediction if available, otherwise predict one by one
if 'batch_predict' in globals():
    predictions = batch_predict(test_images, model, preprocess, clf, text_embeddings, CLASSES, device=DEVICE, batch_size=BATCH_SIZE)
else:
    predictions = [predict_image(img_path) for img_path in test_images]

print(f"Predictions complete for {len(predictions)} images")

## 11. Create Submission

Create a submission file with the predictions.

In [None]:
# Create submission file
output_file = "submission.csv"

# Use utility function if available, otherwise use inline code
if 'create_submission' in globals():
    submission_path = create_submission(test_df, predictions, output_file)
else:
    test_df["soil_type"] = predictions
    test_df.to_csv(output_file, index=False)
    submission_path = output_file
    
    # Print prediction distribution
    print(f"Prediction distribution:\n{test_df['soil_type'].value_counts()}")

print(f"\nSubmission file created at: {submission_path}")
display(test_df.head(10))

## 12. Conclusion

Summarize the inference process and results.

In [None]:
print("Inference Summary:")
print(f"- Model: {MODEL_NAME} with {PRETRAINED} weights")
print(f"- Hybrid approach: 70% Logistic Regression, 30% Zero-Shot Classification")
print(f"- Test samples: {len(test_df)}")
print(f"- Validation accuracy: {val_accuracy:.2%}")
print(f"- Classes: {CLASSES}")
print(f"- Submission file: {submission_path}")