# Rice Diseases: Image-to-Graph Conversion

This notebook converts rice disease images to graph structures using superpixel segmentation.

**Output**: Preprocessed graph dataset ready for Graphormer training (fairseq-compatible)

**Dataset**: 4 rice disease classes - BrownSpot, Healthy, Hispa, LeafBlast

## 1. Setup Environment

Mount Google Drive and setup directories.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

print("✓ Google Drive mounted successfully")

In [None]:
# Clone Graphormer repository (if not already cloned)
import os

if not os.path.exists('/content/Graphormer'):
    !git clone https://github.com/microsoft/Graphormer.git
    print("✓ Cloned Graphormer repository")
else:
    print("✓ Graphormer repository already exists")

%cd /content/Graphormer

## 2. Install Dependencies

Install fairseq, Graphormer, and rice_diseases specific dependencies.

In [None]:
# Install Graphormer and fairseq
!bash install_updated.sh

# Install rice_diseases specific dependencies
!pip install -q -r examples/rice_diseases/requirements.txt

print("\n" + "=" * 60)
print("✓ All dependencies installed")
print("=" * 60)

In [None]:
# Verify installation
import sys
sys.path.append('/content/Graphormer')

from examples.rice_diseases.colab_setup import verify_installation

verify_installation()

## 3. Copy and Extract Dataset

Copy dataset from Google Drive to local Colab storage and extract.

**Note**: This avoids RAM overflow by copying to `/tmp` first before extraction.

In [None]:
from examples.rice_diseases.colab_setup import copy_and_extract_dataset

# Copy from Drive and extract
data_dir = copy_and_extract_dataset(
    drive_zip_path="MyDrive/Rice_Diseases_Dataset/rice-diseases-image-dataset.zip",
    temp_dir="/tmp",
    extract_dir="/content/rice_diseases_data"
)

print(f"\n✓ Dataset extracted to: {data_dir}")

In [None]:
# Analyze dataset structure
from examples.rice_diseases.colab_setup import get_dataset_structure

dataset_structure = get_dataset_structure(data_dir)

## 4. Convert Images to Graphs

Convert all images to graph structures using superpixel segmentation.

**This will cache the results**, so subsequent runs will be much faster.

In [None]:
from examples.rice_diseases.rice_diseases_dataset import RiceDiseasesGraphDataset

# Create dataset (this processes all images and caches graphs)
dataset = RiceDiseasesGraphDataset(
    data_dir="/content/rice_diseases_data",
    cache_dir="/content/rice_diseases_graphs",
    n_segments=75,  # Number of superpixels per image
    force_reprocess=False,  # Set True to reprocess even if cache exists
    use_labelled=True,  # Use LabelledRice/Labelled structure
    seed=42
)

print("\n" + "=" * 60)
print("✓ Dataset created successfully!")
print("=" * 60)
print(f"Total samples: {len(dataset)}")
print(f"Number of classes: {dataset.num_classes}")
print(f"Train: {len(dataset.train_idx)} samples")
print(f"Valid: {len(dataset.valid_idx)} samples")
print(f"Test: {len(dataset.test_idx)} samples")

In [None]:
# Inspect a sample graph
from examples.rice_diseases.rice_diseases_dataset import CLASS_NAMES

sample_graph = dataset[0]

print("Sample graph structure:")
print("-" * 40)
print(f"Nodes: {sample_graph.x.shape[0]}")
print(f"Edges: {sample_graph.edge_index.shape[1]}")
print(f"Node features shape: {sample_graph.x.shape}  (RGB color)")
print(f"Edge features shape: {sample_graph.edge_attr.shape}  (color difference)")
print(f"Label: {sample_graph.y.item()} ({CLASS_NAMES[sample_graph.y.item()]})")
print("-" * 40)

## 5. Save Preprocessed Dataset

The dataset is already cached in `/content/rice_diseases_graphs/`. 

This cache can be used directly with fairseq for training.

In [None]:
import os
from pathlib import Path

cache_dir = "/content/rice_diseases_graphs"
cache_files = list(Path(cache_dir).glob("*.pkl"))

print("Preprocessed dataset cache:")
print("-" * 60)
print(f"Cache directory: {cache_dir}")
print(f"Number of cache files: {len(cache_files)}")

for cache_file in cache_files:
    size_mb = cache_file.stat().st_size / (1024 * 1024)
    print(f"  {cache_file.name} ({size_mb:.2f} MB)")

print("-" * 60)
print("\n✓ Preprocessed graphs are ready for Graphormer training!")
print("\nTo use with fairseq, the dataset is registered as 'rice_diseases'.")
print("Example training command:")
print("  fairseq-train --dataset-name rice_diseases --dataset-source pyg ...")

## 6. Generate Visualizations

Create sample visualizations showing the image-to-graph conversion process.

In [None]:
from examples.rice_diseases.visualize_graphs import create_sample_visualizations

# Create visualizations (2 samples per class)
viz_dir = "/content/Graphormer/examples/rice_diseases/visualizations"

saved_files = create_sample_visualizations(
    dataset,
    output_dir=viz_dir,
    samples_per_class=2,
    n_segments=75
)

print(f"\n✓ Created {len(saved_files)} visualizations")
print(f"✓ Saved to: {viz_dir}")

In [None]:
# Display some visualizations
from IPython.display import Image, display
import os

print("Sample visualizations:")
print("=" * 60)

# Show first 4 visualizations
for viz_file in saved_files[:4]:
    if os.path.exists(viz_file):
        print(f"\n{os.path.basename(viz_file)}:")
        display(Image(filename=viz_file, width=800))

## 7. Display Dataset Statistics

Show class distribution and train/val/test split statistics.

In [None]:
from examples.rice_diseases.visualize_graphs import plot_dataset_statistics
import matplotlib.pyplot as plt

# Plot statistics
fig = plot_dataset_statistics(
    dataset,
    save_path="/content/rice_diseases_statistics.png"
)

plt.show()

print("\n✓ Dataset statistics displayed above")

## 8. Summary

### What Was Created

1. **Preprocessed Graph Dataset**: Cached in `/content/rice_diseases_graphs/`
2. **Visualizations**: Sample conversions saved to `visualizations/`
3. **Dataset Statistics**: Class distribution and split information

### Dataset Information

- **Total samples**: All rice disease images converted to graphs
- **Classes**: 4 (BrownSpot, Healthy, Hispa, LeafBlast)
- **Graph structure**: 
  - Nodes: ~75 superpixels per image
  - Node features: RGB color (3D)
  - Edge features: Color difference (1D)
- **Split**: 70% train, 15% validation, 15% test

### Next Steps

The dataset is now ready for training with Graphormer! The graphs are cached and can be loaded directly by fairseq.

**To train a model**:
1. Use the dataset name `"rice_diseases"` with fairseq-train
2. Set `--dataset-source pyg` for PyTorch Geometric compatibility
3. Use `--num-classes 4` for 4-class classification
4. Use `--criterion multiclass_cross_entropy` for classification

Example training script coming soon in `examples/rice_diseases/train.sh`

In [None]:
print("=" * 80)
print("                    DATA CONVERSION COMPLETE!")
print("=" * 80)
print(f"\nPreprocessed graphs: /content/rice_diseases_graphs/")
print(f"Visualizations: {viz_dir}")
print(f"\nDataset is ready for Graphormer training!")
print("=" * 80)