<a href="https://colab.research.google.com/github/fjadidi2001/Cyber-Attack-Detection/blob/main/SatelliteImageEnvironment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vegetation Monitoring with Sentinel-2 RGB Dataset Using Classical Computer Vision

## Project Overview
Develop a vegetation monitoring system using the Sentinel-2 RGB captioned dataset with classical computer vision techniques to analyze vegetation cover, health, and changes over time.

## Dataset Information
**Dataset**: `sshh12/sentinel-2-rgb-captioned` from Hugging Face
- **Content**: Pre-processed Sentinel-2 RGB images with captions
- **Format**: RGB images (Red, Green, Blue bands)
- **Advantages**: Clean, pre-processed data with descriptive captions
- **Focus**: Vegetation analysis using visible spectrum

## Detailed Workflow

### Phase 1: Dataset Setup & Exploration ( 1)
**Objectives:**
- Load and explore the Sentinel-2 RGB dataset
- Understand data structure and captions
- Set up vegetation monitoring framework

**Tasks:**
1. **Dataset Loading**
   ```python
   from datasets import load_dataset
   ds = load_dataset("sshh12/sentinel-2-rgb-captioned")
   ```
2. **Data Exploration**
   - Analyze image dimensions and RGB channel distributions
   - Study caption content for vegetation-related keywords
   - Create sample visualizations of different vegetation types
3. **Environment Setup**
   - Install libraries: OpenCV, scikit-image, matplotlib, pandas, numpy
   - Set up project structure for vegetation analysis

### Phase 2: Vegetation-Focused Preprocessing ( 2)
**Objectives:**
- Enhance RGB images for vegetation analysis
- Extract vegetation-specific features from limited spectral bands

**Classical CV Techniques:**
1. **RGB Enhancement for Vegetation**
   - Histogram equalization on individual channels
   - Contrast Limited Adaptive Histogram Equalization (CLAHE)
   - Color space conversions (RGB → HSV, RGB → LAB)

2. **Vegetation Index Approximation**
   - **Visible Atmospherically Resistant Index (VARI)**: (Green - Red) / (Green + Red - Blue)
   - **Green Leaf Index (GLI)**: (2×Green - Red - Blue) / (2×Green + Red + Blue)
   - **Red-Green Ratio**: Red/Green for vegetation stress detection

3. **Color-Based Vegetation Enhancement**
   - Green channel enhancement
   - Color thresholding for vegetation masking
   - HSV-based vegetation extraction

### Phase 3: Vegetation Feature Extraction ( 3)
**Objectives:**
- Extract vegetation-specific features from RGB imagery

**Classical CV Techniques:**
1. **Color-Based Vegetation Features**
   - **Green Dominance Analysis**: Quantify green pixel distribution
   - **Color Moment Analysis**: Mean, variance, skewness of each channel
   - **Color Histogram Features**: Vegetation-specific color patterns

2. **Texture Analysis for Vegetation**
   - **GLCM on Green Channel**: Vegetation texture characterization
   - **Local Binary Patterns (LBP)**: Forest vs. grassland texture differentiation
   - **Gabor Filters**: Directional texture analysis for crop patterns

3. **Morphological Features**
   - **Vegetation Boundary Detection**: Canny edge detection on green-enhanced images
   - **Shape Analysis**: Contour analysis for vegetation patches
   - **Canopy Structure**: Morphological operations to identify tree crowns

4. **Spatial Vegetation Patterns**
   - **Vegetation Density Maps**: Green pixel density analysis
   - **Patch Size Distribution**: Connected component analysis
   - **Fragmentation Metrics**: Edge-to-area ratios

### Phase 4: Vegetation Classification & Segmentation ( 4)
**Objectives:**
- Classify different vegetation types and health conditions

**Vegetation Categories:**
- Dense Forest
- Sparse Forest/Woodland
- Grassland/Shrubland
- Agricultural Crops
- Stressed/Unhealthy Vegetation
- Non-Vegetation (Urban, Water, Bare Soil)

**Classical CV Techniques:**
1. **Color-Based Segmentation**
   - **K-means Clustering**: Separate vegetation types by color characteristics
   - **HSV Thresholding**: Isolate healthy green vegetation
   - **Watershed Segmentation**: Separate individual vegetation patches

2. **Machine Learning Classification**
   - **Support Vector Machine (SVM)**: Multi-class vegetation classification
   - **Random Forest**: Combine multiple vegetation features
   - **Decision Trees**: Interpretable vegetation health assessment

3. **Rule-Based Classification**
   - **Vegetation Index Thresholding**: VARI and GLI-based classification
   - **Color Rule Sets**: IF-THEN rules for vegetation types
   - **Multi-criteria Decision**: Combine color, texture, and shape features

### Phase 5: Vegetation Health Assessment ( 5)
**Objectives:**
- Assess vegetation health and stress conditions

**Classical CV Approaches:**
1. **Health Indicators from RGB**
   - **Greenness Assessment**: Green channel intensity analysis
   - **Color Deviation Analysis**: Deviation from healthy vegetation colors
   - **Browning Detection**: Red/Brown pixel identification for stress

2. **Vegetation Vigor Analysis**
   - **VARI Trend Analysis**: Vegetation activity and vigor
   - **Seasonal Color Changes**: Multi-temporal color analysis
   - **Stress Pattern Recognition**: Identify yellowing/browning patterns

3. **Canopy Analysis**
   - **Canopy Coverage**: Percentage of vegetation cover
   - **Canopy Density**: Pixel intensity-based density estimation
   - **Gap Analysis**: Identify clearings and deforestation

### Phase 6: Temporal Vegetation Analysis ( 6)
**Objectives:**
- Monitor vegetation changes over time using available temporal data

**Change Detection Methods:**
1. **RGB-Based Change Detection**
   - **Image Differencing**: Compare vegetation indices across time
   - **Color Change Analysis**: Track color shifts indicating phenology
   - **Threshold-Based Change**: Binary change detection

2. **Vegetation Trend Analysis**
   - **Greenness Trends**: Long-term vegetation health trends
   - **Seasonal Pattern Recognition**: Identify phenological cycles
   - **Disturbance Detection**: Identify sudden vegetation loss

### Phase 7: Validation & Results ( 7)
**Objectives:**
- Validate results and create comprehensive vegetation analysis

**Validation Methods:**
1. **Caption-Based Validation**
   - Use image captions to validate classification results
   - Cross-reference vegetation descriptions with analysis
   - Accuracy assessment using caption keywords

2. **Visual Validation**
   - Expert interpretation of results
   - Comparison with known vegetation patterns
   - Ground truth validation where available

## Technical Implementation Stack

### Core Libraries
```python
# Data handling
from datasets import load_dataset
import pandas as pd
import numpy as np

# Image processing
import cv2
from skimage import filters, segmentation, measure, morphology
from skimage.feature import graycomatrix, graycoprops, local_binary_pattern

# Machine learning
from sklearn.cluster import KMeans
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
```

### Key Vegetation Algorithms
1. **VARI Calculation**: `(Green - Red) / (Green + Red - Blue)`
2. **GLI Calculation**: `(2*Green - Red - Blue) / (2*Green + Red + Blue)`
3. **Green Dominance**: `Green / (Red + Green + Blue)`
4. **Vegetation Masking**: HSV-based green extraction
5. **Canopy Coverage**: Green pixel percentage calculation

## Vegetation-Specific Features to Extract

### Color Features
- Mean, std, skewness of R, G, B channels
- VARI and GLI vegetation indices
- Green dominance ratio
- HSV color moments
- Color histogram bins

### Texture Features
- GLCM properties (contrast, dissimilarity, homogeneity, energy)
- LBP histogram for vegetation texture
- Gabor filter responses for directional patterns

### Morphological Features
- Vegetation patch area and perimeter
- Compactness and roundness of vegetation areas
- Edge density within vegetation regions

## Expected Vegetation Classification Results

### Vegetation Types to Identify
1. **Dense Forest**: High green intensity, coarse texture
2. **Open Woodland**: Moderate green, mixed texture
3. **Grassland**: Uniform green, fine texture
4. **Cropland**: Regular patterns, seasonal color changes
5. **Stressed Vegetation**: Yellow/brown tones, reduced green intensity
6. **Mixed Vegetation**: Varied color and texture patterns

### Performance Metrics
- Overall classification accuracy > 80%
- Vegetation vs. non-vegetation accuracy > 90%
- Healthy vs. stressed vegetation accuracy > 75%
- F1-score per vegetation class > 0.7

## Sample Code Structure

```python
# 1. Dataset loading and exploration
ds = load_dataset("sshh12/sentinel-2-rgb-captioned")
explore_vegetation_dataset(ds)

# 2. Preprocessing
enhanced_images = preprocess_for_vegetation(ds['image'])
vegetation_indices = calculate_vegetation_indices(enhanced_images)

# 3. Feature extraction
color_features = extract_color_features(enhanced_images)
texture_features = extract_texture_features(enhanced_images)
vegetation_features = combine_features(color_features, texture_features, vegetation_indices)

# 4. Classification
vegetation_classifier = train_vegetation_classifier(vegetation_features, labels)
vegetation_map = classify_vegetation(test_images)

# 5. Analysis and visualization
analyze_vegetation_health(vegetation_map)
create_vegetation_visualizations(results)
```

## Final Deliverables
1. **Vegetation Classification System**: Automated vegetation type identification
2. **Vegetation Health Assessment Tool**: RGB-based health monitoring
3. **Vegetation Coverage Analysis**: Quantitative vegetation coverage metrics
4. **Temporal Vegetation Monitoring**: Change detection capabilities
5. **Comprehensive Report**: Methodology, results, and vegetation insights
6. **Interactive Visualizations**: Vegetation maps and health indicators

## Success Criteria
- Accurate vegetation type classification using only RGB data
- Effective vegetation health assessment from color analysis
- Reliable vegetation change detection over time
- Clear visualization of vegetation patterns and trends
- Validation against image captions and expert knowledge

In [2]:
!pip install mlcroissant

Collecting mlcroissant
  Downloading mlcroissant-1.0.17-py2.py3-none-any.whl.metadata (10 kB)
Collecting jsonpath-rw (from mlcroissant)
  Downloading jsonpath-rw-1.4.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting rdflib (from mlcroissant)
  Downloading rdflib-7.1.4-py3-none-any.whl.metadata (11 kB)
Downloading mlcroissant-1.0.17-py2.py3-none-any.whl (141 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m141.4/141.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading rdflib-7.1.4-py3-none-any.whl (565 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m565.1/565.1 kB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: jsonpath-rw
  Building wheel for jsonpath-rw (setup.py) ... [?25l[?25hdone
  Created wheel for jsonpath-rw: filename=jsonpath_rw-1.4.0-py3-none-any.whl size=15127 sha256=ed64276709c68af7de0fd34ad26873f225111d948da1816ee465b7a7a7d5f403
  Stored in directory: /

In [1]:
from mlcroissant import Dataset

# The Croissant metadata exposes the first 5GB of this dataset
ds = Dataset(jsonld="https://huggingface.co/api/datasets/sshh12/sentinel-2-rgb-captioned/croissant")
records = ds.records("default")

ModuleNotFoundError: No module named 'mlcroissant'

In [1]:
# =============================================================================
# VEGETATION MONITORING WITH SENTINEL-2 RGB DATASET - PHASES 1-3
# Google Colab Notebook
# =============================================================================

#%% CELL 1: Environment Setup and Installations
"""
Install required packages for vegetation monitoring project
"""
!pip install datasets transformers opencv-python scikit-image matplotlib seaborn pandas numpy scikit-learn pillow
!apt-get update
!apt-get install libgl1-mesa-glx -y

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from PIL import Image
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

print("✅ Environment setup complete!")

#%% CELL 2: Dataset Loading and Initial Setup
"""
PHASE 1: Dataset Setup & Exploration
Load and explore the Sentinel-2 RGB dataset
"""
from datasets import load_dataset
import random
from collections import Counter

print("🌿 PHASE 1: Dataset Setup & Exploration")
print("="*50)

# Load the Sentinel-2 RGB captioned dataset
print("Loading Sentinel-2 RGB dataset...")
try:
    ds = load_dataset("sshh12/sentinel-2-rgb-captioned", split="train[:1000]")  # Load first 1000 samples
    print(f"✅ Dataset loaded successfully!")
    print(f"📊 Dataset size: {len(ds)} samples")
except Exception as e:
    print(f"⚠️ Error loading dataset: {e}")
    print("Please check your internet connection and try again.")

# Display dataset structure
print("\n📋 Dataset Structure:")
print(f"Features: {ds.features}")
print(f"Column names: {ds.column_names}")

#%% CELL 3: Dataset Exploration and Analysis
"""
Explore dataset content and analyze vegetation-related information
"""
print("\n🔍 Dataset Content Analysis:")
print("="*40)

# Sample a few examples
sample_indices = random.sample(range(len(ds)), min(5, len(ds)))
print(f"Analyzing {len(sample_indices)} random samples...\n")

vegetation_keywords = ['forest', 'tree', 'vegetation', 'green', 'grass', 'crop', 'field', 'plant', 'canopy', 'agricultural']
caption_analysis = []

for i, idx in enumerate(sample_indices):
    sample = ds[idx]
    image = sample['image']
    caption = sample.get('text', 'No caption available')

    print(f"Sample {i+1} (Index {idx}):")
    print(f"  Caption: {caption}")
    print(f"  Image size: {image.size}")
    print(f"  Image mode: {image.mode}")

    # Check for vegetation keywords in caption
    veg_found = [kw for kw in vegetation_keywords if kw.lower() in caption.lower()]
    if veg_found:
        print(f"  🌱 Vegetation keywords found: {veg_found}")

    caption_analysis.append({
        'index': idx,
        'caption': caption,
        'vegetation_keywords': veg_found,
        'image_size': image.size
    })
    print()

# Create analysis dataframe
analysis_df = pd.DataFrame(caption_analysis)
print(f"📈 Analysis complete for {len(analysis_df)} samples")

#%% CELL 4: Vegetation Keywords Analysis
"""
Analyze captions for vegetation-related content
"""
print("\n🏷️ Caption Analysis for Vegetation Content:")
print("="*45)

# Collect all captions
all_captions = [ds[i]['text'] for i in range(min(500, len(ds)))]  # Analyze first 500 captions

# Count vegetation keywords
vegetation_counts = Counter()
for caption in all_captions:
    for keyword in vegetation_keywords:
        if keyword.lower() in caption.lower():
            vegetation_counts[keyword] += 1

print("🌿 Vegetation keyword frequency:")
for keyword, count in vegetation_counts.most_common():
    percentage = (count / len(all_captions)) * 100
    print(f"  {keyword}: {count} occurrences ({percentage:.1f}%)")

# Visualize keyword frequency
if vegetation_counts:
    plt.figure(figsize=(12, 6))
    keywords = list(vegetation_counts.keys())
    counts = list(vegetation_counts.values())

    plt.bar(keywords, counts, color='green', alpha=0.7)
    plt.title('Vegetation Keywords Frequency in Captions', fontsize=14, fontweight='bold')
    plt.xlabel('Keywords')
    plt.ylabel('Frequency')
    plt.xticks(rotation=45)
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.show()

#%% CELL 5: Image Visualization Setup
"""
Create visualization functions for RGB images and analysis
"""
def display_rgb_images(dataset, indices, figsize=(15, 10)):
    """Display multiple RGB images with their captions"""
    n_images = len(indices)
    cols = min(3, n_images)
    rows = (n_images + cols - 1) // cols

    fig, axes = plt.subplots(rows, cols, figsize=figsize)
    if n_images == 1:
        axes = [axes]
    elif rows == 1:
        axes = axes.reshape(1, -1)

    for i, idx in enumerate(indices):
        row = i // cols
        col = i % cols

        sample = dataset[idx]
        image = np.array(sample['image'])
        caption = sample.get('text', 'No caption')

        if rows > 1:
            ax = axes[row, col]
        else:
            ax = axes[col]

        ax.imshow(image)
        ax.set_title(f'Sample {idx}\n{caption[:60]}...', fontsize=10)
        ax.axis('off')

    # Hide empty subplots
    for i in range(n_images, rows * cols):
        row = i // cols
        col = i % cols
        if rows > 1:
            axes[row, col].axis('off')
        else:
            axes[col].axis('off')

    plt.tight_layout()
    plt.show()

def analyze_rgb_channels(image_array):
    """Analyze RGB channel statistics"""
    r_channel = image_array[:, :, 0]
    g_channel = image_array[:, :, 1]
    b_channel = image_array[:, :, 2]

    stats = {
        'Red': {'mean': np.mean(r_channel), 'std': np.std(r_channel), 'max': np.max(r_channel)},
        'Green': {'mean': np.mean(g_channel), 'std': np.std(g_channel), 'max': np.max(g_channel)},
        'Blue': {'mean': np.mean(b_channel), 'std': np.std(b_channel), 'max': np.max(b_channel)}
    }
    return stats

print("✅ Visualization functions created!")

#%% CELL 6: Sample Image Display and Analysis
"""
Display sample images and analyze their RGB characteristics
"""
print("\n🖼️ Sample Image Display and RGB Analysis:")
print("="*45)

# Select diverse samples for display
display_indices = random.sample(range(len(ds)), min(6, len(ds)))
print(f"Displaying {len(display_indices)} sample images...")

# Display images
display_rgb_images(ds, display_indices)

# Analyze RGB characteristics of sample images
print("\n📊 RGB Channel Analysis:")
rgb_stats_all = []

for i, idx in enumerate(display_indices[:3]):  # Analyze first 3 images
    sample = ds[idx]
    image_array = np.array(sample['image'])
    caption = sample.get('text', 'No caption')

    stats = analyze_rgb_channels(image_array)
    rgb_stats_all.append({
        'sample_id': idx,
        'caption': caption[:50] + '...',
        'stats': stats
    })

    print(f"\nSample {idx}:")
    print(f"Caption: {caption[:50]}...")
    for channel, values in stats.items():
        print(f"  {channel}: Mean={values['mean']:.1f}, Std={values['std']:.1f}, Max={values['max']}")

print(f"\n✅ Phase 1 Complete: Dataset loaded and explored!")

#%% CELL 7: PHASE 2 - Vegetation-Focused Preprocessing Setup
"""
PHASE 2: Vegetation-Focused Preprocessing
Set up image enhancement and vegetation index calculation functions
"""
print("\n🌱 PHASE 2: Vegetation-Focused Preprocessing")
print("="*50)

def enhance_rgb_for_vegetation(image_array):
    """
    Enhance RGB image for vegetation analysis
    """
    # Convert to float for processing
    img_float = image_array.astype(np.float32) / 255.0

    # Histogram equalization on individual channels
    img_eq = np.zeros_like(img_float)
    for i in range(3):
        # Convert back to uint8 for CLAHE
        channel_uint8 = (img_float[:,:,i] * 255).astype(np.uint8)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        img_eq[:,:,i] = clahe.apply(channel_uint8) / 255.0

    return img_eq

def calculate_vegetation_indices(image_array):
    """
    Calculate vegetation indices from RGB image
    """
    # Normalize to 0-1 range
    img = image_array.astype(np.float32) / 255.0

    red = img[:,:,0]
    green = img[:,:,1]
    blue = img[:,:,2]

    # Avoid division by zero
    epsilon = 1e-8

    # Visible Atmospherically Resistant Index (VARI)
    denominator_vari = green + red - blue + epsilon
    vari = (green - red) / denominator_vari

    # Green Leaf Index (GLI)
    denominator_gli = 2*green + red + blue + epsilon
    gli = (2*green - red - blue) / denominator_gli

    # Red-Green Ratio
    rg_ratio = red / (green + epsilon)

    # Green Dominance
    total_intensity = red + green + blue + epsilon
    green_dominance = green / total_intensity

    return {
        'vari': vari,
        'gli': gli,
        'rg_ratio': rg_ratio,
        'green_dominance': green_dominance
    }

def convert_color_spaces(image_array):
    """
    Convert RGB to different color spaces for vegetation analysis
    """
    # Convert to HSV
    hsv = cv2.cvtColor(image_array, cv2.COLOR_RGB2HSV)

    # Convert to LAB
    lab = cv2.cvtColor(image_array, cv2.COLOR_RGB2LAB)

    return {
        'hsv': hsv,
        'lab': lab
    }

def create_vegetation_mask(image_array, method='hsv'):
    """
    Create vegetation mask using color thresholding
    """
    if method == 'hsv':
        hsv = cv2.cvtColor(image_array, cv2.COLOR_RGB2HSV)
        # Define range for green colors (vegetation)
        lower_green = np.array([35, 30, 30])
        upper_green = np.array([85, 255, 255])
        mask = cv2.inRange(hsv, lower_green, upper_green)

    elif method == 'green_threshold':
        # Simple green channel thresholding
        green_channel = image_array[:,:,1]
        red_channel = image_array[:,:,0]
        blue_channel = image_array[:,:,2]

        # Vegetation typically has high green and lower red/blue
        mask = (green_channel > red_channel) & (green_channel > blue_channel) & (green_channel > 100)
        mask = mask.astype(np.uint8) * 255

    return mask

print("✅ Vegetation preprocessing functions created!")

#%% CELL 8: Apply Preprocessing to Sample Images
"""
Apply vegetation-focused preprocessing to sample images
"""
print("\n🔧 Applying Preprocessing to Sample Images:")
print("="*45)

# Select a sample for preprocessing demonstration
sample_idx = random.choice(range(len(ds)))
sample = ds[sample_idx]
original_image = np.array(sample['image'])
caption = sample.get('text', 'No caption')

print(f"Processing sample {sample_idx}: {caption[:50]}...")

# Apply preprocessing steps
enhanced_image = enhance_rgb_for_vegetation(original_image)
vegetation_indices = calculate_vegetation_indices(original_image)
color_spaces = convert_color_spaces(original_image)
vegetation_mask = create_vegetation_mask(original_image, method='hsv')

# Create visualization
fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# Original image
axes[0,0].imshow(original_image)
axes[0,0].set_title('Original RGB')
axes[0,0].axis('off')

# Enhanced image
axes[0,1].imshow(enhanced_image)
axes[0,1].set_title('Enhanced RGB')
axes[0,1].axis('off')

# HSV
axes[0,2].imshow(color_spaces['hsv'])
axes[0,2].set_title('HSV Color Space')
axes[0,2].axis('off')

# Vegetation mask
axes[0,3].imshow(vegetation_mask, cmap='gray')
axes[0,3].set_title('Vegetation Mask')
axes[0,3].axis('off')

# Vegetation indices
im1 = axes[1,0].imshow(vegetation_indices['vari'], cmap='RdYlGn', vmin=-1, vmax=1)
axes[1,0].set_title('VARI Index')
axes[1,0].axis('off')
plt.colorbar(im1, ax=axes[1,0], fraction=0.046)

im2 = axes[1,1].imshow(vegetation_indices['gli'], cmap='RdYlGn', vmin=-1, vmax=1)
axes[1,1].set_title('GLI Index')
axes[1,1].axis('off')
plt.colorbar(im2, ax=axes[1,1], fraction=0.046)

im3 = axes[1,2].imshow(vegetation_indices['green_dominance'], cmap='Greens', vmin=0, vmax=1)
axes[1,2].set_title('Green Dominance')
axes[1,2].axis('off')
plt.colorbar(im3, ax=axes[1,2], fraction=0.046)

im4 = axes[1,3].imshow(vegetation_indices['rg_ratio'], cmap='RdYlGn_r', vmin=0, vmax=2)
axes[1,3].set_title('Red/Green Ratio')
axes[1,3].axis('off')
plt.colorbar(im4, ax=axes[1,3], fraction=0.046)

plt.suptitle(f'Vegetation Preprocessing Results\nSample {sample_idx}: {caption[:60]}...', fontsize=14)
plt.tight_layout()
plt.show()

# Print vegetation index statistics
print(f"\n📊 Vegetation Index Statistics for Sample {sample_idx}:")
for index_name, index_values in vegetation_indices.items():
    print(f"  {index_name.upper()}: Mean={np.mean(index_values):.3f}, Std={np.std(index_values):.3f}")

print(f"\n✅ Phase 2 Complete: Preprocessing applied successfully!")

#%% CELL 9: PHASE 3 - Feature Extraction Setup
"""
PHASE 3: Vegetation Feature Extraction
Set up feature extraction functions for vegetation analysis
"""
from skimage.feature import graycomatrix, graycoprops, local_binary_pattern
from skimage import measure
from scipy import ndimage

print("\n🔬 PHASE 3: Vegetation Feature Extraction")
print("="*50)

def extract_color_features(image_array):
    """
    Extract color-based features for vegetation analysis
    """
    # Convert to float
    img = image_array.astype(np.float32) / 255.0

    features = {}

    # Basic color statistics for each channel
    for i, channel in enumerate(['red', 'green', 'blue']):
        channel_data = img[:,:,i]
        features[f'{channel}_mean'] = np.mean(channel_data)
        features[f'{channel}_std'] = np.std(channel_data)
        features[f'{channel}_skewness'] = scipy.stats.skew(channel_data.flatten()) if 'scipy.stats' in globals() else 0
        features[f'{channel}_max'] = np.max(channel_data)
        features[f'{channel}_min'] = np.min(channel_data)

    # Color ratios important for vegetation
    red, green, blue = img[:,:,0], img[:,:,1], img[:,:,2]
    epsilon = 1e-8

    features['green_red_ratio'] = np.mean(green / (red + epsilon))
    features['green_blue_ratio'] = np.mean(green / (blue + epsilon))
    features['red_green_ratio'] = np.mean(red / (green + epsilon))

    # Green dominance
    total_intensity = red + green + blue + epsilon
    features['green_dominance_mean'] = np.mean(green / total_intensity)
    features['green_dominance_std'] = np.std(green / total_intensity)

    return features

def extract_vegetation_indices_features(image_array):
    """
    Extract features based on vegetation indices
    """
    indices = calculate_vegetation_indices(image_array)
    features = {}

    for index_name, index_values in indices.items():
        features[f'{index_name}_mean'] = np.mean(index_values)
        features[f'{index_name}_std'] = np.std(index_values)
        features[f'{index_name}_median'] = np.median(index_values)
        features[f'{index_name}_range'] = np.max(index_values) - np.min(index_values)

        # Percentiles
        features[f'{index_name}_p25'] = np.percentile(index_values, 25)
        features[f'{index_name}_p75'] = np.percentile(index_values, 75)

    return features

def extract_texture_features(image_array):
    """
    Extract texture features for vegetation analysis
    """
    # Convert to grayscale (using green channel as it's most relevant for vegetation)
    if len(image_array.shape) == 3:
        gray = image_array[:,:,1]  # Green channel
    else:
        gray = image_array

    # Normalize to 0-255 range for texture analysis
    gray_norm = ((gray - gray.min()) / (gray.max() - gray.min()) * 255).astype(np.uint8)

    features = {}

    # GLCM texture features
    try:
        # Calculate GLCM
        distances = [1, 2]
        angles = [0, np.pi/4, np.pi/2, 3*np.pi/4]

        glcm = graycomatrix(gray_norm, distances=distances, angles=angles,
                           levels=256, symmetric=True, normed=True)

        # Calculate texture properties
        properties = ['dissimilarity', 'correlation', 'homogeneity', 'contrast', 'energy']

        for prop in properties:
            prop_values = graycoprops(glcm, prop)
            features[f'glcm_{prop}_mean'] = np.mean(prop_values)
            features[f'glcm_{prop}_std'] = np.std(prop_values)

    except Exception as e:
        print(f"Warning: GLCM calculation failed: {e}")
        # Set default values
        for prop in ['dissimilarity', 'correlation', 'homogeneity', 'contrast', 'energy']:
            features[f'glcm_{prop}_mean'] = 0
            features[f'glcm_{prop}_std'] = 0

    # Local Binary Pattern
    try:
        radius = 3
        n_points = 8 * radius
        lbp = local_binary_pattern(gray_norm, n_points, radius, method='uniform')

        # LBP histogram
        lbp_hist, _ = np.histogram(lbp.ravel(), bins=n_points + 2,
                                  range=(0, n_points + 2), density=True)

        features['lbp_uniformity'] = np.sum(lbp_hist ** 2)
        features['lbp_entropy'] = -np.sum(lbp_hist * np.log(lbp_hist + 1e-8))

    except Exception as e:
        print(f"Warning: LBP calculation failed: {e}")
        features['lbp_uniformity'] = 0
        features['lbp_entropy'] = 0

    return features

def extract_morphological_features(image_array):
    """
    Extract morphological features for vegetation analysis
    """
    # Create vegetation mask
    veg_mask = create_vegetation_mask(image_array, method='hsv')

    features = {}

    # Basic morphological properties
    features['vegetation_coverage'] = np.sum(veg_mask > 0) / veg_mask.size

    try:
        # Connected components analysis
        labeled_mask = measure.label(veg_mask > 0)
        props = measure.regionprops(labeled_mask)

        if props:
            areas = [prop.area for prop in props]
            features['num_vegetation_patches'] = len(props)
            features['mean_patch_area'] = np.mean(areas)
            features['std_patch_area'] = np.std(areas)
            features['largest_patch_area'] = np.max(areas)

            # Shape features
            eccentricities = [prop.eccentricity for prop in props]
            solidities = [prop.solidity for prop in props]

            features['mean_eccentricity'] = np.mean(eccentricities)
            features['mean_solidity'] = np.mean(solidities)
        else:
            # No vegetation patches found
            features['num_vegetation_patches'] = 0
            features['mean_patch_area'] = 0
            features['std_patch_area'] = 0
            features['largest_patch_area'] = 0
            features['mean_eccentricity'] = 0
            features['mean_solidity'] = 0

    except Exception as e:
        print(f"Warning: Morphological analysis failed: {e}")
        # Set default values
        for key in ['num_vegetation_patches', 'mean_patch_area', 'std_patch_area',
                   'largest_patch_area', 'mean_eccentricity', 'mean_solidity']:
            features[key] = 0

    return features

def extract_all_vegetation_features(image_array):
    """
    Extract all vegetation-related features from an image
    """
    color_features = extract_color_features(image_array)
    vegetation_features = extract_vegetation_indices_features(image_array)
    texture_features = extract_texture_features(image_array)
    morphological_features = extract_morphological_features(image_array)

    # Combine all features
    all_features = {}
    all_features.update(color_features)
    all_features.update(vegetation_features)
    all_features.update(texture_features)
    all_features.update(morphological_features)

    return all_features

# Import scipy.stats if available
try:
    import scipy.stats
    print("✅ SciPy available for advanced statistics")
except ImportError:
    print("⚠️ SciPy not available, skipping skewness calculation")

print("✅ Feature extraction functions created!")

#%% CELL 10: Extract Features from Sample Images
"""
Extract features from sample images to demonstrate the feature extraction pipeline
"""
print("\n🔍 Extracting Features from Sample Images:")
print("="*45)

# Select samples for feature extraction
feature_sample_indices = random.sample(range(len(ds)), min(5, len(ds)))
all_sample_features = []

print(f"Extracting features from {len(feature_sample_indices)} samples...\n")

for i, idx in enumerate(feature_sample_indices):
    print(f"Processing sample {i+1}/{len(feature_sample_indices)} (Index {idx})...")

    sample = ds[idx]
    image_array = np.array(sample['image'])
    caption = sample.get('text', 'No caption')

    # Extract all features
    features = extract_all_vegetation_features(image_array)
    features['sample_id'] = idx
    features['caption'] = caption

    all_sample_features.append(features)

    print(f"  ✅ Extracted {len(features)-2} features")  # -2 for sample_id and caption

# Create feature dataframe
feature_df = pd.DataFrame(all_sample_features)
print(f"\n📊 Feature Extraction Summary:")
print(f"  Total samples processed: {len(feature_df)}")
print(f"  Total features per sample: {len(feature_df.columns)-2}")

# Display feature categories
color_features = [col for col in feature_df.columns if any(color in col for color in ['red', 'green', 'blue', 'ratio', 'dominance'])]
vegetation_features = [col for col in feature_df.columns if any(vi in col for vi in ['vari', 'gli', 'rg_ratio', 'green_dominance'])]
texture_features = [col for col in feature_df.columns if any(tex in col for tex in ['glcm', 'lbp'])]
morphological_features = [col for col in feature_df.columns if any(morph in col for morph in ['patch', 'coverage', 'eccentricity', 'solidity'])]

print(f"\n🎨 Feature Categories:")
print(f"  Color features: {len(color_features)}")
print(f"  Vegetation index features: {len(vegetation_features)}")
print(f"  Texture features: {len(texture_features)}")
print(f"  Morphological features: {len(morphological_features)}")

#%% CELL 11: Feature Analysis and Visualization
"""
Analyze and visualize the extracted features
"""
print("\n📈 Feature Analysis and Visualization:")
print("="*45)

# Select key features for visualization
key_features = ['green_mean', 'vari_mean', 'gli_mean', 'vegetation_coverage',
                'glcm_contrast_mean', 'num_vegetation_patches']

# Check which features are available
available_key_features = [f for f in key_features if f in feature_df.columns]
print(f"Analyzing {len(available_key_features)} key features: {available_key_features}")

if available_key_features:
    # Create correlation matrix for key features
    correlation_matrix = feature_df[available_key_features].corr()

    # Visualize feature distributions and correlations
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))

    # Feature distributions
    feature_df[available_key_features].hist(bins=20, ax=axes[0,0], figsize=(8, 6))
    axes[0,0].set_title('Feature Distributions')

    # Correlation heatmap
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
                ax=axes[0,1], square=True)
    axes[0,1].set_title('Feature Correlations')

    # Vegetation coverage vs other features
    if 'vegetation_coverage' in available_key_features and len(available_key_features) > 1:
        other_feature = [f for f in available_key_features if f != 'vegetation_coverage'][0]
        axes[1,0].scatter(feature_df['vegetation_coverage'], feature_df[other_feature], alpha=0.7)
        axes[1,0].set_xlabel('Vegetation Coverage')
        axes[1,0].set_ylabel(other_feature.replace('_', ' ').title())
        axes[1,0].set_title(f'Vegetation Coverage vs {other_feature.replace("_", " ").title()}')

    # Feature summary statistics
    summary_stats = feature_df[available_key_features].describe()
    axes[1,1].axis('tight')
    axes[1,1].axis('off')
    table = axes[1,1].table(cellText=summary_stats.round(3).values,
                           rowLabels=summary_stats.index,
                           colLabels=[col.replace('_', '\n') for col in summary_stats.columns],
                           cellLoc='center',
                           loc='center')
    table.auto_set_font_size(False)
    table.set_fontsize(8)
    axes[1,1].set_title('Feature Summary Statistics')

    plt.tight_layout()
    plt.show()

# Display detailed feature statistics
print(f"\n📊 Detailed Feature Statistics:")
numeric_features = feature_df.select_dtypes(include=[np.number]).columns
if len(numeric_features) > 0:
    print(f"Mean feature values across {len(feature_df)} samples:")
    feature_means = feature_df[numeric_features].mean().sort_values(ascending=False)

    for feature, value in feature_means.head(10).items():
        print(f"  {feature}: {value:.4f}")

print(f"\n✅ Phase 3 Complete: Feature extraction and analysis finished!")

#%% CELL 12: Summary and Next Steps
"""
Summary of Phases 1-3 and preparation for next phases
"""
print("\n🎯 PROJECT SUMMARY - PHASES 1-3 COMPLETE")
print("="*60)

print("✅ PHASE 1 - Dataset Setup & Exploration:")
print(f"   • Loaded Sentinel-2 RGB dataset with {len(ds)} samples")
print(f"   • Analyzed vegetation keywords in captions")
print(f"   • Explored RGB image characteristics")

print("\n✅ PHASE 2 - Vegetation-Focused Preprocessing:")
print("   • Implemented RGB enhancement for vegetation analysis")
print("   • Created vegetation indices (VARI, GLI, RG-ratio, Green dominance)")
print("   • Developed color space conversions (HSV, LAB)")
print("   • Created vegetation masking functions")

print("\n✅ PHASE 3 - Vegetation Feature Extraction:")
print("   • Extracted color-based features (RGB statistics, ratios)")
print("   • Calculated vegetation index features (statistics, percentiles)")
print("   • Implemented texture analysis (GLCM, LBP)")
print("   • Developed morphological features (patch analysis, coverage)")

# Create feature summary for export
if len(all_sample_features) > 0:
    print(f"\n📋 FEATURE EXTRACTION RESULTS:")
    print(f"   • Total samples processed: {len(all_sample_features)}")
    print(f"   • Features per sample: {len(all_sample_features[0])-2}")
    print(f"   • Feature categories: Color, Vegetation Indices, Texture, Morphological")

print(f"\n🚀 READY FOR NEXT PHASES:")
print("   • Phase 4: Vegetation Classification & Segmentation")
print("   • Phase 5: Vegetation Health Assessment")
print("   • Phase 6: Temporal Vegetation Analysis")
print("   • Phase 7: Validation & Results")

# Save processed data for next phases
processed_data = {
    'dataset_size': len(ds),
    'sample_features': all_sample_features,
    'feature_columns': list(feature_df.columns) if 'feature_df' in locals() else [],
    'vegetation_keywords_found': dict(vegetation_counts) if 'vegetation_counts' in locals() else {}
}

print(f"\n💾 Data prepared for phases 4-7:")
print(f"   • {len(processed_data['sample_features'])} samples with extracted features")
print(f"   • {len(processed_data['feature_columns'])} feature types identified")
print(f"   • Vegetation analysis pipeline established")

print(f"\n" + "="*60)
print("🌿 VEGETATION MONITORING PROJECT - PHASES 1-3 COMPLETE! 🌿")
print("="*60)

Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:6 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease [18.1 kB]
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Get:11 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 Packages [1,553 kB]
Get:12 http://security.ubuntu.com/ubuntu jammy-security/universe amd64 Packages [1,246 kB]
Get:13 http://archive.ubuntu.com/ubuntu jammy-updates/main amd

Downloading readme:   0%|          | 0.00/880 [00:00<?, ?B/s]

⚠️ Error loading dataset: Invalid pattern: '**' can only be an entire path component
Please check your internet connection and try again.

📋 Dataset Structure:


NameError: name 'ds' is not defined