# GreenCast Data Preprocessing Pipeline

This notebook contains the complete data preprocessing pipeline for the GreenCast agriculture AI project.

## Dataset Overview
- **PlantVillage Dataset**: Plant disease images in color, grayscale, and segmented formats
- **Data2**: Organized crop datasets with Train/Test/Val splits

## Pipeline Steps
1. Dataset exploration and analysis
2. Image preprocessing and augmentation
3. Data cleaning and validation
4. Feature extraction
5. Dataset splitting and organization
6. Export processed data

## 1. Import Libraries and Setup

In [None]:
# Core libraries
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Image processing
import cv2
from PIL import Image, ImageEnhance, ImageFilter
import imageio
from skimage import io, transform, filters, exposure
from skimage.feature import local_binary_pattern, greycomatrix, greycoprops
from skimage.measure import label, regionprops

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Deep Learning
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator, load_img, img_to_array
from tensorflow.keras.applications import VGG16, ResNet50, EfficientNetB0
from tensorflow.keras.utils import to_categorical

# Utilities
import json
import pickle
from datetime import datetime
import logging
from tqdm import tqdm
import glob
from collections import Counter, defaultdict

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configure matplotlib
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)

print("Libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")
print(f"OpenCV version: {cv2.__version__}")

## 2. Configuration and Paths

In [None]:
# Project paths
PROJECT_ROOT = Path('/Users/debabratapattnayak/web-dev/greencast')
DATASET_PATH = PROJECT_ROOT / 'dataset'
PLANTVILLAGE_PATH = DATASET_PATH / 'plantvillage dataset'
DATA2_PATH = DATASET_PATH / 'data2'
PROCESSED_DATA_PATH = PROJECT_ROOT / 'processed_data'
MODELS_PATH = PROJECT_ROOT / 'models'

# Create directories if they don't exist
PROCESSED_DATA_PATH.mkdir(exist_ok=True)
MODELS_PATH.mkdir(exist_ok=True)

# Image processing configuration
IMG_CONFIG = {
    'target_size': (224, 224),  # Standard size for most pre-trained models
    'batch_size': 32,
    'color_mode': 'rgb',
    'class_mode': 'categorical',
    'validation_split': 0.2,
    'test_split': 0.1
}

# Data augmentation parameters
AUGMENTATION_CONFIG = {
    'rotation_range': 20,
    'width_shift_range': 0.2,
    'height_shift_range': 0.2,
    'shear_range': 0.2,
    'zoom_range': 0.2,
    'horizontal_flip': True,
    'fill_mode': 'nearest',
    'brightness_range': [0.8, 1.2]
}

print(f"Dataset path: {DATASET_PATH}")
print(f"PlantVillage path: {PLANTVILLAGE_PATH}")
print(f"Data2 path: {DATA2_PATH}")
print(f"Processed data will be saved to: {PROCESSED_DATA_PATH}")