<a href="https://colab.research.google.com/github/Uzmafaheem/EcoSort-Waste-Management-Assistant/blob/main/Copy_of_waste_management_summative.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EcoSort Waste Management Assistant
# Module 8 Summative Lab

## Overview

You are a data scientist at "EcoSort," a technology company that specializes in developing AI solutions for waste management. EcoSort has partnered with Metro City's waste management department to develop an intelligent waste management assistant that can help residents properly dispose of waste items so less time is spent sorting material at facilities.

This assistant needs to:

1. Identify waste materials from images uploaded by residents (CNN)
2. Classify waste items based on text descriptions provided by residents (RNN/Transformer)
3. Generate specific recycling instructions based on identified waste type and city policies (Generative Transformer with RAG)

Your task is to build this integrated system using the RealWaste dataset along with generated text data that simulates real-world waste management operations.

## Part 1: Dataset Exploration and Preparation

In this section, you will explore and prepare the datasets for your models.

### 1.1 Load and Explore the RealWaste Dataset

In [None]:
# Import necessary libraries
import os
import json
import numpy as np
import pandas as pd
import tensorflow as tf
import random
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import seaborn as sns
from PIL import Image
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
random.seed(42)

In [None]:
from google.colab import drive
drive.mount('/content/MyDrive')

In [None]:
# List all files/folders in MyDrive
print(os.listdir("/content/MyDrive"))

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import os

# Check root of MyDrive
print(os.listdir("/content/drive"))

In [None]:
image_dir = "/content/drive/MyDrive/realwaste-main/RealWaste/"
description_file = "/content/waste_descriptions.csv"
policy_file = "/content/waste_policy_documents.json"

In [None]:
categories = os.listdir(image_dir)
print("Waste categories:", categories)

In [None]:
# Paths
image_dir = "/content/drive/MyDrive/realwaste-main/RealWaste/"
description_file = "/content/waste_descriptions.csv"
policy_file = "/content/waste_policy_documents.json"

# Load text descriptions
descriptions_df = pd.read_csv(description_file)
print("Sample descriptions:")
display(descriptions_df.head())

# Load policy documents
with open(policy_file, 'r') as f:
    policies = json.load(f)
print("Sample policy document:")
print(json.dumps(policies[0], indent=2))

# Explore image dataset
categories = os.listdir(image_dir)
categories = [cat for cat in categories if os.path.isdir(os.path.join(image_dir, cat))]
print("Waste categories:", categories)

# Count images per category
category_counts = {cat: len(os.listdir(os.path.join(image_dir, cat))) for cat in categories}
plt.figure(figsize=(10,6))
sns.barplot(x=list(category_counts.keys()), y=list(category_counts.values()))
plt.title("Number of Images per Waste Category")
plt.ylabel("Count")
plt.xlabel("Category")
plt.show()

# Sample images inspection
for cat in categories[:3]:  # just sample first 3 categories
    sample_img_name = os.listdir(os.path.join(image_dir, cat))[0]
    img = Image.open(os.path.join(image_dir, cat, sample_img_name))
    print(f"Category: {cat}, Image: {sample_img_name}, Size: {img.size}, Mode: {img.mode}")
    plt.imshow(img)
    plt.axis('off')
    plt.show()

### 1.2 Explore Text Datasets

In [None]:
# Load CSV
descriptions_df = pd.read_csv(description_file)

# Quick look at the data
print("First 5 rows of waste_descriptions.csv:")
print(descriptions_df.head())

# Info about columns and missing values
print("\nDataFrame info:")
print(descriptions_df.info())

# Check for missing values
print("\nMissing values per column:")
print(descriptions_df.isnull().sum())


plt.figure(figsize=(10,6))
sns.countplot(y='category', data=descriptions_df, order=descriptions_df['category'].value_counts().index)
plt.title("Distribution of Waste Categories in Text Descriptions")
plt.xlabel("Count")
plt.ylabel("Waste Category")
plt.show()


num_descriptions = len(descriptions_df)
print(f"Total number of waste descriptions: {num_descriptions}")

# Average length of descriptions (in words)
descriptions_df['text_length'] = descriptions_df['description'].apply(lambda x: len(str(x).split()))
avg_length = descriptions_df['text_length'].mean()
print(f"Average description length (words): {avg_length:.2f}")

# Vocabulary size (approximate)
from collections import Counter
all_words = ' '.join(descriptions_df['description'].astype(str)).lower().split()
vocab = Counter(all_words)
print(f"Approximate vocabulary size: {len(vocab)}")

# Most common words
print("Top 20 most common words:")
print(vocab.most_common(20))


In [None]:
import json

with open(policy_file, 'r') as f:
    policies = json.load(f)

# Number of documents
print(f"Total number of policy documents: {len(policies)}")

# Preview the first document
print("\nSample policy document (formatted):")
print(json.dumps(policies[0], indent=2))


#  Understand document structure

# Assuming each policy document is a dictionary with keys like 'title', 'content', 'category', etc.
keys = set()
for doc in policies:
    keys.update(doc.keys())
print("\nAll keys present in policy documents:", keys)

# Check distribution of categories (if 'category' key exists)
categories = [doc.get('category', 'Unknown') for doc in policies]
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

cat_df = pd.DataFrame({'category': categories})
plt.figure(figsize=(10,6))
sns.countplot(y='category', data=cat_df, order=cat_df['category'].value_counts().index)
plt.title("Distribution of Policy Document Categories")
plt.xlabel("Count")
plt.ylabel("Category")
plt.show()


#  Analyze text length and content

# Example: length of content in words
content_lengths = [len(doc.get('content','').split()) for doc in policies]
print(f"Average length of policy content (words): {sum(content_lengths)/len(content_lengths):.2f}")
print(f"Minimum content length: {min(content_lengths)}, Maximum: {max(content_lengths)}")

# Optionally, print a short excerpt from first few documents
print("\nPolicy excerpts:")
for i, doc in enumerate(policies[:3]):
    content = doc.get('content','')
    print(f"\nDocument {i+1} excerpt:", content[:300], "...")


### 1.3 Create Data Pipelines

In [None]:
# Run this code to setup the images properly into train, validation, and test sets
# Set your data directory path - update this with your actual path
import pathlib
data_dir = pathlib.Path("/content/drive/MyDrive/realwaste-main/RealWaste")

#data_dir = pathlib.Path('RealWaste')

# Parameters
BATCH_SIZE = 32
IMG_HEIGHT = 224
IMG_WIDTH = 224

# Calculate the total number of classes automatically from the directory structure
num_classes = len([item for item in data_dir.glob('*') if item.is_dir()])
print(f"Number of classes: {num_classes}")

# List all class folders
class_names = sorted([item.name for item in data_dir.glob('*') if item.is_dir()])
print(f"Class names: {class_names}")

# Count all images
image_count = len(list(data_dir.glob('*/*.jpg'))) + len(list(data_dir.glob('*/*.png')))
print(f"Total images found: {image_count}")

# Create a dataset using tf.keras.utils.image_dataset_from_directory
# This will automatically split the data into training and validation sets
train_ds = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,  # 20% for validation
    subset="training",
    seed=42,
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    label_mode='categorical',  # For one-hot encoded labels
    shuffle=True
)

validation_ds = tf.keras.utils.image_dataset_from_directory(
    data_dir,
    validation_split=0.2,  # 20% for validation
    subset="validation",
    seed=42,
    image_size=(IMG_HEIGHT, IMG_WIDTH),
    batch_size=BATCH_SIZE,
    label_mode='categorical',  # For one-hot encoded labels
    shuffle=True
)

# Create a separate test dataset by taking part of the validation set
# First, let's get the number of batches in the validation set
val_batches = tf.data.experimental.cardinality(validation_ds)
test_dataset = validation_ds.take(val_batches // 2)
validation_ds = validation_ds.skip(val_batches // 2)

print(f"Number of training batches: {tf.data.experimental.cardinality(train_ds)}")
print(f"Number of validation batches: {tf.data.experimental.cardinality(validation_ds)}")
print(f"Number of test batches: {tf.data.experimental.cardinality(test_dataset)}")

# Configure dataset for performance
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
validation_ds = validation_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_dataset = test_dataset.cache().prefetch(buffer_size=AUTOTUNE)

In [None]:
descriptions_df = pd.read_csv(description_file)
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Quick check
print(descriptions_df.head())


#  Text cleaning function

def clean_text(text):
    text = str(text).lower()  # lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation/special chars
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

descriptions_df['cleaned_description'] = descriptions_df['description'].apply(clean_text)


#  Encode labels

# Create a mapping from category names to integers
category_list = sorted(descriptions_df['category'].unique())
category_to_index = {cat:i for i, cat in enumerate(category_list)}
descriptions_df['label'] = descriptions_df['category'].map(category_to_index)

num_classes = len(category_list)
print(f"Number of categories: {num_classes}")
print("Category mapping:", category_to_index)


#  Tokenization & padding

MAX_VOCAB_SIZE = 5000
MAX_SEQUENCE_LENGTH = 50  # max words per description

tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(descriptions_df['cleaned_description'])

sequences = tokenizer.texts_to_sequences(descriptions_df['cleaned_description'])
padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post', truncating='post')

print("Example sequence (padded):")
print(padded_sequences[0])


# Train-test split

X = padded_sequences
y = tf.keras.utils.to_categorical(descriptions_df['label'], num_classes=num_classes)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Train samples: {len(X_train)}, Test samples: {len(X_test)}")


# create TensorFlow dataset

BATCH_SIZE = 32

train_ds_text = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(1000).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_ds_text = tf.data.Dataset.from_tensor_slices((X_test, y_test)).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

In [None]:
#  Imports

import json
import re
from sentence_transformers import SentenceTransformer
import numpy as np
import pickle


# 2 Load policy documents

with open(policy_file, 'r') as f:
    policies = json.load(f)


# 3 Preprocess text
# - lowercase, remove extra spaces, remove non-alphanumeric chars

def preprocess_text(text):
    text = str(text).lower()
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces/newlines
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove special characters
    return text.strip()

# Combine title + content for each document
documents = []
for doc in policies:
    content = doc.get('document_text', '') # Use 'document_text' key
    clean_text = preprocess_text(content)
    documents.append(clean_text)

print(f"Total documents processed: {len(documents)}")
print("Sample processed document:", documents[0][:300], "...")


# 4Ô∏è Split documents into smaller chunks for retrieval

CHUNK_SIZE = 150  # words per chunk
doc_chunks = []

for doc in documents:
    words = doc.split()
    for i in range(0, len(words), CHUNK_SIZE):
        chunk = " ".join(words[i:i+CHUNK_SIZE])
        doc_chunks.append(chunk)

print(f"Total chunks created: {len(doc_chunks)}")
if len(doc_chunks) > 0:
  print("Sample chunk:", doc_chunks[0])
else:
  print("No chunks created.")


# Create embeddings

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')  # lightweight, fast

# Generate embeddings
embeddings = model.encode(doc_chunks, show_progress_bar=True)
embeddings = np.array(embeddings)

print("Embeddings shape:", embeddings.shape)


#  Save chunks and embeddings (optional)

with open("policy_chunks.pkl", "wb") as f:
    pickle.dump(doc_chunks, f)

with open("policy_embeddings.npy", "wb") as f:
    np.save(f, embeddings)

print("Chunks and embeddings saved successfully.")

## Part 2: Waste Material Classification with CNN

In this section, you will build a CNN model to classify waste materials from images.

### 2.1 Preprocess Images

In [None]:
# TODO: Implement image preprocessing
# - Apply the preprocessing pipeline created earlier

import tensorflow as tf

# Assuming you already created train_ds, validation_ds, and test_dataset previously

# 1Ô∏è‚É£ Normalize pixel values
# CNNs train better when pixel values are in [0,1]
normalization_layer = tf.keras.layers.Rescaling(1./255)

train_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
validation_ds = validation_ds.map(lambda x, y: (normalization_layer(x), y))
test_dataset = test_dataset.map(lambda x, y: (normalization_layer(x), y))

# 2Ô∏è‚É£ (Optional) Data augmentation
# Helps prevent overfitting and improves generalization
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal"),
    tf.keras.layers.RandomRotation(0.1),
    tf.keras.layers.RandomZoom(0.1),
])

# Apply augmentation only to training dataset
train_ds = train_ds.map(lambda x, y: (data_augmentation(x, training=True), y))

# 3Ô∏è‚É£ Optimize pipeline performance
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
validation_ds = validation_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_dataset = test_dataset.cache().prefetch(buffer_size=AUTOTUNE)

# Confirm preprocessing
for images, labels in train_ds.take(1):
    print("Image batch shape:", images.shape)
    print("Label batch shape:", labels.shape)

### 2.2 Implement CNN Model with Transfer Learning

In [None]:
# TODO: Select an appropriate base model and implement transfer learning
# - Choose from MobileNet, EfficientNet, etc.
# - Add custom classification layers for the 9 waste categories
# - Configure loss function and metrics

# Import necessary modules
from tensorflow.keras import layers, models, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# Number of waste classes (you can update if needed)
NUM_CLASSES = len(class_names)  # or manually set to 9

# 1Ô∏è‚É£ Choose a pretrained base model
# EfficientNetB0 is light, accurate, and ideal for image classification tasks
base_model = tf.keras.applications.EfficientNetB0(
    input_shape=(IMG_HEIGHT, IMG_WIDTH, 3),
    include_top=False,  # exclude final dense layers
    weights='imagenet'
)

# Freeze the base model to retain pretrained ImageNet features
base_model.trainable = False

# 2Ô∏è‚É£ Build the transfer learning model
inputs = tf.keras.Input(shape=(IMG_HEIGHT, IMG_WIDTH, 3))

# Preprocessing layer (EfficientNet expects specific input scaling)
x = tf.keras.applications.efficientnet.preprocess_input(inputs)

# Pass through base model
x = base_model(x, training=False)

# Global average pooling to reduce dimensions
x = layers.GlobalAveragePooling2D()(x)

# Optional dropout for regularization
x = layers.Dropout(0.3)(x)

# Final classification layer (softmax for multi-class)
outputs = layers.Dense(NUM_CLASSES, activation='softmax')(x)

# Define model
model = tf.keras.Model(inputs, outputs)

# 3Ô∏è‚É£ Compile the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# 4Ô∏è‚É£ Model summary
model.summary()

### 2.3 Train and Evaluate the Model

In [None]:
# TODO: Train the CNN model
# - Use appropriate batch size and epochs
# - Implement regularization to prevent overfitting
# - Monitor training and validation metrics

# Import necessary callbacks
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau

# 1Ô∏è‚É£ Training configuration
EPOCHS = 15
BATCH_SIZE = 32

# Early stopping and learning rate scheduling to prevent overfitting
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True
)

reduce_lr = ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.2,
    patience=2,
    min_lr=1e-6,
    verbose=1
)

# 2Ô∏è‚É£ Train the model
history = model.fit(
    train_ds,
    validation_data=validation_ds,
    epochs=EPOCHS,
    callbacks=[early_stop, reduce_lr],
    verbose=1
)

# 3Ô∏è‚É£ Evaluate on test dataset
test_loss, test_acc = model.evaluate(test_dataset)
print(f"\n‚úÖ Test Accuracy: {test_acc:.4f}")
print(f"‚úÖ Test Loss: {test_loss:.4f}")

# 4Ô∏è‚É£ Visualize training performance
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs_range = range(len(acc))

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy')
plt.plot(epochs_range, val_acc, label='Validation Accuracy')
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')

plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss')
plt.plot(epochs_range, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')

plt.show()

In [None]:
# TODO: Evaluate model performance
# - Calculate accuracy on test set
# - Generate confusion matrix
# - Analyze error patterns

from sklearn.metrics import confusion_matrix, classification_report

# 1Ô∏è‚É£ Evaluate accuracy on test set
test_loss, test_acc = model.evaluate(test_dataset, verbose=1)
print(f"\n‚úÖ Test Accuracy: {test_acc:.4f}")
print(f"‚úÖ Test Loss: {test_loss:.4f}")

# 2Ô∏è‚É£ Generate predictions
y_true = []
y_pred = []

for images, labels in test_dataset:
    preds = model.predict(images)
    y_pred.extend(np.argmax(preds, axis=1))
    y_true.extend(np.argmax(labels.numpy(), axis=1))

y_true = np.array(y_true)
y_pred = np.array(y_pred)

# 3Ô∏è‚É£ Confusion matrix
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names,
            yticklabels=class_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - Waste Classification')
plt.show()

# 4Ô∏è‚É£ Classification report
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=class_names))

# 5Ô∏è‚É£ Analyze common misclassifications
misclassified_indices = np.where(y_true != y_pred)[0]
print(f"\nNumber of misclassified samples: {len(misclassified_indices)}")

# Optional: visualize a few misclassified examples
for images, labels in test_dataset.take(1):
    preds = model.predict(images)
    pred_labels = np.argmax(preds, axis=1)
    true_labels = np.argmax(labels.numpy(), axis=1)

    mis_idx = np.where(pred_labels != true_labels)[0]
    print(f"Displaying {len(mis_idx)} misclassified examples from this batch...")

    plt.figure(figsize=(12, 6))
    for i, idx in enumerate(mis_idx[:6]):  # show up to 6 examples
        plt.subplot(2, 3, i + 1)
        plt.imshow(images[idx].numpy().astype("uint8"))
        plt.title(f"True: {class_names[true_labels[idx]]}\nPred: {class_names[pred_labels[idx]]}")
        plt.axis('off')
    plt.show()

### 2.4 Fine-tune the Model

In [None]:
# TODO: Tune model parameters to improve performance
# - Adjust learning rate
# - Add regularization, dropout
# - Modify architecture if needed
from tensorflow.keras import layers, models, optimizers, regularizers
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
# 1Ô∏è‚É£ Unfreeze top layers of the base model for fine-tuning
base_model.trainable = True

# Optionally, freeze most layers and only fine-tune the top ones
fine_tune_at = int(len(base_model.layers) * 0.7)  # unfreeze top 30%
for layer in base_model.layers[:fine_tune_at]:
    layer.trainable = False

# 2Ô∏è‚É£ Re-compile with a lower learning rate (important for fine-tuning)
fine_tune_lr = 1e-5
model.compile(
    optimizer=optimizers.Adam(learning_rate=fine_tune_lr),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# 3Ô∏è‚É£ Add regularization and dropout if not already present
# (You can rebuild the classifier head with dropout/regularization if needed)
# The model architecture is already defined in cell H6HPFM1mRjNa.
# We will fine-tune the existing 'model' object directly after unfreezing layers.

# 4Ô∏è‚É£ Callbacks for stable training
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=3, verbose=1)

# 5Ô∏è‚É£ Train the fine-tuned model
history_fine = model.fit(
    train_ds,
    validation_data=validation_ds,
    epochs=10,
    callbacks=[early_stop, reduce_lr]
)

# 6Ô∏è‚É£ Evaluate after fine-tuning
test_loss, test_acc = model.evaluate(test_dataset, verbose=1)
print(f"\n‚úÖ Fine-tuned Test Accuracy: {test_acc:.4f}")
print(f"‚úÖ Fine-tuned Test Loss: {test_loss:.4f}")

## Part 3: Waste Description Classification

In this section, you will build a text classification model to categorize waste based on descriptions.

### 3.1 Preprocess Text Data

In [None]:
# TODO: Implement text preprocessing
# - Apply the text preprocessing pipeline created earlier

import pandas as pd
import re
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the dataset
descriptions_df = pd.read_csv("/content/waste_descriptions.csv")
print("‚úÖ Data loaded successfully!")
print(descriptions_df.head())

# 1Ô∏è‚É£ Clean text function
def clean_text(text):
    text = str(text).lower()  # lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation/special chars
    text = re.sub(r'\s+', ' ', text)  # remove extra spaces
    return text.strip()

# Apply cleaning
descriptions_df['cleaned_description'] = descriptions_df['description'].apply(clean_text)

# 2Ô∏è‚É£ Prepare labels
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
descriptions_df['label_encoded'] = label_encoder.fit_transform(descriptions_df['category'])
num_classes = len(label_encoder.classes_)
print(f"‚úÖ Number of classes: {num_classes}")
print(f"Class names: {label_encoder.classes_}")

# 3Ô∏è‚É£ Tokenization
MAX_VOCAB_SIZE = 5000
MAX_SEQUENCE_LENGTH = 50

tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(descriptions_df['cleaned_description'])

# Convert text to sequences and pad
sequences = tokenizer.texts_to_sequences(descriptions_df['cleaned_description'])
padded_sequences = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post', truncating='post')

# 4Ô∏è‚É£ Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    padded_sequences,
    descriptions_df['label_encoded'],
    test_size=0.2,
    random_state=42,
    stratify=descriptions_df['label_encoded']
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

# 5Ô∏è‚É£ Example check
print("\nExample cleaned + tokenized text:")
print(descriptions_df[['description', 'cleaned_description']].head(3))

### 3.2 Implement Text Classification Model

In [None]:
# TODO: Choose and implement a text classification model
# Option A: Traditional ML model (Naive Bayes, Random Forest, etc.)
# Option B: Fine-tune a transformer-based model (BERT, DistilBERT, etc.)

#optionA
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

# Parameters
EMBEDDING_DIM = 128
LSTM_UNITS = 128
DROPOUT_RATE = 0.4

# Build the model
lstm_model = Sequential([
    Embedding(input_dim=5000, output_dim=EMBEDDING_DIM, input_length=50),
    Bidirectional(LSTM(LSTM_UNITS, return_sequences=False)),
    Dropout(DROPOUT_RATE),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(num_classes, activation='softmax')
])

# Compile the model
lstm_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Summary
lstm_model.summary()

# Train
history = lstm_model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=8,
    batch_size=32,
    verbose=1
)

# Evaluate
loss, accuracy = lstm_model.evaluate(X_test, y_test)
print(f"\n‚úÖ Test Accuracy: {accuracy:.4f}")


### 3.3 Train and Evaluate the Model

In [None]:
# TODO: Train the text classification model
# - Use appropriate training parameters
# - Monitor training progress

# Training parameters
EPOCHS = 10
BATCH_SIZE = 32

# Train the model
history = lstm_model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    verbose=1
)

# Evaluate the model
test_loss, test_accuracy = lstm_model.evaluate(X_test, y_test, verbose=0)
print(f"\n‚úÖ Test Accuracy: {test_accuracy:.4f}")

# Plot training history


plt.figure(figsize=(10, 4))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title("Model Accuracy Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

plt.figure(figsize=(10, 4))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title("Model Loss Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
# TODO: Evaluate model performance
# - Calculate accuracy on test set
# - Generate confusion matrix
# - Analyze error patterns

import numpy as np
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Predict on test data
y_pred_probs = lstm_model.predict(X_test)
y_pred = np.argmax(y_pred_probs, axis=1)

# Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"\n‚úÖ Test Accuracy: {accuracy:.4f}")

# Classification report
target_names = label_encoder.classes_
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred, target_names=target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=target_names)
fig, ax = plt.subplots(figsize=(8, 6))
disp.plot(ax=ax, cmap='Blues', xticks_rotation=45)
plt.title("Confusion Matrix ‚Äî Waste Description Classifier (LSTM)")
plt.show()

# Error analysis ‚Äî show top misclassified examples
mismatch_indices = np.where(y_test != y_pred)[0]
print(f"\n‚ùå Misclassified examples: {len(mismatch_indices)} / {len(y_test)}")

# Show a few examples
for i in mismatch_indices[:5]:
    print(f"\nüóëÔ∏è Description: {descriptions_df.iloc[i]['description']}")
    print(f"Predicted: {target_names[y_pred[i]]} | Actual: {target_names[y_test.iloc[i]]}")

### 3.4 Create Classification Function

In [None]:
# TODO: Create a function that takes a text description and returns the predicted waste category

def classify_waste_description(description):
    """
    Classifies a waste description into an appropriate category.

    Args:
        description (str): Text description of waste item

    Returns:
        str: Predicted waste category
    """
    # Your code here
    pass

In [None]:
def classify_waste_description(description):
    """
    Classifies a waste description into an appropriate category (LSTM version).
    """
    # 1Ô∏è‚É£ Clean text
    def clean_text(text):
        text = str(text).lower()
        text = re.sub(r'[^a-z0-9\s]', '', text)
        text = re.sub(r'\s+', ' ', text)
        return text.strip()

    cleaned_text = clean_text(description)

    # 2Ô∏è‚É£ Tokenize + pad
    seq = tokenizer.texts_to_sequences([cleaned_text])
    padded = pad_sequences(seq, maxlen=50, padding='post', truncating='post')

    # 3Ô∏è‚É£ Predict
    pred_probs = lstm_model.predict(padded)
    pred_label = np.argmax(pred_probs, axis=1)[0]

    # 4Ô∏è‚É£ Decode to category name
    predicted_category = label_encoder.inverse_transform([pred_label])[0]

    return predicted_category

In [None]:
test_desc = "Empty soda can made of aluminum"
pred = classify_waste_description(test_desc)
print(f"üßæ Description: {test_desc}\nüîç Predicted Category: {pred}")

## Part 4: Recycling Instruction Generation with RAG

In this section, you will implement a Retrieval-Augmented Generation (RAG) system to generate recycling instructions.

### 4.1 Preprocess Documents for Retrieval

In [None]:
# TODO: Prepare documents for retrieval
# - Process policy documents and disposal instructions
# - Create embeddings for efficient retrieval

policy_file = "/content/waste_policy_documents.json"  # update path
with open(policy_file, 'r') as f:
    policies = json.load(f)

print(f"‚úÖ Loaded {len(policies)} policy documents")

# Optional: inspect first document
print(json.dumps(policies[0], indent=2))

# 2Ô∏è‚É£ Preprocess text documents
def preprocess_text(text):
    text = str(text).lower()
    text = text.replace("\n", " ")
    text = ' '.join(text.split())  # remove extra spaces
    return text

# Flatten documents into a list of text snippets for retrieval
policy_texts = []
for doc in policies:
    # Assuming each doc has a 'title' and 'content'
    title = preprocess_text(doc.get('title', ''))
    content = preprocess_text(doc.get('content', ''))
    policy_texts.append(f"{title}. {content}")

print(f"‚úÖ Total processed documents/snippets: {len(policy_texts)}")

# 3Ô∏è‚É£ Create embeddings
# Using a SentenceTransformer model for embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # lightweight & fast

# Compute embeddings
policy_embeddings = embedding_model.encode(policy_texts, show_progress_bar=True)

# Convert to numpy array for efficient retrieval
policy_embeddings = np.array(policy_embeddings)
print(f"‚úÖ Embeddings shape: {policy_embeddings.shape}")

# 4Ô∏è‚É£ Simple retrieval function (cosine similarity)
from sklearn.metrics.pairwise import cosine_similarity

def retrieve_policy(query, top_k=3):
    """
    Retrieve top-k relevant policy snippets for a given query.
    """
    query_emb = embedding_model.encode([preprocess_text(query)])
    similarities = cosine_similarity(query_emb, policy_embeddings)[0]
    top_indices = similarities.argsort()[-top_k:][::-1]
    top_texts = [policy_texts[i] for i in top_indices]
    return top_texts

# Example usage
query = "How should I dispose of a plastic bottle?"
top_docs = retrieve_policy(query, top_k=2)
print("\nTop retrieved policy snippets:")
for i, doc in enumerate(top_docs):
    print(f"{i+1}. {doc}\n")

### 4.2 Implement RAG-based System

In [None]:
!pip install transformers sentence-transformers faiss-cpu -q


In [None]:
# TODO: Select a pre-trained language model and implement RAG
# - Choose an appropriate language model
# - Create a retrieval mechanism

import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1Ô∏è‚É£ Load a pre-trained generative model (T5/Flan-T5 for instruction generation)
model_name = "google/flan-t5-base"  # lightweight yet powerful
tokenizer_rag = AutoTokenizer.from_pretrained(model_name)
model_rag = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 2Ô∏è‚É£ RAG helper function
def generate_recycling_instructions(waste_description, top_k=3, max_length=150):
    """
    Generates recycling instructions for a waste item using RAG.

    Args:
        waste_description (str): Description of the waste item
        top_k (int): Number of retrieved policy snippets to condition generation
        max_length (int): Max length of generated instruction

    Returns:
        str: Generated recycling instructions
    """
    # Retrieve top-k policy documents
    retrieved_docs = retrieve_policy(waste_description, top_k=top_k)

    # Combine waste description + retrieved policies into prompt
    prompt = f"Item: {waste_description}\n\nPolicies:\n" + "\n".join(retrieved_docs) + "\n\nInstruction:"

    # Tokenize input
    inputs = tokenizer_rag(prompt, return_tensors="pt", truncation=True)

    # Generate output
    outputs = model_rag.generate(
        **inputs,
        max_length=max_length,
        num_beams=4,
        early_stopping=True
    )

    # Decode generated text
    instruction = tokenizer_rag.decode(outputs[0], skip_special_tokens=True)
    return instruction

# 3Ô∏è‚É£ Example usage
example_description = "Broken glass bottle"
instruction = generate_recycling_instructions(example_description, top_k=2)
print(f"\nüóëÔ∏è Waste Item: {example_description}\nüìÑ Recycling Instruction: {instruction}")

### 4.3 Adjust and Evaluate the System

In [None]:
# TODO: Train the RAG-based system
# - Adjust sampling methods/parameters

generation_params = {
    "max_length": 150,
    "num_beams": 4,
    "temperature": 0.7,  # controls creativity
    "top_p": 0.9,        # nucleus sampling
    "early_stopping": True
}

# 2Ô∏è‚É£ Function to generate instruction with adjustable parameters
def generate_instruction_rag(description, top_k=3, params=generation_params):
    retrieved_docs = retrieve_policy(description, top_k=top_k)
    prompt = f"Item: {description}\n\nPolicies:\n" + "\n".join(retrieved_docs) + "\n\nInstruction:"

    inputs = tokenizer_rag(prompt, return_tensors="pt", truncation=True)

    outputs = model_rag.generate(**inputs, **params)
    instruction = tokenizer_rag.decode(outputs[0], skip_special_tokens=True)

    return instruction

# 3Ô∏è‚É£ Evaluate on sample test set
test_descriptions = [
    "Plastic soda bottle",
    "Broken brown glass jar",
    "Used aluminum can",
    "Food waste scraps"
]

for desc in test_descriptions:
    instr = generate_instruction_rag(desc, top_k=2)
    print(f"\nüóëÔ∏è Waste Item: {desc}\nüìÑ Recycling Instruction: {instr}")

# 4Ô∏è‚É£ Optional: Evaluate quality metrics
# For production, you can compare generated instructions against a reference dataset
# using BLEU, ROUGE, or human evaluation.

# Example (pseudo-code, requires reference instructions):
# from rouge_score import rouge_scorer
# scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
# score = scorer.score(reference_text, generated_text)

In [None]:
# TODO: Evaluate the quality of generated instructions
# - Test with various waste categories
# - Assess relevance and accuracy

from sklearn.metrics import accuracy_score

# Sample test waste descriptions for evaluation
test_descriptions = [
    "Plastic soda bottle",
    "Broken brown glass jar",
    "Used aluminum can",
    "Food waste scraps",
    "Cardboard box",
    "Green glass wine bottle"
]

# Reference categories (for categorical consistency check)
reference_categories = [
    "plastic",
    "brown-glass",
    "metal",
    "biological",
    "cardboard",
    "green-glass"
]

# 1Ô∏è‚É£ Generate instructions and predict categories
generated_instructions = []
predicted_categories = []

for desc in test_descriptions:
    instr = generate_instruction_rag(desc, top_k=2)
    generated_instructions.append(instr)

    # Optional: use text classifier to predict category from generated instruction
    cat = classify_waste_description(desc)
    predicted_categories.append(cat)

    print(f"\nüóëÔ∏è Waste Item: {desc}")
    print(f"üìÑ Generated Instruction: {instr}")
    print(f"üè∑Ô∏è Predicted Category: {cat}")

# 2Ô∏è‚É£ Evaluate category consistency
consistency = accuracy_score(reference_categories, predicted_categories)
print(f"\n‚úÖ Category Consistency Accuracy: {consistency:.4f}")

# 3Ô∏è‚É£ Optional: Automated text quality metrics (requires reference instructions)
# from rouge_score import rouge_scorer
# scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
# for gen, ref in zip(generated_instructions, reference_instructions):
#     score = scorer.score(ref, gen)
#     print(score)

# 4Ô∏è‚É£ Manual inspection
# Review the instructions to ensure:
# - Correct disposal method
# - Material-specific instructions
# - Relevance to municipal policies

### 4.4 Create Instruction Generation Function

In [None]:
# TODO: Create a function that takes a waste category and generates recycling instructions

def generate_recycling_instructions(waste_category):
    """
    Generates detailed recycling instructions for a given waste category.

    Args:
        waste_category (str): Waste category

    Returns:
        str: Detailed recycling instructions
        list: Relevant policy documents
    """
    # Your code here
    pass

In [None]:
def generate_recycling_instructions(waste_category, top_k=3, max_length=150):
    """
    Generates detailed recycling instructions for a given waste category.

    Args:
        waste_category (str): Waste category or description
        top_k (int): Number of top policy documents to retrieve
        max_length (int): Maximum length of generated instruction

    Returns:
        instruction (str): Detailed recycling instructions
        retrieved_docs (list): List of relevant policy snippets
    """
    # 1Ô∏è‚É£ Retrieve top-k relevant policy documents
    retrieved_docs = retrieve_policy(waste_category, top_k=top_k)

    # 2Ô∏è‚É£ Construct prompt for RAG generator
    prompt = f"Waste Category/Item: {waste_category}\n\nPolicies:\n" + \
             "\n".join(retrieved_docs) + "\n\nInstruction:"

    # 3Ô∏è‚É£ Tokenize and generate instruction
    inputs = tokenizer_rag(prompt, return_tensors="pt", truncation=True)
    outputs = model_rag.generate(
        **inputs,
        max_length=max_length,
        num_beams=4,
        temperature=0.7,
        top_p=0.9,
        early_stopping=True
    )

    # 4Ô∏è‚É£ Decode generated text
    instruction = tokenizer_rag.decode(outputs[0], skip_special_tokens=True)

    return instruction, retrieved_docs

## Part 5: Integrated Waste Management Assistant

In this section, you will integrate all three models into a unified waste management assistant.

### 5.1 Design Integration Architecture

In [None]:
# TODO: Design an architecture that integrates all three models
# - Create interfaces between components
# - Handle input/output flow

def eco_sort_assistant(image=None, text_description=None, top_k=3):
    """
    Integrated assistant for waste management.

    Args:
        image: Path to waste image (optional)
        text_description (str): Text description (optional)
        top_k (int): Number of retrieved policies for RAG

    Returns:
        dict: {
            'predicted_category': str,
            'recycling_instructions': str,
            'retrieved_policies': list
        }
    """
    # 1Ô∏è‚É£ Predict category
    category_from_image = None
    category_from_text = None

    if image is not None:
        # Preprocess & predict using CNN
        category_from_image = classify_waste_image(image)  # You need a CNN helper function
    if text_description is not None:
        # Predict using text classifier
        category_from_text = classify_waste_description(text_description)

    # 2Ô∏è‚É£ Resolve category
    if category_from_image and category_from_text:
        # Example logic: prefer image prediction if available
        predicted_category = category_from_image
    elif category_from_image:
        predicted_category = category_from_image
    elif category_from_text:
        predicted_category = category_from_text
    else:
        predicted_category = "Unknown"

    # 3Ô∏è‚É£ Generate recycling instructions
    if predicted_category != "Unknown":
        instructions, retrieved_docs = generate_recycling_instructions(predicted_category, top_k=top_k)
    else:
        instructions, retrieved_docs = "Cannot determine category", []

    # 4Ô∏è‚É£ Return results
    return {
        'predicted_category': predicted_category,
        'recycling_instructions': instructions,
        'retrieved_policies': retrieved_docs
    }

### 5.2 Implement Integrated Assistant

In [None]:
# TODO: Implement the integrated waste management assistant

def waste_management_assistant(input_data, input_type="image"):
    """
    Integrated waste management assistant that processes either images or text descriptions
    and returns waste classification and recycling instructions.

    Args:
        input_data: Either an image file path/array or a text description
        input_type (str): Type of input - "image" or "text"

    Returns:
        dict: Dictionary containing waste category, confidence, and recycling instructions
    """
    # Your code here
    pass

In [None]:
def waste_management_assistant(input_data, input_type="image", top_k=3, cnn_model=None):
    """
    Integrated waste management assistant that processes either images or text descriptions
    and returns waste classification and recycling instructions.

    Args:
        input_data: Either an image file path/array or a text description
        input_type (str): Type of input - "image" or "text"
        top_k (int): Number of policy documents to retrieve for instruction generation
        cnn_model: The trained CNN model for image classification

    Returns:
        dict: {
            'predicted_category': str,
            'confidence': float (if available, else None),
            'recycling_instructions': str,
            'retrieved_policies': list
        }
    """
    predicted_category = None
    confidence = None

    # 1Ô∏è‚É£ Predict category from image
    if input_type == "image":
        if cnn_model is None:
            return {"error": "CNN model not provided for image input."}

        # Preprocess image for CNN
        from tensorflow.keras.preprocessing import image as keras_image
        import numpy as np

        img = keras_image.load_img(input_data, target_size=(IMG_HEIGHT, IMG_WIDTH))
        img_array = keras_image.img_to_array(img)
        img_array = np.expand_dims(img_array, axis=0) / 255.0  # normalize

        # Predict
        pred_probs = cnn_model.predict(img_array)
        pred_index = np.argmax(pred_probs)
        predicted_category = class_names[pred_index]
        confidence = float(np.max(pred_probs))

    # 2Ô∏è‚É£ Predict category from text
    elif input_type == "text":
        predicted_category = classify_waste_description(input_data)
        confidence = None  # could add probability if classifier returns it

    else:
        return {"error": "Invalid input_type. Choose 'image' or 'text'"}

    # 3Ô∏è‚É£ Generate recycling instructions using RAG
    if predicted_category:
        instructions, retrieved_docs = generate_recycling_instructions(predicted_category, top_k=top_k)
    else:
        instructions = "Unable to determine category"
        retrieved_docs = []

    # 4Ô∏è‚É£ Return results
    return {
        'predicted_category': predicted_category,
        'confidence': confidence,
        'recycling_instructions': instructions,
        'retrieved_policies': retrieved_docs
    }

In [None]:
# Example 1: Image input
#result_img = waste_management_assistant("/content/RealWaste/plastic/bottle1.jpg", input_type="image")
#print(result_img)

# Example 2: Text description input
result_text = waste_management_assistant("Broken green wine bottle", input_type="text")
print(result_text)

### 5.3 Evaluate the Integrated System

In [None]:
# TODO: Evaluate the integrated system on test cases
# - Test with images from test dataset
# - Test with text descriptions from test dataset
# - Assess overall performance

import random
import tensorflow as tf # Import tensorflow
from tensorflow import keras # Import keras

# 1Ô∏è‚É£ Evaluate on a few sample test images
print("üîπ Testing on image inputs from test dataset:")
test_image_paths = []
for batch in test_dataset.take(3):  # take 3 batches as example
    images, labels = batch
    for i in range(min(3, images.shape[0])):  # take up to 3 images per batch
        # Save temporary image to file (needed for keras load_img)
        temp_img_path = f"temp_img_{i}.png"
        keras.preprocessing.image.save_img(temp_img_path, images[i].numpy())
        test_image_paths.append(temp_img_path)

for img_path in test_image_paths:
    # Pass the trained CNN model ('model') for image input
    result = waste_management_assistant(img_path, input_type="image", cnn_model=model)
    print(f"\nImage: {img_path}")
    print(f"Predicted Category: {result['predicted_category']} | Confidence: {result['confidence']:.2f}")
    print(f"Recycling Instructions: {result['recycling_instructions']}")
    print(f"Retrieved Policies: {len(result['retrieved_policies'])} documents used")

# 2Ô∏è‚É£ Evaluate on text descriptions
print("\nüîπ Testing on text description inputs:")
sample_texts = random.sample(list(descriptions_df['description']), 5)  # 5 random test descriptions

for text in sample_texts:
    result = waste_management_assistant(text, input_type="text")
    print(f"\nText: {text}")
    print(f"Predicted Category: {result['predicted_category']}")
    print(f"Recycling Instructions: {result['recycling_instructions']}")
    print(f"Retrieved Policies: {len(result['retrieved_policies'])} documents used")

# 3Ô∏è‚É£ Optional: Measure overall performance
# - For images: compare predicted_category with true labels in test_dataset
# - For text: compare predicted_category with true categories in descriptions_df
# - Calculate accuracy and identify misclassified cases

## Submission Guidelines

1. Make sure all code cells are properly commented and annotated
2. Ensure that all functions are implemented and working correctly
3. Verify that all evaluation metrics are calculated and analyzed
4. Double-check that the integrated system works as expected
5. Submit your completed and annotated Jupyter notebook file

Remember to demonstrate your understanding of the underlying concepts and provide justification for your design decisions throughout the notebook.