## **New Approch B: Data Alignment & Feature Extraction**

### **Objective**
This notebook addresses the critical data misalignment issue discovered in `pipeline2_visual_descriptors.py`.

**Context in Pipeline:**
1.  **Previous Step:** `prepare_visual_metadata.py` generated a raw, unordered CSV of species metadata (`full_visual_metadata.csv`).
2.  **This Step:** We read the official dataset order (`train.txt`) and force our metadata to match that exact sequence. We also generate frozen DINOv2 embeddings to serve as a baseline.
3.  **Next Step:** The output files from this notebook (`aligned_train_metadata.pkl`) will be fed into the LoRA training script.

**Outputs:**
*   `aligned_train_metadata.pkl`: The master dataframe, perfectly ordered to match the image directory.
*   `aligned_train_embeddings.npy`: Pre-computed DINOv2 features (used for baseline comparison).

# **1.0 Environment Setup**
We mount Google Drive to access the dataset and install the necessary libraries (`timm` for the model, `pandas` for data manipulation).

## **1.1 Mount Google Drive**


In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("✅ Google Drive mounted successfully.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Google Drive mounted successfully.


## **1.2 Install Required Libraries**

In [None]:
# Install all necessary libraries. The '-q' flag makes the output less verbose.
!pip install pandas scikit-learn torch torchvision tqdm matplotlib timm kagglehub -q

print("✅ Required libraries are installed.")

✅ Required libraries are installed.


## **1.3 Define Project Paths and Constants**

In [None]:
# This cell centralizes all configuration for easy access.
import os
import torch

# --- Configuration ---

# 1. Define the base path to project folder in Google Drive.
PROJECT_PATH = '/content/drive/My Drive/Colab Notebooks/COS30082_ML_Project'

# 2. Set the device for processing.
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

# 3. Define paths to key data files.
DATA_DIR = os.path.join(PROJECT_PATH, 'AML_dataset')
METADATA_FILE = os.path.join(PROJECT_PATH, 'full_visual_metadata.csv')

# 4. Set up the persistent Kaggle cache for the DINOv2 model.
cache_dir = os.path.join(PROJECT_PATH, 'kaggle_cache')
os.makedirs(cache_dir, exist_ok=True)
os.environ['KAGGLE_CACHE'] = cache_dir

print(f"Project Path: {PROJECT_PATH}")
print(f"Dataset Path: {DATA_DIR}")
print(f"Using device: {DEVICE}")
print(f"Kaggle cache path set to: {cache_dir}")

Project Path: /content/drive/My Drive/Colab Notebooks/COS30082_ML_Project
Dataset Path: /content/drive/My Drive/Colab Notebooks/COS30082_ML_Project/AML_dataset
Using device: cpu
Kaggle cache path set to: /content/drive/My Drive/Colab Notebooks/COS30082_ML_Project/kaggle_cache


# **2.0 Load the DINOv2 Feature Extractor**
To verify that all images are readable and to generate baseline features, we load the specific DINOv2 model used in previous baselines. We ensure the classifier head is removed so we get raw feature vectors (size 768).


In [None]:
import kagglehub
import timm
import torchvision.transforms as transforms

print("--- Loading DINOv2 Model ---")
try:
    path_to_model_files = kagglehub.model_download("juliostat/dinov2_patch14_reg4_onlyclassifier_then_all/pyTorch/default")
    timm_model_name = 'vit_base_patch14_reg4_dinov2.lvd142m'
    dinov2_model = timm.create_model(timm_model_name, pretrained=False)
    dinov2_model.reset_classifier(0, '')

    weights_filename = next((f for f in os.listdir(path_to_model_files) if f.endswith(('.pth', '.pt', '.tar'))), None)
    if weights_filename is None:
        raise FileNotFoundError("Could not find a model weights file in the cache directory.")

    weights_path = os.path.join(path_to_model_files, weights_filename)
    print(f"Found weights file: {weights_filename}")

    # Use weights_only=False as required by this specific checkpoint file
    checkpoint = torch.load(weights_path, map_location='cpu', weights_only=False)

    state_dict = checkpoint.get('state_dict', checkpoint)
    dinov2_model.load_state_dict(state_dict, strict=False)
    dinov2_model.to(DEVICE)
    dinov2_model.eval()
    print("✅ DINOv2 model ready and moved to device.")

    # Define the required image transformations
    transform = transforms.Compose([
        transforms.Resize(518, interpolation=transforms.InterpolationMode.BICUBIC),
        transforms.CenterCrop(518),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ])
    print("✅ Image transformer ready.")

except Exception as e:
    print(f"❌ Failed to load DINOv2 model. Error: {e}")
    raise

--- Loading DINOv2 Model ---
Found weights file: model_best.pth.tar
✅ DINOv2 model ready and moved to device.
✅ Image transformer ready.


# **3.0 Generate and Save Aligned Data**
This is the **most critical step**. 
The `full_visual_metadata.csv` file we created earlier is likely sorted by ID or Name. 
However, the dataset expects images to be loaded in the order defined by `list/train.txt`.

We load `train.txt` first to establish the "Ground Truth" order.

In [None]:
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm

# --- Step 1: Get the OFFICIAL file order from train.txt ---
print("--- Reading Official File Order ---")
train_list_path = os.path.join(DATA_DIR, 'list/train.txt')
train_order_df = pd.read_csv(train_list_path, sep=' ', header=None, names=['filepath', 'classid'])
print(f"Found {len(train_order_df)} images to process.")

--- Reading Official File Order ---
Found 4744 images to process.


In [None]:
from PIL import Image

# --- Step 2: Process images IN ORDER and collect embeddings ---
aligned_embeddings = []
print("\n--- Generating Embeddings in Correct Order ---")

# Loop through the official list of training files
for index, row in tqdm(train_order_df.iterrows(), total=len(train_order_df), desc="Extracting Features"):
    image_path = os.path.join(DATA_DIR, row['filepath'])
    try:
        img = Image.open(image_path).convert("RGB")
        img_tensor = transform(img).unsqueeze(0).to(DEVICE)
        with torch.no_grad():
            full_features = dinov2_model(img_tensor)
            embedding = full_features[:, 0, :].cpu().numpy().flatten()
        aligned_embeddings.append(embedding)
    except FileNotFoundError:
        print(f"Warning: Could not find file {image_path}. Skipping.")
        # We assume all files exist as per the dataset structure.
        # If files were missing, this would need error handling (e.g., adding a placeholder).
        continue

# Convert the list of embeddings to a single NumPy array
final_embeddings_array = np.array(aligned_embeddings, dtype='float32')


--- Generating Embeddings in Correct Order ---


Extracting Features:   0%|          | 0/4744 [00:00<?, ?it/s]

In [None]:
# --- Step 3: Create the aligned metadata DataFrame ---
print("\n--- Aligning Full Metadata ---")
df_unordered = pd.read_csv(METADATA_FILE)
# Normalize paths for a successful merge
df_unordered['filepath'] = df_unordered['filepath'].str.replace('\\', '/')

# Merge the full metadata onto our correctly ordered DataFrame.
# This enforces the order of `train_order_df`.
aligned_metadata_df = pd.merge(train_order_df, df_unordered, on=['filepath', 'classid'], how='left')


--- Aligning Full Metadata ---


In [None]:
# --- Step 4: Save the new, perfectly aligned files ---
EMBEDDINGS_SAVE_PATH = os.path.join(PROJECT_PATH, 'aligned_train_embeddings.npy')
METADATA_SAVE_PATH = os.path.join(PROJECT_PATH, 'aligned_train_metadata.pkl')

np.save(EMBEDDINGS_SAVE_PATH, final_embeddings_array)
aligned_metadata_df.to_pickle(METADATA_SAVE_PATH)

print("\n--- ✅ SUCCESS! ---")
print(f"Final Embeddings Shape: {final_embeddings_array.shape}")
print(f"Final Metadata Shape:   {aligned_metadata_df.shape}")
print(f"\nSaved aligned embeddings to: {EMBEDDINGS_SAVE_PATH}")
print(f"Saved aligned metadata to:   {METADATA_SAVE_PATH}")


--- ✅ SUCCESS! ---
Final Embeddings Shape: (4744, 768)
Final Metadata Shape:   (4744, 9)

Saved aligned embeddings to: /content/drive/My Drive/Colab Notebooks/COS30082_ML_Project/aligned_train_embeddings.npy
Saved aligned metadata to:   /content/drive/My Drive/Colab Notebooks/COS30082_ML_Project/aligned_train_metadata.pkl
