# Image Analysis and Object Detection Program

## Overview

This program uses advanced machine learning models to analyze images, detect objects, and generate detailed descriptions. It combines object detection models (DETR and YOLOS) with an image captioning model (BLIP) to provide a comprehensive analysis of input images.

## Features

1. Object detection using either DETR or YOLOS models
2. Image captioning using the BLIP model
3. Color extraction from detected objects
4. Activity detection for objects
5. Support for both local image files and image URLs

## Requirements

To run this program, you need:

- Python 3.7+
- PyTorch
- Transformers library
- PIL (Python Imaging Library)
- OpenCV (cv2)
- NumPy
- scikit-learn
- webcolors
- requests
- Recommended to use Google colab, you need a RAM of up to 12GB to run both models

You can install the required libraries using pip, which is included in this ipynb file:

```
pip install torch transformers Pillow opencv-python numpy scikit-learn webcolors requests
```

## How It Works

1. **Image Input**: The program accepts either a local file path or a URL to an image.

2. **Image Processing**: The input image is opened and converted to a NumPy array for processing.

3. **Object Detection**: Depending on which version you run, either the DETR or YOLOS model is used to detect objects in the image. Both models return bounding boxes and labels for detected objects.

4. **Image Captioning**: The BLIP model is used to generate captions for the entire image and for individual detected objects.

5. **Color and Activity Extraction**: The program analyzes the captions to extract color information and potential activities associated with the objects.

6. **Vision Record Generation**: All the gathered information is compiled into a "vision record" - a dictionary containing details about detected objects, their colors, activities, bounding boxes, and an overall image summary.

## How to Run

1. Ensure all required libraries are installed.

2. Run the each of the cells in the notebook. You will be prompted to enter the path to your image or a URL:

3. When prompted, enter either a local file path or an image URL:

   ```
   Input path to your image (from internet or path): /path/to/your/image.jpg
   ```
   or
   ```
   Input path to your image (from internet or path): https://example.com/image.jpg
   ```

4. The program will process the image and output a vision record containing all the analyzed information.

## Output

The program outputs a vision record dictionary with the following information:

- Time: Timestamp of the analysis
- Objects: List of detected objects
- Objects Activities: List of activities associated with each object (if any)
- Object Colors: List of colors associated with each object
- Object Bounding Boxes: Coordinates of bounding boxes for each detected object
- Frame Size: Dimensions of the input image
- Frame Summary: Overall description of the image

## Note on Model Versions

The code includes two versions of the `generate_vision_record` function:

1. Using the DETR (DEtection TRansformer) model
2. Using the YOLOS (You Only Look at One Sequence) model

By default, the main function uses the YOLOS version. To switch to the DETR version, you would need to modify the `main()` function to call the DETR version of `generate_vision_record`.

## Limitations and Considerations

- The accuracy of object detection, color extraction, and activity recognition depends on the training and capabilities of the underlying models.
- Processing large images or running the program on a CPU can be time-consuming. A GPU is recommended for faster processing, though this notebook wasn't optimized for GPU usage.
- The program requires an internet connection to download the model weights if they're not already cached.


In [None]:
!pip install opencv-python-headless tensorflow numpy Pillow accelerate flash_attn

In [None]:
!pip install transformers requests torch

In [None]:
!pip install webcolors

In [None]:
!pip install timm

In [5]:
# Imports

import torch
import transformers
from transformers import DetrImageProcessor, DetrForObjectDetection, BlipProcessor, BlipForConditionalGeneration, YolosImageProcessor, YolosForObjectDetection
from PIL import Image
import warnings
import requests
import cv2
import numpy as np
import datetime
import re
from sklearn.cluster import KMeans
import webcolors

In [21]:
import io
def is_url(s):
    """
    Check if a given string is a valid URL for an image.

    Args:
    s (str): The string to check.

    Returns:
    bool: True if the string is a valid image URL, False otherwise.
    """
    # Regex pattern to check if the string is a URL
    url_pattern = r'^https?:\/\/(?:www\.)?.*\.(jpg|jpeg|png|gif|bmp|tiff|webp)$'
    return re.match(url_pattern, s) is not None

def open_image(image_source: str):
    """
    Open an image from a URL or file path.

    Args:
    image_source (str): URL or file path of the image.

    Returns:
    numpy.array: The image as a NumPy array, or None if there's an error.
    """
    try:
        if is_url(image_source):
            # Open the image from a URL
            response = requests.get(image_source, stream=True)
            response.raise_for_status()  # Check if the request was successful
            image_bytes = io.BytesIO(response.content)
            image = Image.open(image_bytes).convert("RGB")
        else:
            # Open the image from a file path
            image = Image.open(image_source).convert("RGB")

        return np.array(image)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching image from URL: {e}")
    except IOError as e:
        print(f"Error opening image file: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")

    return None

def crop_image(image: np.array, coordinates):
    """
    Crop an image using given coordinates.

    Args:
    image (numpy.array): The input image as a NumPy array.
    coordinates (tuple): The coordinates for cropping (left, top, right, bottom).

    Returns:
    numpy.array: The cropped image as a NumPy array.
    """

    # Convert NumPy array back to PIL Image
    image_pil = Image.fromarray(image)

    # Unpack coordinates
    left, top, right, bottom = coordinates

    # Crop the image using the coordinates
    cropped_image_pil = image_pil.crop((left, top, right, bottom))

    # Convert cropped image back to NumPy array
    return np.array(cropped_image_pil)


def extract_colors(sentence):
    """
    Extract color names from a given sentence.

    Args:
    sentence (str): The input sentence.

    Returns:
    str: A string of extracted colors or 'colorless' if no colors are found.
    """

    # List of common colors
    color_list = ['red', 'blue', 'green', 'yellow', 'orange', 'purple', 'pink', 'brown', 'black', 'white', 'gray', 'grey', 'silver', 'gold', 'colorless']

    # Convert sentence to lowercase for case-insensitive matching
    sentence = sentence.lower()

    # Find all color words in the sentence
    found_colors = [color for color in color_list if color in sentence]

    # If no colors found, return 'colorless'
    if not found_colors:
        return 'colorless'

    # If only one color found, return it
    if len(found_colors) == 1:
        return found_colors[0]

    # If multiple colors found, format them
    if len(found_colors) == 2:
        return f"{found_colors[0]} and {found_colors[1]}"
    else:
        return ", ".join(found_colors[:-1]) + f", and {found_colors[-1]}"

def extract_activity(sentence):
    """
    Extract activity words (ending with 'ing') from a given sentence.

    Args:
    sentence (str): The input sentence.

    Returns:
    str: The first activity word found, or None if no activity is found.
    """

    # Convert sentence to lowercase for case-insensitive matching
    sentence = sentence.lower()

    # Find all words ending with 'ing'
    ing_words = re.findall(r'\b\w+ing\b', sentence)

    # If 'ing' words found, return the first one
    if ing_words:
        return ing_words[0]

    # Special case for 'pointing at'
    if 'pointing at' in sentence:
        return 'pointing'

    # If no activity found, return None
    return None


In [None]:
# Model 2


# Load model and processor once
model_name = "Salesforce/blip-image-captioning-large"

# Disable some warnings
import warnings
warnings.filterwarnings('ignore')

# Load model and processor
processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

def model_two(image_url: np.array, conditional_prompt: str = None) -> dict:
    """
    Generate image captions using the BLIP model.

    Args:
    image_url (numpy.array): The input image as a NumPy array.
    conditional_prompt (str, optional): A conditional prompt for captioning.

    Returns:
    dict: A dictionary containing unconditional and (if provided) conditional captions.
    """

    raw_image = Image.fromarray(image_url)
    inputs_unconditional = processor(raw_image, return_tensors="pt")
    out_unconditional = model.generate(**inputs_unconditional)
    caption_unconditional = processor.decode(out_unconditional[0], skip_special_tokens=True)

    results = {"unconditional": caption_unconditional}

    if conditional_prompt:
        # Conditional image captioning
        inputs_conditional = processor(raw_image, conditional_prompt, return_tensors="pt")
        out_conditional = model.generate(**inputs_conditional)
        caption_conditional = processor.decode(out_conditional[0], skip_special_tokens=True)
        results["conditional"] = caption_conditional

    return results

In [None]:
def generate_vision_record(image: np.array) -> dict:
    """
    Generate a vision record using the DETR object detection model and BLIP captioning.

    Args:
    image (numpy.array): The input image as a NumPy array.

    Returns:
    dict: A vision record containing detected objects, activities, colors, and other information.
    """

    result = None
    processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
    model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

    # Convert NumPy array to PIL Image for processing
    image_pil = Image.fromarray(image)

    # Prepare the image for detection
    inputs = processor(images=image_pil, return_tensors="pt")
    outputs = model(**inputs)

    # Convert outputs (bounding boxes and class logits) to COCO API
    target_sizes = torch.tensor([image_pil.size[::-1]])
    results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0]

    # Initialize the vision record dictionary
    vision_record = {
        "Time": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"),
        "Objects": [],
        "Objects Activities": [],
        "Object Colors": [],
        "Object Bounding Boxes": [],
        "Frame Size": image.shape[:2],  # (height, width)
        "Frame Summary" : ""
    }

    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        box = [round(i, 2) for i in box.tolist()]
        result = "Detected"
        # Crop image using bounding box coordinates
        cropped_image = crop_image(image, box)
        response_one = model_two(image, "This image depicts")
        response_two_one = model_two(cropped_image, "This object(s) is color")
        response_two_two = model_two(cropped_image, "This object(s) is")

        summary = response_one.get("unconditional")
        color = extract_colors(response_two_one.get("unconditional"))
        activity = extract_activity(response_two_two.get("unconditional"))

        vision_record["Objects"].append(model.config.id2label[label.item()])
        vision_record["Objects Activities"].append(activity)
        vision_record["Object Colors"].append(color)
        vision_record["Object Bounding Boxes"].append(box)
        vision_record["Frame Summary"] = summary
    return vision_record

def main():
    """
    Main function to process an image and generate a vision record using the DETR model.
    """
    image_source = input("Input path to your image (from internet or path): ")
    image = open_image(image_source)
    vision_record = generate_vision_record(image)
    print(vision_record)

if __name__ == "__main__":
    main()

In [None]:
def generate_vision_record(image: np.array):
    """
    Generate a vision record using the YOLOS object detection model and BLIP captioning.

    Args:
    image (numpy.array): The input image as a NumPy array.

    Returns:
    dict: A vision record containing detected objects, activities, colors, and other information.
    """

    result = None
    model = YolosForObjectDetection.from_pretrained('hustvl/yolos-small')
    image_processor = YolosImageProcessor.from_pretrained("hustvl/yolos-small")

    # Convert NumPy array to PIL Image for processing
    image_pil = Image.fromarray(image)

    inputs = image_processor(images=image_pil, return_tensors="pt")
    outputs = model(**inputs)

    # Process the outputs
    target_sizes = torch.tensor([image_pil.size[::-1]])
    results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[0]

    # Initialize the vision record dictionary
    vision_record = {
        "Time": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"),
        "Objects": [],
        "Objects Activities": [],
        "Object Colors": [],
        "Object Bounding Boxes": [],
        "Frame Size": image.shape[:2],  # (height, width)
        "Frame Summary" : ""
    }

    for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
        box = [round(i, 2) for i in box.tolist()]
        result = "Detected"
        # Crop image using bounding box coordinates
        cropped_image = crop_image(image, box)
        response_one = model_two(image, "This image depicts")
        response_two_one = model_two(cropped_image, "This object(s) is color")
        response_two_two = model_two(cropped_image, "This object(s) is")

        summary = response_one.get("unconditional")
        color = extract_colors(response_two_one.get("unconditional"))
        activity = extract_activity(response_two_two.get("unconditional"))

        vision_record["Objects"].append(model.config.id2label[label.item()])
        vision_record["Objects Activities"].append(activity)
        vision_record["Object Colors"].append(color)
        vision_record["Object Bounding Boxes"].append(box)
        vision_record["Frame Summary"] = summary
    return vision_record

def main():
    """
    Main function to process an image and generate a vision record using the YOLOS model.
    """
    image_source = input("Input path to your image (from internet or path): ")
    image = open_image(image_source)
    vision_record = generate_vision_record(image)
    print(vision_record)

if __name__ == "__main__":
    main()