## Object Detection using PyTorch

### Import necessary libraries

In [None]:
import torch
import torchvision

from PIL import Image
from pprint import pprint
from collections import Counter
import requests
import ast

In [None]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Load a pretrained torchvision model

In [None]:
# EXERCISE: Write a function to load a pretrained object detection model from torchvision in eval mode

def load_model():
    # ...
    # write code here
    # ...
    return model

In [None]:
model = load_model()

# print(model)

### Get images to analyze

In [None]:
!curl "https://www.sfmta.com/sites/default/files/imce-images/2021/pedestrian_scramble.jpg" -o pedestrian_scramble.jpg
!curl "https://static.wixstatic.com/media/0b1913_a8d6b79a2f624015b42ecf5b8efa54fc~mv2.jpg" -o cats.jpg

In [None]:
Image.open("pedestrian_scramble.jpg").show()

### Preprocess the images

In [None]:
# EXERCISE: Write a function that accepts the image file path and returns a tensor

def load_as_tensor(img_path: str):
    # ...
    # write code to Load as PIL image
    # write code to Convert PIL image to tensor
    # ...
    return image

In [None]:
img1 = load_as_tensor("pedestrian_scramble.jpg")
print(img1.size())

In [None]:
img2 = load_as_tensor("cats.jpg")
print(img2.size())

### Batchify
- Since the operations on each image are identical and independent of each other, they can be performed in parallel. This is why inputs to deep learning models are batches of images (or text or audio or whatever your model consumes)

In [None]:
# Create list of all images of a batch
batch = [img1, img2]

# Convert list to tensor
input_batch = torch.stack(batch)

Oh no! You just got an error! Don't fret, let's figure out what went wrong...

The stacktrace says we couldn't create a batch because the image sizes are different.

When sizes are different, the operations are no longer identical (large images will need more operations). For parallel processing, the batch must contain images of the same size.

In the real world, it's likely we won't always get images of the same size. To resize our images, we can use `torchvision.transforms.Resize` in our preprocessing function. Let's try that!

### Update the preprocessing function

Rewrite the preprocessing function from above so that after the image is loaded as a tensor, we resize it to 224 pixels in height and width.

Use [torchvision.transforms.Resize](https://pytorch.org/vision/main/generated/torchvision.transforms.Resize.html)

In [None]:
# EXERCISE: Update `load_as_tensor` to resize the image tensor to 224x224

def load_as_tensor(img_path: str):
    # ...
    # write code to Load as PIL image
    # write code to Convert PIL image to tensor
    # write code to resize image tensor
    # ...
    return image

Test it! Make sure the image tensor sizes are the same

In [None]:
img1 = load_as_tensor("pedestrian_scramble.jpg")
print(img1.size())

img2 = load_as_tensor("cats.jpg")
print(img2.size())

Why (3, 224, 224)? It is the smallest permissible image that we can use with pretrained models.

### Batchify... again

In [None]:
batch = [img1, img2]
input_batch = torch.stack(batch)

In [None]:
# EXERCISE: What is the size of the `input_batch` tensor?

# ...
# write code here
# ...

The input batch tensor resembles the classic (N, C, H, W) format you will encounter often in your computer vision journey.

### Run inference on the image
Pass the input batch through the model

In [None]:
predictions = model(input_batch)

In [None]:
# EXERCISE: How many elements does `predictions` contain? 

# ...
# write code here
# ...

How does the number of elements in `prediction` relate to the number of images in the input batch?

In [None]:
# EXERCISE: Explore what each prediction contains. What do you think all these numbers mean?

# ...
# write code here
# ...

The model has returned to us 3 things:
- boxes: coordinates of the bounding boxes around detected objects
- labels: what it thinks the detected object is 
- scores: confidence in the predicted label (ranging from 0 - 1, higher is more confident)

In [None]:
# EXERCISE: See what objects have been detected in the first image

# ...
# write code here
# ...

In [None]:
# EXERCISE: What are the scores of the most-confident and least-confident predictions?

# ...
# write code here
# ...

### Post-process output

The model has given us integers for labels. These integers are indices that map to object names in the CoCo dataset.

Here's a function to load the lookup map:

In [None]:
def get_mapping_dict():
    idx_to_labels_url = "https://gist.githubusercontent.com/suraj813/1fe4c9dd0bc7e1dd1ce79462712ac9ce/raw/0e2c65813946769a375d673a34a1c0236b0505f1/coco_idx_to_labels.txt"
    r = requests.get(idx_to_labels_url).text
    map = {int(k) : v for k,v in ast.literal_eval(r).items()}
    return map

label_lookup = get_mapping_dict()

Try it out! `1` seems to a common label in the first image, what does it correspond to?

In [None]:
# EXERCISE: What is the object the model predicts as `1`?

# ...
# write code here
# ...

### Build a report

Now that we can translate the model's output labels to actual object names, let's try to build a report for each image.

The report should contain all the objects in the image BUT the model isn't confident about every prediction it has made. We should ignore predictions below a certain threshold.

There might be multiple occurences of an object in the image; instead of listing every occurrence of the object, the report can just contain an aggregate count of the object.

In [None]:

def create_detection_report(model_output, confidence_threshold=0.8):
    # ...
    # write code here
    # ...

    # HINTS
    # - Unpack the output dictionary to get the bbox, labels, and confidence values
    # - Convert the labels and confidence arrays to lists for easier processing
    # - Get a lookup dictionary for the class labels
    # - Loop through each label and its corresponding confidence value. Record only those predictions above 
    #   the confidence threshold
    # - Use a Counter object to count the number of times each class appears in the detected_objects list
    # - Return a tuple containing the list of detected objects and the class counts
    
    return detected_objects, counts



In [None]:
for c, pred in enumerate(predictions):
    detected_objects, counts = create_detection_report(pred, confidence_threshold=0.85)   

    print(f"Objects detected in image {c+1}:\n", "="*20)
    pprint(detected_objects)
    print()

    print("Count of objects:\n", "="*20)
    pprint(counts)
    
    print("\n\n")


### Take-home assignment

Improve this report by drawing boxes on the input image and labelling each box with the detected object and confidence score.

HINT: https://pytorch.org/vision/main/generated/torchvision.utils.draw_bounding_boxes.html