## Object Detection using PyTorch

### Import necessary libraries

In [1]:
import torch
import torchvision

from PIL import Image
from pprint import pprint
from collections import Counter
import requests
import ast

In [2]:
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

### Load a pretrained torchvision model

In [17]:
# EXERCISE: Write a function to load a pretrained object detection model from torchvision in eval mode

def load_model():
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    # Set it to `eval` mode because we aren't training the model
    model.eval()
    return model

In [4]:
model = load_model()

# print(model)



### Get images to analyze

In [5]:
!curl "https://www.sfmta.com/sites/default/files/imce-images/2021/pedestrian_scramble.jpg" -o pedestrian_scramble.jpg
!curl "https://static.wixstatic.com/media/0b1913_a8d6b79a2f624015b42ecf5b8efa54fc~mv2.jpg" -o cats.jpg

# !wget "https://www.sfmta.com/sites/default/files/imce-images/2021/pedestrian_scramble.jpg" -O pedestrian_scramble.jpg
# !wget "https://static.wixstatic.com/media/0b1913_a8d6b79a2f624015b42ecf5b8efa54fc~mv2.jpg" -O cats.jpg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  188k  100  188k    0     0   132k      0  0:00:01  0:00:01 --:--:--  132k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.0M  100 17.0M    0     0  5079k      0  0:00:03  0:00:03 --:--:-- 5085k


In [6]:
Image.open("pedestrian_scramble.jpg").show()

### Preprocess the images

In [7]:
# EXERCISE: Write a function that accepts the image file path and returns a tensor

def load_as_tensor(img_path):
    image = Image.open(img_path) # Load as PIL image
    image = torchvision.transforms.ToTensor()(image) # Convert PIL image to tensor
    return image

In [8]:
img1 = load_as_tensor("pedestrian_scramble.jpg")
print(img1.size())

torch.Size([3, 806, 1200])


In [9]:
img2 = load_as_tensor("cats.jpg")
print(img2.size())

torch.Size([3, 5074, 5074])


### Batchify
- Since the operations on each image are identical and independent of each other, they can be performed in parallel. This is why inputs to deep learning models are batches of images (or text or audio or whatever your model consumes)

In [10]:
batch = [img1, img2]
input_batch = torch.stack(batch)

RuntimeError: stack expects each tensor to be equal size, but got [3, 806, 1200] at entry 0 and [3, 5074, 5074] at entry 1

Oh no! You just got an error! Don't fret, let's figure out what went wrong...

The stacktrace says we couldn't create a batch because the image sizes are different.

When sizes are different, the operations are no longer identical (large images will need more operations). For parallel processing, the batch must contain images of the same size.

In the real world, it's likely we won't always get images of the same size. To resize our images, we can use `torchvision.transforms.Resize` in our preprocessing function. Let's try that!

### Update the preprocessing function

Rewrite the preprocessing function from above so that after the image is loaded as a tensor, we resize it to 224 pixels in height and width.

In [11]:
# EXERVISE: Update `load_as_tensor` to resize the image tensor to 224x224

def load_as_tensor(img_path):
    image = Image.open(img_path) # Load as PIL image
    image = torchvision.transforms.ToTensor()(image) # Convert PIL image to tensor
    image = torchvision.transforms.Resize(size=(224,224))(image)
    return image

Test it! Make sure the image tensor sizes are the same

In [12]:
img1 = load_as_tensor("pedestrian_scramble.jpg")
print(img1.size())

img2 = load_as_tensor("cats.jpg")
print(img2.size())

torch.Size([3, 224, 224])
torch.Size([3, 224, 224])


Why (3, 224, 224)? It is the smallest permissible image that we can use with pretrained models.

### Batchify... again

In [13]:
batch = [img1, img2]
input_batch = torch.stack(batch)

In [14]:
# EXERCISE: What is the size of the `input_batch` tensor?

print(input_batch.size())

torch.Size([2, 3, 224, 224])


The input batch tensor resembles the classic (N, C, H, W) format you will encounter often in your computer vision journey.

### Run inference on the image
Pass the input batch through the model

In [15]:
predictions = model(input_batch)

In [18]:
# EXERCISE: How many elements does `predictions` contain? 

print(len(predictions))  # must match input batch size

2


How does the number of elements in `prediction` relate to the number of images in the input batch?

In [19]:
# EXERCISE: Explore what each prediction contains. What do you think all these numbers mean?

p0 = predictions[0]

# print(p0)
print(p0.keys())

dict_keys(['boxes', 'labels', 'scores'])


The model has returned to us 3 things:
- boxes: coordinates of the bounding boxes around detected objects
- labels: what it thinks the detected object is 
- scores: confidence in the predicted label (ranging from 0 - 1, higher is more confident)

In [20]:
# EXERCISE: See what objects have been detected in the first image

p0['labels']

tensor([ 3,  3, 10,  8,  3,  1,  1,  1,  1,  3,  1,  1,  1,  3, 10, 10,  6,  3,
         1,  6,  3,  1,  3,  1,  8,  1,  1,  1,  3,  1,  1, 10,  6,  3,  8, 10,
         3,  3,  1,  1,  3, 10,  3,  8,  3,  8,  3,  1,  3,  1, 10,  3,  6,  1,
         8,  1,  1,  1,  3,  6,  6,  1,  8,  3,  3,  1,  3, 10,  1,  1,  1,  1,
        31,  6, 10,  1,  3,  6,  1, 33,  3,  1, 10, 41, 27,  1, 14, 31,  8,  8,
         3,  8,  6, 31,  6,  2,  1,  3,  1,  6])

In [21]:
# EXERCISE: What are the scores of the most-confident and least-confident predictions?

print("Score of most-confident prediction: ", max(p0['scores']).item())
print("Score of least-confident prediction: ", min(p0['scores']).item())

Score of most-confident prediction:  0.973721981048584
Score of least-confident prediction:  0.05608990788459778


### Post-process output

The model has given us integers for labels. These integers are indices that map to object names in the CoCo dataset.

Here's a function to load the lookup map:

In [22]:
def get_mapping_dict():
    idx_to_labels_url = "https://gist.githubusercontent.com/suraj813/1fe4c9dd0bc7e1dd1ce79462712ac9ce/raw/0e2c65813946769a375d673a34a1c0236b0505f1/coco_idx_to_labels.txt"
    r = requests.get(idx_to_labels_url).text
    map = {int(k) : v for k,v in ast.literal_eval(r).items()}
    return map

label_lookup = get_mapping_dict()

Try it out! `1` seems to a common label in the first image, what does it correspond to?

In [23]:
# EXERCISE: What is the object the model predicts as `1`?

print(label_lookup[1])

person


### Build a report

Now that we can translate the model's output labels to actual object names, let's try to build a report for each image.

The report should contain all the objects in the image BUT the model isn't confident about every prediction it has made. We should ignore predictions below a certain threshold.

There might be multiple occurences of an object in the image; instead of listing every occurrence of the object, the report can just contain an aggregate count of the object.

In [24]:
def create_detection_report(model_output, confidence_threshold=0.8):
    bbox, labels, confidence = model_output.values()
    labels = labels.tolist()
    confidence = confidence.tolist()

    label_lookup = get_mapping_dict()
    detected_objects = []

    for label, confidence in zip(labels, confidence):
        if confidence > confidence_threshold:
            classname = label_lookup[label]
            detected_objects.append((classname, confidence,))
    
    counts = Counter([x[0] for x in detected_objects])
    return detected_objects, counts 


In [29]:
for c, pred in enumerate(predictions):
    detected_objects, counts = create_detection_report(pred, confidence_threshold=0.85)   

    print(f"Objects detected in image {c+1}:\n", "="*20)
    pprint(detected_objects)
    print()

    print("Count of objects:\n", "="*20)
    pprint(counts)
    
    print("\n\n")


Objects detected in image 1:
[('car', 0.973721981048584),
 ('car', 0.9571205973625183),
 ('traffic light', 0.9557090401649475),
 ('truck', 0.9524416327476501),
 ('car', 0.9516623020172119),
 ('person', 0.9422258734703064),
 ('person', 0.9256057739257812),
 ('person', 0.9153063893318176),
 ('person', 0.8720185160636902),
 ('car', 0.8719595074653625),
 ('person', 0.863501250743866)]

Count of objects:
Counter({'person': 5, 'car': 4, 'traffic light': 1, 'truck': 1})



Objects detected in image 2:
[('cat', 0.9777542352676392), ('cat', 0.967362105846405)]

Count of objects:
Counter({'cat': 2})





### Take-home assignment

Improve this report by drawing boxes on the input image and labelling each box with the detected object and confidence score.

HINT: https://pytorch.org/vision/main/generated/torchvision.utils.draw_bounding_boxes.html