In [1]:
import os
import time
import uuid
import numpy as np
import pandas as pd

import tensorflow as tf
from object_detection.utils import label_map_util, config_util
from object_detection.utils import visualization_utils as viz_utils
from object_detection.builders import model_builder
import cv2

import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
from mediapipe_model_maker import gesture_recognizer

  from .autonotebook import tqdm as notebook_tqdm


# Object Detection with Naruto Hand Seals

---

Anime and manga were a big part of my daily entertainment while growing up as a 90s kid, in particular the Big 3 - One Piece, Naruto and Bleach. While One Piece still remains my personal favourite (and still ongoing as of 2023!), today's focus will be on [**Naruto**](https://naruto.fandom.com/wiki/Narutopedia), which tells the story of a young ninja called Uzumaki Naruto trying to achieve his dream of becoming the Hokage.

One of the fundamental concepts within Naruto is the use of chakra, which in turn allows a user to perform a jutsu (technique). Through the use of hand seals, a ninja can better control and manipulate their chakra when performing their technique. There are twelve basic seals, each of them named after an animal in the Chinese Zodiac. There are different sequences of hand seals for every technique, but a skilled ninja could also use less or no hand seals in order to perform a technique.

![Twelve Basic Hand Seals](Naruto_Hand_Seals_by_Megan.gif)
<br>[Source: Naruto Hand Seals by Megan #1](https://www.youtube.com/watch?v=y_NRTgVuaNo)

While I am not savvy enough to perform hand seals with the same accuracy and speed as [Megan](https://www.youtube.com/@DreamSilver05) in the image above, I wondered how the new Python and deep learning skills I've picked up recently could be applied. This is by no means an original project and I've also come across and referenced various other enthusiasts who have developed their own computer vision models. However, this is a great opportunity for me to practise using some of the various deep learning and computer vision libraries, like TensorFlow and OpenCV, that I did not have much opportunities to interact with during my Data Science Immersive.

## Problem Statement

Using `Object Detection`, we will attempt to train a model that can recognise the 12 basic hand seals in a live video feed with XXX accuracy.

## Getting Started

Referencing the [TensorFlow Object Detection Tutorial](https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html), I will be using transfer learning from a pre-trained network to customise it to our current task.

We will prepare our workspace as recommended in the tutorial and assign our file paths to constants for easy reference.

In [2]:
# set up paths for easier reference

SCRIPTS_PATH = '../../scripts/preprocessing'
# APIMODEL_PATH = '../../models'
ANNOTATION_PATH = './annotations'
IMAGE_PATH = './images'
COLLECTION_PATH = IMAGE_PATH + '/collected'
MODEL_PATH = './models/my_ssd_mobilenet_v2_fpnlite'
LOG_PATH = MODEL_PATH + '/train'
# PRETRAINED_MODEL_PATH = './pre-trained-models'
CONFIG_PATH = MODEL_PATH + '/pipeline.config'
CHECKPOINT_PATH = MODEL_PATH

## Create Label Maps

---

We will assign a label name and id to each of the 12 hand seals, creating a label map that TensorFlow will use in the training and detection process.

In [3]:
labels = [
    {'name':'rat', 'id':1}, 
    {'name':'ox', 'id':2},
    {'name':'tiger', 'id':3},
    {'name':'hare', 'id':4},
    {'name':'dragon', 'id':5},
    {'name':'snake', 'id':6},
    {'name':'horse', 'id':7},
    {'name':'ram', 'id':8},
    {'name':'monkey', 'id':9},
    {'name':'bird', 'id':10},
    {'name':'dog', 'id':11},
    {'name':'boar', 'id':12},
    ]

In [4]:
with open(ANNOTATION_PATH + '/label_map.pbtxt', 'w') as f:
    for label in labels:
        f.write('item { \n')
        f.write('\tname:\'{}\'\n'.format(label['name']))
        f.write('\tid:{}\n'.format(label['id']))
        f.write('}\n')

## Collecting the Images

---

While I'm not the first to attempt such a project, I did not come across any existing datasets online, so this would be a great opportunity to ~~toy around~~ practise building my own dataset.

I will utilise OpenCV to capture some images of myself in various lighting conditions, as well as search online for a mixture of anime/manga and real-life samples, which should hopefully provide more generalisability to the model. We will target for approximately 200 samples per hand seal, which can then be split into our train and validation sets.

In [15]:
def capture_images(label: str, number_images: int = 20):
    cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
    print(f'Starting collection of images for {label} in 3 seconds')
    time.sleep(3)
    for image_number in range(number_images):
        image_name = os.path.join(COLLECTION_PATH, f'{label}_{image_number}_{uuid.uuid4().hex}.jpg')
        print(f'Capturing {image_name}')
        time.sleep(1)
        ret, frame = cap.read()
        cv2.imwrite(image_name, frame)
    cap.release()
    cv2.destroyAllWindows()

In [19]:
for label in labels:
    print(f"Prepare for {label['name']}, starting in in 5 seconds")
    time.sleep(5)
    capture_images(label['name'], number_images=5)

Prepare for rat, starting in in 5 seconds
Starting collection of images for rat in 3 seconds
Capturing ./images/collected\rat_0_0b163023493c4f2fb0e64250aeb8f7e7.jpg
Capturing ./images/collected\rat_1_96c6561886454734bad1783eb3047be7.jpg
Capturing ./images/collected\rat_2_02ad6ccf1c744342b2f133b6590e9637.jpg
Capturing ./images/collected\rat_3_3ebe7da4bcbf4f47bce28f104e4412c6.jpg
Capturing ./images/collected\rat_4_9a283f6671774fa7ba77ddd54702608e.jpg
Prepare for ox, starting in in 5 seconds
Starting collection of images for ox in 3 seconds
Capturing ./images/collected\ox_0_a73b75f7a7f5430291b450e29205e399.jpg
Capturing ./images/collected\ox_1_e329dc1477014cd99afa7edf317cf461.jpg
Capturing ./images/collected\ox_2_e29f045adb114b94875e7bf89a9be1e8.jpg
Capturing ./images/collected\ox_3_f348f269c9b64ba7bfe86ab3cccd8aa8.jpg
Capturing ./images/collected\ox_4_d1cf79c45eb047259feb02e00ea831dd.jpg
Prepare for tiger, starting in in 5 seconds
Starting collection of images for tiger in 3 seconds
Capt

In [5]:
# check the number of files generated
annotation_list = []

for annotation in os.listdir(IMAGE_PATH+'/train'):
    if ".xml" not in annotation:
        annotation_list.append(annotation.split('.')[0])

df = pd.DataFrame(annotation_list, columns=['image'])
df[['sign', 'image']] = df['image'].str.split('_', expand=True, n=1)
print(f"Total train records: {len(df)}")
df.value_counts('sign')

Total train records: 884


sign
ram       86
tiger     79
snake     76
bird      75
rat       73
dragon    72
ox        71
horse     71
hare      71
monkey    70
dog       70
boar      70
dtype: int64

In [13]:
capture_images('rat', 1)

Starting collection of images for rat in 3 seconds
Capturing ./images/collected\rat_0_3061bb7957d84da38c56959149047e96.jpg


## Annotation with LabelImg

Using LabelImg, I went through the various images and annotated them to define the bounding boxes and labels that our model will eventually learn to recognise. 

During the first attempt to train the model, the model was not able to perform very effectively. After some evaluation of my workflow, I realised that my bounding boxes were not tight enough and I was leaving too much empty space. Since I am more concerned with trying to recognise the type of hand seal rather than trying to create an accurate bounding box on screen, I opted to "zoom in" to the key features of each hand seal to hopefully help the model perform better.

In [19]:
!Labelimg

## Create TF records

In [23]:
!python {SCRIPTS_PATH + '/generate_tfrecord.py'} -x {IMAGE_PATH + '/train'} -l {ANNOTATION_PATH + '/label_map.pbtxt'} -o {ANNOTATION_PATH + '/train.record'}
!python {SCRIPTS_PATH + '/generate_tfrecord.py'} -x {IMAGE_PATH + '/test'} -l {ANNOTATION_PATH + '/label_map.pbtxt'} -o {ANNOTATION_PATH + '/test.record'}

Successfully created the TFRecord file: ./annotations/train.record
Successfully created the TFRecord file: ./annotations/test.record


## Training the Model

In [3]:
print(f"tensorboard --logdir {LOG_PATH}")

tensorboard --logdir ./models/my_ssd_mobilenet_v2_fpnlite/train


In [22]:
# create statement to paste into command line
print(f'python model_main_tf2.py --model_dir={MODEL_PATH} --pipeline_config_path={CONFIG_PATH}')

python model_main_tf2.py --model_dir=./models/my_ssd_mobilenet_v2_fpnlite --pipeline_config_path=./models/my_ssd_mobilenet_v2_fpnlite/pipeline.config


## Load Trained Model from Checkpoint

In [6]:
# load pipeline config and build model
configs = config_util.get_configs_from_pipeline_file(CONFIG_PATH)
detection_model = model_builder.build(model_config=configs['model'], is_training=False)

# restore checkpoint
ckpt = tf.compat.v2.train.Checkpoint(model=detection_model)
ckpt.restore(os.path.join(CHECKPOINT_PATH, 'ckpt-13')).expect_partial()

@tf.function
def detect_fn(image):
    image, shapes = detection_model.preprocess(image)
    prediction_dict = detection_model.predict(image, shapes)
    detections = detection_model.postprocess(prediction_dict, shapes)
    return detections

## Perform Real Time Detections

In [7]:
category_index = label_map_util.create_category_index_from_labelmap(ANNOTATION_PATH+'/label_map.pbtxt')

In [9]:
# setup CV2 capture
cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))

while True:
    ret, frame = cap.read()
    image_np = np.array(frame)

    input_tensor = tf.convert_to_tensor(np.expand_dims(image_np, 0), dtype=tf.float32)
    detections = detect_fn(input_tensor)

    num_detections = int(detections.pop('num_detections'))
    detections = {key: value[0, :num_detections].numpy() for key, value in detections.items()}
    detections['num_detections'] = num_detections

    detections['detection_classes'] = detections['detection_classes'].astype(np.int32)

    label_id_offset = 1
    image_np_with_detections = image_np.copy()

    viz_utils.visualize_boxes_and_labels_on_image_array(
        image_np_with_detections,
        detections['detection_boxes'],
        detections['detection_classes']+label_id_offset,
        detections['detection_scores'],
        category_index,
        use_normalized_coordinates=True,
        max_boxes_to_draw=1,
        min_score_thresh=0.85,
        agnostic_mode=False
    )

    cv2.imshow('Naruto Hand Seal Detection', cv2.resize(image_np_with_detections, (800, 600)))

    if cv2.waitKey(1) & 0xFF == ord('q'):
        cap.release()
        cv2.destroyAllWindows()
        cv2.waitKey(1)
        break

The model appears to be performing relatively well in generally identifying the models, but there are a few hand seals that are harder for the model to distinguish, or work only at certain angles. This could perhaps be overcome by providing more training samples are slightly different angles rather than only straight on to the camera.

In particular, the 

# Utilising Mediapipe Hands


https://developers.google.com/mediapipe/solutions/vision/gesture_recognizer/customize

The dataset for gesture recognition in model maker requires the following format: <dataset_path>/<label_name>/<img_name>.*. In addition, one of the label names (label_names) must be none. The none label represents any gesture that isn't classified as one of the other gestures.

## Load the Dataset

In [3]:
model_path = '/Users/brkit/Documents/DSI33-Shawn/Tensorflow/workspace/training_naruto/pre-trained-models/mediapipe_hand_landmarker/hand_landmarker.task'

In [None]:
data = gesture_recognizer.Dataset.from_folder(
    dirname=dataset_path,
    hparams=gesture_recognizer.HandDataPreprocessingParams()
)
train_data, rest_data = data.split(0.8)
validation_data, test_data = rest_data.split(0.5)

## Train the Model

In [None]:
hparams = gesture_recognizer.HParams(export_dir="exported_model")
options = gesture_recognizer.GestureRecognizerOptions(hparams=hparams)
model = gesture_recognizer.GestureRecognizer.create(
    train_data=train_data,
    validation_data=validation_data,
    options=options
)

## Evaluate Performance

In [None]:
loss, acc = model.evaluate(test_data, batch_size=1)
print(f"Test loss:{loss}, Test accuracy:{acc}")