# Sign Language Translator with MediaPipe and Transformers

This notebook walks through the process of building a real-time sign language translator. We will use **MediaPipe** for hand and pose landmark detection and a **Transformer** model built with **TensorFlow/Keras** for gesture classification.

## 1. Install and Import Dependencies

In [1]:
# !pip install tensorflow opencv-python mediapipe scikit-learn matplotlib

import cv2
import numpy as np
import os
from matplotlib import pyplot as plt
import time
import mediapipe as mp
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Layer, Dense, Dropout, LayerNormalization, MultiHeadAttention, Input
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.metrics import multilabel_confusion_matrix, accuracy_score

## 2. Keypoints using MediaPipe Holistic

We'll start by setting up the MediaPipe Holistic model. This powerful model can detect pose, face, and hand landmarks all at once. We'll also define some helper functions to process the video feed and draw the landmarks.

In [2]:
# Holistic model for pose, face, and hand tracking
mp_holistic = mp.solutions.holistic
# Drawing utilities to visualize landmarks
mp_drawing = mp.solutions.drawing_utils

def mediapipe_detection(image, model):
    """
    Takes an image and a MediaPipe model, processes the image, and returns the detection results.
    """
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # COLOR CONVERSION BGR 2 RGB
    image.flags.writeable = False                  # Image is no longer writeable
    results = model.process(image)                 # Make prediction
    image.flags.writeable = True                   # Image is now writeable
    image = cv2.cvtColor(image, cv2.COLOR_RGB2BGR) # COLOR COVERSION RGB 2 BGR
    return image, results

def draw_styled_landmarks(image, results):
    """
    Draws the detected landmarks on the image with custom styling.
    """
    # Draw face connections
    mp_drawing.draw_landmarks(image, results.face_landmarks, mp_holistic.FACEMESH_TESSELATION,
                             mp_drawing.DrawingSpec(color=(80,110,10), thickness=1, circle_radius=1),
                             mp_drawing.DrawingSpec(color=(80,256,121), thickness=1, circle_radius=1)
                             )
    # Draw pose connections
    mp_drawing.draw_landmarks(image, results.pose_landmarks, mp_holistic.POSE_CONNECTIONS,
                             mp_drawing.DrawingSpec(color=(80,22,10), thickness=2, circle_radius=4),
                             mp_drawing.DrawingSpec(color=(80,44,121), thickness=2, circle_radius=2)
                             )
    # Draw left hand connections
    mp_drawing.draw_landmarks(image, results.left_hand_landmarks, mp_holistic.HAND_CONNECTIONS,
                             mp_drawing.DrawingSpec(color=(121,22,76), thickness=2, circle_radius=4),
                             mp_drawing.DrawingSpec(color=(121,44,250), thickness=2, circle_radius=2)
                             )
    # Draw right hand connections
    mp_drawing.draw_landmarks(image, results.right_hand_landmarks, mp_holistic.HAND_CONNECTIONS,
                             mp_drawing.DrawingSpec(color=(245,117,66), thickness=2, circle_radius=4),
                             mp_drawing.DrawingSpec(color=(245,66,230), thickness=2, circle_radius=2)
                             )

## 3. Extract Keypoint Values

This function will take the landmark data from MediaPipe and flatten it into a single NumPy array. This array will be the input feature for our model.

In [3]:
def extract_keypoints(results):
    """
    Extracts the coordinates of all landmarks into a single flattened NumPy array.
    """
    pose = np.array([[res.x, res.y, res.z, res.visibility] for res in results.pose_landmarks.landmark]).flatten() if results.pose_landmarks else np.zeros(33*4)
    face = np.array([[res.x, res.y, res.z] for res in results.face_landmarks.landmark]).flatten() if results.face_landmarks else np.zeros(468*3)
    lh = np.array([[res.x, res.y, res.z] for res in results.left_hand_landmarks.landmark]).flatten() if results.left_hand_landmarks else np.zeros(21*3)
    rh = np.array([[res.x, res.y, res.z] for res in results.right_hand_landmarks.landmark]).flatten() if results.right_hand_landmarks else np.zeros(21*3)
    return np.concatenate([pose, face, lh, rh])

## 4. Setup Folders for Collection

We need to create a directory structure to store our collected data. Each action will have its own folder, containing subfolders for each video sequence.

In [None]:
# Path for exported data, numpy arrays
DATA_PATH = os.path.join('MP_Data')

# Actions that we try to detect
actions = np.array(['hello', 'thanks', 'iloveyou'])

# Thirty videos worth of data
num_sequences = 30

# Videos are going to be 30 frames in length
sequence_length = 30

# Create folders for each action
for action in actions:
    for sequence in range(num_sequences):
        try:
            os.makedirs(os.path.join(DATA_PATH, action, str(sequence)))
        except:
            pass

## 5. Collect Keypoint Values for Training and Testing

This is the data collection step. We'll use OpenCV to capture video from the webcam. For each sign, we'll record a number of sequences, and for each sequence, we'll save the keypoints for each frame.

## 6. Preprocess Data and Create Labels and Features

Now that we've collected the data, we need to load it from the saved `.npy` files, create corresponding labels, and split it into training and testing sets.

In [None]:
label_map = {label:num for num, label in enumerate(actions)}

sequences, labels = [], []
for action in actions:
    for sequence in range(num_sequences):
        window = []
        for frame_num in range(sequence_length):
            res = np.load(os.path.join(DATA_PATH, action, str(sequence), "{}.npy".format(frame_num)))
            window.append(res)
        sequences.append(window)
        labels.append(label_map[action])

X = np.array(sequences)
y = to_categorical(labels).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05)

## 7. Build the Transformer Model

Here's the core of our project. We define a Transformer block, which includes multi-head attention and a feed-forward network. We also create a custom embedding layer that adds positional information to our input keypoints. Finally, we assemble these components into a Keras model.

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense, Dropout, LayerNormalization, MultiHeadAttention, Input, GlobalAveragePooling1D
from tensorflow.keras.models import Model

class TransformerBlock(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        if embed_dim % num_heads != 0:
            raise ValueError("embed_dim must be divisible by num_heads")
        # key_dim is size per head
        self.att = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim // num_heads)
        self.ffn = tf.keras.Sequential(
            [
            Dense(ff_dim, activation="relu"),
            Dense(embed_dim)
            ]
        )
        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training=None):
        # Make `training` optional — Keras may call without it during model construction
        attn_output = self.att(inputs, inputs, training=training)  # self-attention
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, embed_dim):
        super().__init__()
        self.proj = Dense(embed_dim)  # project raw numeric features to embed_dim
        self.pos_emb = tf.keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        # x shape: (batch, seq_len, features)
        seq_len = tf.shape(x)[1]
        positions = tf.range(start=0, limit=seq_len, delta=1)
        pos_embeddings = self.pos_emb(positions)            # (seq_len, embed_dim)
        x = self.proj(x)                                   # (batch, seq_len, embed_dim)
        return x + pos_embeddings                          # broadcast add (batch, seq_len, embed_dim)

# -------------------------
# Build example model
# -------------------------
embed_dim = 64   # must be divisible by num_heads
num_heads = 4
ff_dim = 64
sequence_length = 30
num_keypoints = 1662  # your value

actions = np.array(['hello', 'thanks', 'iloveyou'])

inputs = Input(shape=(sequence_length, num_keypoints))
x = TokenAndPositionEmbedding(sequence_length, embed_dim)(inputs)
x = TransformerBlock(embed_dim, num_heads, ff_dim)(x)
x = GlobalAveragePooling1D()(x)
x = Dropout(0.1)(x)
x = Dense(20, activation="relu")(x)
x = Dropout(0.1)(x)
outputs = Dense(actions.shape[0], activation="softmax")(x)

model = Model(inputs=inputs, outputs=outputs)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.summary()


## 8. Train the Model

Now we'll train our Transformer model. We'll use a `TensorBoard` callback to monitor the training process. The trained model weights will be saved to a file named `action.h5`.

In [13]:
log_dir = os.path.join('Logs')
tb_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir)

# Check if a pre-trained model exists
if os.path.exists('action.h5'):
    print("Loading pre-trained model...")
    model.load_weights('action.h5')
else:
    print("Training new model...")
    model.fit(X_train, y_train, epochs=200, callbacks=[tb_callback])
    model.save('action.h5')

Training new model...
Epoch 1/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 50ms/step - categorical_accuracy: 0.3444 - loss: 1.4752
Epoch 2/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step - categorical_accuracy: 0.3366 - loss: 1.1078
Epoch 3/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - categorical_accuracy: 0.3288 - loss: 1.1056
Epoch 4/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - categorical_accuracy: 0.2720 - loss: 1.1019
Epoch 5/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - categorical_accuracy: 0.3602 - loss: 1.1076
Epoch 6/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/step - categorical_accuracy: 0.3131 - loss: 1.0947
Epoch 7/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - categorical_accuracy: 0.2407 - loss: 1.1048
Epoch 8/200
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0



## 9. Make Predictions and Evaluate

Let's evaluate the model's performance on the test data we set aside earlier. We'll look at the confusion matrix and the overall accuracy.

In [14]:
yhat = model.predict(X_test)
ytrue = np.argmax(y_test, axis=1).tolist()
yhat = np.argmax(yhat, axis=1).tolist()

print("Confusion Matrix:\n", multilabel_confusion_matrix(ytrue, yhat))
print("Accuracy Score:", accuracy_score(ytrue, yhat))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 172ms/step
Confusion Matrix:
 [[[4 0]
  [0 1]]

 [[3 0]
  [0 2]]

 [[3 0]
  [0 2]]]
Accuracy Score: 1.0


## 10. Test in Real Time
