# Recognizing Sign Language Using HMMs
- [Introduction](#intro)
- [Part 1: Feature Selection](#part1)
- [Part 2: HMMs Training](#part2)
- [Part 3: HMMs Testing](#part3)

<a id='intro'></a>
## Introduction
In this project, I build a word recognizer for American Sign Language (ASL) video sequences using Hidden Markov Models.  I use the [RWTH-BOSTON-104 Database](http://www-i6.informatik.rwth-aachen.de/~dreuw/database-rwth-boston-104.php), which consists of 201 videos of ASL sentences annotated with the x-y positions of the speaker's nose tip, left and right hands. 

The HMMs are trained using the [hmmlearn](https://hmmlearn.readthedocs.io/en/latest/) library. 

This notebook is composed of three parts. In Part 1, I derive a variety of feature sets. In Part 2, I implement three different model selection criteria to determine the optimal number of hidden states for each word model. Finally, in Part 3, I train, test and compare the different models.  

<a id='part1'></a>
## PART 1: Feature Selection

Each data frame has been annotated with the speaker's nose, left and right hand position. As features, we will try a combination of cartesian and polar coordinates as well as the changes of hand position between time frames. 

Loading the database and looking at one particular example of data frame:

In [1]:
import numpy as np
import pandas as pd
import warnings
from hmmlearn.hmm import GaussianHMM
import math
import statistics
import numpy as np
from hmmlearn.hmm import GaussianHMM
from sklearn.model_selection import KFold
from asl_utils import combine_sequences
from asl_data import AslDb
from asl_data import SinglesData
from asl_utils import show_errors

asl = AslDb() # initializes the database using the AslDb class in the asl_data module
asl.df.ix[37,98]  # look at the data available for an individual frame

left-x         155
left-y         188
right-x        171
right-y        164
nose-x         169
nose-y          56
speaker    woman-2
Name: (37, 98), dtype: object

This corresponds to the following frame: ![Video 98](http://www-i6.informatik.rwth-aachen.de/~dreuw/database/rwth-boston-104/overview/images/orig/037-end.jpg)

I first compute the scaled cartesian coordinates of the left and right hands as first set of features:

In [2]:
scaled_cartesian = ['cartesian-rx', 'cartesian-ry', 'cartesian-lx', 'cartesian-ly']

asl.df['cartesian-rx'] = (asl.df['right-x'] - asl.df['right-x'].min()) / (asl.df['right-x'].max() - asl.df['right-x'].min())
asl.df['cartesian-lx'] = (asl.df['left-x'] - asl.df['left-x'].min()) / (asl.df['left-x'].max() - asl.df['left-x'].min())
asl.df['cartesian-ry'] = (asl.df['right-y'] - asl.df['right-y'].min()) / (asl.df['right-y'].max() - asl.df['right-y'].min())
asl.df['cartesian-ly'] = (asl.df['left-y'] - asl.df['left-y'].min()) / (asl.df['left-y'].max() - asl.df['left-y'].min())

As second set of features, I chose the normalized polar coordinates of the hands referenced to the nose: 

In [3]:
# we first compute the hands coordinates referenced to the nose

asl.df['grnd-ry'] = asl.df['right-y'] - asl.df['nose-y']
asl.df['grnd-rx'] = asl.df['right-x'] - asl.df['nose-x']
asl.df['grnd-ly'] = asl.df['left-y'] - asl.df['nose-y']
asl.df['grnd-lx'] = asl.df['left-x'] - asl.df['nose-x']

df_means = asl.df.groupby('speaker').mean()
df_std = asl.df.groupby('speaker').std()

norm_polar = ['polar-rr', 'polar-rtheta', 'polar-lr', 'polar-ltheta']

asl.df['polar-rr'] = np.sqrt(np.square(asl.df['grnd-rx']) + np.square(asl.df['grnd-ry']))
asl.df['polar-rtheta'] = np.arctan2(asl.df['grnd-rx'], asl.df['grnd-ry'])

asl.df['polar-lr'] = np.sqrt(np.square(asl.df['grnd-lx']) + np.square(asl.df['grnd-ly']))
asl.df['polar-ltheta'] = np.arctan2(asl.df['grnd-lx'], asl.df['grnd-ly'])

As third set of features, I compute the difference of hand position between every 2 frames: 

In [4]:
delta_2 = ['delta_2-rx', 'delta_2-ry', 'delta_2-lx', 'delta_2-ly']

asl.df['delta_2-rx'] = asl.df['right-x'].diff(2)
asl.df['delta_2-ry'] = asl.df['right-y'].diff(2)
asl.df['delta_2-lx'] = asl.df['left-x'].diff(2)
asl.df['delta_2-ly'] = asl.df['left-y'].diff(2)

asl.df['delta_2-rx'].fillna(0, inplace = True)
asl.df['delta_2-ry'].fillna(0, inplace = True)
asl.df['delta_2-lx'].fillna(0, inplace = True)
asl.df['delta_2-ly'].fillna(0, inplace = True)

Finally, as fourth set of features I compute the difference of hand position between every 3 frames: 

In [5]:
delta_3 = ['delta_3-rx', 'delta_3-ry', 'delta_3-lx', 'delta_3-ly']

asl.df['delta_3-rx'] = asl.df['right-x'].diff(3)
asl.df['delta_3-ry'] = asl.df['right-y'].diff(3)
asl.df['delta_3-lx'] = asl.df['left-x'].diff(3)
asl.df['delta_3-ly'] = asl.df['left-y'].diff(3)

asl.df['delta_3-rx'].fillna(0, inplace = True)
asl.df['delta_3-ry'].fillna(0, inplace = True)
asl.df['delta_3-lx'].fillna(0, inplace = True)
asl.df['delta_3-ly'].fillna(0, inplace = True)

And combining all features..

In [6]:
features = scaled_cartesian + norm_polar + delta_2 + delta_3

<a id='part2'></a>
## PART 2: HMMs Training

I determine the optimal number of states for each word model using three different selection criteria: 
- Log likelihood using cross-validation folds (CV)
- Bayesian Information Criterion (BIC)
- Discriminative Information Criterion (DIC) 

Implementing the base class for model selection:

In [7]:
class ModelSelector(object):
    '''
    base class for model selection
    '''

    def __init__(self, all_word_sequences: dict, all_word_Xlengths: dict, this_word: str,
                 n_constant=3,
                 min_n_components=2, max_n_components=10,
                 random_state=14, verbose=False):
        self.words = all_word_sequences
        self.hwords = all_word_Xlengths
        self.sequences = all_word_sequences[this_word]
        self.X, self.lengths = all_word_Xlengths[this_word]
        self.this_word = this_word
        self.n_constant = n_constant
        self.min_n_components = min_n_components
        self.max_n_components = max_n_components
        self.random_state = random_state
        self.verbose = verbose

Implementing selector according to the best averaged log-likelihood over k-fold cross-validation: 

In [8]:
class SelectorCV(ModelSelector):
    ''' select best model based on average log Likelihood of cross-validation folds
    '''

    def select(self):
        warnings.filterwarnings("ignore", category=DeprecationWarning)

        number_splits= min(len(self.sequences), 3)
        if number_splits > 1:
            split_method = KFold(n_splits=number_splits)
            max_logL = float('-Inf')

            for num_states in range(self.min_n_components, self.max_n_components+1):
                
                current_logL = 0

                try:
                    for cv_train_idx, cv_test_idx in split_method.split(self.sequences):

                        X_train, lengths_train = combine_sequences(cv_train_idx, self.sequences)
                        X_test, lengths_test = combine_sequences(cv_test_idx, self.sequences)

                        model = GaussianHMM(n_components=num_states, covariance_type="diag", n_iter=1000,
                                            random_state=self.random_state, verbose=False).fit(X_train, lengths_train)

                        logL = model.score(X_test, lengths_test)

                        current_logL += logL

                    current_logL = current_logL/number_splits
                    
                    if current_logL > max_logL:
                        
                        max_logL = current_logL
                        best_num_components = num_states
                        best_model = model
                        
                except:  
                    continue
            
        if 'best_model' not in locals():    
            best_model = None

        return best_model

Implementing the BIC model selector, i.e., selects the model with lowest BIC value, where BIC= -2 * logL + p * logN, with logL = log-likelihood of the model, p = the number of parameters of the model, and N = the number of observations. BIC penalizes models with a high number of states. 

In [9]:
class SelectorBIC(ModelSelector):
    """ select the model with the lowest Bayesian Information Criterion(BIC) score
    """

    def select(self):
        """ select the best model for self.this_word based on
        BIC score for n between self.min_n_components and self.max_n_components

        :return: GaussianHMM object
        """
        warnings.filterwarnings("ignore", category=DeprecationWarning)
        
        lowest_BIC_score = float('Inf')
        
        for num_states in range(self.min_n_components, self.max_n_components+1):
            
            try:
           
                model = GaussianHMM(n_components=num_states, covariance_type="diag", n_iter=1000,
                                    random_state=self.random_state, verbose=False).fit(self.X, self.lengths)
                    
                logL = model.score(self.X, self.lengths)
                    
                    # nb_params = nb of states * (number of states -1) (i.e., transition probabilities from state to state) 
                    # + number of states - 1 (i.e., initial probabilities of being in a particular state)
                    # + number of states * number of parameters to model each feature, (i.e., emission probabilities or probability to observe a feature given a state), i.e. number of states * number of features * 2 (i.e. 2 counting for mean and standard deviation of each gaussian for each feature)
                    # --> finally, nb_params = nb_states * (nb_states -1) + nb_states -1 + nb_states * nb_features * 2 = nb_states^2 - 1 + nb_states*nb_features*2 
                    
                num_params = num_states * num_states - 1 + 2 * num_states * len(self.X[0])
                num_frames = len(self.X)
                    
                current_BIC_score = num_params * math.log(num_frames) - 2 * logL
              
                if current_BIC_score < lowest_BIC_score:
                  
                    lowest_BIC_score = current_BIC_score
                  
                    best_model = model
                
            except:
                continue
                
        if 'best_model' not in locals():    
            best_model = None
              
        return best_model

Implementing the DIC model selector, i.e., selects the model with highest DIC value. DIC maximizes the discrimination between classes by computing the difference between the logL of the model on the trained word and the averaged logL of the model on all other words. It is computed according to Biem, Alain. "A model selection criterion for classification: Application to hmm topology optimization." Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. IEEE, 2003:

In [10]:
class SelectorDIC(ModelSelector):
    ''' select best model based on Discriminative Information Criterion    
    '''

    def select(self):
        warnings.filterwarnings("ignore", category=DeprecationWarning)
        
        highest_DIC_score = float('-Inf')
        
        for num_states in range(self.min_n_components, self.max_n_components+1):
            
            logL_antievidence = 0
            
            try:
                           
                model = GaussianHMM(n_components=num_states, covariance_type="diag", n_iter=1000,
                                    random_state=self.random_state, verbose=False).fit(self.X, self.lengths)
                    
                logL_evidence = model.score(self.X, self.lengths)
                
                for key in self.words.keys():
                    if key != self.this_word:
                        X_test, lengths_test = self.hwords[key]
                        logL_antievidence += model.score(X_test, lengths_test)
                    
                current_DIC_score = logL_evidence - (logL_antievidence / (len(self.words) - 1))
                
                if current_DIC_score > highest_DIC_score:
                  
                    highest_DIC_score = current_DIC_score
                    
                    best_model = model
                
            except:
                continue
                
        if 'best_model' not in locals():    
            best_model = None
              
        return best_model

<a id='part3'></a>
## PART 3: HMMs Testing

Here I test the HMMs on a testing set composed of ~40 sentences. Error rate on the vocabulary of 112 trained words is of 50% for the cross-validation-selected models, 45% for BIC, and 48% for DIC.

I first implement the train_all_words function that builds the training set and trains and selects an HMM for each word of the training set according to the selected selection criterion:

In [11]:
def train_all_words(features, model_selector):
    training = asl.build_training(features)
    sequences = training.get_all_sequences()
    Xlengths = training.get_all_Xlengths()
    model_dict = {}
    for word in training.words:
        model = model_selector(sequences, Xlengths, word, 
                        n_constant=3).select()
        model_dict[word]=model
    return model_dict

Building test data:

In [12]:
test_set = asl.build_test(features)

Here I implement a function that outputs the log-likelihood of each word for each test sequence, along with the word with the highest log-likelihood value: 

In [13]:
def recognize(models: dict, test_set: SinglesData):
    """ Recognize test word sequences from word models set

   :param models: dict of trained models
       {'SOMEWORD': GaussianHMM model object, 'SOMEOTHERWORD': GaussianHMM model object, ...}
   :param test_set: SinglesData object
   :return: (list, list)  as probabilities, guesses
       both lists are ordered by the test set word_id
       probabilities is a list of dictionaries where each key a word and value is Log Likelihood
           [{SOMEWORD': LogLvalue, 'SOMEOTHERWORD' LogLvalue, ... },
            {SOMEWORD': LogLvalue, 'SOMEOTHERWORD' LogLvalue, ... },
            ]
       guesses is a list of the best guess words ordered by the test set word_id
           ['WORDGUESS0', 'WORDGUESS1', 'WORDGUESS2',...]
   """
    warnings.filterwarnings("ignore", category=DeprecationWarning)
       
    probabilities = []
    guesses = []
    for word_id in range(test_set.num_items):
        X_test, lengths_test = test_set.get_item_Xlengths(word_id)
        logL = {}
        for key in models.keys():
            try:
                logL[key] = models[key].score(X_test, lengths_test)
            except:
                continue
        probabilities.append(logL)
        best_guess = max(logL, key=logL.get)
        guesses.append(best_guess)

    return probabilities, guesses

Finally, I test the models. 

First, testing the model that uses cross-validation to select the optimal number of states. Results are displayed as follows: ground truth for each sentence is on the right column, and recognized words on the left. Errors are indicated with an asterisk (WER: word error rate).

In [14]:
model_selector = SelectorCV

models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)


**** WER = 0.5
Total correct: 89 out of 178
Video  Recognized                                                    Correct
    2: *POSS *NEW *ARRIVE                                            JOHN WRITE HOMEWORK
    7: JOHN *HAVE GO *HAVE                                           JOHN CAN GO CAN
   12: JOHN CAN *SOMETHING-ONE CAN                                   JOHN CAN GO CAN
   21: JOHN *JOHN WONT *JOHN *CAR *CAR *FUTURE *FUTURE               JOHN FISH WONT EAT BUT CAN EAT CHICKEN
   25: JOHN *IX IX IX IX                                             JOHN LIKE IX IX IX
   28: *MARY *IX IX IX IX                                            JOHN LIKE IX IX IX
   30: *IX LIKE IX IX IX                                             JOHN LIKE IX IX IX
   36: MARY *MARY *SOMETHING-ONE IX *IX *MARY                        MARY VEGETABLE KNOW IX LIKE CORN1
   40: JOHN IX *CORN MARY *MARY                                      JOHN IX THINK MARY LOVE
   43: JOHN *POSS BUY HOUSE                        

Testing the models selected using BIC:

In [15]:
model_selector = SelectorBIC

models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)


**** WER = 0.449438202247191
Total correct: 98 out of 178
Video  Recognized                                                    Correct
    2: JOHN WRITE *ARRIVE                                            JOHN WRITE HOMEWORK
    7: JOHN CAN GO *ARRIVE                                           JOHN CAN GO CAN
   12: JOHN CAN *GO1 CAN                                             JOHN CAN GO CAN
   21: JOHN *ARRIVE WONT *JOHN *STUDENT *GIVE1 *FUTURE *MARY         JOHN FISH WONT EAT BUT CAN EAT CHICKEN
   25: JOHN LIKE *MARY IX *MARY                                      JOHN LIKE IX IX IX
   28: JOHN LIKE IX *JOHN IX                                         JOHN LIKE IX IX IX
   30: JOHN LIKE *MARY IX IX                                         JOHN LIKE IX IX IX
   36: MARY *JOHN *IX IX *MARY *MARY                                 MARY VEGETABLE KNOW IX LIKE CORN1
   40: JOHN IX *CORN MARY *MARY                                      JOHN IX THINK MARY LOVE
   43: JOHN *SHOULD BUY HOUSE        

Testing the models selected using DIC:

In [16]:
model_selector = SelectorDIC

models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)


**** WER = 0.47752808988764045
Total correct: 93 out of 178
Video  Recognized                                                    Correct
    2: JOHN *BUY *ARRIVE                                             JOHN WRITE HOMEWORK
    7: JOHN *CAR GO *JOHN                                            JOHN CAN GO CAN
   12: JOHN CAN *GIVE1 CAN                                           JOHN CAN GO CAN
   21: JOHN *WHO *JOHN *JOHN *WHAT *GIVE1 *FUTURE *MARY              JOHN FISH WONT EAT BUT CAN EAT CHICKEN
   25: JOHN *JOHN *LOVE IX IX                                        JOHN LIKE IX IX IX
   28: JOHN LIKE IX *JOHN IX                                         JOHN LIKE IX IX IX
   30: JOHN *MARY IX IX IX                                           JOHN LIKE IX IX IX
   36: MARY *JOHN *IX IX *MARY *MARY                                 MARY VEGETABLE KNOW IX LIKE CORN1
   40: JOHN IX *CORN MARY *MARY                                      JOHN IX THINK MARY LOVE
   43: JOHN *JOHN BUY HOUSE        