# Sign Language Recognition System

Reduced notebook solving the 'Sign Language Recognition System' project of the [Udacity's Artificial Intelligence Nanodegree](https://www.udacity.com/course/artificial-intelligence-nanodegree--nd889). 

Source: https://github.com/udacity/AIND-Recognizer

The goal of this project is to build a word recognizer for American Sign Language video sequences, demonstrating the power of probabalistic models.  In particular, this project employs  [hidden Markov models (HMM's)](https://en.wikipedia.org/wiki/Hidden_Markov_model) to analyze a series of measurements taken from videos of American Sign Language (ASL) collected for research (see the [RWTH-BOSTON-104 Database](http://www-i6.informatik.rwth-aachen.de/~dreuw/database-rwth-boston-104.php)).  In this video, the right-hand x and y locations are plotted as the speaker signs the sentence.
[![ASLR demo](http://www-i6.informatik.rwth-aachen.de/~dreuw/images/demosample.png)](https://drive.google.com/open?id=0B_5qGuFe-wbhUXRuVnNZVnMtam8)

The raw data, train, and test sets are pre-defined. The project comprises three parts: <br>
<ol>
<li>Create a variety of feature sets</li>
<li>Implement three different model selection criterion to determine the optimal number of hidden states for each word model</li>
<li>Implement the recognizer and compare the effects the different combinations of feature sets and model selection criteria</li> 
</ol>


This project requires Python 3 and the following Python libraries installed:

[NumPy](http://www.numpy.org/), [SciPy](https://www.scipy.org/), [scikit-learn](http://scikit-learn.org/0.17/install.html), [pandas](http://pandas.pydata.org/), [matplotlib](http://matplotlib.org/), [jupyter](http://ipython.org/notebook.html), [hmmlearn](http://hmmlearn.readthedocs.io/en/latest/)


# Data and Features

The data handler class `AslDb` creates the initial pandas dataframe as well as dictionaries suitable for extracting data to the [hmmlearn](https://hmmlearn.readthedocs.io/en/latest/) library

In [1]:
import numpy as np
import pandas as pd
from asl_recognizer.asl_data import AslDb

asl = AslDb() # initializes the database
asl.df.head() # displays the first rows of the asl database, indexed by video and frame

Unnamed: 0_level_0,Unnamed: 1_level_0,left-x,left-y,right-x,right-y,nose-x,nose-y,speaker
video,frame,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
98,0,149,181,170,175,161,62,woman-1
98,1,149,181,170,175,161,62,woman-1
98,2,149,181,170,175,161,62,woman-1
98,3,149,181,170,175,161,62,woman-1
98,4,149,181,170,175,161,62,woman-1


In [2]:
asl.df.loc[98,1]  # look at the data available for an individual frame

left-x         149
left-y         181
right-x        170
right-y        175
nose-x         161
nose-y          62
speaker    woman-1
Name: (98, 1), dtype: object

The frame represented by video 98, frame 1 is shown here:
![Video 98](http://www-i6.informatik.rwth-aachen.de/~dreuw/database/rwth-boston-104/overview/images/orig/098-start.jpg)

## Features example

##### Ground position (hand relative to nose)

In [3]:
asl.df['grnd-ry'] = asl.df['right-y'] - asl.df['nose-y']
asl.df['grnd-ly'] = asl.df['left-y'] - asl.df['nose-y']
asl.df['grnd-rx'] = asl.df['right-x'] - asl.df['nose-x']
asl.df['grnd-lx'] = asl.df['left-x'] - asl.df['nose-x']
asl.df.head() 

Unnamed: 0_level_0,Unnamed: 1_level_0,left-x,left-y,right-x,right-y,nose-x,nose-y,speaker,grnd-ry,grnd-ly,grnd-rx,grnd-lx
video,frame,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
98,0,149,181,170,175,161,62,woman-1,113,119,9,-12
98,1,149,181,170,175,161,62,woman-1,113,119,9,-12
98,2,149,181,170,175,161,62,woman-1,113,119,9,-12
98,3,149,181,170,175,161,62,woman-1,113,119,9,-12
98,4,149,181,170,175,161,62,woman-1,113,119,9,-12


In [4]:
features_ground = ['grnd-rx','grnd-ry','grnd-lx','grnd-ly']
 #show a single set of features for a given (video, frame) tuple
[asl.df.loc[98,1][v] for v in features_ground]

[9, 113, -12, 119]

##### Build the training set
Now that we have a feature list defined, we can pass that list to the `build_training` method to collect the features for all the words in the training set.  Each word in the training set has multiple examples from various videos.  Below we can see the unique words that have been loaded into the training set:

In [5]:
training = asl.build_training(features_ground)
print("Training words: {}".format(training.words))

Training words: ['JOHN', 'WRITE', 'HOMEWORK', 'IX-1P', 'SEE', 'YESTERDAY', 'IX', 'LOVE', 'MARY', 'CAN', 'GO', 'GO1', 'FUTURE', 'GO2', 'PARTY', 'FUTURE1', 'HIT', 'BLAME', 'FRED', 'FISH', 'WONT', 'EAT', 'BUT', 'CHICKEN', 'VEGETABLE', 'CHINA', 'PEOPLE', 'PREFER', 'BROCCOLI', 'LIKE', 'LEAVE', 'SAY', 'BUY', 'HOUSE', 'KNOW', 'CORN', 'CORN1', 'THINK', 'NOT', 'PAST', 'LIVE', 'CHICAGO', 'CAR', 'SHOULD', 'DECIDE', 'VISIT', 'MOVIE', 'WANT', 'SELL', 'TOMORROW', 'NEXT-WEEK', 'NEW-YORK', 'LAST-WEEK', 'WILL', 'FINISH', 'ANN', 'READ', 'BOOK', 'CHOCOLATE', 'FIND', 'SOMETHING-ONE', 'POSS', 'BROTHER', 'ARRIVE', 'HERE', 'GIVE', 'MAN', 'NEW', 'COAT', 'WOMAN', 'GIVE1', 'HAVE', 'FRANK', 'BREAK-DOWN', 'SEARCH-FOR', 'WHO', 'WHAT', 'LEG', 'FRIEND', 'CANDY', 'BLUE', 'SUE', 'BUY1', 'STOLEN', 'OLD', 'STUDENT', 'VIDEOTAPE', 'BORROW', 'MOTHER', 'POTATO', 'TELL', 'BILL', 'THROW', 'APPLE', 'NAME', 'SHOOT', 'SAY-1P', 'SELF', 'GROUP', 'JANA', 'TOY1', 'MANY', 'TOY', 'ALL', 'BOY', 'TEACHER', 'GIRL', 'BOX', 'GIVE2', 'GIVE3

The training data in `training` is an object of class `WordsData` defined in the `asl_data` module.  in addition to the `words` list, data can be accessed with the `get_all_sequences`, `get_all_Xlengths`, `get_word_sequences`, and `get_word_Xlengths` methods. We need the `get_word_Xlengths` method to train multiple sequences with the `hmmlearn` library.  In the following example, notice that there are two lists; the first is a concatenation of all the sequences(the X portion) and the second is a list of the sequence lengths(the Lengths portion).

In [6]:
training.get_word_Xlengths('CHICKEN')

(array([[-19,  38,  27,  89],
        [-19,  38,  23,  89],
        [-15,  39,  24,  92],
        [-12,  33,  24,  92],
        [-12,  33,  24,  92],
        [ -9,  32,  24,  92],
        [-10,  29,  21,  93],
        [-10,  29,  20,  94],
        [-10,  29,  20,  94],
        [-10,  29,  20,  94],
        [-10,  29,  20,  94],
        [-10,  29,  20,  98],
        [-10,  29,  20,  98],
        [ -8,  34,  20,  98],
        [ -8,  34,  13, 104],
        [-19,  35,  29,  75],
        [-19,  35,  29,  75],
        [-19,  35,  29,  75],
        [-19,  35,  25,  79],
        [-19,  35,  22,  80],
        [-14,  32,  21,  86],
        [-14,  32,  21,  86],
        [-14,  32,  14,  90],
        [-12,  32,  16,  90],
        [-12,  32,  16,  92],
        [-12,  32,  15,  94],
        [-12,  32,  15,  94],
        [-12,  32,  15,  94],
        [-12,  32,  15,  94],
        [-12,  32,  15,  97],
        [-10,  35,  15, 101],
        [ -6,  41,  11, 104]]), [15, 17])

## Features Implementation 

- normalized Cartesian coordinates
- polar coordinates
- delta difference
- custom features: delta2

In [7]:
df_means = asl.df.groupby('speaker').mean()  # Dataframe with mean grouped by speaker
df_std = asl.df.groupby('speaker').std() # Dataframe with standard deviations grouped by speaker

# Add features for normalized by speaker values of left, right, x, y, using Z-score scaling (X-Xmean)/Xstd
asl.df['left-x-mean']= asl.df['speaker'].map(df_means['left-x'])
asl.df['left-y-mean']= asl.df['speaker'].map(df_means['left-y'])
asl.df['right-x-mean']= asl.df['speaker'].map(df_means['right-x'])
asl.df['right-y-mean']= asl.df['speaker'].map(df_means['right-y'])

asl.df['left-x-std']= asl.df['speaker'].map(df_std['left-x'])
asl.df['left-y-std']= asl.df['speaker'].map(df_std['left-y'])
asl.df['right-x-std']= asl.df['speaker'].map(df_std['right-x'])
asl.df['right-y-std']= asl.df['speaker'].map(df_std['right-y'])

asl.df['norm-rx'] = (asl.df['right-x'] - asl.df['right-x-mean']) / asl.df['right-x-std']
asl.df['norm-ry'] = (asl.df['right-y'] - asl.df['right-y-mean']) / asl.df['right-y-std']
asl.df['norm-lx'] = (asl.df['left-x'] - asl.df['left-x-mean']) / asl.df['left-x-std']
asl.df['norm-ly'] = (asl.df['left-y'] - asl.df['left-y-mean']) / asl.df['left-y-std']

features_norm = ['norm-rx', 'norm-ry', 'norm-lx','norm-ly']

In [8]:
# Add features for polar coordinate values where the nose is the origin
# Note that 'polar-rr' and 'polar-rtheta' refer to the radius and angle
asl.df['polar-rr'] = (asl.df['grnd-rx']**2 + asl.df['grnd-ry']**2)**0.5
asl.df['polar-rtheta'] = np.arctan2(asl.df['grnd-rx'], asl.df['grnd-ry'])
asl.df['polar-lr'] = (asl.df['grnd-lx']**2 + asl.df['grnd-ly']**2)**0.5
asl.df['polar-ltheta'] = np.arctan2(asl.df['grnd-lx'], asl.df['grnd-ly'])

features_polar = ['polar-rr', 'polar-rtheta', 'polar-lr', 'polar-ltheta']

In [9]:
# Add features for left, right, x, y differences by one time step
asl.df['delta-rx'] = asl.df['right-x'].diff().fillna(method='bfill')
asl.df['delta-ry'] = asl.df['right-y'].diff().fillna(method='bfill')
asl.df['delta-lx'] = asl.df['left-x'].diff().fillna(method='bfill')
asl.df['delta-ly'] = asl.df['left-y'].diff().fillna(method='bfill')

features_delta = ['delta-rx', 'delta-ry', 'delta-lx', 'delta-ly']

### Custom feature: delta2

Delta difference by two time steps of cartesian coordinates (delta2) was chosen empirically after obtaining better results at the end of the project. In particular, delta2 resulted in better performance that other combinations of the above features, including the former choice 'delta polar for several time steps differences'.

In [10]:
# delta2: features for left, right, x, y differences by TWO time steps
asl.df['delta2-rx'] = asl.df['right-x'].diff(periods=2).fillna(method='bfill')
asl.df['delta2-ry'] = asl.df['right-y'].diff(periods=2).fillna(method='bfill')
asl.df['delta2-lx'] = asl.df['left-x'].diff(periods=2).fillna(method='bfill')
asl.df['delta2-ly'] = asl.df['left-y'].diff(periods=2).fillna(method='bfill')

features_custom = ['delta2-rx', 'delta2-ry', 'delta2-lx', 'delta2-ly']
asl.df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,left-x,left-y,right-x,right-y,nose-x,nose-y,speaker,grnd-ry,grnd-ly,grnd-rx,...,polar-lr,polar-ltheta,delta-rx,delta-ry,delta-lx,delta-ly,delta2-rx,delta2-ry,delta2-lx,delta2-ly
video,frame,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
98,0,149,181,170,175,161,62,woman-1,113,119,9,...,119.603512,-0.100501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,1,149,181,170,175,161,62,woman-1,113,119,9,...,119.603512,-0.100501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,2,149,181,170,175,161,62,woman-1,113,119,9,...,119.603512,-0.100501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,3,149,181,170,175,161,62,woman-1,113,119,9,...,119.603512,-0.100501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,4,149,181,170,175,161,62,woman-1,113,119,9,...,119.603512,-0.100501,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## PART 2: Model Selection
### Model Selection Tutorial
The objective of Model Selection is to tune the number of states for each word HMM prior to testing on unseen data.  In this section you will explore three methods: 
- Log likelihood using cross-validation folds (CV)
- Bayesian Information Criterion (BIC)
- Discriminative Information Criterion (DIC) 

##### Train a single word
Now that we have built a training set with sequence data, we can "train" models for each word.  As a simple starting example, we train a single word using Gaussian hidden Markov models (HMM).   By using the `fit` method during training, the [Baum-Welch Expectation-Maximization](https://en.wikipedia.org/wiki/Baum%E2%80%93Welch_algorithm) (EM) algorithm is invoked iteratively to find the best estimate for the model *for the number of hidden states specified* from a group of sample seequences. For this example, we *assume* the correct number of hidden states is 3, but that is just a guess.  How do we know what the "best" number of states for training is?  We will need to find some model selection technique to choose the best parameter.

In [None]:
import warnings
import sklearn
from hmmlearn.hmm import GaussianHMM

def train_a_word(word, num_hidden_states, features):
    
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    training = asl.build_training(features)  
    X, lengths = training.get_word_Xlengths(word)
    model = GaussianHMM(n_components=num_hidden_states, n_iter=1000).fit(X, lengths)
    logL = model.score(X, lengths)
    return model, logL

demoword = 'CHICKEN'
model, logL = train_a_word(demoword, 3, features_ground)
print("Number of states trained in model for {} is {}".format(demoword, model.n_components))
print("logL = {}".format(logL))

The HMM model has been trained and information can be pulled from the model, including means and variances for each feature and hidden state.  The [log likelihood](http://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution) for any individual sample or group of samples can also be calculated with the `score` method.

In [None]:
def show_model_stats(word, model):
    print("Number of states trained in model for {} is {}".format(word, model.n_components))    
    variance=np.array([np.diag(model.covars_[i]) for i in range(model.n_components)])    
    for i in range(model.n_components):  # for each hidden state
        print("hidden state #{}".format(i))
        print("mean = ", model.means_[i])
        print("variance = ", variance[i])
        print()
    
show_model_stats(demoword, model)

##### Try it!
Experiment by changing the feature set, word, and/or num_hidden_states values in the next cell to see changes in values.  

In [None]:
my_testword = 'CHOCOLATE'
model, logL = train_a_word(my_testword, 4, features_custom) # Experiment here with different parameters
show_model_stats(my_testword, model)
print("logL = {}".format(logL))

##### Visualize the hidden states
We can plot the means and variances for each state and feature.  Try varying the number of states trained for the HMM model and examine the variances.  Are there some models that are "better" than others?  How can you tell?  We would like to hear what you think in the classroom online.

In [None]:
%matplotlib inline

In [None]:
import math
from matplotlib import (cm, pyplot as plt, mlab)

def visualize(word, model):
    """ visualize the input model for a particular word """
    variance=np.array([np.diag(model.covars_[i]) for i in range(model.n_components)])
    figures = []
    for parm_idx in range(len(model.means_[0])):
        xmin = int(min(model.means_[:,parm_idx]) - max(variance[:,parm_idx]))
        xmax = int(max(model.means_[:,parm_idx]) + max(variance[:,parm_idx]))
        fig, axs = plt.subplots(model.n_components, sharex=True, sharey=False)
        colours = cm.rainbow(np.linspace(0, 1, model.n_components))
        for i, (ax, colour) in enumerate(zip(axs, colours)):
            x = np.linspace(xmin, xmax, 100)
            mu = model.means_[i,parm_idx]
            sigma = math.sqrt(np.diag(model.covars_[i])[parm_idx])
            ax.plot(x, mlab.normpdf(x, mu, sigma), c=colour)
            ax.set_title("{} feature {} hidden state #{}".format(word, parm_idx, i))

            ax.grid(True)
        figures.append(plt)
    for p in figures:
        p.show()
        
visualize(my_testword, model)

#####  ModelSelector class
Review the `SelectorModel` class from the codebase found in the `my_model_selectors.py` module.  It is designed to be a strategy pattern for choosing different model selectors.  For the project submission in this section, subclass `SelectorModel` to implement the following model selectors.  In other words, you will write your own classes/functions in the `my_model_selectors.py` module and run them from this notebook:

- `SelectorCV `:  Log likelihood with CV
- `SelectorBIC`: BIC 
- `SelectorDIC`: DIC

You will train each word in the training set with a range of values for the number of hidden states, and then score these alternatives with the model selector, choosing the "best" according to each strategy. The simple case of training with a constant value for `n_components` can be called using the provided `SelectorConstant` subclass as follow:

In [None]:
import math
import statistics
import warnings

import numpy as np
from hmmlearn.hmm import GaussianHMM
from sklearn.model_selection import KFold
from asl_recognizer.asl_utils import combine_sequences


class ModelSelector(object):
    '''
    base class for model selection (strategy design pattern)
    '''

    def __init__(self, all_word_sequences: dict, all_word_Xlengths: dict, this_word: str,
                 n_constant=3,
                 min_n_components=2, max_n_components=10,
                 random_state=14, verbose=False):
        self.words = all_word_sequences
        self.hwords = all_word_Xlengths
        self.sequences = all_word_sequences[this_word]
        self.X, self.lengths = all_word_Xlengths[this_word]
        self.this_word = this_word
        self.n_constant = n_constant
        self.min_n_components = min_n_components
        self.max_n_components = max_n_components
        self.random_state = random_state
        self.verbose = verbose

    def select(self):
        raise NotImplementedError

    def base_model(self, num_states):
        # with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=DeprecationWarning)
        # warnings.filterwarnings("ignore", category=RuntimeWarning)
        try:
            hmm_model = GaussianHMM(n_components=num_states, covariance_type="diag", n_iter=1000,
                                    random_state=self.random_state, verbose=False).fit(self.X, self.lengths)
            if self.verbose:
                print("model created for {} with {} states".format(self.this_word, num_states))
            return hmm_model
        except:
            if self.verbose:
                print("failure on {} with {} states".format(self.this_word, num_states))
            return None


class SelectorConstant(ModelSelector):
    """ select the model with value self.n_constant

    """
    def select(self):
        """ select based on n_constant value

        :return: GaussianHMM object
        """
        best_num_components = self.n_constant
        return self.base_model(best_num_components)


class SelectorBIC(ModelSelector):
    """ select the model with the lowest Baysian Information Criterion(BIC) score

    http://www2.imm.dtu.dk/courses/02433/doc/ch6_slides.pdf
    Bayesian information criteria: BIC = -2 * logL + p * logN
    """

    def select(self):
        """ select the best model for self.this_word based on
        BIC score for n between self.min_n_components and self.max_n_components
        :return: GaussianHMM object
        """
        warnings.filterwarnings("ignore", category=DeprecationWarning)

        # TODO implement model selection based on BIC scores

        best_BIC = float('inf')     # hold the lowest BIC found
        best_model = None           # hold the model with the lowest BIC found

        # main loop: get model and score for different number of hidden states and update best_BIC and best_model
        for n_components in range(self.min_n_components, self.max_n_components+1):
            logL = None
            model = self.base_model(n_components) # avoid code duplication using the inherited method
            if model is None:   # failed
                continue
            try:  # hmmlearn stability issues
                logL = model.score(self.X, self.lengths)
            except:
                continue

            # BIC = -2 log L + p log N  (the lower the better)
            # p = num free params = transistion probs(n*n) + means(n*f) + covars(n*f)  (feedback from project review)
            p = n_components**2 + (2 * len(self.X[0]) * n_components)
            BIC = -2 * logL + p * np.log(len(self.X))
            if  BIC < best_BIC:
                best_BIC = BIC
                best_model = model

        return best_model


class SelectorDIC(ModelSelector):
    """ select best model based on Discriminative Information Criterion

    Biem, Alain. "A model selection criterion for classification: Application to hmm topology optimization."
    Document Analysis and Recognition, 2003. Proceedings. Seventh International Conference on. IEEE, 2003.
    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.6208&rep=rep1&type=pdf
    DIC = log(P(X(i)) - 1/(M-1)SUM(log(P(X(all but i))
    """

    def select(self):
        warnings.filterwarnings("ignore", category=DeprecationWarning)

        # TODO implement model selection based on DIC scores

        best_DIC = float('-inf')    # hold the highest BIC found
        best_model = None           # hold the model with the highest BIC found
        my_word = self.this_word    # word to evaluate

        # main loop: get model and score for different number of hidden states and update best_DIC and best_model
        for n_components in range(self.min_n_components, self.max_n_components+1):
            logL = None     # likelihood of the trained model
            anti_logL = []  # list containing all the anti likelihoods of the trained model
            self.this_word = my_word

            # train the model for current word (self.this_word)
            model = self.base_model(n_components)  # avoid code duplication using the inherited method
            if model is None:   # failed
                continue
            # get the likelihood of the model
            try:  # hmmlearn stability issues
                logL = model.score(self.X, self.lengths)
            except:
                continue

            # get the anti-likehoods (score every other word for the current model)
            for other_word in self.hwords:
                if other_word == my_word:
                    continue
                X, lengths = self.hwords[other_word]
                try:  # stability issues for hmmlearn v<0.2.1
                    anti_logL.append(model.score(X, lengths))
                except:
                    continue

            # I found the docstring formula a bit confusing   DIC = log(P(X(i)) - 1/(M-1)SUM(log(P(X(all but i)) :
            # The size of anti_logL is already M-1, as the word evaluated is not included in anti_logL.
            # Biem, Alain: "DIC = Difference between the likelihood of the data an the the average of anti-likelihood"
            # From the paper we can state: DIC = log(P(X(i)) - average(log(P(X(all but i))
            DIC = logL - np.average(anti_logL)  # The higher the better
            if  DIC > best_DIC:
                best_DIC = DIC
                best_model = model

        return best_model


class SelectorCV(ModelSelector):
    """ select best model based on average log Likelihood of cross-validation folds """

    def select(self):
        """ :return: GaussianHMM object """

        warnings.filterwarnings("ignore", category=DeprecationWarning)
        # TODO implement model selection using CV

        # KFold default n_splits = 3 is used here. Sequences of size lower than 3 must be considered:
        if len(self.sequences) >= 3:
            n_splits = 3
        elif len(self.sequences) == 2:
            n_splits = 2
        else:
            # print("No Cross Validation is possible for word: ", self.this_word, "only has 1 occurrence")
            return None

        best_score = float('-inf')  # hold the highest average of cross-validation scores found
        best_model = None           # hold the model with the highest average of cross-validation score found

        # main loop: get model and score for different number of hidden states and update best_score and best_model
        for n_components in range(self.min_n_components, self.max_n_components+1):

            split_method = KFold(n_splits=n_splits)
            logL = []  # list of cross validation scores obtained

            # get the model for the combined cross-validation training sequences and score with their combined
            #  validation sequences filling the list 'logL'
            for cv_train_idx, cv_test_idx in split_method.split(self.sequences):

                self.X, self.lengths = combine_sequences(cv_train_idx, self.sequences)
                model = self.base_model(n_components)  # avoid code duplication using the inherited method
                if model is None:   # failed
                    continue

                test_X, test_lengths = combine_sequences(cv_test_idx, self.sequences)
                try:   # hmmlearn stability issues
                    logL.append(model.score(test_X, test_lengths))
                except:
                    continue

            if model is None or logL == []:
                continue

            score = np.mean(logL)  # The higher the better
            if  score > best_score:
                best_score = score
                best_model = model

        return best_model

In [None]:
training = asl.build_training(features_ground)  # Experiment here with different feature sets defined in part 1
word = 'CHICKEN' # Experiment here with different words
model = SelectorConstant(training.get_all_sequences(), training.get_all_Xlengths(), word, n_constant=3).select()
print("Number of states trained in model for {} is {}".format(word, model.n_components))

##### Cross-validation folds
If we simply score the model with the Log Likelihood calculated from the feature sequences it has been trained on, we should expect that more complex models will have higher likelihoods. However, that doesn't tell us which would have a better likelihood score on unseen data.  The model will likely be overfit as complexity is added.  To estimate which topology model is better using only the training data, we can compare scores using cross-validation.  One technique for cross-validation is to break the training set into "folds" and rotate which fold is left out of training.  The "left out" fold scored.  This gives us a proxy method of finding the best model to use on "unseen data". In the following example, a set of word sequences is broken into three folds using the [scikit-learn Kfold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) class object. When you implement `SelectorCV`, you will use this technique.

In [None]:
from sklearn.model_selection import KFold

training = asl.build_training(features_ground) # Experiment here with different feature sets
word = 'VEGETABLE' # Experiment here with different words
word_sequences = training.get_word_sequences(word)
split_method = KFold()
for cv_train_idx, cv_test_idx in split_method.split(word_sequences):
    print("Train fold indices:{} Test fold indices:{}".format(cv_train_idx, cv_test_idx))  # view indices of the folds

**Tip:** In order to run `hmmlearn` training using the X,lengths tuples on the new folds, subsets must be combined based on the indices given for the folds.  A helper utility has been provided in the `asl_utils` module named `combine_sequences` for this purpose.

##### Scoring models with other criterion
Scoring model topologies with **BIC** balances fit and complexity within the training set for each word.  In the BIC equation, a penalty term penalizes complexity to avoid overfitting, so that it is not necessary to also use cross-validation in the selection process.  There are a number of references on the internet for this criterion.  These [slides](http://www2.imm.dtu.dk/courses/02433/doc/ch6_slides.pdf) include a formula you may find helpful for your implementation.

The advantages of scoring model topologies with **DIC** over BIC are presented by Alain Biem in this [reference](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.6208&rep=rep1&type=pdf) (also found [here](https://pdfs.semanticscholar.org/ed3d/7c4a5f607201f3848d4c02dd9ba17c791fc2.pdf)).  DIC scores the discriminant ability of a training set for one word against competing words.  Instead of a penalty term for complexity, it provides a penalty if model liklihoods for non-matching words are too similar to model likelihoods for the correct word in the word set.

<a id='part2_submission'></a>
### Model Selection Implementation Submission
Implement `SelectorCV`, `SelectorBIC`, and `SelectorDIC` classes in the `my_model_selectors.py` module.  Run the selectors on the following five words. Then answer the questions about your results.

**Tip:** The `hmmlearn` library may not be able to train or score all models.  Implement try/except contructs as necessary to eliminate non-viable models from consideration.

In [None]:
words_to_train = ['FISH', 'BOOK', 'VEGETABLE', 'FUTURE', 'JOHN']
import timeit

In [None]:
# TODO: Implement SelectorCV in my_model_selector.py

# from importlib import reload
# import my_model_selectors
# reload(my_model_selectors)

training = asl.build_training(features_ground)  # Experiment here with different feature sets defined in part 1
sequences = training.get_all_sequences()
Xlengths = training.get_all_Xlengths()
for word in words_to_train:
    start = timeit.default_timer()
    model = SelectorCV(sequences, Xlengths, word, 
                    min_n_components=2, max_n_components=15, random_state = 14).select()
    end = timeit.default_timer()-start
    if model is not None:
        print("Training complete for {} with {} states with time {} seconds".format(word, model.n_components, end))
    else:
        print("Training failed for {}".format(word))

In [None]:
training = asl.build_training(features_ground)  # Experiment here with different feature sets defined in part 1
sequences = training.get_all_sequences()
Xlengths = training.get_all_Xlengths()
for word in words_to_train:
    start = timeit.default_timer()
    model = SelectorBIC(sequences, Xlengths, word, 
                    min_n_components=2, max_n_components=15, random_state = 14).select()
    end = timeit.default_timer()-start
    if model is not None:
        print("Training complete for {} with {} states with time {} seconds".format(word, model.n_components, end))
    else:
        print("Training failed for {}".format(word))

In [None]:
# TODO: Implement SelectorDIC in module my_model_selectors.py

training = asl.build_training(features_ground)  # Experiment here with different feature sets defined in part 1
sequences = training.get_all_sequences()
Xlengths = training.get_all_Xlengths()
for word in words_to_train:
    start = timeit.default_timer()
    model = SelectorDIC(sequences, Xlengths, word, 
                    min_n_components=2, max_n_components=15, random_state = 14).select()
    end = timeit.default_timer()-start
    if model is not None:
        print("Training complete for {} with {} states with time {} seconds".format(word, model.n_components, end))
    else:
        print("Training failed for {}".format(word))

**Question 2:**  Compare and contrast the possible advantages and disadvantages of the various model selectors implemented.

**Answer 2:**

**CV**:  Common in supervised learning to avoid partitioning the training set for validation. Among the tree model selectors, CV is the only one (cross)validated with sequences different of the training sets. However the running time needed is higher than direct selectors like BIC, since for each word several models must be created for cross-validation (combinations of training-validation sequences).  

**BIC**: Simple selector that penalizes the resulted likelihood with the complexity of the model (more hidden layers increases the risk of overfitting). It needs less running time than CV and DIC, as only one model and one score must be computed per number of hidden nodes. 

**DIC**: The penalty term in DIC is based on the average likelihood for the other words. That is why it has the highest computational cost of the three: a score for each word of the training set must be computed for each number of hidden states.

Further comparisons will be discussed together with the results of the part 3 of the project.


<a id='part3_tutorial'></a>
## PART 3: Recognizer
The objective of this section is to "put it all together".  Using the four feature sets created and the three model selectors, you will experiment with the models and present your results.  Instead of training only five specific words as in the previous section, train the entire set with a feature set and model selector strategy.  
### Recognizer Tutorial
##### Train the full training set
The following example trains the entire set with the example `features_ground` and `SelectorConstant` features and model selector.  Use this pattern for you experimentation and final submission cells.



In [None]:
# autoreload for automatically reloading changes made in my_model_selectors and my_recognizer
%load_ext autoreload
%autoreload 2

def train_all_words(features, model_selector):
    training = asl.build_training(features)  # Experiment here with different feature sets defined in part 1
    sequences = training.get_all_sequences()
    Xlengths = training.get_all_Xlengths()
    model_dict = {}
    for word in training.words:
        model = model_selector(sequences, Xlengths, word, 
                        n_constant=3).select()
        model_dict[word]=model
    return model_dict

models = train_all_words(features_ground, SelectorConstant)
print("Number of word models returned = {}".format(len(models)))

##### Load the test set
The `build_test` method in `ASLdb` is similar to the `build_training` method already presented, but there are a few differences:
- the object is type `SinglesData` 
- the internal dictionary keys are the index of the test word rather than the word itself
- the getter methods are `get_all_sequences`, `get_all_Xlengths`, `get_item_sequences` and `get_item_Xlengths`

In [None]:
test_set = asl.build_test(features_ground)
print("Number of test set items: {}".format(test_set.num_items))
print("Number of test set sentences: {}".format(len(test_set.sentences_index)))

<a id='part3_submission'></a>
### Recognizer Implementation Submission
For the final project submission, students must implement a recognizer following guidance in the `my_recognizer.py` module.  Experiment with the four feature sets and the three model selection methods (that's 12 possible combinations). You can add and remove cells for experimentation or run the recognizers locally in some other way during your experiments, but retain the results for your discussion.  For submission, you will provide code cells of **only three** interesting combinations for your discussion (see questions below). At least one of these should produce a word error rate of less than 60%, i.e. WER < 0.60 . 

**Tip:** The hmmlearn library may not be able to train or score all models.  Implement try/except contructs as necessary to eliminate non-viable models from consideration.

In [None]:
import warnings
from asl_recognizer.asl_data import SinglesData


def recognize(models: dict, test_set: SinglesData):
    """ Recognize test word sequences from word models set

   :param models: dict of trained models
       {'SOMEWORD': GaussianHMM model object, 'SOMEOTHERWORD': GaussianHMM model object, ...}
   :param test_set: SinglesData object
   :return: (list, list)  as probabilities, guesses
       both lists are ordered by the test set word_id
       probabilities is a list of dictionaries where each key a word and value is Log Liklehood
           [{SOMEWORD': LogLvalue, 'SOMEOTHERWORD' LogLvalue, ... },
            {SOMEWORD': LogLvalue, 'SOMEOTHERWORD' LogLvalue, ... },
            ]
       guesses is a list of the best guess words ordered by the test set word_id
           ['WORDGUESS0', 'WORDGUESS1', 'WORDGUESS2',...]
   """
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    probabilities = []
    guesses = []
    # TODO implement the recognizer

    test_sequences = list(test_set.get_all_Xlengths().values()) # extract X and Xlength lists in a variable

    # Main loop: iterate for each word sequence and fill the lists probabilities and guesses
    for test_X, test_Xlength in  test_sequences:

        best_score = float('-inf')  # hold the highest likelihood found for the current sequence
        best_word = None            # hold the word with the highest likelihood found for the current sequence
        prob_dict = {}              # dictionary to fill before being added to the probabilities list {word: LogL}

        # Score the models avaliable for different words for the current sequence, fill prob_dict and update best_score
        # and best_word found
        for word, model in models.items():
            try:  # hmmlearn stability issues
                logL = model.score(test_X, test_Xlength)
            except:
                logL = None

            prob_dict[word] = logL
            if logL is not None:
                if logL > best_score:
                    best_score = logL
                    best_word = word

        probabilities.append(prob_dict)
        guesses.append(best_word)

    return probabilities, guesses


# if __name__ == "__main__":
#     from  asl_test_recognizer import TestRecognize
#     test_model = TestRecognize()
#     test_model.setUp()
#     test_model.test_recognize_probabilities_interface()

In [None]:
from asl_recognizer.asl_utils import show_errors

In [None]:
%%time
#  BASE COMBINATION:  SelectorConstant with features_ground.  WER = 0.669

# TODO Choose a feature set and model selector
features = features_ground # change as needed
model_selector = SelectorConstant # change as needed

# TODO Recognize the test set and display the result with the show_errors method
models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)

In [None]:
%%time
#  COMBINATION 1:   SelectorCV with features_polar.  WER = 0.607

# TODO Choose a feature set and model selector
features = features_polar # change as needed
model_selector = SelectorCV # change as needed

# TODO Recognize the test set and display the result with the show_errors method
models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)

In [None]:
%%time
#  COMBINATION 2:  SelectorBIC with features_polar.  WER = 0.545

# TODO Choose a feature set and model selector
features = features_polar # change as needed
model_selector = SelectorBIC # change as needed

# TODO Recognize the test set and display the result with the show_errors method
models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)

In [None]:
%%time
#  COMBINATION 3:  SelectorCV with features_custom(delta2).  WER = 0.539

# TODO Choose a feature set and model selector
features = features_custom # change as needed
model_selector = SelectorCV # change as needed

# TODO Recognize the test set and display the result with the show_errors method
models = train_all_words(features, model_selector)
test_set = asl.build_test(features)
probabilities, guesses = recognize(models, test_set)
show_errors(guesses, test_set)

**Question 3:**  Summarize the error results from three combinations of features and model selectors.  What was the "best" combination and why?  What additional information might we use to improve our WER?  For more insight on improving WER, take a look at the introduction to Part 4.

**Answer 3:**

A - SelectorCV + features_polar. WER = 0.607  (wall time: 112s)

B - SelectorBIC + features_polar. WER = 0.545  (wall time: 71s)

C - SelectorCV  + features_custom(delta2). WER = 0.539  (wall time: 118s)

The best combination, SelectorCV + features_custom(delta2), is mainly due to an acurrate choice for the custom features "delta2":
- With normalized features, the recognizer always performed worse than with the other features analyzed. 
- With polar features, less than 60% could be achieved with DIC and BIC only (54.4%). 
- With delta features, however, the recognizer never produced a word error rate lower than 60%, being the best result, 61.2%, obtained with selectorCV. 
- After using the custom "delta2" features it was easy to get error rates lower than 60% for the three model selectors, allowing a better comparison of the them, where SelectorCV had a better performance than BIC and DIC (opposite to using polar features)

Attending the running time, BIC model selector is the fastest for all tests for the reasons discussed in part 2 followed by Cross-Validation (~1.5 times longer), and finally DIC (~3 times longer). DIC was discarded after the first combinations since similar or better performances were usually reached with cross-validation and BIC, also needing less running time.

Finally, to improve our WER we could modify "delta2" or "polar" including terms from other features or combinations of features (linear function, etc.). Other variables like zoom and hand orientations, unavailable in this project, could help to elaborate more accurate features.

A better option is to include statistical data (as suggested in the lesson and also proposed in the optional exercise) with information about the occurrence of a word in a sentence, or the probability of the word to appears next to other words (biased priors could be created before training). By using semantics rules, for example, a heuristic strategy based on previous sequences could restrict the search to meaningful words only. 

Moreover, as we are using ordered sequences, we could implement a recurrent neural network (LSTM) to improve our WER.

