# Project Delivery Report: Lips Reading
## Motivation

Deep learning in multi-modality has achieved significant progress in these years, and some of them have applied in
our daily life like audio-to-text, text-to-audio and text to images. Lipreading is one of the challenging tasks which
involves decoding features from the movement from speakers’ mouths or faces. Many potential applications can be
benefited from lipreading like hearing aids, speech recognition in noisy environments, outdoor communication with
AR glasses. Humans’ performance is not satisfied with this task. By leveraging the online video and text resources,
lipreading with deep learning might have more promising results than human. Instead of doing classification tasks in
a single frame, Lipreading needs to extract spatio-temporal features from video sequences, which increase our
project difficulty.

  * Potential applications:
    * Benefitial earing aids.
    * Outdoor communication with AR glasses.
    * Large dataset can be collected from online resources.
    * Utilize what we learned from this module like cnn/rnn


## Introduction

In this task, the input is a video or a series of image frames, which contain the visual features of speakers’ mouths or
faces, while the output is a sequence of text. So the core target of this project is to recognize a sequence of characters
from a sequence of images (seq2seq task). Without loss of generality, we assume that the input videos will only
contain the lip movements of reading `alphabet characters (a, b, c, d, e,..., z)` in random order and our target is to employ proposed
methods to recognize these corresponding `alphabet characters (a, b, c, d, e,..., z)`.



## Dataset
This is not a public dataset, we created this lips reading dataset by ourselves.  

### How did we generate this dataset?  
  * we invited 7 people (including our team members) to record their alphabet pronunciation videos by their phones
  
### Details
  * Number of training videos: 171
  * Number of validation videos: 20
  * Maximun alphabet character length in this dataset: 5
  * Character classes: 28 (includes 26 alphabet letters (a, b, c, d, e, f,..., z), blank and eos flag)
  * Number of lips-reading recording participants: 7

## Proposed Methods

Datasets contain multiple images (different frames of the video) per data instance, i.e. the X in an (X, Y) pair in the
dataset, is not a single image but a list of images with variable length.  

* **2D CNN + RNN**: A series of frames can be concatenated into a larger 2D grid image. Our 2D CNN model will extract the visual features of lip movements from this grid frame and the following RNN model will recognize and decode the corresponding characters of this video.
* **3D CNN + RNN**: We can employ 3D CNN to encode the spatio-temporal features from video sequences directly. Then these features will be fed into RNN models to decode the output character sequences
* **MLP + RNN**: Same idea, but the video is flatten and fed into MLP layers to extract the visual features. RNN works as a target character decoder.

    
## Preprocess method


Our team recorded 191 raw video clips. In order to give more constraint for the nerual network, we used a face detection model to preprocess the raw frames. The steps are shown in the above pipeline. Firstly, we will detect the five landmarks, and then followed by an affine transform operation(cv2.warpAffine()) to make sure that the mounth is at a consistent position.
![preprocess.png](report_img/preprocess.png)



## Training Data Statistics
![download.png](report_img/download.png)
The first graph shows the our training dataset distribution of label sequence length . The second graph tells the the 26 alphabet number distribution.

## Network Design
In this part, we present the detailed network design and model architectures of our proposed methods



### MLP + LSTM

! python mlp/mlp_model.py


![3DCNN+LSTM](report_img/MLP+LSTM.png)




### CNN + LSTM

![3DCNN+LSTM](./report_img/2DCNN+LSTM.png)


### 3DCNN + LSTM
![3DCNN+LSTM](report_img/3DCNN+LSTM.png)

![3DCNN+LSTM+Details](report_img/3DCNN_RNN_model.png)

## Training and Inference
### [training and inference of 3DCNN_RNN model](3DCNN_RNN_lipsreading.ipynb)
### [training and inference of MLP_RNN model](MLP_RNN_lipsreading.ipynb)

## Peformance


structure | Train loss | val loss | val acc|
--------- | ---------- | -------- | -------|
MLP + LSTM| 2.24| 2.61|0.31|
CNN + LSTM | ?  |?    |? |
3DCNN + LSTM | 1.90  |2.25 |0.67 |

