# Post-mortem on custom model training

---

For the past 3 months I have been working to create a new model to detect whether or not someone is looking at a screen.

### Overcoming the lack of data

This was a lot trickier due to a lack of data. This was my first time training a model for a task which hasn't been performed before. The task itself is a binary classification task, in which the model predicts whether a face is looking or not looking at the screen.

#### Eye-Contact CNN - why I moved away from it

The closest model to this was described in https://github.com/rehg-lab/eye-contact-cnn, the eye contact CNN was trained to detect eye contact in children. However, due to it's license I cannot use it in EyesOff, nevertheless I am grateful to the authors as their paper is immensly useful in detailing it's approach. Also, this model was too restrictive, it would only classify as looking if the person looked directly at the camera which is too specific for the EyesOff usecase.

#### First steps to a custom dataset

Given the lack of data, I had to come up with an approach to create my own dataset. I started first thinking I would strictly need images of people using their laptops, this would be the closest data to what the model would see in production. However, this type of data was quite hard to come by. Then i realised, all we need is people in the image and they are facing towards the camera as this is essentially what the webcam will be doing. I.e. we take images with people in them, if they are looking towards the camera and where a screen could be (we assume a webcam at the top of the display) this allowed me to widen the range of possible data I could use.

Next up was the data collection phase, initially I thought I could do this with images of myself + my friends however I quickly realised this would create issues in terms of generalisability. To get over this I started looking for face datasets which I could label myself.

First, I took the FFHQ dataset as I knew it from my StyleGAN work. I went through and manually labelled a subset of FFHQ (4900 images) to test my hypothsis, to my eye it worked ok however failed on a small test set of images taken of myself. I figured this was because the FFHQ images are quite unlike real life images, Nvidia applied very heavy augmentations on top of them making them look a little weird. So, the next step was to find images which looked real. 

I began looking for gaze datasets but didnt find much which were easily accessible (GazeFace, MPIIGaze etc all require you to sign up to receive and my requests weren't replied to). I found the following dataset on Kaggle, https://www.kaggle.com/datasets/jigrubhatt/selfieimagedetectiondataset. This provided a great starting point, from this I took and labelled 3400 images. [TODO - run the model trained on selfie data on benchmark and get performance]. Given the dataset is of selfies, alot of the times faces were ocluded by phones, or eyes were looking at phones in an unnatural manner (show an image), this meant the data wasnt the best for my usecase. I did try labelling only images without phones in etc but it didn't help. 

#### Developing a consistent labelling framework

A lot of time, when I started labelling data, was spent on developing my labelling framework, i.e. how can I consistently label 1000s of images. As I was following the eye-contact-cnn, one of the first things I had to iron out was what is the boundary for someone looking at the screen? I began with quite a wide boundary, but decided that I will follow the paper a bit closer, meaning labelling as looking = directly at the camera and then slightly around it. I also had to make assumptions on the camera position - to make my life easier (put a drawing here of what I expect) I take a laptop setup where the camera is at the top of the screen. This is a limitation but we will work on solving it soon. Moving on, another idea I had was to use the eye-contact-cnn to label my data for me, but this did not produce great results. The looking bounds are too tight, the EyesOff model is useless if it only says you are looking if you look directly at the camera, having done this I realised how important hand labelled data can be.

#### Quirks of YuNet

Mention YuNet quirks (1080 images etc) [run experiment to show results + explain - find notebook (lol goodluck)] Perhaps move this part higher? Where do i talk about pipeline YuNet & face crop first then looking detection.

#### Start of the VCD dataset phase

HERE Write about VCD and mention train test split faces no overlap etc.

Realising the limitations of the selfie dataset, namely low quality images and situations which were not close to the production environment the EyesOff model would sit in I began a search for new datasets. Then I found the Video Conferencing Dataset (VCD), https://github.com/microsoft/VCD. This dataset was originally created to evaluate video codecs for video conferencing. However, it is also perfect for the EyesOff usecase, people in video calls smack in front of the webcam and occasionally looking around. The dataset contains 160 unique individuals in different video conferencing settings. I set to work labelling the dataset, the pipeline goes like this (and it's the same for all data collection I undertake): 

    - Run videos frame by frame but only extract frames at a fixed interval. Extracting every single frame creates issues, firstly most frames close to each other are the same, diversity in images is important. Also, if you have a 30fps video which lasts 30 seconds each video gives 900 frames, with 160 videos you end up with 144000 images to label! [INCLUDE VCD every 5 and VCD EVERY results on test bench and explain why diversity is needed/duplicate images dont help]
    
    - Next we take YuNet and run it on the extracted frames, doing this we crop out the faces in each image. I added this step, firstly to utilise YuNet cause I love it, but more importantly it's an amazing facial detection model and by using it to do the heavy work of detectinf faces we break up our task. YuNet handles facial detection and the EyesOff model only needs to worry about if the person is looking or not. It also helps when multiple people are in the scene, making data collection much simpler (imagine having to label images where 3 people are looking but 2 are not, and how would we get diversity in such scenes). By extracting the face we transform the task, its a bit hacky but we end up with face crops and I envisioned it as "this face is in front of the camera is it looking".
    
    - Then I take the face crops and run them through my labeller as before. The labeller was a small tool built to speedup this process, I did look into proper labelling tools such as label studio but found them too heavy for such a usecase. Using Claude [GIVEN i built the labeller earlier than finding VCD this whole bullet point can probably be an expandable box above - also can have another box which describes generally how I developed my labelling process] I built a simple labebeller. The labelling tool shows one image at a time, with 4 major buttons: 1 = label not looking, 2 = label looking, 3 = skip and q = go back to previous image. At first labelling was a very slow process, but the more I labelled the faster I got. By the end I could label around 1000 images in 15 minutes, to get this fast I would use skip pretty frequently, if a case is too ambiguous it makes more sense to skip it than to waste time labelling it. In the future I will go back and review the skipped cases correctly labelling them and adding them to the train set. [explain the labeller, perhaps go over how i refined my process + sped things up?]
    
    - After labelling the images we can train the model! However we have to be careful in this process, one of the things I learnt was facial images also need a train test split. By this I mean, the same face cannot appear in train and test, even if the image is different. To see why this is required imagine the following: you have a face labelled in 100 different scenarios and poses but it is always looking, the model may learn this particular face is always looking and as such when evaluating the test set the result is not reliable.

That's it for the data labelling process for the VCD dataset! All in all I got 5900 images from the VCD dataset. Time to discuss training details!

[how do i talk about additional data? perhaps best to talk about it here then move on to discussing the training regime - No let's cover training process + model decisions, show results of VCD only and then explain why more data needed.]

TODO - explain thinking behind dataset creation - experimenting first with FFHQ faces, them not realistic so switching to ppl in cam and using the pseudo looking etc, training first iter models, why i chose efficient net, include results too of different models, different ablations, labelling process and why it was important to do it by hand.

#### Model Choice

Given the nature of EyesOff, being an application which will always run in the background I want the model to relatively small. YuNet already satisfies this having only 75856 parameters. The EyesOff model is a pre-trained EfficientNetB0 model, which has 5288548 parameters, it is considerably larger than YuNet but still small enough to run on a CPU without affecting performance or draining battery too much [Battery usage benchmark here?].

The decision to use EfficientNetB0, is aribitrary I could use any other model or architecture and probably get similar or better results. It is definetly worthwhile trying other models. Ultimately I think the best approach would probably be to take the YuNet architecture and build the EyesOff on that, YuNet is so strong for such a small model, I'm sure the architecture can be adapted to give great results in the EyesOff setting. This is future work I endeavour to undertake as the application is built out further. 

#### Training the EyesOff Model - A Two Stage Approach

[Train VCD model with and without pre-training to prove it helps!]

[EXPLAIN BRIEFLY the two-stage training process from the Eye-Contact-CNN paper, then describe how exactly i implement it. cover which layers i freeze and unfreeze, how i created the gaze vectors (mediapipe + openVino pipeline), the dataset I used for gaze regression ,what is a gaze vector, the importance of this phase] - [simply cover the approach first and then my method of applying it]

During training I used the approach of the Eye-Contact-CNN as a guide to train my own model. Given my limited initial dataset of only 5900 images, I use an ImageNet pre-trained model. Furthermore, to aid generelisation (as in the Eye-Contact-CNN paper) I employ a two-stage training phase. Phase one takes the EfficientNetB0 model, sets the weights of blocks 0-4 to False [INCLUDE BLOCK COUNT img from pipeline.ipynb, to show how much params are in each layer], this means most of the model will be updated. The training goal in this phase is gaze regression, we take images of faces with the corresponding 2d gaze vector label and train the model to predict that. This 

