# Evaluation the lip_sync using the LSE-D and LSE-C metric.

The LSE-D (Lip-Sync Error-Displacement) and LSE-C (Lip-Sync Error-Confidence) metrics are used to evaluate the performance of lip-syncing models, particularly in the context of assessing the alignment between audio and video. These metrics are particularly relevant in projects such as Wav2Lip, where the goal is to generate videos with synchronized lip movements corresponding to given audio inputs. Here's a detailed explanation of each metric:

### LSE-D (Lip-Sync Error-Displacement)
LSE-D measures the average displacement error between the ground truth lip landmarks and the generated lip landmarks in the lip-synced video. This metric provides an indication of how closely the generated lip movements follow the actual lip movements that should occur for the given audio.

- **Calculation**: LSE-D is typically calculated by extracting lip landmarks from both the ground truth video and the generated video, then computing the Euclidean distance between corresponding landmarks.
- **Interpretation**: A lower LSE-D value indicates better performance, as it means the generated lip movements are closer to the actual lip movements that should occur for the given audio. High LSE-D values suggest a poor alignment between the audio and the generated lip movements.

### LSE-C (Lip-Sync Error-Confidence)
LSE-C measures the confidence of the lip-sync model in its prediction. It is often based on the output of a pre-trained lip-sync discriminator, which assesses how realistic the lip movements are in the context of the given audio.

- **Calculation**: LSE-C is typically derived from the confidence scores output by a discriminator network, which evaluates the likelihood that the generated lip movements are realistic and synchronized with the audio.
- **Interpretation**: A higher LSE-C value indicates better performance, as it means the discriminator network is more confident that the generated lip movements are realistic and well-synchronized with the audio. Low LSE-C values suggest a lack of realism or poor synchronization in the generated lip movements.

### Application in Wav2Lip
In the Wav2Lip project, both LSE-D and LSE-C are used to evaluate the quality of the lip-synced videos generated by the model. These metrics provide quantitative measures to assess the alignment between the audio and the generated lip movements, helping researchers and developers improve the model's performance.

- **LSE-D** helps in assessing the precision of the lip movements.
- **LSE-C** helps in assessing the overall quality and realism of the lip-syncing.

By using both metrics, researchers can comprehensively evaluate the performance of lip-syncing models and make necessary adjustments to improve the quality of the generated videos.

# Notes for the reviewers:
- This notebook provides an overview of the LSE-D and LSE-C metrics used to evaluate lip-syncing models.
- For the main evaluation code, please refer to the Wav2Lip repo [Wav2Lip ](https://github.com/Rudrabha/Wav2Lip) and refer to the evaluation repo [Joonson](https://github.com/joonson/syncnet_python)
- Please refer to "modified scripts" for the modified scripts used in the evaluation.

In [1]:
# install the necessary packages

# clone the required repo for LSE-D and LSE-C metric
!git clone https://github.com/joonson/syncnet_python.git 

# clone the required repo for wav2lip    
!git clone https://github.com/Rudrabha/Wav2Lip.git

# download the requirements    
%cd syncnet_python
!pip install -r requirements.txt
!sh download_model.sh

# Copy the evaluation scripts from syncnet to wav2lip folder.
%cd /kaggle/working/Wav2Lip/evaluation/scores_LSE/
!cp *.py /kaggle/working/syncnet_python/
!cp *.sh /kaggle/working/syncnet_python/ 

Cloning into 'syncnet_python'...
remote: Enumerating objects: 123, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (20/20), done.[K
remote: Total 123 (delta 35), reused 28 (delta 28), pack-reused 75[K
Receiving objects: 100% (123/123), 97.36 KiB | 6.49 MiB/s, done.
Resolving deltas: 100% (70/70), done.
Cloning into 'Wav2Lip'...
remote: Enumerating objects: 381, done.[K
remote: Counting objects: 100% (3/3), done.[K
remote: Compressing objects: 100% (3/3), done.[K
remote: Total 381 (delta 0), reused 1 (delta 0), pack-reused 378[K
Receiving objects: 100% (381/381), 538.67 KiB | 13.81 MiB/s, done.
Resolving deltas: 100% (209/209), done.
/kaggle/working/syncnet_python
Collecting scenedetect==0.5.1 (from -r requirements.txt (line 5))
  Downloading scenedetect-0.5.1-py3-none-any.whl.metadata (4.4 kB)
Collecting python_speech_features (from -r requirements.txt (line 7))
  Downloading python_speech_features-0.6.tar.gz (5.6 kB)
  Preparing metada

In [2]:
# Install the 'gdown' and 'rarfile' packages
!pip install gdown rarfile
# Install the 'unrar' package for extracting RAR files
!apt-get install unrar

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Collecting rarfile
  Downloading rarfile-4.2-py3-none-any.whl.metadata (4.4 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Downloading rarfile-4.2-py3-none-any.whl (29 kB)
Installing collected packages: rarfile, gdown
Successfully installed gdown-5.2.0 rarfile-4.2
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  unrar
0 upgraded, 1 newly installed, 0 to remove and 80 not upgraded.
Need to get 113 kB of archives.
After this operation, 406 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 unrar amd64 1:5.6.6-2build1 [113 kB]
Fetched 113 kB in 1s (126 kB/s) 
Selecting previously unselected package unrar.
(Reading database ... 113807 files and directories currently installed.)
Preparing to unpack .../unrar_1%3a5.6.6-2build1_amd64.deb ...
Unpacking unrar (1:5.6.

In [3]:
# import the necessary libraries

import rarfile
import gdown
import os

# Preparation before calculation

In [None]:
#modify the shell script 'calculate_scores_real_videos.sh' add the name of the video alongwith the metrics' scores
# the modified shell script is in the "modified scripts" folder

'''
********{modify}*********
modify the following  
do
   python run_pipeline.py --videofile $1/$eachfile --reference wav2lip --data_dir tmp_dir
   score_output=$(python calculate_scores_real_videos.py --videofile $1/$eachfile --reference wav2lip --data_dir tmp_dir)
   echo "$eachfile $score_output" >> all_scores.txt
done
'''

url = "https://drive.google.com/uc?id=***********************"
   
gdown.download(url, '/kaggle/working/syncnet_python/calculate_scores_real_videos.sh', quiet=False)


In [None]:
# the following code solves an issue of deprecation of using "np.int" in the script box_utils.py {The aliases was originally deprecated in NumPy 1.20}
# the modified script is in the "modified scripts" folder

'''
********{modify}*********
modify the following 

    return np.array(keep).astype(int)

'''
    
url = "https://drive.google.com/uc?id=***********************"

gdown.download(url, '/kaggle/working/syncnet_python/detectors/s3fd/box_utils.py', quiet=False)

In [5]:
# the following code solves an issue in the 'content_detector.py' scripe it trys to change values of a tuple {does not support item assignment} so simply convert it to list them get it back agin to tuple 

content_detector_path = '/opt/conda/lib/python3.10/site-packages/scenedetect/detectors/content_detector.py'

# read the content of the file
with open(content_detector_path, 'r') as file:
    content = file.readlines()

# Iterate through the content and replace the specific lines if found
for i, line in enumerate(content):
    if "curr_hsv[i] = curr_hsv[i].astype(numpy.int32)" in line:
        content[i] = (
            "                    curr_hsv = list(curr_hsv)  # Convert tuple to list\n"
            "                    for i in range(len(curr_hsv)):\n"
            "                        curr_hsv[i] = curr_hsv[i].astype(numpy.int32)\n"
            "                    curr_hsv = tuple(curr_hsv)  # Convert back to tuple if needed\n"
        )
    if "last_hsv[i] = last_hsv[i].astype(numpy.int32)" in line:
        content[i] = (
            "                    last_hsv = list(last_hsv)  # Convert tuple to list\n"
            "                    for i in range(len(last_hsv)):\n"
            "                        last_hsv[i] = last_hsv[i].astype(numpy.int32)\n"
            "                    last_hsv = tuple(last_hsv)  # Convert back to tuple if needed\n"
        )

# write the modified content back to the file
with open(content_detector_path, 'w') as file:
    file.writelines(content)

# just to make such the changes took place 
with open(content_detector_path, 'r') as file:
    print(file.read())


# -*- coding: utf-8 -*-
#
#         PySceneDetect: Python-Based Video Scene Detector
#   ---------------------------------------------------------------
#     [  Site: http://www.bcastell.com/projects/pyscenedetect/   ]
#     [  Github: https://github.com/Breakthrough/PySceneDetect/  ]
#     [  Documentation: http://pyscenedetect.readthedocs.org/    ]
#
# Copyright (C) 2012-2019 Brandon Castellano <http://www.bcastell.com>.
#
# PySceneDetect is licensed under the BSD 3-Clause License; see the included
# LICENSE file, or visit one of the following pages for details:
#  - https://github.com/Breakthrough/PySceneDetect/
#  - http://www.bcastell.com/projects/pyscenedetect/
#
# This software uses Numpy, OpenCV, click, tqdm, simpletable, and pytest.
# See the included LICENSE files or one of the above URLs for more information.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR

In [None]:
# donload the models "wav2lip_gan.pth" and "wav2lip.pth" fom wav2lip repo
# create wav2lip_gen folder

os.makedirs('/kaggle/working/wav2lip_gen', exist_ok=True)

url = "https://drive.google.com/uc?id=***********************"

gdown.download(url, '/kaggle/working/wav2lip_gen.rar', quiet=False)

with rarfile.RarFile('/kaggle/working/wav2lip_gen.rar') as rf:
    rf.extractall('/kaggle/working/wav2lip_gen')
    
    
# create wav2lip folder

os.makedirs('/kaggle/working/wav2lip', exist_ok=True)

url = "https://drive.google.com/uc?id=*************************"

gdown.download(url, '/kaggle/working/wav2lip.rar', quiet=False)

with rarfile.RarFile('/kaggle/working/wav2lip.rar') as rf:
    rf.extractall('/kaggle/working/wav2lip')

In [None]:
# create a folder for the ground truth video 

   
os.makedirs('/kaggle/working/ground_video', exist_ok=True)

url = "https://drive.google.com/uc?id=************************"

gdown.download(url, '/kaggle/working/ground_video/mian_video.mp4', quiet=False)


In [9]:
%cd /kaggle/working/syncnet_python

/kaggle/working/syncnet_python


## Run The LSE-D and LSE-C metric Evaluation
- run on the wav2lip_gen generated videos

In [12]:
# run the lipSync evaluation on the wav2lip_gen folder

!sh calculate_scores_real_videos.sh /kaggle/working/wav2lip_gen

# the 'all_scores_.txt' file conatains the video name with the SE-D and LSE-C metric aside
# the file is found in "/kaggle/working/syncnet_python"

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --e

## Run The LSE-D and LSE-C metric Evaluation
- run on the wav2lip generated videos

In [15]:
# run the  lipSync evaluation on the wav2lip folder

!sh calculate_scores_real_videos.sh /kaggle/working/wav2lip

# the 'all_scores_.txt' file conatains the video name with the SE-D and LSE-C metric aside
# the file is found in "/kaggle/working/syncnet_python"

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --e

### Results Discussion 

In [14]:
# # the results of the gen model 

# with open("/kaggle/working/syncnet_python/all_scores.txt", 'r') as file:
#     content = file.read()
# print(content)

# with open("/kaggle/working/all_scores_gen_model.txt", 'w') as new_file:
#     new_file.write(content)

Arabic_gen_model.mp4 7.4505444 6.946064
English_gen_model.mp4 7.3036866 7.7088923
Korean_gen_model.mp4 8.0658455 6.2819786
Spanish_gen_model.mp4 7.7571344 6.3895636



In [16]:
# the results of the gen model 

with open("/kaggle/working/syncnet_python/all_scores.txt", 'r') as file:
    content = file.read()
print(content)

with open("/kaggle/working/all_scores_not_gen_model.txt", 'w') as new_file:
    new_file.write(content)

Arabic_not_gen_model.mp4 7.452433 6.8285794
English_not_gen_model.mp4 7.2815084 7.450014
Korean_not_gen_model.mp4 7.900413 6.4207473
Spanish_not_gen_model.mp4 7.7730327 6.183518



## The Results:

| Video                          | LSE-D      | LSE-C      | Model          |
|--------------------------------|------------|------------|----------------|
| Arabic_not_gen_model.mp4       | 7.452433   | 6.8285794  | wav2lip.pth    |
| English_not_gen_model.mp4      | 7.2815084  | 7.450014   | wav2lip.pth    |
| Korean_not_gen_model.mp4       | 7.900413   | 6.4207473  | wav2lip.pth    |
| Spanish_not_gen_model.mp4      | 7.7730327  | 6.183518   | wav2lip.pth    |
| Arabic_gen_model.mp4           | 7.4505444  | 6.946064   | wav2lip_gen.pth|
| English_gen_model.mp4          | 7.3036866  | 7.7088923  | wav2lip_gen.pth|
| Korean_gen_model.mp4           | 8.0658455  | 6.2819786  | wav2lip_gen.pth|
| Spanish_gen_model.mp4          | 7.7571344  | 6.3895636  | wav2lip_gen.pth|

### Analysis of the Results

1. **Language Comparison (Accuracy)**:
    - **English**: 
        - `not_gen_model`: LSE-D: 7.2815084, LSE-C: 7.450014
        - `gen_model`: LSE-D: 7.3036866, LSE-C: 7.7088923
        - **Insight**: English has the lowest LSE-D and highest LSE-C among the languages in both models, indicating that it has the best alignment and confidence in lip-syncing. This suggests that the models might be better trained or have more data for English.
    
    - **Arabic**: 
        - `not_gen_model`: LSE-D: 7.452433, LSE-C: 6.8285794
        - `gen_model`: LSE-D: 7.4505444, LSE-C: 6.946064
        - **Insight**: Arabic shows slightly better LSE-D and improved LSE-C with the `gen_model`. However, it still lags behind English.
    
    - **Spanish**: 
        - `not_gen_model`: LSE-D: 7.7730327, LSE-C: 6.183518
        - `gen_model`: LSE-D: 7.7571344, LSE-C: 6.3895636
        - **Insight**: Spanish has improved LSE-C with the `gen_model`, but its LSE-D is still relatively high, indicating less accurate alignment.
    
    - **Korean**: 
        - `not_gen_model`: LSE-D: 7.900413, LSE-C: 6.4207473
        - `gen_model`: LSE-D: 8.0658455, LSE-C: 6.2819786
        - **Insight**: Korean shows the highest LSE-D values, indicating the poorest alignment among the languages. The `gen_model` did not improve and actually worsened in terms of LSE-D.

2. **Model Comparison**:
    - **LSE-D**:
        - **General Insight**: `wav2lip.pth` performs slightly better or similarly to `wav2lip_gen.pth` in most languages except for Korean where it performs worse.
    
    - **LSE-C**:
        - **General Insight**: `wav2lip_gen.pth` generally shows higher LSE-C values for English and Arabic, indicating better confidence. However, for Korean and Spanish, the improvement is marginal or even worse.

### Conclusion

- **Most Accurate Language**:
    - **English** is the most likely to be accurate based on the metrics, probably due to more training data or better training for this language.

- **Comparison Between Models**:
    - **wav2lip.pth**: This model shows slightly better LSE-D (displacement error) in most cases, which means it might have more precise lip-syncing.
    - **wav2lip_gen.pth**: This model generally shows better LSE-C (confidence), indicating it might generate more confident or visually convincing lip-syncs, especially in English and Arabic.

Overall, **English** is the best-performing language in both models, while **Korean** shows the poorest performance. The original model `wav2lip.pth` might be more reliable in terms of displacement accuracy, while the generated model `wav2lip_gen.pth` shows better confidence in some languages. This suggests potential areas for improvement in the models' training data or techniques, particularly for languages like Korean and Spanish.

.

.

-------------------------------------------------------------------------

# Evaluation Of The Image Quality 

FID (Fréchet Inception Distance) is a metric used to evaluate the quality of generated images, commonly used in the context of generative models like GANs (Generative Adversarial Networks). It measures the similarity between two sets of images, often comparing generated images with real images to assess the quality and diversity of the generated set. Here's a detailed explanation of FID and how it is used:

### Frechet Inception Distance (FID)

#### Concept
FID evaluates the distance between the distributions of features extracted from two sets of images. These features are typically extracted using a pre-trained Inception network, which is a convolutional neural network trained on a large dataset like ImageNet. The idea is to compare the statistical properties of these features to determine how similar the two sets of images are.

#### Calculation
1. **Feature Extraction**: Both real and generated images are passed through the Inception network to obtain feature representations. These features are usually the activations from an intermediate layer of the network.
   
2. **Distribution Modeling**: The feature representations are modeled as multidimensional Gaussian distributions. For each set of images, the mean (\(\mu\)) and covariance (\(\Sigma\)) of the features are computed.

3. **Fréchet Distance**: The FID score is then calculated as the Fréchet distance between the two Gaussian distributions, given by the formula:
   
   \[
   \text{FID} = ||\mu_r - \mu_g||^2 + \text{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2})
   \]
   
   where \(\mu_r\) and \(\Sigma_r\) are the mean and covariance of the real images' features, and \(\mu_g\) and \(\Sigma_g\) are the mean and covariance of the generated images' features.

#### Interpretation
- **Lower FID Scores**: Indicate that the generated images are more similar to the real images, suggesting higher quality and more realistic generated images.
- **Higher FID Scores**: Indicate greater differences between the generated and real images, suggesting lower quality or less realistic generated images.

### Advantages of FID
- **Sensitivity to Image Quality**: FID is sensitive to both the visual quality and the diversity of the generated images. It penalizes generators that produce images that are either of poor quality or lack variety.
- **Better Than Inception Score (IS)**: FID has been shown to correlate better with human judgment of image quality than the Inception Score (IS), another popular metric for evaluating GANs.

### Limitations
- **Computational Complexity**: Calculating FID can be computationally expensive, especially for large datasets.
- **Dependence on Inception Network**: The choice of the Inception network as the feature extractor means that the FID score is somewhat dependent on the specific characteristics of this network.

### Application
In the context of evaluating lip-syncing models or any other generative models, FID can be used to assess the quality of the generated images. For example, in lip-syncing, FID could be used to evaluate how realistic the generated frames of the video are, comparing them to real frames to ensure that the synthetic lips look natural and coherent with the rest of the face.


By using FID, researchers can quantitatively assess and improve the quality of images generated by their models, ensuring that the outputs are both realistic and diverse.

In [18]:
# install the required packages
!pip install pytorch-fid

Collecting pytorch-fid
  Downloading pytorch_fid-0.3.0-py3-none-any.whl.metadata (5.3 kB)
Downloading pytorch_fid-0.3.0-py3-none-any.whl (15 kB)
Installing collected packages: pytorch-fid
Successfully installed pytorch-fid-0.3.0


In [39]:
# import the neccesary libraries

import shutil
import subprocess
import glob


## Preparations Before The Calculations

In [None]:
# modify the fid_score.py script to handle immaginary numbers 
# the modified script is in the "modified scripts" folder

'''
********{modify}*********
modify the following 

    # Check for and handle any imaginary component
    if np.iscomplexobj(covmean):
        covmean = covmean.real
'''

url = "https://drive.google.com/uc?id=*******************************"
gdown.download(url, '/kaggle/working/fid_score_m.py', quiet=False)


In [28]:
# function to create folder of frames 

def create_frames(dir_path, video_path):
    # Check if the directory exists if exist remove it 
    if os.path.exists(dir_path):
        shutil.rmtree(dir_path)
    
    # Create the directory
    os.makedirs(dir_path, exist_ok=True)
    
    # create the ground_truth frames
    subprocess.run(['ffmpeg', '-i', f'{video_path}', '-vf', 'fps=25', f'{dir_path}/frame_%06d.png'])

In [37]:
#define faunctions to mach the number of frames of each video 

def trim_frames(directory, target_count):
    frame_files = sorted(glob.glob(os.path.join(directory, '*.png')))
    for frame_file in frame_files[target_count:]:
        os.remove(frame_file)

def match_frames(test="/kaggle/working/frames/test",ground_truth="/kaggle/working/frames/ground_truth"):
    # get frame counts
    ground_truth_frame_count = len(glob.glob(f'{ground_truth}/*.png'))
    generated_frame_count = len(glob.glob(f'{test}/*.png'))

    # mach frame counts
    if ground_truth_frame_count < generated_frame_count:
        trim_frames(f'{test}', ground_truth_frame_count)
        print("the number of the frames is",ground_truth_frame_count)
        
    elif generated_frame_count < ground_truth_frame_count:
        trim_frames(f'{ground_truth}', generated_frame_count)
        print("the number of the frames is",generated_frame_count)


In [None]:
# i will consider the 13_K.mp4 as the ground truth for my calculations
# create directories for frames

ground_truth_url = "https://drive.google.com/uc?id=**************************"
gdown.download(ground_truth_url, '/kaggle/working/ground_truth_video.mp4', quiet=False)

# create the ground_truth frames
create_frames(dir_path = '/kaggle/working/frames/ground_truth', video_path ='/kaggle/working/ground_truth_video.mp4')

In [31]:
# list all the links of the videos 

# links of the gen_model
base_url = '/kaggle/working/wav2lip_gen'
all_videos_links = []
for item in os.listdir(base_url):
    item_path = os.path.join(base_url, item)
    all_videos_links.append(str(item_path))
    
# links of the not_gen_model    
base_url = '/kaggle/working/wav2lip'
for item in os.listdir(base_url):
    item_path = os.path.join(base_url, item)
    all_videos_links.append(str(item_path))

In [49]:
# import the modified fid_score_m.py

from fid_score_m import calculate_fid_given_paths

# calculate all the FIDs of the videos in the list 

text = ""

for i in all_videos_links:
    create_frames(dir_path = '/kaggle/working/frames/test', video_path = i)
    match_frames()
    
    # paths to directories
    path_to_original_images = '/kaggle/working/frames/ground_truth'
    path_to_generated_images = '/kaggle/working/frames/test'

    # define parameters for the FID calculation
    batch_size = 50
    device = 'cuda'  # or 'cpu' if CUDA is not available
    dims = 2048

    # Calculate the FID score
    fid_value = calculate_fid_given_paths(
        [path_to_original_images, path_to_generated_images],
        batch_size=batch_size,
        device=device,
        dims=dims
    )

    text += f"The FID of {os.path.basename(i).split('.')[0]} is {fid_value}\n"

# save the calculated FIDs to .txt file    
with open("/kaggle/working/FID.txt", 'w') as new:
    new.write(text)

ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --e

the number of the frames is 777


100%|██████████| 16/16 [00:45<00:00,  2.82s/it]
100%|██████████| 16/16 [00:42<00:00,  2.67s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex -

the number of the frames is 777


100%|██████████| 16/16 [00:45<00:00,  2.83s/it]
100%|██████████| 16/16 [00:41<00:00,  2.62s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex -

the number of the frames is 591


100%|██████████| 12/12 [00:35<00:00,  2.93s/it]
100%|██████████| 12/12 [00:33<00:00,  2.82s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex -

the number of the frames is 591


100%|██████████| 12/12 [00:35<00:00,  2.93s/it]
100%|██████████| 12/12 [00:32<00:00,  2.71s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex -

the number of the frames is 591


100%|██████████| 12/12 [00:35<00:00,  2.92s/it]
100%|██████████| 12/12 [00:32<00:00,  2.73s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex -

the number of the frames is 591


100%|██████████| 12/12 [00:35<00:00,  2.92s/it]
100%|██████████| 12/12 [00:32<00:00,  2.72s/it]
ffmpeg version 4.2.7-0ubuntu0.1 Copyright (c) 2000-2022 the FFmpeg developers
  built with gcc 9 (Ubuntu 9.4.0-1ubuntu1~20.04.1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex -

the number of the frames is 591


100%|██████████| 12/12 [00:35<00:00,  2.92s/it]
100%|██████████| 12/12 [00:33<00:00,  2.75s/it]


In [50]:
# The FID of the videos 

with open("/kaggle/working/FID.txt", 'r') as file:
    content = file.read()
print(content)

The FID of Korean_gen_model is 2.3120862558399153
The FID of Arabic_gen_model is 2.280514213509388
The FID of Spanish_gen_model is 2.389169484441494
The FID of English_gen_model is 2.4446369026039747
The FID of Korean_not_gen_model is 2.3007421794956375
The FID of English_not_gen_model is 2.3435178692535867
The FID of Spanish_not_gen_model is 2.3006673257909256
The FID of Arabic_not_gen_model is 2.3549242277811224



## Discuss the Results

### Analysis of FID Values

The Fréchet Inception Distance (FID) values given for the various videos indicate the similarity between the generated images and the ground truth images. Lower FID values suggest higher similarity and better quality. Here are the provided FID values:

### FID Values for `wav2lip_gen.pth` and `wav2lip.pth`

| Language | FID (wav2lip_gen.pth) | FID (wav2lip.pth) |
|----------|-----------------------|-------------------|
| Korean   | 2.3120862558399153    | 2.3007421794956375 |
| Arabic   | 2.280514213509388     | 2.3549242277811224 |
| Spanish  | 2.389169484441494     | 2.3006673257909256 |
| English  | 2.4446369026039747    | 2.3435178692535867 |

### Insights

1. **Comparison of Models**:
    - **Generated Model (`_gen_model`)**:
        - Lowest FID: Arabic (2.280514213509388)
        - Highest FID: English (2.4446369026039747)
        - Insight: Arabic has the highest quality and similarity to the ground truth for the generated model, while English has the lowest.

    - **Non-generated Model (`_not_gen_model`)**:
        - Lowest FID: Spanish (2.3006673257909256)
        - Highest FID: Arabic (2.3549242277811224)
        - Insight: Spanish has the highest quality and similarity to the ground truth for the non-generated model, while Arabic has the lowest.

2. **Language Comparison within Models**:
    - **Generated Model**:
        - Arabic_gen_model (2.280514213509388) is the best performing, closely followed by Korean_gen_model (2.3120862558399153).
        - English_gen_model (2.4446369026039747) has the highest FID, indicating lower quality compared to the other generated models.

    - **Non-generated Model**:
        - Spanish_not_gen_model (2.3006673257909256) is the best performing, with Korean_not_gen_model (2.3007421794956375) being very close.
        - Arabic_not_gen_model (2.3549242277811224) has the highest FID, indicating lower quality compared to the other non-generated models.


3. **Overall Best Performance**: 
    - **Arabic_gen_model** (2.280514213509388) for `wav2lip_gen.pth`.
    - **Spanish_not_gen_model** (2.3006673257909256) for `wav2lip.pth`.


4. **Overall Worst Performance**:
    - **English_gen_model** (2.4446369026039747) for `wav2lip_gen.pth`.
    - **Arabic_not_gen_model** (2.3549242277811224) for `wav2lip.pth`.


5. **Language Comparison**:
    - For **Korean**, the FID values are very close between the two models, with `wav2lip.pth` being slightly better.
    - For **Arabic**, `wav2lip_gen.pth` performs better.
    - For **Spanish**, `wav2lip.pth` performs better.
    - For **English**, `wav2lip.pth` performs better.

### Summary of Insights

- **Arabic_gen_model** has the best overall performance, suggesting that the generated model performs exceptionally well for Arabic, likely due to high-quality training data or effective model tuning for this language.
- **English_gen_model** has the worst performance among the generated models, suggesting potential room for improvement in the generated model's training or data quality for English.
- **Spanish_not_gen_model** and **Korean_not_gen_model** have very similar and low FID values, indicating high-quality outputs for the non-generated model in these languages.
- The **non-generated model** generally performs better than the generated model for English and Spanish, while the generated model performs better for Arabic.

### Recommendations

- **Focus on Improving English Generation**: Given that English_gen_model has the highest FID, efforts could be made to improve the generated model's performance for English. This could involve augmenting the training data or fine-tuning the model parameters.
- **Leverage Strengths in Arabic Generation**: The strong performance of Arabic_gen_model suggests that the techniques or data used for this language could potentially be applied to improve the model's performance in other languages.
- **Evaluate Training Data**: Ensuring high-quality, diverse, and well-aligned training data for each language can help improve the model's performance across the board.

Overall, the analysis highlights the strengths and weaknesses of the models across different languages, providing a clear direction for future improvements and optimizations.



____________________________________

# Training And Fine Tuning

### It has been shown from the previous section that the model may have some deficiencies in generate or synchronize the generated image with the sound, so may train the model intensively in one language that shown inferior performance elevates the quality.

In [None]:
# a URL to a video training just to check the process of the traing  

url = "https://drive.google.com/uc?id=1H-***********************"

# the main traing data set will be here
os.makedirs('/kaggle/working/main/data_file', exist_ok=True)

# save the trained model in 
os.makedirs('/kaggle/working/my_check_point', exist_ok=True)

# save the processed data in
os.makedirs('/kaggle/working/final', exist_ok=True)



# Download the video 
gdown.download(url, output="/kaggle/working/main/data_file/12345.mp4", quiet=False)



## Training the Wav2Lip models
### without the additional visual quality discriminator

In [None]:
# download lipsync_expert.pth model 
# refer to wav2lip repo for the lipsync_expert.pth and for more details

#  URL
url = "https://drive.google.com/uc?id=*******************************"

# Download the file
gdown.download(url, output="/kaggle/working/lipsync_expert.pth", quiet=False)

# download the face detection pretrained model
!wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "/kaggle/working/Wav2Lip/face_detection/detection/sfd/s3fd.pth"


In [58]:
%cd /kaggle/working/Wav2Lip/

/kaggle/working/Wav2Lip


In [59]:
# preprocessing the traing video
! python "/kaggle/working/Wav2Lip/preprocess.py" --ngpu 1 --batch_size 16 --data_root "/kaggle/working/main" --preprocessed_root "/kaggle/working/final"


Started processing for /kaggle/working/main with 1 GPUs
100%|████████████████████████████████████████████| 1/1 [04:14<00:00, 254.71s/it]
Dumping audios...
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  7.97it/s]


In [61]:
# add txt file to specify the tranin testing and validation data sets

! echo  "12345" > /kaggle/working/Wav2Lip/filelists/train.txt
! echo "12345" > /kaggle/working/Wav2Lip/filelists/val.txt
! echo "12345" > /kaggle/working/Wav2Lip/filelists/test.txt



In [62]:
# preform the training script
! python "/kaggle/working/Wav2Lip/wav2lip_train.py" --data_root "/kaggle/working/final" --checkpoint_dir "/kaggle/working/my_check_point" --syncnet_checkpoint_path "/kaggle/working/lipsync_expert.pth"

use_cuda: True
total trainable params 36298035
Load checkpoint from: /kaggle/working/lipsync_expert.pth
Starting Epoch: 0
0it [00:00, ?it/s]^C
0it [00:18, ?it/s]
Traceback (most recent call last):
  File "/kaggle/working/Wav2Lip/wav2lip_train.py", line 371, in <module>
    train(device, model, train_data_loader, test_data_loader, optimizer,
  File "/kaggle/working/Wav2Lip/wav2lip_train.py", line 210, in train
    for step, (x, indiv_mels, mel, gt) in prog_bar:
  File "/opt/conda/lib/python3.10/site-packages/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1328, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1294, in _get_data
    success, data = self._try_get_data()
  File "

# To train the above model 
### Total trainable params 36298035 are required which is highter than the capacity of the normal machines even with T4 or p100 the process failed
