# NSMQ - Kwame AI Project
##### Title: A script For Cropping Videos Based on Timestamps
###### By: Ernest Samuel, Team member; Data preprocessing Team
###### Date: 05-08-2023

This code is a script for cropping videos based on timestamps. It is currently saved at the Data curration Folder, Path=videos/2019_2020, while the videos are saved at Technical folder, Path= Data preprocessing/Cropped Riddels. It takes two input files: an Excel file containing the timestamps and a directory containing the videos to be cropped. The script first reads the Excel file and creates a Pandas DataFrame containing the timestamps. Then, it iterates through the DataFrame and crops the corresponding videos. The cropped videos are saved in the same directory as the original videos.

### The key points of use of this code are:

- It can be used to crop videos based on any set of timestamps.
- It is easy to use and can be run from the command line.
- It is efficient and can crop videos quickly.
- Make sure to specify the path to the video you wish to crop, else, the video to be croped is to be saved in the "videos" folder which is the default path in this script

The code is written in Python and uses the moviepy library. The moviepy library is a powerful library for working with videos in Python. It allows you to read, write, edit, and process videos.

To run the code, you will need to install the *moviepy library*. You can do this by running the following command in your terminal:

                pip install moviepy

Once you have installed the moviepy library, you can run the code by saving it as a Python file and then running it from the command line. For example, if you saved the code as crop_video.py, you could run it by running the following command:

                python crop_video.py
                
This will crop the videos and save them in the same directory as the original videos.

Mounted at /content/drive


In [12]:
# Install needed library if not already installed
# pip install moviepy

# Updated  16/08/2023
"""
#-------------------------------------------------------------Instructions Start---------------------------------------------------------------------------------#

# Instructions on running the code:

# 1. Ensure you have the 'moviepy' library installed. You can install it using pip:
#    Uncomment the following line to install the library (remove '#' at the beginning of the line):
#    !pip install moviepy

# 3. Ensure csv_timestamps_data_preprocessing.ipynb is in thesame directory as this file

# 4. Make sure you have an Excel file containing video timestamps.
#    Modify the 'excel_file' variable in the 'if __name__ == "__main__":' section to point to your Excel file.

# 5. Organize your video files in a folder.
#    Set the 'videos_path' variable in the 'if __name__ == "__main__":' section to the folder containing your video files.

# 6. Run the script. You can execute it in your terminal or IDE. Make sure you are in the correct directory.

# The script will read the timestamps from the Excel file and crop the videos based on those timestamps.

#------------------------------------------------------------Instructions End ------------------------------------------------------------------------#
"""
# Import necessary libraries

#!pip install moviepy
from moviepy.video.io.VideoFileClip import VideoFileClip
from datetime import datetime
import os
import pandas as pd
import re
from numpy import nan

# file for CSV file preprocessing
%run csv_timestamps_data_preprocessing.ipynb

## For Google drive mounting in Google Colab
- Uncomment the section below

In [8]:
# #------------------------------------ Load directories in google colab ----------------------------------------#
# from google.colab import drive
# drive.mount('/content/drive')
# drive_content = os.listdir('/content/drive/MyDrive/NSMQ AI Project/Data Curation/Videos/2019_2020')

# # Directory containing video files
# video_folder = "/content/drive/MyDrive/NSMQ AI Project/Data Curation/Videos/2019_2020/2019"
# # CSV file path
# csv_file = "/content/drive/MyDrive/NSMQ AI Project/Data Curation/Videos/2019_2020/2019_riddle_start_end_file.csv"
# drive_content

In [5]:
#----------------------------------------Video folder Data Cleaning---------------------------------------------------------------------------------------------------------#
#----------- Video file name cleaning to remove unwanted characters --------------#


# Iterate over the files in the specified directory
for filename in os.listdir(video_folder):
    if os.path.isfile(os.path.join(video_folder, filename)):
        # Remove unwanted characters and spaces from the filename
        new_filename = re.sub(r'[’\'"\-,\s]', '', filename)
        new_filename = filename.replace(" ", "")

        # Check if the filename has changed
        if filename != new_filename:
            # Rename the file with the updated name
            os.rename(os.path.join(video_folder, filename), os.path.join(video_folder, new_filename))
            print(f'Renamed: {filename} to {new_filename}')

Renamed: NSMQ_2020_QUARTER FINALS_CONTEST_KUMASI_SEC_TECH_SCH_VS_ ACCRA _ACADEMY_VS_ST_. AUGUSTINE'S_COLLEGE.mp4 to NSMQ_2020_QUARTERFINALS_CONTEST_KUMASI_SEC_TECH_SCH_VS_ACCRA_ACADEMY_VS_ST_.AUGUSTINES_COLLEGE.mp4
Renamed: NSM_Q2020_QUARTER_FINALS_CONTEST_GHANA_NATIONAL_COLLEGE_VS_KOFORIDUA_SEC_TECH_ SCH_VS_PRESEC_LEGON.mp4 to NSM_Q2020_QUARTER_FINALS_CONTEST_GHANA_NATIONAL_COLLEGE_VS_KOFORIDUA_SEC_TECH_SCH_VS_PRESEC_LEGON.mp4


In [None]:
#-------------------------------------------------------------------------------------------------------------------------------------------------#

#------------------------------------------------------------ Data Validation: layer 1 ------------------------------------------------------------------------#

#----------- Check if video names in video folder are in CSV file --------------#

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_file, encoding='ISO-8859-1')

# Get a list of video names from the CSV file
csv_video_names = df['video_name'].tolist()
print(csv_video_names)
# Iterate through the files in the video folder
for filename in os.listdir(video_folder):
    if filename.endswith(".webm" or ".mp4"):  # Assuming your videos are in mp4 format or webm
        video_name = os.path.splitext(filename)[0]  # Extract the name without the extension
        if video_name not in csv_video_names:
            print(f"Video not found in the CSV file: {video_name}, {filename}")

print("Done checking for unfound videos.")

print("\n\n These videos are on the file")
# Iterate through the files in the video folder
for filename in os.listdir(video_folder):
    if filename.endswith(".mp4"):  # Assuming your videos are in mp4 format
        video_name = os.path.splitext(filename)[0]  # Extract the name without the extension
        if video_name in csv_video_names:
            print(f"Video found in the CSV file: {video_name}")
print("Done Checking for videos")


['NSMQ_2019_ONEEIGHTH_POPE_JOHN_SHS_VS_NOTRE_DAME_SHS_VS_KOFORIDUA_SHTS', nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_PRESEC_LEGON_VS_ASSIN_STATE_COLLEGE_VS_OFORI_PANIN_SHS', nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_ST._JOHNS_SCHOOL_VS_TAMALE_ISLAMIC_SCIENCE_SHS_VS_ARCHBISHOP_PORTER_GIRLS_SHS', nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_KUMASI_ACADEMY__VS_ANLO_SHS_VS_HOLY_CHILD_SCHOOL', nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_GHANA_SHS_TAMALE_VS_ARMED_FORCES_SHTS_VS_NEW_JUABEN_SHS', nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_ADISADEL_COLLEGE_VS_NSUTAMAN_CATH._SHS_VS_GHANATA_SCH', nan, nan, nan, '\nNSMQ_2019_ONEEIGHTH_ST._IGNATIUS_OF_LOYOLA_SHS_VS_KUMASI_SHTS_VS_BEREKUM_SHS', nan, nan, nan, nan, nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_OPOKU_WARE_SCHOOL_VS_ANGLICAN_SHS_KUMASI_VS_OSEI_KYERETWIE_SHS', nan, nan, nan, nan, nan, nan, nan, 'NSMQ_2019_ONEEIGHTH_TEPA_SHS_VS_ISLAMIC_SHS_KUMASI_VS_AWUDOME_SHS']
Video not found in the CSV file: NSMQ_2019_ONE_EIGHTH_ACHIMOTA_SCHOOL_VS_ST._THOMAS_AQUINAS_SHS_VS_PRESBY_SHS_NKWATIA, NSM

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------------------------#

#------------------------------------------------------------ Data Validation: Layer 2 ------------------------------------------------------------------------#

#----------- Check if names in CSV are thesame as the video files in video folder--------------#


# List all video file names in the video folder
video_files = [os.path.splitext(filename)[0] for filename in os.listdir(video_folder) if filename.endswith(".mp4" or ".webm")]

# Load the CSV file into a DataFrame
df = pd.read_csv(csv_file, encoding='ISO-8859-1')

# Get all values in the "video_name" column
csv_video_names = df['video_name'].tolist()

# Find video names in the CSV file that are not in the video folder
missing_video_names = [video_name for video_name in csv_video_names if video_name not in video_files]

# Print the missing video names
if missing_video_names:
    print("Video names in the CSV file not found in the video folder:")
    for missing_name in missing_video_names:
        print(missing_name)
else:
    print("All video names in the CSV file have corresponding files in the video folder.")

print("Done checking for missing video names.")



# Find video names in the CSV file that are not in the video folder
missing_video_names = [video_name for video_name in csv_video_names if video_name not in video_files]

# Calculate the total number of entries in the CSV file and the number of missing entries
total_entries = len(csv_video_names)
total_found = total_entries - len(missing_video_names)

# Print the results
print(f"Total number of video names found in the folder: {total_found}")
print(f"Total number of entries in the CSV file: {total_entries}")

print("Done checking for missing video names.")


All video names in the CSV file have corresponding files in the video folder.
Done checking for missing video names.
Total number of video names found in the folder: 30
Total number of entries in the CSV file: 30
Done checking for missing video names.


In [13]:

#-------------------------------------------------------------------------------------------------------------------------------------------------#

#------------------------------------------------------------ Main Script ------------------------------------------------------------------------#

# Function to crop videos based on timestamps in the Excel file
# crop_video(excel_file,save_to_all_audio_path, save_to_all_video_path,save_to_specific_audio_path, save_to_specific_video_path, videos_path)
def crop_video(excel_file,audio_save_folder, video_save_folder, save_to_specific_audio_path, save_to_specific_video_path, cropped = 'cropped', input_file="videos"):
    """
    Crop a video based on the timestamps in the specified Excel file.

    Args:
        excel_file (str): The path to the Excel file containing the timestamps.
        audio_save_folder(str): General folder path to save all audio cropped file
        video_save_folder(str): General folder path to save all video croped file
        save_to_specific_audio_path(str): specific folder path to save a paticular year audio group
        save_to_specific_video_path(str): specific folder path to save a paticular year video group
        cropped(str): Used to specify the first nameing of the croped file
        input_file (str): The folder path where the video files are located.
    """
    # Specify the encoding (e.g., 'utf-8', 'latin1', 'ISO-8859-1', 'cp1252', etc.)

    # Try multiple encodings
    csv_encodings = ['utf-8', 'latin1', 'ISO-8859-1']  # Add more encodings if needed

    for csv_encoding in csv_encodings:
        try:
            video_data = pd.read_csv(excel_file, encoding=csv_encoding)
            break  # If successful, exit the loop
        except UnicodeDecodeError:
            continue  # Try the next encoding if decoding fails

    for _, row in video_data.iterrows():
        if not pd.isna(row['video_name']):
            # this helps to change video names if one video have multiple timestamp
            video_name = row['video_name']+str(".mp4")
            path_name = row['video_name']

        if not pd.isna(row['start_time']):
            start_time = row['start_time']
            stop_time = row['stop_time']

            start_seconds = time_to_seconds(start_time)
            stop_seconds = time_to_seconds(stop_time)

            video_file = os.path.join(input_file, video_name)
#-----------------------------------------------------------------------------------------#
            # creat new specific folder for audio and video for each contest
            #video_folder_name = f"video_{path_name}"
            video_folder_path = os.path.join(save_to_specific_video_path, path_name)
            if not os.path.exists(video_folder_path):
                os.makedirs(video_folder_path)
            #audio
            #audio_folder_name = f"audio_{path_name}"
            audio_folder_path = os.path.join(save_to_specific_audio_path, path_name)
            if not os.path.exists(audio_folder_path):
                os.makedirs(audio_folder_path)
#----------------------------------------------------------------------------------------------#
            try:
                video_clip = VideoFileClip(video_file)
            except IOError as e:
                print(f"Error: {e}")
                print(f"Skipping video '{video_name}' because it could not be found.")
                continue  # Continue to the next video

            cropped_video = video_clip.subclip(start_seconds, stop_seconds)

            output_video_filename = f"{cropped}_{video_name}_{start_time.replace(':', '_')}_{stop_time.replace(':', '_')}.mp4"
            output_audio_filename = f"audio_{cropped}_{video_name}{start_time.replace(':', '_')}_{stop_time.replace(':', '_')}.wav"

            # General output path
            output_video_path = os.path.join(video_save_folder, output_video_filename)
            output_audio_path = os.path.join(audio_save_folder, output_audio_filename)

            # specific folder path
            output_specific_video_path = os.path.join(video_folder_path, output_video_filename)
            output_specific_audio_path = os.path.join(audio_folder_path, output_audio_filename)

            # Save the cropped video
            cropped_video.write_videofile(output_specific_video_path, codec='libx264') #1 to specific video file
            cropped_video.write_videofile(output_video_path, codec='libx264') # to all video

            # Save the audio as a .wav file
            cropped_video.audio.write_audiofile(output_specific_audio_path, codec='pcm_s16le') #1 to specific audio file
            cropped_video.audio.write_audiofile(output_audio_path, codec='pcm_s16le') # to all audio
            cropped_video.close()  # Close the video clip to release resources

    print("All done")



In [None]:
# preparing the csv file and checking for incorrect timestamps
start = 'start_time'
stop = 'stop_time'
csv_file = "testing_file.csv"
time_stamp_CSV_File_validation(csv_file,start,stop)


In [None]:
# Example usage:
if __name__ == "__main__":

    # observe that this relative path here is difference from the one in .ipy script
    excel_file = "/content/drive/MyDrive/NSMQ AI Project/Data Curation/Videos/2019_2020/2020_riddle_start_end_file.csv"  # Replace with the path to your Excel file
    videos_path = "/content/drive/MyDrive/NSMQ AI Project/Data Curation/Videos/2019_2020/2020"
    save_to_all_video_path =  "/content/drive/MyDrive/NSMQ AI Project/Technical/Data Preprocessing/Cropped_riddles/All Videos/2020"
    save_to_all_audio_path =  "/content/drive/MyDrive/NSMQ AI Project/Technical/Data Preprocessing/Cropped_riddles/All Audios/2020"
    save_to_specific_video_path = "/content/drive/MyDrive/NSMQ AI Project/Technical/Data Preprocessing/Cropped_riddles/2020/Video"
    save_to_specific_audio_path = "/content/drive/MyDrive/NSMQ AI Project/Technical/Data Preprocessing/Cropped_riddles/2020/Audio"

    
    cropped = 'RIDDLE1'
    #cropped = 'clue2'
    #cropped = 'clue3'
    #cropped = 'clue4'
    #cropped = 'clue5'
    #cropped = 'cropped'

    """
    Crop a video based on the timestamps in the specified Excel file.

    Args:
        excel_file (str): The path to the Excel file containing the timestamps.
        audio_save_folder(str): General folder path to save all audio cropped file
        video_save_folder(str): General folder path to save all video croped file
        save_to_specific_audio_path(str): specific folder path to save a paticular year audio group
        save_to_specific_video_path(str): specific folder path to save a paticular year video group
        cropped(str): Used to specify the first nameing of the croped file
        input_file (str): The folder path where the video files are located.
    """

    crop_video(excel_file,save_to_all_audio_path, save_to_all_video_path,save_to_specific_audio_path, save_to_specific_video_path, cropped, videos_path)
