# Weverse Downloader

The Weverse Downloader performs a series of tasks related to downloading, processing, and manipulating video and audio files. It includes functions to retrieve the YouTube title, create an output directory, download video files, concatenate the downloaded clips, write the final video to an .mp4 file, and play the video within a Jupyter Notebook.

Additionally, the code utilizes the Whisper ASR (Automated Speech Recognition) model to transcribe the audio from the video, convert the transcription result into SubRip subtitle format (.srt), and write it as an .srt file.

Weverse Live streams are delivered differently from other video broadcasting websites. The src path to the stream is a blob link that leads to an error page. Third-party downloaders cannot process the link or are locked behind a paywall. 

Clips are sent from the servers in encoded segments under network requests. The request URL has a consistent format that can be edited programmatically. The request URL expires after the browser expires.

### imports

In [1]:
#imports
import os
import logging
import time

import ffmpeg
import moviepy
from moviepy.editor import concatenate_videoclips, VideoFileClip

from tqdm import tqdm

import requests

import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import undetected_chromedriver as uc

from googletrans import Translator

from IPython.display import display, HTML, Markdown

import Credentials

import whisper

### definitions

In [2]:
def get_youtube_title():
    '''
    Opens a Chrome browser and automatically logs in to a website.
    Retrieves the YouTube title and publish date.
    Translates the title into different languages.
    Prints the translated titles.
    '''
    
    # Initialize webdriver
    options = uc.ChromeOptions()
    options.headless = False
    
    driverWeverse = uc.Chrome(options = options)
    driverWeverse.get(video_url)
    
    # Login process
    time.sleep(3)
    WebDriverWait(driverWeverse, 20).until(EC.element_to_be_clickable((By.CLASS_NAME, 'UserJoinInduceLayerView_link__wcuim'))).click()
    WebDriverWait(driverWeverse, 10).until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button[aria-label="confirm modal"].ModalButtonView_button__B5k-Z'))).click()
    driverWeverse.find_element(By.TAG_NAME, 'body').send_keys(Keys.TAB + username + Keys.ENTER)
    time.sleep(2)
    driverWeverse.find_element(By.TAG_NAME, 'body').send_keys(Keys.TAB + Keys.TAB + password + Keys.ENTER)

    # Retrieve title and date
    h2_text = WebDriverWait(driverWeverse, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'TitleView_title__SSnHb'))).text.replace("replay\n", '')
    span_text = driverWeverse.find_element(By.CLASS_NAME, 'HeaderView_info__j-KNX').text
    month = span_text.split('.')[0]
    day = span_text.split('.')[1]
    global output_name
    output_name = f'{year}{month}{day}'

    # Translates the title
    ENG_title = Translator().translate(h2_text, dest = 'en')
    SPA_title = Translator().translate(h2_text, dest = 'es')
    JPN_title = Translator().translate(h2_text, dest = 'ja')

    global youtube_title, youtube_title_SPA, youtube_title_KOR, youtube_title_JPN
    youtube_title = f"[ENG SUB] {output_name} | LE SSERAFIM Weverse Live 🔴 ({h2_text}) {ENG_title.text}"
    youtube_title_SPA = f"[SPA SUB] {output_name} | LE SSERAFIM Weverse EN VIVO 🔴 ({h2_text}) {SPA_title.text}"
    youtube_title_KOR = f"[한국어 자막] {output_name} | 르세라핌 위버스 라이브 🔴 ({h2_text})"
    youtube_title_JPN = f"[日本語字幕] {output_name} | レ・セラフィム ウィバースライブ 🔴 ({h2_text}) {JPN_title.text}"

    #Prints the translated titles.
    print(youtube_title)
    print(youtube_title_SPA)
    print(youtube_title_KOR)
    print(youtube_title_JPN)
        
    driverWeverse.quit()

In [3]:
def createDir():
    try:
        # Creates the output directory if it doesn't already exist.
        if not os.path.exists(output_name):
            os.makedirs(output_name)
    except OSError as e:
        print(f"An error occurred while creating the directory: {e}")

In [4]:
def downloader():
    with requests.Session() as session:
        #Iterates over a range of network request codes and downloads the corresponding files.
        for i in tqdm(range(1, last_code_index), desc="Downloading", unit="file", leave=True):
            code = f'{i:06d}'
            url = url_template.replace('{code}', code)

            try:
                response = session.get(url)
                if response.status_code == 200: #Checks if valid URL.
                    file_path = os.path.join(output_name, f'output_{code}.ts')
                    with open(file_path, 'wb') as file:
                        file.write(response.content)
                else:
                    print(f"Failed to retrieve TS file with code {code}")
            except requests.exceptions.RequestException as e:
                logging.error(f"Error downloading file with code {code}: {e}")
                # Handle the exception, such as retrying the download or logging the error message

In [5]:
def videoClipsList():
    file_range = range(1, last_code_index)
    file_paths = [r"F:\Documents\Jupyter_Notebooks\WeVerse_Downloader\{}\output_{:06d}.ts".format(output_name, i) for i in file_range]

    def load_video_clip(file_path):
        try:
            return VideoFileClip(file_path)
        except Exception as e:
            logging.error(f"Error loading video clip from {file_path}: {e}")
            return None

    global video_clips
    video_clips = [load_video_clip(file_path) for file_path in tqdm(file_paths, desc="Loading Clips", unit="clip") if load_video_clip(file_path) is not None]

In [6]:
def concatenate():
    batch_size = 10  # Number of clips to concatenate at a time
    num_batches = len(video_clips) // batch_size + 1

    # Concatenate clips in batches
    clip_batches = [video_clips[i:i+batch_size] for i in range(0, len(video_clips), batch_size)]
    concatenated_clips = []

    for clips in tqdm(clip_batches, desc='Concatenating Batches', unit='batch'):
        concatenated_clips.append(concatenate_videoclips(clips, method='compose'))

    # Concatenate the batches into a final clip
    global final_clip
    final_clip = concatenate_videoclips(concatenated_clips, method='compose')

In [7]:
def write():
    #Writes the final clip to an .mp4 file.
    final_clip.write_videofile(f"{output_name}.mp4", fps = 30, threads = 4)

In [8]:
def playVideo(output_name, width = 640, height = 480, title = None):
    #Plays the video file in the Jupyter Notebook environment using HTML and Markdown.
    video_path = f"{output_name}.mp4"
    display(Markdown(f"# {title}")) if title else None
    video_html = f'<video width="{width}" height="{height}" controls><source src="{video_path}" type="video/mp4"></video>'
    display(HTML(video_html))

In [9]:
 def runPackage(video_url_in, network_request_url_in, last_code_index_in):
    defVariables(video_url_in, network_request_url_in, last_code_index_in) #Defines the variables.
    get_youtube_title() #Retrieves the YouTube title.
    createDir() #Creates the output directory.
    downloader() #Downloads the video files.
    videoClipsList() #loads the video clips.
    concatenate() #Concatenates them.
    write() #Writes the final clip.
    playVideo(output_name, title = youtube_title) #Plays the video.

In [10]:
def transcribe():
    global model
    model = whisper.load_model("tiny", device = 'cuda') #Loads the Whisper ASR modelo; Specifies device as CUDA to utilize GPU.
    global audio
    audio = whisper.load_audio(f'{output_name}.mp4') #Transcribes the audio using the model.
    global result
    result = model.transcribe(audio, verbose = True, language = 'Korean', task = 'translate') #Stores the result.

In [11]:
def format_time(time):
    #Formats a time value into the standard subtitle format (HH:MM:SS,mmm).
    hours = int(time // 3600)
    minutes = int((time % 3600) // 60)
    seconds = int(time % 60)
    milliseconds = int((time % 1) * 1000)

    return f"{hours:02d}:{minutes:02d}:{seconds:02d},{milliseconds:03d}"

In [12]:
def convert_to_srt(segments, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        index = 1
        #Converts a list of segments into SubRip subtitle format.
        for segment in segments:
            
            start_time = format_time(segment['start'])
            end_time = format_time(segment['end'])
            text = segment['text']
            
            #Writes the subtitles to a file.
            file.write(f"{index}\n")
            file.write(f"{start_time} --> {end_time}\n")
            file.write(f"{text}\n")
            file.write("\n")
            
            index += 1

In [13]:
def runWhisper(output_name):
    #Executes the steps to transcribe the audio, convert the result to the .srt format, and save it as an .srt file.
    transcribe()
    convert_to_srt(result['segments'], f'{output_name}.srt')
    print(f'\n\nFile written as {output_name}.srt')

### variables

In [14]:
def defVariables(video_url_in, network_request_url_in, last_code_index_in):
    #URL of the video to download
    global video_url
    video_url = video_url_in
    #network_request_url
    global network_request_url
    network_request_url = network_request_url_in
    #template string for constructing URLs
    global url_template
    url_template = network_request_url.replace(network_request_url.split('-')[-1].split('.')[0], '{code}')
    #last index used in constructing URLs
    global last_code_index
    last_code_index = (last_code_index_in) + 1

    global username
    username = Credentials.username
    global password
    password = Credentials.password
    global year
    year = 22

    ###output_name = '230706'

### execution

In [None]:
%%time
runPackage(
    video_url_in = 'https://weverse.io/lesserafim/live/3-123907106',
    network_request_url_in = 'https://weverse-rmcnmv.akamaized.net/c/read/v2/VOD_ALPHA/weverse_2023_07_06_0/hls/499a455b-1bed-11ee-b37e-a0369ffdede8-000022.ts?__gda__=1688975774_fb4cfbe5ae2def79caa216d1d96676e1',
    last_code_index_in = 20
)

[ENG SUB] 220706 | LE SSERAFIM Weverse Live 🔴 (🍄) 🍄
[SPA SUB] 220706 | LE SSERAFIM Weverse EN VIVO 🔴 (🍄) 🍄
[한국어 자막] 220706 | 르세라핌 위버스 라이브 🔴 (🍄)
[日本語字幕] 220706 | レ・セラフィム ウィバースライブ 🔴 (🍄) 🍄


Downloading: 100%|███████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 21.77file/s]
Loading Clips: 100%|█████████████████████████████████████████████████████████████████| 20/20 [00:23<00:00,  1.16s/clip]
Concatenating Batches: 100%|██████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 11.83batch/s]


Moviepy - Building video 220706.mp4.
MoviePy - Writing audio in 220706TEMP_MPY_wvf_snd.mp3


                                                                                                                       

MoviePy - Done.
Moviepy - Writing video 220706.mp4



t:  99%|████████████████████████████████████████████████████████████████▏| 2402/2433 [07:22<00:06,  4.74it/s, now=None]

In [None]:
%%time
###runWhisper()

https://colab.research.google.com/drive/1kmnzxf7a-wGjsEDXjO46PyVxQOv54tW0?usp=sharing

## Post Notes

### To-do

- Create helper functions: _login_to_website, _get_title_info, and _translate_title.
- Automate file run on new upload.
- Automate network request URL.
- Automate last_code_index.
- Automate upload to YouTube with API.
- Reduce Redundancy defVariables().
- Find solution to run program off home machine.
- Whisper JAX.
- Kaggle Notebook TPUs.
- Docker?
- Automate upload using YouTube API.

### Limitations

- Requires manual link entry.
- getYoutubeTitle() sometimes misclicks in webdriver.
- Requires a lot of CPU and RAM.
- WhisperAI large model requires 10 GB VRAM.
- Cannot be run on my work machine due to lack of RAM.
- Currently runs Whisper on Google Colab.
- Concatenation leaves behind a residual stutter.
- Whisper large-v2 has WER of 13%.
- Whisper cannot distinguish music from speech.
- Whisper needs to "warm-up".
- Speed decreases until kernel is restart.
- Cannot load too many files; kernel must be restarted after about 2 calls.
- Too many global variables.

### Issues

- Memory leak?
- Must optimize