## <p style="text-align:center" color="red"><span style="color:red">Youtube Video Transcriptor - updated</span></p>



<table align="center">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/labrijisaad/Youtube-video-transcriptor"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

- **`This script is meant to be run in google colaboratory!`**
- In order to optimize the transcription time, I updated the script using python threads 😋
- This script can sometimes not detect text correctly, it's mainly due to noises or the way we speak in the video (speaking too fast or too slow)
- The general idea summary model is a community model available at [Huggingface](https://huggingface.co/) , it can sometimes get the general idea wrong, especially if there is a lack of data.


> 🙌 Notebook made by [@labriji_saad](https://github.com/labrijisaad)

### Installing the requirements

In [1]:
import yt_dlp
import time
import re
import os
from pydub import AudioSegment
import speech_recognition as sr
import math
from tqdm import tqdm
from googletrans import Translator
from threading import Thread
import scrapetube
from youtube_transcript_api import YouTubeTranscriptApi

### Downloading the audio (`url = video_link`)
> - Specify here the link of the video you want to transcribe.

In [2]:
videos = scrapetube.get_channel(channel_url="https://www.youtube.com/@ToaoDotNet")
videos = [video for video in videos]

In [3]:
video_ids = [v["videoId"] for v in videos]

In [4]:
transcripts = YouTubeTranscriptApi.get_transcripts(video_ids=video_ids, continue_after_error=True)

In [16]:
transcripts_txt = {}
for vid, transcript in transcripts[0].items():
    if not transcript:
        continue
    f = open(f"./transcripts/{vid}.txt", "w+")
    transcript_txt = " ".join([transcript[i]["text"] for i, t in enumerate(transcript) if i > 0])
    transcripts_txt[vid] = transcript_txt
    f.write(transcript_txt)

In [27]:
import pickle
outputs_pos = pickle.load(open("outputs_pos.pkl", "rb"))
outputs_neg = pickle.load(open("outputs_neg.pkl", "rb"))
print(type(outputs_pos["prompts"]))

<class 'list'>


In [23]:
outputs = {
    "prompt": outputs_pos["prompts"] + outputs_neg["prompts"],
    "response": outputs_pos["outputs"] + outputs_neg["outputs"]
}

In [39]:
N = len(outputs["prompt"])
lines = []
for i in range(N):
    lines.append({"text": f"<s>[INST] {outputs['prompt'][i]} [/INST] {outputs['response'][i]}</s>"})

In [41]:
lines

[{'text': '<s>[INST]     ```system\n    You are an AI assistant tasked with classifying cell phone conversations as scam calls.\n    I will provide you the general structure for scam calls, examples of scam call topics and patterns, and finally the transcript of the call in question. \n    I want you to analyze the transcript and decide whether the transcript describes a scam call or a normal call. \n    Respond only with 1 of the 4 following categories: "Very Likely Scam", "Likely Scam", "Unlikely Scam", "Very Unlikely Scam." On a new line add a 2-3 sentence justification for each decision.\n\tGive the following extraneous information depending on the decided category:\n        Very Likely: On a new line provide an action for the user to do. For example, "Hang up immediately.", or "Do NOT give any personal information."\n        Likely: Provide clarifying questions for the user to ask. For example, "Why do you need this information?"\n        Unlikely: Do NOT provide any actions or qu

In [44]:
import jsonlines

with jsonlines.open("scam_finetune.jsonl", mode="w") as writer:
    for line in lines:
        writer.write(line)

AttributeError: 'Writer' object has no attribute 'writelines'

In [25]:
import json

json.dump(outputs, open("scam_finetune.json", "w+"))

In [None]:
f = open("./transcripts/processed_ids.csv", "w+")
processed_ids = f.read().split(",")
for video in videos:
    if video["videoId"] in processed_ids:
        continue
    url = f"https://www.youtube.com/watch?v={video['videoId']}"
    ydl_opts={}
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
            info_dict = ydl.extract_info(url, download=False)
    video_title = info_dict['title']
    video_name = re.sub('[\\\\/*?:"<>|]', '', video_title)
    name = video_name
    ydl_opts = {
        'format': 'm4a/bestaudio/best',
            'noplaylist': True,
            'continue_dl': True,
            'outtmpl': f'./transcripts/{name}.wav',
            'postprocessors': [{
                'key': 'FFmpegExtractAudio',
                'preferredcodec': 'wav',
                'preferredquality': '192',
            }],
            'geobypass':True,
            #  'ffmpeg_location':'/usr/bin/ffmpeg'
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code = ydl.download(url)
    f.write(",".join(processed_ids) + video["videoId"])

### Spliting the audio (`min_per_split = 1`)
> - When it comes to using the free version of Google's transcriber (**`speech_recognition`**), there is a limit on the length of the video (or audio) that should not be exceeded (this limit is around 5 minutes ). To remedy this problem, the following script splits the video into one minute long intervals and puts the generated mini-videos in a directory that has a name in the form of **`split files for: Video_Name.wav`**

In [92]:
import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')

In [93]:
class SplitWavAudioMubin():
    def __init__(self, folder, filename):
        self.folder = folder
        self.filename = slugify(filename)[0:-6]
        self.filepath = f"{folder}/{filename}"
        self.audio = AudioSegment.from_wav(self.filepath)

    def get_duration(self):
        return self.audio.duration_seconds

    def single_split(self, from_min, to_min, split_filename):
        t1 = from_min * 60 * 1000
        t2 = to_min * 60 * 1000
        split_audio = self.audio[t1:t2]
        split_audio.export(f"/Users/tanush/Programming/Projects/Shascam/scraper/splits/{self.filename}/{split_filename}", format="wav")

    def multiple_split(self, min_per_split):
        os.makedirs(f"/Users/tanush/Programming/Projects/Shascam/scraper/splits/{self.filename}", exist_ok=True)
        total_mins = math.ceil(self.get_duration() / 60)
        for i in range(0, total_mins, min_per_split):
            split_fn = str(i) + '_' + self.filename + ".wav"
            self.single_split(i, i+min_per_split, split_fn)
            if i == total_mins - min_per_split:
                print('All splited successfully')
        print('>>> Video duration: ' + str(self.get_duration()))

def split_audio(folder, file_name):
    directory = "/Users/tanush/Programming/Projects/Shascam/scraper/splits"
    os.makedirs(directory, exist_ok=True)
    try:
        split_wav = SplitWavAudioMubin(folder, file_name)
        split_wav.multiple_split(min_per_split=1)
    except:
        return

In [94]:
for video in os.listdir("/Users/tanush/Programming/Projects/Shascam/scraper/transcripts"):
    folder = "/Users/tanush/Programming/Projects/Shascam/scraper/transcripts"
    split_audio(folder, video)

All splited successfully
>>> Video duration: 506.2414512471655
All splited successfully
>>> Video duration: 374.37532879818593
All splited successfully
>>> Video duration: 1085.6954195011338
All splited successfully
>>> Video duration: 293.1287074829932
All splited successfully
>>> Video duration: 234.2661224489796
All splited successfully
>>> Video duration: 531.3422222222222
All splited successfully
>>> Video duration: 624.4078004535147
All splited successfully
>>> Video duration: 406.09378684807257
All splited successfully
>>> Video duration: 1730.9082993197278
All splited successfully
>>> Video duration: 243.41478458049886
All splited successfully
>>> Video duration: 306.5034013605442
All splited successfully
>>> Video duration: 207.86503401360545
All splited successfully
>>> Video duration: 498.78784580498865
All splited successfully
>>> Video duration: 550.7308843537415
All splited successfully
>>> Video duration: 885.144671201814
All splited successfully
>>> Video duration: 846.

### Recognizing the text (` language = "en-US"`) https://cloud.google.com/speech-to-text/docs/languages
> - To perform text detection, we must first specify the language spoken in the video. To do this, we must search for the keyword equivalent to language in the language catalog available in the link on the title. ( In our case, it's **`English`** so the keyword is **`en-US`** )

In [3]:
all_files = []
for video in os.listdir("/Users/tanush/Programming/Projects/Shascam/scraper/splits"):
    search_dir = f"/Users/tanush/Programming/Projects/Shascam/scraper/splits/{video}"
    files = os.listdir(search_dir)
    files = [os.path.join(search_dir, f) for f in files]
    files.sort(key=lambda x: os.path.getmtime(x))
    all_files.append(files)

In [9]:
def speech_recognizer(files, frames, i):  ## This function recognizes speech in our WAV files
    texts = []
    recognizer = sr.Recognizer()

    for file in files:
        with sr.AudioFile(file) as source:
            recorded_audio = recognizer.listen(source)
        try:
            text = recognizer.recognize_google(
                recorded_audio,
                language="en-US"  ## Replace with language keyword
            )
            texts.append(text)
        except Exception as ex:
            print(ex)
    result = ""
    for text in texts:
        result += " " + text
    frames[i] = result
    return result

def split_files(files, n_batches): ## This function split the files evenly between the threads we have.
    k, m = divmod(len(files), n_batches)
    return list(files[i*k+min(i, m):(i+1)*k+min(i+1, m)] for i in range(n_batches))

def main(n_batches=8, verbose=True): ## By default, the maximum capacity of threads supported in collab is 8
    all_frames = []
    for files in tqdm(all_files):
        print(files)
        threads = [None]*n_batches
        frames = [None]*n_batches
        batches = split_files(files, n_batches)
        start = 0
        for i in range(len(batches)):
            if i>0:
                start_index=len(batches[i-1])
            else:
                start_index = 0
            t = Thread(target=speech_recognizer, args=(batches[i], frames, i))
            threads[i] = t
            t.start()
        for t in threads:
            t.join()
        all_frames.append(frames)
        print(frames)
    return all_frames

In [12]:
frames = main()

  0%|          | 0/149 [00:00<?, ?it/s]

['/Users/tanush/Programming/Projects/Shascam/scraper/splits/lenny-does-not-remember-the-expiration-date-on-his-quadit-card/0_lenny-does-not-remember-the-expiration-date-on-his-quadit-card.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/lenny-does-not-remember-the-expiration-date-on-his-quadit-card/1_lenny-does-not-remember-the-expiration-date-on-his-quadit-card.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/lenny-does-not-remember-the-expiration-date-on-his-quadit-card/2_lenny-does-not-remember-the-expiration-date-on-his-quadit-card.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/lenny-does-not-remember-the-expiration-date-on-his-quadit-card/3_lenny-does-not-remember-the-expiration-date-on-his-quadit-card.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/lenny-does-not-remember-the-expiration-date-on-his-quadit-card/4_lenny-does-not-remember-the-expiration-date-on-his-quadit-card.wav', '/Users/tanush/Programming/Project

  1%|          | 1/149 [00:18<45:42, 18.53s/it]

[" hello this is Lenny I believe you're responding to get a lower rate on your credit card right sorry I believe you are responding to get a lower interest rate on your credit card right and do you have any ideas", ' my name is Kevin eldest Larissa she was driving at doing', ' how do you say do you remember the expiration date remember what is the expiration date card', ' hello', '', '', '', '']
['/Users/tanush/Programming/Projects/Shascam/scraper/splits/tech-support-telemarketer-gets-frustrated-when-lenny-wont-switch-on-his-computer/0_tech-support-telemarketer-gets-frustrated-when-lenny-wont-switch-on-his-computer.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/tech-support-telemarketer-gets-frustrated-when-lenny-wont-switch-on-his-computer/1_tech-support-telemarketer-gets-frustrated-when-lenny-wont-switch-on-his-computer.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/tech-support-telemarketer-gets-frustrated-when-lenny-wont-switch-on-his-computer/2_

  1%|▏         | 2/149 [00:26<30:05, 12.28s/it]

[' hello this is Lenny', ' yes so yeah so could you please go ahead and switch on the computer', " yes of course your computer is infected do some infections okay which are coming from the internet so that is why I'm telling you go ahead and turn on the computer so that I will tell you where you can find out the infection and how can you remove it that's also", '', '', '', '', '']
['/Users/tanush/Programming/Projects/Shascam/scraper/splits/mae-garcia-from-american-energy-efficient-offers-lenny-a-lot-of-benefits/0_mae-garcia-from-american-energy-efficient-offers-lenny-a-lot-of-benefits.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/mae-garcia-from-american-energy-efficient-offers-lenny-a-lot-of-benefits/1_mae-garcia-from-american-energy-efficient-offers-lenny-a-lot-of-benefits.wav', '/Users/tanush/Programming/Projects/Shascam/scraper/splits/mae-garcia-from-american-energy-efficient-offers-lenny-a-lot-of-benefits/2_mae-garcia-from-american-energy-efficient-offers-lenny-

  1%|▏         | 2/149 [00:35<43:15, 17.66s/it]


KeyboardInterrupt: 

### Saving the recognized text
> - After performing the speech detection, we save the resulting text in a file which is in the form of **`Transcription_Video_Name.txt`**

In [None]:
result = ""
for text in frames:
    result += " " + text

os.chdir("../")
text_file = open("Transcription_"+ file_name[:-4] +".txt", "w")
text_file.write(result)
text_file.close()

In [8]:
print(frames)

NameError: name 'frames' is not defined

### Translating the recognized text (`dest='fr'`)
> - In addition, we have tried here to translate the text into French after transcription, using the Google API (**`googletrans`**)
> - To correctly use this API, we must replace the dest variable with the output language keyword ( In our case, **`dest='fr'`**)

In [None]:
translator = Translator()

translate_text = ""
for text in frames:
    translate_text += " " + translator.translate(text, dest='fr').text
print(translate_text)

 préparez-vous à 19 ans à partir de l'espace des mauvaises idées c'est des lecteurs complètement incompris comme quel genre de journal le matin mais ensuite pensez oh attendez non vélos que vous pourriez offrir un t3micro complet par mouvement et que vous puissiez coller sur n'importe quelle surface que vous vouliez matériau caoutchouté spécial supplémentaire à l'arrière ils viennent avec l'impression que ces êtres humains hautement optimisés par des tempes de lunettes encombrantes et encombrantes sont tout le monde rêve regardez mes vidéos dès le début, vous savez peut-être que j'ai construit cela en filmant sur une table Ikea pour la simple raison qu'ils ont un modèle commercial, c'est que quel problème était l'Ikea ​​a oublié le simple fait que les sèche-cheveux soufflent de l'air chaud et que le plastique fond voir où je veux en venir ne dites pas que l'attaque n'était pas intelligente, essentiellement la divulgation a un haut-parleur à ultrasons à l'intérieur de laquelle Soundwave

### Saving the translated text
> - We save the resulting translated text in a file which is in the form of **`Transcription_translated_Video_Name.txt`**

In [None]:
text_file = open("Transcription_translated_"+ file_name[:-4] +".txt", "w")
text_file.write(translate_text)
text_file.close()

### General Idea summarization
> - Finally, we can use the text we have recovered to have a summary of the general idea discussed in the video
> - Here it is necessary to specify the **`max_length`** and the **`min_length`**, by default we have chosen that the length of the general idea of a text must be at least 10% of the total length of the text.

In [None]:
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

In [None]:
ARTICLE = result
summary_text = summarizer(ARTICLE, max_length=100, min_length=int(len(result.split(" "))/10), do_sample=False)[0]["summary_text"]
print(summary_text)

I built this by filming on an Ikea table for simple reason they have a business model is that what issue was the Ikea forgot the simple fact that hair dryers blow out hot air and the plastic melts. Microsoft correctly predicted the gaming with a huge and growing markets. Mike mattock fails that videotape biggest text that you've ever seen your entire life.


> - 🙌 Notebook made by [@labriji_saad](https://github.com/labrijisaad)
> - 🔗 Linledin [@labriji_saad](https://www.linkedin.com/in/labrijisaad/)