# **Youtube video to text summery for educational centers and Project research**

#**By: Dev Chakraborty**

**"Installation of various NLP, ML, and related libraries"**

In [None]:
!pip install nltk
!pip install spacy
!pip install tensorflow
!pip install torch torchvision
!pip show spacy
!pip install transformers
!python -m spacy download en_core_web_sm
!pip install pytube
!pip install sentence-transformers
!pip install youtube_transcript_api
!pip install sentence_transformers
!pip install google-cloud-speech
!pip install ncluster
!pip install gensim

Name: spacy
Version: 3.7.4
Summary: Industrial-strength Natural Language Processing (NLP) in Python
Home-page: https://spacy.io
Author: Explosion
Author-email: contact@explosion.ai
License: MIT
Location: /usr/local/lib/python3.10/dist-packages
Requires: catalogue, cymem, jinja2, langcodes, murmurhash, numpy, packaging, preshed, pydantic, requests, setuptools, smart-open, spacy-legacy, spacy-loggers, srsly, thinc, tqdm, typer, wasabi, weasel
Required-by: en-core-web-sm, fastai
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m27.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to res

In [None]:
import nltk

**"NLTK Resource Download: Tokenization and POS Tagging"**

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

**"Testing the tokenization with NLTK: Breaking Down Text into Words"**

In [None]:
sample_text = "NLTK is a leading platform for building Python programs to work with human language data for now."
tokens = nltk.word_tokenize(sample_text)
print(tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


**"Testing the text Processing with SpaCy: Extracting Tokens from Text"**

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(sample_text)
tokens = [token.text for token in doc]
print(tokens)

['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']


In [None]:
import json
from pytube import YouTube
import nltk
from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
import spacy
from transformers import BartTokenizer, BartForConditionalGeneration
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from youtube_transcript_api import YouTubeTranscriptApi
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
from transformers import pipeline


# NLTK data for sentence tokenization ..
nltk.download('punkt')

# Load English language model for SpaCy
nlp = spacy.load('en_core_web_sm')

# Load BART tokenizer and model for abstractive summarization
bart_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
bart_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Load a pre-trained sentence embedding model for extractive summarization
embedder = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Function to fetch YouTube video transcript
def get_youtube_subtitle(video_url, language='en'):
    try:
        video_id = video_url.split('v=')[-1]
        subtitles = YouTubeTranscriptApi.get_transcript(video_id, languages=[language])
        return ' '.join(segment['text'] for segment in subtitles) if subtitles else None
    except Exception as e:
        print(f"Error fetching subtitles: {e}")
        return None

# Function to clean text using SpaCy
def clean_text(text):
    doc = nlp(text)
    return " ".join(token.lemma_ for token in doc if not token.is_stop and not token.is_punct)

# Function for abstractive summarization using BART
def abstractive_summarization(transcript):
    inputs = bart_tokenizer([transcript], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = bart_model.generate(inputs.input_ids, num_beams=4, min_length=30, max_length=200, early_stopping=True)
    return bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

def long_form_summarization(transcript):
    inputs = bart_tokenizer([transcript], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = bart_model.generate(inputs.input_ids, num_beams=4, min_length=300, max_length=600, early_stopping=True)
    return bart_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Function for clustering-based summarization
def clustering_based_summarization(transcript):
    sentences = sent_tokenize(transcript)
    embeddings = embedder.encode(sentences)
    # Ensure there is at least one cluster
    n_clusters = max(1, min(10, len(sentences) // 5))
    kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(embeddings)
    avg = []
    for j in range(n_clusters):
        idx = np.where(kmeans.labels_ == j)[0]
        avg.append(np.mean(idx))
    closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, embeddings)
    ordered = sorted(range(n_clusters), key=lambda k: avg[k])
    summarized = '. '.join(sentences[closest[idx]] for idx in ordered)
    return summarized

# Function for extractive summarization using TF-IDF
def extractive_summarization(transcript):
    sentences = sent_tokenize(transcript)
    cleaned_sentences = [clean_text(sentence) for sentence in sentences]
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(cleaned_sentences)
    similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
    sentence_scores = similarity_matrix.sum(axis=1)
    summary_sentences_indices = sentence_scores.argsort()[-3:][::-1]
    return ' '.join(sentences[i] for i in summary_sentences_indices)

# Main summarization function
def generate_summary(video_url):
    subtitle = get_youtube_subtitle(video_url)
    if subtitle:
        long_summary = long_form_summarization(subtitle)
        clustering_summary = clustering_based_summarization(subtitle)

        print("Long-form Summary (50 lines):")
        print("-" * 50)
        print("\n".join(long_summary.split('.')[:50]))  # Print only the first 50 sentences split by periods

        print("\nClustering-based Summary:")
        print("-" * 50)
        print(clustering_summary.replace('. ', '.\n'))  # Replace periods with newer and newer lines of data for  better readability ..

        print("\nAbstractive Summary:")
        print("-" * 50)
        print(abstractive_summarization(subtitle).replace('. ', '.\n'))  # Replace periods with new data lines for better readability

        print("\nExtractive Summary:")
        print("-" * 50)
        print(extractive_summarization(subtitle).replace('. ', '.\n'))  # Replace periods with lines that is the data for better readability
    else:
        print("No subtitles found for the video.")

# User interaction
def main():
    video_url = input("Enter the URL of the YouTube video: ")
    generate_summary(video_url)

if __name__ == "__main__":
    main()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Long-form Summary (50 lines):
--------------------------------------------------
When you buy a home in the US there is a basic promise that home values will go up over time
 Median wages have not kept pace with the increase in the cost of housing
 Nearly half of federal student loan borrowers don't know how much they owe or who they owe it to
 Only one in three students understand the financial terms of their loans and nearly half of students are confused about how to pay off their student loans
 We're going to talk about why it's become so hard for lower income and middle inome people to be come and remain middle class in this week's episode of The Story Goes Like This
 We'll also talk about how financial literacy and student debt management can help you manage your debt and get out of debt in the first place
 The Story goes Like This airs on Sundays at 10pm ET on CNN and 1am ET on Monday and Tuesday on CNN Deportes and CNN Living
 For more information on The Story Go Like This, visi

In [1]:
# # Project research reference:

# https://github.com/cybertronai/ncluster

# https://www.datastax.com/guides/what-is-cosine-similarity

# https://www.analyticsvidhya.com/blog/2022/01/youtube-summariser-mini-nlp-project/

# https://huggingface.co/facebook/bart-large-cnn

# https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

# https://www.sbert.net/docs/pretrained_models.html

# https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-algorithm#:~:text=K%2DMeans%20clustering%20is%20an,'K'%20is%20a%20number.

# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

Status and Maintenance for IT Staff

1. Use appropriate tools and techniques= using cloud based system was much better match for Large language models (LLM) processing due to scalibility options.
2. Update and review regularly: I noticed the installation of various NLP, ML, and related libraries versions have to be updated from time to time. Noticed that google colab was able to fetch packages very easyly. While using jupiter notebook had to install most libraries manualy through terminal.
3.  Train and support users: The end user should only be resposible for selecting the link and pressing enter to process it.

4. Seek feedback and improvement: As flys by new versions of LLM will come in the market. The IT staff should seek to change the current models if:
1. there will be no compitiblity issues
2. Processing cost or time is reduced
3. It is very easy to implement changes on the current LLM , ML and libiries  