<a href="https://colab.research.google.com/github/girishsenthil/NLP/blob/main/PegasusForYouTubeVideoSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Pegasus Summarization Model for YouTube Movie Summary Channels

Why watch a movie when you can read a generated abstractive summary of a video summarizing a movie??

I had a lot of fun learning API manipulation and overall data processing, and especially the Pegasus Model functionality. Unfortunately there is a limit to how often one can make requests to the server for video transcripts, leading to a waiting period until I can actually train the model. 

Despite this setback I hope you enjoy the project! The model will be trained as soon as the data is available.

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install youtube-transcript-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import json
import urllib
from urllib import request, parse

In [5]:
import torch
import transformers
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, Trainer, TrainingArguments

In [6]:
from youtube_transcript_api import YouTubeTranscriptApi

In [7]:
import pandas as pd, numpy as np
import re

In [8]:
config = transformers.PegasusConfig

In [9]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [10]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-large', max_position_embeddings = 2048).to(device)

In [11]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-large')

## Retrieving Text Data from Movie Recaps Channel

Objectively for this niche task, the inputs (video transcripts) and labels (video descriptions) are the in a very consistent format across the entire channel's videos. 

Using a YouTube API created through the Google Cloud Platform, I will query the videos in json format and create a dataframe containing video title, description, and cleaned transcript.

In [12]:
api_key = 'AIzaSyA4PUdHD4RtFQmrDLF5ePt27TsST4_DL4g'
playlist_id = 'UUyXD1jAZBdZ4u0K-GLYC77Q'

### Investigating the json outputs //

In [13]:
with request.urlopen('https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,contentDetails&maxResults=5&playlistId=UUyXD1jAZBdZ4u0K-GLYC77Q&key=AIzaSyA4PUdHD4RtFQmrDLF5ePt27TsST4_DL4g') as url:
  data = json.loads(url.read().decode())
  print(data)


{'kind': 'youtube#playlistItemListResponse', 'etag': 'Zbf0eueff16WZ_7SU9Is-lG5YPA', 'nextPageToken': 'EAAaBlBUOkNBVQ', 'items': [{'kind': 'youtube#playlistItem', 'etag': 'yl-Hx2Bkz2uvCdwsnTJi5QeFR40', 'id': 'VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLi0tbVVPRDlUb2s0', 'snippet': {'publishedAt': '2022-07-16T18:42:42Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'After 27 Years in Prison, He Became President and Changed The Whole Country', 'description': 'The true story of how the president of South Africa helped the rugby team win the world cup as inspiration to bring people together through the universal enjoyment of sports.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/hqdefault.jpg', 'width'

In [14]:
data.keys()

dict_keys(['kind', 'etag', 'nextPageToken', 'items', 'pageInfo'])

In [15]:
for i in range(5):
  print(data['items'][i]['snippet']['title'])

After 27 Years in Prison, He Became President and Changed The Whole Country
Fallen Soldier Wakes up on His Funeral and Learns he Has Become a Zombie
Hiker Finds a Stranded Man Wearing Shorts at The Top of a Snowy Mountain
Young Mother Accused of Killing Her Best Friend Must Find The True Killer to Save Herself
Grandpa Discovered a Lottery Loophole And Earned $27 Million in 9 years


In [16]:
cont = data['items'][0]

In [17]:
for keys in cont.keys():
  print(keys)
  print(cont[keys])
  print('*' * 20)

kind
youtube#playlistItem
********************
etag
yl-Hx2Bkz2uvCdwsnTJi5QeFR40
********************
id
VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLi0tbVVPRDlUb2s0
********************
snippet
{'publishedAt': '2022-07-16T18:42:42Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'After 27 Years in Prison, He Became President and Changed The Whole Country', 'description': 'The true story of how the president of South Africa helped the rugby team win the world cup as inspiration to bring people together through the universal enjoyment of sports.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {'url': 'https://i.ytimg.com/vi/--mUOD9Tok4/sddefault

In [18]:
title = cont['snippet']['title']
description = cont['snippet']['description'].split('\n')[0] ###.split() accounts for the new lines before they plug their friend's channel

In [19]:
print(f'Title: {title} \nDescription: {description}')

Title: After 27 Years in Prison, He Became President and Changed The Whole Country 
Description: The true story of how the president of South Africa helped the rugby team win the world cup as inspiration to bring people together through the universal enjoyment of sports.


### Functions to extract desired information and store in pd.DataFrame

In [20]:
def clean(dirty_text):
  
  text = [i['text'] for i in dirty_text if i['text'].find('[') == -1]
  text = list(map(lambda x: x.replace('\n', ' '), text))
  clean_text = ' '.join(text)
  clean_text = re.sub('[^A-Za-z0-9]+', ' ', clean_text)

  return clean_text

In [21]:
def playlist_to_dataframe(playlist_id, api_key, max_results):
  

  api_url = 'https://www.googleapis.com/youtube/v3/playlistItems?'
  param_url = f'part=snippet,contentDetails&maxResults={max_results}&playlistId={playlist_id}&'
  api_key = f'key={api_key}'

  loop = True
  nextPageToken = None
  desired = np.array(['videoID', 'title', 'description', 'text'])

  while loop:

    if nextPageToken is None:
      pageToken = ''
    else:
      pageToken = f'&pageToken={nextPageToken}'

    concat_url = api_url + param_url + api_key + pageToken

    with request.urlopen(concat_url + pageToken) as request_url:
      data = json.loads(request_url.read().decode())
    
    query_length = len(data['items'])

    #if query_length < max_results: loop = False

    for item in range(query_length):

      content_dictionary = data['items'][item]

      videoID = content_dictionary['contentDetails']['videoId']
      title = content_dictionary['snippet']['title']
      description = content_dictionary['snippet']['description'].split('\n')[0]

      try:
        text = YouTubeTranscriptApi.get_transcript(videoID)
        text = clean(text)
      except:
        text = np.nan

      desired = np.vstack((desired, np.array([videoID, title, description, text])))

    try:
      nextPageToken = data['nextPageToken']
      print('Accessing Next Page')
    except KeyError:
      break

  df = pd.DataFrame(data = desired[1:], columns = desired[0])

  return df


## Preparing Data

In [22]:
df = playlist_to_dataframe(playlist_id = playlist_id,
                             api_key = api_key,
                             max_results = 50)

Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page


In [29]:
df

Unnamed: 0,videoID,title,description,text
0,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,
1,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,
2,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,
3,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",
4,puwkcC7P3rg,Grandpa Discovered a Lottery Loophole And Earn...,The real story of a retired married couple fro...,
...,...,...,...,...
274,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,
275,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,
276,5rCygdGq_AI,A Family Struggles For Survival in The Face of...,A family fights for survival as a planet-killi...,
277,4_19GDyr8KA,Shady Legal Guardian Lands in Hot Water When S...,This is the story of Marla Grayson. Profession...,


7:51 AM 7/18/2022 IP blocked from making more requests to the server for the transcripts. In the meantime, I will setup the training parameters to finetune the Pegasus Model.

In [30]:
inputs, labels = df['text'], df['description']

In [43]:
def tokenize(inputs, labels, tokenizer, max_length = 2048):
  tok_inputs = tokenizer(inputs, max_length = 2048, truncation = True, padding = True, return_tensors = 'pt')
  tok_labels = tokenizer(labels, max_length = 2048, truncation = True, padding = True, return_tensors = 'pt')
  return tok_inputs, tok_labels

In [33]:
#Use torch.utils.data.Dataset

## Pegasus

In [46]:
#reference https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3

In [None]:
#To Init: Train_datasets, Test_datasets

In [34]:
output_dir = '/content/fine_tune'

In [35]:
training_args = TrainingArguments(output_dir = output_dir,
                                  num_train_epochs = 5,
                                  save_steps = 500,
                                  save_total_limit = 3,
                                  warmup_steps = 100,
                                  weight_decay = 1e-2,
                                  )

In [136]:
trainer = Trainer(model = model, args = training_args, train_dataset = train_dataset,
                  tokenizer = tokenizer)

array(['videoID', 'title', 'description', 'text'], dtype='<U15346')