<a href="https://colab.research.google.com/github/girishsenthil/NLP/blob/main/PegasusForYouTubeVideoSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Pegasus Summarization Model for YouTube Movie Summary Channels

Why watch a movie when you can read a generated abstractive summary of a video summarizing a movie??

I had a lot of fun learning API manipulation and overall data processing, and especially the Pegasus Model functionality. Unfortunately there is a limit to how often one can make requests to the server for video transcripts, leading to a waiting period until I can actually train the model. 

Despite this setback I hope you enjoy the project! The model will be trained as soon as the data is available.

## Imports

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 14.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 68.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 66.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 12.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling 

In [2]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 14.3 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


In [3]:
!pip install youtube-transcript-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.4.4-py3-none-any.whl (22 kB)
Installing collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.4.4


In [4]:
import json
import urllib
from urllib import request, parse

In [5]:
import torch
import transformers
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, Trainer, TrainingArguments

In [6]:
from youtube_transcript_api import YouTubeTranscriptApi

In [47]:
import pandas as pd, numpy as np
import re
from sklearn.model_selection import train_test_split

In [8]:
config = transformers.PegasusConfig

In [9]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [10]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-large', max_position_embeddings = 2048).to(device)

Downloading:   0%|          | 0.00/3.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [11]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-large')

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

## Retrieving Text Data from Movie Recaps Channel

Objectively for this niche task, the inputs (video transcripts) and labels (video descriptions) are the in a very consistent format across the entire channel's videos. 

Using a YouTube API created through the Google Cloud Platform, I will query the videos in json format and create a dataframe containing video title, description, and cleaned transcript.

In [12]:
api_key = 'AIzaSyA4PUdHD4RtFQmrDLF5ePt27TsST4_DL4g'
playlist_id = 'UUyXD1jAZBdZ4u0K-GLYC77Q'

### Investigating the json outputs //

In [13]:
with request.urlopen('https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,contentDetails&maxResults=5&playlistId=UUyXD1jAZBdZ4u0K-GLYC77Q&key=AIzaSyA4PUdHD4RtFQmrDLF5ePt27TsST4_DL4g') as url:
  data = json.loads(url.read().decode())
  print(data)


{'kind': 'youtube#playlistItemListResponse', 'etag': 'WoZLPkPvKx2Qh4FpweR9Pmy-ang', 'nextPageToken': 'EAAaBlBUOkNBVQ', 'items': [{'kind': 'youtube#playlistItem', 'etag': 'mSoeAYTVrd0ArB4YsXnxZQsp4Z8', 'id': 'VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLktybmRxSkVQbjZr', 'snippet': {'publishedAt': '2022-07-18T20:07:42Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday', 'description': 'An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {

In [14]:
data.keys()

dict_keys(['kind', 'etag', 'nextPageToken', 'items', 'pageInfo'])

In [15]:
for i in range(5):
  print(data['items'][i]['snippet']['title'])

The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday
After 27 Years in Prison, He Became President and Changed The Whole Country
Fallen Soldier Wakes up on His Funeral and Learns he Has Become a Zombie
Hiker Finds a Stranded Man Wearing Shorts at The Top of a Snowy Mountain
Young Mother Accused of Killing Her Best Friend Must Find The True Killer to Save Herself


In [22]:
cont = data['items'][0]

In [23]:
for keys in cont.keys():
  print(keys)
  print(cont[keys])
  print('*' * 20)

kind
youtube#playlistItem
********************
etag
mSoeAYTVrd0ArB4YsXnxZQsp4Z8
********************
id
VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLktybmRxSkVQbjZr
********************
snippet
{'publishedAt': '2022-07-18T20:07:42Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday', 'description': 'An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/sddefault.jpg', 'width': 640, 'height': 480},

Lots of nested dictionaries in the output dictionary, but should be straightforward to access necessary values

In [18]:
title = cont['snippet']['title']
description = cont['snippet']['description'].split('\n')[0] ###.split() accounts for the new lines before they plug their friend's channel

In [19]:
print(f'Title: {title} \nDescription: {description}')

Title: The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday 
Description: An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.


### Functions to extract desired information and store in pd.DataFrame

In [20]:
def clean(dirty_text):
  
  text = [i['text'] for i in dirty_text if i['text'].find('[') == -1]
  text = list(map(lambda x: x.replace('\n', ' '), text))
  clean_text = ' '.join(text)
  clean_text = re.sub('[^A-Za-z0-9]+', ' ', clean_text)

  return clean_text

In [21]:
def playlist_to_dataframe(playlist_id, api_key, max_results):
  
  
  api_url = 'https://www.googleapis.com/youtube/v3/playlistItems?'
  param_url = f'part=snippet,contentDetails&maxResults={max_results}&playlistId={playlist_id}&'
  api_key = f'key={api_key}'

  loop = True
  nextPageToken = None
  desired = np.array(['videoID', 'title', 'description', 'text'])

  while loop:

    if nextPageToken is None:
      pageToken = ''
    else:
      pageToken = f'&pageToken={nextPageToken}'

    concat_url = api_url + param_url + api_key + pageToken

    with request.urlopen(concat_url + pageToken) as request_url:
      data = json.loads(request_url.read().decode())
    
    query_length = len(data['items'])

    for item in range(query_length):

      content_dictionary = data['items'][item]

      videoID = content_dictionary['contentDetails']['videoId']
      title = content_dictionary['snippet']['title']
      description = content_dictionary['snippet']['description'].split('\n')[0]

      try:
        text = YouTubeTranscriptApi.get_transcript(videoID)
        text = clean(text)
      except:
        text = np.nan

      desired = np.vstack((desired, np.array([videoID, title, description, text])))

    try:
      nextPageToken = data['nextPageToken']
      print('Accessing Next Page')
    except KeyError:
      break

  df = pd.DataFrame(data = desired[1:], columns = desired[0])

  return df


## Preparing Data

In [24]:
df = playlist_to_dataframe(playlist_id = playlist_id,
                             api_key = api_key,
                             max_results = 50)

Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page


In [25]:
df

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
275,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,
276,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,
277,5rCygdGq_AI,A Family Struggles For Survival in The Face of...,A family fights for survival as a planet-killi...,
278,4_19GDyr8KA,Shady Legal Guardian Lands in Hot Water When S...,This is the story of Marla Grayson. Profession...,


In [40]:
has_text = df.loc[df['text'] != 'nan'].reset_index(drop = True)

In [41]:
has_text

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
210,8Z4fVj43JIM,A Damaged Spaceship Carrying Settlers to Mars ...,A Mars-bound spaceship gets knocked off course...,Welcome back to Movie Recaps Today I will show...
211,Q_xtMu6bqv8,A Woman Vampire is Forced Into Action When Ter...,A woman with a Mysterious illness who is heade...,Welcome back to Movie Recaps Today I will show...
212,NryQxqPAn4Q,Five American Soldiers Encounter an Enemy More...,American soldiers are assigned to hold a Frenc...,Welcome back to Movie Recaps Today I will show...
213,3pwJcaWqOu4,A Soldier Wakes Up in Someone Else's Body and ...,An Army Captain becomes a part of an experimen...,Welcome back to Movie Recaps Today I will show...


As can be seen, the API for retrieving YouTube Transcripts has a limit of 215, which may be affected by how much time is taken between reaching server limits. I will retrieve the videoIDs of where the transcript was not able to be retrieved and try to wait until there are available requests to finish the dataset.

In [45]:
missing_text = df.loc[df['text'] == 'nan'].reset_index(drop = True)
missing_text

Unnamed: 0,videoID,title,description,text
0,h1AojkhAxZc,Autistic Hotel Clerk Uses Cameras to Spy on a ...,A hotel clerk with Asperger's syndrome spies o...,
1,pNPXSOaflnU,Girl Takes Revenge For Her Death in a Strange Way,"After fleeing the scene of an accident, a youn...",
2,seT46uZHLvg,He Gains The Ability To See Ghosts But They Ar...,"After almost losing his life, a young man can ...",
3,AwtQeurKBi8,Stuck on a Deserted Island in The Middle of a ...,A depressed man jumps into the river only to e...,
4,Wg66FJ3Us9s,World Where Our Memory Lasts Only For a Few Hours,"A decade after a global pandemic, a group of s...",
...,...,...,...,...
60,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,
61,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,
62,5rCygdGq_AI,A Family Struggles For Survival in The Face of...,A family fights for survival as a planet-killi...,
63,4_19GDyr8KA,Shady Legal Guardian Lands in Hot Water When S...,This is the story of Marla Grayson. Profession...,


To avoid further trouble with API limits, the dataframes will be downloaded as .csv files for future use

In [44]:
has_text.to_csv('/content/has_text.csv')

In [46]:
missing_text.to_csv('/content/missing_text.csv')

## Creating training split for model

In [48]:
inputs, labels = has_text['text'], has_text['description']

In [49]:
X_train, X_test, y_train, y_test = train_test_split(inputs, labels, 
                                                    test_size = .05,
                                                    shuffle = True,
                                                    random_state = 48)

In [56]:
X_train[0]

16663

In [57]:
y_train[0]

'An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.'

In [60]:
tokenize(X_train[0], y_train[0], tokenizer = tokenizer)

({'input_ids': tensor([[ 1685,   178,   140,  ...,   120, 18532,     1]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])},
 {'input_ids': tensor([[  983,  2648,   121, 35475,  2955,  4786,   190,   114, 30098,   108,
          24171,   114,   679,   113,  9051,   120,  4494,   114, 18512,  2634,
           4815,   342,   112,   169,  2214,   107,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1]])})

In [58]:
def tokenize(inputs, labels, tokenizer, max_length = 2048):
  tok_inputs = tokenizer(inputs, max_length = max_length, truncation = True, 
                         padding = True, return_tensors = 'pt')
  tok_labels = tokenizer(labels, max_length = max_length, truncation = True, 
                         padding = True, return_tensors = 'pt')
  return tok_inputs, tok_labels

In [52]:
X_train_tok, y_train_tok = tokenize(X_train, y_train, 
                                    tokenizer = tokenizer)

ValueError: ignored

## Pegasus

In [None]:
#reference https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3

In [None]:
#To Init: Train_datasets, Test_datasets

In [None]:
output_dir = '/content/fine_tune'

In [None]:
training_args = TrainingArguments(output_dir = output_dir,
                                  num_train_epochs = 5,
                                  save_steps = 500,
                                  save_total_limit = 3,
                                  warmup_steps = 100,
                                  weight_decay = 1e-2,
                                  )

In [None]:
trainer = Trainer(model = model, args = training_args, train_dataset = train_dataset,
                  tokenizer = tokenizer)

array(['videoID', 'title', 'description', 'text'], dtype='<U15346')