<a href="https://colab.research.google.com/github/girishsenthil/NLP/blob/main/PegasusForYouTubeVideoSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning Pegasus Summarization Model for YouTube Movie Summary Channels

Why watch a movie when you can read a generated abstractive summary of a video summarizing a movie??

I had a lot of fun learning API manipulation and overall data processing, and especially the Pegasus Model functionality. Unfortunately there is a limit to how often one can make requests to the server for video transcripts, leading to a waiting period until I can actually train the model. 

Despite this setback I hope you enjoy the project! The model will be trained as soon as the data is available.

Currently Training Model on CPU as GPU does not have sufficient ram

## Imports

In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 63.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 54.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 8.6 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uni

In [2]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 5.2 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


In [3]:
!pip install youtube-transcript-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.4.4-py3-none-any.whl (22 kB)
Installing collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.4.4


In [4]:
import json
import urllib
from urllib import request, parse

In [5]:
import torch
import transformers
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, Trainer, TrainingArguments

In [6]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [7]:
from youtube_transcript_api import YouTubeTranscriptApi

In [8]:
import pandas as pd, numpy as np
import re
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [9]:
config = transformers.PegasusConfig

In [10]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [11]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-large', max_position_embeddings = 2048).to(device)

Downloading:   0%|          | 0.00/3.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [12]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-large')

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

## Retrieving Text Data from Movie Recaps Channel

Objectively for this niche task, the inputs (video transcripts) and labels (video descriptions) are the in a very consistent format across the entire channel's videos. 

Using a YouTube API created through the Google Cloud Platform, I will query the videos in json format and create a dataframe containing video title, description, and cleaned transcript.

In [13]:
api_key = 'AIzaSyA4PUdHD4RtFQmrDLF5ePt27TsST4_DL4g'
playlist_id = 'UUyXD1jAZBdZ4u0K-GLYC77Q'

### Investigating the json outputs //

In [14]:
with request.urlopen('https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,contentDetails&maxResults=5&playlistId=UUyXD1jAZBdZ4u0K-GLYC77Q&key=AIzaSyA4PUdHD4RtFQmrDLF5ePt27TsST4_DL4g') as url:
  data = json.loads(url.read().decode())
  print(data)


{'kind': 'youtube#playlistItemListResponse', 'etag': 'WoZLPkPvKx2Qh4FpweR9Pmy-ang', 'nextPageToken': 'EAAaBlBUOkNBVQ', 'items': [{'kind': 'youtube#playlistItem', 'etag': 'mSoeAYTVrd0ArB4YsXnxZQsp4Z8', 'id': 'VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLktybmRxSkVQbjZr', 'snippet': {'publishedAt': '2022-07-18T20:07:42Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday', 'description': 'An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {

In [15]:
data.keys()

dict_keys(['kind', 'etag', 'nextPageToken', 'items', 'pageInfo'])

In [16]:
for i in range(len(data['items'])):
  print(data['items'][i]['snippet']['title'])

The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday
After 27 Years in Prison, He Became President and Changed The Whole Country
Fallen Soldier Wakes up on His Funeral and Learns he Has Become a Zombie
Hiker Finds a Stranded Man Wearing Shorts at The Top of a Snowy Mountain
Young Mother Accused of Killing Her Best Friend Must Find The True Killer to Save Herself


In [17]:
cont = data['items'][0]

In [18]:
for keys in cont.keys():
  print(keys)
  print(cont[keys])
  print('*' * 20)

kind
youtube#playlistItem
********************
etag
mSoeAYTVrd0ArB4YsXnxZQsp4Z8
********************
id
VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLktybmRxSkVQbjZr
********************
snippet
{'publishedAt': '2022-07-18T20:07:42Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday', 'description': 'An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {'url': 'https://i.ytimg.com/vi/KrndqJEPn6k/sddefault.jpg', 'width': 640, 'height': 480},

Lots of nested dictionaries in the output dictionary, but should be straightforward to access necessary values

In [19]:
title = cont['snippet']['title']
description = cont['snippet']['description'].split('\n')[0] ###.split() accounts for the new lines before they plug their friend's channel

In [20]:
print(f'Title: {title} \nDescription: {description}')

Title: The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday 
Description: An accident-prone boy falls into a coma, triggering a series of investigations that reveal a supernatural factor connecting him to his doctor.


### Functions to extract desired information and store in pd.DataFrame

In [21]:
def clean(dirty_text):
  
  text = [i['text'] for i in dirty_text if i['text'].find('[') == -1]
  text = list(map(lambda x: x.replace('\n', ' '), text))
  clean_text = ' '.join(text)
  clean_text = re.sub('[^A-Za-z0-9]+', ' ', clean_text)

  return clean_text

In [22]:
def playlist_to_dataframe(playlist_id, api_key, max_results):
  
  
  api_url = 'https://www.googleapis.com/youtube/v3/playlistItems?'
  param_url = f'part=snippet,contentDetails&maxResults={max_results}&playlistId={playlist_id}&'
  api_key = f'key={api_key}'

  loop = True
  nextPageToken = None
  desired = np.array(['videoID', 'title', 'description', 'text'])

  while loop:

    if nextPageToken is None:
      pageToken = ''
    else:
      pageToken = f'&pageToken={nextPageToken}'

    concat_url = api_url + param_url + api_key + pageToken

    with request.urlopen(concat_url + pageToken) as request_url:
      data = json.loads(request_url.read().decode())
    
    query_length = len(data['items'])

    for item in range(query_length):

      content_dictionary = data['items'][item]

      videoID = content_dictionary['contentDetails']['videoId']
      title = content_dictionary['snippet']['title']
      description = content_dictionary['snippet']['description'].split('\n')[0]

      try:
        text = YouTubeTranscriptApi.get_transcript(videoID)
        text = clean(text)
      except:
        text = np.nan

      desired = np.vstack((desired, np.array([videoID, title, description, text])))

    try:
      nextPageToken = data['nextPageToken']
      print('Accessing Next Page')
    except KeyError:
      break

  df = pd.DataFrame(data = desired[1:], columns = desired[0])

  return df


## Preparing Data

### Initial Data Loading

In [24]:
df = playlist_to_dataframe(playlist_id = playlist_id,
                             api_key = api_key,
                             max_results = 50)

Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page


In [25]:
df

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
275,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,
276,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,
277,5rCygdGq_AI,A Family Struggles For Survival in The Face of...,A family fights for survival as a planet-killi...,
278,4_19GDyr8KA,Shady Legal Guardian Lands in Hot Water When S...,This is the story of Marla Grayson. Profession...,


In [40]:
has_text = df.loc[df['text'] != 'nan'].reset_index(drop = True)

In [41]:
has_text

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
210,8Z4fVj43JIM,A Damaged Spaceship Carrying Settlers to Mars ...,A Mars-bound spaceship gets knocked off course...,Welcome back to Movie Recaps Today I will show...
211,Q_xtMu6bqv8,A Woman Vampire is Forced Into Action When Ter...,A woman with a Mysterious illness who is heade...,Welcome back to Movie Recaps Today I will show...
212,NryQxqPAn4Q,Five American Soldiers Encounter an Enemy More...,American soldiers are assigned to hold a Frenc...,Welcome back to Movie Recaps Today I will show...
213,3pwJcaWqOu4,A Soldier Wakes Up in Someone Else's Body and ...,An Army Captain becomes a part of an experimen...,Welcome back to Movie Recaps Today I will show...


As can be seen, the API for retrieving YouTube Transcripts has a limit, which may be affected by how much time is taken between reaching server limits. I will retrieve the videoIDs of where the transcript was not able to be retrieved and try to wait until there are available requests to finish the dataset.

In [45]:
missing_text = df.loc[df['text'] == 'nan'].reset_index(drop = True)
missing_text

Unnamed: 0,videoID,title,description,text
0,h1AojkhAxZc,Autistic Hotel Clerk Uses Cameras to Spy on a ...,A hotel clerk with Asperger's syndrome spies o...,
1,pNPXSOaflnU,Girl Takes Revenge For Her Death in a Strange Way,"After fleeing the scene of an accident, a youn...",
2,seT46uZHLvg,He Gains The Ability To See Ghosts But They Ar...,"After almost losing his life, a young man can ...",
3,AwtQeurKBi8,Stuck on a Deserted Island in The Middle of a ...,A depressed man jumps into the river only to e...,
4,Wg66FJ3Us9s,World Where Our Memory Lasts Only For a Few Hours,"A decade after a global pandemic, a group of s...",
...,...,...,...,...
60,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,
61,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,
62,5rCygdGq_AI,A Family Struggles For Survival in The Face of...,A family fights for survival as a planet-killi...,
63,4_19GDyr8KA,Shady Legal Guardian Lands in Hot Water When S...,This is the story of Marla Grayson. Profession...,


To avoid further trouble with API limits, the dataframes will be downloaded as .csv files for future use

In [44]:
has_text.to_csv('/content/has_text.csv')

In [46]:
missing_text.to_csv('/content/missing_text.csv')

### Reloaded Data

In [23]:
# has_text = pd.read_csv('/content/has_text.csv')
# missing_text = pd.read_csv('/content/missing_text.csv')

has_text = pd.read_csv('/content/has_text (1).csv')
missing_text = pd.read_csv('/content/missing_text (1).csv')

Trying to retrieve transcripts from the missing_text df

In [None]:
missing_text.count()

In [None]:
retrieved_transcripts = []
missing_ID = []

for videoID in missing_text['videoID']:

  try:
    transcript = clean(YouTubeTranscriptApi.get_transcript(videoID))
    retrieved_transcripts.append(transcript)

  except:
    retrieved_transcripts.append(np.nan)
    missing_ID.append(videoID)


In [25]:
val = retrieved_transcripts[0]

In [27]:
val

nan

In [29]:
len(missing_ID)

24

In [30]:
missing_text['text'] = retrieved_transcripts

In [31]:
missing_text.count()

Unnamed: 0      24
Unnamed: 0.1    24
videoID         24
title           24
description     24
text             0
dtype: int64

Still missing 24, slightly tedious to wait. Will have to invest in a paid API for future projects using youtube data.

### July 20th: Had server requests available, added more data

In [46]:
still_missing = missing_text.loc[missing_text['text'].isin([val])]

In [54]:
found = missing_text[~missing_text.text.isin([val])]

In [56]:
found = found.drop(columns = 'Unnamed: 0')

In [58]:
has_text = has_text.drop(columns = 'Unnamed: 0')

In [59]:
has_text = pd.concat([has_text, found])

In [60]:
has_text

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
58,y5IEs7Vr9as,Genetic Engineers Want to Create New Hybrid An...,Genetic engineers are hoping to achieve fame b...,Welcome back to Movie Recaps Today i will show...
59,ctswinaJ8ZA,Chemical Company That Spread Cancer Around The...,A corporate defense attorney takes on an envir...,Welcome back to Movie Recaps Today I will show...
60,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,welcome back to movie recaps today i will show...
61,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,welcome back to movie recaps today i will show...


In [61]:
has_text.to_csv('/content/has_text.csv')

In [62]:
still_missing.to_csv('/content/missing_text.csv')

## Creating training split for model

In [24]:
inputs, labels = has_text['text'], has_text['description']

In [25]:
X_train, X_test, y_train, y_test = train_test_split(inputs, labels, 
                                                    test_size = .05,
                                                    shuffle = True,
                                                    random_state = 48)

In [26]:
print(len(X_train), len(X_test))

243 13


In [27]:
train_encodings = tokenizer.batch_encode_plus(X_train, padding = True,
                                              truncation = True,
                                              max_length = 2048,
                                              return_tensors = 'pt')
train_decodings = tokenizer.batch_encode_plus(y_train, padding = True,
                                              truncation = True,
                                              max_length = 2048,
                                              return_tensors = 'pt')

### Test Set Pre-Fine Tuning

In [74]:
X_test_tok = list(map(lambda x: tokenizer(x, max_length = 2048,
                                          truncation = True,
                                          padding = True,
                                          return_tensors = 'pt'), X_test))

In [77]:
device = 'cuda'

In [78]:
pred = []
for i in X_test_tok:
  i.to(device)
  gen = model.generate(i['input_ids'], max_length = 40)
  pred.append(gen)

RuntimeError: ignored

In [34]:
pred = list(map(lambda x: tokenizer.batch_decode(x, skip_special_tokens = True, clean_up_tokenization_spaces = True), pred))

In [35]:
pred

[['Welcome back to Movie Recaps Today I will show you a crime drama mystery film from 2017 titled Otherlife Spoilers ahead They watch out and take care Music like a heartbeat can be heard along'],
 ['Welcome back to Movie Recaps Today I will show you a dark comedy from 2017 titled Two Pigeons Spoilers ahead Watch out and take care In the city of London Hussein works as a very'],
 ['Welcome back to Movie Recaps Today I will show you an action sci fi thriller film from 2010 titled Repo Men Spoilers ahead Watch out and take care In the year 2025 technology has advanced'],
 ['Welcome back to Movie Recaps Today I will show you a sci fi thriller film from 2016 titled Domain Spoilers ahead Phoenix is slowly wiping out humanity spreading beyond doctors abilities to contain it This could'],
 ['Welcome back to Movie Recaps Today I will show you an action thriller film from 2007 titled Shooter Shooter Spoilers ahead and afterwards Watch out and take care In the middle of the Ethiopian desert snip

## Pegasus

In [36]:
#reference https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3

In [28]:
class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx] 
        return item
    def __len__(self):
        return len(self.labels['input_ids'])

In [29]:
output_dir = '/content/fine_tune'

In [30]:
train_dataset = PegasusDataset(train_encodings, train_decodings)

In [31]:
training_args = TrainingArguments(output_dir = output_dir,
                                  num_train_epochs = 3,
                                  save_steps = 5,
                                  save_total_limit = 3,
                                  warmup_steps = 50,
                                  weight_decay = 1e-2,
                                  per_device_train_batch_size = 1
                                  )

In [59]:
trainer = Trainer(model = model, args = training_args, 
                  train_dataset = train_dataset,
                  tokenizer = tokenizer)

In [66]:
torch.cuda.empty_cache()

GPU that colab provides does not have enough memory to train model, even with a batch size of 1. Instead, CPU with high-ram will be selected, but training will take a very long time.

In [61]:
torch.cuda

<module 'torch.cuda' from '/usr/local/lib/python3.7/dist-packages/torch/cuda/__init__.py'>

In [62]:
trainer.train()

***** Running training *****
  Num examples = 243
  Num Epochs = 3
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 729


RuntimeError: ignored

Due to the difficulties in memory usage and management, I will use a pre-trained pegasus instance right now as a baseline for what summarization would look like for these videos

In [32]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-xsum')

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36k [00:00<?, ?B/s]

In [34]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-xsum', max_position_embeddings = 2048)

In [None]:
model.config

Summarizations for X_test using pre-trained model

In [38]:
X_test

191    Welcome back to Movie Recaps Today I will show...
112    Welcome back to Movie Recaps Today I will show...
163    Welcome back to Movie Recaps Today I will show...
53     Welcome back to Movie Recaps Today I will show...
62     Welcome back to Movie Recaps Today I will show...
209    Welcome back to Movie Recaps Today I will show...
31     Welcome back to Movie Recaps Today I will show...
246    welcome back to movie recaps today i m going t...
97     In 1898 Daniel Plainview is doing some prospec...
26     In Paty Do Alferes sixty miles from Rio de Jan...
70     Welcome back to Movie Recaps Today I will show...
186    Welcome back to Movie Recaps Today I will show...
171    Hi Movie Recaps here Today I will show you a d...
Name: text, dtype: object

In [40]:
np.mean(list(map(lambda x: len(x), y_train)))

175.41152263374485

Average length is 175, used 180 max_length for model return



In [None]:
model.to(device)

In [44]:
device

'cuda'

In [45]:
summaries = []
for transcript in X_test:
  inputs = tokenizer(transcript, max_length = 2048, padding = True,
                     truncation = True, return_tensors = 'pt')
  summary_ids = model.generate(inputs['input_ids'].to(device), max_length = 180)

  summaries.append(tokenizer.batch_decode(summary_ids, skip_special_tokens = True)[0])




ROUGE is used to evaluate automatically produced summaries and the actual reference summaries. 

ROUGE-1 is the overlap of unigrams(an n-gram consisting of a single item from a sequence)

ROUGE-L is the longest matching sequence of words

In [46]:
!pip install rouge-score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge-score
  Downloading rouge_score-0.0.4-py2.py3-none-any.whl (22 kB)
Installing collected packages: rouge-score
Successfully installed rouge-score-0.0.4


In [47]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer = True)

In [59]:
scores = []

for label, pred in zip(y_test, summaries):
  
  score = scorer.score(label, pred)
  scores.append(score)

Computing Averages

In [64]:
#Rouge1 Average Precision
precision_rouge1 = []
for score in scores:
  precision_rouge1.append(score['rouge1'][0])
print(np.mean(precision_rouge1))

0.2689356542482043


In [65]:
#RougeL Average Precision
precision_rougeL = []
for score in scores: 
  precision_rougeL.append(score['rougeL'][0])
print(np.mean(precision_rougeL))

0.22234532453836492


As can be seen, the performance of this pre-trained model on the youtube data is very poor, and can most possibly be trained to perform better when the memory constraints are handled.