<a href="https://colab.research.google.com/github/girishsenthil/NLP/blob/main/PegasusForYouTubeVideoSummarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pegasus Summarization Model for YouTube Movie Summary Channels


Learned a lot about API manipulation and overall data processing, and especially the Pegasus Model functionality. 

I wanted to be able to create an effective way to summarize YouTube Transcripts using NLG Models, with Pegasus being the most popular and effective model currently. 

In the process of making this notebook I had to troubleshoot GPU memory issues, however reducing max_embedding size proved to be the most effective method to reduce GPU memory requirements.

## Imports

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 4.3 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 61.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 53.6 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 14.4 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstal

In [2]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 4.3 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.96


In [3]:
!pip install youtube-transcript-api

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.4.4-py3-none-any.whl (22 kB)
Installing collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.4.4


In [169]:
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.11.0-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 4.2 MB/s 
Installing collected packages: accelerate
Successfully installed accelerate-0.11.0


In [207]:
!pip install rouge-score

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24955 sha256=22446cb553b58092b57e05a539d67bcc7881fddce23fc83a48d0e45bfc820166
  Stored in directory: /root/.cache/pip/wheels/84/ac/6b/38096e3c5bf1dc87911e3585875e21a3ac610348e740409c76
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [4]:
import json
import urllib
from urllib import request, parse

In [5]:
import torch
import transformers
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, Trainer, TrainingArguments

In [6]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

In [7]:
from youtube_transcript_api import YouTubeTranscriptApi

In [8]:
import pandas as pd, numpy as np
import re
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [10]:
config = transformers.PegasusConfig

In [11]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [12]:
model = PegasusForConditionalGeneration.from_pretrained('google/pegasus-large', max_position_embeddings = 512)

Downloading:   0%|          | 0.00/3.02k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [None]:
model.to(device)

In [14]:
tokenizer = PegasusTokenizer.from_pretrained('google/pegasus-large')

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

## Retrieving Text Data from Movie Recaps Channel

I am using the Movie Recaps channel as it follows an extremely consistent format across all videos, similar to how a research paper follows a similar format of abstract > methods > results > discussion

Using a YouTube API created through the Google Cloud Platform, I will query the videos in json format and create a dataframe containing videoID, video title, description, and cleaned transcript.

If you are using this notebook, ensure that you have your own API Key to use with querying videos from your desired playlist

In [13]:
api_key = 'Your API Key HERE'
playlist_id = 'UUyXD1jAZBdZ4u0K-GLYC77Q'

### Investigating the JSON outputs

In [19]:
with request.urlopen(f'https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,contentDetails&maxResults=5&playlistId=UUyXD1jAZBdZ4u0K-GLYC77Q&key={api_key}') as url:
  data = json.loads(url.read().decode())
  print(data)


{'kind': 'youtube#playlistItemListResponse', 'etag': 'b_RjtxkTxk5xULiUhuZqaTLeEVs', 'nextPageToken': 'EAAaBlBUOkNBVQ', 'items': [{'kind': 'youtube#playlistItem', 'etag': 'lAu6bolSfb0eqgOccz2upzKOyLI', 'id': 'VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLnc5Zzl5ZUhjX0h3', 'snippet': {'publishedAt': '2022-07-22T20:45:43Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'Girl Must Wear 25 Kg Iron Shoes Everyday to Avoid Floating Away', 'description': "A teenager follows his grandfather's stories to find a secret house filled with superpowered children that he must protect because he's the only one that can see the monsters chasing them.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps", 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/w9g9yeHc_Hw/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/w9g9yeHc_Hw/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/w9g9yeHc_Hw/hqdefault.jpg', 'widt

In [15]:
data.keys()

dict_keys(['kind', 'etag', 'nextPageToken', 'items', 'pageInfo'])

In [16]:
for i in range(len(data['items'])):
  print(data['items'][i]['snippet']['title'])

Girl Must Wear 25 Kg Iron Shoes Everyday to Avoid Floating Away
Before Being Discharged, 3 US Soldiers Discover a Stash of Gold Worth Billions
The Boy Has 9 Lives But His Mother Kills Him Every Year on His Birthday
After 27 Years in Prison, He Became President and Changed The Whole Country
Fallen Soldier Wakes up on His Funeral and Learns he Has Become a Zombie


In [17]:
cont = data['items'][0]

In [18]:
for keys in cont.keys():
  print(keys)
  print(cont[keys])
  print('*' * 20)

kind
youtube#playlistItem
********************
etag
lAu6bolSfb0eqgOccz2upzKOyLI
********************
id
VVV5WEQxakFaQmRaNHUwSy1HTFlDNzdRLnc5Zzl5ZUhjX0h3
********************
snippet
{'publishedAt': '2022-07-22T20:45:43Z', 'channelId': 'UCyXD1jAZBdZ4u0K-GLYC77Q', 'title': 'Girl Must Wear 25 Kg Iron Shoes Everyday to Avoid Floating Away', 'description': "A teenager follows his grandfather's stories to find a secret house filled with superpowered children that he must protect because he's the only one that can see the monsters chasing them.\n\n\n\n\n\n\nSubscribe to our friends channel: https://tinyurl.com/Movie-Recaps", 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/w9g9yeHc_Hw/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/w9g9yeHc_Hw/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/w9g9yeHc_Hw/hqdefault.jpg', 'width': 480, 'height': 360}}, 'channelTitle': 'Movie Recaps', 'playlistId': 'UUyXD1jAZBdZ4u0

Lots of nested dictionaries in the output dictionary, but should be straightforward to access necessary values

In [20]:
title = cont['snippet']['title']
description = cont['snippet']['description'].split('\n')[0] ###.split() accounts for the new lines before they plug their friend's channel

In [21]:
print(f'Title: {title} \nDescription: {description}')

Title: Girl Must Wear 25 Kg Iron Shoes Everyday to Avoid Floating Away 
Description: A teenager follows his grandfather's stories to find a secret house filled with superpowered children that he must protect because he's the only one that can see the monsters chasing them.


### Functions to extract desired information and store in pd.DataFrame

In [22]:
def clean(dirty_text):
  
  text = [i['text'] for i in dirty_text if i['text'].find('[') == -1]
  text = list(map(lambda x: x.replace('\n', ' '), text))
  clean_text = ' '.join(text)
  clean_text = re.sub('[^A-Za-z0-9]+', ' ', clean_text)

  return clean_text

In [23]:
def playlist_to_dataframe(playlist_id, api_key, max_results):
  
  #Constructing URL for use with YouTube API
  api_url = 'https://www.googleapis.com/youtube/v3/playlistItems?'
  param_url = f'part=snippet,contentDetails&maxResults={max_results}&playlistId={playlist_id}&'
  api_key = f'key={api_key}'

  #Setup for looping through all videos in a playlist
  loop = True
  nextPageToken = None

  #Initializing columns for DataFrame Result
  desired = np.array(['videoID', 'title', 'description', 'text'])

  #Begin retrieving data
  while loop:

    if nextPageToken is None: #First Page
      pageToken = ''
    else:                     #Used Only for Subsequent Pages
      pageToken = f'&pageToken={nextPageToken}'

    #Fully created URL
    concat_url = api_url + param_url + api_key + pageToken

    with request.urlopen(concat_url + pageToken) as request_url:
      data = json.loads(request_url.read().decode())
    
    #Looping only for the amount of items in data, as on the last page there
    #will be less than the maximum amount for certain instances

    query_length = len(data['items'])

    #Loop to retrieve relevant data from JSON dictionary
    for item in range(query_length):

      content_dictionary = data['items'][item]

      videoID = content_dictionary['contentDetails']['videoId']
      title = content_dictionary['snippet']['title']
      description = content_dictionary['snippet']['description'].split('\n')[0]

      #This try-except block for the instance when the server limit is reached,
      #and the API call will produce an error
      try: 
        text = YouTubeTranscriptApi.get_transcript(videoID)
        text = clean(text)
      except:
        text = np.nan

      #vstacking to desired array
      desired = np.vstack((desired, np.array([videoID, title, description, text])))

    #Every JSON dictionary will have a nextPageToken key until the final page,
    #Allowing for this loop to exit when encountering the final page's expected
    #KeyError when calling the nextPageToken key
    try:
      nextPageToken = data['nextPageToken']
      print('Accessing Next Page')
    except KeyError:
      break

  df = pd.DataFrame(data = desired[1:], columns = desired[0])

  return df


## Preparing Data

### Initial Data Loading

In [None]:
df = playlist_to_dataframe(playlist_id = playlist_id,
                             api_key = api_key,
                             max_results = 50)

Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page
Accessing Next Page


In [None]:
df

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
275,0SE11VVrl5Q,A Group of People Are Trapped in an Elevator A...,Time is running out for the occupants of the e...,
276,fYkw4MgPR8A,Hybrid Children Are The Only Hope For The Huma...,A scientist and a teacher living in a dystopia...,
277,5rCygdGq_AI,A Family Struggles For Survival in The Face of...,A family fights for survival as a planet-killi...,
278,4_19GDyr8KA,Shady Legal Guardian Lands in Hot Water When S...,This is the story of Marla Grayson. Profession...,


In [None]:
has_text = df.loc[df['text'] != 'nan'].reset_index(drop = True)

In [None]:
has_text

Unnamed: 0,videoID,title,description,text
0,KrndqJEPn6k,The Boy Has 9 Lives But His Mother Kills Him E...,"An accident-prone boy falls into a coma, trigg...",Since he was born Louis Drax has been in hundr...
1,--mUOD9Tok4,"After 27 Years in Prison, He Became President ...",The true story of how the president of South A...,In 1990 Nelson Mandela is finally freed from t...
2,OTD436RwFuE,Fallen Soldier Wakes up on His Funeral and Lea...,A fallen soldier wakes up in his coffin and di...,In Iraq a group of American soldiers is travel...
3,TK76DFJskPs,Hiker Finds a Stranded Man Wearing Shorts at T...,The true story of a search and rescue voluntee...,It is almost six a m in the morning and search...
4,jcpZJeDnr0o,Young Mother Accused of Killing Her Best Frien...,"During a vacation overseas, a young woman must...",It s a lovely summer day in Croatia and Beth h...
...,...,...,...,...
210,8Z4fVj43JIM,A Damaged Spaceship Carrying Settlers to Mars ...,A Mars-bound spaceship gets knocked off course...,Welcome back to Movie Recaps Today I will show...
211,Q_xtMu6bqv8,A Woman Vampire is Forced Into Action When Ter...,A woman with a Mysterious illness who is heade...,Welcome back to Movie Recaps Today I will show...
212,NryQxqPAn4Q,Five American Soldiers Encounter an Enemy More...,American soldiers are assigned to hold a Frenc...,Welcome back to Movie Recaps Today I will show...
213,3pwJcaWqOu4,A Soldier Wakes Up in Someone Else's Body and ...,An Army Captain becomes a part of an experimen...,Welcome back to Movie Recaps Today I will show...


As can be seen, the API for retrieving YouTube Transcripts has a limit, which may be affected by how much time is taken between reaching server limits. I will retrieve the videoIDs of where the transcript was not able to be retrieved and try to wait until there are available requests to finish the dataset.

In [None]:
missing_text = df.loc[df['text'] == 'nan'].reset_index(drop = True)
missing_text

To avoid further trouble with API limits, the dataframes will be downloaded as .csv files for future use

In [None]:
has_text.to_csv('/content/has_text.csv')

In [None]:
missing_text.to_csv('/content/missing_text.csv')

Further Data Retrieval/API Usage: https://colab.research.google.com/drive/1ZIhCM6gukUKckrnssOSZOH6jhiEeio8p?usp=sharing

## Creating training split for model

Loading Data from .csv file

In [15]:
has_text = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/has_text (1).csv')

In [16]:
inputs, labels = has_text['text'], has_text['description']

Important to remove the redundant intro from each transcript for higher quality training data

Typically the intro will not be longer than 170 characters, hence the array of [-1, 170]. 

In [220]:
inputs = list(map(lambda x: x[x.find('care') + 4:].strip() if x.find('care')\
                         not in [-1, 170] else x, inputs.tolist()))

In [221]:
X_train, X_test, y_train, y_test = train_test_split(inputs, labels, 
                                                    test_size = .05,
                                                    shuffle = True,
                                                    random_state = 48)

In [222]:
print(f'Train Length: {len(X_train)} \nTest Length: {len(X_test)}')

Train Length: 243 
Test Length: 13


200 training examples and 43 validation examples

In [223]:
train_encodings = tokenizer.batch_encode_plus(X_train[:200], padding = 'longest',
                                              truncation = True,
                                              max_length = 512,
                                              return_tensors = 'pt')

train_decodings = tokenizer.batch_encode_plus(y_train[:200], padding = 'longest',
                                              truncation = True,
                                              max_length = 128,
                                              return_tensors = 'pt')

In [224]:
eval_encodings = tokenizer.batch_encode_plus(X_train[200:], padding = 'longest',
                                             truncation = True,
                                             max_length = 512,
                                             return_tensors = 'pt')

eval_decodings = tokenizer.batch_encode_plus(y_train[200:], padding = 'longest',
                                             truncation = True,
                                             max_length = 128,
                                             return_tensors = 'pt')

Investigating 1st and 3rd quartile for decoding length to create general intuition of how long model generating length (min_length/max_length) should be

In [225]:
np.quantile(list(map(lambda x: len(x), train_decodings['input_ids'])), q = [.25, .75])

array([73., 73.])

### Test Set Pre-Fine Tuning

In [245]:
def generate_predictions(model, test_encodings, min_length, max_length):
  
  pred = []

  for encoding in test_encodings:
    encoding.to(device)
    gen = model.generate(encoding['input_ids'], 
                         min_length = min_length,
                         max_length = max_length)
    pred.append(gen)

  pred = list(map(lambda x: tokenizer.batch_decode(x, 
                                                   skip_special_tokens = True, 
                                                   clean_up_tokenization_spaces = True)[0], 
                  pred))

  return pred


In [227]:
test_encodings = list(map(lambda x: tokenizer(x, max_length = 512,
                                          truncation = True,
                                          padding = True,
                                          return_tensors = 'pt'), X_test))

In [None]:
pred = generate_predictions(model, test_encodings, min_length = 30, max_length = 120)

## Pegasus Training

In [38]:
#reference https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3

In [188]:
class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels['input_ids'][idx] 
        return item
    def __len__(self):
        return len(self.labels['input_ids'])

In [189]:
output_dir = '/content/fine_tuned_pegasus.bin'

In [190]:
train_dataset = PegasusDataset(train_encodings, train_decodings)

In [191]:
eval_dataset = PegasusDataset(eval_encodings, eval_decodings)

In [None]:
training_args = TrainingArguments(output_dir = output_dir,
                                  num_train_epochs = 20,
                                  save_steps = 100,
                                  save_total_limit = 1,
                                  warmup_steps = 100,
                                  save_strategy = 'epoch',
                                  load_best_model_at_end = True,
                                  logging_strategy = 'epoch',
                                  metric_for_best_model = 'eval_loss',
                                  evaluation_strategy = 'epoch',
                                  auto_find_batch_size = True,
                                  )

In [201]:
trainer = Trainer(model = model, args = training_args, 
                  train_dataset = train_dataset,
                  eval_dataset = eval_dataset,
                  tokenizer = tokenizer)

In [202]:
torch.cuda.empty_cache()

Initial goal was to train the model with a total of 2048 embeddings for the inputs, however GPU limitations prevent the model training. 512 instead was selected to lower the strain on the GPU

In [None]:
trainer.train()

## Inference/Model Performance after Fine Tuning

In [246]:
summaries = generate_predictions(model, test_encodings, min_length = 30, max_length = 120)

In [247]:
summaries

["A young woman is given the chance to create new memories indistinguishable from reality in order to cure her brother's drowning accident. Now she must find a way to survive.",
 'A successful real estate agent is kidnapped by a man who lives in his closet and must find a way to find his girlfriend before he is killed.',
 'A repo man is hired by a company to locate and forcibly repossess a bio-mechanical organ known as artiforgs that is being sold at very high prices.',
 'A group of struggling to survive a pandemic that has wiped out humanity, they must find a way to communicate with each other in order to stay alive.',
 'A former soldier and his spotter are on a peacekeeping mission in the middle of the Ethiopian desert when they discover a plan to assassinate the president of the United States.',
 'A special boarding school that works as a foster home for young women is about to move on to the next level, so the girls must find a way to survive.',
 'The year is 2022 and a serious loo

### Rouge Metric

ROUGE is used to evaluate automatically produced summaries and the actual reference summaries. 

ROUGE-1 is the overlap of unigrams(an n-gram consisting of a single item from a sequence)

ROUGE-L is the longest matching sequence of words

In [239]:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer = True)

In [249]:
scores = []

for label, prediction in zip(y_test, summaries):
  
  score = scorer.score(label, prediction)
  scores.append(score)

Computing Averages

In [250]:
#Rouge1 Average Precision
precision_rouge1 = []
for score in scores:
  precision_rouge1.append(score['rouge1'][0])
print(np.mean(precision_rouge1))

0.2983286679316456


In [251]:
#RougeL Average Precision
precision_rougeL = []
for score in scores: 
  precision_rougeL.append(score['rougeL'][0])
print(np.mean(precision_rougeL))

0.1875539312263878


As Rouge Score focuses solely on n-gram similarity, it is important to also implement human post-processing

In [261]:
y_test.tolist()[4]

'A former marksman returns to action when he is hired to protect the president from a murder complot only to end up framed for the attempt and running away from the law until he can prove his innocence.'

In [260]:
summaries[4]

'A former soldier and his spotter are on a peacekeeping mission in the middle of the Ethiopian desert when they discover a plan to assassinate the president of the United States.'

Generally, the premise of the movie is somewhat captured, but there is a combination of too much detail without enough context causing the generated abstraction to be more ambiguous and confusing than the true label for some of the summaries

Generating Random Examples from training data

In [291]:
def gen_random(inputs, labels, model, min_length = 30, max_length = 120, size = 5):

  rand_ind = np.random.randint(len(inputs) - 1, size = size)

  random_data, random_label = [], []
  for ind in rand_ind:
    random_data.append(inputs[ind]), random_label.append(labels[ind])

  tok = list(map(lambda x: tokenizer(x, max_length = 512,
                                          truncation = True,
                                          padding = True,
                                          return_tensors = 'pt'), random_data))
  
  preds = generate_predictions(model, tok, min_length,
                              max_length)
  


  return list(zip(random_label, preds))


In [296]:
pairs = gen_random(inputs, labels, model)

In [297]:
pairs[0]

('A corporate defense attorney takes on an environmental lawsuit against a chemical company that exposes a lengthy history of pollution which led to many health issues around the world.',
 "A lawyer is hired by a chemical company to investigate the poisoning of a farmer's cattle. Now he must find a way to stop the poisoning before it's too late.")

In [298]:
pairs[1]

('A college girl is trying to enjoy her birthday but soon realizes that this is her last one. Until she figures out who her killer is. She must relive that day, over and over again, dying in a different way each time. ',
 'A young woman wakes up in a dorm room on her 18th birthday and decides to hide her birthdate in order to hide her true identity.')

Hope you enjoyed! There is definitly room to improve the abstractions, a clear way is to have a more powerful GPU and increase embedding size to 1024 or even 2048.