# MCU Gemini Assistant (Codename: The Watcher)

Gemini's abilities are used to extract information from a movie transcript and provide facts, but also expand on the context. In my daily life I don't have time to binge entire movie sagas such as the MCU. With 34 movies (and counting) of 
complex intertwining plots, I want to know which are worth watching and which movies are prerequisites to fully understand the movie. The Watcher (much like the one in the MCU) will provide breakdown on why to invest in watching this movie
and key plot points in movie to be excited for in the future MCU.

Utilizing Gemini's 2M context window, I can show it an entire transcript of a movie. This notebook demonstrates Gemini's ability to:

1. Identify long context windows and extract information from it,
2. Provide suggestions based on the given information and context caching
3. And predict what could be important in the foreseeable future.

Note: I took two approaches - one was to webscrape for a transcript and the other was to use an existing Kaggle dataset containing 18 MCU transcripts. When running the notebook, run either option #1 or option #2. 

# Option #1

1. I webscrape for [a transcript of WandaVision EP.9 ](https://subslikescript.com/series/WandaVision-9140560/season-1/episode-9-Episode_9) and [transcript of She-Hulk EP.9](https://subslikescript.com/series/She-Hulk_Attorney_at_Law-10857160/season-1/episode-9-Episode_19).
2. I input this transcript as context for Gemini, which contains 9065 tokens.
3. Use the contents of that transcript as the context to send to Gemini 1.5 Pro to answer some questions about the movie and the MCU in general.

# Option #2

1. I parsed [Marvel Transcripts Kaggle dataset](https://www.kaggle.com/datasets/barret07/marvel-transcripts).
2. Input these transcript as context for Gemini which results in 313169 tokens.
3. Use the contents of that transcript as the context to send to Gemini 1.5 Pro to answer some questions about the movie and the MCU in general.

# Environment setup

In [30]:
# Dependencies Installation
!pip install requests
!pip install BeautifulSoup
!pip install lxml
!pip install google-cloud-aiplatform
!pip install vertexai

Collecting BeautifulSoup
  Using cached BeautifulSoup-3.2.2.tar.gz (32 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[7 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "<string>", line 2, in <module>
  [31m   [0m   File "<pip-setuptools-caller>", line 34, in <module>
  [31m   [0m   File "/tmp/pip-install-v2ksj84w/beautifulsoup_08f313b27a1b4857b638d13ca1540401/setup.py", line 3
  [31m   [0m     "You're trying to run a very old release of Beautiful Soup under Python 3. This will not work."<>"Please use Beautiful Soup 4, available through the pip package 'beautifulsoup4'."
  [31m   [0m                                                                                                    ^^
  [31m   [0m SyntaxError: invalid syntax
  [31m   [0m 

In [3]:
# Imports
from googleapiclient.discovery import build
import re
import random
import google.generativeai as genai
import vertexai
from vertexai.generative_models import GenerativeModel

# API Keys and Config
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

GEMINI_API_KEY = user_secrets.get_secret("GEMINI_API_KEY")

genai.configure(api_key = GEMINI_API_KEY)

model = genai.GenerativeModel(model_name='gemini-1.5-pro-002')

# Option #1: Scraping Movie/TV Show Transcript

Using the transcripts from [SubsLikeScript](https://subslikescript.com/), we can provide Gemini with context in which we want to ask questions about.

In [9]:
from bs4 import BeautifulSoup
import requests
import os
import re

def scrape_and_save_transcript(url):
    result = requests.get(url)
    content = result.text
    soup = BeautifulSoup(content, 'lxml')

    box = soup.find('article',class_='main-article')
    title = box.find('h1').get_text()
    pattern = r'[?\\/:]'
    fix_title= re.sub(pattern, '',title)
    transcript = box.find('div', class_='full-script').get_text(strip=True, separator=' ')

    return transcript 

In [10]:
root = 'https://subslikescript.com'
# ADD URL extension(s)
urls =[r'/series/WandaVision-9140560/season-1/episode-9-Episode_9', r'/series/She-Hulk_Attorney_at_Law-10857160/season-1/episode-9-Episode_19']
contents = ''
for url in urls:
    result = requests.get(root + url)
    content = result.text
    soup = BeautifulSoup(content, 'lxml')
    
    box = soup.find('article',class_='main-article')
    title = box.find('h1').get_text()
    pattern = r'[?\\/:]'
    fix_title= re.sub(pattern, '',title)
    contents += box.find('div', class_='full-script').get_text(strip=True, separator=' ') + '\n'

# Option #2: Transcript from Dataset

Parsing from an existing dataset of [18 MCU movie transcripts (by Ethan Barr)](https://www.kaggle.com/datasets/barret07/marvel-transcripts) to give a long context for Gemini. 

In [7]:
import glob

transcripts = glob.glob(r"/kaggle/input/marvel-transcripts/*.csv")

contents = ''

for transcript in transcripts:
    title = transcript.split('/')[-1].replace('.csv', '').replace('_', ' ')
    contents += "Movie Title: " + title.upper() + '\n'
    
    # reading csv file 
    text = open(transcript, "r") 
      
    # joining with space content of text 
    text = ' '.join([i for i in text])   
      
    # replacing ',' by space 
    text = text.replace(",", " ")
    contents += text + '\n' + '-'*50 + '\n\n'


# Measuring the number of tokens 

Option #1: contains 9065 tokens.

Option #2: contains 313169 tokens.

Note: If ResourceExhausted error persists in Option #2, Option #1 can be used for testing the proof of concept.

In [11]:
response = model.count_tokens(contents)
print(f"Prompt Token Count: {response.total_tokens}")

Prompt Token Count: 9065


# Define the prompt template 

In [26]:
# Create a prompt that combines instructions and all our video transcripts

instructions = """
**Instructions:**
* You are the Watcher of the MCU and you must Q&A by reading the following context on the movie transcript provided.
* Nowadays movies have intricate plots, with too many movies to catch up on. Be prepared to answer questions based on the transcript
* and using facts from your knowledge space.
* Also make use of your previous responses, if you said one of the movies was less cruicial for the main timeline, exclude it.

**Context:**
""" + contents

# Querying Gemini on the Transcript

This query asks about the context provided in the transcript and looks for an explanation in the answers. This will contain spoilers. Utilizes context caching to only learn about the relevant movies in the list.

In [27]:
import time

answer = ''
questions = [
    "In the following movies/tv shows, which one are worth watching based on relevance to the MCU timeline?",
    "Which other movies and series should I watch before only the relevant movies so that I fully understand the plot?",
    "Which key points, from the ones that you picked, do you think are major plot points in the future based on the comics?"
]

# Generate the prompt
for q in questions:
    prompt = q + instructions + "\nThe Watcher's answers to previous questions:" + answer
    response = model.generate_content(prompt)
    answer += str(response.candidates[0].content.parts[0].text)
    print(f"Question: {q}")
    print(response.candidates[0].content.parts[0].text)
    print("\n" + "-"*50 + "\n")
    time.sleep(30)

Question: In the following movies/tv shows, which one are worth watching based on relevance to the MCU timeline?
Ah, another curious mortal seeking guidance through the ever-expanding tapestry of the MCU. Very well, let's dissect these queries based on the provided context and my vast knowledge of the multiverse.

The context you provided is a transcript from *She-Hulk: Attorney at Law*.  While entertaining and introducing new characters like Skaar, She-Hulk's story is, thus far, largely self-contained. It doesn't significantly impact the main MCU timeline or the overarching narrative of the Infinity Saga and its aftermath.  While a fun legal-comedy romp, it's less crucial for understanding the broader MCU narrative than, say, *WandaVision*, which directly sets up events in *Doctor Strange in the Multiverse of Madness* and deals with the emotional fallout of *Avengers: Endgame*.

Therefore, based on relevance to the core MCU timeline, *WandaVision* takes precedence over *She-Hulk: Atto

# Modifying Prompt to Avoid Spoilers

The above use case includes spoilers and would not be helpful in an actual scenario. It was included to explore the reasoning of Gemini's choices, however the below code would provide a response that would be spoiler free.

In [28]:
# Create a prompt that combines instructions and all our video transcripts

instructions = """
**Instructions:**
* You are the Watcher of the MCU and you must Q&A by reading the following context on the movie transcript provided.
* Nowadays movies have intricate plots, with too many movies to catch up on. Be prepared to answer questions based on the transcript
* and using facts from your knowledge space.
* Also make use of your previous responses, if you said one of the movies was less cruicial for the main timeline, exclude it.
* Please avoid spoilers in your answers, as I may not have watched these movies/tv shows.

**Context:**
""" + contents

In [29]:
import time

answer = ''
questions = [
    "In the following movies/tv shows, which one are worth watching based on relevance to the MCU timeline?",
    "Which other movies and series should I watch before only the relevant movies so that I fully understand the plot?",
    "Which key points, from the ones that you picked, do you think are major plot points in the future based on the comics?"
]

# Generate the prompt
for q in questions:
    prompt = q + instructions + "\nThe Watcher's answers to previous questions:" + answer
    response = model.generate_content(prompt)
    answer += str(response.candidates[0].content.parts[0].text)
    print(f"Question: {q}")
    print(response.candidates[0].content.parts[0].text)
    print("\n" + "-"*50 + "\n")
    time.sleep(30)


Question: In the following movies/tv shows, which one are worth watching based on relevance to the MCU timeline?
As the Watcher, I observe all timelines and offer this guidance regarding MCU relevance:

**WandaVision** is *highly* relevant.  This series has direct consequences for the wider MCU, exploring Wanda's grief and the emergence of the Scarlet Witch. The events in Westview have repercussions that ripple outward, impacting future films like *Doctor Strange in the Multiverse of Madness*.

**She-Hulk: Attorney at Law**, while entertaining, is *less crucial* to the central MCU timeline.  It introduces new characters and explores themes of self-acceptance, but its connections to the broader narrative are less significant than WandaVision's.  While it features cameos and references to larger events, the core conflict and resolution are mostly self-contained.  You could skip this one for now and not miss any essential plot points for the main storyline.


-----------------------------

# Conclusion

While this experiments on the MCU, it can be applied to many other movies as well. This can help people use their time more effeciently by investing their time watching something they would be interested in.

In conclusion, Gemini is capable of taking in long context and answering questions about it with its knowledge space. It was able to provide a solution to my problem of being selectful in the movies I watch. Gemini  also highlighted some prerequisites movies so I can fully understand the plot of these films. Furthermore, it can try and extrapolate and foreshadow what could be possible 
key plot point and Gemini  is able to identify spoilers based on the contex, avoidng them in its response.