# 🎬 Movie Recommender Chatbot using T5
This project fine-tunes a T5-small transformer model on a custom movie Q&A dataset to build a chatbot that can recommend movies based on genres, actors, moods, and more.


In [1]:
import pandas as pd


## 📦 Step 1: Load and Clean TMDB Dataset
We load a custom TMDB-based dataset containing movie titles, genres, cast, crew, runtime, and other relevant metadata. The dataset is cleaned and preprocessed before use.


In [2]:
# Load the datasets
movies_df = pd.read_csv('../data/tmdb_5000_movies.csv')
credits_df = pd.read_csv('../data/tmdb_5000_credits.csv')

# Show the first few rows of movies_df
movies_df.head()


Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


## 🧹 Step 2: Parse and Fix `cast` and `crew` Columns
The `cast` and `crew` columns are stored as strings. We parse them into usable Python lists and dictionaries for easier filtering.


In [3]:
movies_df.columns


Index(['budget', 'genres', 'homepage', 'id', 'keywords', 'original_language',
       'original_title', 'overview', 'popularity', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'vote_average',
       'vote_count'],
      dtype='object')

In [4]:
credits_df.columns


Index(['movie_id', 'title', 'cast', 'crew'], dtype='object')

In [5]:
# Keep only important columns
movies_df = movies_df[['id', 'original_title', 'genres', 'runtime', 'overview', 'budget', 'original_language', 'production_countries']]
credits_df = credits_df[['movie_id', 'cast', 'crew']]

# Show the cleaned datasets
movies_df.head(), credits_df.head()


(       id                            original_title  \
 0   19995                                    Avatar   
 1     285  Pirates of the Caribbean: At World's End   
 2  206647                                   Spectre   
 3   49026                     The Dark Knight Rises   
 4   49529                               John Carter   
 
                                               genres  runtime  \
 0  [{"id": 28, "name": "Action"}, {"id": 12, "nam...    162.0   
 1  [{"id": 12, "name": "Adventure"}, {"id": 14, "...    169.0   
 2  [{"id": 28, "name": "Action"}, {"id": 12, "nam...    148.0   
 3  [{"id": 28, "name": "Action"}, {"id": 80, "nam...    165.0   
 4  [{"id": 28, "name": "Action"}, {"id": 12, "nam...    132.0   
 
                                             overview     budget  \
 0  In the 22nd century, a paraplegic Marine is di...  237000000   
 1  Captain Barbossa, long believed to be dead, ha...  300000000   
 2  A cryptic message from Bond’s past sends him o...  24500

In [6]:
# Merge movies_df and credits_df based on id and movie_id
merged_df = movies_df.merge(credits_df, left_on='id', right_on='movie_id')

# Show the first few rows
merged_df.head()


Unnamed: 0,id,original_title,genres,runtime,overview,budget,original_language,production_countries,movie_id,cast,crew
0,19995,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",162.0,"In the 22nd century, a paraplegic Marine is di...",237000000,en,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",19995,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",169.0,"Captain Barbossa, long believed to be dead, ha...",300000000,en,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",285,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",148.0,A cryptic message from Bond’s past sends him o...,245000000,en,"[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",206647,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",165.0,Following the death of District Attorney Harve...,250000000,en,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",49026,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",132.0,"John Carter is a war-weary, former military ca...",260000000,en,"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",49529,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [7]:
# Convert budget to readable format (e.g., 2600000 -> 2.6M)
def format_budget(budget):
    if pd.isna(budget) or budget == 0:
        return "Unknown"
    return f"{round(budget/1_000_000, 1)}M"

# Apply the formatting to the budget column
merged_df['budget'] = merged_df['budget'].apply(format_budget)

# Show some examples
merged_df[['original_title', 'budget']].head()


Unnamed: 0,original_title,budget
0,Avatar,237.0M
1,Pirates of the Caribbean: At World's End,300.0M
2,Spectre,245.0M
3,The Dark Knight Rises,250.0M
4,John Carter,260.0M


In [8]:
import ast  # To safely evaluate string to list/dict

# Function to clean genres
def extract_genres(genres_str):
    if pd.isna(genres_str):
        return []
    genres_list = ast.literal_eval(genres_str)
    return [genre['name'] for genre in genres_list]

# Apply it to the 'genres' column
merged_df['genres'] = merged_df['genres'].apply(extract_genres)

# Show an example
merged_df[['original_title', 'genres']].head()


Unnamed: 0,original_title,genres
0,Avatar,"[Action, Adventure, Fantasy, Science Fiction]"
1,Pirates of the Caribbean: At World's End,"[Adventure, Fantasy, Action]"
2,Spectre,"[Action, Adventure, Crime]"
3,The Dark Knight Rises,"[Action, Crime, Drama, Thriller]"
4,John Carter,"[Action, Adventure, Science Fiction]"


In [9]:
# Function to clean cast (keep only top 3 actors)
def extract_cast(cast_str):
    if pd.isna(cast_str):
        return []
    cast_list = ast.literal_eval(cast_str)
    return [actor['name'] for actor in cast_list[:3]]  # Only top 3 actors

# Apply it to 'cast' column
merged_df['cast'] = merged_df['cast'].apply(extract_cast)

# Show an example
merged_df[['original_title', 'cast']].head()


Unnamed: 0,original_title,cast
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]"
2,Spectre,"[Daniel Craig, Christoph Waltz, Léa Seydoux]"
3,The Dark Knight Rises,"[Christian Bale, Michael Caine, Gary Oldman]"
4,John Carter,"[Taylor Kitsch, Lynn Collins, Samantha Morton]"


In [10]:
# Function to clean crew (extract the director's name)
def extract_director(crew_str):
    if pd.isna(crew_str):
        return []
    crew_list = ast.literal_eval(crew_str)
    for member in crew_list:
        if member['job'] == 'Director':
            return [member['name']]  # Keep as list for easier searching
    return []

# Apply it to 'crew' column
merged_df['crew'] = merged_df['crew'].apply(extract_director)

# Show an example
merged_df[['original_title', 'crew']].head()


Unnamed: 0,original_title,crew
0,Avatar,[James Cameron]
1,Pirates of the Caribbean: At World's End,[Gore Verbinski]
2,Spectre,[Sam Mendes]
3,The Dark Knight Rises,[Christopher Nolan]
4,John Carter,[Andrew Stanton]


In [11]:
# List of possible genres
all_genres = ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 
              'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music', 'Mystery', 
              'Romance', 'Science Fiction', 'TV Movie', 'Thriller', 'War', 'Western']

def extract_genre(user_input):
    matched_genres = []
    for genre in all_genres:
        if genre.lower() in user_input.lower():
            matched_genres.append(genre)
    return matched_genres

# Example test
extract_genre("I want a comedy movie or an action one")


['Action', 'Comedy']

In [12]:
def extract_duration(user_input):
    user_input = user_input.lower()
    
    if 'short' in user_input:
        return 'short'
    elif 'long' in user_input:
        return 'long'
    elif 'medium' in user_input or 'normal' in user_input:
        return 'medium'
    else:
        return None  # No preference

# Example tests
print(extract_duration("I want a short funny movie"))
print(extract_duration("Give me a long action movie"))
print(extract_duration("Suggest a medium length movie"))
print(extract_duration("Just any movie is fine"))


short
long
medium
None


In [13]:
def extract_actor(user_input, all_actors):
    matched_actors = []
    for actor in all_actors:
        if actor.lower() in user_input.lower():
            matched_actors.append(actor)
    return matched_actors

# Prepare a list of all known actors (from merged_df)
all_known_actors = set()

for cast_list in merged_df['cast']:
    for actor in cast_list:
        all_known_actors.add(actor)

all_known_actors = list(all_known_actors)

# Example test
extract_actor("I want a movie with Kevin Hart", all_known_actors)


['Kevin Hart']

In [14]:
def extract_director(user_input, all_directors):
    matched_directors = []
    for director in all_directors:
        if director.lower() in user_input.lower():
            matched_directors.append(director)
    return matched_directors

# Prepare a list of all known directors (from merged_df)
all_known_directors = set()

for crew_list in merged_df['crew']:
    for director in crew_list:
        all_known_directors.add(director)

all_known_directors = list(all_known_directors)

# Example test
extract_director("Give me a movie directed by Tim Story", all_known_directors)


['Tim Story']

In [15]:
def search_movies(user_input):
    # Extract info from user input
    desired_genres = extract_genre(user_input)
    desired_duration = extract_duration(user_input)
    desired_actors = extract_actor(user_input, all_known_actors)
    desired_directors = extract_director(user_input, all_known_directors)
    
    # Start with all movies
    filtered_df = merged_df.copy()
    
    # Filter by genres
    if desired_genres:
        filtered_df = filtered_df[filtered_df['genres'].apply(lambda movie_genres: any(genre in movie_genres for genre in desired_genres))]
    
    # Filter by duration
    if desired_duration == 'short':
        filtered_df = filtered_df[filtered_df['runtime'] < 90]
    elif desired_duration == 'medium':
        filtered_df = filtered_df[(filtered_df['runtime'] >= 90) & (filtered_df['runtime'] <= 120)]
    elif desired_duration == 'long':
        filtered_df = filtered_df[filtered_df['runtime'] > 120]
    
    # Filter by actors
    if desired_actors:
        filtered_df = filtered_df[filtered_df['cast'].apply(lambda movie_cast: any(actor in movie_cast for actor in desired_actors))]
    
    # Filter by directors
    if desired_directors:
        filtered_df = filtered_df[filtered_df['crew'].apply(lambda movie_crew: any(director in movie_crew for director in desired_directors))]
    
    # Return the results
    if filtered_df.empty:
        return "Sorry, no movies found matching your description."
    else:
        return filtered_df[['original_title', 'genres', 'runtime', 'budget', 'overview']].head(5)  # Show top 5 matches



In [16]:
# Example search
search_movies("I want a short funny movie with Kevin Hart directed by Tim Story")


Unnamed: 0,original_title,genres,runtime,budget,overview
4471,Kevin Hart: Laugh at My Pain,"[Comedy, Documentary]",89.0,Unknown,Experience the show that quickly became a nati...


In [17]:
def chatbot_answer(user_input):
    # Search movies
    results = search_movies(user_input)
    
    # If no results
    if isinstance(results, str):
        return results
    
    # Build chatbot style answers
    responses = []
    for index, row in results.iterrows():
        title = row['original_title']
        genres = ", ".join(row['genres'])
        runtime = int(row['runtime']) if not pd.isna(row['runtime']) else "Unknown"
        budget = row['budget']
        overview = row['overview'][:100] + "..." if isinstance(row['overview'], str) else ""
        
        response = f"🎬 **{title}**\nGenres: {genres}\nRuntime: {runtime} minutes\nBudget: {budget}\nOverview: {overview}\n"
        responses.append(response)
    
    return "\n\n".join(responses)


In [18]:
print(chatbot_answer("I want a short funny movie with Kevin Hart directed by Tim Story"))


🎬 **Kevin Hart: Laugh at My Pain**
Genres: Comedy, Documentary
Runtime: 89 minutes
Budget: Unknown
Overview: Experience the show that quickly became a national phenomenon. Get an up-close and personal look at ...



In [19]:
# Combine all TMDB movie overviews into one big training text

# Remove missing overviews
tmdb_texts = merged_df['overview'].dropna().tolist()

# Join all overviews into one big text
full_text = " ".join(tmdb_texts)

# Save it to a new file
with open('../data/tmdb_overviews_combined.txt', 'w', encoding='utf-8') as f:
    f.write(full_text)

print(f"Total characters in text: {len(full_text)}")


Total characters in text: 1470713


## 🧠 Step 3: Generate Q&A Pairs from Movie Metadata
We create a custom dataset of question–answer pairs where the questions are natural language prompts (e.g., "Suggest a short movie with Tom Hanks") and the answers are real movie titles.


In [20]:
import random
import ast

def safe_parse(entry):
    try:
        return ast.literal_eval(entry) if isinstance(entry, str) else entry
    except:
        return []

# Apply parsing safely to your DataFrame
merged_df['crew'] = merged_df['crew'].apply(safe_parse)
merged_df['cast'] = merged_df['cast'].apply(safe_parse)

def clean_crew(crew_list):
    return [d for d in crew_list if isinstance(d, dict)]

merged_df['crew'] = merged_df['crew'].apply(clean_crew)

def generate_qa_dataset_large(df, target_size=50000):
    qa_pairs = []
    genres = ["Action", "Comedy", "Drama", "Horror", "Romance", "Adventure", "Thriller", "Sci-Fi", "Fantasy", "Animation"]
    actors, directors = [], []

    for cast in df['cast']:
        actors.extend(cast if isinstance(cast, list) else [])
    for crew in df['crew']:
        for member in crew:
            if member.get("job") == "Director":
                directors.append(member["name"])

    actors = list(set(actors))
    directors = list(set(directors))

    def random_movie(df, genre=None, actor=None, director=None, short=False):
        filtered = df
        if genre:
            filtered = filtered[filtered['genres'].apply(lambda x: genre in x)]
        if actor:
            filtered = filtered[filtered['cast'].apply(lambda x: actor in x)]
        if director:
            filtered = filtered[filtered['crew'].apply(lambda x: any(d.get("name") == director and d.get("job") == "Director" for d in x))]
        if short:
            filtered = filtered[filtered['runtime'] < 90]
        if len(filtered) == 0:
            return None
        return random.choice(filtered['original_title'].tolist())

    used_pairs = set()
    while len(qa_pairs) < target_size:
        choice = random.choice(["genre", "actor", "director", "short", "mood"])
        question = ""
        answer = None

        if choice == "genre":
            g = random.choice(genres)
            question = f"Suggest a {g.lower()} movie"
            answer = random_movie(df, genre=g)

        elif choice == "actor" and actors:
            a = random.choice(actors)
            question = f"Suggest a movie with {a}"
            answer = random_movie(df, actor=a)

        elif choice == "director" and directors:
            d = random.choice(directors)
            question = f"Suggest a movie directed by {d}"
            answer = random_movie(df, director=d)

        elif choice == "short":
            question = "Suggest a short movie"
            answer = random_movie(df, short=True)

        elif choice == "mood":
            question = random.choice([
                "Suggest a feel-good movie",
                "I'm feeling sad, suggest a movie to cheer me up",
                "Suggest an exciting movie to watch",
                "Suggest a deep and emotional movie",
                "Recommend a classic movie"
            ])
            answer = random_movie(df)

        if answer and (question, answer) not in used_pairs:
            qa_pairs.append((question, answer))
            used_pairs.add((question, answer))

    return qa_pairs


## 📊 Step 4: Preview Sample Training Pairs
We preview the first few examples to ensure the format is consistent and the data is clean.


In [21]:
train_data = generate_qa_dataset_large(merged_df, target_size=10000)

# Preview a few examples
for q, a in train_data[:10]:
    print(f"Q: {q}\nA: {a}\n")

print(f"✅ Total Q&A pairs generated: {len(train_data)}")


Q: Suggest an exciting movie to watch
A: The Life Aquatic with Steve Zissou

Q: Suggest a short movie
A: Deuce Bigalow: Male Gigolo

Q: Suggest a short movie
A: Banshee Chapter

Q: Suggest a short movie
A: Pieces of April

Q: Suggest a movie with Ray Milland
A: The Lost Weekend

Q: Suggest a movie with Michael Harris
A: I Love You, Don't Touch Me!

Q: Suggest a fantasy movie
A: Oz: The Great and Powerful

Q: Suggest a short movie
A: Walter

Q: Suggest a deep and emotional movie
A: Monsters vs Aliens

Q: Suggest a movie with Shane Dawson
A: Not Cool

✅ Total Q&A pairs generated: 10000


## 🔄 Step 5: Convert Q&A Data to HuggingFace Dataset
We format the list of question–answer pairs into a HuggingFace `Dataset` object and split it into training and validation sets.


In [22]:
from datasets import Dataset

qa_formatted = [{"input_text": q, "target_text": a} for q, a in train_data]
dataset = Dataset.from_list(qa_formatted)
dataset = dataset.train_test_split(test_size=0.1)  # 90% train, 10% validation

dataset


  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    train: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 9000
    })
    test: Dataset({
        features: ['input_text', 'target_text'],
        num_rows: 1000
    })
})

## ✂️ Step 6: Tokenize with T5 Tokenizer
Each question and answer is tokenized using the T5 tokenizer. We pad/truncate as needed to ensure consistent input shapes.


In [23]:
from transformers import T5Tokenizer

tokenizer = T5Tokenizer.from_pretrained("t5-small")

def preprocess(example):
    input_enc = tokenizer(example["input_text"], padding="max_length", truncation=True, max_length=64)
    target_enc = tokenizer(example["target_text"], padding="max_length", truncation=True, max_length=32)
    input_enc["labels"] = target_enc["input_ids"]
    return input_enc

tokenized_dataset = dataset.map(preprocess, batched=True)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Map: 100%|████████████████████████████████████████████████████████████████| 9000/9000 [00:05<00:00, 1550.40 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1626.81 examples/s]


## 🧠 Step 7: Fine-Tune T5-small on Our Dataset
We fine-tune a pre-trained T5-small model using our movie Q&A dataset. We use the Trainer API for simplicity.


In [24]:
from transformers import T5ForConditionalGeneration, TrainingArguments, Trainer

model = T5ForConditionalGeneration.from_pretrained("t5-small")

training_args = TrainingArguments(
    output_dir="./movie-t5",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    logging_steps=10,
    learning_rate=5e-4,
    weight_decay=0.01,
    logging_dir="./logs",
    save_safetensors=False  # fix for your saving issue
)



In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
)

trainer.train()



  trainer = Trainer(
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss
10,4.0677
20,1.3687
30,1.2254
40,0.8925
50,0.9189
60,0.9014
70,0.8287
80,0.7638
90,0.8208
100,0.8575


## 💾 Step 8: Save Fine-Tuned Model
After training, we save the model and tokenizer for future use and inference.


In [None]:
trainer.save_model("./movie-t5")
tokenizer.save_pretrained("./movie-t5")


## 🤖 Step 9: Ask the Chatbot
We test the chatbot with custom movie-related prompts and print the model’s predictions.


In [None]:
def ask(question):
    inputs = tokenizer(question, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, max_new_tokens=32)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Try some questions
print(ask("Suggest a comedy movie"))
print(ask("Suggest a movie with Kevin Hart"))
print(ask("Suggest a short movie"))
print(ask("Suggest a movie directed by Christopher Nolan"))


In [None]:
response = ask("Suggest a comedy movie")
print("RESPONSE:", response)


In [None]:
# Step 1: Build a set of real movie titles
all_movie_titles = set(merged_df["original_title"].dropna().tolist())

# Step 2: Define the new response function with title verification
def recommend_real_movie(question):
    inputs = tokenizer(question, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, max_new_tokens=32, num_beams=4)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

    for title in all_movie_titles:
        if answer.lower() in title.lower() or title.lower() in answer.lower():
            return f"🤖 Predicted: {answer}\n🎥 Matched: {title}"

    return f"🤖 Predicted: {answer}\n❌ No real match found"


In [None]:
print(recommend_real_movie("Suggest a horror movie"))
print(recommend_real_movie("Suggest a movie with Tom Hanks"))


## 🎨 Step 10 (Optional): Build a Gradio UI
Use Gradio to launch a web interface that allows users to type in movie questions and get real-time answers.


In [None]:
import gradio as gr

def recommend_movie(user_input):
    inputs = tokenizer(user_input, return_tensors="pt", padding=True)
    outputs = model.generate(**inputs, max_new_tokens=32, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

gr.Interface(
    fn=recommend_movie,
    inputs=gr.Textbox(label="Ask me for a movie recommendation! 🎬"),
    outputs=gr.Textbox(label="🎥 Movie Suggestion"),
    title="🎬 Movie Recommender Chatbot",
    description="Type anything like: 'Suggest a comedy movie' or 'I want a short movie with Dwayne Johnson'",
    theme="soft"
).launch()


## ✅ Conclusion

This project demonstrates how transformer models like T5-small can be fine-tuned for practical NLP applications such as movie recommendation. By training the model on thousands of real question–answer pairs derived from movie metadata, the chatbot is able to understand a variety of natural language prompts and return relevant movie titles.

While the model performs well on many test cases, further improvements could include:
- Expanding and diversifying the training dataset even more
- Using retrieval-based techniques to avoid hallucinated movie titles
- Deploying the chatbot to a cloud environment for public access

Overall, this project highlights the power and flexibility of NLP fine-tuning for building intelligent, real-world assistants.
