#  Genre-Based Speech Analysis of Movie Trailers

This notebook performs full speech analysis of movie trailers grouped by **genre**.  
Each trailer can belong to multiple genres.

We will:
- Transcribe audio using Whisper
- Label narration vs. dialogue
- Expand multi-genre labels
- Analyze word count, sentiment, frequency, and word clouds **per genre**


In [None]:
!pip install git+https://github.com/openai/whisper.git
!pip install torch pandas matplotlib seaborn wordcloud textblob


## Import Libraries

We import the Whisper model and supporting tools for analysis.


In [None]:
import os
import whisper
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from textblob import TextBlob
from collections import Counter
import string


## Transcribe All Audio Files

This block loads Whisper and transcribes each `.wav` file into timestamped segments.


In [None]:
model = whisper.load_model("base")

def transcribe_audio(file_path):
    result = model.transcribe(file_path)
    return result.get("segments", [])

audio_dir = "../data/audio"
transcripts = []

for fname in os.listdir(audio_dir):
    if fname.endswith(".wav"):
        print(f"Transcribing {fname}")
        segments = transcribe_audio(os.path.join(audio_dir, fname))
        for seg in segments:
            transcripts.append({
                "file": fname,
                "start": seg["start"],
                "end": seg["end"],
                "text": seg["text"]
            })

df = pd.DataFrame(transcripts)
df.to_csv("../data/transcripts.csv", index=False)
df.head()


##  Merge Genre Metadata

We load `mtgc.csv` and attach all relevant genre labels to each trailer segment.


In [None]:
meta = pd.read_csv("../data/mtgc.csv")
genre_cols = ['action', 'adventure', 'comedy', 'crime', 'drama',
              'fantasy', 'horror', 'romance', 'sci-fi', 'thriller']

meta["base_file"] = meta["mid"].astype(str)  # Adjust if your filenames match something else

df = df.merge(meta[["base_file"] + genre_cols], on="base_file", how="left")
df.head()


## Expand Genre Labels (One Row Per Genre)

Each trailer may belong to multiple genres.  
We explode the DataFrame so each row reflects one (file, genre) pair.


In [None]:
# Melt binary genre columns into rows
df_genres = df.melt(
    id_vars=["file", "start", "end", "text", "base_file"],
    value_vars=genre_cols,
    var_name="genre",
    value_name="is_genre"
)

# Keep only rows where genre is active
df_genres = df_genres[df_genres["is_genre"] == 1].drop(columns="is_genre").reset_index(drop=True)
df_genres.head()


## Label Segments as Narration or Dialogue

We use a basic heuristic to distinguish between narration and dialogue:
- Long monologues = narration
- Short or punctuated lines = dialogue


In [None]:
def classify_speech(text):
    if len(text.split()) > 10 and text.count(".") <= 1:
        return "narration"
    else:
        return "dialogue"

df_genres["type"] = df_genres["text"].apply(classify_speech)
df_genres["word_count"] = df_genres["text"].apply(lambda x: len(x.split()))
df_genres["sentiment"] = df_genres["text"].apply(lambda x: TextBlob(x).sentiment.polarity)


## Analyze Per Genre

For each genre, we’ll generate:
- Word count boxplot
- Sentiment distribution
- Dialogue vs narration ratio
- Word cloud
- Top 10 most frequent words


In [None]:
genre_stats = []

for genre in genre_cols:
    genre_df = df_genres[df_genres["genre"] == genre]
    
    if genre_df.empty:
        continue

    print(f"\n Genre: {genre.upper()} ({len(genre_df)} segments total)")

    for speech_type in ["dialogue", "narration"]:
        subset = genre_df[genre_df["type"] == speech_type]
        if subset.empty:
            print(f"No {speech_type} found for {genre}")
            continue

        print(f"\n {speech_type.capitalize()} in {genre.capitalize()}")

        # Store stats
        genre_stats.append({
            "genre": genre,
            "type": speech_type,
            "avg_word_count": subset["word_count"].mean(),
            "avg_sentiment": subset["sentiment"].mean(),
            "segment_count": len(subset)
        })

        # Boxplot: word count
        sns.histplot(subset["word_count"], bins=20, kde=True)
        plt.title(f"Word Count - {speech_type.capitalize()} - {genre.capitalize()}")
        plt.xlabel("Word Count")
        plt.show()

        # Sentiment histogram
        sns.histplot(subset["sentiment"], bins=20, kde=True)
        plt.title(f"Sentiment - {speech_type.capitalize()} - {genre.capitalize()}")
        plt.xlabel("Sentiment Polarity")
        plt.show()

        # Word cloud
        text = " ".join(subset["text"])
        wc = WordCloud(width=800, height=400, background_color="white").generate(text)
        plt.imshow(wc, interpolation="bilinear")
        plt.axis("off")
        plt.title(f"Word Cloud - {speech_type.capitalize()} - {genre.capitalize()}")
        plt.show()

        # Top 10 words
        words = subset["text"].str.lower().str.replace(r"[^\w\s]", "", regex=True).str.split().explode()
        top_words = words.value_counts().head(10)
        print("Top 10 Words:")
        print(top_words)


In [None]:
summary = pd.DataFrame(genre_stats)
summary = summary[["genre", "type", "segment_count", "avg_word_count", "avg_sentiment"]]
summary.round(2) zz
