## Technical Metadata Generation
In this notebook, we are extracting the insights we have made during the analysis of the past movie trailers in the last 2 homework assignments. The extracted insights will be on a higher level than just numbers and will be packed into a JSON string for LLM to generate the complete trailer script based on genres.

In [5]:
# import openai # or together, or any other API provider
from together import Together
import pandas as pd
import json
import os

# Set your OpenAI API key
# openai.api_key = "YOUR_API_KEY"
os.environ['TOGETHER_API_KEY'] =  "MY_API_KEY"

In [6]:
# Combine all analysis results into a single dataframe

# Load all analysis results
audio_analysis_df = pd.read_csv("data/audio_analysis_results.csv")

# Rename the column from "genre_string" to "genre"
audio_analysis_df = audio_analysis_df.rename(columns={"genre_list": "genre"})

dialogue_analysis_df = pd.read_csv("data/dialogue_analysis_results_without_stopwords.csv")
avg_brightness_df = pd.read_csv("data/avg_brightness.csv")

# Merge all analysis results into a single dataframe
merged_df = pd.merge(audio_analysis_df, dialogue_analysis_df, on="genre", how="inner")
merged_df = pd.merge(merged_df, avg_brightness_df, on="genre", how="inner")

# Save the merged dataframe
merged_df
# merged_df.to_csv("data/merged_analysis_results.csv", index=False)

Unnamed: 0,genre,tempo,spectral_centroid,zcr,rms,mfcc_1,mfcc_2,mfcc_3,mfcc_4,mfcc_5,...,chroma_7,chroma_8,chroma_9,chroma_10,chroma_11,chroma_12,word_count,sentiment,top_words,average_brightness
0,action,[124.36288436],[1614.87252979],0.090363,0.089993,-210.31418,125.160194,-30.875294,44.779716,-16.835392,...,0.421147,0.443839,0.429335,0.43167,0.455683,0.425768,182,0.2817,"[('game', 4), ('right', 3), ('morning', 2), ('...",49.562532
1,adventure,[123.9790483],[1670.02218864],0.087155,0.069942,-225.72115,120.45447,-24.429781,45.25401,-17.47893,...,0.430831,0.425811,0.430328,0.461064,0.407086,0.391524,303,-0.8956,"[('going', 5), ('game', 5), ('like', 4), ('nev...",59.761
2,comedy,[130.30395407],[1760.75999042],0.095823,0.053118,-241.66449,122.60115,-30.631477,46.187584,-18.369783,...,0.435117,0.448367,0.435995,0.454043,0.425064,0.403464,471,0.9924,"[('think', 7), ('like', 7), ('going', 6), ('an...",60.290589
3,crime,[147.66557173],[1573.35086471],0.076802,0.07616,-230.50964,125.86955,-22.114851,51.3151,-13.494143,...,0.4935,0.473641,0.468546,0.505251,0.472417,0.431782,162,0.852,"[('going', 5), ('think', 3), ('bank', 3), ('ne...",54.669143
4,drama,[133.39005695],[1509.07438125],0.080991,0.078847,-244.03885,126.83167,-20.47563,43.72983,-14.275898,...,0.422245,0.428096,0.430099,0.462502,0.416622,0.403541,471,0.9967,"[('want', 7), ('going', 7), ('think', 6), ('ne...",53.815369
5,fantasy,[122.89151278],[1629.69415315],0.085081,0.112684,-182.9145,129.26941,-26.764755,42.726177,-18.0592,...,0.417521,0.432085,0.411747,0.413174,0.380018,0.405466,148,-0.9618,"[('like', 4), ('game', 4), ('going', 3), ('thi...",45.354886
6,horror,[127.33871177],[1537.61416053],0.085244,0.166445,-154.90012,140.4438,-26.234093,41.11699,-15.794506,...,0.420835,0.439445,0.438435,0.470879,0.452816,0.516977,175,0.9355,"[('candidates', 3), ('carpenter', 2), ('lookin...",47.142939
7,romance,[144.3264836],[1560.1925967],0.088109,0.040701,-286.04572,128.696,-24.969086,39.80083,-11.762167,...,0.394881,0.445149,0.413972,0.450883,0.426805,0.383333,275,0.9963,"[('think', 5), ('days', 5), ('like', 4), ('wan...",46.468858
8,sci-fi,[133.83634665],[1564.78795432],0.087778,0.082127,-224.77054,129.41185,-22.849241,44.23874,-14.348557,...,0.415811,0.413565,0.427172,0.473793,0.450169,0.449706,260,0.0932,"[('animal', 5), ('future', 4), ('something', 3...",61.454051
9,thriller,[140.01599019],[1418.23868292],0.072776,0.082493,-238.09671,132.5928,-16.862095,42.845554,-9.772017,...,0.455109,0.466311,0.480732,0.491756,0.473663,0.482246,275,0.7135,"[('going', 4), ('future', 4), ('think', 3), ('...",42.205618


In [7]:
# Convert dataframe to markdown table for better readability in the prompt
def df_to_markdown(df):
    return df.to_markdown(index=False)

table_str = df_to_markdown(merged_df)

In [8]:
genres_to_analyze = ["fantasy", "romance", "horror", "thriller"]

In [46]:
# Build prompt
prompt = f"""
You are an expert data analyst, musicologist, and movie trailer researcher. You are given the following dataset, containing features extracted from movie trailers across multiple genres. Given the raw features extracted from movie trailers, summarize them into high-level qualitative or quantitative insights by comparing each metric of each genre to one another.

The columns are:
- genre: movie genre string. There are only 10 genres in the dataset.
- tempo: average tempo in BPM of that genre.
- spectral_centroid: average spectral centroid (Hz) of that genre.
- zcr: average zero crossing rate in percentage of that genre.
- rms: average loudness as RMS energy of that genre (not in dB)
- mfcc: average MFCCs of that genre splitted into 13 bins starting from chroma_1 to chroma_13.
- chroma: average chroma of that genre starting from chroma_1 to chroma_12.
- word_count: average word count of the dialogue and narration in the trailers of that genre.
- top_words: a list of tuples, each containing a word and its average frequency appeared in trailers of that genre. The stop words are removed from this list.
- sentiment: average sentiment of the dialogue and narration in the trailers of that genre ranging from -1 to 1.
- average_brightness: average brightness (unnormalized) of that genre.

Here is the dataset:

{table_str}

Analyze the dataset as a whole. Compare across genres and extract high-level insights about:

- Which genres tend to have higher loudness, tempo, pitch (spectral centroid), brightness, percussiveness (ZCR), word count. Respond with string "low", "medium", "high", "very low", "very high" for these metrics and do not use numbers.
- Which words are characteristic of each genre. Respond with a list of words that are characteristic of each genre and their frequency level as "high", "medium", "low". Do not make up words that doesn't exist in the dataset.
- Common dominant notes in the genre, and perhaps chords, and the sound characteristics of the genre. Be as descriptive as possible. Summarize the sound characteristics of the genre in a few sentences.
- Which genres tend to have higher positive or negative sentiment in their dialogue and narration. Respond with string "low", "medium", "high", "very low", "very high" for these metrics and do not use numbers.
- Overall patterns or differences across genres summarized in a few sentences.

Return only a valid JSON object with the following structure:
[
  {{
    "genre": "<genre>",
    "word_summary": {{
      "top_words": ["<string>", "<string>", "<string>", "<string>", "<string>"],
      "dialogue_and_narration_sentiment_positivity": "<string>"
    }},
    "audio_summary": {{
      "average_loudness_level": "<string>",
      "average_tempo_level": "<string>",
      "average_pitch_level": "<string>",
      "average_percussiveness_level": "<string>",
      "average_brightness_level": "<string>",
      "sound_characteristics": "<string>"
    }}
  }},
  {{
    "genre": "<genre>",
    "word_summary": {{
      "top_words": ["<string>", "<string>", "<string>", "<string>", "<string>"],
      "dialogue_and_narration_sentiment_level": "<string>"
    }},
    "audio_summary": {{
      "average_loudness_level": "<string>",
      "average_tempo_level": "<string>",
      "average_pitch_level": "<string>",
      "average_percussiveness_level": "<string>",
      "average_brightness_level": "<string>",
      "sound_characteristics": "<string>"
    }}
  }},
  ...
]

without using ```json and ``` Make sure to return a valid JSON object and comma is not missing.

and return the result for only the following genres:
{genres_to_analyze}


Be analytical, concise, and avoid repeating the raw numbers. Focus on summarizing meaningful trends and insights.
"""

In [47]:
# Call LLM API
system_intel = "You are a helpful senior data scientist, Hollywood movie trailer expert, and renown composer.\nKnowledge cutoff: 2021-09-01\nCurrent date: 2023-03-02"

client = Together(api_key=os.environ.get("TOGETHER_API_KEY"))
response = client.chat.completions.create(
        messages=[{"role": "system",
                   "content": system_intel},
                  {"role": "user", "content": prompt}],
        # model="meta-llama/Llama-3.3-70B-Instruct-Turbo-Free",
        model="deepseek-ai/DeepSeek-R1-Distill-Llama-70B-free",
    )

summary_json = response.choices[0].message.content

print(summary_json)

<think>
Alright, I need to help the user by analyzing the provided dataset of movie trailer features across different genres. The user has specifically asked for insights on four genres: fantasy, romance, horror, and thriller. 

First, I'll go through each of the required sections: loudness, tempo, pitch, brightness, percussiveness, word count, characteristic words, sound characteristics, sentiment, and overall patterns.

Starting with loudness, I'll look at the RMS values. Comparing the four genres, horror has the highest RMS, indicating very high loudness. Thriller is next, so high. Romance is lower, around medium, and fantasy is the lowest, so very low.

For tempo, I'll check the BPM. Romance has the highest tempo, so very high. Thriller is next, high. Fantasy is lower, medium, and horror is the lowest, low.

Pitch is determined by the spectral centroid. Fantasy has the highest, so very high. Horror is next, high. Thriller is medium, and romance is low.

Brightness is straightforwar

In [48]:
# Fix the json format
import re
raw_response = summary_json

# Step 1: Remove all <think>...</think> blocks (greedy or non-greedy works here)
clean_response = re.sub(r"<think>.*?</think>", "", raw_response, flags=re.DOTALL)

# Step 2: Optional - strip whitespace
clean_response = clean_response.strip()

# # Step 3: If there are multiple JSON objects concatenated, split and parse each
json_objects = []
# for match in re.finditer(r'{.*?}', clean_response, re.DOTALL):
#     try:
#         obj = json.loads(match.group())
#         json_objects.append(obj)
#     except json.JSONDecodeError as e:
#         print("Error decoding JSON object:", e)

# # Step 4: Use or save the cleaned JSON
# for obj in json_objects:
#     print(json.dumps(obj, indent=2))

# print(json_objects)
print(clean_response)

[]
[
  {
    "genre": "fantasy",
    "word_summary": {
      "top_words": ["like", "game", "going", "think", "monsters"],
      "dialogue_and_narration_sentiment_positivity": "very low"
    },
    "audio_summary": {
      "average_loudness_level": "very low",
      "average_tempo_level": "medium",
      "average_pitch_level": "very high",
      "average_percussiveness_level": "high",
      "average_brightness_level": "very high",
      "sound_characteristics": "Fantasy trailers often feature bright, airy soundscapes with high spectral centroids, suggesting ethereal or magical elements. The MFCCs indicate a balance between mid and high-frequency content, creating a sense of openness and wonder. Chroma features suggest a preference for major keys, contributing to a sense of adventure and exploration."
    }
  },
  {
    "genre": "romance",
    "word_summary": {
      "top_words": ["think", "days", "like", "want", "marry"],
      "dialogue_and_narration_sentiment_positivity": "very high"


In [51]:
# Parse the response and save to local file
try:
    parsed = json.loads(clean_response)
except json.JSONDecodeError:
    raise ValueError("Failed to decode JSON. Check the response format.")

output_path = 'data/technical_metadata.json'
with open(output_path, 'w') as f:
    json.dump(parsed, f, indent=4)

print(f"Metadata saved to {output_path}")
print(parsed)

Metadata saved to data/technical_metadata.json
[{'genre': 'fantasy', 'word_summary': {'top_words': ['like', 'game', 'going', 'think', 'monsters'], 'dialogue_and_narration_sentiment_positivity': 'very low'}, 'audio_summary': {'average_loudness_level': 'very low', 'average_tempo_level': 'medium', 'average_pitch_level': 'very high', 'average_percussiveness_level': 'high', 'average_brightness_level': 'very high', 'sound_characteristics': 'Fantasy trailers often feature bright, airy soundscapes with high spectral centroids, suggesting ethereal or magical elements. The MFCCs indicate a balance between mid and high-frequency content, creating a sense of openness and wonder. Chroma features suggest a preference for major keys, contributing to a sense of adventure and exploration.'}}, {'genre': 'romance', 'word_summary': {'top_words': ['think', 'days', 'like', 'want', 'marry'], 'dialogue_and_narration_sentiment_positivity': 'very high'}, 'audio_summary': {'average_loudness_level': 'medium', 'av