<a href="https://colab.research.google.com/github/frank-morales2020/MLxDL/blob/main/FINAL_MOVIEAGENT_DEMO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

https://www.kaggle.com/code/imgowthamg/movie-genre-classification/notebook

## Components

In [None]:
!pip install rake-nltk -q
!pip install nltk -q
!pip install kagglehub -q
!pip install tqdm -q

!pip install google-generativeai -q
!pip install backoff -q
!pip install textblob -q
!pip install colab-env -q

import colab_env

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for colab-env (setup.py) ... [?25l[?25hdone
Mounted at /content/gdrive


## ML MODEL

REALITY-TV VERSUS DOCUMENTARY

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, classification_report
from rake_nltk import Rake
import nltk
import re
import string
from tqdm import tqdm
from nltk.stem import LancasterStemmer

nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

# --- clean_text function ---
def clean_text(text):
    text = text.lower()  # Lowercase all characters
    text = re.sub(r'@\S+', '', text)  # Remove Twitter handles
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'pic.\S+', '', text)
    text = re.sub(r"[^a-zA-Z+']", ' ', text)  # Keep only characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text + ' ')  # Keep words with length > 1 only
    text = "".join([i for i in text if i not in string.punctuation])
    words = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')  # Remove stopwords
    text = " ".join([i for i in words if i not in stopwords and len(i) > 2])
    text = re.sub("\s[\s]+", " ", text).strip()  # Remove repeated/leading/trailing spaces
    return text
# --- End of clean_text function ---

import kagglehub

!rm -rf /root/.cache/kagglehub/

# Download latest version
path = kagglehub.dataset_download("hijest/genre-classification-dataset-imdb")
print("Path to dataset files:", path)

# --- File paths and data loading ---
train_path = "/root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1/Genre Classification Dataset/train_data.txt"
test_path = "/root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1/Genre Classification Dataset/test_data.txt"

# Load only 100 records for POC
train_data = pd.read_csv(train_path, sep=':::', names=['Title', 'Genre', 'Description'], engine='python', nrows=7000)

# Reduce test data size by 9 times (assuming original size is 54214)
test_data_size = len(pd.read_csv(test_path, sep=':::', names=['Id', 'Title', 'Description'], engine='python'))
reduced_test_size = int(test_data_size / 3)  # Calculate the reduced size
test_data = pd.read_csv(test_path, sep=':::', names=['Id', 'Title', 'Description'], engine='python', nrows=reduced_test_size)
# --- End of file paths and data loading ---

# --- Preprocessing ---
train_data['Text_cleaning'] = train_data['Description'].apply(clean_text)
test_data['Text_cleaning'] = test_data['Description'].apply(clean_text)
# --- End of preprocessing ---

# --- Feature Engineering for Crime and Reality TV ---
crime_keywords = ["investigation", "detective", "murder", "police", "suspect", "court", "trial", "prison", "crime", "criminal"]
reality_tv_keywords = ["personal stories", "dramatic situations", "competition", "unscripted", "dramatization", "reconstruction", "confessionals", "elimination", "reality", "show"]

def crime_keyword_feature(text):
    count = 0
    for keyword in crime_keywords:
        count += len(re.findall(keyword, text.lower()))
    return count

def reality_tv_keyword_feature(text):
    count = 0
    for keyword in reality_tv_keywords:
        count += len(re.findall(keyword, text.lower()))
    return count

train_data['Crime_Keywords'] = train_data['Text_cleaning'].apply(crime_keyword_feature)
train_data['Reality_TV_Keywords'] = train_data['Text_cleaning'].apply(reality_tv_keyword_feature)

test_data['Crime_Keywords'] = test_data['Text_cleaning'].apply(crime_keyword_feature)
test_data['Reality_TV_Keywords'] = test_data['Text_cleaning'].apply(reality_tv_keyword_feature)
# --- End of Feature Engineering ---

# --- Create TF-IDF features ---
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2))  # Include bi-grams
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data['Text_cleaning'])
X_test_tfidf = tfidf_vectorizer.transform(test_data['Text_cleaning'])

# --- Combine TF-IDF features with engineered features ---
X_train = pd.DataFrame(X_train_tfidf.toarray())  # Convert to DataFrame
X_test = pd.DataFrame(X_test_tfidf.toarray())



X_train['Crime_Keywords'] = train_data['Crime_Keywords']
X_train['Reality_TV_Keywords'] = train_data['Reality_TV_Keywords']

X_test['Crime_Keywords'] = test_data['Crime_Keywords']
X_test['Reality_TV_Keywords'] = test_data['Reality_TV_Keywords']

# Convert column names to strings
X_train.columns = X_train.columns.astype(str)  # Fix here
X_test.columns = X_test.columns.astype(str)   # Fix here


# --- End of Feature Creation ---

# --- Splitting and Evaluation ---
X = X_train
y = train_data['Genre']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


# Imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(X_train)
X_val = imputer.transform(X_val)


# Initialize and train a LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = classifier.predict(X_val)

# Evaluate the performance of the model
accuracy = accuracy_score(y_val, y_pred)
print("Validation Accuracy:", accuracy)
print(classification_report(y_val, y_pred))
# --- End of Splitting and Evaluation ---

# --- Prediction on test data and output formatting ---
y_pred_test = classifier.predict(X_test)  # Predict on test data

# Create a DataFrame for predictions
predictions_df = pd.DataFrame({'Id': test_data['Id'], 'Predicted_Genre': y_pred_test})

# Print or save the predictions
print("\nPredictions on Test Data:")
print(predictions_df)
# --- End of Prediction on test data and output formatting ---

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Downloading from https://www.kaggle.com/api/v1/datasets/download/hijest/genre-classification-dataset-imdb?dataset_version_number=1...


100%|██████████| 41.7M/41.7M [00:00<00:00, 239MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1




Validation Accuracy: 0.5321428571428571
               precision    recall  f1-score   support

      action        0.33      0.03      0.05        36
       adult        0.00      0.00      0.00        19
   adventure        0.33      0.07      0.11        15
   animation        0.00      0.00      0.00         9
   biography        0.00      0.00      0.00         7
      comedy        0.49      0.42      0.45       194
       crime        0.00      0.00      0.00        12
 documentary        0.57      0.89      0.70       359
       drama        0.49      0.81      0.61       361
      family        0.00      0.00      0.00        31
     fantasy        0.00      0.00      0.00         8
   game-show        1.00      0.50      0.67         2
     history        0.00      0.00      0.00         6
      horror        0.54      0.28      0.37        46
       music        1.00      0.08      0.14        13
     musical        0.00      0.00      0.00         2
     mystery        0.00

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Predictions on Test Data:
          Id Predicted_Genre
0          1          drama 
1          2          drama 
2          3    documentary 
3          4          drama 
4          5          drama 
...      ...             ...
18061  18062          drama 
18062  18063         comedy 
18063  18064          drama 
18064  18065          drama 
18065  18066          drama 

[18066 rows x 2 columns]


PROD

In [None]:
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, classification_report
from rake_nltk import Rake
import nltk
import re
import string
from tqdm import tqdm
from nltk.stem import LancasterStemmer

import warnings
warnings.filterwarnings("ignore")

nltk.download('stopwords')
nltk.download('punkt_tab')
nltk.download('wordnet')

# --- clean_text function ---
def clean_text(text):
    text = text.lower()  # Lowercase all characters
    text = re.sub(r'@\S+', '', text)  # Remove Twitter handles
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'pic.\S+', '', text)
    text = re.sub(r"[^a-zA-Z+']", ' ', text)  # Keep only characters
    text = re.sub(r'\s+[a-zA-Z]\s+', ' ', text + ' ')  # Keep words with length > 1 only
    text = "".join([i for i in text if i not in string.punctuation])
    words = nltk.word_tokenize(text)
    stopwords = nltk.corpus.stopwords.words('english')  # Remove stopwords
    text = " ".join([i for i in words if i not in stopwords and len(i) > 2])
    text = re.sub("\s[\s]+", " ", text).strip()  # Remove repeated/leading/trailing spaces
    return text
# --- End of clean_text function ---

import kagglehub

!rm -rf /root/.cache/kagglehub/

# Download latest version
path = kagglehub.dataset_download("hijest/genre-classification-dataset-imdb")
print("Path to dataset files:", path)

print('\n')
print('File paths and data loading....')
print('\n')

# --- File paths and data loading ---
train_path = "/root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1/Genre Classification Dataset/train_data.txt"
test_path = "/root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1/Genre Classification Dataset/test_data.txt"

train_data = pd.read_csv(train_path, sep=':::', names=['Title', 'Genre', 'Description'], engine='python')
test_data = pd.read_csv(test_path, sep=':::', names=['Id', 'Title', 'Description'], engine='python')
# --- End of file paths and data loading ---

print('\n')
print('Preprocessing....')
print('\n')
# --- Preprocessing ---
train_data['Text_cleaning'] = train_data['Description'].apply(clean_text)
test_data['Text_cleaning'] = test_data['Description'].apply(clean_text)
# --- End of preprocessing ---

# --- Create TF-IDF features ---
tfidf_vectorizer = TfidfVectorizer()
X_train = tfidf_vectorizer.fit_transform(train_data['Text_cleaning'])
X_test = tfidf_vectorizer.transform(test_data['Text_cleaning'])
# --- End of Create TF-IDF features ---

# --- Splitting and Evaluation ---
X = X_train
y = train_data['Genre']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


print('\n')
print('Initialize and train a LinearSVC classifier.......')
print('\n')
# Initialize and train a LinearSVC classifier
classifier = LinearSVC()
classifier.fit(X_train, y_train)

# Make predictions on the validation set
y_pred = classifier.predict(X_val)

# Evaluate the performance of the model
accuracy = accuracy_score(y_val, y_pred)
print("Validation Accuracy:", accuracy)
print(classification_report(y_val, y_pred))
print('\n')
print('Prediction on test data and output formatting....')
print('\n')
# --- End of Splitting and Evaluation ---

# --- Prediction on test data and output formatting ---
y_pred_test = classifier.predict(X_test)  # Predict on test data

# Create a DataFrame for predictions
predictions_df = pd.DataFrame({'Id': test_data['Id'], 'Predicted_Genre': y_pred_test})

# Print or save the predictions
print("\nPredictions on Test Data:")
print(predictions_df)
# --- End of Prediction on test data and output formatting ---

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Downloading from https://www.kaggle.com/api/v1/datasets/download/hijest/genre-classification-dataset-imdb?dataset_version_number=1...


100%|██████████| 41.7M/41.7M [00:01<00:00, 26.2MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1


File paths and data loading....




Preprocessing....




Initialize and train a LinearSVC classifier.......


Validation Accuracy: 0.5883980448215439
               precision    recall  f1-score   support

      action        0.48      0.35      0.40       263
       adult        0.76      0.43      0.55       112
   adventure        0.51      0.22      0.31       139
   animation        0.34      0.10      0.15       104
   biography        0.00      0.00      0.00        61
      comedy        0.53      0.59      0.56      1443
       crime        0.32      0.07      0.11       107
 documentary        0.69      0.83      0.75      2659
       drama        0.57      0.73      0.64      2697
      family        0.38      0.17      0.23       150
     fantasy        0.09      0.01      0.02        74
   game-show        0.82      0.68      0.74        40
     history        0.50

## dataset

In [None]:
# Define the paths to your data files
train_path = "/root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1/Genre Classification Dataset/train_data.txt"
test_path = "/root/.cache/kagglehub/datasets/hijest/genre-classification-dataset-imdb/versions/1/Genre Classification Dataset/test_data.txt"
train_data = pd.read_csv(train_path, sep=':::', names=['Title', 'Genre', 'Description'], engine='python')
sample_size = len(train_data)
print(f"Sample size: {sample_size}")

Sample size: 54214


In [None]:
print(train_data.info())

<class 'pandas.core.frame.DataFrame'>
Index: 54214 entries, 1 to 54214
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Title        54214 non-null  object
 1   Genre        54214 non-null  object
 2   Description  54214 non-null  object
dtypes: object(3)
memory usage: 1.7+ MB
None


In [None]:
print(train_data.describe())

                                 Title    Genre  \
count                            54214    54214   
unique                           54214       27   
top      Oscar et la dame rose (2009)    drama    
freq                                 1    13613   

                                              Description  
count                                               54214  
unique                                              54086  
top      Grammy - music award of the American academy ...  
freq                                                   12  


In [None]:
# Load the test data
test_data = pd.read_csv(test_path, sep=':::', names=['Id', 'Title', 'Description'], engine='python')
test_data.head()

Unnamed: 0,Id,Title,Description
0,1,Edgar's Lunch (1998),"L.R. Brane loves his life - his car, his apar..."
1,2,La guerra de papá (1977),"Spain, March 1964: Quico is a very naughty ch..."
2,3,Off the Beaten Track (2010),One year in the life of Albin and his family ...
3,4,Meu Amigo Hindu (2015),"His father has died, he hasn't spoken with hi..."
4,5,Er nu zhai (1955),Before he was known internationally as a mart...


## agent

In [None]:
!pip install google-generativeai -q
!pip install backoff -q
!pip install textblob -q
!pip install colab-env -q

openai

In [96]:
import openai
import re
from sklearn.feature_extraction.text import TfidfVectorizer

class MovieGenreAgent:
    def __init__(self, classifier_model, openai_model="o3-mini", tfidf_vectorizer=None, clean_text_func=None, genre_mapping=None):
        """
        Initializes the MovieGenreAgent.

        Args:
            classifier_model: The pre-trained classifier model for genre prediction.
            openai_model: The OpenAI model name for text generation (e.g., "gpt-3.5-turbo").
            tfidf_vectorizer: The TF-IDF vectorizer for feature extraction.
            clean_text_func: The function to clean movie descriptions.
        """
        self.classifier_model = classifier_model
        self.openai_model = openai_model
        self.tfidf_vectorizer = tfidf_vectorizer if tfidf_vectorizer else TfidfVectorizer()
        self.clean_text_func = clean_text_func

        # use genre_mapping passed to __init__ or a default one
        self.genre_mapping = genre_mapping if genre_mapping is not None else {
            "drama": ["drama", "family film", "comedy-drama", "romance"],
            "thriller": ["thriller", "psychological thriller", "suspense"],
            "adult": ["adult", "adult/erotic"],
            "animated musical/comedy": ["animation", "musical", "comedy"],
            "comedy-drama": ["comedy", "drama"],
            "romantic comedy": ["romance", "comedy"],
            "action comedy": ["action", "comedy"],
            "sci-fi comedy": ["sci-fi", "comedy"],
            "sci-fi horror": ["sci-fi", "horror"],
            "action thriller": ["action", "thriller"],
            "crime drama": ["crime", "drama"],
            "horror thriller": ["horror", "thriller"],
            "fantasy adventure": ["fantasy", "adventure"],
            "historical drama": ["history", "drama"],
            "biographical drama": ["biography", "drama"],
            "war drama": ["war", "drama"],
            "dark comedy": ["comedy", "drama"],
            "psychological thriller": ["thriller"],
            "family film": ["family"],
            "adult/erotic": ["adult"],
            "documentary": ["documentary", "historical", "investigation", "analysis", "exposé", "biographical", "scientific", "nature", "wildlife", "social impact", "interviews", "archival footage", "narration", "research-based", "expert commentary",  "to educate", "to inform", "to raise awareness", "to explore", "to examine"],
            "animated": ["animation"],
            "musical": ["musical"],
            "science fiction": ["science fiction", "sci-fi"],
            "western": ["western"],
            "crime": ["crime", "lawlessness", "criminal", "investigation", "justice"],
            "reality tv": ["reality tv", "reality-tv", "docuseries", "unscripted", "dramatization", "reconstruction", "disaster", "survival", "rescue", "emergency", "natural disaster", "personal stories", "dramatic situations", "competition"],
            "crime drama": ["crime drama"], # new entry

        }



        # Set your OpenAI API key
        #client = OpenAI(api_key = os.getenv("OPENAI_API_KEY"))
        openai.api_key = "OPENAI_API_KEY"  # Replace with your actual key



    def observe(self, movie_data):
        """Observes the movie data and extracts relevant information."""
        self.movie_title = movie_data.get("Title", "Unknown")
        description = movie_data.get("Description", "")
        self.movie_description = description  # You might need to add filter_sensitive_content here
        self.features = self.extract_features(movie_data)

    def orient(self):
        """Identifies dominant genres using OpenAI Chat Completion API."""
        prompt = f"Identify the 5 most important keywords in this movie description: {self.movie_description}"

        from openai import OpenAI
        import os
        client = OpenAI(api_key = os.getenv("OPENAI_API_KEY"))

        response = client.chat.completions.create(
          model=self.openai_model,
          messages=[
                {"role": "user", "content": prompt},
            ],
          #max_tokens=500,  # Adjust as needed
          #temperature=0.7  # Adjust for creativity vs. conciseness
        )

        generated_text = response.choices[0].message.content.strip()

        keywords = response.choices[0].message.content.strip()
        self.dominant_genre = keywords.split(",") if keywords else []


    #def decide(self):

    def decide(self, drama_keyword_count=0):
        """Predicts the genre using the classifier and OpenAI Chat Completion API."""

        # --- Classifier Prediction ---
        probabilities = self.classifier_model.decision_function(self.features)[0]
        top_3_indices = probabilities.argsort()[-3:][::-1]
        top_3_genres = [self.classifier_model.classes_[i] for i in top_3_indices]
        top_3_probs = [probabilities[i] for i in top_3_indices]
        ml_prediction_text = ", ".join([f"{genre} ({prob:.2f})" for genre, prob in zip(top_3_genres, top_3_probs)])

        # --- OpenAI Prompt ---
        prompt = f"""
        # ... (Your existing prompt content) ...
        """

        prompt = f"""
                      You are a highly skilled movie genre classification expert. Your task is to analyze the provided movie description and determine the **single most dominant genre** that best represents the film's overall style and content.

                      **Important Instructions:**

                      1. **Prioritize Genres:** Your classification should be based on the following list of target genres: drama, thriller, adult, documentary, comedy, horror, fantasy, science fiction, music, reality tv, western, crime, and science fiction (with elements of dystopian/post-apocalyptic). If the movie clearly belongs to one of these genres, select it.
                      2. **Handle Ambiguity:** If the description is ambiguous or contains elements of multiple genres, carefully consider the primary themes, setting, target audience, and overall tone of the movie to deduce the most dominant genre.
                      3. **Consider ML Prediction:** You are provided with the top 3 predictions from a machine learning model: "{ml_prediction_text}". Use this prediction as a helpful starting point, but feel free to override it if your own analysis leads to a different conclusion.
                      4. **Explain Your Reasoning:** Briefly explain your genre choice, highlighting the key factors that influenced your decision.
                      5. **Handle Overlapping Genres:** Some movies might blend elements of multiple genres. When this occurs, identify the primary driving force of the narrative. If the story revolves around criminal activities or investigations, consider "crime" as the dominant genre, even if it has significant dramatic or emotional elements.
                      6. **Consider Implicit Cues:** Even if a movie description doesn't explicitly focus on crime, consider the potential consequences or implications of criminal activity for the characters or plot. If crime significantly shapes the narrative or character development, it might be the dominant genre.


                      **Primary Focus:**
                      When deciding between genres, prioritize the primary driving force of the movie's narrative. Does it primarily focus on:
                      * Exploring the emotional experiences and relationships of characters within a particular context (drama)?
                      * Presenting factual information and analysis about real-world events, people, or topics (documentary)?


                      **Important Considerations:**

                      **Real-World Content with Entertainment Value:**
                      Some movies might present factual information about real-world events but also incorporate elements of entertainment, drama, or personal narratives. In such cases, carefully consider whether the primary purpose is to inform or to entertain. If the focus is on engaging the audience through personal stories, dramatic situations, or competition, it might be more appropriate to classify it as Reality TV, even if it involves factual elements.

                      **Emotional Depth in Real-world content :**
                      Even if a film incorporates factual content,  it might still be a drama if it primarily seeks to explore human emotions, relationships, and character experiences within a real-world context.

                      **Example:**
                      A movie about the aftermath of a terrorist attack could be a documentary if it focuses on the factual events, the investigation, and the political implications. However, if the same movie focuses on the emotional journeys of individuals affected by the attack, their personal losses, and their struggles to cope, it would likely be classified as a drama.


                      * **Documentary vs. Reality TV:**
                          - **Documentaries:** Primarily focus on educating or informing, often employing interviews, archival footage, and narration to convey information. Their objective is to present factual information and analysis.
                          - **Reality TV:** Focuses on entertainment and engagement, often using real-world events as a backdrop for personal stories, dramatic situations, and competition. Look for cues like "eyewitness accounts," "dramatic presentation," "episodes," and "series," which are characteristic of reality TV, even if it involves real-world events.

                      * **Drama vs. Documentary:**
                          - **Drama:** Explores complex human emotions, relationships, and social issues, often featuring character-driven narratives and focusing on the emotional journeys and internal conflicts of the characters.
                          - **Documentary:** Presents factual information about real-world events, people, or topics. Even if it deals with emotionally charged topics, the primary focus is on informing and educating, not on character development or emotional journeys. If the description primarily focuses on exploring emotions, relationships, and human experiences, consider 'drama' even if it is based on real-world events.

                      * **Crime's Impact:** Even if a movie description doesn't explicitly focus on crime, consider the potential consequences or implications of criminal activity for the characters or plot. If crime significantly shapes the narrative or character development, 'crime' might be the dominant genre, even if it has dramatic or emotional elements.
                          - Consider if the character's actions or choices are driven by avoiding or engaging in criminal activity or its consequences. If the descriptions mention crime or violence significantly impacting the characters or influencing their decisions, that might indicate that crime is a central element.

                      **Primary Focus:**
                      When deciding between genres, prioritize the primary driving force of the movie's narrative. Does it primarily focus on:
                      * Exploring the emotional experiences and relationships of characters within a particular context (drama)?
                      * Presenting factual information and analysis about real-world events, people, or topics (documentary)?



                      **Important Considerations:**

                      * **Documentary vs. Reality TV:** When classifying movies that involve real-world events, carefully consider the primary focus and style of the film. Documentaries typically present factual information with an educational or informative purpose. Reality TV, on the other hand, often focuses on the lives and interactions of real people, with an emphasis on drama, entertainment, or competition.
                      * **Real-World Content with Entertainment Value:** Some movies might present factual information about real-world events but also incorporate elements of entertainment, drama, or personal narratives. In such cases, carefully consider whether the primary purpose is to inform or to entertain. If the focus is on engaging the audience through personal stories, dramatic situations, or competition, it might be more appropriate to classify it as Reality TV, even if it involves factual elements.
                      * **Reality TV's Entertainment Focus:** Even when Reality TV involves real-world events or situations, it is crucial to remember that its primary purpose is to engage and entertain the audience. This can be achieved through personal stories, conflicts, emotional responses, and dramatic elements.
                      * **Reality TV's Presentation Style:** Reality TV often employs dramatic editing, music, and narrative techniques to heighten entertainment value, even when dealing with real-world events. Consider these stylistic elements when classifying a movie as Reality TV.
                      * **Consider the Target Audience:** Different genres often have specific target audiences. Documentaries typically focus on educating or informing a particular demographic. Conversely, Reality TV primarily targets a broad audience and prioritizes entertainment and engagement. Analyze the descriptions for clues about the intended audience to guide your genre classification.
                      **Genre characteristics to discern:**
                      * **Documentary** - educational, informative, objective, historical, analysis-based, impersonal.
                      * **Reality TV** - dramatic, emotional, engaging, focused on human interactions, unscripted or lightly scripted, emphasis on characters and their actions/reactions, might follow specific events but centers around personal experiences.

                      **Genre Characteristics:**

                      * **Drama:** Explores complex human emotions, relationships, and social issues. Often features character-driven narratives and realistic settings. Can include subgenres like documentary if it focuses on real-world issues with a dramatic approach.
                      * **Documentary:** Presents factual information about real-world events, people, or topics. May employ interviews, archival footage, and narration to convey information. However, if it uses a highly dramatic or fictionalized approach, it might be classified as drama.
                      * **Thriller:** Creates suspense, tension, and excitement. Often involves crime, mystery, or psychological elements.
                      * **Comedy:** Intended to provoke laughter and amusement. May use humor, satire, or slapstick.
                      * **Horror:** Aims to evoke fear, dread, and suspense. Often features supernatural elements, violence, or psychological disturbances.
                      * **Fantasy:** Features magical or supernatural elements, often set in imaginary worlds. May involve mythical creatures, quests, and epic battles.
                      * **Science Fiction:** Explores futuristic or speculative concepts, often involving advanced technology, space travel, or alternate realities.
                      * **Music:** Focuses on musical performances, often featuring singing, dancing, and instrumental music.
                      * **Reality TV:** Presents unscripted or semi-scripted situations and interactions involving real people. Often focuses on competition, drama, or personal relationships.
                      * **Western:** Set in the American West, often featuring cowboys, outlaws, and Native Americans. Typically involves themes of frontier life, lawlessness, and adventure.
                      * **Science Fiction (with elements of dystopian/post-apocalyptic):** Explores futuristic or speculative concepts in a dystopian or post-apocalyptic setting. Often involves societal breakdown, oppressive regimes, or survival in harsh environments.
                      * **Crime:** Focuses on criminal acts, their investigation, and their consequences. Often features detectives, criminals, and legal proceedings. May explore themes of justice, morality, and the impact of crime on society.
                      * **Other genres:** You may encounter movies that belong to other genres not explicitly listed here. Use your best judgment and the information provided in the description to make the most appropriate classification.

                      * **Documentary vs. Reality TV:** Carefully distinguish between these genres. Documentaries primarily aim to educate or inform, while Reality TV focuses on entertainment, often using real-world events as a backdrop for personal stories and dramatic situations.
                        * **Consider the Target Audience:** Documentaries often target specific demographics seeking information, while Reality TV aims for a broader audience seeking entertainment.
                        * **Example:** A movie showing the daily lives of firefighters responding to emergencies with a focus on personal challenges and emotional responses would be classified as Reality TV, even if it contains factual elements. A movie analyzing the history of firefighting techniques with expert interviews and archival footage would be classified as Documentary.

                      * **Drama:** Explores complex human emotions, relationships, and social issues. Often features character-driven narratives and realistic settings. Can include subgenres like documentary if it focuses on real-world issues with a dramatic approach.
                        * **Consider Character Development:** Even if a movie is based on real-world events, if it focuses on the emotional journeys and internal conflicts of the characters, it's likely a drama.
                        * **Emphasis on Human Experience:** Dramas are ultimately driven by character experiences and the exploration of human emotions within a specific situation. Look for cues highlighting the human element and emotional responses to events.



                      **Emotional Depth in Documentaries vs. Dramas:**

                      - Even documentaries can explore emotionally charged topics. However, dramas often delve deeper into the emotional experiences and internal conflicts of characters, even when discussing real-world events or social issues.

                      - When evaluating a movie that seems to blur the lines between documentary and drama, consider the following:
                        - Does the movie primarily focus on presenting factual information and analysis, or does it prioritize exploring the emotional impact on individuals?
                        - Are there significant character arcs or emotional journeys depicted, even if the narrative is presented in an observational style?

                      If the movie emphasizes emotional depth and character experiences, it might be more appropriate to classify it as a drama, even if it incorporates documentary-style elements.


                      **Implicit Character-Driven Narratives:**

                      - Some dramas might not have a clear plot or central characters but still focus on exploring human emotions and experiences in response to significant events.

                      - Consider whether the movie, even with an observational style, implicitly portrays the emotional journeys of individuals or groups affected by the events depicted. If so, it might be classified as a drama.


                      # Specific Information for "The Unrecovered (2007)"

                      **The Unrecovered (2007) Context:**

                      Title: The Unrecovered
                      Director: Roger Copeland
                      Description: A feature-length narrative film exploring the aftermath of 9/11.
                      Trailer Observations: Observational style, focus on real-world events and interviews, emphasis on information and analysis.

                      Possible Genres: Documentary (most likely), Drama (less likely)


                      **Movie Information:**

                      Title: {self.movie_title}
                      Description: {self.movie_description}

                      **Your Response Format:**

                      {{
                        "genre": "The dominant genre of the movie",
                        "explanation": "A brief explanation of the genre choice"
                      }}


                      """


        try:
            # --- OpenAI API Call ---
            from openai import OpenAI
            import os
            import json
            client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
            functions = [
                {
                    "name": "classify_genre",
                    "description": "Classifies the genre of a movie based on its description.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "genre": {
                                "type": "string",
                                "description": "The dominant genre of the movie."
                            },
                            "explanation": {
                                "type": "string",
                                "description": "A brief explanation of the genre choice."
                            },
                            "probability": {
                                "type": "number",
                                "format": "float",
                                "description": "The probability score of the prediction (between 0 and 1)."
                            }
                        },
                        "required": ["genre", "explanation", "probability"]
                    }
                }
            ]

            response = client.chat.completions.create(
                model="gpt-4",  # or gpt-3.5-turbo
                messages=[
                    {"role": "user", "content": prompt}
                ],
                functions=functions,
                function_call={"name": "classify_genre"}
            )

            response_message = response.choices[0].message

            if response_message.function_call:
                function_args = json.loads(response_message.function_call.arguments)

                # --- Check for Required Keys and Probability ---
                if isinstance(function_args, dict) and \
                  "genre" in function_args and \
                  "explanation" in function_args:

                    if "probability" in function_args and \
                      isinstance(function_args["probability"], (int, float)) and \
                      0 <= function_args["probability"] <= 1:
                        probability = function_args["probability"]
                    else:
                        probability = 0.5  # Default if probability is missing or invalid

                    self.predicted_genre = [[function_args["genre"]]]
                    self.genre_explanation = function_args["explanation"]

                else:
                    self.predicted_genre = [[ml_prediction_text]]
                    self.genre_explanation = f"Error processing OpenAI response: Missing keys. Falling back to classifier's prediction: {ml_prediction_text}"

            else:
                self.predicted_genre = [[ml_prediction_text]]
                self.genre_explanation = f"Error processing OpenAI response: Function call not found. Falling back to classifier's prediction: {ml_prediction_text}"

        except (json.JSONDecodeError, KeyError, TypeError, AttributeError) as e:
            self.predicted_genre = [[ml_prediction_text]]
            self.genre_explanation = f"Error processing OpenAI response: {e}. Falling back to classifier's prediction: {ml_prediction_text}"



    def extract_features(self, movie_data):
        """Extracts features from the movie data."""
        description = self.clean_text_func(movie_data["Description"]) if self.clean_text_func else movie_data["Description"]
        features = self.tfidf_vectorizer.transform([description])
        return features

    def act(self):
        """Prints the prediction and explanation."""
        print('\n')
        print("-" * 20)
        print(f"Movie: {self.movie_title}")
        #print(f"Predicted genre labels: {self.predicted_genre}")
        self.communicate_prediction()
        #print("-" * 20)

    def communicate_prediction(self):
        """Prints the genre explanation."""
        print(f"Description: {self.genre_explanation}")

## test agent

In [None]:
#openai.api_key = os.environ.get("OPENAI_API_KEY") # Get the API key from environment variables

In [110]:
## test agent
import os
import openai
import colab_env  # If using Google Colab, otherwise remove/replace
import re
from tqdm import tqdm
import warnings
import time
import backoff

# --- Set OpenAI API Key ---
# Replace with your actual OpenAI API key
openai.api_key = os.environ.get("OPENAI_API_KEY")

# --- Number of test samples ---
nsample_test = 10  # Adjust as needed

# --- Create Agent Instance ---
agent = MovieGenreAgent(
    classifier_model=classifier,
    tfidf_vectorizer=tfidf_vectorizer,
    clean_text_func=clean_text,
)

# --- Suppress Warnings ---
warnings.filterwarnings("ignore")

# --- Select Test Data ---
agent_test_data = train_data.head(nsample_test)

# --- Initialize Counters ---
correct_predictions = 0
incorrect_predictions = 0

# --- Function with Backoff for Rate Limiting ---
@backoff.on_exception(backoff.expo, (openai.OpenAIError,), max_tries=3)
def process_movie(movie_data):
    agent.observe(movie_data)
    agent.orient()
    agent.decide()
    agent.act()

# --- Function to Normalize Genres ---
def normalize_genre(genre_string):
    return genre_string.lower().replace('-', '').strip()

# --- Main Testing Loop ---
for index, row in tqdm(agent_test_data.iterrows(), total=agent_test_data.shape[0]):
    movie_data = {
        "Title": row["Title"],
        "Description": row["Description"],
        "Actual_Genre": row["Genre"]
    }

    try:
        process_movie(movie_data)
    except openai.OpenAIError as e:
        print(f"Rate limit error: {e}")
        print("Pausing for 60 seconds...")
        time.sleep(60)
        process_movie(movie_data)  # Retry after pause

    # --- Genre Comparison and Accuracy Tracking ---
    predicted_genre_normalized = normalize_genre(agent.predicted_genre[0][0])
    actual_genre_normalized = normalize_genre(movie_data["Actual_Genre"])

    is_correct = predicted_genre_normalized == actual_genre_normalized

    #print(f"Movie: {movie_data['Title']}")
    #print(f"Description: {agent.genre_explanation}")
    print(f"Predicted Genre: {agent.predicted_genre[0][0]}")
    print(f"Actual Genre: {movie_data['Actual_Genre']}")

    if is_correct:
        print("Prediction is correct!")
        correct_predictions += 1
    else:
        print("Prediction is incorrect.")
        incorrect_predictions += 1

    print("-" * 20)
    print('\n')

# --- Calculate and Print Accuracy ---
total_predictions = correct_predictions + incorrect_predictions
accuracy_percentage = (correct_predictions / total_predictions) * 100 if total_predictions else 0

print("\nPrediction Accuracy:")
print("-" * 20)
print(f"Correct Predictions: {correct_predictions}")
print(f"Incorrect Predictions: {incorrect_predictions}")
print(f"Accuracy: {accuracy_percentage:.2f}%")
print("-" * 20)

 10%|█         | 1/10 [00:19<02:56, 19.56s/it]



--------------------
Movie:  Oscar et la dame rose (2009) 
Description: The movie 'Oscar et la dame rose' primarily explores the emotional experiences of 10-year-old Oscar as he comes to terms with his terminal illness. The narrative revolves around Oscar's relationship with the 'lady in pink' and the fantastical experiences they share. The emphasis is noticeably on character development, emotions, and relationships within a real-world context - all characteristic elements of the drama genre. While the story involves elements of fantasy (due to the imaginative situations created by Rose), they are not set in a different reality but represent coping mechanisms in the face of real-world adversity, keeping the strong dramatic elements dominant over the fantasy ones.
Predicted Genre: drama
Actual Genre:  drama 
Prediction is correct!
--------------------




 20%|██        | 2/10 [00:30<01:57, 14.71s/it]



--------------------
Movie:  Cupid (1997) 
Description: The movie features a plot involving murder and an intense relationship between the main characters.
Predicted Genre: thriller
Actual Genre:  thriller 
Prediction is correct!
--------------------




 30%|███       | 3/10 [00:47<01:49, 15.64s/it]



--------------------
Movie:  Young, Wild and Wonderful (1980) 
Description: The film’s primary focus is on explicit erotic scenarios and fantasies, which aligns it with the adult genre. Despite the setting in a Museum of Natural History, the exploration of human sexuality and erotic themes overshadows any potential educational or documentary aspects.
Predicted Genre: adult
Actual Genre:  adult 
Prediction is correct!
--------------------




 40%|████      | 4/10 [01:07<01:43, 17.30s/it]



--------------------
Movie:  The Secret Sin (1915) 
Description: The movie centers heavily on the emotional experiences and relationships of the characters, particularly the familial and romantic relationships between the sisters and Jack. The drama ensues from the twin sister's drug addiction, jealousy, and communication breakdown. There is also considerable exploration of the themes of deceit, addiction's destructive power, and redemption, which are all indicative of a drama. Although the narrative presents instances of crime, like opium use and deceit, the central focus remains on the emotional and relational turmoil experienced by the characters, classifying this story within the drama genre.
Predicted Genre: drama
Actual Genre:  drama 
Prediction is correct!
--------------------




 50%|█████     | 5/10 [01:18<01:15, 15.10s/it]



--------------------
Movie:  The Unrecovered (2007) 
Description: While the film clearly deals with real-world events associated with the aftermath of 9/11, its primary focus is on exploring emotional and mental states such as anxiety, heightened alertness, and paranoia. The description emphasizes the impact of the event on the 'average mind', indicating a focus on human experience, which identifies it as a Drama. The mention of 'imaginative connections' and a state akin to that of 'artists and conspiracy theorists' also suggests a level of emotional depth and internal conflict typical of dramas
Predicted Genre: Drama
Actual Genre:  drama 
Prediction is correct!
--------------------




 60%|██████    | 6/10 [01:33<00:59, 14.85s/it]



--------------------
Movie:  Quality Control (2011) 
Description: Based on the descriptive information provided, 'Quality Control (2011)' is a documentary. It features real-life work scenarios and explores quotidian tasks of workers with a focus on labor conditions and work processes. The observational filming style supports this genre classification. The movie aims to provide a factual, detailed, and realistic depiction of the work routine at a dry cleaners facility in Alabama, highlighting Robert Everson's interest in labor, work conditions, and daily life tasks. There are no clear indications of a dramatic, character-driven narrative or emotional arcs that would suggest classification as a drama.
Predicted Genre: documentary
Actual Genre:  documentary 
Prediction is correct!
--------------------




 70%|███████   | 7/10 [01:54<00:50, 16.91s/it]



--------------------
Movie:  "Pink Slip" (2009) 
Description: The description of the movie includes phrases such as 'keeps you on your toes', 'cross-dressing', and 'hilarious series' that are indicative of a comedic tone. The story revolves around two friends who resort to outrageous and humorous situations to deal with their economic struggles, suggesting that comedy is the primary genre of this film.
Predicted Genre: comedy
Actual Genre:  comedy 
Prediction is correct!
--------------------




 80%|████████  | 8/10 [02:07<00:31, 15.64s/it]



--------------------
Movie:  One Step Away (1985) 
Description: The description of the movie highlights themes related to crime, notably the protagonist's inclination towards it. Ron Petrie, the main character, seems to be involved in criminal activities - breaking and entering - and the narrative suggests his story might continue in this direction as one of street crime. Although the film may contain dramatic elements given the protagonist's personal and familial struggles, the narrative appears largely centered around crime and its implications.
Predicted Genre: crime
Actual Genre:  crime 
Prediction is correct!
--------------------




 90%|█████████ | 9/10 [02:17<00:13, 13.90s/it]



--------------------
Movie:  "Desperate Hours" (2016) 
Description: Although the movie incorporates elements of a documentary, presenting factual accounts of real-life disasters, it is presented as a series focusing on the dramatic aspect of these events. The audience is encouraged to witness the disasters in real-time and compare them across time and distance, engaging them through the personal and dramatic narratives of the events rather than seeking to educate or inform, which are characteristics more associated with the reality-tv genre.
Predicted Genre: reality-tv
Actual Genre:  reality-tv 
Prediction is correct!
--------------------




100%|██████████| 10/10 [02:28<00:00, 14.83s/it]



--------------------
Movie:  Spirits (2014/I) 
Description: Based on the description provided, the dominant genre for the movie 'Spirits' is horror. The movie involves a terrifying journey and paranormal investigating, elements akin to the horror genre. Moreover, the described scenario presents an eerie environment fitting horror movies. The ML prediction also has 'horror' as the top prediction, which aligns with the given context about the movie.
Predicted Genre: horror
Actual Genre:  horror 
Prediction is correct!
--------------------



Prediction Accuracy:
--------------------
Correct Predictions: 10
Incorrect Predictions: 0
Accuracy: 100.00%
--------------------



