**CSI 4142 Data Science** <br/>
*Assignment 3: Predictive Analysis and Classification*

# Identification

Name: Eli Wynn<br/>
Student Number: 300248135

Name: Jack Snelgrove<br/>
Student Number: 300247435


Our datasets have been uploaded from the public repository:

- [github.com/eli-wynn/Datasets](https://github.com/eli-wynn/Datasets)

# Running Instructions
1. Generate a kaggle Api key on the kaggle website under your Account < Settings < Generate Key
2. This will generate a kaggle.json file
3. Place the file in its respective location
- On Windows: place the file here C:\Users\<YourUsername>\.kaggle\kaggle.json
- On Mac: place the file here: ~/.kaggle/kaggle.json

# Imports

In [None]:
#configure database -> can't use github because files are too large
#!pip install kaggle
#!pip install python-Levenshtein
#Run the above lines if library has not been previously installed


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import kaggle
import Levenshtein
import re
import ast
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN


In [None]:
# Load the dataset
import os
import kaggle

# Set environment variables
os.environ["KAGGLE_USERNAME"] = "eliwynn"
os.environ["KAGGLE_KEY"] = "c71d95504bba7ec81de8ed35ec02b166"

# Download the dataset
dataset = "rounakbanik/the-movies-dataset"
download_path = "./movies_dataset"

#download entire dataset
kaggle.api.dataset_download_files(dataset, download_path, unzip=True)

metadataDF = pd.read_csv(f"{download_path}/movies_metadata.csv")
metadataDF.head()
ratingsDF = pd.read_csv(f"{download_path}/ratings.csv")
ratingsDF.head()
ratingsSmallDF = pd.read_csv(f"{download_path}/ratings_small.csv")
ratingsSmallDF.head()

# **Dataset Description**

## **Overview**
| Attribute | Description |
|-----------|-------------|
| Dataset Name | The Movies Dataset |
| Author | Rounak Banik |
| Purpose | Provides metadata and ratings for movies to facilitate film analysis and recommendation system development |
| Shape | 45,466 rows × 24 columns (movies metadata), 100,004 rows × 4 columns (ratings) |

## **Features**

### **Movies Metadata (`movies_metadata.csv`)**
| Feature Name | Type | Description |
|-------------|------|-------------|
| id | Categorical | Unique identifier for each movie |
| title | Categorical | Title of the movie |
| release_date | DateTime | Release date of the movie |
| budget | Numerical | Production budget in USD |
| revenue | Numerical | Box office revenue in USD |
| genres | Categorical | List of genres associated with the movie |
| popularity | Numerical | Popularity score based on TMDb metrics |
| vote_average | Numerical | Average user rating (0-10) |
| vote_count | Numerical | Total number of votes received |
| production_companies | Categorical | List of companies that produced the movie |
| production_countries | Categorical | Countries where the movie was produced |
| runtime | Numerical | Duration of the movie in minutes |
| spoken_languages | Categorical | List of languages spoken in the movie |

### **Ratings (`ratings_small.csv`)**
| Feature Name | Type | Description |
|-------------|------|-------------|
| userId | Categorical | Unique identifier for each user |
| movieId | Categorical | Unique identifier for each movie (links to `id` in movies metadata) |
| rating | Numerical | User rating of the movie (0.5 - 5.0) |
| timestamp | DateTime | Unix timestamp of when the rating was given |


## Data Preparation

In [None]:
def prepare_movie_datasets():
    """
    Prepare the movie datasets (movies_metadata.csv and ratings_small.csv)
    """

    metadataDF = pd.read_csv('./movies_dataset/movies_metadata.csv', low_memory=False)
    ratingsDF = pd.read_csv('./movies_dataset/ratings_small.csv')
    
    # Display basic information
    print("Movies Metadata Shape:", metadataDF.shape)
    print("Ratings Shape:", ratingsDF.shape)
    
    # Clean the movies dataset
    print("Cleaning movies dataset...")
    
    # Convert 'id' to numeric, coercing errors to NaN
    metadataDF['id'] = pd.to_numeric(metadataDF['id'], errors='coerce')
    
    # Drop rows with NaN in 'id'
    metadataDF = metadataDF.dropna(subset=['id'])
    
    # Convert 'id' to integer
    metadataDF['id'] = metadataDF['id'].astype(int)
    
    # Select relevant columns
    metadataDF = metadataDF[['id', 'title', 'release_date', 'popularity', 'vote_average', 'vote_count', 'budget', 'revenue', 'runtime', 'genres', 'overview']]
    
    # Convert date column to datetime
    metadataDF['release_date'] = pd.to_datetime(metadataDF['release_date'], errors='coerce')
    
    # Extract year from release_date
    metadataDF['release_year'] = metadataDF['release_date'].dt.year
    
    # Drop rows with missing values in important columns
    metadataDF = metadataDF.dropna(subset=['release_year', 'popularity', 'vote_average', 'vote_count'])
    
    # Convert numeric columns to appropriate types
    numeric_cols = ['popularity', 'vote_average', 'vote_count', 'budget', 'revenue', 'runtime']
    for col in numeric_cols:
        metadataDF[col] = pd.to_numeric(metadataDF[col], errors='coerce')
    
    # Fill missing values with median
    for col in numeric_cols:
        metadataDF[col] = metadataDF[col].fillna(metadataDF[col].median())
    
    # Prepare the ratings dataset
    print("Preparing ratings dataset...")
    
    # Rename movieId to match with metadataDF id
    ratingsDF = ratingsDF.rename(columns={'movieId': 'id'})
    
    # Merge datasets
    print("Merging datasets...")
    merged_df = pd.merge(ratingsDF, metadataDF, on='id', how='inner')
    
    print("Merged dataset shape:", merged_df.shape)
    
    # Create features and target
    # Target: High rating (1 if rating >= 4.0, 0 otherwise)
    merged_df['high_rating'] = (merged_df['rating'] >= 4.0).astype(int)
    
    # Features: movie attributes
    features = ['popularity', 'vote_average', 'vote_count', 'budget', 'revenue', 'runtime', 'release_year']
    X = merged_df[features]
    y = merged_df['high_rating']
    
    # Split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Standardize features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    print("Movie datasets prepared successfully!")
    return X_train_scaled, X_test_scaled, y_train, y_test, merged_df, metadataDF

## Study 1: Similarity Measures

In [None]:
X_train, X_test, y_train, y_test, merged_df, metadataDF = prepare_movie_datasets()

# Helper functions for processing movie data
def extract_genres(genres_json):
    """Extract genre names from the JSON string in the genres column"""
    try:
        if isinstance(genres_json, str):
            genres = ast.literal_eval(genres_json)
            return [genre['name'] for genre in genres]
        else:
            return []
    except:
        return []

def clean_title(title):
    """Clean title for better similarity matching"""
    if pd.isna(title):
        return ""
    # Remove special characters and convert to lowercase
    return re.sub(r'[^\w\s]', '', title).lower()

def extract_keywords(overview):
    """Extract keywords from overview text"""
    if pd.isna(overview):
        return ""
    # Simple keyword extraction - remove stopwords and keep only words with 4+ characters
    words = re.findall(r'\b\w{4,}\b', overview.lower())
    # Remove common stopwords
    stopwords = ['this', 'that', 'with', 'from', 'have', 'they', 'will', 'what', 'when', 'where', 'which']
    return ' '.join([w for w in words if w not in stopwords])

# Prepare additional features for similarity measures
print("Preparing data for similarity measures...")
metadataDF['genres_list'] = metadataDF['genres'].apply(extract_genres)
metadataDF['clean_title'] = metadataDF['title'].apply(clean_title)
metadataDF['keywords'] = metadataDF['overview'].apply(extract_keywords)

# 1. Jaccard Similarity for Genres
def jaccard_similarity_genres(movie_data, movie_title):
    """
    Jaccard similarity for genres
    Jaccard(A,B) = |A ∩ B| / |A ∪ B|
    """
    # Find the movie by title
    movie = movie_data[movie_data['title'] == movie_title]
    if len(movie) == 0:
        print(f"Movie '{movie_title}' not found. Please check the title.")
        return []
    
    movie = movie.iloc[0]
    movie_genres = set(movie['genres_list'])
    
    # Calculate Jaccard similarity for all movies
    similarities = []
    for idx, row in movie_data.iterrows():
        other_genres = set(row['genres_list'])
        if not movie_genres or not other_genres:
            sim = 0
        else:
            intersection = len(movie_genres.intersection(other_genres))
            union = len(movie_genres.union(other_genres))
            sim = intersection / union if union > 0 else 0
        similarities.append((row['title'], sim, row['genres_list'], row['popularity']))
    
    # Sort by similarity (descending) and then by popularity (descending)
    return sorted(similarities, key=lambda x: (-x[1], -x[3]))

# 2. Euclidean Similarity for Revenue
def euclidean_similarity_revenue(movie_data, movie_title):
    """
    Euclidean distance for revenue
    Converted to similarity: 1 / (1 + distance)
    """
    # Find the movie by title
    movie = movie_data[movie_data['title'] == movie_title]
    if len(movie) == 0:
        print(f"Movie '{movie_title}' not found. Please check the title.")
        return []
    
    movie = movie.iloc[0]
    movie_revenue = movie['revenue']
    
    # Calculate Euclidean similarity for all movies
    similarities = []
    for idx, row in movie_data.iterrows():
        other_revenue = row['revenue']
        # Calculate Euclidean distance
        distance = abs(movie_revenue - other_revenue)
        # Convert to similarity (higher is more similar)
        sim = 1 / (1 + distance/1e6)  # Normalize by dividing by 1 million
        similarities.append((row['title'], sim, row['revenue'], row['popularity']))
    
    # Sort by similarity (descending) and then by popularity (descending)
    return sorted(similarities, key=lambda x: (-x[1], -x[3]))

# 3. Manhattan Similarity for Runtime
def manhattan_similarity_runtime(movie_data, movie_title):
    """
    Manhattan distance for runtime
    Converted to similarity: 1 / (1 + distance)
    """
    # Find the movie by title
    movie = movie_data[movie_data['title'] == movie_title]
    if len(movie) == 0:
        print(f"Movie '{movie_title}' not found. Please check the title.")
        return []
    
    movie = movie.iloc[0]
    movie_runtime = movie['runtime']
    
    # Calculate Manhattan similarity for all movies
    similarities = []
    for idx, row in movie_data.iterrows():
        other_runtime = row['runtime']
        # Calculate Manhattan distance
        distance = abs(movie_runtime - other_runtime)
        # Convert to similarity (higher is more similar)
        sim = 1 / (1 + distance/10)  # Normalize by dividing by 10
        similarities.append((row['title'], sim, row['runtime'], row['popularity']))
    
    # Sort by similarity (descending) and then by popularity (descending)
    return sorted(similarities, key=lambda x: (-x[1], -x[3]))

# 4. Levenshtein (Edit Distance) Similarity for Title
def levenshtein_similarity_title(movie_data, movie_title):
    """
    Levenshtein (edit) distance for titles
    Converted to similarity: 1 - (distance / max_length)
    """
    # Find the movie by title
    movie = movie_data[movie_data['title'] == movie_title]
    if len(movie) == 0:
        print(f"Movie '{movie_title}' not found. Please check the title.")
        return []
    
    movie = movie.iloc[0]
    movie_clean_title = movie['clean_title']
    
    # Calculate Levenshtein similarity for all movies
    similarities = []
    for idx, row in movie_data.iterrows():
        other_clean_title = row['clean_title']
        # Calculate Levenshtein distance
        distance = Levenshtein.distance(movie_clean_title, other_clean_title)
        # Convert to similarity (higher is more similar)
        max_len = max(len(movie_clean_title), len(other_clean_title))
        sim = 1 - (distance / max_len) if max_len > 0 else 0
        similarities.append((row['title'], sim, distance, row['popularity']))
    
    # Sort by similarity (descending) and then by popularity (descending)
    return sorted(similarities, key=lambda x: (-x[1], -x[3]))

# 5. Cosine Similarity for Budget
def cosine_similarity_budget(movie_data, movie_title):
    """
    Cosine similarity for budget
    """
    # Find the movie by title
    movie = movie_data[movie_data['title'] == movie_title]
    if len(movie) == 0:
        print(f"Movie '{movie_title}' not found. Please check the title.")
        return []
    
    movie = movie.iloc[0]
    movie_budget = movie['budget']
    
    # Calculate Cosine similarity for all movies
    similarities = []
    for idx, row in movie_data.iterrows():
        other_budget = row['budget']
        # Calculate Cosine similarity
        if movie_budget == 0 and other_budget == 0:
            sim = 1  # Both budgets are 0, consider them similar
        elif movie_budget == 0 or other_budget == 0:
            sim = 0  # One budget is 0, the other isn't
        else:
            # Cosine similarity for 1D is just the dot product divided by magnitudes
            sim = (movie_budget * other_budget) / (abs(movie_budget) * abs(other_budget))
        similarities.append((row['title'], sim, row['budget'], row['popularity']))
    
    # Sort by similarity (descending) and then by popularity (descending)
    return sorted(similarities, key=lambda x: (-x[1], -x[3]))

def display_results(results, title, attribute_name, top_n=10):
    """Display the top N results in a formatted way"""
    print(f"\n--- Top {top_n} movies with similar {attribute_name} to '{title}' ---")
    
    # Create a DataFrame for better display in notebook
    result_df = pd.DataFrame([
        {
            'Title': movie_title,
            'Similarity': f"{similarity:.4f}",
            f'{attribute_name.capitalize()}': attribute,
            'Popularity': f"{popularity:.2f}"
        }
        for movie_title, similarity, attribute, popularity in results[:top_n+1]
        if movie_title != title  # Skip the query movie itself
    ][:top_n])
    
    display(result_df)
    
    return result_df

# Run the similarity study
print("\n=== Study 1 – Similarity Measures ===")

# Define query movies
toy_story = "Toy Story"
titanic = "Titanic"
apollo_13 = "Apollo 13"
fight_club = "Fight Club"
matrix = "The Matrix"

# Verify movies exist in the dataset
for movie in [toy_story, titanic, apollo_13, fight_club, matrix]:
    if movie not in metadataDF['title'].values:
        print(f"Warning: '{movie}' not found in dataset. Using a similar title.")
        # Find closest match
        closest = metadataDF.iloc[metadataDF['title'].apply(
            lambda x: Levenshtein.distance(str(x).lower(), movie.lower())).argmin()]['title']
        print(f"Using '{closest}' instead of '{movie}'")
        if movie == toy_story:
            toy_story = closest
        elif movie == titanic:
            titanic = closest
        elif movie == apollo_13:
            apollo_13 = closest
        elif movie == fight_club:
            fight_club = closest
        elif movie == matrix:
            matrix = closest

# 1. Jaccard similarity for genres
print("\nRequest 1: Show me movies of the same genre as 'Toy Story'")
genre_results = jaccard_similarity_genres(metadataDF, toy_story)
genre_df = display_results(genre_results, toy_story, "genres")

# 2. Euclidean similarity for revenue
print("\nRequest 2: Show me movies with similar revenue to 'Titanic'")
revenue_results = euclidean_similarity_revenue(metadataDF, titanic)
revenue_df = display_results(revenue_results, titanic, "revenue")

# 3. Manhattan similarity for runtime
print("\nRequest 3: Show me movies with similar length as 'Apollo 13'")
runtime_results = manhattan_similarity_runtime(metadataDF, apollo_13)
runtime_df = display_results(runtime_results, apollo_13, "runtime (minutes)")

# 4. Levenshtein similarity for title
print("\nRequest 4: Show me movies with similar title to 'Fight Club'")
title_results = levenshtein_similarity_title(metadataDF, fight_club)
title_df = display_results(title_results, fight_club, "title")

# 5. Cosine similarity for budget
print("\nRequest 5: Show me movies with similar budget to 'The Matrix'")
budget_results = cosine_similarity_budget(metadataDF, matrix)
budget_df = display_results(budget_results, matrix, "budget")

# Store all results for further analysis
similarity_results = {
    'genre': genre_df,
    'revenue': revenue_df,
    'runtime': runtime_df,
    'title': title_df,
    'budget': budget_df
}



## Study 2: Clustering Algorithms

In [None]:
# Function to perform KMeans clustering
def perform_kmeans(data, feature_pairs, k_values):
    results = {}
    
    for features in feature_pairs:
        feature_name = f"{features[0]}_vs_{features[1]}"
        results[feature_name] = {}
        
        # Extract features
        X = data[list(features)].values
        
        # Standardize features
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        for k in k_values:
            # Perform KMeans clustering
            kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
            clusters = kmeans.fit_predict(X_scaled)
            
            # Calculate silhouette score if k > 1
            if k > 1:
                silhouette = silhouette_score(X_scaled, clusters)
            else:
                silhouette = 0  # Not applicable for k=1
            
            # Store results
            results[feature_name][k] = {
                'clusters': clusters,
                'centers': scaler.inverse_transform(kmeans.cluster_centers_),
                'inertia': kmeans.inertia_,
                'silhouette': silhouette
            }
    
    return results

# Function to perform DBSCAN clustering
def perform_dbscan(data, feature_pairs, eps_values, min_samples_values):
    results = {}
    
    for features in feature_pairs:
        feature_name = f"{features[0]}_vs_{features[1]}"
        results[feature_name] = {}
        
        # Extract features
        X = data[list(features)].values
        
        # Standardize features
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        
        for eps in eps_values:
            for min_samples in min_samples_values:
                param_key = f"eps_{eps}_min_samples_{min_samples}"
                
                # Perform DBSCAN clustering
                dbscan = DBSCAN(eps=eps, min_samples=min_samples)
                clusters = dbscan.fit_predict(X_scaled)
                
                # Calculate silhouette score if more than one cluster and no noise points (-1)
                unique_clusters = np.unique(clusters)
                if len(unique_clusters) > 1 and -1 not in unique_clusters:
                    silhouette = silhouette_score(X_scaled, clusters)
                else:
                    silhouette = 0  # Not applicable
                
                # Store results
                results[feature_name][param_key] = {
                    'clusters': clusters,
                    'n_clusters': len(np.unique(clusters[clusters >= 0])),
                    'n_noise': np.sum(clusters == -1),
                    'silhouette': silhouette
                }
    
    return results

# Function to visualize clustering results
def visualize_clustering(data, feature_pairs, kmeans_results, dbscan_results, k_values, eps_values, min_samples_values):
    for features in feature_pairs:
        feature_name = f"{features[0]}_vs_{features[1]}"
        x_label = features[0]
        y_label = features[1]
        
        # Create figure for this feature pair
        fig = plt.figure(figsize=(20, 10))
        
        # Plot KMeans results
        for i, k in enumerate(k_values):
            ax = fig.add_subplot(2, len(k_values) + len(eps_values) * len(min_samples_values), i + 1)
            
            # Get cluster assignments
            clusters = kmeans_results[feature_name][k]['clusters']
            centers = kmeans_results[feature_name][k]['centers']
            
            # Plot data points colored by cluster
            scatter = ax.scatter(data[x_label], data[y_label], c=clusters, cmap='viridis', 
                       alpha=0.5, s=10)
            
            # Plot cluster centers
            ax.scatter(centers[:, 0], centers[:, 1], c='red', marker='X', s=100, label='Centroids')
            
            ax.set_title(f'KMeans (k={k})\nInertia: {kmeans_results[feature_name][k]["inertia"]:.2f}\nSilhouette: {kmeans_results[feature_name][k]["silhouette"]:.2f}')
            ax.set_xlabel(x_label)
            ax.set_ylabel(y_label)
            ax.legend()
            
            # Add colorbar
            plt.colorbar(scatter, ax=ax, label='Cluster')
        
        # Plot DBSCAN results
        plot_idx = len(k_values) + 1
        for eps in eps_values:
            for min_samples in min_samples_values:
                param_key = f"eps_{eps}_min_samples_{min_samples}"
                
                ax = fig.add_subplot(2, len(k_values) + len(eps_values) * len(min_samples_values), plot_idx)
                
                # Get cluster assignments
                clusters = dbscan_results[feature_name][param_key]['clusters']
                
                # Plot data points colored by cluster
                scatter = ax.scatter(data[x_label], data[y_label], c=clusters, cmap='viridis', 
                           alpha=0.5, s=10)
                
                ax.set_title(f'DBSCAN (eps={eps}, min_samples={min_samples})\nClusters: {dbscan_results[feature_name][param_key]["n_clusters"]}\nNoise: {dbscan_results[feature_name][param_key]["n_noise"]}\nSilhouette: {dbscan_results[feature_name][param_key]["silhouette"]:.2f}')
                ax.set_xlabel(x_label)
                ax.set_ylabel(y_label)
                
                # Add colorbar
                plt.colorbar(scatter, ax=ax, label='Cluster')
                
                plot_idx += 1
        
        plt.tight_layout()
        plt.savefig(f'clustering_{feature_name}.png', dpi=300)
        plt.show()

# Main function to run the clustering study
def run_clustering_study(metadataDF):
    movie_data = metadataDF.copy()
    movie_data = movie_data[(movie_data['budget'] > 0) & (movie_data['revenue'] > 0)]
    
    # Log transform budget and revenue (they're highly skewed)
    movie_data['log_budget'] = np.log1p(metadataDF['budget'])
    movie_data['log_revenue'] = np.log1p(movie_data['revenue'])
    print(f"Loaded {len(movie_data)} movies.")
    
    # Define feature pairs to analyze
    feature_pairs = [
        ('log_budget', 'log_revenue'),  # Budget vs Revenue
        ('runtime', 'log_budget')       # Runtime vs Budget
    ]
    
    # Define parameters to test
    k_values = [3, 5]                   # Number of clusters for KMeans
    eps_values = [0.5, 1.0]             # Epsilon values for DBSCAN
    min_samples_values = [5, 10]        # Minimum samples for DBSCAN
    
    # Perform clustering
    print("\nPerforming KMeans clustering...")
    kmeans_results = perform_kmeans(movie_data, feature_pairs, k_values)
    
    print("\nPerforming DBSCAN clustering...")
    dbscan_results = perform_dbscan(movie_data, feature_pairs, eps_values, min_samples_values)
    
    # Visualize results
    print("\nVisualizing clustering results...")
    visualize_clustering(movie_data, feature_pairs, kmeans_results, dbscan_results, 
                        k_values, eps_values, min_samples_values)
    
    print("\nClustering study completed!")

if __name__ == "__main__":
    run_clustering_study(metadataDF)

The visualizations suggest that a hybrid approach might be optimal - using DBSCAN for initial outlier detection and pattern discovery, followed by KMeans on the core data points for more structured segmentation. DBSCAN seems to work better for identifying natural clusters in the log budget vs log revenue relationship because there is a strong positive correlation between the two variables. KMeans seems more effective for the runtime vs log revenue relationship because there is a weaker positive correlation and there are fewer clusters so forcing all of the points together would work better.

## Study 3: Content-Based Recommendation System

### Heuristic 1: Combines Jaccard similarity on genres and Euclidean similarity on popularity.

Jaccard Similarity (Genres): This heuristic is based on the fact that movies with similar genres are likely to be similar in terms of user preferences. For instance, action movie fans might also enjoy other action movies.

Euclidean Similarity (Popularity): This approach assumes that the popularity of a movie is correlated with its appeal. More popular movies could attract users with similar interests.

### Heuristic 2: Combines Levenshtein (edit) distance on titles and Cosine similarity on budget.

Levenshtein Similarity (Titles): Title similarity is used to match movies with similar titles (e.g., sequels or movies from the same series). This helps identify movies with names that are alike, which might be conceptually linked in users' minds.

Cosine Similarity (Budget): Budget reflects production quality, which might influence the viewer's preference. Movies with similar budgets could be similar in terms of production values, targeting the same type of audience


In [None]:
def display_heuristic_results(heuristic_results, title, top_n=10):
    """
    Display the top N results for a heuristic in a formatted table.
    Args:
        heuristic_results (list): List of tuples containing (movie_title, combined_similarity, popularity).
        title (str): The movie title that the user is querying for.
        top_n (int): Number of top results to display.
    """
    # Create a DataFrame for the heuristic results with only Title, Similarity, and Popularity
    result_df = pd.DataFrame([
        {
            'Title': movie_title,
            'Similarity': f"{combined_similarity:.4f}",
            'Popularity': f"{popularity:.2f}"
        }
        for movie_title, combined_similarity, popularity in heuristic_results[:top_n+1]
        if movie_title != title  # Skip the query movie itself
    ][:top_n])
    
    # Sort by combined similarity (descending) and popularity (descending)
    result_df_sorted = result_df.sort_values(by=['Similarity', 'Popularity'], ascending=[False, False])
    
    # Display the DataFrame (in a Jupyter notebook, this will show the table)
    display(result_df_sorted)
    
    return result_df_sorted


# Combine Jaccard on genres and Euclidean distance on popularity
def combined_heuristic_1(movie_data, movie_title):
    genre_results = jaccard_similarity_genres(movie_data, movie_title)
    popularity_results = euclidean_similarity_revenue(movie_data, movie_title)
    
    combined_results = []
    
    # Loop through the results and combine the similarity scores
    for genre_result, popularity_result in zip(genre_results, popularity_results):
        combined_similarity = genre_result[1] * 0.6 + popularity_result[1] * 0.4  # The combined similarity score
        movie_title = genre_result[0]  # Movie title
        popularity = genre_result[3]  # Popularity from genre_results
        
        # Append only the relevant data: Title, Combined Similarity, and Popularity
        combined_results.append((
            movie_title, combined_similarity, popularity
        ))
    
    # Sort by combined similarity (descending) and then by popularity (descending)
    return sorted(combined_results, key=lambda x: (-x[1], -x[2]))

# Combine Levenshtein similarity on titles and Cosine similarity on budget
def combined_heuristic_2(movie_data, movie_title):
    title_results = levenshtein_similarity_title(movie_data, movie_title)
    budget_results = cosine_similarity_budget(movie_data, movie_title)
    
    combined_results = []
    
    # Loop through the results and combine the similarity scores
    for title_result, budget_result in zip(title_results, budget_results):
        combined_similarity = title_result[1] * 0.5 + budget_result[1] * 0.5  # The combined similarity score
        movie_title = title_result[0]  # Movie title
        popularity = title_result[3]  # Popularity from title_results
        
        # Append only the relevant data: Title, Combined Similarity, and Popularity
        combined_results.append((
            movie_title, combined_similarity, popularity
        ))
    
    # Sort by combined similarity (descending) and then by popularity (descending)
    return sorted(combined_results, key=lambda x: (-x[1], -x[2]))




# Test combined heuristic 1 with 3 requests

print("\nRequest 1: Show me movies similar to 'Toy Story' using Heuristic 1")
heuristic_1_results_toy_story = combined_heuristic_1(metadataDF, "Toy Story")
heuristic_1_df_toy_story = display_heuristic_results(heuristic_1_results_toy_story, "Toy Story", top_n=10)

print("\nRequest 2: Show me movies similar to 'Apollo 13' using Heuristic 1")
heuristic_1_results_apollo_13 = combined_heuristic_1(metadataDF, "Apollo 13")
heuristic_1_df_apollo_13 = display_heuristic_results(heuristic_1_results_apollo_13, "Apollo 13", top_n=10)

print("\nRequest 3: Show me movies similar to 'Fight Club' using Heuristic 1")
heuristic_1_results_fight_club = combined_heuristic_1(metadataDF, "Fight Club")
heuristic_1_df_fight_club = display_heuristic_results(heuristic_1_results_fight_club, "Fight Club", top_n=10)

# Test combined heuristic 2 with 3 requests

print("\nRequest 1: Show me movies similar to 'Toy Story' using Heuristic 2")
heuristic_2_results_toy_story = combined_heuristic_2(metadataDF, "Toy Story")
heuristic_2_df_toy_story = display_heuristic_results(heuristic_2_results_toy_story, "Toy Story", top_n=10)

print("\nRequest 2: Show me movies similar to 'Apollo 13' using Heuristic 2")
heuristic_2_results_apollo_13 = combined_heuristic_2(metadataDF, "Apollo 13")
heuristic_2_df_apollo_13 = display_heuristic_results(heuristic_2_results_apollo_13, "Apollo 13", top_n=10)

print("\nRequest 3: Show me movies similar to 'Fight Club' using Heuristic 2")
heuristic_2_results_fight_club = combined_heuristic_2(metadataDF, "Fight Club")
heuristic_2_df_fight_club = display_heuristic_results(heuristic_2_results_fight_club, "Fight Club", top_n=10)







### Discussion
#### Heuristic #1 
Based on my personal knowledge of the movies, the combination of genre and popularity generated movies that were quite similar to the ones in the queries. Something I noticed was interesting though was that when we queried `Request 2: Show me movies similar to 'Apollo 13' using Heuristic 1`, Fight Club was the second result. But when we did the reverse `Request 3: Show me movies similar to 'Fight Club' using Heuristic 1`, Apollo 13 was not listed as a similar movie. 

#### Heuristic #2 
In contrast to heuristic #1, I dont believe that the combination of Movie Title and Budget was an effective way to find similar movies. From personal experience, movies can have very similar titles yet filmed for completely different audiences. This is similar for budget where the price of a movie isnt an effective way to tell if movies are similar. 

In [None]:
# Create the utility matrix (Rui)
utility_matrix = ratingsSmallDF.pivot_table(index='userId', columns='movieId', values='rating')

# Fill NaN values with 0 (indicating unrated movies)
utility_matrix = utility_matrix.fillna(0)

# Convert utility matrix to numpy array for matrix factorization
R = utility_matrix.values

def matrix_factorization(R, K, steps=5000, alpha=0.0002, beta=0.02):
    '''
    R: rating matrix
    P: |U| * K (User features matrix)
    Q: |D| * K (Item features matrix)
    K: latent features
    steps: iterations
    alpha: learning rate
    beta: regularization parameter'''

    # Initialize P and Q randomly
    num_users, num_items = R.shape
    P = np.random.rand(num_users, K)
    Q = np.random.rand(num_items, K)
    Q = Q.T  # Transpose Q to match matrix multiplication

    # Perform SGD to factorize R
    for step in range(steps):
        for i in range(num_users):
            for j in range(num_items):
                if R[i][j] > 0:  # Only update for rated movies
                    eij = R[i][j] - np.dot(P[i, :], Q[:, j])  # Calculate error

                    for k in range(K):  # Update the P and Q matrices
                        P[i][k] += alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] += alpha * (2 * eij * P[i][k] - beta * Q[k][j])

        # Calculate the error and regularization term
        e = 0
        for i in range(num_users):
            for j in range(num_items):
                if R[i][j] > 0:
                    e += (R[i][j] - np.dot(P[i, :], Q[:, j]))**2
                    for k in range(K):
                        e += (beta / 2) * (P[i][k]**2 + Q[k][j]**2)

        # Stop if the error is below a threshold
        if e < 0.001:
            break

    return P, Q.T  # Return the factorized matrices

# Perform matrix factorization with 5 latent factors (K=5)
K = 5
P, Q = matrix_factorization(R, K)

# Check the shape of P and Q
print("P shape:", P.shape)
print("Q shape:", Q.shape)

### Conclusion
In this notebook, we explored different similarity measures, clustering techniques, combined heuristics, and reccomender systems. Unfortunately, we were unable to finish study #4 becuase of time constraints but as next steps, we would finish the implementation.

### References:
- Clustering Algorithm KBmeans: https://www.w3schools.com/python/python_ml_k-means.asp
- Matrix Factorization: https://medium.com/data-science/recommendation-system-matrix-factorization-d61978660b4b