# Movie Recommendation System

This notebook creates personalized movie recommendations using collaborative filtering based on your Letterboxd ratings and similar users' preferences.

## Features:
- Uses your complete Letterboxd rating history
- Excludes movies you've already watched 
- Gives bonus weighting to movies rated 5/5 stars by similar users
- Finds users with similar taste through cosine similarity

In [2]:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
import re

# Load ratings export dataset
ratings = pd.read_csv('data/ratings_export.csv')

# Robust CSV loading for movie database with multiple fallback methods
try:
    movies = pd.read_csv('data/movie_data.csv', 
                        on_bad_lines='skip',
                        quoting=1,  # QUOTE_ALL
                        engine='python')
    print("✅ Successfully loaded movies data")
except Exception as e:
    print(f"⚠️  First attempt failed, trying alternative method...")
    try:
        movies = pd.read_csv('data/movie_data.csv', 
                            on_bad_lines='skip',
                            sep=',',
                            engine='python',
                            encoding='utf-8',
                            quotechar='"',
                            escapechar='\\')
        print("✅ Successfully loaded movies data with method 2")
    except Exception as e2:
        print(f"⚠️  Loading sample data as fallback...")
        movies = pd.read_csv('data/movie_data.csv', 
                           nrows=10000,
                           on_bad_lines='skip',
                           engine='python')
        print("✅ Successfully loaded sample movies data")

# Merge datasets to create comprehensive user ratings
user_ratings = pd.merge(ratings, movies, left_on='movie_id', right_on='movie_id')

print(f"📊 Dataset Summary:")
print(f"   • {len(ratings):,} user ratings loaded")
print(f"   • {len(movies):,} movies in database")  
print(f"   • {len(user_ratings):,} combined user-movie ratings")

# Movie name cleaning function for matching
def clean_movie_name(name):
    """Clean movie names for better matching"""
    if pd.isna(name) or not isinstance(name, str):
        return ""
    name = re.sub(r'^The\s+', '', name, flags=re.IGNORECASE)
    name = re.sub(r'\s*\([^)]*\)$', '', name)  # Remove year/info in parentheses
    name = re.sub(r'[^\w\s]', '', name)  # Remove special characters
    return name.strip().lower()

✅ Successfully loaded movies data
📊 Dataset Summary:
   • 11,078,167 user ratings loaded
   • 285,963 movies in database
   • 11,079,666 combined user-movie ratings
📊 Dataset Summary:
   • 11,078,167 user ratings loaded
   • 285,963 movies in database
   • 11,079,666 combined user-movie ratings


## Step 1: Load Your Letterboxd Ratings

In [3]:
# Let's use the original clean ratings.csv instead of the processed version
print("Loading original ratings.csv with all your clean movie ratings...")

# Load your original clean ratings
original_ratings = pd.read_csv('data/ratings.csv')
print(f"Original ratings loaded: {len(original_ratings)} movies")
print(f"Rating range: {original_ratings['Rating'].min()} to {original_ratings['Rating'].max()}")
print(f"Movies with ratings > 0: {(original_ratings['Rating'] > 0).sum()}")

# Show sample
print("\nSample of your original ratings:")
print(original_ratings[['Name', 'Year', 'Rating']].head(10))

Loading original ratings.csv with all your clean movie ratings...
Original ratings loaded: 204 movies
Rating range: 0.5 to 5.0
Movies with ratings > 0: 204

Sample of your original ratings:
                         Name  Year  Rating
0                Interstellar  2014     5.0
1        (500) Days of Summer  2009     4.5
2       Friends with Benefits  2011     3.5
3  Terminator 2: Judgment Day  1991     3.5
4               Groundhog Day  1993     4.0
5            The Hunger Games  2012     3.5
6                  About Time  2013     5.0
7                      Barbie  2023     3.0
8                        Dune  2021     4.0
9        John Wick: Chapter 4  2023     3.5


## Step 2: Load and Match Your Watched Movies

In [6]:
# Load your watched movies to exclude them from recommendations
print("Loading your watched movies to exclude from recommendations...")
watched_movies = pd.read_csv('data/watched.csv')
print(f"Total watched movies: {len(watched_movies)}")

# Clean watched movie names for matching
watched_movies['clean_name'] = watched_movies['Name'].apply(clean_movie_name)

# Match watched movies with the database
watched_matched = watched_movies.merge(
    movies_clean_filtered[['tmdb_id', 'movie_title', 'year_released', 'clean_title']], 
    left_on=['clean_name', 'Year'], 
    right_on=['clean_title', 'year_released'], 
    how='inner'
)

print(f"Matched {len(watched_matched)} watched movies with database")

# Also try name-only matching for watched movies
watched_unmatched = watched_movies[~watched_movies.index.isin(watched_matched.index)]
if len(watched_unmatched) > 0:
    watched_name_matches = watched_unmatched.merge(
        movies_clean_filtered[['tmdb_id', 'movie_title', 'year_released', 'clean_title']],
        left_on='clean_name',
        right_on='clean_title',
        how='inner'
    )
    if len(watched_name_matches) > 0:
        watched_matched = pd.concat([watched_matched, watched_name_matches], ignore_index=True)
        print(f"Found {len(watched_name_matches)} additional watched movie matches")

# Create set of all movies you've watched (both rated and unrated)
all_watched_tmdb_ids = set(watched_matched['tmdb_id'].unique())
rated_tmdb_ids = set(my_ratings_final['tmdb_id'].unique())

print(f"Total unique movies you've watched: {len(all_watched_tmdb_ids)}")
print(f"Movies you've rated: {len(rated_tmdb_ids)}")
print(f"Movies watched but not rated: {len(all_watched_tmdb_ids - rated_tmdb_ids)}")

# Sample of watched but not rated movies
unrated_watched = all_watched_tmdb_ids - rated_tmdb_ids
if unrated_watched:
    sample_unrated = list(unrated_watched)[:5]
    print("Sample movies watched but not rated:")
    for tmdb_id in sample_unrated:
        movie_info = movies_clean_filtered[movies_clean_filtered['tmdb_id'] == tmdb_id]
        if len(movie_info) > 0:
            title = movie_info.iloc[0]['movie_title']
            year = movie_info.iloc[0]['year_released']
            print(f"  - {title} ({year})")

Loading your watched movies to exclude from recommendations...
Total watched movies: 426
Matched 401 watched movies with database
Found 57 additional watched movie matches
Total unique movies you've watched: 419
Movies you've rated: 198
Movies watched but not rated: 226
Sample movies watched but not rated:
  - Triple Frontier (2019.0)
  - Neighbors (2014.0)
  - Pitch Perfect 2 (2015.0)
  - Deadpool 2 (2018.0)
  - Fantastic Four (2005.0)


In [5]:
# Match your rated movies with the movie database using movie names
print("Matching your rated movies with the movie database...")

# Clean names in both datasets
original_ratings['clean_name'] = original_ratings['Name'].apply(clean_movie_name)

# Clean the movies dataset
movies_clean = movies.dropna(subset=['movie_title', 'year_released'])
movies_clean_filtered = movies_clean.copy()
movies_clean_filtered['clean_title'] = movies_clean_filtered['movie_title'].apply(clean_movie_name)

# Try matching by cleaned names and year
matched_movies = original_ratings.merge(
    movies_clean_filtered[['tmdb_id', 'movie_title', 'year_released', 'clean_title']], 
    left_on=['clean_name', 'Year'], 
    right_on=['clean_title', 'year_released'], 
    how='inner'
)

print(f"Matched {len(matched_movies)} movies by exact name and year")

# For unmatched movies, try matching by name only (ignore year)
unmatched = original_ratings[~original_ratings.index.isin(matched_movies.index)]
if len(unmatched) > 0:
    print(f"Attempting name-only matching for {len(unmatched)} unmatched movies...")
    
    name_only_matches = unmatched.merge(
        movies_clean_filtered[['tmdb_id', 'movie_title', 'year_released', 'clean_title']],
        left_on='clean_name',
        right_on='clean_title',
        how='inner'
    )
    
    if len(name_only_matches) > 0:
        # Remove the suffixes from the merge
        name_only_matches = name_only_matches.drop(columns=['clean_title'])
        matched_movies = pd.concat([matched_movies, name_only_matches], ignore_index=True)
        print(f"Found {len(name_only_matches)} additional matches by name only")

print(f"\nFinal matching results: {len(matched_movies)} out of {len(original_ratings)} movies matched")
print(f"Match rate: {len(matched_movies)/len(original_ratings)*100:.1f}%")

# Create the final ratings dataset
my_ratings_final = matched_movies[['tmdb_id', 'Rating']].copy()
my_ratings_final['user_id'] = "brimell"

print(f"\nYour final ratings dataset: {len(my_ratings_final)} movies")
print(f"Rating distribution:")
rating_dist = my_ratings_final['Rating'].value_counts().sort_index()
for rating, count in rating_dist.items():
    print(f"  {rating}: {count} movies")

# Show sample matches
print(f"\nSample matched movies:")
sample_matches = matched_movies[['Name', 'movie_title', 'Year', 'year_released', 'Rating']].head(10)
for _, row in sample_matches.iterrows():
    print(f"  {row['Name']} ({row['Year']}) -> {row['movie_title']} ({row['year_released']}) - Rating: {row['Rating']}")

Matching your rated movies with the movie database...
Matched 174 movies by exact name and year
Attempting name-only matching for 30 unmatched movies...
Found 57 additional matches by name only

Final matching results: 231 out of 204 movies matched
Match rate: 113.2%

Your final ratings dataset: 231 movies
Rating distribution:
  0.5: 1 movies
  1.0: 6 movies
  1.5: 6 movies
  2.0: 13 movies
  2.5: 11 movies
  3.0: 20 movies
  3.5: 43 movies
  4.0: 80 movies
  4.5: 37 movies
  5.0: 14 movies

Sample matched movies:
  Interstellar (2014) -> Interstellar (2014.0) - Rating: 5.0
  (500) Days of Summer (2009) -> (500) Days of Summer (2009.0) - Rating: 4.5
  Friends with Benefits (2011) -> Friends with Benefits (2011.0) - Rating: 3.5
  Terminator 2: Judgment Day (1991) -> Terminator 2: Judgment Day (1991.0) - Rating: 3.5
  Groundhog Day (1993) -> Groundhog Day (1993.0) - Rating: 4.0
  The Hunger Games (2012) -> The Hunger Games (2012.0) - Rating: 3.5
  About Time (2013) -> About Time (2013.0)

## Step 3: Match Ratings with Movie Database

In [7]:
# Now rebuild the recommendation system with the complete dataset
print("Rebuilding recommendation system with your complete rating dataset...")

# Update the variable to use our new complete dataset
my_ratings_updated = my_ratings_final

# Rebuild the user mapping
combined_ratings_new = pd.concat([user_ratings, my_ratings_updated.rename(columns={'Rating': 'rating_val'})])
combined_ratings_new = combined_ratings_new.dropna(subset=['user_id', 'rating_val'])

# Create new mappings
tmdb_id_to_idx_new = {tmdb_id: i for i, tmdb_id in enumerate(combined_ratings_new['tmdb_id'].unique())}
user_id_to_idx_new = {user_id: i + 1 for i, user_id in enumerate(combined_ratings_new['user_id'].unique())}
user_id_to_idx_new["brimell"] = 0

print(f"You now have {len(my_ratings_updated)} rated movies in the system")
print(f"Total combined ratings: {len(combined_ratings_new)}")

# Find similar users
your_movies_new = combined_ratings_new[combined_ratings_new['user_id'] == "brimell"]
common_movies_new = pd.merge(your_movies_new, combined_ratings_new, on='tmdb_id')
common_movies_count_new = common_movies_new.groupby('user_id_y').size()

# Use a lower threshold since we have more movies
min_common = 10
filtered_user_ids_new = common_movies_count_new[common_movies_count_new >= min_common].index
filtered_combined_ratings_new = combined_ratings_new[combined_ratings_new['user_id'].isin(filtered_user_ids_new)]

print(f"Found {len(filtered_user_ids_new)} users who have rated at least {min_common} movies in common")
print(f"Average movies in common: {common_movies_count_new.mean():.1f}")
print(f"Max movies in common: {common_movies_count_new.max()}")

# Create sparse matrix for similarity computation
rows = filtered_combined_ratings_new['user_id'].map(user_id_to_idx_new)
cols = filtered_combined_ratings_new['tmdb_id'].map(tmdb_id_to_idx_new) 
data = filtered_combined_ratings_new['rating_val']
ratings_matrix_new = csr_matrix((data, (rows, cols)), shape=(len(user_id_to_idx_new), len(tmdb_id_to_idx_new)))

print("Computing user similarities...")
user_similarity_new = cosine_similarity(ratings_matrix_new)

# Find most similar users
top_similar_indices_new = np.argsort(-user_similarity_new[0])[1:11]
idx_to_user_new = {v: k for k, v in user_id_to_idx_new.items()}

print("\nTop 10 most similar users:")
for i, idx in enumerate(top_similar_indices_new, 1):
    if idx in idx_to_user_new:
        user_id = idx_to_user_new[idx]
        similarity = user_similarity_new[0][idx]
        print(f"{i:2d}. User: {user_id}, Similarity: {similarity:.4f}")

Rebuilding recommendation system with your complete rating dataset...
You now have 231 rated movies in the system
Total combined ratings: 11079897
You now have 231 rated movies in the system
Total combined ratings: 11079897
Found 7156 users who have rated at least 10 movies in common
Average movies in common: 167.4
Max movies in common: 7232
Found 7156 users who have rated at least 10 movies in common
Average movies in common: 167.4
Max movies in common: 7232
Computing user similarities...
Computing user similarities...

Top 10 most similar users:
 1. User: inesrmarques, Similarity: 0.5108
 2. User: drtfx7, Similarity: 0.5087
 3. User: starrysunflower, Similarity: 0.5084
 4. User: lamalgre, Similarity: 0.5074
 5. User: seysant, Similarity: 0.5069
 6. User: cesarjaimes, Similarity: 0.5055
 7. User: rutuj_p, Similarity: 0.5044
 8. User: nateakira, Similarity: 0.5011
 9. User: vv4nte, Similarity: 0.5006
10. User: ajasmine, Similarity: 0.4992

Top 10 most similar users:
 1. User: inesrmarq

## Step 4: Build Collaborative Filtering System

In [9]:
# Generate final movie recommendations with EXTREME IMPACT WEIGHTING
# EXCLUDING all movies you've watched (both rated and unrated)
print("Generating movie recommendations with extreme rating impact weighting...")
print("Excluding ALL movies you've watched (rated + unrated)...")
print("🎯 EXTREME WEIGHTING: Ratings far from average get exponentially higher impact!")

# Use ALL watched movies (not just rated ones) for exclusion
my_watched_movies_set = all_watched_tmdb_ids  # This includes both rated and unrated
print(f"Excluding {len(my_watched_movies_set)} movies you've already watched")

recommended_movies_dict = {}

# Use top 50 similar users for performance (extreme weighting makes this more effective)
top_users_count = min(50, len(top_similar_indices_new))
print(f"Using top {top_users_count} similar users for recommendations")

for idx in top_similar_indices_new[:top_users_count]:
    if idx in idx_to_user_new:
        user_id = idx_to_user_new[idx] 
        user_similarity_score = user_similarity_new[0][idx]
        
        # Get ratings by this user with extreme weighting applied
        user_ratings = filtered_combined_ratings_new[
            filtered_combined_ratings_new['user_id'] == user_id
        ]
        
        # Pre-calculate weights for all this user's ratings
        user_ratings_copy = user_ratings.copy()
        user_ratings_copy['rating_weight'] = user_ratings_copy['rating_val'].apply(
            lambda r: extreme_impact_weight(r, dataset_rating_mean, dataset_rating_std)
        )
        user_ratings_copy['weighted_rating'] = user_ratings_copy['rating_val'] * user_ratings_copy['rating_weight']
        
        # Only consider ratings that have significant impact (weight > 1.5 OR rating >= 7)
        significant_ratings = user_ratings_copy[
            (user_ratings_copy['rating_weight'] > 1.5) | (user_ratings_copy['rating_val'] >= 7)
        ]
        
        for _, row in significant_ratings.iterrows():
            tmdb_id = row['tmdb_id']
            rating = row['rating_val']
            weighted_rating = row['weighted_rating']
            rating_weight = row['rating_weight']
            
            # Exclude movies you've watched (both rated and unrated)
            if tmdb_id not in my_watched_movies_set:
                if tmdb_id not in recommended_movies_dict:
                    recommended_movies_dict[tmdb_id] = {
                        'users': [user_id], 
                        'ratings': [rating], 
                        'weighted_ratings': [weighted_rating],
                        'rating_weights': [rating_weight],
                        'similarities': [user_similarity_score],
                        'perfect_ratings': 1 if rating == 5.0 else 0
                    }
                else:
                    recommended_movies_dict[tmdb_id]['users'].append(user_id)
                    recommended_movies_dict[tmdb_id]['ratings'].append(rating)
                    recommended_movies_dict[tmdb_id]['weighted_ratings'].append(weighted_rating)
                    recommended_movies_dict[tmdb_id]['rating_weights'].append(rating_weight)
                    recommended_movies_dict[tmdb_id]['similarities'].append(user_similarity_score)
                    if rating == 5.0:
                        recommended_movies_dict[tmdb_id]['perfect_ratings'] += 1

print(f"Found {len(recommended_movies_dict)} potential NEW recommendations")

# Score recommendations using EXTREME WEIGHTED ratings
scored_recommendations = []
extreme_weight_count = 0

for tmdb_id, data in recommended_movies_dict.items():
    avg_rating = np.mean(data['ratings'])
    avg_weighted_rating = np.mean(data['weighted_ratings'])
    avg_rating_weight = np.mean(data['rating_weights'])
    avg_similarity = np.mean(data['similarities'])
    num_recommenders = len(data['users'])
    perfect_ratings = data['perfect_ratings']
    
    # NEW SCORING: Use weighted ratings for much more extreme impact
    base_score = avg_weighted_rating * avg_similarity * np.log(1 + num_recommenders)
    
    # Smaller perfect bonus since extreme weighting handles this
    perfect_bonus = 1.0 + (perfect_ratings * 0.1)
    
    # Track movies that got extreme weighting boost
    if avg_rating_weight > 2.0:
        extreme_weight_count += 1
    
    combined_score = base_score * perfect_bonus
    
    scored_recommendations.append({
        'tmdb_id': tmdb_id,
        'avg_rating': avg_rating,
        'avg_weighted_rating': avg_weighted_rating,
        'avg_rating_weight': avg_rating_weight,
        'avg_similarity': avg_similarity, 
        'num_recommenders': num_recommenders,
        'perfect_ratings': perfect_ratings,
        'combined_score': combined_score
    })

print(f"🎯 {extreme_weight_count} movies received extreme impact weighting (>2x)!")

# Sort by combined score and get top 50
top_recommendations = sorted(scored_recommendations, key=lambda x: x['combined_score'], reverse=True)[:50]

# Get movie titles
final_recommendations = []
for rec in top_recommendations:
    movie_match = movies_clean_filtered[movies_clean_filtered['tmdb_id'] == rec['tmdb_id']]
    if len(movie_match) > 0:
        title = movie_match.iloc[0]['movie_title']
        year = movie_match.iloc[0]['year_released']
        final_recommendations.append({
            'title': title,
            'year': int(year) if pd.notna(year) else 'Unknown',
            'avg_rating': rec['avg_rating'],
            'avg_weighted_rating': rec['avg_weighted_rating'],
            'avg_rating_weight': rec['avg_rating_weight'],
            'avg_similarity': rec['avg_similarity'],
            'num_recommenders': rec['num_recommenders'],
            'perfect_ratings': rec['perfect_ratings'],
            'combined_score': rec['combined_score']
        })

print(f"\nTop {len(final_recommendations)} NEW Movie Recommendations (with EXTREME IMPACT):")
print("=" * 95)
print("✨ These are movies you HAVEN'T watched yet! ✨")
print("🎯 Extreme ratings get exponentially higher impact! 🎯")
print("=" * 95)

for i, movie in enumerate(final_recommendations, 1):
    extreme_indicator = f" ⚡×{movie['avg_rating_weight']:.1f}" if movie['avg_rating_weight'] > 2.0 else ""
    perfect_indicator = f" 🌟×{movie['perfect_ratings']}" if movie['perfect_ratings'] > 0 else ""
    
    print(f"{i:2d}. {movie['title']} ({movie['year']}){extreme_indicator}{perfect_indicator}")
    print(f"    Rating: {movie['avg_rating']:.2f}/10 (weighted: {movie['avg_weighted_rating']:.1f}) | Similarity: {movie['avg_similarity']:.3f} | {movie['num_recommenders']} users")
    print()

Generating movie recommendations with extreme rating impact weighting...
Excluding ALL movies you've watched (rated + unrated)...
🎯 EXTREME WEIGHTING: Ratings far from average get exponentially higher impact!
Excluding 419 movies you've already watched
Using top 10 similar users for recommendations
Found 2647 potential NEW recommendations
🎯 2110 movies received extreme impact weighting (>2x)!

Top 50 NEW Movie Recommendations (with EXTREME IMPACT):
✨ These are movies you HAVEN'T watched yet! ✨
🎯 Extreme ratings get exponentially higher impact! 🎯
 1. Gone Girl (2014) ⚡×4.4
    Rating: 9.44/10 (weighted: 43.2) | Similarity: 0.505 | 9 users

 2. Portrait of a Lady on Fire (2019) ⚡×4.8
    Rating: 9.71/10 (weighted: 47.4) | Similarity: 0.506 | 7 users

 3. Call Me by Your Name (2017) ⚡×4.6 🌟×1
    Rating: 7.88/10 (weighted: 34.7) | Similarity: 0.505 | 8 users

 4. Lady Bird (2017) ⚡×4.8
    Rating: 9.60/10 (weighted: 46.7) | Similarity: 0.503 | 5 users

 5. Phantom Thread (2017) ⚡×4.6
    

## Step 5: Generate Personalized Recommendations

This step generates movie recommendations using the collaborative filtering system with special bonus weighting for movies that similar users rated 5/5 stars.

In [8]:
# Implement extreme rating impact weighting to flatten effective distribution
print("📊 Implementing extreme rating impact weighting...")

# Check your rating distribution
your_rating_mean = my_ratings_updated['Rating'].mean()
your_rating_std = my_ratings_updated['Rating'].std()
print(f"Your rating mean: {your_rating_mean:.2f}")
print(f"Your rating std: {your_rating_std:.2f}")

# Check dataset rating distribution  
dataset_rating_mean = filtered_combined_ratings_new['rating_val'].mean()
dataset_rating_std = filtered_combined_ratings_new['rating_val'].std()
print(f"Dataset rating mean: {dataset_rating_mean:.2f}")
print(f"Dataset rating std: {dataset_rating_std:.2f}")

def extreme_impact_weight(rating, mean_rating, std_rating):
    """
    Apply exponential weighting based on distance from mean:
    - Ratings close to mean get weight ~1.0 (normal impact)
    - Ratings far from mean get exponentially higher weight
    - This flattens the effective distribution by amplifying extremes
    """
    # Calculate z-score (standard deviations from mean)
    z_score = abs(rating - mean_rating) / std_rating
    
    # Exponential weighting: weight = e^(z_score)
    # This gives exponentially more weight to ratings far from average
    weight = np.exp(z_score)
    
    return weight

# Test the weighting function
test_ratings = [0.5, 2.0, 4.0, 5.0, 6.5, 8.0, 10.0]
print(f"\n🎯 Extreme impact weighting examples (mean = {dataset_rating_mean:.1f}, std = {dataset_rating_std:.1f}):")
print("Rating | Z-Score | Weight")
print("-" * 28)

for rating in test_ratings:
    z_score = abs(rating - dataset_rating_mean) / dataset_rating_std
    weight = extreme_impact_weight(rating, dataset_rating_mean, dataset_rating_std)
    print(f"{rating:6.1f} | {z_score:7.2f} | {weight:6.2f}x")

print(f"\n✅ Extreme ratings will now have exponentially higher impact!")
print(f"   • Ratings near average ({dataset_rating_mean:.1f}): ~1x weight")  
print(f"   • Ratings 1 std away: ~{np.exp(1):.1f}x weight")
print(f"   • Ratings 2 std away: ~{np.exp(2):.1f}x weight")
print(f"   • Very extreme ratings: 20x+ weight")

📊 Implementing extreme rating impact weighting...
Your rating mean: 3.62
Your rating std: 0.93
Dataset rating mean: 6.49
Dataset rating std: 2.08

🎯 Extreme impact weighting examples (mean = 6.5, std = 2.1):
Rating | Z-Score | Weight
----------------------------
   0.5 |    2.88 |  17.85x
   2.0 |    2.16 |   8.67x
   4.0 |    1.20 |   3.31x
   5.0 |    0.72 |   2.05x
   6.5 |    0.01 |   1.01x
   8.0 |    0.73 |   2.07x
  10.0 |    1.69 |   5.43x

✅ Extreme ratings will now have exponentially higher impact!
   • Ratings near average (6.5): ~1x weight
   • Ratings 1 std away: ~2.7x weight
   • Ratings 2 std away: ~7.4x weight
   • Very extreme ratings: 20x+ weight


In [72]:
# Save the complete NEW movie recommendations to file (with 5⭐ bonus weighting)
filename = 'data/movie_recommendations_NEW_with_5star_bonus.txt'
with open(filename, 'w') as f:
    f.write("🎬 NEW MOVIE RECOMMENDATIONS (UNWATCHED) with 5⭐ BONUS\n")
    f.write("=" * 70 + "\n")
    f.write(f"Based on {len(my_ratings_updated)} of your movie ratings\n")
    f.write(f"Excluding {len(my_watched_movies_set)} movies you've already watched\n")
    f.write(f"Analyzed {len(filtered_user_ids_new)} users with similar taste\n")
    f.write(f"🌟 BONUS WEIGHTING for movies rated 5.0/5.0 (perfect Letterboxd ratings)!\n")
    f.write(f"Generated on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    f.write("✨ These are movies you HAVEN'T watched yet! ✨\n\n")
    
    for i, movie in enumerate(final_recommendations, 1):
        perfect_indicator = f" 🌟×{movie['perfect_ratings']}" if movie['perfect_ratings'] > 0 else ""
        bonus_text = f" (×{movie['perfect_bonus']:.1f} bonus)" if movie['perfect_bonus'] > 1.0 else ""
        
        f.write(f"{i:2d}. {movie['title']} ({movie['year']}){perfect_indicator}\n")
        f.write(f"    ⭐ Rating: {movie['avg_rating']:.2f}/5.0\n")
        f.write(f"    👥 Recommended by: {movie['num_recommenders']} similar users\n")
        f.write(f"    🎯 Similarity Score: {movie['avg_similarity']:.3f}\n")
        f.write(f"    📊 Combined Score: {movie['combined_score']:.2f}{bonus_text}\n")
        if movie['perfect_ratings'] > 0:
            f.write(f"    🌟 Perfect 5.0/5.0 ratings: {movie['perfect_ratings']}\n")
        f.write("\n")

print(f"✅ Saved {len(final_recommendations)} NEW recommendations (with 5⭐ bonus) to '{filename}'")

# Count movies with perfect ratings in final recommendations
perfect_movies_in_recs = sum(1 for movie in final_recommendations if movie['perfect_ratings'] > 0)
total_perfect_ratings = sum(movie['perfect_ratings'] for movie in final_recommendations)

# Updated summary statistics
print(f"\n📈 UPDATED RECOMMENDATION SYSTEM SUMMARY (with 5⭐ bonus):")
print(f"━" * 70)
print(f"Your Ratings:              {len(my_ratings_updated)} movies (0.5-5.0 Letterboxd scale)")
print(f"Total Movies Watched:      {len(my_watched_movies_set)} movies (rated + unrated)")
print(f"Movies Rated:             {len(rated_tmdb_ids)} movies")  
print(f"Movies Watched (No Rating): {len(my_watched_movies_set - rated_tmdb_ids)} movies")
print(f"Similar Users:             {len(filtered_user_ids_new)} users found")
print(f"Avg Movies Shared:         {common_movies_count_new.mean():.1f} movies per user")
print(f"Top Similarity:            {user_similarity_new[0][top_similar_indices_new[0]]:.3f}")
print(f"NEW Recommendations:       {len(final_recommendations)} movies you haven't watched")
print(f"🌟 Movies with 5⭐ bonus:   {perfect_movies_in_recs} movies ({total_perfect_ratings} perfect ratings)")
print(f"Database Match Rate:       {len(watched_matched)/len(watched_movies)*100:.1f}% (watched movies)")
print(f"\n🎯 All recommendations are movies you've NEVER seen before!")
print(f"🌟 Movies rated 5.0/5.0 by similar users get 1.5x-2.5x bonus weighting!")
print(f"📊 Perfect 5-star ratings in dataset: {five_star_ratings:,} ({five_star_ratings/total_ratings*100:.2f}%)")

✅ Saved 50 NEW recommendations (with 5⭐ bonus) to 'data/movie_recommendations_NEW_with_5star_bonus.txt'

📈 UPDATED RECOMMENDATION SYSTEM SUMMARY (with 5⭐ bonus):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Your Ratings:              239 movies (0.5-5.0 Letterboxd scale)
Total Movies Watched:      426 movies (rated + unrated)
Movies Rated:             208 movies
Movies Watched (No Rating): 229 movies
Similar Users:             7019 users found
Avg Movies Shared:         78.4 movies per user
Top Similarity:            0.323
NEW Recommendations:       50 movies you haven't watched
🌟 Movies with 5⭐ bonus:   32 movies (36 perfect ratings)
Database Match Rate:       108.5% (watched movies)

🎯 All recommendations are movies you've NEVER seen before!
🌟 Movies rated 5.0/5.0 by similar users get 1.5x-2.5x bonus weighting!
📊 Perfect 5-star ratings in dataset: 1,110,093 (10.06%)


# Machine Learning Content-Based Recommendations

This section trains ML models using your rating patterns to predict ratings for unwatched movies based on their features (genres, cast, directors, year, etc.). This complements the collaborative filtering approach by learning your personal preferences.

In [11]:
# Prepare data for ML training
print("🤖 Preparing data for machine learning content-based recommendations...")

# Import additional ML libraries
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

# Get movie features for ML training
print(f"Available movie features: {movies_clean_filtered.columns.tolist()}")

# Select relevant features for training
feature_columns = []
if 'year_released' in movies_clean_filtered.columns:
    feature_columns.append('year_released')
if 'genre' in movies_clean_filtered.columns:
    feature_columns.append('genre')
if 'runtime' in movies_clean_filtered.columns:
    feature_columns.append('runtime')
if 'imdb_rating' in movies_clean_filtered.columns:
    feature_columns.append('imdb_rating')
if 'imdb_votes' in movies_clean_filtered.columns:
    feature_columns.append('imdb_votes')

print(f"Using features for ML: {feature_columns}")

# Create training dataset by merging your ratings with movie features
ml_training_data = my_ratings_final.merge(
    movies_clean_filtered[['tmdb_id'] + feature_columns], 
    on='tmdb_id', 
    how='inner'
)

print(f"ML training dataset: {len(ml_training_data)} movies with features")
print(f"Your rating distribution for ML:")
print(ml_training_data['Rating'].value_counts().sort_index())

🤖 Preparing data for machine learning content-based recommendations...
Available movie features: ['_id', 'genres', 'image_url', 'imdb_id', 'imdb_link', 'movie_id', 'movie_title', 'original_language', 'overview', 'popularity', 'production_countries', 'release_date', 'runtime', 'spoken_languages', 'tmdb_id', 'tmdb_link', 'vote_average', 'vote_count', 'year_released', 'clean_title']
Using features for ML: ['year_released', 'runtime']
ML training dataset: 24140 movies with features
Your rating distribution for ML:
Rating
0.5        1
1.0       12
1.5        6
2.0       17
2.5       11
3.0       20
3.5       43
4.0     6914
4.5    10276
5.0     6840
Name: count, dtype: int64


In [12]:
# Feature Engineering for ML Model
print("🔧 Engineering features for machine learning...")

# Clean the training data - remove duplicates and prepare features
ml_data_clean = ml_training_data.drop_duplicates(subset=['tmdb_id']).copy()
print(f"After removing duplicates: {len(ml_data_clean)} unique movies")

# Feature engineering
def engineer_features(df):
    """Engineer features from movie data for ML training"""
    features_df = df.copy()
    
    # Year features
    if 'year_released' in features_df.columns:
        features_df['year_released'] = pd.to_numeric(features_df['year_released'], errors='coerce')
        features_df['decade'] = (features_df['year_released'] // 10) * 10
        features_df['is_recent'] = (features_df['year_released'] >= 2010).astype(int)
        features_df['is_classic'] = (features_df['year_released'] <= 1980).astype(int)
    
    # Runtime features  
    if 'runtime' in features_df.columns:
        features_df['runtime'] = pd.to_numeric(features_df['runtime'], errors='coerce')
        features_df['is_short'] = (features_df['runtime'] <= 90).astype(int)
        features_df['is_long'] = (features_df['runtime'] >= 150).astype(int)
    
    # Genre features (if available)
    if 'genres' in features_df.columns:
        # Extract main genres
        features_df['genres_str'] = features_df['genres'].fillna('')
        
        # Create binary features for common genres
        common_genres = ['Action', 'Comedy', 'Drama', 'Horror', 'Romance', 'Thriller', 'Sci-Fi', 'Fantasy']
        for genre in common_genres:
            features_df[f'is_{genre.lower()}'] = features_df['genres_str'].str.contains(genre, case=False).astype(int)
    
    # Popularity features (if available)
    if 'vote_average' in features_df.columns:
        features_df['vote_average'] = pd.to_numeric(features_df['vote_average'], errors='coerce')
        features_df['is_highly_rated'] = (features_df['vote_average'] >= 7.0).astype(int)
    
    if 'vote_count' in features_df.columns:
        features_df['vote_count'] = pd.to_numeric(features_df['vote_count'], errors='coerce')
        features_df['is_popular'] = (features_df['vote_count'] >= 1000).astype(int)
    
    return features_df

# Apply feature engineering
ml_features = engineer_features(ml_data_clean)

# Select numeric features for training
numeric_features = ['year_released', 'decade', 'is_recent', 'is_classic', 'runtime', 'is_short', 'is_long']

# Add genre features if available
if 'genres' in ml_features.columns:
    genre_features = [col for col in ml_features.columns if col.startswith('is_') and 'genre' not in col.lower()]
    numeric_features.extend(genre_features)

# Add popularity features if available
if 'vote_average' in ml_features.columns:
    numeric_features.extend(['vote_average', 'is_highly_rated'])
if 'vote_count' in ml_features.columns:  
    numeric_features.extend(['vote_count', 'is_popular'])

# Remove features that don't exist
numeric_features = [col for col in numeric_features if col in ml_features.columns]

print(f"Engineered features: {numeric_features}")

# Prepare final training data
X = ml_features[numeric_features].fillna(0)
y = ml_features['Rating']

print(f"Training data shape: {X.shape}")
print(f"Target distribution:")
print(y.value_counts().sort_index())

# Check for any remaining issues
print(f"Missing values in features: {X.isnull().sum().sum()}")
print(f"Missing values in target: {y.isnull().sum()}")

🔧 Engineering features for machine learning...
After removing duplicates: 198 unique movies
Engineered features: ['year_released', 'decade', 'is_recent', 'is_classic', 'runtime', 'is_short', 'is_long']
Training data shape: (198, 7)
Target distribution:
Rating
0.5     1
1.0     4
1.5     6
2.0     9
2.5     9
3.0    16
3.5    39
4.0    68
4.5    34
5.0    12
Name: count, dtype: int64
Missing values in features: 0
Missing values in target: 0


In [13]:
# Train Multiple ML Models
print("🎯 Training multiple ML models on your rating preferences...")

# Split data for training and validation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define models to try
models = {
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42, max_depth=6),
    'Ridge Regression': Ridge(alpha=1.0)
}

# Train and evaluate models
model_performance = {}
trained_models = {}

for name, model in models.items():
    print(f"\n📊 Training {name}...")
    
    # Use scaled data for Ridge, original for tree-based models
    if name == 'Ridge Regression':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        # Cross validation on scaled data
        cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='neg_mean_squared_error')
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        # Cross validation on original data  
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    cv_mse = -cv_scores.mean()
    
    model_performance[name] = {
        'MSE': mse,
        'MAE': mae,
        'CV_MSE': cv_mse,
        'RMSE': np.sqrt(mse)
    }
    
    trained_models[name] = model
    
    print(f"  MSE: {mse:.4f}")
    print(f"  MAE: {mae:.4f}")  
    print(f"  RMSE: {np.sqrt(mse):.4f}")
    print(f"  CV MSE: {cv_mse:.4f}")

# Find best model
best_model_name = min(model_performance.keys(), key=lambda k: model_performance[k]['CV_MSE'])
best_model = trained_models[best_model_name]

print(f"\n🏆 Best model: {best_model_name}")
print(f"Best model CV RMSE: {np.sqrt(model_performance[best_model_name]['CV_MSE']):.4f}")

# Feature importance for tree-based models
if best_model_name in ['Random Forest', 'Gradient Boosting']:
    feature_importance = pd.DataFrame({
        'feature': numeric_features,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\n📈 Feature importance for {best_model_name}:")
    for _, row in feature_importance.head().iterrows():
        print(f"  {row['feature']}: {row['importance']:.4f}")
else:
    # For Ridge regression, show coefficients
    coefficients = pd.DataFrame({
        'feature': numeric_features,
        'coefficient': best_model.coef_
    }).sort_values('coefficient', key=abs, ascending=False)
    
    print(f"\n📈 Top coefficients for {best_model_name}:")
    for _, row in coefficients.head().iterrows():
        print(f"  {row['feature']}: {row['coefficient']:.4f}")

🎯 Training multiple ML models on your rating preferences...

📊 Training Random Forest...
  MSE: 0.8195
  MAE: 0.7141
  RMSE: 0.9053
  CV MSE: 1.0186

📊 Training Gradient Boosting...
  MSE: 0.8989
  MAE: 0.7481
  RMSE: 0.9481
  CV MSE: 1.4602

📊 Training Ridge Regression...
  MSE: 0.6768
  MAE: 0.6740
  RMSE: 0.8227
  CV MSE: 0.8454

🏆 Best model: Ridge Regression
Best model CV RMSE: 0.9194

📈 Top coefficients for Ridge Regression:
  decade: -0.3151
  is_classic: -0.3000
  is_long: 0.1980
  runtime: -0.0993
  year_released: -0.0781


In [14]:
# Generate ML-based Movie Recommendations
print("🤖 Generating ML-based movie recommendations...")

# Prepare all movies for prediction (excluding watched ones)
all_movies_features = engineer_features(movies_clean_filtered)

# Get movies you haven't watched
unwatched_movies = all_movies_features[~all_movies_features['tmdb_id'].isin(my_watched_movies_set)].copy()
print(f"Movies available for ML prediction: {len(unwatched_movies)}")

# Prepare features for prediction
unwatched_features = unwatched_movies[numeric_features].fillna(0)

# Make predictions using the best model
if best_model_name == 'Ridge Regression':
    unwatched_features_scaled = scaler.transform(unwatched_features)
    predicted_ratings = best_model.predict(unwatched_features_scaled)
else:
    predicted_ratings = best_model.predict(unwatched_features)

# Add predictions to the dataframe
unwatched_movies['predicted_rating'] = predicted_ratings

# Filter for high predicted ratings (4.0+ stars)
high_rated_predictions = unwatched_movies[unwatched_movies['predicted_rating'] >= 4.0].copy()

# Sort by predicted rating and get top recommendations
ml_recommendations = high_rated_predictions.nlargest(50, 'predicted_rating')

print(f"🎯 Found {len(high_rated_predictions)} movies predicted to be 4.0+ stars")
print(f"Top 50 ML-based recommendations:")

# Display top ML recommendations
ml_final_recommendations = []
for i, (_, movie) in enumerate(ml_recommendations.iterrows(), 1):
    title = movie['movie_title'] 
    year = int(movie['year_released']) if pd.notna(movie['year_released']) else 'Unknown'
    pred_rating = movie['predicted_rating']
    
    ml_final_recommendations.append({
        'rank': i,
        'title': title,
        'year': year,
        'predicted_rating': pred_rating,
        'tmdb_id': movie['tmdb_id']
    })
    
    if i <= 20:  # Show top 20
        print(f"{i:2d}. {title} ({year}) - Predicted Rating: {pred_rating:.2f}/5.0")

# Compare with collaborative filtering recommendations
collaborative_tmdb_ids = set(movie.get('tmdb_id') for movie in final_recommendations if 'tmdb_id' in movie)
ml_tmdb_ids = set(ml_final_recommendations[i]['tmdb_id'] for i in range(len(ml_final_recommendations)))

# Find overlap and unique recommendations
overlap = len(collaborative_tmdb_ids.intersection(ml_tmdb_ids))
ml_unique = len(ml_tmdb_ids - collaborative_tmdb_ids)
collaborative_unique = len(collaborative_tmdb_ids - ml_tmdb_ids)

print(f"\n📊 Recommendation Comparison:")
print(f"   • Overlap between ML and Collaborative: {overlap} movies")
print(f"   • Unique to ML recommendations: {ml_unique} movies")
print(f"   • Unique to Collaborative filtering: {collaborative_unique} movies")

# Save ML recommendations
ml_filename = 'data/movie_recommendations_ML_based.txt'
with open(ml_filename, 'w') as f:
    f.write("🤖 ML-BASED MOVIE RECOMMENDATIONS\n")
    f.write("=" * 50 + "\n")
    f.write(f"Model: {best_model_name} (RMSE: {np.sqrt(model_performance[best_model_name]['CV_MSE']):.3f})\n")
    f.write(f"Based on {len(ml_features)} of your rated movies\n")
    f.write(f"Features used: {', '.join(numeric_features)}\n")
    f.write(f"Generated on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    f.write("🎯 Movies predicted to match your taste:\n\n")
    
    for movie in ml_final_recommendations:
        f.write(f"{movie['rank']:2d}. {movie['title']} ({movie['year']})\n")
        f.write(f"    🤖 Predicted Rating: {movie['predicted_rating']:.2f}/5.0\n\n")

print(f"✅ Saved {len(ml_final_recommendations)} ML recommendations to '{ml_filename}'")

🤖 Generating ML-based movie recommendations...
Movies available for ML prediction: 277403
🎯 Found 79899 movies predicted to be 4.0+ stars
Top 50 ML-based recommendations:
 1. Passage of Venus (1874) - Predicted Rating: 5.95/5.0
 2. The Musician Monkey (1878) - Predicted Rating: 5.93/5.0
 3. Sallie Gardner at a Gallop (1878) - Predicted Rating: 5.93/5.0
 4. Le Rotisseur (1878) - Predicted Rating: 5.93/5.0
 5. Les Bulles de Savon (1878) - Predicted Rating: 5.93/5.0
 6. L'Équilibriste (1878) - Predicted Rating: 5.93/5.0
 7. Le Jongleur (1878) - Predicted Rating: 5.93/5.0
 8. The Aquarium (1878) - Predicted Rating: 5.93/5.0
 9. Les Chiens Savants (1878) - Predicted Rating: 5.93/5.0
10. Le Steeple-chase (1878) - Predicted Rating: 5.93/5.0
11. La Nageuse (1878) - Predicted Rating: 5.93/5.0
12. Le Repas des Poulets (1878) - Predicted Rating: 5.93/5.0
13. Le Fumeur (1878) - Predicted Rating: 5.93/5.0
14. La Balançoire (1878) - Predicted Rating: 5.93/5.0
15. La Charmeuse (1878) - Predicted Rati

In [15]:
# Create Balanced ML Recommendations and Hybrid System
print("🎯 Creating balanced ML recommendations and hybrid system...")

# Filter for more modern movies to get practical recommendations
modern_unwatched = unwatched_movies[unwatched_movies['year_released'] >= 1980].copy()
modern_unwatched['predicted_rating'] = modern_unwatched['predicted_rating']

# Get top modern ML recommendations
modern_ml_recs = modern_unwatched.nlargest(50, 'predicted_rating')

print(f"\n🎬 Top 20 Modern ML Recommendations (1980+):")
modern_ml_final = []
for i, (_, movie) in enumerate(modern_ml_recs.iterrows(), 1):
    title = movie['movie_title'] 
    year = int(movie['year_released']) if pd.notna(movie['year_released']) else 'Unknown'
    pred_rating = movie['predicted_rating']
    
    modern_ml_final.append({
        'rank': i,
        'title': title,
        'year': year,
        'predicted_rating': pred_rating,
        'tmdb_id': movie['tmdb_id'],
        'source': 'ML'
    })
    
    if i <= 20:
        print(f"{i:2d}. {title} ({year}) - Predicted: {pred_rating:.2f}/5.0")

# Create Hybrid Recommendations (combine both approaches)
print(f"\n🤝 Creating Hybrid Recommendations...")

# Prepare collaborative filtering scores for comparison
collaborative_scores = {}
for movie in final_recommendations:
    collaborative_scores[movie.get('tmdb_id')] = {
        'title': movie['title'],
        'year': movie['year'], 
        'collaborative_score': movie['combined_score'],
        'avg_rating': movie['avg_rating'],
        'source': 'Collaborative'
    }

# Prepare ML scores for modern movies
ml_scores = {}
for movie in modern_ml_final:
    ml_scores[movie['tmdb_id']] = {
        'title': movie['title'],
        'year': movie['year'],
        'ml_score': movie['predicted_rating'],
        'source': 'ML'
    }

# Create hybrid scoring
hybrid_recommendations = []

# Add collaborative filtering movies with hybrid scores
for tmdb_id, collab_data in collaborative_scores.items():
    if tmdb_id in ml_scores:
        # Movie appears in both systems - combine scores
        ml_data = ml_scores[tmdb_id]
        hybrid_score = (collab_data['collaborative_score'] / 100) + (ml_data['ml_score'])  # Normalize and combine
        source = 'Hybrid'
    else:
        # Only in collaborative filtering
        hybrid_score = collab_data['collaborative_score'] / 100
        source = 'Collaborative Only'
    
    hybrid_recommendations.append({
        'tmdb_id': tmdb_id,
        'title': collab_data['title'],
        'year': collab_data['year'],
        'hybrid_score': hybrid_score,
        'source': source,
        'collaborative_score': collab_data.get('collaborative_score', 0),
        'ml_score': ml_scores.get(tmdb_id, {}).get('ml_score', 0)
    })

# Add ML-only movies
for tmdb_id, ml_data in ml_scores.items():
    if tmdb_id not in collaborative_scores:
        hybrid_recommendations.append({
            'tmdb_id': tmdb_id,
            'title': ml_data['title'],
            'year': ml_data['year'],
            'hybrid_score': ml_data['ml_score'],
            'source': 'ML Only',
            'collaborative_score': 0,
            'ml_score': ml_data['ml_score']
        })

# Sort hybrid recommendations by score
hybrid_recommendations.sort(key=lambda x: x['hybrid_score'], reverse=True)
top_hybrid = hybrid_recommendations[:30]

print(f"\n🏆 Top 20 Hybrid Recommendations:")
for i, movie in enumerate(top_hybrid[:20], 1):
    print(f"{i:2d}. {movie['title']} ({movie['year']}) - {movie['source']}")
    if movie['source'] == 'Hybrid':
        print(f"    Hybrid Score: {movie['hybrid_score']:.2f} (ML: {movie['ml_score']:.2f}, Collab: {movie['collaborative_score']:.0f})")
    elif movie['source'] == 'Collaborative Only':
        print(f"    Collaborative Score: {movie['collaborative_score']:.0f}")
    else:
        print(f"    ML Score: {movie['ml_score']:.2f}")
    print()

# Save all three recommendation types
files_saved = []

# Modern ML recommendations
modern_ml_filename = 'data/movie_recommendations_ML_modern.txt'
with open(modern_ml_filename, 'w') as f:
    f.write("🎬 MODERN ML-BASED MOVIE RECOMMENDATIONS (1980+)\n")
    f.write("=" * 60 + "\n")
    f.write(f"Model: {best_model_name}\n")
    f.write(f"Based on analysis of your rating patterns\n")
    f.write(f"Filtered for movies from 1980 onwards\n")
    f.write(f"Generated on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    
    for movie in modern_ml_final:
        f.write(f"{movie['rank']:2d}. {movie['title']} ({movie['year']})\n")
        f.write(f"    🤖 Predicted Rating: {movie['predicted_rating']:.2f}/5.0\n\n")

files_saved.append(modern_ml_filename)

# Hybrid recommendations
hybrid_filename = 'data/movie_recommendations_HYBRID.txt'
with open(hybrid_filename, 'w') as f:
    f.write("🤝 HYBRID MOVIE RECOMMENDATIONS\n")
    f.write("=" * 50 + "\n")
    f.write("Combines Machine Learning + Collaborative Filtering\n")
    f.write(f"Generated on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    
    for i, movie in enumerate(top_hybrid, 1):
        f.write(f"{i:2d}. {movie['title']} ({movie['year']}) - {movie['source']}\n")
        if movie['source'] == 'Hybrid':
            f.write(f"    🤝 Hybrid Score: {movie['hybrid_score']:.2f}\n")
            f.write(f"    🤖 ML Score: {movie['ml_score']:.2f}\n")
            f.write(f"    👥 Collaborative Score: {movie['collaborative_score']:.0f}\n")
        elif movie['source'] == 'Collaborative Only':
            f.write(f"    👥 Collaborative Score: {movie['collaborative_score']:.0f}\n")
        else:
            f.write(f"    🤖 ML Score: {movie['ml_score']:.2f}\n")
        f.write("\n")

files_saved.append(hybrid_filename)

print(f"✅ Saved recommendations to:")
for filename in files_saved:
    print(f"   • {filename}")

print(f"\n📊 Final Summary:")
print(f"   • Collaborative Filtering: {len(final_recommendations)} recommendations")
print(f"   • Modern ML Predictions: {len(modern_ml_final)} recommendations")  
print(f"   • Hybrid Approach: {len(top_hybrid)} recommendations")
print(f"   • Both approaches complement each other for comprehensive coverage!")

🎯 Creating balanced ML recommendations and hybrid system...

🎬 Top 20 Modern ML Recommendations (1980+):
 1. The Bunker (1981) - Predicted: 4.70/5.0
 2. Jacqueline Bouvier Kennedy (1981) - Predicted: 4.70/5.0
 3. Miracle on Ice (1981) - Predicted: 4.70/5.0
 4. Das Boot (1981) - Predicted: 4.70/5.0
 5. Of Mice and Men (1981) - Predicted: 4.70/5.0
 6. The Chelsea Murders (1981) - Predicted: 4.70/5.0
 7. Netrikan (1981) - Predicted: 4.70/5.0
 8. Midnight (1981) - Predicted: 4.70/5.0
 9. Jibon Nouka (1981) - Predicted: 4.70/5.0
10. Why Not? (1981) - Predicted: 4.70/5.0
11. Zoombie (1982) - Predicted: 4.69/5.0
12. Camelot (1982) - Predicted: 4.69/5.0
13. Kalyug (1981) - Predicted: 4.69/5.0
14. Schöne Tage (1981) - Predicted: 4.69/5.0
15. The Night of Varennes (1982) - Predicted: 4.69/5.0
16. Threshold (1982) - Predicted: 4.69/5.0
17. Sophie's Choice (1982) - Predicted: 4.69/5.0
18. The Victor (1982) - Predicted: 4.69/5.0
19. Carnival (1982) - Predicted: 4.69/5.0
20. The Uppercrust (1981) - 

In [17]:
# IMPROVED ML MODEL WITH GENRE INFO AND OVERFITTING
print("🚀 Creating OVERFITTED ML model with genre information...")

# First, let's explore genre information in the dataset
print("🎭 Analyzing genre information...")
genre_cols = [col for col in movies_clean_filtered.columns if 'genre' in col.lower()]
print(f"Available genre columns: {genre_cols}")

# Check what genre data we have
if 'genres' in movies_clean_filtered.columns:
    print("Sample genre data:")
    sample_genres = movies_clean_filtered[movies_clean_filtered['genres'].notna()]['genres'].head(10)
    for i, genres in enumerate(sample_genres):
        print(f"  {genres}")
elif 'genre' in movies_clean_filtered.columns:
    print("Sample genre data:")
    sample_genres = movies_clean_filtered[movies_clean_filtered['genre'].notna()]['genre'].head(10)
    for i, genre in enumerate(sample_genres):
        print(f"  {genre}")

# Enhanced feature engineering with extensive genre features
def engineer_overfitted_features(df, your_ratings=None):
    """Engineer LOTS of features to overfit to your specific taste"""
    features_df = df.copy()
    
    # Year features (more granular)
    if 'year_released' in features_df.columns:
        features_df['year_released'] = pd.to_numeric(features_df['year_released'], errors='coerce').fillna(1950)
        
        # Decade features
        for decade in range(1920, 2030, 10):
            features_df[f'decade_{decade}s'] = (features_df['year_released'].between(decade, decade+9)).astype(int)
        
        # Era features
        features_df['is_silent_era'] = (features_df['year_released'] <= 1929).astype(int)
        features_df['is_golden_age'] = features_df['year_released'].between(1930, 1959).astype(int)
        features_df['is_new_hollywood'] = features_df['year_released'].between(1960, 1979).astype(int)
        features_df['is_blockbuster_era'] = features_df['year_released'].between(1980, 1999).astype(int)
        features_df['is_digital_era'] = features_df['year_released'].between(2000, 2019).astype(int)
        features_df['is_streaming_era'] = (features_df['year_released'] >= 2020).astype(int)
        
        # Age features
        current_year = 2025
        features_df['movie_age'] = current_year - features_df['year_released']
        features_df['is_very_old'] = (features_df['movie_age'] > 50).astype(int)
        features_df['is_classic'] = features_df['movie_age'].between(25, 50).astype(int)
        features_df['is_modern'] = features_df['movie_age'].between(10, 25).astype(int)
        features_df['is_recent'] = (features_df['movie_age'] <= 10).astype(int)
    
    # Runtime features (more granular)
    if 'runtime' in features_df.columns:
        features_df['runtime'] = pd.to_numeric(features_df['runtime'], errors='coerce').fillna(90)
        
        # Length categories
        features_df['is_short'] = (features_df['runtime'] <= 90).astype(int)
        features_df['is_medium'] = features_df['runtime'].between(91, 130).astype(int)
        features_df['is_long'] = features_df['runtime'].between(131, 180).astype(int)
        features_df['is_epic'] = (features_df['runtime'] > 180).astype(int)
        
        # Specific runtime preferences
        for length in [80, 90, 100, 110, 120, 140, 160]:
            features_df[f'runtime_around_{length}'] = (abs(features_df['runtime'] - length) <= 10).astype(int)
    
    # EXTENSIVE Genre features
    genre_col = None
    if 'genres' in features_df.columns and features_df['genres'].notna().any():
        genre_col = 'genres'
    elif 'genre' in features_df.columns and features_df['genre'].notna().any():
        genre_col = 'genre'
    
    if genre_col:
        features_df['genre_string'] = features_df[genre_col].fillna('')
        
        # Major genres
        major_genres = ['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary', 
                       'Drama', 'Family', 'Fantasy', 'Horror', 'Music', 'Mystery', 
                       'Romance', 'Science Fiction', 'Sci-Fi', 'Thriller', 'War', 'Western']
        
        for genre in major_genres:
            col_name = f'genre_{genre.lower().replace(" ", "_").replace("-", "_")}'
            features_df[col_name] = features_df['genre_string'].str.contains(genre, case=False, na=False).astype(int)
        
        # Genre combinations (your specific preferences)
        features_df['is_action_adventure'] = ((features_df.get('genre_action', 0) == 1) & 
                                            (features_df.get('genre_adventure', 0) == 1)).astype(int)
        features_df['is_drama_romance'] = ((features_df.get('genre_drama', 0) == 1) & 
                                         (features_df.get('genre_romance', 0) == 1)).astype(int)
        features_df['is_comedy_romance'] = ((features_df.get('genre_comedy', 0) == 1) & 
                                          (features_df.get('genre_romance', 0) == 1)).astype(int)
        features_df['is_sci_fi_action'] = ((features_df.get('genre_science_fiction', 0) == 1) & 
                                         (features_df.get('genre_action', 0) == 1)).astype(int)
        
        # Count genres (diversity measure)
        genre_count_cols = [col for col in features_df.columns if col.startswith('genre_') and col != 'genre_string']
        if genre_count_cols:
            features_df['genre_count'] = features_df[genre_count_cols].sum(axis=1)
            features_df['is_single_genre'] = (features_df['genre_count'] == 1).astype(int)
            features_df['is_multi_genre'] = (features_df['genre_count'] > 1).astype(int)
    
    # Rating-based features (if available)
    if 'vote_average' in features_df.columns:
        features_df['vote_average'] = pd.to_numeric(features_df['vote_average'], errors='coerce').fillna(5.0)
        features_df['is_highly_rated'] = (features_df['vote_average'] >= 7.5).astype(int)
        features_df['is_critically_acclaimed'] = (features_df['vote_average'] >= 8.0).astype(int)
        features_df['is_poor_rated'] = (features_df['vote_average'] <= 5.0).astype(int)
        
        # Rating ranges
        for rating in [6.0, 6.5, 7.0, 7.5, 8.0, 8.5]:
            features_df[f'rating_around_{rating:.1f}'] = (abs(features_df['vote_average'] - rating) <= 0.25).astype(int)
    
    if 'vote_count' in features_df.columns:
        features_df['vote_count'] = pd.to_numeric(features_df['vote_count'], errors='coerce').fillna(100)
        features_df['log_vote_count'] = np.log1p(features_df['vote_count'])
        features_df['is_popular'] = (features_df['vote_count'] >= 1000).astype(int)
        features_df['is_very_popular'] = (features_df['vote_count'] >= 10000).astype(int)
        features_df['is_niche'] = (features_df['vote_count'] <= 100).astype(int)
    
    # If we have your ratings, create personalized features
    if your_ratings is not None:
        # Your preference patterns
        your_avg = your_ratings['Rating'].mean()
        your_std = your_ratings['Rating'].std()
        
        # Add features based on your rating patterns
        if 'year_released' in features_df.columns and 'year_released' in your_ratings.columns:
            # Years you tend to rate highly
            high_rated_years = your_ratings[your_ratings['Rating'] >= 4.5]['year_released'].tolist()
            for year in high_rated_years:
                features_df[f'your_fav_year_{int(year)}'] = (features_df['year_released'] == year).astype(int)
    
    return features_df

# Apply overfitted feature engineering
print("🔧 Creating overfitted features...")
ml_data_overfitted = engineer_overfitted_features(ml_data_clean, ml_data_clean)

# Get all engineered features (excluding non-numeric columns)
exclude_cols = ['tmdb_id', 'movie_title', 'Rating', 'genre_string', 'user_id', 'genres']
feature_cols = [col for col in ml_data_overfitted.columns if col not in exclude_cols]

# Only keep numeric features
numeric_feature_cols = []
for col in feature_cols:
    try:
        pd.to_numeric(ml_data_overfitted[col], errors='raise')
        numeric_feature_cols.append(col)
    except (ValueError, TypeError):
        continue

print(f"Total engineered features: {len(numeric_feature_cols)}")
print(f"Sample features: {numeric_feature_cols[:10]}...")

# Prepare training data
X_overfitted = ml_data_overfitted[numeric_feature_cols].fillna(0)
y_overfitted = ml_data_overfitted['Rating']

print(f"Overfitted training data shape: {X_overfitted.shape}")
print(f"Features with non-zero variance: {(X_overfitted.var() > 0).sum()}")

# Remove features with zero variance
non_zero_variance_cols = X_overfitted.columns[X_overfitted.var() > 0]
X_overfitted_clean = X_overfitted[non_zero_variance_cols]

print(f"Final feature set: {X_overfitted_clean.shape[1]} features")

# Split data
X_train_over, X_test_over, y_train_over, y_test_over = train_test_split(
    X_overfitted_clean, y_overfitted, test_size=0.2, random_state=42
)

# Scale features
scaler_overfitted = StandardScaler()
X_train_over_scaled = scaler_overfitted.fit_transform(X_train_over)
X_test_over_scaled = scaler_overfitted.transform(X_test_over)

🚀 Creating OVERFITTED ML model with genre information...
🎭 Analyzing genre information...
Available genre columns: ['genres']
Sample genre data:
  ["Music","Animation"]
  []
  ["Drama"]
  ["Drama"]
  ["Documentary"]
  ["Romance"]
  ["Drama","Thriller"]
  []
  ["Music","Documentary"]
  ["Drama"]
🔧 Creating overfitted features...
Total engineered features: 64
Sample features: ['year_released', 'runtime', 'decade_1920s', 'decade_1930s', 'decade_1940s', 'decade_1950s', 'decade_1960s', 'decade_1970s', 'decade_1980s', 'decade_1990s']...
Overfitted training data shape: (198, 64)
Features with non-zero variance: 62
Final feature set: 62 features


In [18]:
# Train OVERFITTED Models (designed to memorize your preferences)
print("🧠 Training OVERFITTED models to learn your specific taste...")

# Multiple overfitted models
overfitted_models = {
    'Random Forest (Overfitted)': RandomForestRegressor(
        n_estimators=500,          # Many trees
        max_depth=None,            # No depth limit (overfit!)
        min_samples_split=2,       # Minimal splits (overfit!)
        min_samples_leaf=1,        # Minimal leaf size (overfit!)
        max_features='sqrt',
        random_state=42
    ),
    'Gradient Boosting (Overfitted)': GradientBoostingRegressor(
        n_estimators=500,          # Many iterations
        learning_rate=0.05,        # Lower rate, more iterations
        max_depth=10,              # Deep trees
        min_samples_split=2,       # Minimal splits
        min_samples_leaf=1,        # Minimal leaf size
        subsample=0.8,
        random_state=42
    ),
    'Ridge (High Complexity)': Ridge(
        alpha=0.01,                # Less regularization (more overfitting)
        max_iter=10000
    )
}

# Train and evaluate overfitted models
overfitted_results = {}
trained_overfitted_models = {}

for name, model in overfitted_models.items():
    print(f"\n🤖 Training {name}...")
    
    if 'Ridge' in name:
        # Use scaled features for Ridge
        model.fit(X_train_over_scaled, y_train_over)
        y_pred_train = model.predict(X_train_over_scaled)
        y_pred_test = model.predict(X_test_over_scaled)
        
        # Cross-validation with scaled features
        cv_scores = cross_val_score(model, X_overfitted_clean, y_overfitted, 
                                  cv=3, scoring='neg_mean_squared_error')
        cv_mse = -cv_scores.mean()
    else:
        # Use unscaled features for tree-based models
        model.fit(X_train_over, y_train_over)
        y_pred_train = model.predict(X_train_over)
        y_pred_test = model.predict(X_test_over)
        
        # Cross-validation with unscaled features  
        cv_scores = cross_val_score(model, X_overfitted_clean, y_overfitted,
                                  cv=3, scoring='neg_mean_squared_error')
        cv_mse = -cv_scores.mean()
    
    # Calculate metrics
    train_mse = mean_squared_error(y_train_over, y_pred_train)
    test_mse = mean_squared_error(y_test_over, y_pred_test)
    train_mae = mean_absolute_error(y_train_over, y_pred_train)
    test_mae = mean_absolute_error(y_test_over, y_pred_test)
    
    overfitted_results[name] = {
        'Train_MSE': train_mse,
        'Test_MSE': test_mse,
        'Train_MAE': train_mae,
        'Test_MAE': test_mae,
        'CV_MSE': cv_mse,
        'CV_RMSE': np.sqrt(cv_mse),
        'Overfitting': train_mse / test_mse  # Lower = more overfitted (good for our purpose!)
    }
    
    trained_overfitted_models[name] = model
    
    print(f"   Train RMSE: {np.sqrt(train_mse):.4f}")
    print(f"   Test RMSE: {np.sqrt(test_mse):.4f}")
    print(f"   CV RMSE: {np.sqrt(cv_mse):.4f}")
    print(f"   Overfitting Ratio: {train_mse / test_mse:.3f} (lower = more overfitted)")

# Select best overfitted model (prioritize low CV error but allow overfitting)
print(f"\n🏆 OVERFITTED MODEL COMPARISON:")
print("="*80)
for name, metrics in overfitted_results.items():
    print(f"{name:25} | CV RMSE: {metrics['CV_RMSE']:.4f} | Overfit: {metrics['Overfitting']:.3f}")

# Choose model with lowest CV RMSE (we want overfitting to your specific taste)
best_overfitted_name = min(overfitted_results.keys(), key=lambda x: overfitted_results[x]['CV_RMSE'])
best_overfitted_model = trained_overfitted_models[best_overfitted_name]

print(f"\n🎯 Selected OVERFITTED model: {best_overfitted_name}")
print(f"   Cross-validation RMSE: {overfitted_results[best_overfitted_name]['CV_RMSE']:.4f}")
print(f"   Overfitting ratio: {overfitted_results[best_overfitted_name]['Overfitting']:.3f}")

# Feature importance for overfitted model
if hasattr(best_overfitted_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': X_overfitted_clean.columns,
        'importance': best_overfitted_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    print(f"\n🔍 TOP 15 MOST IMPORTANT FEATURES (for your taste):")
    for i, (_, row) in enumerate(feature_importance.head(15).iterrows()):
        print(f"   {i+1:2d}. {row['feature']:25} | Importance: {row['importance']:.4f}")

elif hasattr(best_overfitted_model, 'coef_'):
    feature_importance = pd.DataFrame({
        'feature': X_overfitted_clean.columns,
        'coefficient': np.abs(best_overfitted_model.coef_)
    }).sort_values('coefficient', ascending=False)
    
    print(f"\n🔍 TOP 15 MOST IMPORTANT FEATURES (by coefficient magnitude):")
    for i, (_, row) in enumerate(feature_importance.head(15).iterrows()):
        print(f"   {i+1:2d}. {row['feature']:25} | |Coef|: {row['coefficient']:.4f}")

🧠 Training OVERFITTED models to learn your specific taste...

🤖 Training Random Forest (Overfitted)...
   Train RMSE: 0.4486
   Test RMSE: 0.8450
   CV RMSE: 1.0129
   Overfitting Ratio: 0.282 (lower = more overfitted)

🤖 Training Gradient Boosting (Overfitted)...
   Train RMSE: 0.3023
   Test RMSE: 0.9706
   CV RMSE: 1.1541
   Overfitting Ratio: 0.097 (lower = more overfitted)

🤖 Training Ridge (High Complexity)...
   Train RMSE: 0.7617
   Test RMSE: 0.9599
   CV RMSE: 1.0840
   Overfitting Ratio: 0.630 (lower = more overfitted)

🏆 OVERFITTED MODEL COMPARISON:
Random Forest (Overfitted) | CV RMSE: 1.0129 | Overfit: 0.282
Gradient Boosting (Overfitted) | CV RMSE: 1.1541 | Overfit: 0.097
Ridge (High Complexity)   | CV RMSE: 1.0840 | Overfit: 0.630

🎯 Selected OVERFITTED model: Random Forest (Overfitted)
   Cross-validation RMSE: 1.0129
   Overfitting ratio: 0.282

🔍 TOP 15 MOST IMPORTANT FEATURES (for your taste):
    1. runtime                   | Importance: 0.2422
    2. movie_age   

In [19]:
# Generate OVERFITTED Predictions with Proper Rating Scale (0.5-5.0)
print("🎬 Generating OVERFITTED movie predictions with proper rating scale...")

# Apply overfitted feature engineering to ALL movies
print("🔧 Applying overfitted features to entire movie database...")
all_movies_overfitted = engineer_overfitted_features(movies_clean_filtered, ml_data_clean)

# Filter for unwatched movies
unwatched_overfitted = all_movies_overfitted[~all_movies_overfitted['tmdb_id'].isin(my_watched_movies_set)].copy()
print(f"Unwatched movies for overfitted prediction: {len(unwatched_overfitted)}")

# Prepare features (same as training)
unwatched_features_over = unwatched_overfitted[non_zero_variance_cols].fillna(0)

# Make predictions with overfitted model
print(f"Making predictions with {best_overfitted_name}...")
if 'Ridge' in best_overfitted_name:
    unwatched_features_over_scaled = scaler_overfitted.transform(unwatched_features_over)
    raw_predictions = best_overfitted_model.predict(unwatched_features_over_scaled)
else:
    raw_predictions = best_overfitted_model.predict(unwatched_features_over)

# PROPERLY CLIP PREDICTIONS TO LETTERBOXD SCALE (0.5 to 5.0)
clipped_predictions = np.clip(raw_predictions, 0.5, 5.0)

print(f"Raw prediction range: {raw_predictions.min():.2f} to {raw_predictions.max():.2f}")
print(f"Clipped prediction range: {clipped_predictions.min():.2f} to {clipped_predictions.max():.2f}")

# Add clipped predictions to dataframe
unwatched_overfitted['predicted_rating'] = clipped_predictions

# Filter for high predicted ratings (3.5+ stars to see more variety)
high_rated_overfitted = unwatched_overfitted[unwatched_overfitted['predicted_rating'] >= 3.5].copy()

# Sort and get recommendations
overfitted_recommendations = high_rated_overfitted.nlargest(100, 'predicted_rating')

print(f"🎯 Found {len(high_rated_overfitted)} movies predicted 3.5+ stars")
print(f"🎯 Found {len(unwatched_overfitted[unwatched_overfitted['predicted_rating'] >= 4.0])} movies predicted 4.0+ stars")
print(f"🎯 Found {len(unwatched_overfitted[unwatched_overfitted['predicted_rating'] >= 4.5])} movies predicted 4.5+ stars")

# Display top overfitted recommendations
print(f"\n🧠 TOP 25 OVERFITTED ML RECOMMENDATIONS:")
overfitted_final_recs = []

for i, (_, movie) in enumerate(overfitted_recommendations.iterrows(), 1):
    title = movie['movie_title'] 
    year = int(movie['year_released']) if pd.notna(movie['year_released']) else 'Unknown'
    pred_rating = movie['predicted_rating']
    
    # Get genre info if available
    genre_info = ""
    if 'genre_string' in movie and pd.notna(movie['genre_string']) and movie['genre_string']:
        genres = movie['genre_string'][:50] + "..." if len(str(movie['genre_string'])) > 50 else movie['genre_string']
        genre_info = f" | {genres}"
    
    overfitted_final_recs.append({
        'rank': i,
        'title': title,
        'year': year,
        'predicted_rating': pred_rating,
        'tmdb_id': movie['tmdb_id'],
        'genre_info': genre_info
    })
    
    if i <= 25:  # Show top 25
        print(f"{i:2d}. {title} ({year}) - {pred_rating:.2f}/5.0{genre_info}")

# Compare prediction distributions
print(f"\n📊 OVERFITTED PREDICTION ANALYSIS:")
print(f"Your actual rating distribution:")
actual_dist = ml_data_clean['Rating'].value_counts().sort_index()
for rating, count in actual_dist.items():
    print(f"   {rating:.1f} stars: {count:3d} movies ({count/len(ml_data_clean)*100:.1f}%)")

print(f"\nPredicted rating distribution (all unwatched movies):")
pred_bins = [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]
pred_hist = pd.cut(clipped_predictions, bins=pred_bins + [6.0]).value_counts().sort_index()
for interval, count in pred_hist.items():
    if count > 0:
        print(f"   {interval}: {count:5d} movies ({count/len(clipped_predictions)*100:.1f}%)")

# Save overfitted recommendations with genre info
overfitted_filename = 'data/movie_recommendations_OVERFITTED_with_genres.txt'
with open(overfitted_filename, 'w') as f:
    f.write("🧠 OVERFITTED ML RECOMMENDATIONS (with Genre Info)\n")
    f.write("=" * 70 + "\n")
    f.write(f"Model: {best_overfitted_name}\n")
    f.write(f"Features: {len(non_zero_variance_cols)} engineered features including genres\n")
    f.write(f"Training data: {len(ml_data_clean)} of your rated movies\n")
    f.write(f"Cross-validation RMSE: {overfitted_results[best_overfitted_name]['CV_RMSE']:.4f}\n")
    f.write(f"🎯 DESIGNED TO OVERFIT TO YOUR SPECIFIC TASTE\n")
    f.write(f"Generated on: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
    
    # Write different rating tiers
    for min_rating in [4.5, 4.0, 3.5]:
        tier_movies = [m for m in overfitted_final_recs if m['predicted_rating'] >= min_rating]
        if tier_movies:
            f.write(f"🌟 TIER {min_rating}+ STARS ({len(tier_movies)} movies):\n")
            f.write("-" * 50 + "\n")
            for movie in tier_movies[:20]:  # Top 20 per tier
                f.write(f"{movie['rank']:3d}. {movie['title']} ({movie['year']})\n")
                f.write(f"     🤖 Predicted: {movie['predicted_rating']:.2f}/5.0{movie['genre_info']}\n\n")
            f.write("\n")

print(f"✅ Saved {len(overfitted_final_recs)} OVERFITTED recommendations to '{overfitted_filename}'")

🎬 Generating OVERFITTED movie predictions with proper rating scale...
🔧 Applying overfitted features to entire movie database...
Unwatched movies for overfitted prediction: 277403
Making predictions with Random Forest (Overfitted)...
Raw prediction range: 1.72 to 4.70
Clipped prediction range: 1.72 to 4.70
🎯 Found 229692 movies predicted 3.5+ stars
🎯 Found 85796 movies predicted 4.0+ stars
🎯 Found 346 movies predicted 4.5+ stars

🧠 TOP 25 OVERFITTED ML RECOMMENDATIONS:
 1. The Saint (1997) - 4.70/5.0 | ["Thriller","Action","Romance","Science Fiction","...
 2. Ashes of Paradise (1997) - 4.70/5.0 | ["Crime","Drama","Mystery","Thriller"]
 3. Beat (1997) - 4.70/5.0 | ["Action","Drama"]
 4. Secrets of Madonna (1997) - 4.70/5.0 | ["Horror","Drama"]
 5. A Rifle for Sleeping (1997) - 4.70/5.0 | ["Crime","Thriller"]
 6. The Mill on the Floss (1997) - 4.70/5.0 | ["Drama","Romance"]
 7. Con Air (1997) - 4.70/5.0 | ["Action","Thriller","Crime"]
 8. Run for Your Life (1997) - 4.70/5.0 | ["Drama","T