# Movie Recommendation System - Model Training
## Train Collaborative Filtering Models in Kaggle

This notebook trains your ML models using the MovieLens 25M dataset.

**Steps:**
1. Add MovieLens dataset from Kaggle
2. Install dependencies
3. Train models (SVD, KNN, ALS)
4. Download trained model file

**⚠️ IMPORTANT FOR KAGGLE:**
- Turn ON Internet in Settings (right sidebar)
- Add MovieLens dataset: Click "+ Add Data" → Search "movielens-25m" → Add
- Or we'll download it directly

In [None]:
# STEP 1: Download MovieLens 25M Dataset
# For Kaggle: Check if dataset exists, if not download it

import os

# Check if running on Kaggle
if os.path.exists('/kaggle/input'):
    print("✅ Running on Kaggle!")
    # Check for MovieLens dataset
    if os.path.exists('/kaggle/input/movielens-25m-dataset'):
        print("✅ MovieLens dataset found in Kaggle input!")
        data_path = '/kaggle/input/movielens-25m-dataset'
    else:
        print("📥 Downloading MovieLens dataset...")
        !wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip
        !unzip -q ml-25m.zip
        data_path = 'ml-25m'
        print("✅ Dataset downloaded!")
else:
    # Download if not on Kaggle
    print("📥 Downloading MovieLens 25M dataset...")
    !wget -q https://files.grouplens.org/datasets/movielens/ml-25m.zip
    !unzip -q ml-25m.zip
    data_path = 'ml-25m'
    print("✅ Dataset downloaded!")

print(f"\nDataset location: {data_path}")
!ls {data_path}

In [None]:
# STEP 2.5: Monitor GPU Usage
print("🔍 Checking GPU status...")

import torch

if torch.cuda.is_available():
    print(f"✅ GPU is available!")
    print(f"   Device count: {torch.cuda.device_count()}")
    for i in range(torch.cuda.device_count()):
        print(f"   GPU {i}: {torch.cuda.get_device_name(i)}")
        print(f"   Memory: {torch.cuda.get_device_properties(i).total_memory / 1024**3:.1f} GB")
    
    # Clear GPU cache
    torch.cuda.empty_cache()
    print("✅ GPU cache cleared and ready!")
else:
    print("⚠️ No GPU detected - will use CPU")
    print("   Training will be slower but still work")

print("\n💡 TIP: You should see GPU usage increase during training!")
print("   Watch the GPU percentage in the session stats (right sidebar)")

In [None]:
# STEP 2: Install Required Libraries (GPU-Accelerated)
print("📦 Installing GPU-accelerated libraries...")
!pip install -q pandas numpy scikit-learn scipy torch

import pandas as pd
import numpy as np
import torch
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
import pickle
import logging
import json

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Check GPU availability
if torch.cuda.is_available():
    device = torch.device('cuda')
    gpu_count = torch.cuda.device_count()
    print(f"✅ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   GPU count: {gpu_count}")
    for i in range(gpu_count):
        print(f"   GPU {i}: {torch.cuda.get_device_name(i)}")
    
    # Test GPU
    test_tensor = torch.rand(5, 5).to(device)
    print(f"✅ GPU test successful! Tensor device: {test_tensor.device}")
else:
    device = torch.device('cpu')
    print("⚠️ No GPU detected, using CPU")

print("✅ Libraries installed!")

In [None]:
# STEP 3: Define MEMORY-OPTIMIZED Collaborative Filtering Model
class CollaborativeFilteringModel:
    def __init__(self, use_gpu=True):
        self.user_movie_matrix = None
        self.user_similarity_matrix = None
        self.user_ids = None
        self.movie_ids = None
        self.svd_model = None
        self.knn_model = None
        self.user_factors = None
        self.item_factors = None
        self.n_factors = 50
        
        # GPU setup
        self.use_gpu = use_gpu and torch.cuda.is_available()
        self.device = torch.device('cuda' if self.use_gpu else 'cpu')
        print(f"🔧 Model initialized on: {self.device}")
        
    def prepare_data(self, ratings_data, movies_data):
        try:
            print("   Converting to DataFrames...")
            self.ratings_df = pd.DataFrame(ratings_data)
            self.movies_df = pd.DataFrame(movies_data)
            
            print("   Creating user-movie matrix (sparse)...")
            # Use sparse matrix to save memory
            self.user_movie_matrix = self.ratings_df.pivot_table(
                index='user_id', columns='movie_id', values='rating'
            ).fillna(0)
            
            # Limit matrix size if too large
            if self.user_movie_matrix.shape[0] > 5000:
                print(f"   ⚠️ Large matrix detected, sampling users...")
                sampled_users = np.random.choice(self.user_movie_matrix.index, 5000, replace=False)
                self.user_movie_matrix = self.user_movie_matrix.loc[sampled_users]
            
            self.user_ids = list(self.user_movie_matrix.index)
            self.movie_ids = list(self.user_movie_matrix.columns)
            print(f"✅ Data prepared: {len(self.user_ids)} users, {len(self.movie_ids)} movies")
            
            # Clean up
            import gc
            gc.collect()
            return True
        except Exception as e:
            print(f"❌ Error: {e}")
            return False
    
    def compute_user_similarity(self):
        """Compute similarity with memory optimization"""
        print("   Computing similarity matrix (memory-optimized)...")
        try:
            if self.use_gpu:
                # Process in batches to avoid GPU OOM
                batch_size = 500
                n_users = len(self.user_ids)
                
                print(f"   Processing {n_users} users in batches of {batch_size}...")
                similarity_matrix = np.zeros((n_users, n_users), dtype=np.float32)
                
                matrix_np = self.user_movie_matrix.values.astype(np.float32)
                
                for i in range(0, n_users, batch_size):
                    end_i = min(i + batch_size, n_users)
                    batch = torch.tensor(matrix_np[i:end_i], device=self.device, dtype=torch.float32)
                    full_matrix = torch.tensor(matrix_np, device=self.device, dtype=torch.float32)
                    
                    # Normalize
                    batch_norm = torch.nn.functional.normalize(batch, p=2, dim=1)
                    full_norm = torch.nn.functional.normalize(full_matrix, p=2, dim=1)
                    
                    # Compute similarity for this batch
                    similarity_batch = torch.mm(batch_norm, full_norm.t())
                    similarity_matrix[i:end_i] = similarity_batch.cpu().numpy()
                    
                    # Clear GPU memory
                    del batch, full_matrix, batch_norm, full_norm, similarity_batch
                    torch.cuda.empty_cache()
                    
                    if (i // batch_size + 1) % 5 == 0:
                        print(f"   Progress: {end_i}/{n_users} users processed")
                
                self.user_similarity_matrix = similarity_matrix
                print("   ✅ Computed on GPU (batched)")
            else:
                # CPU with memory optimization
                from sklearn.metrics.pairwise import cosine_similarity
                self.user_similarity_matrix = cosine_similarity(self.user_movie_matrix.values.astype(np.float32))
                print("   ✅ Computed on CPU")
            
            print("✅ User similarity computed")
            import gc
            gc.collect()
            return True
        except Exception as e:
            print(f"❌ Error in similarity computation: {e}")
            return False
    
    def train_svd_model(self, n_components=30):
        """Reduced components for memory efficiency"""
        print(f"   Fitting SVD model ({n_components} components)...")
        sparse_matrix = csr_matrix(self.user_movie_matrix.values.astype(np.float32))
        self.svd_model = TruncatedSVD(n_components=n_components, random_state=42)
        self.svd_model.fit(sparse_matrix)
        print(f"✅ SVD trained ({n_components} components)")
        import gc
        gc.collect()
        return True
    
    def train_knn_model(self, n_neighbors=15):
        """Reduced neighbors for faster training"""
        print(f"   Fitting KNN model ({n_neighbors} neighbors)...")
        self.knn_model = NearestNeighbors(
            n_neighbors=n_neighbors, 
            metric='cosine', 
            algorithm='brute',
            n_jobs=-1
        )
        self.knn_model.fit(self.user_movie_matrix.values.astype(np.float32))
        print(f"✅ KNN trained ({n_neighbors} neighbors)")
        import gc
        gc.collect()
        return True
    
    def train_als_model(self, n_factors=30, n_iterations=5, lambda_reg=0.1):
        """OPTIMIZED ALS: Fewer factors and iterations"""
        print(f"   Training ALS model ({n_factors} factors, {n_iterations} iterations)...")
        self.n_factors = n_factors
        R = csr_matrix(self.user_movie_matrix.values.astype(np.float32))
        n_users, n_items = R.shape
        
        if self.use_gpu:
            print("   🚀 Using GPU for ALS training")
            # Convert to dense only for small matrices
            if n_users * n_items < 10_000_000:  # ~10M elements
                R_dense = torch.tensor(R.toarray(), device=self.device, dtype=torch.float32)
                
                # Initialize factors on GPU
                torch.manual_seed(42)
                user_factors = torch.randn(n_users, n_factors, device=self.device, dtype=torch.float32) * 0.1
                item_factors = torch.randn(n_items, n_factors, device=self.device, dtype=torch.float32) * 0.1
                
                # Training loop
                for iteration in range(n_iterations):
                    user_factors = self._als_step_gpu(R_dense, item_factors, lambda_reg)
                    item_factors = self._als_step_gpu(R_dense.t(), user_factors, lambda_reg)
                    
                    # Compute RMSE every iteration
                    predictions = torch.mm(user_factors, item_factors.t())
                    mask = R_dense > 0
                    diff = (R_dense[mask] - predictions[mask]) ** 2
                    rmse = torch.sqrt(diff.mean()).item()
                    print(f"  Iteration {iteration+1}/{n_iterations}, RMSE: {rmse:.4f}")
                    
                    # Clear cache
                    torch.cuda.empty_cache()
                
                self.user_factors = user_factors.cpu().numpy()
                self.item_factors = item_factors.cpu().numpy()
                del R_dense, user_factors, item_factors
            else:
                print("   Matrix too large for GPU, using CPU...")
                self._train_als_cpu(R, n_factors, n_iterations, lambda_reg)
        else:
            self._train_als_cpu(R, n_factors, n_iterations, lambda_reg)
        
        torch.cuda.empty_cache() if self.use_gpu else None
        import gc
        gc.collect()
        print("✅ ALS trained")
        return True
    
    def _train_als_cpu(self, R, n_factors, n_iterations, lambda_reg):
        """CPU ALS training"""
        print("   Using CPU for ALS training")
        n_users, n_items = R.shape
        np.random.seed(42)
        self.user_factors = np.random.normal(0, 0.1, (n_users, n_factors)).astype(np.float32)
        self.item_factors = np.random.normal(0, 0.1, (n_items, n_factors)).astype(np.float32)
        
        for iteration in range(n_iterations):
            self.user_factors = self._als_step_cpu(R, self.item_factors, lambda_reg)
            self.item_factors = self._als_step_cpu(R.T, self.user_factors, lambda_reg)
            
            # Compute RMSE
            predictions = self.user_factors @ self.item_factors.T
            mask = R.toarray() > 0
            rmse = np.sqrt(np.mean((R.toarray()[mask] - predictions[mask]) ** 2))
            print(f"  Iteration {iteration+1}/{n_iterations}, RMSE: {rmse:.4f}")
    
    def _als_step_gpu(self, R, fixed_factors, lambda_reg):
        """GPU ALS step with memory optimization"""
        n_users, n_factors = R.shape[0], fixed_factors.shape[1]
        updated_factors = torch.zeros(n_users, n_factors, device=self.device, dtype=torch.float32)
        lambda_eye = lambda_reg * torch.eye(n_factors, device=self.device, dtype=torch.float32)
        
        # Process in batches
        batch_size = 100
        for start in range(0, n_users, batch_size):
            end = min(start + batch_size, n_users)
            for u in range(start, end):
                rated_mask = R[u] > 0
                if not rated_mask.any():
                    continue
                
                ratings = R[u][rated_mask]
                factors = fixed_factors[rated_mask]
                
                A = torch.mm(factors.t(), factors) + lambda_eye
                b = torch.mv(factors.t(), ratings)
                
                try:
                    updated_factors[u] = torch.linalg.solve(A, b)
                except:
                    updated_factors[u] = torch.linalg.lstsq(A.unsqueeze(0), b.unsqueeze(0)).solution.squeeze()
        
        return updated_factors
    
    def _als_step_cpu(self, R, fixed_factors, lambda_reg):
        """CPU ALS step"""
        n_users, n_factors = R.shape[0], fixed_factors.shape[1]
        updated_factors = np.zeros((n_users, n_factors), dtype=np.float32)
        
        for u in range(n_users):
            rated_items = R[u].nonzero()[1]
            if len(rated_items) == 0:
                continue
            
            ratings = R[u, rated_items].toarray().flatten().astype(np.float32)
            factors = fixed_factors[rated_items].astype(np.float32)
            
            A = factors.T @ factors + lambda_reg * np.eye(n_factors, dtype=np.float32)
            b = factors.T @ ratings
            
            try:
                updated_factors[u] = np.linalg.solve(A, b)
            except:
                updated_factors[u] = np.linalg.lstsq(A, b, rcond=None)[0]
        
        return updated_factors
    
    def save_model(self, filepath):
        model_data = {
            'user_movie_matrix': self.user_movie_matrix,
            'user_similarity_matrix': self.user_similarity_matrix,
            'user_ids': self.user_ids,
            'movie_ids': self.movie_ids,
            'svd_model': self.svd_model,
            'knn_model': self.knn_model,
            'user_factors': self.user_factors,
            'item_factors': self.item_factors,
            'n_factors': self.n_factors
        }
        with open(filepath, 'wb') as f:
            pickle.dump(model_data, f)
        print(f"✅ Model saved to {filepath}")
        return True

print("✅ Memory-optimized model class loaded!")

In [None]:
# STEP 4: Load and Sample Data (OPTIMIZED FOR KAGGLE)
print("📂 Loading MovieLens data...")

# Use the data_path from Step 1
movies_df = pd.read_csv(f'{data_path}/movies.csv')
print(f"✅ Loaded {len(movies_df):,} movies")

# OPTIMIZED: Load ratings in chunks to avoid memory issues
print(f"📥 Loading ratings (this may take a moment)...")
ratings_df = pd.read_csv(f'{data_path}/ratings.csv', 
                         usecols=['userId', 'movieId', 'rating'],  # Only load needed columns
                         dtype={'userId': 'int32', 'movieId': 'int32', 'rating': 'float32'})  # Use smaller dtypes
print(f"✅ Loaded {len(ratings_df):,} ratings")

# KAGGLE-OPTIMIZED SAMPLE SIZES (prevents kernel crashes)
SAMPLE_SIZE = 50000  # REDUCED from 100k - safer for Kaggle
# Recommended sizes:
# - 50k ratings = 5-10 min, LOW memory usage ✅ SAFEST
# - 100k ratings = 10-15 min, MEDIUM memory usage
# - 200k ratings = 20-30 min, HIGH memory usage (risky on Kaggle)

print(f"\n📊 Sampling {SAMPLE_SIZE:,} ratings for training...")
print("💡 Using smaller sample to prevent kernel crashes")

# Smart sampling: Get diverse users and movies
ratings_sample = ratings_df.sample(n=min(SAMPLE_SIZE, len(ratings_df)), random_state=42)

# Get only movies that appear in sample
movie_ids_in_sample = ratings_sample['movieId'].unique()
movies_sample = movies_df[movies_df['movieId'].isin(movie_ids_in_sample)]

print(f"✅ Sampled: {len(movies_sample):,} movies, {len(ratings_sample):,} ratings")
print(f"📊 Memory usage: ~{ratings_sample.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Clear original dataframes to free memory
del ratings_df
import gc
gc.collect()
print("✅ Memory cleaned")

In [None]:
# STEP 5: Format Data
print("🔄 Formatting data...")
ratings_data = [{
    'user_id': f"user_{row['userId']}",
    'movie_id': int(row['movieId']),
    'rating': float(row['rating'])
} for _, row in ratings_sample.iterrows()]

movies_data = [{
    'id': int(row['movieId']),
    'title': row['title'],
    'genres': json.dumps([{"name": g} for g in row['genres'].split('|')])
} for _, row in movies_sample.iterrows()]

print(f"✅ Formatted {len(ratings_data):,} ratings, {len(movies_data):,} movies")

In [None]:
# STEP 6: Train All Models (MEMORY-OPTIMIZED)
print("\n" + "="*60)
print("🤖 TRAINING MODELS (MEMORY-OPTIMIZED)")
print("="*60)
print("💡 Using reduced parameters to prevent kernel crashes")

model = CollaborativeFilteringModel(use_gpu=True)

print("\n1️⃣ Preparing data...")
model.prepare_data(ratings_data, movies_data)

print("\n2️⃣ Computing user similarity (batched)...")
model.compute_user_similarity()

print("\n3️⃣ Training SVD (30 components)...")
model.train_svd_model(n_components=30)  # Reduced from 50

print("\n4️⃣ Training KNN (15 neighbors)...")
model.train_knn_model(n_neighbors=15)  # Reduced from 20

print("\n5️⃣ Training ALS (30 factors, 5 iterations)...")
model.train_als_model(n_factors=30, n_iterations=5)  # Reduced from 50 factors, 10 iterations

print("\n" + "="*60)
print("✅ ALL MODELS TRAINED!")
print("="*60)

# Show memory usage
if torch.cuda.is_available():
    print(f"\n📊 GPU Memory Usage:")
    for i in range(torch.cuda.device_count()):
        allocated = torch.cuda.memory_allocated(i) / 1024**2
        reserved = torch.cuda.memory_reserved(i) / 1024**2
        print(f"   GPU {i}: {allocated:.1f} MB allocated, {reserved:.1f} MB reserved")
    
    # Final cleanup
    torch.cuda.empty_cache()
    print("✅ GPU cache cleared")

# CPU memory cleanup
import gc
gc.collect()
print("✅ Memory optimized")

In [None]:
# STEP 7: Save Model
print("\n" + "="*60)
print("💾 SAVING MODEL")
print("="*60)

model_filename = 'collaborative_filtering_trained.pkl'
print(f"Saving model to: {model_filename}")

if model.save_model(model_filename):
    print(f"✅ Model saved successfully!")
    
    # Get file size
    import os
    file_size = os.path.getsize(model_filename) / (1024 * 1024)
    print(f"📦 File size: {file_size:.1f} MB")
else:
    print("❌ Failed to save model")

print("="*60)

In [None]:
# STEP 8: Download Model (Kaggle & Colab Compatible)
print("\n" + "="*60)
print("📥 PREPARING MODEL FOR DOWNLOAD")
print("="*60)

# Check environment
if os.path.exists('/kaggle/working'):
    print("✅ Running on Kaggle!")
    print(f"✅ Model saved to: /kaggle/working/{model_filename}")
    print("\n📥 TO DOWNLOAD:")
    print("1. Click 'Save Version' (top right, blue button)")
    print("2. Select 'Save & Run All (Quick Save)'")
    print("3. Wait for completion notification")
    print("4. Go to 'Output' tab (right sidebar)")
    print("5. Click download icon next to the .pkl file")
    print("\n💡 The file will be in the Output section after saving!")
    
elif os.path.exists('/content'):
    print("✅ Running on Google Colab!")
    try:
        from google.colab import files
        print("📥 Downloading model file...")
        files.download(model_filename)
        print("✅ Download started! Check your browser downloads.")
    except:
        print("⚠️ Auto-download failed. File saved as:", model_filename)
else:
    print("✅ Running locally!")
    print(f"Model saved as: {model_filename}")
    print("Copy this file to: d:\\Movie recommendation system\\backend\\saved_models\\")

print("="*60)

## 📥 Next Steps

### **For Kaggle Users:**

After the notebook finishes:

1. **Click "Save Version"** (top right, blue button)
2. **Select "Save & Run All"** (Quick Save)
3. **Wait for it to complete** (check notification)
4. **Go to the "Output" tab** (right side)
5. **Download** `collaborative_filtering_trained.pkl`

### **For Colab Users:**

The file downloads automatically to your browser's download folder.

---

### **Place the Downloaded File:**

```
d:\Movie recommendation system\backend\saved_models\collaborative_filtering_trained.pkl
```

Create the `saved_models` folder if it doesn't exist.

---

## 🎯 Kaggle Tips

- **Turn ON Internet:** Settings → Internet → ON (required for pip installs)
- **Use GPU:** Settings → Accelerator → GPU (faster training)
- **Increase Sample Size:** Kaggle gives 30 hours/week, so you can use `SAMPLE_SIZE = 500000`
- **Save Often:** Click "Save Version" to avoid losing progress

---

## 🚀 Training Options

- **Quick (5-10 min):** `SAMPLE_SIZE = 50000`
- **Medium (10-20 min):** `SAMPLE_SIZE = 100000`
- **Better (20-40 min):** `SAMPLE_SIZE = 200000`
- **Best (1-2 hours):** `SAMPLE_SIZE = 1000000` or remove sampling

---

**Happy Training! 🎉**