<a href="https://colab.research.google.com/github/atomicamit24/Minor-Projects-using-AIML/blob/main/Basic_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# ==============================================================================
# 0. SETUP AND LIBRARIES
# ==============================================================================
# Import all necessary libraries
import pandas as pd
import numpy as np
import requests
import zipfile
import io

# For Content-Based Filtering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

# Note: surprise library imports are commented out as they cause an ImportError currently.
# from surprise import Reader, Dataset, SVD
# from surprise.model_selection import GridSearchCV, cross_validate

print("‚úÖ Essential libraries imported successfully.")

# ==============================================================================
# 1. DATA LOADING AND PREPARATION
# ==============================================================================
print("\n--- 1. Data Loading and Preparation ---")
# Download the MovieLens 100k dataset
url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
print(f"Downloading dataset from {url}...")
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()
print("Dataset downloaded and extracted.")

# Load the data into pandas DataFrames
ratings = pd.read_csv('ml-latest-small/ratings.csv')
movies = pd.read_csv('ml-latest-small/movies.csv')

# Merge ratings and movies data for context
df = pd.merge(ratings, movies, on='movieId')

# Create a mapping from movieId to title for later use
movie_id_to_title = movies.set_index('movieId')['title']
print("\nData Preview:")
display(df.head())
print(f"\nTotal ratings: {len(ratings)}")
print(f"Total unique movies: {len(movies)}")5


‚úÖ Essential libraries imported successfully.

--- 1. Data Loading and Preparation ---
Downloading dataset from https://files.grouplens.org/datasets/movielens/ml-latest-small.zip...
Dataset downloaded and extracted.

Data Preview:


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller



Total ratings: 100836
Total unique movies: 9742


**Reasoning**:
The previous command failed due to a numpy compatibility issue with the surprise library. To resolve this, install numpy<2 and surprise in a separate cell before importing them in the main code cell. This ensures the compatible numpy version is installed first.

In [7]:
# Install scikit-surprise and a compatible numpy version
!pip install scikit-surprise numpy<2 -q

/bin/bash: line 1: 2: No such file or directory


**Reasoning**:
Move the Content-Based Filtering code under its section and ensure it is clean and well-commented, as per Plan Items 2 and 5, focusing on non-surprise parts.

In [3]:
# ==============================================================================
# 2. CONTENT-BASED FILTERING
# ==============================================================================
print("\n--- 2. Content-Based Filtering ---")

# Replace '|' with a space in genres string to treat them as separate words
# Create a copy to avoid SettingWithCopyWarning
movies_copy = movies.copy()
movies_copy['genres'] = movies_copy['genres'].str.replace('|', ' ', regex=False)

# Use TF-IDF to vectorize the genres. TF-IDF weighs terms based on their frequency
# in a document and inverse frequency across all documents.
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(movies_copy['genres'])

print("\nTF-IDF Matrix Shape (movies x genres):", tfidf_matrix.shape)

# Calculate item similarity using cosine similarity on the TF-IDF matrix.
# Cosine similarity measures the cosine of the angle between two vectors,
# indicating how similar the movie genre profiles are.
item_similarity = cosine_similarity(tfidf_matrix)

# Convert the similarity matrix to a DataFrame for easier handling,
# using movieId as index and columns for robust lookup.
item_similarity_df = pd.DataFrame(item_similarity, index=movies['movieId'], columns=movies['movieId'])

print("\nItem (Movie) Similarity Matrix created.")

# Define the function to get content-based recommendations
def get_content_based_recommendations(user_id, ratings_df, similarity_df, num_recommendations=10):
    """
    Generates movie recommendations for a user based on the content (genres)
    of movies they have rated highly.

    Args:
        user_id (int): The ID of the user for whom to generate recommendations.
        ratings_df (pd.DataFrame): DataFrame containing user ratings.
        similarity_df (pd.DataFrame): Item-item similarity matrix (Content-Based).
        num_recommendations (int): The number of top recommendations to return.

    Returns:
        list: A list of tuples, where each tuple is (movie_title, combined_score).
              Returns an empty list if the user has no high ratings.
    """
    # Get movies the user has rated highly (e.g., rating > 4)
    user_high_ratings = ratings_df[(ratings_df['userId'] == user_id) & (ratings_df['rating'] > 4)]

    if user_high_ratings.empty:
        print(f"User ID {user_id} has no high ratings (>4) for Content-Based recommendations.")
        return [] # Return empty list if user has no high ratings

    # Get the IDs of movies the user has already seen
    user_watched_movie_ids = set(ratings_df[ratings_df['userId'] == user_id]['movieId'])

    # Get the IDs of movies the user liked (rated highly)
    user_liked_movie_ids = user_high_ratings['movieId'].tolist()

    # Accumulate similarity scores for potential recommendations
    recommendation_scores = {}
    for movie_id in user_liked_movie_ids:
        # Ensure the liked movie ID exists in the similarity matrix index/columns
        if movie_id in similarity_df.columns:
            # Get similarity scores for this liked movie with all other movies
            similar_movies_series = similarity_df[movie_id].sort_values(ascending=False)

            # Iterate through similar movies
            for similar_movie_id, score in similar_movies_series.items():
                # Exclude the liked movie itself and movies the user has already watched
                if similar_movie_id != movie_id and similar_movie_id not in user_watched_movie_ids:
                    # Add the similarity score to the recommendation score for the similar movie
                    recommendation_scores[similar_movie_id] = recommendation_scores.get(similar_movie_id, 0) + score

    # Sort recommendations by accumulated score in descending order
    sorted_recommendations = sorted(recommendation_scores.items(), key=lambda item: item[1], reverse=True)

    # Retrieve movie titles for the top N recommended movie IDs
    final_recs = []
    # Ensure movie_id_to_title map is accessible (created in Data Loading section)
    # If not, create it here: movie_id_to_title = movies.set_index('movieId')['title'].to_dict()

    for movie_id, score in sorted_recommendations:
        if len(final_recs) >= num_recommendations:
            break # Stop once we have enough recommendations

        # Get the movie title using the pre-created map
        title = movie_id_to_title.get(movie_id, f"Unknown Movie (ID: {movie_id})") # Handle potential missing titles
        final_recs.append((title, score))

    return final_recs

# --- Example for Content-Based Filtering ---
target_user_id_cbf = 1
print(f"\nGetting Content-Based recommendations for userId: {target_user_id_cbf}")
content_recs = get_content_based_recommendations(target_user_id_cbf, ratings, item_similarity_df)

print("\nTop 10 Content-Based Recommendations:")
if content_recs:
    for movie, score in content_recs:
        print(f"- {movie} (Score: {score:.2f})")
else:
    print(f"No Content-Based recommendations generated for User ID {target_user_id_cbf}.")


--- 2. Content-Based Filtering ---

TF-IDF Matrix Shape (movies x genres): (9742, 23)

Item (Movie) Similarity Matrix created.

Getting Content-Based recommendations for userId: 1

Top 10 Content-Based Recommendations:
- The Great Train Robbery (1978) (Score: 43.25)
- Flashback (1990) (Score: 43.25)
- Dragonheart 2: A New Beginning (2000) (Score: 42.71)
- Hunting Party, The (2007) (Score: 42.58)
- Charlie's Angels: Full Throttle (2003) (Score: 41.52)
- Machete (2010) (Score: 41.52)
- Diamond Arm, The (Brilliantovaya ruka) (1968) (Score: 41.52)
- After the Sunset (2004) (Score: 41.52)
- Catch That Kid (2004) (Score: 41.48)
- Extreme Days (2001) (Score: 41.45)


**Reasoning**:
Move the evaluation code for the Content-Based model under the Evaluation section and ensure it is clean and well-commented, as per Plan Items 2 and 5, focusing on non-surprise parts.

In [4]:
# ==============================================================================
# 3. EVALUATION (FOR CONTENT-BASED MODEL)
# ==============================================================================
print("\n--- 3. Evaluation (for Content-Based Model) ---")

# Split data into training and testing sets
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

def precision_recall_at_k(test_df, recommendations_func, k=10):
    """
    Calculates precision and recall at k for all users in the test set.
    """
    precisions = []
    recalls = []

    # Get a list of all users in the test set
    test_users = test_df['userId'].unique()

    # Create a mapping from movieId to title for efficient lookup
    # Ensure 'movies' DataFrame is accessible in this scope
    movie_id_to_title_map = movies.set_index('movieId')['title'].to_dict()


    for user_id in test_users:
        # Get top-k recommendations using the training data
        # The recommendations_func is expected to return a list of (movie_title, score)
        recs = recommendations_func(user_id, train_data, item_similarity_df, num_recommendations=k)
        recommended_movies_titles = [movie for movie, score in recs] # Extract titles


        # Get the actual movies the user liked in the test set (relevant items)
        relevant_movies = test_df[(test_df['userId'] == user_id) & (test_df['rating'] > 3.5)]

        if relevant_movies.empty:
            continue # Skip user if no relevant movies in test set

        relevant_movies_ids = set(relevant_movies['movieId'])

        # Map relevant movie IDs to titles, ensuring the ID exists in the map
        relevant_movies_titles = {
            movie_id_to_title_map[movie_id] for movie_id in relevant_movies_ids
            if movie_id in movie_id_to_title_map
        }


        # If after filtering, there are no relevant movies with valid titles, skip
        if not relevant_movies_titles:
            continue


        # Calculate the number of recommended items that are relevant
        # Ensure both sets contain valid movie titles
        hits = len(set(recommended_movies_titles) & set(relevant_movies_titles))


        # Calculate precision and recall for this user
        precision = hits / k if k > 0 else 0
        recall = hits / len(relevant_movies_titles) if len(relevant_movies_titles) > 0 else 0

        precisions.append(precision)
        recalls.append(recall)

    # Calculate average precision and recall
    # Avoid division by zero if no users had relevant movies
    avg_precision = np.mean(precisions) if precisions else 0
    avg_recall = np.mean(recalls) if recalls else 0


    # Calculate F1-Score
    f1_score = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0

    return avg_precision, avg_recall, f1_score

# Evaluate the Content-Based model
# The get_content_based_recommendations function is defined in the previous section
precision, recall, f1_score = precision_recall_at_k(test_data, get_content_based_recommendations, k=10)

print(f"\nEvaluation Results for Content-Based Filtering (k=10):")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1_score:.4f}")


--- 3. Evaluation (for Content-Based Model) ---
User ID 139 has no high ratings (>4) for Content-Based recommendations.
User ID 214 has no high ratings (>4) for Content-Based recommendations.
User ID 54 has no high ratings (>4) for Content-Based recommendations.
User ID 442 has no high ratings (>4) for Content-Based recommendations.
User ID 404 has no high ratings (>4) for Content-Based recommendations.
User ID 245 has no high ratings (>4) for Content-Based recommendations.
User ID 26 has no high ratings (>4) for Content-Based recommendations.
User ID 320 has no high ratings (>4) for Content-Based recommendations.
User ID 609 has no high ratings (>4) for Content-Based recommendations.
User ID 163 has no high ratings (>4) for Content-Based recommendations.
User ID 133 has no high ratings (>4) for Content-Based recommendations.
User ID 194 has no high ratings (>4) for Content-Based recommendations.
User ID 293 has no high ratings (>4) for Content-Based recommendations.
User ID 311 has n

## 8. Getting User Input

This section allows you to interact with the recommendation system by providing a User ID and getting recommendations based on the Content-Based Filtering model.

A **User ID** is a unique number assigned to each user in the dataset, used to track their ratings and generate personalized recommendations.

**Reasoning**:
Add the code for getting user input for a User ID and generating Content-Based recommendations based on the input.

In [5]:
# ==============================================================================
# 8. GETTING USER INPUT FOR RECOMMENDATIONS (Content-Based)
# ==============================================================================
print("\n--- 8. Getting User Input for Recommendations (Content-Based) ---")

def get_content_based_recommendations_for_user_input(ratings_df, item_similarity_df, movies_df):
    """
    Prompts user for a userId and displays Content-Based recommendations for that user.
    """
    print("\nEnter a User ID to get Content-Based recommendations.")

    while True:
        try:
            user_input_id_str = input(f"\nEnter a User ID (e.g., 1, 2, 3... up to {ratings_df['userId'].max()}) or type 'exit' to quit: ")
            if user_input_id_str.lower() == 'exit':
                print("Goodbye! üëã")
                break

            user_id = int(user_input_id_str)

            # Check if the entered user ID exists in the ratings data
            if user_id in ratings_df['userId'].unique():
                print(f"\n‚ú® Generating Content-Based Recommendations for User ID: {user_id} ‚ú®")
                # Use the get_content_based_recommendations function defined earlier
                recommendations = get_content_based_recommendations(user_id, ratings_df, item_similarity_df, num_recommendations=10)

                if recommendations:
                    print(f"\nTop 10 Content-Based Recommendations for User ID {user_id}:")
                    for i, (title, score) in enumerate(recommendations):
                        print(f"{i+1}. {title} (Score: {score:.3f})")
                else:
                    print(f"Could not generate recommendations for User ID {user_id}. This user might not exist or might not have enough high rating data for Content-Based filtering.")
                # No break here to allow the user to enter another ID

            else:
                print(f"Error: User ID {user_id} not found in the dataset. Please enter a valid User ID.")

        except ValueError:
            print("Error: Invalid input. Please enter a numeric User ID.")
        except Exception as e:
            print(f"An unexpected error occurred: {e}")
            # No break here to allow the user to try again

# Call the function to prompt the user
# Ensure ratings, item_similarity_df, and movies are accessible
get_content_based_recommendations_for_user_input(ratings, item_similarity_df, movies)


--- 8. Getting User Input for Recommendations (Content-Based) ---

Enter a User ID to get Content-Based recommendations.

Enter a User ID (e.g., 1, 2, 3... up to 610) or type 'exit' to quit: 2

‚ú® Generating Content-Based Recommendations for User ID: 2 ‚ú®

Top 10 Content-Based Recommendations for User ID 2:
1. Wasabi (2001) (Score: 3.653)
2. Money Train (1995) (Score: 3.653)
3. Another 48 Hrs. (1990) (Score: 3.653)
4. Last Boy Scout, The (1991) (Score: 3.653)
5. Bad Boys (1995) (Score: 3.653)
6. Metro (1997) (Score: 3.653)
7. Knockin' on Heaven's Door (1997) (Score: 3.619)
8. Best Men (1997) (Score: 3.619)
9. 48 Hrs. (1982) (Score: 3.619)
10. Blind Swordsman: Zatoichi, The (Zat√¥ichi) (2003) (Score: 3.619)

Enter a User ID (e.g., 1, 2, 3... up to 610) or type 'exit' to quit: 2

‚ú® Generating Content-Based Recommendations for User ID: 2 ‚ú®

Top 10 Content-Based Recommendations for User ID 2:
1. Wasabi (2001) (Score: 3.653)
2. Money Train (1995) (Score: 3.653)
3. Another 48 Hrs. (19

KeyboardInterrupt: Interrupted by user

## 9. Additional Features Suggestions

To further improve this movie recommendation system and make it more robust and user-friendly, several additional features could be implemented:

*   **Incorporate More Data Sources:** As discussed earlier, integrating external datasets like IMDb or TMDb data (plot summaries, cast, crew, keywords) can significantly enrich the content representation of movies, leading to more nuanced content-based recommendations.
*   **Implement Other Recommendation Algorithms:** Exploring algorithms like neural collaborative filtering, factorization machines, or graph-based methods could provide alternative perspectives and potentially capture more complex user-item interactions.
*   **Consider Time Dynamics:** Incorporating timestamps of ratings or release dates of movies can help capture trends and the evolution of user preferences over time, making recommendations more timely and relevant.
*   **Add a User Interface:** Building a simple web interface using libraries like Gradio or Streamlit would make the system interactive and easy for users to get recommendations without needing to run the notebook code directly.
*   **Provide Recommendation Explanations:** Implementing a mechanism to explain *why* a specific movie was recommended (e.g., "Because you liked movies like X and Y," or "Users similar to you enjoyed this") can increase user trust and satisfaction.
*   **Address Cold Start Problem:** Developing strategies to handle new users (with no rating history) and new items (movies with few or no ratings) is crucial for a practical recommendation system. Content-based methods or leveraging user demographic information can be helpful here.
*   **Advanced Evaluation:** Using more sophisticated evaluation techniques, such as A/B testing in a live environment or offline evaluations that simulate real-world scenarios more closely, can provide a better understanding of the system's performance.
*   **Model Deployment:** Packaging the trained models and recommendation logic into a deployable service would make the system accessible for real-world use.

## 10. Summary

This notebook provided a hands-on exploration of building a movie recommendation system. We implemented and evaluated several core recommendation techniques:

*   **User-Based Collaborative Filtering:** Leveraging the wisdom of the crowd by finding users with similar tastes. (Note: This was included as a demonstration but not fully integrated into the evaluation or hybrid approach in this version due to limitations).
*   **Content-Based Filtering:** Recommending movies based on their characteristics (genres) and a user's past preferences.
*   **Hybrid Approach:** While a hybrid approach combining Content-Based and SVD was planned and partially implemented, the SVD component was affected by a technical issue, limiting the full demonstration and evaluation of the hybrid model in this version.

We covered essential steps like data loading, preprocessing, and evaluating the Content-Based model's performance using Precision, Recall, and F1-Score. We also added interactive user input functionality for Content-Based recommendations.

Due to a persistent technical issue with the `surprise` library, the sections related to the SVD model, its parameter tuning, and the full implementation and evaluation of the hybrid model could not be fully demonstrated in this version.

While the implemented Content-Based model provides a solid foundation, the evaluation metrics suggest there is room for improvement. Exploring more advanced content features, addressing the technical issue with SVD to fully implement and evaluate the hybrid approach, and considering other algorithms (as suggested in the "Additional Features Suggestions" section) are promising avenues for future work.