<a href="https://colab.research.google.com/github/aiswarya-1422/ictak/blob/main/machine_test.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas scikit-learn



In [None]:

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

movies_url = "https://files.grouplens.org/datasets/movielens/ml-latest-small.zip"
!wget $movies_url -O ml-latest-small.zip
!unzip -o ml-latest-small.zip

movies_df = pd.read_csv("ml-latest-small/movies.csv")

print("Sample movies:")
print(movies_df.head())

movies_df['content_soup'] = movies_df['genres'].str.replace('|', ' ', regex=False)

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_df['content_soup'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

title_to_index = pd.Series(movies_df.index, index=movies_df['title']).drop_duplicates()

def recommend_movies(title, cosine_sim=cosine_sim, df=movies_df):
    if title not in title_to_index:
        return f"Movie '{title}' not found in dataset."

    idx = title_to_index[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]  # Top 5 excluding the movie itself
    movie_indices = [i[0] for i in sim_scores]
    return df['title'].iloc[movie_indices]

movie_title = "Toy Story (1995)"
print(f"\nTop 5 recommendations for '{movie_title}':")
print(recommend_movies(movie_title))


--2025-10-15 08:24:53--  https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.96.204
Connecting to files.grouplens.org (files.grouplens.org)|128.101.96.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2025-10-15 08:24:54 (2.85 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  
Sample movies:
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Fa

 **Major Limitation of a Purely Content-Based Recommendation System**

A major limitation of a purely content-based recommendation system is its **inability to capture user preferences beyond item features**. Since recommendations are based solely on the attributes of items (e.g., genres, actors, keywords), the system tends to suggest items that are very similar to what the user has already consumed.

This leads to two key issues:

- **Lack of Serendipity**: Users are rarely surprised or exposed to diverse content. The system reinforces existing preferences and may never suggest something outside the user's known interests—even if they might enjoy it.
  
- **Cold Start for Users**: For new users with limited interaction history, the system struggles to personalize recommendations effectively, as it doesn't learn from other users' behaviors.

In contrast, collaborative filtering can uncover hidden patterns by leveraging the preferences of similar users, enabling more diverse and unexpected suggestions.


In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD

ratings = pd.read_csv("ml-latest-small/ratings.csv")
movies = pd.read_csv("ml-latest-small/movies.csv")

user_item_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
print("User-Item Matrix shape:", user_item_matrix.shape)


User-Item Matrix shape: (610, 9724)


In [None]:

svd = TruncatedSVD(n_components=20, random_state=42)
matrix_svd = svd.fit_transform(user_item_matrix)

approx_ratings = np.dot(matrix_svd, svd.components_)

predicted_df = pd.DataFrame(approx_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)

# Sample prediction: User 1, Movie 1
sample_user = 1
sample_movie = 1
predicted_rating = predicted_df.loc[sample_user, sample_movie]
print(f"\nPredicted rating for user {sample_user} on movie {sample_movie}: {predicted_rating:.2f}")



Predicted rating for user 1 on movie 1: 2.79


**Core Assumption of Collaborative Filtering (SVD)**

Collaborative filtering models like **SVD** assume that:

> **Users with similar rating patterns share similar preferences.**

This means the model learns latent features from the user-item matrix that represent hidden dimensions of taste. It predicts unknown ratings by projecting users and items into this shared latent space.
### Implications:
- No need for item metadata (e.g., genres or cast)
- Can uncover surprising recommendations based on behavior
- Suffers from cold-start issues for new users or items

SVD is powerful when there's enough rating data to reveal meaningful patterns.


##**Cold Start Problem in Recommender Systems**

The **cold start problem** happens when a recommender system struggles to make good suggestions because it doesn't have enough data.

###  New User Problem
When a new user joins, the system doesn’t know their preferences yet. So it’s hard to recommend anything personalized.

### New Movie Problem
When a new movie is added, no one has rated it yet. So the system doesn’t know who might like it.


---

##  Which Model Handles Cold Start Better?

| Scenario        | Better Model         | Why? |
|----------------|----------------------|------|
| New User        |  Content-Based      | It uses movie features (like genres), not user history. So it can still recommend similar movies. |
| New Movie       | Content-Based      | It uses the movie’s own content (genres), so it can suggest it to users who liked similar genres. |
| New User        |  Collaborative Filtering | Needs user ratings to find similar users. Without data, it can't personalize. |
| New Movie       |  Collaborative Filtering | Needs ratings from users to learn who likes it. No ratings = no recommendations. |

---

Content-based models are better at handling cold start situations because they rely on item features (like genres) instead of user behavior. Collaborative filtering needs interaction data, so it struggles when that data is missing.


In [None]:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator, TransformerMixin

ratings = pd.read_csv("ml-latest-small/ratings.csv")

user_item_matrix = ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
print("User-Item Matrix shape:", user_item_matrix.shape)


User-Item Matrix shape: (610, 9724)


In [None]:
class SVDWrapper(BaseEstimator, TransformerMixin):
    def __init__(self, n_components=20):
        self.n_components = n_components
        self.model = TruncatedSVD(n_components=self.n_components, random_state=42)

    def fit(self, X, y=None):
        self.model.fit(X)
        return self

    def transform(self, X):
        return self.model.transform(X)

    def score(self, X, y=None):
        reconstructed = np.dot(self.model.transform(X), self.model.components_)
        return -mean_squared_error(X, reconstructed)

In [None]:
param_grid = {'n_components': [10, 20, 30, 40]}

grid = GridSearchCV(SVDWrapper(), param_grid, cv=3)
grid.fit(user_item_matrix)

print("Best parameters:", grid.best_params_)
print("Best score (negative MSE):", grid.best_score_)


Best parameters: {'n_components': 10}
Best score (negative MSE): -0.15491735106214533


In [None]:

best_n = grid.best_params_['n_components']
final_svd = TruncatedSVD(n_components=best_n, random_state=42)
user_features = final_svd.fit_transform(user_item_matrix)
approx_ratings = np.dot(user_features, final_svd.components_)
predicted_df = pd.DataFrame(approx_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)

# Sample prediction: User 1, Movie 1
user_id = 1
movie_id = 1
if movie_id in predicted_df.columns:
    predicted_rating = predicted_df.loc[user_id, movie_id]
    print(f"\nPredicted rating for user {user_id} on movie {movie_id}: {predicted_rating:.2f}")
else:
    print(f"Movie ID {movie_id} not found.")



Predicted rating for user 1 on movie 1: 2.86


##  Why Hyperparameter Tuning Is Important

Hyperparameter tuning helps a machine learning model perform better by finding the best settings for learning. It’s like adjusting the knobs on a radio to get the clearest signal.

### Risks of Using Default Parameters

- They may not fit your data well.
- Can lead to poor predictions or overfitting.
- You might miss out on better performance.

By tuning parameters like `n_components` in SVD, we help the model learn the most useful patterns in the data, leading to more accurate and reliable recommendations.


##  Hybrid Recommendation Strategy

We combined the top 10 recommendations from:
- Content-Based Filtering (based on genre similarity to the selected movie)
- Collaborative Filtering (based on the user's past ratings)

Then we removed duplicates and kept the first 10 unique titles.

### Ranking Strategy
We did not prioritize one model over the other. The final list preserves the original order—content-based first, then collaborative—until we reach 10 unique movies.

### Why Hybrid Is Powerful
- Content-Based captures item similarity.
- Collaborative Filtering captures user behavior.
- Hybrid systems balance both, improving personalization and robustness—especially when one model lacks data.
