 
# KNN-based Movie Recommender with MovieLens (ml-latest-small)

In this notebook we will:

1. Load the **MovieLens ml-latest-small** dataset.
2. Explore the ratings data briefly.
3. Build a **User × Movie** rating matrix.
4. Use **K-Nearest Neighbours (KNN)** on items (movies) to:
   - Find movies similar to a given movie.
   - Recommend movies to a specific user based on their past ratings.

**Conceptual point (compared to K-means on customers)**

- In your customer-segmentation dataset, we cluster customers with K-means.
- Here, we have **users × movies** and use **KNN on movies** to find:
  - “Movies that are close together in the rating space”.
- Recommendation then is:
  - “Given what a user liked, suggest similar movies they haven’t seen.”
"""


In [1]:
# %%
import os

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors


 
## 1. Set up paths and load data



In [2]:
 
 
DATA_DIR = r"C:\Users\er_si\Desktop\Corporate Institutions Trainings & Business\Final Clients Trainings\Educational Institutes\RPS-Wipro\python_AIML\cum_segment\knn_Cus_recommendation\ml-latest-small"  # e.g. "C:\\Users\\yourname\\Downloads\\ml-latest-small"

ratings_path = os.path.join(DATA_DIR, "ratings.csv")
movies_path = os.path.join(DATA_DIR, "movies.csv")

ratings = pd.read_csv(ratings_path)
movies = pd.read_csv(movies_path)

ratings.head(), movies.head()


(   userId  movieId  rating  timestamp
 0       1        1     4.0  964982703
 1       1        3     4.0  964981247
 2       1        6     4.0  964982224
 3       1       47     5.0  964983815
 4       1       50     5.0  964982931,
    movieId                               title  \
 0        1                    Toy Story (1995)   
 1        2                      Jumanji (1995)   
 2        3             Grumpier Old Men (1995)   
 3        4            Waiting to Exhale (1995)   
 4        5  Father of the Bride Part II (1995)   
 
                                         genres  
 0  Adventure|Animation|Children|Comedy|Fantasy  
 1                   Adventure|Children|Fantasy  
 2                               Comedy|Romance  
 3                         Comedy|Drama|Romance  
 4                                       Comedy  )

 
## 2. Quick EDA
We'll quickly inspect:
- Number of users
- Number of movies
- Number of ratings
- Basic rating statistics
 


In [3]:
 
n_users = ratings["userId"].nunique()
n_movies = ratings["movieId"].nunique()
n_ratings = len(ratings)

print(f"Number of users   : {n_users}")
print(f"Number of movies  : {n_movies}")
print(f"Number of ratings : {n_ratings}")

print("\nRatings summary:")
print(ratings["rating"].describe())


Number of users   : 610
Number of movies  : 9724
Number of ratings : 100836

Ratings summary:
count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64


 
## 3. Build the User–Movie ratings matrix

We create a matrix:

- Rows: `userId`
- Columns: `movieId`
- Values: rating (NaN if user has not rated that movie)

Then we fill NaNs with **0** for KNN similarity calculation.

> Note: In a more advanced setup, we might:
> - Normalize by user mean (centered ratings)
> - Use sparse matrices
> For this small dataset, a dense matrix with 0-filled missing values is okay for experimentation.
"""


In [4]:
 
user_item_matrix = ratings.pivot_table(
    index="userId",
    columns="movieId",
    values="rating"
)

user_item_matrix.head()


movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,


In [5]:
 
# Fill NaNs with 0 (unrated movies)
user_item_matrix_filled = user_item_matrix.fillna(0)

user_item_matrix_filled.shape


(610, 9724)

 
The matrix shape is:

- `rows = number of users`
- `columns = number of movies`

Now we will **transpose** it to get an **item-user** matrix:

- Rows: movies
- Columns: users

This is convenient for item-based KNN:
- Each movie is represented as a vector of user ratings.
"""


In [6]:
 
item_user_matrix = user_item_matrix_filled.T  # shape: (n_movies, n_users)
item_user_matrix.shape


(9724, 610)

 
## 4. Fit an Item-based KNN model

We will use:

- `NearestNeighbors` from scikit-learn
- Distance metric: **cosine distance** (works well for high-dimensional sparse data)
- Algorithm: `"brute"` (simple and fine for this dataset size)

The model will learn to:
> For each movie, find the nearest neighbour movies in the user-rating space.
"""


In [7]:
 
knn_model = NearestNeighbors(
    metric="cosine",
    algorithm="brute"
)

knn_model.fit(item_user_matrix.values)


 
## 5. Helper functions

We will create:

1. A function to **search for movies by partial title** (to get the movieId).
2. A function to **find similar movies** given a movie title.
3. A function to **recommend movies to a user** based on their past ratings.

### 5.1: Movie lookup by partial title
"""


In [8]:
 
def search_movie(title_substring, n=10):
    """
    Search movies whose titles contain the given substring (case-insensitive).
    Returns a small dataframe with movieId and title.
    """
    mask = movies["title"].str.contains(title_substring, case=False, na=False)
    result = movies.loc[mask, ["movieId", "title"]].head(n)
    return result


# Example search
search_movie("Toy Story")


Unnamed: 0,movieId,title
0,1,Toy Story (1995)
2355,3114,Toy Story 2 (1999)
7355,78499,Toy Story 3 (2010)


In [None]:
 
### 5.2: Find similar movies (item-based KNN)

- Input: movie title (or part of it), number of neighbours.
- Steps:
  1. Find the corresponding `movieId`.
  2. Locate that movie in the `item_user_matrix` (row index).
  3. Use `kneighbors` to fetch nearest neighbour movies.
  4. Map neighbour movieIds back to titles.
"""


In [9]:
 
def get_similar_movies(movie_title_substring, n_neighbors=10):
    """
    Given a movie title (or substring), find n_neighbors similar movies
    using item-based KNN.
    """
    # 1. Find movie(s) matching the substring
    candidates = search_movie(movie_title_substring, n=1)
    if candidates.empty:
        print(f"No movie found containing: {movie_title_substring}")
        return None

    # Take the first match for simplicity
    target_movie_id = candidates.iloc[0]["movieId"]
    target_title = candidates.iloc[0]["title"]
    print(f"Target movie: {target_movie_id} – {target_title}")

    # 2. Find the row index in the item_user_matrix
    if target_movie_id not in item_user_matrix.index:
        print("Movie ID not in item-user matrix (no ratings).")
        return None

    movie_idx = item_user_matrix.index.get_loc(target_movie_id)

    # 3. Query nearest neighbors
    distances, indices = knn_model.kneighbors(
        item_user_matrix.iloc[movie_idx, :].values.reshape(1, -1),
        n_neighbors=n_neighbors + 1  # +1 because the closest is itself
    )

    # 4. Collect neighbour movieIds and titles
    similar_movies = []
    for dist, idx in zip(distances[0], indices[0]):
        neighbor_movie_id = item_user_matrix.index[idx]
        neighbor_title = movies.loc[movies["movieId"] == neighbor_movie_id, "title"].values[0]
        similar_movies.append((neighbor_movie_id, neighbor_title, dist))

    # Skip the first one (itself)
    similar_movies = similar_movies[1:]

    result_df = pd.DataFrame(similar_movies, columns=["movieId", "title", "distance"])
    # Lower distance = more similar
    return result_df


# Example usage:
similar_to_toy_story = get_similar_movies("Toy Story", n_neighbors=5)
similar_to_toy_story


Target movie: 1 – Toy Story (1995)


Unnamed: 0,movieId,title,distance
0,3114,Toy Story 2 (1999),0.427399
1,480,Jurassic Park (1993),0.434363
2,780,Independence Day (a.k.a. ID4) (1996),0.435738
3,260,Star Wars: Episode IV - A New Hope (1977),0.442612
4,356,Forrest Gump (1994),0.452904


 
### 5.3: Recommend movies to a user

Logic (simple version):

1. Take a **target userId**.
2. Get all their rated movies.
3. Select movies the user **liked** (e.g., rating >= 4.0).
4. For each liked movie:
   - Find similar movies (via KNN).
   - Give each candidate movie a **score** based on similarity.
5. Aggregate scores over all liked movies.
6. Exclude movies the user has already rated.
7. Return top-N recommendations.

This is a straightforward item-based collaborative filtering approach.
"""


In [10]:
 
def recommend_movies_for_user(
    user_id,
    n_neighbors_per_movie=10,
    n_recommendations=10,
    min_rating_for_like=4.0
):
    """
    Recommend movies for a given user based on item-based KNN.
    """
    if user_id not in user_item_matrix.index:
        raise ValueError(f"user_id {user_id} not found in the data.")

    # 1. Get the user's ratings (with NaNs removed)
    user_ratings = user_item_matrix.loc[user_id].dropna()
    if user_ratings.empty:
        print("This user has no ratings.")
        return None

    # 2. Find movies the user liked
    liked_movies = user_ratings[user_ratings >= min_rating_for_like].index.tolist()
    if not liked_movies:
        print("This user has no movies with rating >= threshold.")
        return None

    print(f"User {user_id} liked {len(liked_movies)} movies (rating >= {min_rating_for_like}).")

    # 3. Accumulate candidate scores
    candidate_scores = {}  # movieId -> aggregated score

    for movie_id in liked_movies:
        if movie_id not in item_user_matrix.index:
            continue

        movie_idx = item_user_matrix.index.get_loc(movie_id)

        distances, indices = knn_model.kneighbors(
            item_user_matrix.iloc[movie_idx, :].values.reshape(1, -1),
            n_neighbors=n_neighbors_per_movie + 1
        )

        for dist, idx in zip(distances[0], indices[0]):
            neighbor_movie_id = item_user_matrix.index[idx]

            # Skip the movie itself
            if neighbor_movie_id == movie_id:
                continue

            # Similarity score: 1 - distance (cosine distance)
            similarity = 1 - dist

            # Accumulate score
            candidate_scores[neighbor_movie_id] = candidate_scores.get(neighbor_movie_id, 0.0) + similarity

    # 4. Remove movies the user has already rated
    rated_movie_ids = set(user_ratings.index)
    candidate_scores = {
        mid: score for mid, score in candidate_scores.items()
        if mid not in rated_movie_ids
    }

    if not candidate_scores:
        print("No new candidates found for this user.")
        return None

    # 5. Sort candidates by score
    sorted_candidates = sorted(candidate_scores.items(), key=lambda x: x[1], reverse=True)
    top_candidates = sorted_candidates[:n_recommendations]

    # 6. Map to titles
    rec_movie_ids = [mid for mid, _ in top_candidates]
    rec_scores = [score for _, score in top_candidates]

    rec_titles = []
    for mid in rec_movie_ids:
        title = movies.loc[movies["movieId"] == mid, "title"].values
        title = title[0] if len(title) > 0 else f"(movieId {mid})"
        rec_titles.append(title)

    recommendations_df = pd.DataFrame({
        "movieId": rec_movie_ids,
        "title": rec_titles,
        "score": rec_scores
    })

    return recommendations_df


 
## 6. Try recommendations for a sample user
 

1. Pick a random `userId`.
2. See some of their rated movies.
3. Generate recommendations for that user.
"""


In [11]:
 
# Pick a random user
sample_user_id = ratings["userId"].sample(1, random_state=42).iloc[0]
sample_user_id


432

In [12]:
 
# Show some movies this user has rated
user_ratings_sample = ratings[ratings["userId"] == sample_user_id].merge(
    movies, on="movieId"
).sort_values("rating", ascending=False)

user_ratings_sample.head(10)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
111,432,3949,5.0,1315242940,Requiem for a Dream (2000),Drama
247,432,81591,5.0,1315244863,Black Swan (2010),Drama|Thriller
231,432,71379,5.0,1316391330,Paranormal Activity (2009),Horror|Thriller
173,432,8533,5.0,1315244178,"Notebook, The (2004)",Drama|Romance
66,432,1835,5.0,1315244490,City of Angels (1998),Drama|Fantasy|Romance
67,432,1873,5.0,1316391161,"Misérables, Les (1998)",Crime|Drama|Romance|War
73,432,1997,5.0,1316391091,"Exorcist, The (1973)",Horror|Mystery
20,432,364,5.0,1316391519,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX
194,432,45503,5.0,1316391388,Peaceful Warrior (2006),Drama
209,432,57274,5.0,1316391360,[REC] (2007),Drama|Horror|Thriller


In [13]:
 
# Get recommendations for this user
user_recs = recommend_movies_for_user(
    user_id=sample_user_id,
    n_neighbors_per_movie=10,
    n_recommendations=10,
    min_rating_for_like=4.0
)

user_recs


User 432 liked 143 movies (rating >= 4.0).


Unnamed: 0,movieId,title,score
0,33794,Batman Begins (2005),6.603689
1,8961,"Incredibles, The (2004)",5.413832
2,1089,Reservoir Dogs (1992),5.411245
3,45499,X-Men: The Last Stand (2006),4.167082
4,380,True Lies (1994),4.021148
5,8636,Spider-Man 2 (2004),3.582946
6,457,"Fugitive, The (1993)",3.568333
7,150,Apollo 13 (1995),3.508736
8,89745,"Avengers, The (2012)",3.487925
9,2028,Saving Private Ryan (1998),3.338411


 
## 7. Summary

- We used **KNN** on **item (movie) vectors** derived from the User × Movie rating matrix.
- This is an example of **item-based collaborative filtering**:
  - Items (movies) are similar if rated similarly by users.
  - Recommendations are “movies similar to the ones you liked”.

### Connection back to your earlier discussion

- In your **segmented_customers** dataset:
  - K-means/K-medoids cluster **customers** based on features (Age, Income, etc.).
  - KNN can classify a new customer into a cluster.
- In this **MovieLens** example:
  - We use KNN on **items** (or users) to do recommendation.
  - This requires a **user–item interaction matrix**, which your customer-only dataset doesn’t have.

You can now extend this notebook by:

- Trying **user-based KNN** (neighbors on user vectors instead of items).
- Normalizing ratings (subtract user mean).
- Evaluating recommendation quality with metrics like precision@k, recall@k, etc.
"""
