## Nearest Neighbour with Collaborative Filtering

In [27]:
import pandas as pd
import numpy as np

In [28]:
movies_df = pd.read_csv("movies.csv",usecols=["movieId","title"])
rating_df = pd.read_csv("ratings.csv",usecols=["userId","movieId","rating"])

In [29]:
rating_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [30]:
movies_df.head()

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)


In [31]:
df = pd.merge(rating_df,movies_df,on="movieId")

In [32]:
df.head()

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,1,3,4.0,Grumpier Old Men (1995)
2,1,6,4.0,Heat (1995)
3,1,47,5.0,Seven (a.k.a. Se7en) (1995)
4,1,50,5.0,"Usual Suspects, The (1995)"


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column   Non-Null Count   Dtype  
---  ------   --------------   -----  
 0   userId   100836 non-null  int64  
 1   movieId  100836 non-null  int64  
 2   rating   100836 non-null  float64
 3   title    100836 non-null  object 
dtypes: float64(1), int64(2), object(1)
memory usage: 3.1+ MB


In [34]:
df.isnull().sum()

userId     0
movieId    0
rating     0
title      0
dtype: int64

In [35]:
combine_movie_rating = df.dropna(axis=0, subset="title")
movieRatingCount = (combine_movie_rating.groupby(by=["title"])["rating"]
                    .count()
                    .reset_index()
                    .rename(columns={'rating':'totalRatingCount'})
                    [['title','totalRatingCount']]
                   )
movieRatingCount

Unnamed: 0,title,totalRatingCount
0,'71 (2014),1
1,'Hellboy': The Seeds of Creation (2004),1
2,'Round Midnight (1986),2
3,'Salem's Lot (2004),1
4,'Til There Was You (1997),2
...,...,...
9714,eXistenZ (1999),22
9715,xXx (2002),24
9716,xXx: State of the Union (2005),5
9717,¡Three Amigos! (1986),26


In [36]:
ratingWithTotalRatingCount = combine_movie_rating.merge(movieRatingCount,left_on='title',right_on='title',how='left')
ratingWithTotalRatingCount.head(10)

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,1,1,4.0,Toy Story (1995),215
1,1,3,4.0,Grumpier Old Men (1995),52
2,1,6,4.0,Heat (1995),102
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),203
4,1,50,5.0,"Usual Suspects, The (1995)",204
5,1,70,3.0,From Dusk Till Dawn (1996),55
6,1,101,5.0,Bottle Rocket (1996),23
7,1,110,4.0,Braveheart (1995),237
8,1,151,5.0,Rob Roy (1995),44
9,1,157,5.0,Canadian Bacon (1995),11


In [37]:
ratingWithTotalRatingCount['totalRatingCount'].describe()

count    100836.000000
mean         58.758777
std          61.965384
min           1.000000
25%          13.000000
50%          39.000000
75%          84.000000
max         329.000000
Name: totalRatingCount, dtype: float64

In [39]:
popularity_threshold = 50
rating_popular_movie = ratingWithTotalRatingCount.query("totalRatingCount >= @popularity_threshold")

In [40]:
rating_popular_movie.head(5)

Unnamed: 0,userId,movieId,rating,title,totalRatingCount
0,1,1,4.0,Toy Story (1995),215
1,1,3,4.0,Grumpier Old Men (1995),52
2,1,6,4.0,Heat (1995),102
3,1,47,5.0,Seven (a.k.a. Se7en) (1995),203
4,1,50,5.0,"Usual Suspects, The (1995)",204


In [42]:
rating_popular_movie.shape

(41362, 5)

# Understanding the Difference Between a Normal DataFrame and a Pivot Matrix

## 1. Introduction
In this guide, we explain the difference between a **normal DataFrame** and a **pivot matrix** using a movie ratings dataset as an example.

## 2. Normal DataFrame (Tabular Format)
A normal DataFrame contains data in a structured tabular format where each row represents an observation, and each column represents a feature.

### Example: Normal DataFrame (`rating_popular_movie`)
```plaintext
| title   | user_id | rating | totalRatingCount |
|---------|---------|--------|-----------------|
| Movie A | 101     | 5      | 100             |
| Movie B | 102     | 4      | 30              |
| Movie A | 103     | 3      | 100             |
| Movie C | 104     | 2      | 75              |
```

- Each row represents a **single rating** given by a user to a movie.
- The `title` column contains movie names.
- The `user_id` column identifies users who rated the movies.
- The `rating` column shows the rating given by the user.
- The `totalRatingCount` column indicates how many total ratings the movie has received.

## 3. Pivot Matrix (User-Item Matrix Format)
A pivot matrix transforms the DataFrame so that **users become features (columns), and movies become index rows**. Each cell represents the rating given by a specific user to a movie.

### Example: Pivot Matrix (`movie_user_matrix`)
```plaintext
| title   | 101  | 102  | 103  | 104  |
|---------|------|------|------|------|
| Movie A | 5.0  | NaN  | 3.0  | NaN  |
| Movie B | NaN  | 4.0  | NaN  | NaN  |
| Movie C | NaN  | NaN  | NaN  | 2.0  |
```

- **Rows (`title`) represent movies**.
- **Columns (`user_id`s) represent individual users**.
- **Values in cells represent ratings given by users**.
- `NaN` means the user has not rated that movie.

## 4. Key Differences
```plaintext
| Feature          | Normal DataFrame                | Pivot Matrix                  |
|-----------------|--------------------------------|--------------------------------|
| Structure       | Long-form, row-based          | Wide-form, matrix-style       |
| Rows Represent  | Individual ratings            | Movies                        |
| Columns Represent | Features like `title`, `user_id`, `rating` | Users (`user_id`) as columns |
| Missing Values  | Not applicable (each row has full data) | `NaN` for missing ratings |
| Use Case       | Storing and analyzing structured data | Used for recommendation systems & similarity analysis |
```

## 5. How to Create a Pivot Matrix in Pandas
You can convert a DataFrame into a pivot matrix using the `pivot_table()` function in Pandas:
```python
movie_user_matrix = rating_popular_movie.pivot_table(index='title', columns='user_id', values='rating')
```
This will create a matrix where:
- `index='title'` makes movies the index.
- `columns='user_id'` makes users the features.
- `values='rating'` fills the matrix with ratings.

## 6. Conclusion
A **normal DataFrame** is best for storing raw data, while a **pivot matrix** is useful for **analyzing user-movie interactions**, such as building **recommendation systems**.

By converting the DataFrame into a pivot matrix, we can apply collaborative filtering, similarity measures, and ML models for personalized recommendations.

---

📌 **Use the pivot matrix when working with recommendation systems, user behavior analysis, and data visualization!**



In [46]:
# Pivot Table
movie_feature_df = rating_popular_movie.pivot_table(columns='userId',index='title',values='rating').fillna(0)
movie_feature_df.head(10)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2001: A Space Odyssey (1968),0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,5.0,0.0,3.0,0.0,4.5
28 Days Later (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,5.0
300 (2007),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0,4.0
"40-Year-Old Virgin, The (2005)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5
A.I. Artificial Intelligence (2001),0.0,0.0,0.0,0.0,0.0,0.0,4.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,3.5,0.0,4.5,0.0,3.5
"Abyss, The (1989)",4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0
Ace Ventura: Pet Detective (1994),0.0,0.0,0.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,2.0,0.0,0.0,0.0,3.5,0.0,3.0
Ace Ventura: When Nature Calls (1995),0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.0,2.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0



## `csr_matrix` from `scipy.sparse`

`csr_matrix` from `scipy.sparse` is a **Compressed Sparse Row (CSR) matrix** format used for efficiently storing and processing sparse matrices. It is particularly useful when you have a large matrix with mostly zero values and only a few non-zero entries.

### 🔹 Why Use `csr_matrix`?
- **Efficient storage** – Instead of storing all elements (including zeros), it only stores non-zero values and their positions.
- **Fast row slicing** – Great for row-based operations (e.g., retrieving all ratings for a specific movie in a recommendation system).
- **Memory-efficient** – Uses much less memory compared to a dense matrix.

### 🔹 Example Usage
#### Dense Matrix:
```python
import numpy as np
from scipy.sparse import csr_matrix

dense_matrix = np.array([
    [0, 0, 3, 0, 4],
    [0, 0, 5, 7, 0],
    [0, 2, 0, 6, 0]
])

# Convert to a sparse matrix
sparse_matrix = csr_matrix(dense_matrix)
```

#### Output:
```
  (0, 2)    3
  (0, 4)    4
  (1, 2)    5
  (1, 3)    7
  know if you need any modifications! 🚀

In [48]:
from scipy.sparse import csr_matrix
movie_feature_df_matrix = csr_matrix(movie_feature_df.values)
movie_feature_df_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 41360 stored elements and shape (450, 606)>

In [49]:
from sklearn.neighbors import NearestNeighbors
model = NearestNeighbors(metric='cosine',algorithm='brute')
model.fit(movie_feature_df_matrix)

In [63]:
movie_feature_df.shape

(450, 606)

In [67]:
query_index = np.random.choice(movie_feature_df.shape[0])
print(query_index)
distances, indice = model.kneighbors(movie_feature_df.iloc[query_index,:].values.reshape(1,-1),n_neighbors=6)

204


In [68]:
movie_feature_df.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10 Things I Hate About You (1999),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
12 Angry Men (1957),0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2001: A Space Odyssey (1968),0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,...,0.0,0.0,5.0,0.0,0.0,5.0,0.0,3.0,0.0,4.5
28 Days Later (2002),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,5.0
300 (2007),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,0.0,4.0


In [72]:
for i in range(0,len(distances.flatten())):
    if i==0:
        print(f"Recommendations for {movie_feature_df.index[query_index]}:")
    else:
        print(f"{i}:{movie_feature_df.index[indice.flatten()[i]]} with distance {distances.flatten()[i]}")

Recommendations for Hot Fuzz (2007):
1:Shaun of the Dead (2004) with distance 0.31382370004269133
2:Superbad (2007) with distance 0.37973145855538504
3:Zombieland (2009) with distance 0.44040062622041143
4:Casino Royale (2006) with distance 0.4472459073266031
5:Sherlock Holmes (2009) with distance 0.46693049182809276
