# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [98]:
# u.data - Shows the first 10 rows of the dataset 
!head u.data

196	242	3	881250949
186	302	3	891717742
22	377	1	878887116
244	51	2	880606923
166	346	1	886397596
298	474	4	884182806
115	265	2	881171488
253	465	5	891628467
305	451	3	886324817
6	86	3	883603013


In [99]:
# u.item - Displays the first 10 movies 
!head u.item 

1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
6|Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)|01-Jan-1995||http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|0|0|0|0
7|Twelve Monkeys (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Twelve%20Monkeys%20(1995)|0|0|0|0|0|0|0|0|1|0|0|0|0|0|0|1|0|0|0
8|Babe (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Babe%20(1995)|0|0|0|0|1

In [100]:
# u.user - Displys the first 10 users 
!head u.user 

1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703


#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [101]:
import pandas as pd 

In [102]:
# Load ratings file (user_id, movie_id, rating, timestamp)
ratings_col = ["user_id", "movie_id", "rating", "timestamp"]
ratings = pd.read_csv("u.data", sep="\t", names=ratings_col)

#Show First few rows 
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [137]:
# Load movie info (movie_id, title, release_date) from u.item
movie_col = ["movie_id", "title", "release_date"]
movies = pd.read_csv("u.item", sep="|", usecols=[0,1,2], names=movie_col, encoding="latin-1")
# Show first few rows
movies.head()

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995


In [138]:
# Load basic user info (user_id, age, gender, occupation)
# Using only the first 4 columns from u.user.
user_col = ["user_id", "age", "gender", "occupation"]
users = pd.read_csv("u.user", sep="|", usecols=[0,1,2,3], names=user_col)

# Show first few rows
users.head()

Unnamed: 0,user_id,age,gender,occupation
0,1,24,M,technician
1,2,53,F,other
2,3,23,M,writer
3,4,24,M,technician
4,5,33,F,other


**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [105]:
# ratings (Save ratings Dataframe) 
ratings.to_csv("ratings.csv", index=False)

In [106]:
# movies (Save Movies Dataframe)
movies.to_csv("movies.csv", index=False)

In [107]:
# users (Save users Dataframe)
users.to_csv("users.csv", index=False)

**Display the first 10 rows of each file.**

In [108]:
# ratings - Display first 10 rows 
ratings.head(10) 

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596
5,298,474,4,884182806
6,115,265,2,881171488
7,253,465,5,891628467
8,305,451,3,886324817
9,6,86,3,883603013


In [109]:
# movies - Display the first 10 rows 
movies.head(10)

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995
5,6,Shanghai Triad (Yao a yao yao dao waipo qiao) ...,01-Jan-1995
6,7,Twelve Monkeys (1995),01-Jan-1995
7,8,Babe (1995),01-Jan-1995
8,9,Dead Man Walking (1995),01-Jan-1995
9,10,Richard III (1995),22-Jan-1996


In [110]:
# users - Display the first 10 rows 
users.head(10)

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Convert Timestamps into Readable dates.**

In [111]:
# ratings - Convert timestamp column to readable datetime format
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')

# Display the first 10 rows to verify
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,196,242,3,1997-12-04 15:55:49
1,186,302,3,1998-04-04 19:22:22
2,22,377,1,1997-11-07 07:18:36
3,244,51,2,1997-11-27 05:02:03
4,166,346,1,1998-02-02 05:33:16
5,298,474,4,1998-01-07 14:20:06
6,115,265,2,1997-12-03 17:51:28
7,253,465,5,1998-04-03 18:34:27
8,305,451,3,1998-02-01 09:20:17
9,6,86,3,1997-12-31 21:16:53


**Check for Missing Values**

In [112]:
# ratings - Checking for missing values 
ratings.isnull().sum()

user_id      0
movie_id     0
rating       0
timestamp    0
dtype: int64

In [113]:
# movies - Checking for missing values 
movies.isnull().sum()

movie_id        0
title           0
release_date    1
dtype: int64

In [114]:
# users - Checking for missing values 
users.isnull().sum()

user_id       0
age           0
gender        0
occupation    0
zip_code      0
dtype: int64

**Print the total number of users, movies, and ratings.**

In [115]:
# Total number of users, movies, and ratings 

total_users = users['user_id'].nunique()
total_movies = movies['movie_id'].nunique()
total_ratings = ratings.shape[0]

print(f"Total Users: {total_users}")
print(f"Total Movies: {total_movies}")
print(f"Total Ratings: {total_ratings}")

Total Users: 943
Total Movies: 1682
Total Ratings: 100000


## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

### Building the User–Movie Rating Matrix

To prepare the dataset for collaborative filtering, we first construct a **user–movie
rating matrix**. In this matrix:

- **Rows** represent users  
- **Columns** represent movies  
- **Cells** contain the rating a user gave to a movie (or `NaN` if they never rated it)

This matrix forms the foundation for both **user-based** and **item-based**
collaborative filtering. By organizing the data in this structure, we can
easily compute similarities between users or between movies and generate
personalized recommendations.

The following code creates this matrix using `pandas.pivot`:


In [116]:
# Create the user-movie rating matrix.
user_movie_matrix = ratings.pivot_table(
    index="user_id",     # Each row represents a user
    columns="movie_id",  # Each colum represents a movie
    values="rating"     # Rating the user gave the movie
)

user_movie_matrix.head() # Display first rows of the matrix 

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,,,,,,,,,,
2,4.0,,,,,,,,,2.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,3.0,,,,,,,,,...,,,,,,,,,,


**Display the matrix to verify the transformation.**

In [117]:
# Replace all missing values (NAN) with 0
user_movie_matrix_filled = user_movie_matrix.fillna(0)

# Show first rows to verify filled matrix
user_movie_matrix_filled.head()  

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,5.0,3.0,4.0,3.0,3.0,5.0,4.0,1.0,5.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


### Computing User–User Similarity

Once we have the user–movie rating matrix, the next step in user-based
collaborative filtering is to measure how similar each pair of users is.

To compute user similarity, we use **cosine similarity**, which measures the
angle between two rating vectors:

- A value close to **1** means two users rate movies very similarly.
- A value close to **0** means their preferences are dissimilar.

Because cosine similarity requires numerical vectors, we temporarily replace
`NaN` values (missing ratings) with **0** for the similarity calculation only.
The original matrix remains unchanged for prediction later.

The code below:

1. Computes cosine similarity for every pair of users.  
2. Converts the result into a pandas DataFrame with user IDs as both rows and columns.  
3. Displays the first few rows of the similarity matrix.

In [118]:
# Importing Libraries 
import pandas as pd  # For loading and manipulating data tables
import numpy as np   # For numerical operations 
from sklearn.metrics.pairwise import cosine_similarity  # For measuring similarity between users

# Compute the user–user similarity matrix using cosine similarity.
# We fill missing ratings with 0 temporarily because cosine similarity
# requires numeric values and cannot process NaN values.
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))

# Convert the similarity matrix into a DataFrame 
user_sim_df = pd.DataFrame(
    user_similarity,
    index=user_movie_matrix.index,   # user_id as rows labels
    columns=user_movie_matrix.index  # user_id as columns labels
)
# Display the first few rows to verify the structure
user_sim_df.head()

user_id,1,2,3,4,5,6,7,8,9,10,...,934,935,936,937,938,939,940,941,942,943
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.166931,0.04746,0.064358,0.378475,0.430239,0.440367,0.319072,0.078138,0.376544,...,0.369527,0.119482,0.274876,0.189705,0.197326,0.118095,0.314072,0.148617,0.179508,0.398175
2,0.166931,1.0,0.110591,0.178121,0.072979,0.245843,0.107328,0.103344,0.161048,0.159862,...,0.156986,0.307942,0.358789,0.424046,0.319889,0.228583,0.22679,0.161485,0.172268,0.105798
3,0.04746,0.110591,1.0,0.344151,0.021245,0.072415,0.066137,0.08306,0.06104,0.065151,...,0.031875,0.042753,0.163829,0.069038,0.124245,0.026271,0.16189,0.101243,0.133416,0.026556
4,0.064358,0.178121,0.344151,1.0,0.031804,0.068044,0.09123,0.18806,0.101284,0.060859,...,0.052107,0.036784,0.133115,0.193471,0.146058,0.030138,0.196858,0.152041,0.170086,0.058752
5,0.378475,0.072979,0.021245,0.031804,1.0,0.237286,0.3736,0.24893,0.056847,0.201427,...,0.338794,0.08058,0.094924,0.079779,0.148607,0.071459,0.239955,0.139595,0.152497,0.313941


### User-Based Collaborative Filtering: Generating Recommendations

With the user–user similarity matrix computed, we can now generate movie
recommendations for any target user.

This function takes a `user_id` and recommends the top movies they have not yet rated,
based on the weighted opinions of their most similar users.

#### How the Recommendation Algorithm Works

1. **Retrieve similarity scores for the target user**
   - We extract the row corresponding to `user_id` from the user–similarity matrix.
   - The target user is removed from their own similarity list.
   - Remaining users are sorted by similarity in descending order.

2. **Collect movie ratings from similar users**
   - We gather all movie ratings from the ranked list of similar users.

3. **Compute similarity-weighted movie scores**
   - Each movie receives a score using a weighted average:
      sum(sim[user] * rating[user, movie]) / sum(sim)
   - This gives more influence to users who are more similar to the target user.

4. **Remove movies the user has already rated**
   - We only want to recommend **new** movies.
   - Any movie the user has already rated is filtered out.

5. **Rank the remaining movies**
   - Movies are sorted from highest predicted rating to lowest.
   - The top `num` movies are selected as recommendations.

6. **Map movie IDs to movie titles**
   - Using the `movies` dataframe, we translate movie IDs into human-readable titles.

7. **Return a clean results table**
   - Final output is a DataFrame listing:
     - Ranking (1 → best)
     - Movie Name  
   - This makes it easy to display recommendations in the notebook.

#### Purpose of This Function

This function provides **personalized** recommendations based solely on user similarity.

In [119]:
def recommend_movies_for_user(user_id, num=5):
    """
    Recommend movies for a given user using user-based collaborative filtering
    Returns a DataFrame with columns: Ranking, Movie Name
    """

    # 1 - Get similarity scores for given user and sort in descending order
    sim_scores = user_sim_df.loc[user_id]                # Get similarity scores for the target user 
    sim_scores = sim_scores.drop(index=user_id)          # Remove the target user from their own similarity 
    sim_scores = sim_scores.sort_values(ascending=False) # Sort all other users by similarity score in descending order

    # 2 - Get movie ratings from these similar users
    similar_users = sim_scores.index
    similar_ratings = user_movie_matrix.loc[similar_users]

    # 3 - Compute a similarity-weighted average rating for each movie
    # Formula: 
    # weighted_avg(movie) = sum(sim[user] * rating[user, movie]) / sum(sim)
    weighted_avg_ratings = similar_ratings.T.dot(sim_scores) / sim_scores.sum()

    # 4 - Remove movies the target user has already rated
    user_ratings = user_movie_matrix.loc[user_id]
    unrated_mask = user_ratings.isna()      
    candidate_scores = weighted_avg_ratings[unrated_mask]

    # 5 - Select the top recommended movies 
    top_movies = candidate_scores.sort_values(ascending=False).head(num)

    # 6 - Convert movie IDs to movie titles 
    movie_titles_lookup = movies.set_index('movie_id')['title']
    movie_names = movie_titles_lookup.loc[top_movies.index].values

    # 7 - Build the final results Dataframe
    result_df = pd.DataFrame({
        'Ranking': range(1, len(top_movies) + 1),
        'Movie Name': movie_names
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

### Testing the User-Based Recommendation Function

To verify that our user-based collaborative filtering implementation works as
intended, we can call the `recommend_movies_for_user()` function for a specific
user.  

In the example below, we request **5 movie recommendations for user 10**:

- The function looks up user 10’s similarity scores to all other users.
- It aggregates ratings from the most similar users using a similarity-weighted average.
- It filters out movies that user 10 has already rated.
- It returns the **top 5 unseen movies** with the highest predicted scores.

In [120]:
# Test the User-Based Collaborative Filtering function
recommend_movies_for_user(10, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,GoldenEye (1995)
2,Four Rooms (1995)
3,Copycat (1995)
4,Shanghai Triad (Yao a yao yao dao waipo qiao) ...
5,Babe (1995)


### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


### Computing Item–Item Similarity for Item-Based Collaborative Filtering

To implement item-based collaborative filtering, we need to measure how similar each
movie is to every other movie. Movies that receive similar ratings from the same users
are considered similar, and these similarities drive the recommendations.

This section computes the **movie–movie similarity matrix**.

#### Steps Performed

1. **Transpose the user–movie rating matrix**
   - The original matrix has:
     - Rows = users  
     - Columns = movies  
   - For item-based CF, we flip this so that:
     - Rows = movies  
     - Columns = users  
   - Each row now represents a movie’s rating vector across all users.

2. **Handle missing values**
   - Movies are not rated by every user, so the matrix contains many `NaN` values.
   - Cosine similarity requires numerical values, so we temporarily replace `NaN` with **0**.
   - This does not change the original data—only the similarity input.

3. **Compute cosine similarity between all movie pairs**
   - Cosine similarity measures the angle between two rating vectors.
   - Values close to **1** indicate highly similar movies.
   - Values near **0** indicate unrelated rating patterns.

4. **Store results in a DataFrame**
   - Rows and columns are labeled with `movie_id`.
   - This creates the item–item similarity matrix, where:
     - `item_sim_df.loc[A, B]` = similarity between movie A and movie B.

5. **Preview the matrix**
   - Displaying the first few rows helps verify that the structure is correct.

This similarity matrix will be used in the next step to generate movie recommendations
based on the items a user has previously rated.


In [121]:
import pandas as pd  # For loading and manipulating data
import numpy as np   # For numerical operations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity

# Transpose user_movie_matrix to showw rows = movies, columns = users 
movie_user_matrix = user_movie_matrix.T  

# Fill missing values with 0 
# Cosine similarity cannot process NaN values, so we replace them with 0.
movie_user_matrix_filled = movie_user_matrix.fillna(0)

# Compute movie–movie similarity using cosine similarity 
item_similarity = cosine_similarity(movie_user_matrix_filled) 

# Put into Dataframe with movie_ids as index/columns 
item_sim_df = pd.DataFrame(
    item_similarity,
    index=user_movie_matrix.columns,   # movie_ids as rows
    columns=user_movie_matrix.columns  # movie_ids as columns
)
# Show First few rows 
item_sim_df.head()

movie_id,1,2,3,4,5,6,7,8,9,10,...,1673,1674,1675,1676,1677,1678,1679,1680,1681,1682
movie_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.402382,0.330245,0.454938,0.286714,0.116344,0.620979,0.481114,0.496288,0.273935,...,0.035387,0.0,0.0,0.0,0.035387,0.0,0.0,0.0,0.047183,0.047183
2,0.402382,1.0,0.273069,0.502571,0.318836,0.083563,0.383403,0.337002,0.255252,0.171082,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.078299,0.078299
3,0.330245,0.273069,1.0,0.324866,0.212957,0.106722,0.372921,0.200794,0.273669,0.158104,...,0.0,0.0,0.0,0.0,0.032292,0.0,0.0,0.0,0.0,0.096875
4,0.454938,0.502571,0.324866,1.0,0.334239,0.090308,0.489283,0.490236,0.419044,0.252561,...,0.0,0.0,0.094022,0.094022,0.037609,0.0,0.0,0.0,0.056413,0.075218
5,0.286714,0.318836,0.212957,0.334239,1.0,0.037299,0.334769,0.259161,0.272448,0.055453,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.094211


### Item-Based Collaborative Filtering: Generating Similar Movie Recommendations

The function below takes a movie title and returns the top similar movies based on
cosine similarity.

#### How the Algorithm Works

1. **Look up the movie ID from the title**
   - The input title must match exactly what appears in the `movies` DataFrame.
   - If no matching movie is found, the function returns an empty result.

2. **Retrieve the similarity scores for the movie**
   - Using `item_sim_df`, we extract the row corresponding to this movie.
   - Each value represents the similarity between the target movie and another movie.

3. **Sort similarity scores**
   - The scores are sorted in descending order.
   - The movie is removed from its own similarity list to avoid recommending itself.

4. **Select the top-N similar movies**
   - We keep the highest-scoring `num` movies.

5. **Convert movie IDs back to titles**
   - Using a lookup table from the `movies` DataFrame, we translate IDs into readable titles.

6. **Build and return the recommendation table**
   - The result is a DataFrame with:
     - `ranking` (1 = most similar)
     - `movie_name` (title of the recommended movie)

#### When This Method Works Well

- It excels when a user has rated several movies—the method finds movies most similar to those favorites.
- It tends to produce tightly related recommendations (same genre, same lead actors, sequels, etc.).

In [122]:
def recommend_movies(movie_name, num=5):
    """
    Recommend movies similar to a given movie using item-based collaborative filtering.
    movie_name: title string exactly as in the 'movies' DataFrame
    num: number of similar movies to return
    """
    # 1 - Find the movie_id corresponding to movie_name
    movie_row = movies[movies['title'] == movie_name]

    # If the movie is not found in the dataset, return an empty result
    if movie_row.empty:
        print(f"Movie '{movie_name}' not found in dataset.")
        return pd.DataFrame(columns=['ranking', 'movie_name'])

    movie_id = movie_row.iloc[0]['movie_id']

    # 2 - Get similarity scores for this movie, one row of the similarity matrix 
    sim_scores = item_sim_df.loc[movie_id]

    # 3 - Sort in descending order of similarity and exclude the movie itself
    sim_scores = sim_scores.drop(index=movie_id)
    sim_scores = sim_scores.sort_values(ascending=False)

    # 4 - Take the top num similar movies
    top_matches = sim_scores.head(num)

    # 5 -  Convert movie_ids to movie titles
    movie_titles_lookup = movies.set_index('movie_id')['title']
    movie_names = movie_titles_lookup.loc[top_matches.index].values

    # 6 - Build the result DataFrame with rankings
    result_df = pd.DataFrame({
        'ranking': range(1, len(movie_names) + 1),
        'movie_name': movie_names
    })
    result_df.set_index('ranking', inplace=True)

    return result_df

### Testing the Item-Based Recommendation Function

To verify that our item-based collaborative filtering implementation is working
correctly, we can test it on a known movie. In the example below, we request  
the **top 5 movies most similar to _Toy Story (1995)_**.

This test checks that:

- The function correctly identifies the movie in the dataset.
- The similarity scores are retrieved and ranked properly.
- The recommended movies are reasonable and reflect meaningful item–item similarity.
- The function outputs titles (not just IDs) in a clear ranked format.

Running the cell below will return the top 5 movies whose rating patterns most
closely resemble those of *Toy Story (1995)*.


In [123]:
# Testing Item-Based Collaborative Filtering
recommend_movies("Toy Story (1995)", num=5)

Unnamed: 0_level_0,movie_name
ranking,Unnamed: 1_level_1
1,Star Wars (1977)
2,Return of the Jedi (1983)
3,Independence Day (ID4) (1996)
4,"Rock, The (1996)"
5,Mission: Impossible (1996)


## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

### Step 1 — Merging Ratings with Movie Titles

Before building the graph for the random-walk (Pixie-inspired) recommendation model,
we first enrich the ratings dataset by adding the corresponding movie titles.

The original `ratings` DataFrame contains only:

- `user_id`
- `movie_id`
- `rating`
- `timestamp`

To make analysis and debugging easier, we merge the `ratings` DataFrame with the
`movies` DataFrame so that each rating now includes the movie title as well.

This merge allows us to:

- Quickly see which movies users rated
- Build a graph using readable movie names 
- Debug or inspect spec

In [124]:
"""
Step 1 - Merge ratings with movie titles.
"""
# Merge the ratings table with the movies table using movie_id 
ratings = ratings.merge(
    movies[['movie_id', 'title']], # Only bring in movie_id and title
    on='movie_id'
)
# Display the first few rows of the updated ratings DataFrame
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp,title
0,196,242,3,1997-12-04 15:55:49,Kolya (1996)
1,186,302,3,1998-04-04 19:22:22,L.A. Confidential (1997)
2,22,377,1,1997-11-07 07:18:36,Heavyweights (1994)
3,244,51,2,1997-11-27 05:02:03,Legends of the Fall (1994)
4,166,346,1,1998-02-02 05:33:16,Jackie Brown (1997)


### Step 2 — Aggregating Ratings (Handling Duplicate User–Movie Entries)

Before building the graph structure, we need to ensure that each user–movie
interaction appears only once in the dataset.

1. **Group by**  
   - `user_id`  
   - `movie_id`  
   - `title`

2. **Compute the mean rating** for any duplicate entries  
   - This gives a single representative rating for each user–movie pair.

3. **Reset the index** to return the result as a clean DataFrame.

This ensures the graph will contain one edge per user–movie interaction, which is important for consistent random-walk behavior and cleaner graph analysis.

The code below performs the function and displays the first few rows of the
cleaned dataset.


In [125]:
"""
Step 2 - Aggregate ratings mean per user-movie.
"""
# Group by user, movie, and title, then compute the average rating
ratings = (ratings.groupby(['user_id', 'movie_id', 'title'])['rating']
                 .mean()             # Take the mean rating for duplicates
                 .reset_index()     # Convert grouped data back to a normal DataFrame
          )
# Show the result to verify the cleaned structure
ratings.head()   

Unnamed: 0,user_id,movie_id,title,rating
0,1,1,Toy Story (1995),5.0
1,1,2,GoldenEye (1995),3.0
2,1,3,Four Rooms (1995),4.0
3,1,4,Get Shorty (1995),3.0
4,1,5,Copycat (1995),3.0


### Step 3 — Normalizing Ratings Per User

Before constructing the graph for the random-walk recommender, it is often helpful to
**normalize each user’s ratings**.  

Without normalization, these differences can distort the relationships in the graph.
To address this, we apply **mean-centering per user**, which subtracts a user’s
average rating from all of their ratings.

#### What This Code Does

1. Groups ratings by `user_id`.
2. Computes each user’s average rating.
3. Subtracts that average from all of their movie ratings.
4. Stores the normalized ratings back into the `ratings` DataFrame.

Below it displays the updated, normalized ratings:

In [126]:
"""
Step 3 - Normalize ratings per user.
"""
# For each user, subtract their mean rating from each of their ratings
ratings['rating'] = (ratings.groupby('user_id')['rating'] 
                           .transform(lambda x: x - x.mean())
                    )
# Display the normalized ratings
ratings.head()

Unnamed: 0,user_id,movie_id,title,rating
0,1,1,Toy Story (1995),1.389706
1,1,2,GoldenEye (1995),-0.610294
2,1,3,Four Rooms (1995),0.389706
3,1,4,Get Shorty (1995),-0.610294
4,1,5,Copycat (1995),-0.610294


### Step 4 — Building the Bipartite Graph (Adjacency List)

To apply a random-walk–based recommendation approach, we represent the dataset as a
**bipartite graph** connecting users and movies. In this graph:

- Each `user_id` is treated as a user node.  
- Each `movie_id` is treated as a movie node.  
- An undirected edge is added between a user and a movie when the user has rated that movie.

This representation captures the interaction structure of the data and allows the random
walk to traverse from movies to users and back to other movies. The graph is stored as an
**adjacency list**, implemented as a dictionary in which each key corresponds to a node and
its value is the set of neighboring nodes.

The code below iterates through the cleaned and normalized ratings and constructs the
adjacency lists for both user and movie nodes.

In [127]:
"""
Step 4 - Build the Graph Representation (Adjacency List) 
"""
graph = {}   # Initialize the empty adjacency list

# Loop through each row in the ratings DataFrame
for _, row in ratings.iterrows():
    user = row['user_id']
    movie = row['movie_id']

    # If this user node doesn't exist yet, create it
    if user not in graph:
        graph[user] = set()

    # If this movie node doesn't exist yet, create it
    if movie not in graph:
        graph[movie] = set()

    # Add connections both ways
    graph[user].add(movie)   # user → movie
    graph[movie].add(user)   # movie → user

# Check to look at a few entries
list(graph.items())[:5]

[(1,
  {1,
   2,
   3,
   4,
   5,
   6,
   7,
   8,
   9,
   10,
   11,
   12,
   13,
   14,
   15,
   16,
   17,
   18,
   19,
   20,
   21,
   22,
   23,
   24,
   25,
   26,
   27,
   28,
   29,
   30,
   31,
   32,
   33,
   34,
   35,
   36,
   37,
   38,
   39,
   40,
   41,
   42,
   43,
   44,
   45,
   46,
   47,
   48,
   49,
   50,
   51,
   52,
   53,
   54,
   55,
   56,
   57,
   58,
   59,
   60,
   61,
   62,
   63,
   64,
   65,
   66,
   67,
   68,
   69,
   70,
   71,
   72,
   73,
   74,
   75,
   76,
   77,
   78,
   79,
   80,
   81,
   82,
   83,
   84,
   85,
   86,
   87,
   88,
   89,
   90,
   91,
   92,
   93,
   94,
   95,
   96,
   97,
   98,
   99,
   100,
   101,
   102,
   103,
   104,
   105,
   106,
   107,
   108,
   109,
   110,
   111,
   112,
   113,
   114,
   115,
   116,
   117,
   118,
   119,
   120,
   121,
   122,
   123,
   124,
   125,
   126,
   127,
   128,
   129,
   130,
   131,
   132,
   133,
   134,
   135,
   136,
   137,
   138,

### Step 5 — Inspecting the Graph Structure

We perform two basic checks on the bipartite user–movie graph:

1. **Movies rated by a specific user**
- Given a user node, list all connected movie nodes. 

3. **Users who rated a specific movie**
- Given a movie node, list all connected user nodes.

These checks help verify that edges were added to the graph correctly and that the
bipartite structure is preserved.

In [128]:
"""
Step 5 - Explore/Understand the graph 
"""
# Movies rated by a specific user 
user_id = 1
print("Movies rated by user", user_id, ":", graph[user_id])

# Users who rated a specific movie 
movie_id = 50
print("Users who rated movie", movie_id, ":", graph[movie_id])

Movies rated by user 1 : {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 21

### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

In [129]:
import random   # Used to perform random walks across the graph
import pandas as pd # Used for handling data tables

### Step 6 — Implementing a Pixie-Style Random-Walk Recommendation Function

Given a starting movie, the algorithm proceeds as follows:

#### 1. Identify the starting node
The input movie title is mapped to its `movie_id`. The algorithm checks that this
movie exists in the graph. Otherwise, a recommendation cannot be generated.

#### 2. Perform a random walk on the bipartite graph
A walk begins at the movie node and repeatedly selects one random neighbor at each
step. Because the graph is bipartite, the walk alternates between:

- movie → user  
- user → movie  

At each step, if the walk lands on a movie node (other than the starting movie), we
record a visit.

The parameter `walk_length` controls how far the walk explores the surrounding region
of the graph.

#### 3. Rank candidate movies
After completing the walk, movies are ranked in descending order according to their
visit counts. Movies that are encountered more frequently are considered more relevant
to the starting movie.

#### 4. Convert movie IDs to titles
The top-ranked movie IDs are mapped back to human-readable titles using the `movies`
DataFrame.

#### 5. Return the recommendation list
The function returns a DataFrame containing:

- `Ranking` 
- `Movie Name` 

This output represents the movies most strongly connected to the starting movie under
the random-walk dynamics.

In [130]:
# Create a set of all movie IDs
movie_id_set = set(movies['movie_id'].values)

In [131]:
def weighted_pixie_recommend(movie_name, walk_length=15, num=5):
    """
    Recommend movies using a Pixie-style random walk on the user–movie graph.

    Parameters:
        movie_name (str): Starting movie title.
        walk_length (int): Number of random-walk steps to perform.
        num (int): Number of movies to recommend.
    """
    
    # 1 - Look up the movie_id for the given movie_name
    row = movies[movies['title'] == movie_name]

    if row.empty:
        print(f"Movie '{movie_name}' not found in dataset.")
        return pd.DataFrame(columns=['Ranking', 'Movie Name'])

    start_movie = int(row.iloc[0]['movie_id'])

    # Make sure this movie exists as a node in the graph
    if start_movie not in graph:
        print(f"Movie id {start_movie} not present in the graph.")
        return pd.DataFrame(columns=['Ranking', 'Movie Name'])

    # 2 - Perform the random walk starting from the movie node
    position = start_movie
    visit_counts = {}  

    for _ in range(walk_length):
        neighbors = list(graph.get(position, []))
        if not neighbors:
            break  # stuck, no neighbors

        # Move to a random neighbor which is either user or movie
        position = random.choice(neighbors)

        # If we land on a movie and its not the starting movie, we count the visit
        if position in movie_id_set and position != start_movie:
            visit_counts[position] = visit_counts.get(position, 0) + 1

    # If we never visited any other movies, return empty result
    if not visit_counts:
        return pd.DataFrame(columns=['Ranking', 'Movie Name'])

    # 3 - Rank movies by how many times they were visited
    sorted_movies = sorted(visit_counts.items(),
                           key=lambda x: x[1],
                           reverse=True)
    
    # Extract the top movie IDs
    top_movies = [m for m, count in sorted_movies[:num]]

    # 4 - Convert movie_ids back into human-readable titles
    title_lookup = movies.set_index('movie_id')['title']
    movie_names = title_lookup.loc[top_movies].values

    # 5 - Build the result DataFrame with rankings
    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

### Step 7 — Testing the Pixie-Style Random-Walk Recommender

To evaluate the behavior of the Pixie-style random-walk recommender, we test the
`weighted_pixie_recommend` function on a specific starting movie. In the example
below, the starting point is **"Jurassic Park (1993)"**.

The function is called with:

- `walk_length = 10`: the number of steps taken in the random walk.  

- `num = 5`: the number of movies to return as recommendations.  

This test allows us to:

- Verify that the function correctly maps from a movie title to its node in the graph.
- Confirm that the random walk traverses between users and movies without errors.
- Inspect whether the recommended movies are **similar in genre, content, or style** to *"Jurassic Park (1993)"*, which would indicate that the graph structure and random-walk procedure are behaving as intended.


In [132]:
# Testing the Pixie-Inspired Random Walk Recommendation System
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,Rear Window (1954)
2,French Twist (Gazon maudit) (1995)
3,North by Northwest (1959)
4,Schindler's List (1993)
5,Wallace & Gromit: The Best of Aardman Animatio...


---

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [133]:
# Submit the Github Link here:


### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |