# ITCS 6162: Data Mining - Programming Assignment

**In this assignment, you will explore data analysis, recommendation algorithms, and graph-based techniques using the MovieLens dataset. Your tasks will range from basic data exploration to advanced recommendation models, including:**
- Data manipulation with pandas
- User-item collaborative filtering
- Similarity-based recommendation models
- A Pixie-inspired Graph-based recommendation using adjacency lists with weighted random walks (without using NetworkX)


#### **Dataset Files:**
- **`u.data`**: User-movie ratings (`user_id  movie_id  rating  timestamp`)
- **`u.item`**: Movie metadata (`movie_id | title | release date | IMDB_website`)
- **`u.user`**: User demographics (`user_id | age | gender | occupation | zip_code`)

## **Part 1: Exploring and Cleaning Data**

### Inspecting the Dataset Format

The dataset is not in a traditional CSV format. To examine its structure, use the following shell command to display the first 10 lines of the file:

```sh
!head <file_name>


**In the cells given below. Write the code to read the files.**

In [57]:
# u.data
# File paths
data_file = 'u.data'
item_file = 'u.item'
user_file = 'u.user'

# Reading u.data (user_id, movie_id, rating, timestamp)
data_list = []
with open(data_file, 'r') as file:
    for line in file:
        data_list.append(list(map(int, line.strip().split('\t'))))  # Convert to integers

# Displaying the first 5 rows 
print("u.data:", data_list[:5]) 

u.data: [[196, 242, 3, 881250949], [186, 302, 3, 891717742], [22, 377, 1, 878887116], [244, 51, 2, 880606923], [166, 346, 1, 886397596]]


In [58]:
# u.item
# Reading u.item (movie_id | title | release date | IMDB_website)
item_list = []
with open(item_file, 'r', encoding='ISO-8859-1') as file:  # Encoding for special characters
    for line in file:
        parts = line.strip().split('|')
        item_list.append([int(parts[0]), parts[1], parts[2], parts[4]])  # Extract relevant fields
print("u.item:", item_list[:5])  


u.item: [[1, 'Toy Story (1995)', '01-Jan-1995', 'http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)'], [2, 'GoldenEye (1995)', '01-Jan-1995', 'http://us.imdb.com/M/title-exact?GoldenEye%20(1995)'], [3, 'Four Rooms (1995)', '01-Jan-1995', 'http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)'], [4, 'Get Shorty (1995)', '01-Jan-1995', 'http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)'], [5, 'Copycat (1995)', '01-Jan-1995', 'http://us.imdb.com/M/title-exact?Copycat%20(1995)']]


In [59]:
# u.user
# Reading u.user (user_id | age | gender | occupation | zip_code)
user_list = []
with open(user_file, 'r') as file:
    for line in file:
        parts = line.strip().split('|')
        user_list.append([int(parts[0]), int(parts[1]), parts[2], parts[3], parts[4]])  
print("u.user:", user_list[:5]) 


u.user: [[1, 24, 'M', 'technician', '85711'], [2, 53, 'F', 'other', '94043'], [3, 23, 'M', 'writer', '32067'], [4, 24, 'M', 'technician', '43537'], [5, 33, 'F', 'other', '15213']]


#### Loading the Dataset with Pandas

Use **pandas** to load the dataset into a DataFrame for analysis. Follow these steps:  

1. Import the necessary library: `pandas`.  
2. Use `pd.read_csv()` (or an appropriate function) to read the dataset file.  
3. Ensure the dataset is loaded with the correct delimiter (e.g., `','`, `'\t'`,`'|'` , or another separator if needed).  
4. Select and display the first few rows using `.head()`.

Ensure that:  

- The `ratings` dataset is read from `"u.data"` using tab (`'\t'`) as a separator and column names (`"user_id"`, `"movie_id"`, `"rating"` and `"timestamp"`).  
- The `movies` dataset is read from `"u.item"` using `'|'` as a separator, use columns (`0`, `1`, `2`), encoding (`"latin-1"`) and name the columns (`movie_id`, `title`, and `release_date`).  
- The `users` dataset is read from `"u.user"` using `'|'` as a separator, use columns (`0`, `1`, `2`, `3`) and name the columns (`user_id`, `age`, `gender`, and `occupation`).

In [60]:
# ratings
import pandas as pd

# Load the ratings dataset
ratings = pd.read_csv(
    'u.data', 
    sep='\t', 
    names=['user_id', 'movie_id', 'rating', 'timestamp'], 
    header=None
)

# Display the first few rows of each dataset
print("Ratings dataset:")
print(ratings.head())


Ratings dataset:
   user_id  movie_id  rating  timestamp
0      196       242       3  881250949
1      186       302       3  891717742
2       22       377       1  878887116
3      244        51       2  880606923
4      166       346       1  886397596


In [61]:
# movies
# Load the movies dataset
movies = pd.read_csv(
    'u.item', 
    sep='|', 
    usecols=[0, 1, 2], 
    names=['movie_id', 'title', 'release_date'], 
    encoding='latin-1', 
    header=None
)

print("\nMovies dataset:")
print(movies.head())




Movies dataset:
   movie_id              title release_date
0         1   Toy Story (1995)  01-Jan-1995
1         2   GoldenEye (1995)  01-Jan-1995
2         3  Four Rooms (1995)  01-Jan-1995
3         4  Get Shorty (1995)  01-Jan-1995
4         5     Copycat (1995)  01-Jan-1995


In [62]:
# users
# Load the users dataset
users = pd.read_csv(
    'u.user', 
    sep='|', 
    usecols=[0, 1, 2, 3], 
    names=['user_id', 'age', 'gender', 'occupation'], 
    header=None
)

print("\nUsers dataset:")
print(users.head())


Users dataset:
   user_id  age gender  occupation
0        1   24      M  technician
1        2   53      F       other
2        3   23      M      writer
3        4   24      M  technician
4        5   33      F       other


**Note:** As a **Bonus** task save the `ratings`, `movies` and `users` dataframe created into a `.csv` file format. <br>
**Hint:** Use the `to_csv()` function in pandas to save these DataFrames as CSV files.

In [63]:
# ratings
ratings.to_csv('ratings.csv', index=False)

In [64]:
# movies
movies.to_csv('movies.csv', index=False)

In [65]:
# users
users.to_csv('users.csv', index=False)

**Display the first 10 rows of each file.**

In [66]:
# ratings
print("\nRatings dataset:")
print(ratings.head(10))


Ratings dataset:
   user_id  movie_id  rating  timestamp
0      196       242       3  881250949
1      186       302       3  891717742
2       22       377       1  878887116
3      244        51       2  880606923
4      166       346       1  886397596
5      298       474       4  884182806
6      115       265       2  881171488
7      253       465       5  891628467
8      305       451       3  886324817
9        6        86       3  883603013


In [67]:
# movies
print("\nMovies dataset:")
print(movies.head(10))


Movies dataset:
   movie_id                                              title release_date
0         1                                   Toy Story (1995)  01-Jan-1995
1         2                                   GoldenEye (1995)  01-Jan-1995
2         3                                  Four Rooms (1995)  01-Jan-1995
3         4                                  Get Shorty (1995)  01-Jan-1995
4         5                                     Copycat (1995)  01-Jan-1995
5         6  Shanghai Triad (Yao a yao yao dao waipo qiao) ...  01-Jan-1995
6         7                              Twelve Monkeys (1995)  01-Jan-1995
7         8                                        Babe (1995)  01-Jan-1995
8         9                            Dead Man Walking (1995)  01-Jan-1995
9        10                                 Richard III (1995)  22-Jan-1996


In [68]:
# users
print("\nUsers dataset:")
print(users.head(10))


Users dataset:
   user_id  age gender     occupation
0        1   24      M     technician
1        2   53      F          other
2        3   23      M         writer
3        4   24      M     technician
4        5   33      F          other
5        6   42      M      executive
6        7   57      M  administrator
7        8   36      M  administrator
8        9   29      M        student
9       10   53      M         lawyer


### Data Cleaning and Exploration with Pandas  

After loading the dataset, it’s important to clean and explore the data to ensure consistency and accuracy. Below are key **pandas** functions for cleaning and understanding the dataset.

#### 1. Handle Missing Values  
- `df.dropna()` – Removes rows with missing values.  
- `df.fillna(value)` – Fills missing values with a specified value.  

#### 2. Remove Duplicates  
- `df.drop_duplicates()` – Drops duplicate rows from the dataset.  

#### 3. Handle Incorrect Data Types  
- `df.astype(dtype)` – Converts columns to the appropriate data type.  

#### 4. Filter Outliers (if applicable)  
- `df[df['column_name'] > threshold]` – Filters rows based on a condition.  

#### 5. Rename Columns (if needed)  
- `df.rename(columns={'old_name': 'new_name'})` – Renames columns for clarity.  

#### 6. Reset Index  
- `df.reset_index(drop=True, inplace=True)` – Resets the index after cleaning.  

### Data Exploration Functions  

To better understand the dataset, use these **pandas** functions:  

- `df.shape` – Returns the number of rows and columns in the dataset.  
- `df.nunique()` – Displays the number of unique values in each column.  
- `df['column_name'].unique()` – Returns unique values in a specific column.  

**Example Usage in Pandas:**  
```python
import pandas as pd

# Load dataset
df = pd.read_csv("your_file.csv")

# Drop missing values
df_cleaned = df.dropna()

# Remove duplicate rows
df_cleaned = df_cleaned.drop_duplicates()

# Convert 'timestamp' column to datetime format
df_cleaned['timestamp'] = pd.to_datetime(df_cleaned['timestamp'])

# Display dataset shape
print("Dataset shape:", df_cleaned.shape)

# Display number of unique values in each column
print("Unique values per column:\n", df_cleaned.nunique())

# Display unique movie IDs
print("Unique movie IDs:", df_cleaned['movie_id'].unique()[:10])  # Show first 10 unique movie IDs


**Note:** The functions mentioned above are some of the widely used **pandas** functions for data cleaning and exploration. However, it is not necessary that all of these functions will be required in the exercises below. Use them as needed based on the dataset and the specific tasks.

**Convert Timestamps into Readable dates.**

In [69]:
# ratings
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')
print("\nRatings dataset with readable timestamps:")
print(ratings.head(10))



Ratings dataset with readable timestamps:
   user_id  movie_id  rating           timestamp
0      196       242       3 1997-12-04 15:55:49
1      186       302       3 1998-04-04 19:22:22
2       22       377       1 1997-11-07 07:18:36
3      244        51       2 1997-11-27 05:02:03
4      166       346       1 1998-02-02 05:33:16
5      298       474       4 1998-01-07 14:20:06
6      115       265       2 1997-12-03 17:51:28
7      253       465       5 1998-04-03 18:34:27
8      305       451       3 1998-02-01 09:20:17
9        6        86       3 1997-12-31 21:16:53


In [None]:
# find min and max of timestamp
min_timestamp = ratings['timestamp'].min()
max_timestamp = ratings['timestamp'].max()
print(f"\nMinimum timestamp: {min_timestamp}")
print(f"Maximum timestamp: {max_timestamp}")



Minimum timestamp: 1997-09-20 03:05:10
Maximum timestamp: 1998-04-22 23:10:38


**Check for Missing Values**

In [70]:
# ratings
print("\nMissing values in ratings dataset:")
print(ratings.isnull().sum())



Missing values in ratings dataset:
user_id      0
movie_id     0
rating       0
timestamp    0
dtype: int64


In [71]:
# movies
print("\nMissing values in movies dataset:")
print(movies.isnull().sum())


Missing values in movies dataset:
movie_id        0
title           0
release_date    1
dtype: int64


In [72]:
# users
print("\nMissing values in users dataset:")
print(users.isnull().sum())




Missing values in users dataset:
user_id       0
age           0
gender        0
occupation    0
dtype: int64


**Print the total number of users, movies, and ratings.**

In [73]:
print(f"Total Users: {users['user_id'].nunique()}")  
print(f"Total Movies: {movies['movie_id'].nunique()}")  
print(f"Total Ratings: {ratings.shape[0]}") 

Total Users: 943
Total Movies: 1682
Total Ratings: 100000


---

## Data Loading and Initial Exploration
The MovieLens dataset consists of three main files that contain information about users, movies, and their interactions. The loading process demonstrates how to read structured data from different file formats with various delimiters, an essential skill in data preprocessing.

Loading data using raw Python file operations allows us to understand the underlying structure before moving to pandas DataFrames for more efficient manipulation.

After examining the raw data structure, we used pandas to create clean, structured DataFrames with appropriate column names. This transformation makes the data more accessible for analysis and modeling.

## Data Cleaning and Processing
The data cleaning process involved:

1. Converting Unix timestamps to readable datetime format
2. Checking for missing values across all datasets
3. Summarizing dataset statistics to understand the scale of the recommendation task

`
The preprocessing steps reveal that we have 943 unique users, 1682 unique movies, and 100,000 total ratings - creating a sparse matrix where most users have rated only a small subset of available movies.
`

---

### Insight from part 1:

**Dataset Characteristics**

The MovieLens dataset contains 943 users, 1,682 movies, and 100,000 ratings, creating a sparse matrix where most users have rated only a small percentage of available movies.

The data preprocessing revealed clean data with no missing values in the core datasets, providing a solid foundation for recommendation algorithms.

User ratings show varied preferences, with timestamps spanning from 1997 to 1998, indicating this is a historical dataset.

---

## **Part 2: Collaborative Filtering-Based Recommendation**

### **Create a User-Item Matrix**

#### Instructions for Creating a User-Movie Rating Matrix

In this exercise, you will create a user-movie rating matrix using **pandas**. This matrix will represent the ratings that users have given to different movies.

1. **Dataset Overview**:  
   The dataset has already been loaded. It includes the following key columns:
   - `user_id`: The ID of the user.
   - `movie_id`: The ID of the movie.
   - `ratings`: The rating the user gave to the movie.

2. **Create the User-Movie Rating Matrix**:  
   Use the **`pivot()`** function in **pandas** to reshape the data. Your goal is to create a matrix where:
   - Each **row** represents a **user**.
   - Each **column** represents a **movie**.
   - Each **cell** contains the **rating** that the user has given to the movie.

   Specify the following parameters for the `pivot()` function:
   - **`index`**: The `user_id` column (this will define the rows).
   - **`columns`**: The `movie_id` column (this will define the columns).
   - **`values`**: The `rating` column (this will fill the matrix with ratings).

3. **Inspect the Matrix**:  
   After creating the matrix, examine the first few rows of the resulting matrix to ensure it has been constructed correctly.

4. **Handle Missing Values**:  
   It's likely that some users have not rated every movie, resulting in `NaN` values in the matrix. You will need to handle these missing values. Consider the following options:
   - **Fill with 0**: If you wish to represent missing ratings as zeros (indicating no rating).
   - **Fill with the average rating**: Alternatively, replace missing values with the average rating for each movie.

**Create the user-movie rating matrix using the `pivot()` function.**

In [74]:
user_movie_matrix = ratings.pivot(
    index='user_id', 
    columns='movie_id', 
    values='rating'
)


**Display the matrix to verify the transformation.**

In [75]:

print("User-Movie Rating Matrix (with NaN for missing values):")
print(user_movie_matrix.head())

User-Movie Rating Matrix (with NaN for missing values):
movie_id  1     2     3     4     5     6     7     8     9     10    ...  \
user_id                                                               ...   
1          5.0   3.0   4.0   3.0   3.0   5.0   4.0   1.0   5.0   3.0  ...   
2          4.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   2.0  ...   
3          NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
4          NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   
5          4.0   3.0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...   

movie_id  1673  1674  1675  1676  1677  1678  1679  1680  1681  1682  
user_id                                                               
1          NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
2          NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
3          NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  
4          NaN   NaN   NaN   NaN   NaN   NaN   Na

In [76]:
user_movie_matrix_avg_filled = user_movie_matrix.apply(lambda col: col.fillna(col.mean()), axis=0)
print("\nUser-Movie Rating Matrix (missing values filled with average rating):")
print(user_movie_matrix_avg_filled.head())


User-Movie Rating Matrix (missing values filled with average rating):
movie_id      1         2         3         4         5         6     \
user_id                                                                
1         5.000000  3.000000  4.000000  3.000000  3.000000  5.000000   
2         4.000000  3.206107  3.033333  3.550239  3.302326  3.576923   
3         3.878319  3.206107  3.033333  3.550239  3.302326  3.576923   
4         3.878319  3.206107  3.033333  3.550239  3.302326  3.576923   
5         4.000000  3.000000  3.033333  3.550239  3.302326  3.576923   

movie_id      7         8         9         10    ...  1673  1674  1675  1676  \
user_id                                           ...                           
1         4.000000  1.000000  5.000000  3.000000  ...   3.0   4.0   3.0   2.0   
2         3.798469  3.995434  3.896321  2.000000  ...   3.0   4.0   3.0   2.0   
3         3.798469  3.995434  3.896321  3.831461  ...   3.0   4.0   3.0   2.0   
4         3.798469 

---

### Insight:

The user-movie matrix is extremely sparse, highlighting the "cold start" problem common in recommendation systems.

When filling missing values with movie averages rather than zeros, the matrix preserves the overall rating distribution, which would lead to more balanced recommendations.

---

### **User-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement a **user-based collaborative filtering** movie recommendation system using the **Movie dataset**. The goal is to recommend movies to a user based on the preferences of similar users.

##### **Step 1: Import Required Libraries**
Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing user similarity
```

##### **Step 2: Compute User-User Similarity**
- We will use **cosine similarity** to measure how similar each pair of users is based on their movie ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.

##### **Instructions:**
1. Fill missing values with `0` using `.fillna(0)`.
2. Compute similarity using `cosine_similarity()`.
3. Convert the result into a **Pandas DataFrame**, with users as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
user_similarity = cosine_similarity(user_movie_matrix.fillna(0))
user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)
```

##### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies_for_user(user_id, num=5)` to recommend movies for a given user.

##### **Function Inputs:**
- `user_id`: The target user for whom we need recommendations.
- `num`: The number of movies to recommend (default is 5).

##### **Function Steps:**
1. Find **similar users**:
   - Retrieve the similarity scores for the given `user_id`.
   - Sort them in **descending** order (highest similarity first).
   - Exclude the user themselves.
   
2. Get the **movie ratings** from these similar users.

3. Compute the **average rating** for each movie based on these users' preferences.

4. Sort the movies in **descending order** based on the computed average ratings.

5. Retrieve the **top `num` recommended movies**.

6. Map **movie IDs** to their **titles** using the `movies` DataFrame.

7. Return the results as a **Pandas DataFrame** with rankings.

##### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'Ranking': range(1, num+1),
    'Movie Name': movie_names     
})
result_df.set_index('Ranking', inplace=True)
```

#### **Example: User-Based Collaborative Filtering**
```python
recommend_movies_for_user(10, num = 5)
```
**Output:**
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | In the Company of Men (1997)   |
| 2       | Misérables, Les (1995)         |
| 3       | Thin Blue Line, The (1988)     |
| 4       | Braindead (1992)               |
| 5       | Boys, Les (1997)               |


In [77]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def recommend_movies_for_user(user_id, num=5):
    user_ratings = ratings[ratings['user_id'] == user_id]

    # Create user-movie matrix
    user_movie_matrix = ratings.pivot(index='user_id', columns='movie_id', values='rating')

    user_movie_matrix = user_movie_matrix.fillna(0)  # Fill NaN with 0 for cosine similarity calculation
    
    # Calculate user similarity using cosine similarity
    user_similarity = cosine_similarity(user_movie_matrix)
    user_sim_df = pd.DataFrame(user_similarity, index=user_movie_matrix.index, columns=user_movie_matrix.index)

    # Get similar users
    similar_users = user_sim_df[user_id].sort_values(ascending=False)[1:6]  # Get top 5 similar users

    # Get movies rated by similar users but not by target user
    user_rated_movies = set(user_ratings['movie_id'])
    similar_users_ratings = ratings[ratings['user_id'].isin(similar_users.index)]
    candidate_movies = similar_users_ratings[~similar_users_ratings['movie_id'].isin(user_rated_movies)]

    # Calculate weighted average rating
    movie_ratings = candidate_movies.groupby('movie_id').apply(
        lambda x: np.average(x['rating'], weights=similar_users[x['user_id']]), include_groups=False
    ).sort_values(ascending=False)

    # Get top N movies
    top_movies = movie_ratings.head(num)

    # Map movie IDs to titles
    movie_names = movies[movies['movie_id'].isin(top_movies.index)]['title'].values

    # Create result DataFrame
    result_df = pd.DataFrame({
        'Ranking': range(1, num+1),
        'Movie Name': movie_names
    })
    result_df.set_index('Ranking', inplace=True)

    return result_df

In [78]:
recommend_movies_for_user(10, num = 5)

Unnamed: 0_level_0,Movie Name
Ranking,Unnamed: 1_level_1
1,In the Company of Men (1997)
2,"Misérables, Les (1995)"
3,"Thin Blue Line, The (1988)"
4,Braindead (1992)
5,"Boys, Les (1997)"


---

### Insight:
**User-Based Collaborative Filtering** : It provided personalized recommendations based on similar users' preferences. For user 10, the system recommended niche films like "In the Company of Men" and "Misérables, Les", suggesting this user may have sophisticated taste in drama films.

---

### **Item-Based Collaborative Filtering Recommender System**

#### **Objective**
In this task, you will implement an **item-based collaborative filtering** recommendation system using the **Movie dataset**. The goal is to recommend movies similar to a given movie based on user rating patterns.

#### **Step 1: Import Required Libraries**
Although we have done this part already in the previous task but just to emphasize the importance reiterrating this part.

Before starting, ensure you have the necessary libraries installed. Use the following imports:

```python
import pandas as pd  # For handling data
import numpy as np   # For numerical computations
from sklearn.metrics.pairwise import cosine_similarity  # For computing item similarity
```

#### **Step 2: Compute Item-Item Similarity**
- We will use **cosine similarity** to measure how similar each pair of movies is based on their user ratings.
- Since `cosine_similarity` does not handle missing values (NaN), replace them with `0` before computation.
- Unlike user-based filtering, we need to **transpose** (`.T`) the `user_movie_matrix` because we want similarity between movies (columns) instead of users (rows).

##### **Instructions:**
1. Transpose the user-movie matrix using `.T` to make movies the rows.
2. Fill missing values with `0` using `.fillna(0)`.
3. Compute similarity using `cosine_similarity()`.
4. Convert the result into a **Pandas DataFrame**, with movies as both row and column labels.

##### **Hint:**  
You can achieve this using the following approach:

```python
item_similarity = cosine_similarity(user_movie_matrix.T.fillna(0))
item_sim_df = pd.DataFrame(item_similarity, index=user_movie_matrix.columns, columns=user_movie_matrix.columns)
```

#### **Step 3: Implement the Recommendation Function**
Now, implement the function `recommend_movies(movie_name, num=5)` to recommend movies similar to a given movie.

##### **Function Inputs:**
- `movie_name`: The target movie for which we need recommendations.
- `num`: The number of similar movies to recommend (default is 5).

##### **Function Steps:**
1. Find the **movie_id** corresponding to the given `movie_name` in the `movies` DataFrame.
2. If the movie is not found, return an appropriate message.
3. Extract the **similarity scores** for this movie from `item_sim_df`.
4. Sort the movies in **descending order** based on similarity (excluding the movie itself).
5. Retrieve the **top `num` similar movies**.
6. Map **movie IDs** to their **titles** using the `movies` DataFrame.
7. Return the results as a **Pandas DataFrame** with rankings.

#### **Step 4: Return the Final Recommendation List**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

##### **Hint:** Your final DataFrame should be created like this:
```python
result_df = pd.DataFrame({
    'ranking': range(1, num+1),
    'movie_name': movie_names
})
result_df.set_index('ranking', inplace=True)
```

#### **Example: Item-Based Collaborative Filtering**
```python
recommend_movies("Jurassic Park (1993)", num=5)
```
**Output:**
```
| Ranking | Movie Name                               |
|---------|------------------------------------------|
| 1       | Top Gun (1986)                           |
| 2       | Empire Strikes Back, The (1980)          |
| 3       | Raiders of the Lost Ark (1981)           |
| 4       | Indiana Jones and the Last Crusade (1989)|
| 5       | Speed (1994)                             |


In [79]:
# Code the function here

# Import required libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def compute_item_similarity(user_movie_matrix):
    """
    Compute the similarity between movies based on user ratings.
    
    Parameters:
    user_movie_matrix (pd.DataFrame): User-movie rating matrix
    
    Returns:
    pd.DataFrame: Item similarity matrix
    """
    # Transpose the matrix to get movies as rows and users as columns
    # Fill missing values with 0
    item_matrix = user_movie_matrix.T.fillna(0)
    
    # Compute cosine similarity between movies
    item_similarity = cosine_similarity(item_matrix)
    
    # Convert to DataFrame with proper indices (movie IDs)
    item_sim_df = pd.DataFrame(
        item_similarity, 
        index=user_movie_matrix.columns, 
        columns=user_movie_matrix.columns
    )
    
    return item_sim_df

def recommend_movies(movie_name, num=5, user_movie_matrix=None, movies_df=None, item_sim_df=None):
    """
    Recommend movies similar to a given movie.
    
    Parameters:
    movie_name (str): Name of the movie for which to find similar movies
    num (int): Number of movies to recommend
    user_movie_matrix (pd.DataFrame): User-movie rating matrix
    movies_df (pd.DataFrame): DataFrame containing movie information
    item_sim_df (pd.DataFrame): Item similarity matrix
    
    Returns:
    pd.DataFrame: DataFrame with recommended movies
    """
    # If similarity matrix is not provided, compute it
    if item_sim_df is None and user_movie_matrix is not None:
        item_sim_df = compute_item_similarity(user_movie_matrix)
    
    # Find the movie_id corresponding to the given movie_name
    movie_row = movies_df[movies_df['title'] == movie_name]
    
    # Check if the movie exists
    if movie_row.empty:
        return pd.DataFrame({'Message': ['Movie not found in the database']})
    
    movie_id = movie_row.iloc[0]['movie_id']
    
    # Check if the movie exists in the similarity matrix
    if movie_id not in item_sim_df.index:
        return pd.DataFrame({'Message': ['Movie has no similarity data']})
    
    # Get similarity scores for this movie with all other movies
    movie_similarities = item_sim_df[movie_id]
    
    # Sort similarities in descending order and exclude the movie itself
    similar_movies = movie_similarities.drop(movie_id).sort_values(ascending=False)
    
    # Get the top num similar movies
    top_similar_movies = similar_movies.head(num)
    
    # Map movie IDs to their titles
    movie_ids = top_similar_movies.index.tolist()
    movie_names = movies_df[movies_df['movie_id'].isin(movie_ids)]['title'].tolist()
    
    # Create a DataFrame with the results
    result_df = pd.DataFrame({
        'Ranking': range(1, len(movie_names) + 1),
        'Movie Name': movie_names[:num]
    })
    result_df.set_index('Ranking', inplace=True)
    
    return result_df

# Example usage:
# First compute the item similarity matrix
item_sim_df = compute_item_similarity(user_movie_matrix)
#  get recommendations
recommendations = recommend_movies("Jurassic Park (1993)", num=5, user_movie_matrix=user_movie_matrix, movies_df=movies, item_sim_df=item_sim_df)
print(recommendations)

                                        Movie Name
Ranking                                           
1                                   Top Gun (1986)
2                  Empire Strikes Back, The (1980)
3                   Raiders of the Lost Ark (1981)
4        Indiana Jones and the Last Crusade (1989)
5                                     Speed (1994)


---

### Insight

**Item-Based Collaborative Filtering**: This method excelled at finding content with similar characteristics. The recommendations for "Jurassic Park" included other blockbuster action films like "Top Gun" and "Empire Strikes Back", showing how the algorithm captures genre and style similarity.

---

## **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm)**

### **Adjacency List**

#### **Objective**
In this task, you will preprocess the Movie dataset and construct a **graph representation** where:
- **Users** are connected to the movies they have rated.
- **Movies** are connected to users who have rated them.
  
This graph structure will help in exploring **user-movie relationships** for recommendations.

#### **Step 1: Merge Ratings with Movie Titles**
Since we have **movie IDs** in the ratings dataset but need human-readable movie titles, we will:
1. Merge the `ratings` DataFrame with the `movies` DataFrame using the `'movie_id'` column.
2. This allows each rating to be associated with a **movie title**.

#### **Hint:**
Use the following Pandas operation to merge:
```python
ratings = ratings.merge(movies, on='movie_id')
```


#### **Step 2: Aggregate Ratings**
Since multiple users may rate the same movie multiple times, we:
1. Group the dataset by `['user_id', 'movie_id', 'title']`.
2. Compute the **mean rating** for each movie by each user.
3. Reset the index to ensure we maintain a clean DataFrame structure.

#### **Hint:**  
Use `groupby()` and `mean()` as follows:
```python
ratings = ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
```

#### **Step 3: Normalize Ratings**
Since different users have different rating biases, we normalize ratings by:
1. **Computing each user's mean rating**.
2. **Subtracting the mean rating** from each individual rating.

#### **Instructions:**
- Use `groupby('user_id')` to group ratings by users.
- Apply `transform(lambda x: x - x.mean())` to adjust ratings.

#### **Hint:**  
Normalize ratings using:
```python
ratings['rating'] = ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
```
This ensures each user’s ratings are centered around zero, making similarity calculations fairer.

#### **Step 4: Construct the Graph Representation**
We represent the user-movie interactions as an **undirected graph** using an **adjacency list**:
- Each **user** is a node connected to movies they rated.
- Each **movie** is a node connected to users who rated it.

#### **Graph Construction Steps:**
1. Initialize an empty dictionary `graph = {}`.
2. Iterate through the **ratings dataset**.
3. For each `user_id` and `movie_id` pair:
   - Add the movie to the user’s set of connections.
   - Add the user to the movie’s set of connections.

#### **Hint:**  
The following code builds the graph:

```python
graph = {}
for _, row in ratings.iterrows():
    user, movie = row['user_id'], row['movie_id']
    if user not in graph:
        graph[user] = set()
    if movie not in graph:
        graph[movie] = set()
    graph[user].add(movie)
    graph[movie].add(user)
```

This results in a **bipartite graph**, where:
- **Users** are connected to multiple movies.
- **Movies** are connected to multiple users.

#### **Step 5: Understanding the Graph**
- **Nodes** in the graph represent **users and movies**.
- **Edges** exist between a user and a movie **if the user has rated the movie**.
- This structure allows us to find **users with similar movie tastes** and **movies frequently watched together**.

#### **Exploring the Graph**
- **Find a user’s rated movies:**  
  ```python
  user_id = 1
  print(graph[user_id])  # Movies rated by user 1
  ```

- **Find users who rated a movie:**  
  ```python
  movie_id = 50
  print(graph[movie_id])  # Users who rated movie 50
  ```

In [80]:
# Code the function here

import pandas as pd
import numpy as np

def create_user_movie_graph(ratings_df, movies_df):
    """
    Preprocess the data and construct a bipartite graph representation
    where users are connected to the movies they've rated and vice versa.
    
    Parameters:
    ratings_df (pd.DataFrame): DataFrame containing user ratings
    movies_df (pd.DataFrame): DataFrame containing movie information
    
    Returns:
    dict: Adjacency list representation of the user-movie graph
    pd.DataFrame: Processed ratings with normalized values
    """
    # Step 1: Merge ratings with movie titles
    merged_ratings = ratings_df.merge(movies_df, on='movie_id')
    
    # Step 2: Aggregate ratings (in case there are duplicate ratings)
    aggregated_ratings = merged_ratings.groupby(['user_id', 'movie_id', 'title'])['rating'].mean().reset_index()
    
    # Step 3: Normalize ratings by subtracting each user's mean rating
    processed_ratings = aggregated_ratings.copy()
    processed_ratings['normalized_rating'] = processed_ratings.groupby('user_id')['rating'].transform(lambda x: x - x.mean())
    
    # Step 4: Construct the graph representation as an adjacency list
    graph = {}
    
    # Iterate through each rating
    for _, row in processed_ratings.iterrows():
        user, movie = row['user_id'], row['movie_id']
        
        # Initialize sets for user and movie if they don't exist
        if user not in graph:
            graph[user] = set()
        if movie not in graph:
            graph[movie] = set()
        
        # Add connections between user and movie
        graph[user].add(movie)
        graph[movie].add(user)
    
    return graph, processed_ratings

def explore_graph(graph, user_id=None, movie_id=None):
    """
    Explore the graph to find connections for a specific user or movie.
    
    Parameters:
    graph (dict): Adjacency list representation of the user-movie graph
    user_id (int, optional): User ID to explore connections for
    movie_id (int, optional): Movie ID to explore connections for
    
    Returns:
    set: Connected nodes for the specified user or movie
    """
    if user_id is not None and user_id in graph:
        print(f"Movies rated by user {user_id}:")
        return graph[user_id]
    
    elif movie_id is not None and movie_id in graph:
        print(f"Users who rated movie {movie_id}:")
        return graph[movie_id]
    
    else:
        print("User or movie not found in the graph.")
        return set()



In [84]:
# Usage
graph, processed_ratings = create_user_movie_graph(ratings, movies)

# Explore connections for user 1
user_movies = explore_graph(graph, user_id=1)
print(f"User 1 has rated {len(user_movies)} movies:")
print(list(user_movies)) 

# Explore connections for movie 50
movie_users = explore_graph(graph, movie_id=50)
print(f"Movie 50 has been rated by {len(movie_users)} users:")
print(list(movie_users)) 

# Print some graph statistics
num_nodes = len(graph)
avg_connections = sum(len(connections) for connections in graph.values()) / num_nodes
print(f"Graph has {num_nodes} nodes with an average of {avg_connections:.2f} connections per node")

Movies rated by user 1:
User 1 has rated 601 movies:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211,

---

# Pixie-Inspired Random Walk Algorithm Explanation

**What are Pixie-Inspired Recommendation Systems?**

Pixie is a graph-based recommendation algorithm developed by Pinterest that leverages random walks on large-scale graphs. Unlike traditional collaborative filtering methods that rely on matrix operations, Pixie-inspired systems model relationships as a graph and use stochastic processes to explore connections between entities.


The core idea is to represent users, items, and their interactions as a heterogeneous graph where different types of nodes (users and movies in our case) are connected by edges representing interactions. By performing biased random walks on this graph, the algorithm can discover complex patterns and relationships that might not be apparent with traditional methods.


**How Random Walks Help in Identifying Relevant Recommendations ?**

Random walks operate on the principle of exploration with a probabilistic component. Starting from a node (either a user or a movie), the algorithm:

1. **Traverses the Graph**: Moving from the current node to a randomly selected neighboring node
2. **Tracks Visitation Frequency**: Counting how often each movie node is visited during the walks
3. **Identifies Patterns**: Movies visited more frequently during walks from a starting point are considered more relevant

This approach captures both direct connections (movies rated by a user) and indirect connections (movies rated by similar users) in a unified framework. The stochastic nature of random walks allows the algorithm to discover non-obvious relationships that might be missed by deterministic methods

**Real-World Applications**

These types of algorithms have been successfully deployed in large-scale recommendation systems in the industry. For example, they are used by streaming platforms and online retail sites to provide personalized suggestions based on complex user-item interactions. Graph-based methods can uncover subtle connections in sparse datasets—offering a robust alternative to traditional matrix factorization techniques—thus enhancing recommendations even when user data is limited.

**Advantages and Challenges**

The primary advantage of Pixie-inspired random walks is that they naturally incorporate both direct and indirect relationships among items. This allows the algorithm to explore beyond the immediate neighborhood of a user’s interactions and capture deeper patterns. However, challenges such as computational complexity and the need for careful tuning of parameters (like the number of walk iterations and restart probabilities) must be addressed to achieve optimal performance.

---

### **Implement Weighted Random Walks**

#### **Random Walk-Based Movie Recommendation System (Weighted Pixie)**

#### **Objective**
In this task, you will implement a **random-walk-based recommendation algorithm** using the **Weighted Pixie** method. This technique uses a **user-movie bipartite graph** to recommend movies by simulating a random walk from a given user or movie.

#### **Step 1: Import Required Libraries**
Make sure you have the necessary libraries:

```python
import random  # For random walks
import pandas as pd  # For handling data
```

#### **Step 2: Implement the Random Walk Algorithm**
Your task is to **simulate a random walk** from a given starting point in the **bipartite user-movie graph**.

##### **Hints for Implementation**
- Start from **either a user or a movie**.
- At each step, **randomly move** to a connected node.
- Keep track of **how many times each movie is visited**.
- After completing the walk, **rank movies by visit count**.

#### **Step 3: Implement User-Based Recommendation**
**Hints:**
- Check if the `user_id` exists in the `graph`.
- Start a loop that runs for `walk_length` steps.
- Randomly pick a **connected node** (user or movie).
- Track how many times each **movie** is visited.
- Sort movies by visit frequency and return the **top N**.

#### **Step 4: Implement Movie-Based Recommendation**
**Hints:**
- Find the `movie_id` corresponding to the given `movie_name`.
- Ensure the movie exists in the `graph`.
- Start a random walk from that movie.
- Follow the same **tracking and ranking** process as the user-based version.

**Note:**  
**Your task:** Implement a function `weighted_pixie_recommend(user_id, walk_length=15, num=5)` or `weighted_pixie_recommend(movie_name, walk_length=15, num=5)`.  
**Implement either Step 3 or Step 4.**

#### **Step 5: Running Your Recommendation System**
Once your function is implemented, test it by calling:

##### **Example: User-Based Recommendation**
```python
weighted_pixie_recommend(1, walk_length=15, num=5)
```
| Ranking | Movie Name                     |
|---------|--------------------------------|
| 1       | My Own Private Idaho (1991)   |
| 2       | Aladdin (1992)                |
| 3       | 12 Angry Men (1957)           |
| 4       | Happy Gilmore (1996)          |
| 5       | Copycat (1995)                |


##### **Example: Movie-Based Recommendation**
```python
weighted_pixie_recommend("Jurassic Park (1993)", walk_length=10, num=5)
```
| Ranking | Movie Name                           |
|---------|-------------------------------------|
| 1       | Rear Window (1954)                 |
| 2       | Great Dictator, The (1940)         |
| 3       | Field of Dreams (1989)             |
| 4       | Casablanca (1942)                  |
| 5       | Nightmare Before Christmas, The (1993) |


#### **Step 6: Understanding the Results**
Your function should return a **DataFrame** structured as follows:

| Ranking | Movie Name |
|---------|-----------|
| 1       | Movie A   |
| 2       | Movie B   |
| 3       | Movie C   |
| 4       | Movie D   |
| 5       | Movie E   |

Each movie is ranked based on **how frequently it was visited** during the walk.

#### **Experiment with Different Parameters**
- Try different **`walk_length`** values and observe how it changes recommendations.
- Adjust the number of recommended movies (`num`).

In [99]:
movies

Unnamed: 0,movie_id,title,release_date
0,1,Toy Story (1995),01-Jan-1995
1,2,GoldenEye (1995),01-Jan-1995
2,3,Four Rooms (1995),01-Jan-1995
3,4,Get Shorty (1995),01-Jan-1995
4,5,Copycat (1995),01-Jan-1995
...,...,...,...
1677,1678,Mat' i syn (1997),06-Feb-1998
1678,1679,B. Monkey (1998),06-Feb-1998
1679,1680,Sliding Doors (1998),01-Jan-1998
1680,1681,You So Crazy (1994),01-Jan-1994


In [None]:
import random
import pandas as pd

def weighted_pixie_recommend(input_item, walk_length=15, num=5):
    """
    Recommends movies using random walks on a user-movie bipartite graph.
    
    Args:
        input_item: Either a user_id (int) or movie_name (str)
        walk_length: Number of steps in the random walk
        num: Number of movies to recommend
        
    Returns:
        DataFrame with top recommended movies
    """
    # Determine if input is a user_id (int) or movie_name (str)
    if isinstance(input_item, int):
        # User-based recommendation
        user_id = input_item
        if user_id not in graph:
            raise ValueError(f"User ID {user_id} not found in the graph")
        current_node = user_id
    else:
        # Movie-based recommendation
        movie_name = input_item
        # Find movie_id for the given movie_name
        movie_id = None
        for idx, row in movies.iterrows():
            if row['title'] == movie_name:
                movie_id = row['movie_id']
                break
        
        if movie_id is None:
            raise ValueError(f"Movie '{movie_name}' not found in the dataset")
        if movie_id not in graph:
            raise ValueError(f"Movie ID {movie_id} not found in the graph")
        current_node = movie_id
    
    # Initialize counter for visited movies
    movie_visits = {}
    
    # Perform random walk
    for _ in range(walk_length):
        # Get connections for current node (which is a set in this implementation)
        connections = graph[current_node]
        if not connections:
            break
        
        # Choose next node randomly (unweighted since we're using sets, not dicts)
        next_node = random.choice(list(connections))
        
        # If next node is a movie (assuming movie IDs are integers), count the visit
        if isinstance(next_node, int) and next_node in movies['movie_id'].values:
            movie_visits[next_node] = movie_visits.get(next_node, 0) + 1
        
        # Move to the next node
        current_node = next_node
    
    # Get top visited movies
    top_movies = sorted(movie_visits.items(), key=lambda x: x[1], reverse=True)[:num]
    
    # Create results DataFrame
    result = []
    for i, (movie_id, visits) in enumerate(top_movies, 1):
        # Get the movie name from movies
        movie_name = movies[movies['movie_id'] == movie_id]['title'].values[0]
        result.append({
            'Ranking': i,
            'Movie Name': movie_name
        })
    
    return pd.DataFrame(result)


User-based recommendations:
   Ranking                                         Movie Name
0        1                            Army of Darkness (1993)
1        2  Adventures of Priscilla, Queen of the Desert, ...
2        3                                    Maverick (1994)
3        4                                 Chasing Amy (1997)
4        5                              Cable Guy, The (1996)

Movie-based recommendations:
   Ranking                         Movie Name
0        1  Wes Craven's New Nightmare (1994)
1        2                  Unforgiven (1992)
2        3                    Power 98 (1995)
3        4                      Grease (1978)
4        5                Shining, The (1980)


In [102]:
# User-based recommendation
user_recommendations = weighted_pixie_recommend(1, walk_length=15, num=5)
print("User-based recommendations:")
print(user_recommendations)

# Movie-based recommendation
movie_recommendations = weighted_pixie_recommend("Toy Story (1995)", walk_length=10, num=5)
print("\nMovie-based recommendations:")
print(movie_recommendations)

User-based recommendations:
   Ranking                         Movie Name
0        1  City of Lost Children, The (1995)
1        2          African Queen, The (1951)
2        3          Angels and Insects (1995)
3        4        English Patient, The (1996)
4        5                Lost Highway (1997)

Movie-based recommendations:
   Ranking                         Movie Name
0        1  City of Lost Children, The (1995)
1        2                    Heathers (1989)
2        3              Fatal Instinct (1993)
3        4              Cool Hand Luke (1967)
4        5                White Squall (1996)


---

### Insight

**Random Walk-Based Recommendations**: Generates more diverse recommendations by exploring the graph structure. The recommendations for user 1 included "City of Lost Children" and "African Queen", reflecting how the algorithm can discover relevant content through indirect connections.

In overall the graph-based approach appears particularly effective at generating serendipitous recommendations that might not be obvious through direct similarity metrics, potentially addressing the "filter bubble" problem common in recommendation systems.

---
---

## **Submission Requirements:**

To successfully complete this assignment, ensure that you submit the following:


### **1. Jupyter Notebook Submission**
- Submit a **fully completed Jupyter Notebook** that includes:
  - **All implemented recommendation functions** (user-based, item-based, and random walk-based recommendations).
  - **Code explanations** in markdown cells to describe each step.
  - **Results and insights** from running your recommendation models.


### **2. Explanation of Pixie-Inspired Algorithms (3-5 Paragraphs)**
- Write a **detailed explanation** of **Pixie-inspired random walk algorithms** used for recommendations.
- Your explanation should cover:
  - What **Pixie-inspired recommendation systems** are.
  - How **random walks** help in identifying relevant recommendations.
  - Any real-world applications of such algorithms in industry.


### **3. Report for the Submitted Notebook**
Your report should be structured as follows:

#### **Title: Movie Recommendation System Report**

#### **1. Introduction**
- Briefly introduce **movie recommendation systems** and why they are important.
- Explain the **different approaches used** (user-based, item-based, random-walk).

#### **2. Dataset Description**
- Describe the **MovieLens 100K dataset**:
  - Number of users, movies, and ratings.
  - What features were used.
  - Any preprocessing performed.

#### **3. Methodology**
- Explain the three recommendation techniques implemented:
  - **User-based collaborative filtering** (how user similarity was calculated).
  - **Item-based collaborative filtering** (how item similarity was determined).
  - **Random-walk-based Pixie algorithm** (why graph-based approaches are effective).
  
#### **4. Implementation Details**
- Discuss the steps taken to build the functions.
- Describe how the **adjacency list graph** was created.
- Explain how **random walks** were performed and how visited movies were ranked.

#### **5. Results and Evaluation**
- Present **example outputs** from each recommendation approach.
- Compare the different methods in terms of accuracy and usefulness.
- Discuss any **limitations** in the implementation.

#### **6. Conclusion**
- Summarize the key takeaways from the project.
- Discuss potential improvements (e.g., **hybrid models, additional features**).
- Suggest real-world applications of the methods used.

### **Submission Instructions**

- Submit `.zip` file consisting of Jupyter Notebook and all the datafiles (provided) and the ones saved [i.e. `users.csv`, `movies.csv` and `ratings.csv`]. Also, include the Report and Pixie Algorithm explanation document.
- [`Bonus 10 Points`] **Upload your Jupyter Notebook, Explanation Document, and Report** to your GitHub repository.
- Ensure the repository is public and contains:
  - `users.csv`, `movies.csv` and `ratings.csv` [These are the Dataframes which were created in part 1. Save and export them as a `.csv` file]
  - `Movie_Recommendation.ipynb`
  - `Pixie_Algorithm_Explanation.pdf` or `.md`
  - `Recommendation_Report.pdf` or `.md`
- **Submit the GitHub repository link in the cell below.**


#### **Example Submission Format**
```text
GitHub Repository: https://github.com/username/Movie-Recommendation
```

In [None]:
# Submit the Github Link here:
https://github.com/gcnabingc/movie_recommendation_sys.git

### **Grading Rubric: ITCS 6162 - Data Mining Assignment**


| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **Part 1: Exploring and Cleaning Data (15 pts)**  | Properly loads `u.user`, `u.movies`, and `u.item` datasets into DataFrames | 5 |
|                                           | Handles missing values, duplicates, and inconsistencies appropriately | 5 |
|                                           | Saves the cleaned datasets into CSV files: `users.csv`, `movies.csv`, `ratings.csv` | 5 |
| **Part 2: Collaborative Filtering-Based Recommendation (30 pts)** | Implements user-based collaborative filtering correctly | 10 |
|                                           | Implements item-based collaborative filtering correctly | 10 |
|                                           | Computes similarity measures accurately and provides valid recommendations | 10 |
| **Part 3: Graph-Based Recommender (Pixie-Inspired Algorithm) (35 pts)** | Constructs adjacency lists properly from user-movie interactions | 10 |
|                                           | Implements weighted random walk-based recommendation correctly | 15 |
|                                           | Explains and justifies the algorithm design choices (Pixie-inspired) | 10 |
| **Code Quality & Documentation (10 pts)** | Code is well-structured, efficient, and follows best practices | 5 |
|                                           | Markdown explanations and comments are clear and enhance understanding | 5 |
| **Results & Interpretation (5 pts)**      | Provides meaningful insights from the recommendation system's output | 5 |
| **Submission & Report (5 pts)**          | Submits all required files in the correct format (ZIP file with Jupyter notebook, processed CSV files, and project report) | 5 |
| **Total**                                 |                              | 100 |

#### **Bonus (10 pts)**
| **Category**                              | **Criteria**                                                     | **Points** |
|-------------------------------------------|----------------------------------------------------------------|------------|
| **GitHub Submission**                     | Provides a well-documented GitHub repository with CSV files, a structured README, and a properly formatted Jupyter Notebook | 10 |