# Course: Introduction To GenAI

*Notebook: missing_values.ipynb*

<a href="https://colab.research.google.com/github/gassaf2/RecommenderSystems/blob/main/week1/missing_values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objective:
The goal of this exercise is to predict missing ratings in the Rating.csv dataset by iteratively updating missing values using regression models like decision trees or neural networks.

# Dataset Details:
The dataset [Download here](https://www.kaggle.com/datasets/CooperUnion/anime-recommendations-database) contains the following columns:

user_id: A randomly generated user ID (non-identifiable).
anime_id: The ID of the anime rated by the user.
rating: The rating the user assigned to the anime (range 1-10) or -1 if the user watched the anime but didn’t provide a rating.

# **# Steps for the Exercise:**

# 1. Data Preparation:
Load and Explore the Data:

Load the dataset into a Pandas DataFrame.
Explore the data to understand the structure, missing values (rating = -1), and general distribution.

In [67]:
import pandas as pd
data = pd.read_csv('rating.csv')
print(data.head())
missing_data = data[data['rating'].isin([-1, None])]
print("Missing data:\n", missing_data)

   user_id  anime_id  rating
0        1        20      -1
1        1        24      -1
2        1        79      -1
3        1       226      -1
4        1       241      -1
Missing data:
          user_id  anime_id  rating
0              1        20      -1
1              1        24      -1
2              1        79      -1
3              1       226      -1
4              1       241      -1
...          ...       ...     ...
7813628    73515      2385      -1
7813629    73515      2386      -1
7813631    73515      2490      -1
7813635    73515      2680      -1
7813668    73515      5252      -1

[1476496 rows x 3 columns]


# 2. Regression-Based Imputation (Iterative Approach):
Setting Up the Regression Model:

For simplicity, you can use a decision tree regression model to predict the missing values.
Create a feature matrix (X) and target vector (y), where X consists of user_id, anime_id, and any other useful features (like user averages), and y is the rating.

In [68]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Feature Engineering: Add user-based averages or anime-based averages
data['user_avg_rating'] = data.groupby('user_id')['rating'].transform('mean')
data['anime_avg_rating'] = data.groupby('anime_id')['rating'].transform('mean')

# Filter out missing data (where rating == -1, treated as missing)
#train_data = data[data['rating'] != -1]
train_data = data.drop(missing_data.index)

# Features: user_id, anime_id, user_avg_rating, anime_avg_rating
X = train_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']]
y= train_data['rating']

# Train-test split (to ensure good model performance evaluation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [42]:
train_data

Unnamed: 0,user_id,anime_id,rating,user_avg_rating,anime_avg_rating
0,1,20,-1,-0.640523,6.641967
1,1,24,-1,-0.640523,6.791696
2,1,79,-1,-0.640523,6.099128
3,1,226,-1,-0.640523,6.804843
4,1,241,-1,-0.640523,5.426316
...,...,...,...,...,...
7813732,73515,16512,7,7.841837,5.802797
7813733,73515,17187,9,7.841837,6.179376
7813734,73515,22145,10,7.841837,6.257336
7813735,73516,790,9,9.000000,6.988754


# 3. Train a Regression Model (Decision Tree in this case):

In [43]:
# Train the Decision Tree Regressor model
model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

# Evaluate model performance
score = model.score(X_test, y_test)
print(f"Model R-squared score: {score}")

Model R-squared score: 0.33275694336166584


# 4. Impute Missing Values Using the Trained Model:

For each missing rating, predict it using the trained regression model.


In [44]:
# Now, predict ratings for the missing data (where rating == -1 or NaN)
missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing

# Predict the missing ratings
predicted_ratings = model.predict(missing_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']])

# Update the DataFrame with the predicted ratings for missing values
data.loc[missing_data.index, 'rating'] = predicted_ratings

# Output the updated DataFrame
print("\nUpdated DataFrame with predicted ratings:\n", data)


Updated DataFrame with predicted ratings:
          user_id  anime_id  rating  user_avg_rating  anime_avg_rating
0              1        20      -1        -0.640523          6.641967
1              1        24      -1        -0.640523          6.791696
2              1        79      -1        -0.640523          6.099128
3              1       226      -1        -0.640523          6.804843
4              1       241      -1        -0.640523          5.426316
...          ...       ...     ...              ...               ...
7813732    73515     16512       7         7.841837          5.802797
7813733    73515     17187       9         7.841837          6.179376
7813734    73515     22145      10         7.841837          6.257336
7813735    73516       790       9         9.000000          6.988754
7813736    73516      8074       9         9.000000          6.375374

[7813737 rows x 5 columns]


# 5. Repeat the Process:

After updating the ratings, you can re-train the model with the newly imputed data and predict again, improving the quality of the imputed values.

In [45]:
 missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing
missing_data

Unnamed: 0,user_id,anime_id,rating,user_avg_rating,anime_avg_rating
0,1,20,-1,-0.640523,6.641967
1,1,24,-1,-0.640523,6.791696
2,1,79,-1,-0.640523,6.099128
3,1,226,-1,-0.640523,6.804843
4,1,241,-1,-0.640523,5.426316
...,...,...,...,...,...
7813603,73515,1237,-1,7.841837,5.033461
7813604,73515,1238,-1,7.841837,5.173737
7813623,73515,2020,-1,7.841837,5.228471
7813628,73515,2385,-1,7.841837,5.332016


In [52]:
 missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing
missing_data

Unnamed: 0,user_id,anime_id,rating,user_avg_rating,anime_avg_rating
0,1,20,-1,-0.640523,6.641967
1,1,24,-1,-0.640523,6.791696
2,1,79,-1,-0.640523,6.099128
3,1,226,-1,-0.640523,6.804843
4,1,241,-1,-0.640523,5.426316
...,...,...,...,...,...
7813603,73515,1237,-1,7.841837,5.033461
7813604,73515,1238,-1,7.841837,5.173737
7813623,73515,2020,-1,7.841837,5.228471
7813628,73515,2385,-1,7.841837,5.332016


In [69]:
max_iterations = 5
for i in range(max_iterations):
    missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing

    train_data = data.drop(missing_data.index)
    X = train_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']]
    y = train_data['rating']
    model.fit(X, y)

    #missing_data = data[data['rating'].isin([-1, None])]  # Treat both -1 and NaN as missing
    if missing_data.empty:
        print(f"no more missing value at the iteration {i}")
        break;
    
    pred_rating=model.predict(missing_data[['user_id', 'anime_id', 'user_avg_rating', 'anime_avg_rating']])
    data.loc[missing_data.index,'rating'] = pred_rating

no more missing value at the iteration 1


In [64]:
missing_data

Unnamed: 0,user_id,anime_id,rating,user_avg_rating,anime_avg_rating


In [62]:
# After completing the iterations, print the final updated values
print("\nFinal Updated DataFrame with Imputed Ratings:")
print(data[['user_id', 'anime_id', 'rating']].head())  # Print the top rows with updated ratings for review


Final Updated DataFrame with Imputed Ratings:
   user_id  anime_id  rating
0        1        20       5
1        1        24       4
2        1        79       5
3        1       226       4
4        1       241      10


# 6. Evaluation:
Measure Accuracy:

After filling in missing ratings, evaluate how well the model performs on the imputed ratings.
Use a root mean squared error (RMSE) or mean absolute error (MAE) to measure prediction accuracy.

In [63]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
print(f"RMSE of model: {rmse}")

RMSE of model: 3.2369718648648376
