***
## Implementation Option 2: Use `numpy`, `pandas`, and `sklearn`

### If you do not prefer the one-stop Suprise solution and want more hardcore coding practices, you may implement the KNN model using `numpy`, `pandas`, and possibly `sklearn`:

<div style="background-color:#7FB3D5; font-size:2em; color:#34495E; padding:5px; border:5px solid #D7DBDD; text-align:center;">
<strong>A possible solution</strong>
</div>



In [1]:
from tqdm import tqdm # import the TQDM library
import time
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from math import sqrt
import random

import warnings
warnings.filterwarnings('ignore')

starttime = time.time()

# Load the data
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/ratings.csv"
data_temp = pd.read_csv(rating_url)
data_temp['rating'] = data_temp['rating'].astype(float)

l = data_temp.user.tolist()
l = set(l) # --> len(l) will be 33901

# reduce the size: user list is randomly divided into thirds
random.seed(42)
user_sample = random.sample(l, len(l)//3) # -->len(user_sample) will be 11300

# data is a reduced/randomly filtered dataframe:
data = data_temp.loc[data_temp['user'].isin(user_sample)] # --> data.shape will be (77824, 3)
del data_temp
print(u'\u2713'+" data loaded")


# Create user-item interaction matrix
interaction_matrix = data.pivot_table(index='user', columns='item', values='rating')
interaction_matrix = interaction_matrix.fillna(0)
n_users, n_items = interaction_matrix.shape
print(u'\u2713'+" interaction matrix creation done")

# Split the data into train and test sets
train_data, test_data = train_test_split(data, test_size=0.3)

# Calculate the similarity matrix
similarity_matrix = cosine_similarity(interaction_matrix)

# Find the k nearest neighbors for each user
k = 3
knn = NearestNeighbors(n_neighbors=k, metric='cosine')
knn.fit(interaction_matrix)
print(u'\u2713'+" knn.fit done")

# Calculate the estimated ratings for the test set
actual_ratings = []
estimated_ratings = []
for index, row in tqdm(test_data.iterrows(), total=len(test_data)): # use tqdm to create a progress bar    
    user = row['user']
    item = row['item']
    rating = row['rating']
    actual_ratings.append(rating)

    # Find the k nearest neighbors for the user in the training set
    neighbors = knn.kneighbors([interaction_matrix.loc[user]], return_distance=False)[0]

    # Calculate the average rating for the item among the k nearest neighbors
    item_ratings = []
    for neighbor in neighbors:
        rating = interaction_matrix.iloc[neighbor][item]
        if rating > 0:
            item_ratings.append(rating)
    if len(item_ratings) > 0:
        estimated_rating = np.mean(item_ratings)
    else:
        estimated_rating = 0
    estimated_ratings.append(estimated_rating)

print(u'\u2713'+" estimated ratings calculation done")


# Calculate the RMSE for the test set
rmse = sqrt(mean_squared_error(actual_ratings, estimated_ratings))
print(u'\u2713'+" execution done")
print(f"RMSE: {rmse}\n")

endtime = time.time()
execution_time = time.strftime("%H:%M:%S", time.gmtime(endtime-starttime))
print("Time required to run the code completely (in HH:MM:SS): ", execution_time)


✓ data loaded
✓ interaction matrix creation done
✓ knn.fit done


100%|████████████████████████████████████████████████████████████████████████████| 23348/23348 [05:47<00:00, 67.10it/s]

✓ estimated ratings calculation done
✓ execution done
RMSE: 0.03854829079433337

Time required to run the code completely (in HH:MM:SS):  00:07:08





<font size=5 color=#FAD7A0>Summary</font>

<font size=4 color=#FAD7A0>Although <code>KNN</code> is a fairly straightforward algorithm, it is relatively memory-intensive.<br>
→ As you saw above, I actually took a third of the available data and worked with that.<br>
→ I also chose a relatively low number of 'k's (k=3)<br>
Both were made with memory in mind. The run still took me just over seven minutes.
<br>
<br>

><font size=4 color=#FAD7A0>Please note: you can increase the accuracy by using the entire available dataset and by finding the optimal number of 'k'. There are some methods to find the best `k`, although I didn't implement it here, because the memory was my priority.<br>
Keep in mind the following: if there are sufficient resources, the optimal procedure would be <code style="color:#34495E; background-color:#2ECC71;">to run the algorithm on the entire dataframe with different neighbour numbers</code> and then <code style="color:#34495E; background-color:#2ECC71;">select the most accurate one</code>.
    
</font>

