# Assignment: Neighborhood Collaborative Filtering Models (User- and Item-Based CF)

**Course:** AIE425 Intelligent Recommender Systems  
**Assignment #1:** Neighborhood CF Models (User-, Item-Based CF)  
**Submission Date:** Week 6 Lab (Tuesday, November 5, 2024)  

---

## Overview

This assignment involves building a neighborhood-based collaborative filtering (CF) model for a recommender system using user- and item-based CF methods. The project covers the following core steps:

1. **Data Collection**: Scraping quotes, authors, and tags from the `quotes.toscrape.com` website as the source dataset.
2. **Data Preprocessing**: Cleaning and organizing the data to create a structured user-item matrix with unique tags representing the genres or themes of quotes.
3. **Data Simulation**: Generating a simulated user-item matrix where each cell represents a hypothetical user rating (from 1 to 5) for a tag. This matrix simulates user preferences for different themes.
4. **Collaborative Filtering**:
   - Implementing both **User-User** and **Item-Item Collaborative Filtering**.
   - Calculating similarity using **Cosine Similarity** and **Pearson Correlation**.
5. **Prediction & Recommendation**:
   - Predicting ratings for each user-item pair.
   - Generating Top-N recommendations for each user based on predicted ratings.

---

## Objectives

- **Implement Neighborhood-Based Collaborative Filtering**: Using user-based and item-based CF to identify similar users/items and provide recommendations based on these relationships.
- **Understand Similarity Metrics**: Compare Cosine Similarity and Pearson Correlation, understanding their impact on recommendations.
- **Simulate a User-Item Matrix**: Generate user ratings to facilitate collaborative filtering without real user data.
- **Provide Top-N Recommendations**: Using predicted ratings to recommend the most relevant tags (genres) for each user.

---

## Methodology

1. **Data Collection**: Data was collected using web scraping from `quotes.toscrape.com`, capturing quotes, their authors, and associated tags.
2. **Data Preparation**:
   - Extracted unique tags (themes) to use as item features in the matrix.
   - Created a **User-Item Matrix** with simulated user ratings for each tag, representing user preferences.
3. **Collaborative Filtering Algorithms**:
   - Calculated **User-User and Item-Item Cosine Similarity** to find similar users and items based on tag ratings.
   - Calculated **User-User and Item-Item Pearson Correlation** as an alternative similarity metric.
4. **Rating Prediction**:
   - Used the similarity matrices to predict ratings for unrated tags.
   - Calculated average ratings per tag.
5. **Top-N Recommendations**: Extracted the top-5 recommendations for each user based on the predicted ratings.

---

## Code Structure

1. **Data Collection & Preprocessing**: Collect data from the website, clean it, and extract unique tags.
2. **Simulated Data Creation**: Generate a User-Item Matrix with random ratings for unique tags.
3. **Similarity Calculations**:
   - **User-User and Item-Item Cosine Similarity**: Measure similarity between users and items based on tag ratings.
   - **User-User and Item-Item Pearson Correlation**: Alternative similarity measure.
4. **Rating Prediction & Recommendation**:
   - Predict ratings for user-item pairs based on similarity matrices.
   - Generate Top-N recommendations for each user.

---

## Key Libraries

- `requests` and `BeautifulSoup`: For web scraping.
- `pandas` and `numpy`: For data handling and matrix manipulation.
- `sklearn.metrics.pairwise`: For calculating cosine similarity.

---

## Deliverables

- **Notebook with Code and Outputs**: Documenting each step with code, comments, and results.
- **Report**: A written report summarizing the methodology, results, and conclusions.
- **GitHub Submission**: The complete assignment, including code, dataset, and plagiarism report, submitted to GitHub.

---

## Results Summary

The results include:
- **Unique Tags**: Extracted themes/genres of quotes.
- **User-Item Matrix**: Simulated ratings matrix showing user preferences for each tag.
- **Cosine and Pearson Similarity Matrices**: Showcasing user-user and item-item similarities.
- **Predicted Ratings**: Using collaborative filtering to predict ratings for each user-tag combination.
- **Top-N Recommendations**: List of top-5 recommendations for each user based on predicted ratings.

---

## Conclusion

This project provides hands-on experience with collaborative filtering techniques, utilizing both cosine similarity and Pearson correlation to understand user and item relationships. Simulated data allowed us to apply these techniques and generate meaningful recommendations based on hypothetical user preferences. The comparison between similarity metrics highlighted the impact of each approach on recommendation quality, providing insights into practical implementations of recommender systems.

---

## References

1. BeautifulSoup Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2. Scikit-learn Cosine Similarity Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
3. Quotes to Scrape: http://quotes.toscrape.com
4. OpenAI's ChatGPT (for assistance in generating, formatting, and structuring content)


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

# Function to scrape quotes from a given page URL
def scrape_quotes(page_url):
    try:
        # Send a GET request to the page URL
        response = requests.get(page_url)
        response.raise_for_status()  # Raise an error if the request was unsuccessful
        
        # Parse the page content with BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # List to store data for each quote on the page
        quotes_data = []
        
        # Find all quotes on the page
        for quote_item in soup.find_all('div', class_='quote'):
            # Extract the quote text
            quote_text = quote_item.find('span', class_='text').get_text()
            # Extract the author's name
            author = quote_item.find('small', class_='author').get_text()
            # Extract the tags associated with the quote
            tags = [tag.get_text() for tag in quote_item.find_all('a', class_='tag')]
            
            # Append the data as a dictionary to the quotes_data list
            quotes_data.append({
                'Quote': quote_text,
                'Author': author,
                'Tags': ", ".join(tags)  # Join tags into a single comma-separated string
            })
        
        # Return the list of quotes data for this page
        return quotes_data
    
    except requests.exceptions.RequestException as e:
        # Print an error message if there was a problem with the request
        print(f"Error loading {page_url}: {e}")
        return []

# Initialize an empty list to hold all quote data across pages
all_quotes = []

# Iterate through each page (1 to 10) to scrape quotes
for page_num in range(1, 11):
    # Define the URL for the current page
    page_url = f'http://quotes.toscrape.com/page/{page_num}/'
    # Scrape quotes from the current page
    page_quotes = scrape_quotes(page_url)
    # Add the scraped quotes to the all_quotes list
    all_quotes.extend(page_quotes)
    # Wait for 1 second between requests to avoid overloading the server
    time.sleep(1)

# Create a DataFrame from the collected quotes data
quotes_df = pd.DataFrame(all_quotes)

# Display the first few rows of the DataFrame to verify the data
quotes_df.head()


Unnamed: 0,Quote,Author,Tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"


In [2]:
quotes_df.to_csv('quotes_data.csv', index=False)
print("Data saved to 'IRS_data.csv'")


Data saved to 'IRS_data.csv'


In [3]:
# Extract unique tags from the 'Tags' column
unique_tags = sorted(list(set(tag for tags in quotes_df['Tags'] for tag in tags.split(", "))))

# Display the unique tags (genres)
print("Unique tags (genres):", unique_tags)


Unique tags (genres): ['', 'abilities', 'activism', 'adulthood', 'adventure', 'age', 'alcohol', 'aliteracy', 'apathy', 'attributed', 'attributed-no-source', 'authors', 'be-yourself', 'beatles', 'better-life-empathy', 'bilbo', 'books', 'change', 'children', 'chocolate', 'choices', 'christianity', 'classic', 'comedy', 'connection', 'contentment', 'courage', 'death', 'deep-thoughts', 'difficult', 'dreamers', 'dreaming', 'dreams', 'drug', 'dumbledore', 'edison', 'education', 'elizabeth-bennet', 'failure', 'fairy-tales', 'fairytales', 'faith', 'fantasy', 'fate', 'fear', 'food', 'friends', 'friendship', 'girls', 'god', 'good', 'growing-up', 'grown-ups', 'happiness', 'hate', 'heartbreak', 'hope', 'humor', 'imagination', 'indifference', 'insanity', 'inspiration', 'inspirational', 'integrity', 'jane-austen', 'journey', 'knowledge', 'lack-of-friendship', 'lack-of-love', 'learning', 'library', 'lies', 'life', 'literature', 'live', 'live-death-love', 'lost', 'love', 'lying', 'marriage', 'mind', 'm

In [6]:
import numpy as np

# Define the number of users for the simulation
num_users = 50

# Step 1: Create a user-item matrix with random ratings from 1 to 5
# Each row represents a simulated user, and each column represents a tag (genre)
user_item_matrix = pd.DataFrame(
    np.random.randint(1, 6, size=(num_users, len(unique_tags))),
    columns=unique_tags
)

# Step 2: Display the first few rows of the user-item matrix to verify structure
print("User-Item Matrix (First 5 Rows):")
user_item_matrix.head()


User-Item Matrix (First 5 Rows):


Unnamed: 0,Unnamed: 1,abilities,activism,adulthood,adventure,age,alcohol,aliteracy,apathy,attributed,...,unhappy-marriage,value,wander,wisdom,women,world,write,writers,writing,yourself
0,5,3,2,4,3,2,3,3,2,5,...,2,2,1,3,4,2,1,5,1,2
1,3,1,5,1,4,5,4,1,4,5,...,1,2,3,1,5,4,2,1,5,1
2,5,1,1,2,1,4,4,4,1,1,...,2,2,3,4,5,2,5,3,2,3
3,5,3,2,1,2,1,1,3,3,1,...,3,3,5,4,1,2,4,2,5,2
4,4,3,3,2,1,3,2,1,5,2,...,2,5,2,1,1,2,5,1,2,4


In [7]:
# Calculate the average rating for each tag (genre) in the user-item matrix
average_ratings = user_item_matrix.mean()

# Convert to DataFrame for better presentation and potential inclusion in the report
average_ratings_df = average_ratings.reset_index()
average_ratings_df.columns = ['Tag', 'Average Rating']

# Display the first few rows of the average ratings
print("Average Rating for Each Tag (Genre):")
average_ratings_df.head()

# Optional: Save the result if the assignment requires submission of this data
# average_ratings_df.to_csv('average_ratings.csv', index=False)


Average Rating for Each Tag (Genre):


Unnamed: 0,Tag,Average Rating
0,,2.96
1,abilities,3.16
2,activism,2.7
3,adulthood,2.92
4,adventure,3.1


In [8]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Calculate User-User Cosine Similarity
user_similarity_cosine = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(user_similarity_cosine, index=user_item_matrix.index, columns=user_item_matrix.index)
print("User-User Cosine Similarity:\n", user_similarity_df.head())

# Calculate Item-Item Cosine Similarity
item_similarity_cosine = cosine_similarity(user_item_matrix.T)
item_similarity_df = pd.DataFrame(item_similarity_cosine, index=user_item_matrix.columns, columns=user_item_matrix.columns)
print("\nItem-Item Cosine Similarity:\n", item_similarity_df.head())


User-User Cosine Similarity:
          0         1         2         3         4         5         6   \
0  1.000000  0.796988  0.842846  0.817112  0.776735  0.759911  0.817504   
1  0.796988  1.000000  0.791390  0.796414  0.792961  0.816991  0.812855   
2  0.842846  0.791390  1.000000  0.834447  0.788861  0.779259  0.779728   
3  0.817112  0.796414  0.834447  1.000000  0.816144  0.817426  0.823501   
4  0.776735  0.792961  0.788861  0.816144  1.000000  0.817428  0.811255   

         7         8         9   ...        40        41        42        43  \
0  0.797672  0.800781  0.774630  ...  0.812769  0.812221  0.857730  0.817886   
1  0.879837  0.830915  0.822447  ...  0.799173  0.771808  0.826126  0.777981   
2  0.806316  0.788246  0.791975  ...  0.790310  0.761137  0.817449  0.793331   
3  0.816131  0.808796  0.824192  ...  0.832625  0.827437  0.848318  0.817106   
4  0.844634  0.846432  0.864262  ...  0.789219  0.757776  0.838955  0.811660   

         44        45        46       

In [9]:
# Calculate User-User Pearson Correlation
user_similarity_pearson = user_item_matrix.T.corr()
print("\nUser-User Pearson Correlation:\n", user_similarity_pearson.head())

# Calculate Item-Item Pearson Correlation
item_similarity_pearson = user_item_matrix.corr()
print("\nItem-Item Pearson Correlation:\n", item_similarity_pearson.head())



User-User Pearson Correlation:
          0         1         2         3         4         5         6   \
0  1.000000 -0.023518  0.248056  0.040010 -0.109427 -0.196313  0.028073   
1 -0.023518  1.000000 -0.005186 -0.076755 -0.035963  0.081741 -0.004343   
2  0.248056 -0.005186  1.000000  0.172223 -0.003521 -0.051939 -0.118642   
3  0.040010 -0.076755  0.172223  1.000000  0.042592  0.046489  0.010494   
4 -0.109427 -0.035963 -0.003521  0.042592  1.000000  0.097034  0.002823   

         7         8         9   ...        40        41        42        43  \
0 -0.101157 -0.041498 -0.275563  ... -0.010849  0.052542  0.186020  0.071718   
1  0.342565  0.109619 -0.010981  ... -0.092883 -0.159546 -0.006249 -0.139855   
2 -0.002666 -0.056091 -0.117529  ... -0.078456 -0.152061  0.006454 -0.006223   
3 -0.054386 -0.050861 -0.049354  ...  0.048047  0.086721  0.079572  0.021430   
4  0.162684  0.203863  0.242870  ... -0.129225 -0.212992  0.085177  0.047317   

         44        45        46    

In [10]:
def predict_ratings(user_item_matrix, similarity_matrix, type='user'):
    # Calculate user-based predictions if specified
    if type == 'user':
        mean_user_rating = user_item_matrix.mean(axis=1).values  # User mean ratings
        ratings_diff = user_item_matrix - mean_user_rating[:, None]
        pred = mean_user_rating[:, None] + similarity_matrix.dot(ratings_diff) / np.array([np.abs(similarity_matrix).sum(axis=1)]).T
    
    # Calculate item-based predictions if specified
    elif type == 'item':
        mean_item_rating = user_item_matrix.mean(axis=0).values  # Item mean ratings
        ratings_diff = user_item_matrix - mean_item_rating[None, :]
        pred = mean_item_rating[None, :] + ratings_diff.dot(similarity_matrix) / np.array([np.abs(similarity_matrix).sum(axis=1)])
    
    return pred

# Predict ratings using user-user cosine similarity
user_based_prediction_cosine = predict_ratings(user_item_matrix, user_similarity_cosine, type='user')
print("User-Based Rating Prediction (Cosine Similarity):\n", pd.DataFrame(user_based_prediction_cosine).head())

# Predict ratings using item-item cosine similarity
item_based_prediction_cosine = predict_ratings(user_item_matrix, item_similarity_cosine, type='item')
print("\nItem-Based Rating Prediction (Cosine Similarity):\n", pd.DataFrame(item_based_prediction_cosine).head())


User-Based Rating Prediction (Cosine Similarity):
         0         1         2         3         4         5         6    \
0  2.989577  3.185224  2.710689  2.946319  3.118354  3.047401  2.731157   
1  2.890764  3.074172  2.641049  2.835131  3.039341  2.981232  2.637624   
2  2.734981  2.906979  2.440480  2.670044  2.839746  2.794045  2.462978   
3  2.823232  3.016299  2.545824  2.758347  2.940155  2.878447  2.538802   
4  2.880103  3.075706  2.618753  2.822743  3.005670  2.952320  2.609356   

        7         8         9    ...       128       129       130       131  \
0  2.606429  3.072323  3.207942  ...  2.957335  2.946835  2.923195  2.883371   
1  2.494864  2.995830  3.123874  ...  2.858881  2.858101  2.848255  2.785253   
2  2.343948  2.794342  2.917486  ...  2.692780  2.679715  2.666835  2.619936   
3  2.431257  2.906028  3.014994  ...  2.791798  2.784848  2.774072  2.719564   
4  2.481472  2.987806  3.087607  ...  2.848839  2.857321  2.826422  2.756510   

        132      

In [11]:
import numpy as np

def get_top_n_recommendations(predictions, n=5):
    top_n_recommendations = {}
    for user_id in range(predictions.shape[0]):
        user_ratings = predictions[user_id]
        
        # Sort indices of ratings in descending order
        sorted_indices = np.argsort(-user_ratings)
        
        # Select the top N items based on the highest predicted ratings
        top_n_recommendations[user_id] = sorted_indices[:n]
    
    return top_n_recommendations

# Get Top-5 recommendations for each user based on user-based predictions
top_n_user_based = get_top_n_recommendations(user_based_prediction_cosine, n=5)
print("\nTop-5 User-Based Recommendations:\n", top_n_user_based)



Top-5 User-Based Recommendations:
 {0: array([22, 76, 54, 67, 68], dtype=int64), 1: array([22, 76, 54, 67, 68], dtype=int64), 2: array([22, 76, 54, 68, 17], dtype=int64), 3: array([22, 76, 54, 17, 68], dtype=int64), 4: array([22, 76, 54, 68, 17], dtype=int64), 5: array([22, 76, 54, 68, 67], dtype=int64), 6: array([22, 76, 54, 67, 68], dtype=int64), 7: array([22, 76, 54, 68, 17], dtype=int64), 8: array([22, 76, 54, 68, 17], dtype=int64), 9: array([22, 76, 54, 68, 17], dtype=int64), 10: array([22, 76, 54, 67, 68], dtype=int64), 11: array([22, 76, 54, 68, 67], dtype=int64), 12: array([22, 76, 54, 17, 68], dtype=int64), 13: array([22, 76, 54, 68, 67], dtype=int64), 14: array([22, 76, 54, 17, 67], dtype=int64), 15: array([22, 76, 54, 68, 17], dtype=int64), 16: array([22, 76, 54, 68, 67], dtype=int64), 17: array([22, 76, 54, 17, 67], dtype=int64), 18: array([22, 76, 54, 68, 17], dtype=int64), 19: array([22, 76, 54, 17, 67], dtype=int64), 20: array([22, 76, 54, 67, 68], dtype=int64), 21: arr

In [12]:
print("User-Based Prediction Cosine Matrix Sample:\n", pd.DataFrame(user_based_prediction_cosine).head())


User-Based Prediction Cosine Matrix Sample:
         0         1         2         3         4         5         6    \
0  2.989577  3.185224  2.710689  2.946319  3.118354  3.047401  2.731157   
1  2.890764  3.074172  2.641049  2.835131  3.039341  2.981232  2.637624   
2  2.734981  2.906979  2.440480  2.670044  2.839746  2.794045  2.462978   
3  2.823232  3.016299  2.545824  2.758347  2.940155  2.878447  2.538802   
4  2.880103  3.075706  2.618753  2.822743  3.005670  2.952320  2.609356   

        7         8         9    ...       128       129       130       131  \
0  2.606429  3.072323  3.207942  ...  2.957335  2.946835  2.923195  2.883371   
1  2.494864  2.995830  3.123874  ...  2.858881  2.858101  2.848255  2.785253   
2  2.343948  2.794342  2.917486  ...  2.692780  2.679715  2.666835  2.619936   
3  2.431257  2.906028  3.014994  ...  2.791798  2.784848  2.774072  2.719564   
4  2.481472  2.987806  3.087607  ...  2.848839  2.857321  2.826422  2.756510   

        132       133  

In [13]:
# Display a snapshot of the user-item matrix for report clarity
print("User-Item Matrix (Sample):")
display(user_item_matrix.head(10))  # Display only the first 10 rows


User-Item Matrix (Sample):


Unnamed: 0,Unnamed: 1,abilities,activism,adulthood,adventure,age,alcohol,aliteracy,apathy,attributed,...,unhappy-marriage,value,wander,wisdom,women,world,write,writers,writing,yourself
0,5,3,2,4,3,2,3,3,2,5,...,2,2,1,3,4,2,1,5,1,2
1,3,1,5,1,4,5,4,1,4,5,...,1,2,3,1,5,4,2,1,5,1
2,5,1,1,2,1,4,4,4,1,1,...,2,2,3,4,5,2,5,3,2,3
3,5,3,2,1,2,1,1,3,3,1,...,3,3,5,4,1,2,4,2,5,2
4,4,3,3,2,1,3,2,1,5,2,...,2,5,2,1,1,2,5,1,2,4
5,5,2,5,1,1,5,2,4,5,2,...,2,4,5,5,5,1,4,4,5,5
6,4,4,5,4,4,3,4,3,1,4,...,2,5,4,1,4,2,3,2,2,3
7,2,3,5,3,4,4,5,3,5,5,...,4,3,4,2,4,3,4,2,4,3
8,3,5,2,1,5,3,1,1,5,4,...,3,3,3,2,4,3,3,3,2,2
9,4,4,5,4,5,5,1,1,5,4,...,2,3,3,3,1,4,3,5,4,5


In [14]:
# Display average ratings in a cleaner format
average_ratings_df["Average Rating"] = average_ratings_df["Average Rating"].round(2)  # Round to 2 decimal places
print("Average Ratings per Tag (Sample):")
display(average_ratings_df.head(10))  # Display first 10 tags with average ratings


Average Ratings per Tag (Sample):


Unnamed: 0,Tag,Average Rating
0,,2.96
1,abilities,3.16
2,activism,2.7
3,adulthood,2.92
4,adventure,3.1
5,age,3.04
6,alcohol,2.7
7,aliteracy,2.58
8,apathy,3.06
9,attributed,3.18


In [15]:
# User-User Cosine Similarity Matrix
user_similarity_df_cosine = pd.DataFrame(user_similarity_cosine, index=user_item_matrix.index, columns=user_item_matrix.index)
user_similarity_df_cosine = user_similarity_df_cosine.round(2)  # Round values for readability
print("User-User Cosine Similarity (Sample):")
display(user_similarity_df_cosine.iloc[:10, :10])  # Display a 10x10 portion

# Item-Item Cosine Similarity Matrix
item_similarity_df_cosine = pd.DataFrame(item_similarity_cosine, index=user_item_matrix.columns, columns=user_item_matrix.columns)
item_similarity_df_cosine = item_similarity_df_cosine.round(2)
print("\nItem-Item Cosine Similarity (Sample):")
display(item_similarity_df_cosine.iloc[:10, :10])  # Display a 10x10 portion

# User-User Pearson Correlation Matrix
user_similarity_pearson = user_item_matrix.T.corr().round(2)
print("\nUser-User Pearson Correlation (Sample):")
display(user_similarity_pearson.iloc[:10, :10])  # Display a 10x10 portion

# Item-Item Pearson Correlation Matrix
item_similarity_pearson = user_item_matrix.corr().round(2)
print("\nItem-Item Pearson Correlation (Sample):")
display(item_similarity_pearson.iloc[:10, :10])  # Display a 10x10 portion


User-User Cosine Similarity (Sample):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,0.8,0.84,0.82,0.78,0.76,0.82,0.8,0.8,0.77
1,0.8,1.0,0.79,0.8,0.79,0.82,0.81,0.88,0.83,0.82
2,0.84,0.79,1.0,0.83,0.79,0.78,0.78,0.81,0.79,0.79
3,0.82,0.8,0.83,1.0,0.82,0.82,0.82,0.82,0.81,0.82
4,0.78,0.79,0.79,0.82,1.0,0.82,0.81,0.84,0.85,0.86
5,0.76,0.82,0.78,0.82,0.82,1.0,0.81,0.83,0.82,0.82
6,0.82,0.81,0.78,0.82,0.81,0.81,1.0,0.82,0.82,0.82
7,0.8,0.88,0.81,0.82,0.84,0.83,0.82,1.0,0.87,0.85
8,0.8,0.83,0.79,0.81,0.85,0.82,0.82,0.87,1.0,0.84
9,0.77,0.82,0.79,0.82,0.86,0.82,0.82,0.85,0.84,1.0



Item-Item Cosine Similarity (Sample):


Unnamed: 0,Unnamed: 1,abilities,activism,adulthood,adventure,age,alcohol,aliteracy,apathy,attributed
,1.0,0.8,0.78,0.78,0.77,0.83,0.78,0.81,0.8,0.8
abilities,0.8,1.0,0.83,0.85,0.86,0.81,0.82,0.77,0.83,0.82
activism,0.78,0.83,1.0,0.79,0.85,0.82,0.8,0.74,0.76,0.78
adulthood,0.78,0.85,0.79,1.0,0.87,0.86,0.8,0.82,0.78,0.8
adventure,0.77,0.86,0.85,0.87,1.0,0.85,0.82,0.82,0.83,0.82
age,0.83,0.81,0.82,0.86,0.85,1.0,0.83,0.82,0.84,0.86
alcohol,0.78,0.82,0.8,0.8,0.82,0.83,1.0,0.77,0.81,0.85
aliteracy,0.81,0.77,0.74,0.82,0.82,0.82,0.77,1.0,0.77,0.75
apathy,0.8,0.83,0.76,0.78,0.83,0.84,0.81,0.77,1.0,0.87
attributed,0.8,0.82,0.78,0.8,0.82,0.86,0.85,0.75,0.87,1.0



User-User Pearson Correlation (Sample):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,-0.02,0.25,0.04,-0.11,-0.2,0.03,-0.1,-0.04,-0.28
1,-0.02,1.0,-0.01,-0.08,-0.04,0.08,-0.0,0.34,0.11,-0.01
2,0.25,-0.01,1.0,0.17,-0.0,-0.05,-0.12,-0.0,-0.06,-0.12
3,0.04,-0.08,0.17,1.0,0.04,0.05,0.01,-0.05,-0.05,-0.05
4,-0.11,-0.04,-0.0,0.04,1.0,0.1,0.0,0.16,0.2,0.24
5,-0.2,0.08,-0.05,0.05,0.1,1.0,0.01,0.06,0.08,-0.03
6,0.03,-0.0,-0.12,0.01,0.0,0.01,1.0,-0.07,-0.01,-0.09
7,-0.1,0.34,-0.0,-0.05,0.16,0.06,-0.07,1.0,0.24,0.09
8,-0.04,0.11,-0.06,-0.05,0.2,0.08,-0.01,0.24,1.0,0.04
9,-0.28,-0.01,-0.12,-0.05,0.24,-0.03,-0.09,0.09,0.04,1.0



Item-Item Pearson Correlation (Sample):


Unnamed: 0,Unnamed: 1,abilities,activism,adulthood,adventure,age,alcohol,aliteracy,apathy,attributed
,1.0,-0.18,-0.04,-0.12,-0.25,-0.03,-0.19,0.09,-0.06,-0.12
abilities,-0.18,1.0,0.09,0.16,0.13,-0.24,-0.05,-0.19,0.05,-0.12
activism,-0.04,0.09,1.0,0.01,0.26,0.04,0.04,-0.16,-0.15,-0.09
adulthood,-0.12,0.16,0.01,1.0,0.32,0.19,-0.01,0.19,-0.09,-0.07
adventure,-0.25,0.13,0.26,0.32,1.0,0.07,-0.01,0.1,0.09,-0.01
age,-0.03,-0.24,0.04,0.19,0.07,1.0,-0.0,0.08,0.09,0.13
alcohol,-0.19,-0.05,0.04,-0.01,-0.01,-0.0,1.0,-0.11,-0.01,0.16
aliteracy,0.09,-0.19,-0.16,0.19,0.1,0.08,-0.11,1.0,-0.09,-0.23
apathy,-0.06,0.05,-0.15,-0.09,0.09,0.09,-0.01,-0.09,1.0,0.3
attributed,-0.12,-0.12,-0.09,-0.07,-0.01,0.13,0.16,-0.23,0.3,1.0


In [16]:
# Display predicted ratings with rounding
user_based_prediction_cosine_df = pd.DataFrame(user_based_prediction_cosine).round(2)
print("Predicted Ratings (User-Based CF, Cosine Similarity):")
display(user_based_prediction_cosine_df.head(10))  # Display the first 10 users with predictions


Predicted Ratings (User-Based CF, Cosine Similarity):


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,128,129,130,131,132,133,134,135,136,137
0,2.99,3.19,2.71,2.95,3.12,3.05,2.73,2.61,3.07,3.21,...,2.96,2.95,2.92,2.88,2.75,2.81,3.04,2.93,3.16,3.29
1,2.89,3.07,2.64,2.84,3.04,2.98,2.64,2.49,3.0,3.12,...,2.86,2.86,2.85,2.79,2.67,2.74,2.96,2.82,3.1,3.19
2,2.73,2.91,2.44,2.67,2.84,2.79,2.46,2.34,2.79,2.92,...,2.69,2.68,2.67,2.62,2.48,2.55,2.8,2.66,2.91,3.03
3,2.82,3.02,2.55,2.76,2.94,2.88,2.54,2.43,2.91,3.01,...,2.79,2.78,2.77,2.72,2.56,2.65,2.9,2.75,3.02,3.13
4,2.88,3.08,2.62,2.82,3.01,2.95,2.61,2.48,2.99,3.09,...,2.85,2.86,2.83,2.76,2.63,2.72,2.96,2.8,3.07,3.2
5,2.88,3.07,2.63,2.82,3.01,2.97,2.61,2.5,2.99,3.09,...,2.86,2.86,2.85,2.79,2.64,2.71,2.96,2.82,3.08,3.2
6,2.97,3.17,2.72,2.92,3.11,3.05,2.71,2.59,3.05,3.2,...,2.95,2.96,2.93,2.86,2.73,2.81,3.05,2.9,3.16,3.28
7,3.17,3.38,2.93,3.13,3.32,3.26,2.93,2.79,3.29,3.4,...,3.16,3.15,3.14,3.07,2.95,3.02,3.26,3.11,3.39,3.49
8,2.93,3.15,2.68,2.89,3.09,3.02,2.68,2.55,3.05,3.16,...,2.92,2.92,2.89,2.83,2.71,2.79,3.03,2.87,3.13,3.25
9,3.07,3.27,2.82,3.04,3.22,3.16,2.8,2.67,3.18,3.29,...,3.05,3.05,3.03,2.97,2.82,2.92,3.15,3.01,3.28,3.4


In [18]:
def get_top_n_recommendations(predictions, n=5):
    top_n_recommendations = {}
    for user_id in range(predictions.shape[0]):
        # Get the top N items with the highest predicted ratings for each user
        top_items = predictions[user_id].argsort()[-n:][::-1]
        top_n_recommendations[user_id] = top_items
    return top_n_recommendations

# Generate top-5 recommendations based on user-based predictions (or other predictions)
top_5_recommendations = get_top_n_recommendations(user_based_prediction_cosine, n=5)


In [19]:
# Convert top-N recommendations to DataFrame for a cleaner look
top_5_recommendations_df = pd.DataFrame.from_dict(top_5_recommendations, orient="index", columns=[f"Top-{i+1}" for i in range(5)])
print("Top-5 User-Based Recommendations (Sample):")
display(top_5_recommendations_df.head(10))  # Display the top-5 recommendations for the first 10 users


Top-5 User-Based Recommendations (Sample):


Unnamed: 0,Top-1,Top-2,Top-3,Top-4,Top-5
0,22,76,54,67,68
1,22,76,54,67,68
2,22,76,54,68,17
3,22,76,54,17,68
4,22,76,54,68,17
5,22,76,54,68,67
6,22,76,54,67,68
7,22,76,54,68,17
8,22,76,54,68,17
9,22,76,54,68,17


In [21]:
# Display unique tags in a DataFrame for a clearer snapshot
unique_tags_df = pd.DataFrame(unique_tags, columns=["Unique Tags"])
print("Unique Tags (Genres):")
display(unique_tags_df.head(10))  # Display the first 10 unique tags for readability


Unique Tags (Genres):


Unnamed: 0,Unique Tags
0,
1,abilities
2,activism
3,adulthood
4,adventure
5,age
6,alcohol
7,aliteracy
8,apathy
9,attributed


In [22]:
# Save the final user-item matrix to a CSV file
user_item_matrix.to_csv('final_user_item_matrix.csv', index=False)
