#### Collaborative Filtering Algorithm
Collaborative filtering is a technique used in recommendation systems to predict a user's preferences based on the preferences of similar users. It operates on the principle that if two users agree on one issue, they are likely to agree on others as well. There are two main types of collaborative filtering:

- User-based Collaborative Filtering: Recommends items to a user based on the preferences of other users who are similar to them.
- Item-based Collaborative Filtering: Recommends items that are similar to items the user has liked in the past.

#### Generate Random Data
We can simulate a user-item rating matrix that represents users’ preferences for various items. Here’s how to generate random data suitable for collaborative filtering:

In [1]:
import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Generate random user-item rating data
num_users = 10
num_items = 5

# Random ratings between 1 and 5 (0 indicates no rating)
ratings = np.random.randint(1, 6, size=(num_users, num_items))

# Set some values to 0 to simulate users not rating every item
mask = np.random.choice([1, 0], size=ratings.shape, p=[0.7, 0.3])
ratings *= mask

# Create a DataFrame for better visualization
ratings_df = pd.DataFrame(ratings, columns=[f'Item {i+1}' for i in range(num_items)],
                          index=[f'User {i+1}' for i in range(num_users)])

print(ratings_df)


         Item 1  Item 2  Item 3  Item 4  Item 5
User 1        4       5       3       0       5
User 2        2       3       0       3       5
User 3        4       3       0       2       4
User 4        0       0       5       1       4
User 5        0       0       0       1       1
User 6        0       3       2       4       4
User 7        3       4       4       1       3
User 8        5       0       0       1       0
User 9        4       0       0       2       2
User 10       1       2       5       2       4


The data presented in the user-item rating matrix indicates user preferences for various items, typically used in recommendation systems. Each row corresponds to a user, while each column represents an item. The values in the matrix indicate the rating or score that a user has given to an item. Here's a breakdown of what this data means:

#### Matrix Explanation
- Matrix Structure:
    - Rows: Represent individual users (e.g., User 1, User 2, etc.).
    - Columns: Represent different items (e.g., Item 1, Item 2, etc.).
    - Values: Numeric ratings assigned by users to items. In this case:
        - A value of 0 indicates that the user did not rate the item (i.e., no interaction).
        - Positive values (1-5) indicate the user's rating of the item, where a higher number typically signifies a greater preference or satisfaction with the item.

Interpretation of the Data
- User Ratings:
    - User 1 rated Item 2 with a 4, Item 4 with a 5, and Item 5 with a 3. They did not rate Items 1 and 3 (indicated by 0).
    - User 2 gave a rating of 2 to Item 3, 1 to Item 4, and 4 to Item 5, with no ratings for Items 1 and 2.
    - User 3 rated Item 1 as 3, Item 3 as 5, while not interacting with Items 2, 4, and 5.

#### Implement Collaborative Filtering from Scratch
Here's a simple implementation of user-based collaborative filtering using the cosine similarity to find similar users and recommend items:

In [2]:
class CollaborativeFiltering:
    def __init__(self, ratings):
        self.ratings = ratings
    
    def _cosine_similarity(self, user_a, user_b):
        # Calculate the cosine similarity between two users
        dot_product = np.dot(user_a, user_b)
        norm_a = np.linalg.norm(user_a)
        norm_b = np.linalg.norm(user_b)
        return dot_product / (norm_a * norm_b) if norm_a and norm_b else 0

    def get_similar_users(self, target_user_index, top_n=3):
        # Get the ratings of the target user
        target_user_ratings = self.ratings[target_user_index]

        similarities = []
        for i in range(self.ratings.shape[0]):
            if i != target_user_index:
                similarity = self._cosine_similarity(target_user_ratings, self.ratings[i])
                similarities.append((i, similarity))

        # Sort by similarity and get the top_n similar users
        similar_users = sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]
        return similar_users
    
    def recommend(self, target_user_index):
        similar_users = self.get_similar_users(target_user_index)
        
        # Collect item recommendations based on similar users
        recommendations = {}
        for user_index, similarity in similar_users:
            for item_index in range(self.ratings.shape[1]):
                if self.ratings[user_index][item_index] > 0:  # Only consider rated items
                    if item_index not in recommendations:
                        recommendations[item_index] = 0
                    recommendations[item_index] += self.ratings[user_index][item_index] * similarity

        # Sort recommendations by score
        recommended_items = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
        return recommended_items

# Example Usage
ratings_matrix = np.array(ratings_df)  # Use the ratings matrix from earlier
cf = CollaborativeFiltering(ratings_matrix)

# Get recommendations for User 1 (index 0)
recommendations = cf.recommend(0)
print("Recommendations for User 1:")
for item_index, score in recommendations:
    print(f"Item {item_index + 1} with score {score:.2f}")


Recommendations for User 1:
Item 5 with score 10.42
Item 2 with score 8.87
Item 1 with score 7.99
Item 4 with score 5.14
Item 3 with score 3.82


#### When to Use Collaborative Filtering
When to Use:
- Sufficient User Data: When there is a large amount of user interaction data available (ratings, clicks, etc.).
- User Similarity: When users have similar preferences and can provide meaningful recommendations based on peer choices.
- New Items: Collaborative filtering can recommend new items based on user behavior without requiring detailed item descriptions.

When Not to Use:
- Cold Start Problem: When there are new users or items with insufficient data, leading to unreliable recommendations.
- Sparse Data: When the user-item matrix is highly sparse, making it challenging to find meaningful similarities.
- Scalability: For large datasets, collaborative filtering can become computationally expensive and may require optimization techniques or algorithms.

In summary, collaborative filtering is a powerful technique for generating recommendations based on user interactions but may face challenges when data is sparse or when new users/items are introduced.

#### Key Concepts

- Cold Start Problem:

    - New Users: If a new user does not have any ratings or interactions, it becomes challenging to make personalized recommendations. Strategies to mitigate this include using demographic information or asking users to rate a few items at the beginning.
    - New Items: New items without user ratings cannot be recommended until sufficient interaction data is available.

- Sparsity:

    - In many real-world scenarios, the user-item interaction matrix is sparse, meaning most users do not interact with most items. This sparsity can hinder the effectiveness of collaborative filtering algorithms.

- Scalability:

    - As the number of users and items increases, the computational complexity of calculating similarities and generating recommendations also increases. Efficient algorithms and data structures (like sparse matrices) are necessary for scaling.

- Diversity and Serendipity:

    - It's essential to ensure that recommendations are not only accurate but also diverse and serendipitous to enhance user experience. This means including items that the user may not have considered but are still relevant.

- Evaluation Metrics:

    - Common evaluation metrics for recommendation systems include Precision, Recall, F1 Score, and Mean Absolute Error (MAE). Evaluating the system's performance is crucial to ensuring it meets user needs.

Matrix factorization is generally considered more advanced and accurate than basic collaborative filtering, especially for large datasets. Here’s a quick comparison:

- Matrix Factorization:

    - Technique: Decomposes the user-item interaction matrix into lower-dimensional matrices representing latent factors for users and items.
    - Accuracy: Often more accurate because it captures complex, hidden patterns in user preferences and item attributes.
    - Advanced: More sophisticated, often used in modern recommender systems (e.g., Singular Value Decomposition (SVD), Alternating Least Squares (ALS)).

- Collaborative Filtering (Basic):

    - Technique: Directly calculates similarity between users (user-based) or items (item-based) based on observed ratings or interactions.
    - Accuracy: Generally less accurate for large, sparse datasets because it doesn't handle missing data as well as matrix factorization.
    - Simplicity: Easier to implement and understand, but limited in capturing deeper, hidden relationships.

#### Building a newsfeed recommendation engine using collaborative filtering

There are several datasets available.

##### 1. [MIND (Microsoft News Dataset)](https://msnews.github.io/)
- A large-scale dataset specifically for news recommendation.
- Contains user click behaviors, news content (e.g., title, abstract, category), and user-news interactions.
- There are two versions: MIND-small and MIND-large, both collected from anonymized user interactions on the Microsoft News website.

##### 2. [Adressa Dataset](https://reclab.idi.ntnu.no/dataset/)
- Collected from the Norwegian news website Adresseavisen.
- Contains user interaction data like clicks, scrolls, and time spent on articles, along with article metadata.
- Useful for exploring user behavior patterns in news reading.

##### 3. News360 Dataset
- Contains user interactions on the News360 platform.
- Includes data on clicks, shares, and time spent on articles, along with detailed metadata for each news item.
- This dataset was made available through research collaborations, so you may need to contact them to gain access.

##### 4. [GDELT (Global Database of Events, Language, and Tone)](https://www.gdeltproject.org/)
- Not strictly user interaction data, but provides a vast amount of news content data globally.
- Useful for augmenting interaction data with content-based features, such as event metadata and sentiment analysis.

##### 5. [Click-Through Rate (CTR) Prediction for News Recommendations (Alibaba)](https://tianchi.aliyun.com/dataset/dataDetail?dataId=56)
- Published by Alibaba for click-through rate prediction research.
- Includes user click behavior, article metadata, and historical interaction data.
- Contains a large number of interactions, making it suitable for both collaborative and content-based filtering.

##### 6. [Outbrain Click Prediction Dataset](https://www.kaggle.com/c/outbrain-click-prediction)
- From the Outbrain competition on Kaggle.
- Contains user click data on recommended articles, along with contextual information and ad-click patterns.
- Focuses on click-through prediction, but could be adapted for general newsfeed recommendation modeling.

##### 7. Yahoo News User Click Log Dataset
- Released by Yahoo Labs, it includes anonymized user interactions with news articles on Yahoo's platform.
- Contains user click history, article metadata, and timestamps.
- Though no longer widely available, there may be archived versions or similar datasets from Yahoo.

##### 8. [Coveo Data Challenge Dataset](https://github.com/coveooss/SIGIR-ecom-data-challenge)
- Includes search and recommendation interaction data with news articles from various domains.
- Contains clickstream data, search events, and content metadata.


#### MIND (Microsoft News Dataset)

Let’s start by loading each file within the MINDsmall_train dataset and examining what each one contains. The dataset files in .tsv and .vec formats provide different types of information:

- behaviors.tsv - Contains user interaction data.
- news.tsv - Contains information about news articles.
- entity_embedding.vec - Embeddings for entities mentioned in the news articles.
- relation_embedding.vec - Embeddings for relations between entities.
Here's a breakdown of how to read each file and understand its structure.

In [2]:
import pandas as pd

# Load `behaviors.tsv`
behaviors_df = pd.read_csv("../dataset/MINDsmall_train/behaviors.tsv", sep='\t', header=None, 
                           names=["ImpressionID", "UserID", "Time", "History", "Impressions"])
print("Behaviors Data:")
behaviors_df.head()


Behaviors Data:


Unnamed: 0,ImpressionID,UserID,Time,History,Impressions
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...


Columns in behaviors.tsv:
- ImpressionID: Unique ID for each impression (recommendation instance).
- UserID: Unique ID for each user.
- Time: Timestamp of the impression.
- History: List of article IDs the user has previously read.
- Impressions: List of article IDs shown to the user with click indicators (e.g., "N55689-1" indicates article N55689 was clicked).

In [4]:
# Load `news.tsv`
news_df = pd.read_csv("../dataset/MINDsmall_train/news.tsv", sep='\t', header=None, 
                      names=["NewsID", "Category", "SubCategory", "Title", "Abstract", "URL", "TitleEntities", "AbstractEntities"])
print("\nNews Data:")
news_df.head()



News Data:


Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,TitleEntities,AbstractEntities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


Columns in news.tsv:
- NewsID: Unique ID for each news article.
- Category: General category of the news (e.g., Sports, Health).
- SubCategory: Sub-category of the news.
- Title: The title of the news article.
- Abstract: A brief summary of the article.
- URL: Link to the news article.
- TitleEntities: Entities mentioned in the title.
- AbstractEntities: Entities mentioned in the abstract.

This file provides the content-based features of each news article, such as title, category, and abstract, which are useful for content-based filtering.

In [5]:
# Load `entity_embedding.vec`
with open("../dataset/MINDsmall_train/entity_embedding.vec", "r") as file:
    entity_embeddings = file.readlines()

# Display a sample of the entity embeddings
print("\nEntity Embeddings (First 5 lines):")
entity_embeddings[:5]



Entity Embeddings (First 5 lines):


['Q41\t-0.063388\t-0.181451\t0.057501\t-0.091254\t-0.076217\t-0.052525\t0.050500\t-0.224871\t-0.018145\t0.030722\t0.064276\t0.073063\t0.039489\t0.159404\t-0.128784\t0.016325\t0.026797\t0.137090\t0.001849\t-0.059103\t0.012091\t0.045418\t0.000591\t0.211337\t-0.034093\t-0.074582\t0.014004\t-0.099355\t0.170144\t0.109376\t-0.014797\t0.071172\t0.080375\t0.045563\t-0.046462\t0.070108\t0.015413\t-0.020874\t-0.170324\t-0.001130\t0.059810\t0.054342\t0.027358\t-0.028995\t-0.224508\t0.066281\t-0.200006\t0.018186\t0.082396\t0.167178\t-0.136239\t0.055134\t-0.080195\t-0.001460\t0.031078\t-0.017084\t-0.091176\t-0.036916\t0.124642\t-0.098185\t-0.054836\t0.152483\t-0.053712\t0.092816\t-0.112044\t-0.072247\t-0.114896\t-0.036541\t-0.186339\t-0.160610\t0.037342\t-0.133474\t0.110080\t0.070678\t-0.005586\t-0.046667\t-0.072010\t0.086424\t0.026165\t0.030561\t0.077888\t-0.117226\t0.211597\t0.112512\t0.079999\t-0.083398\t-0.121117\t0.071751\t-0.017654\t-0.134979\t-0.051949\t0.001861\t0.124535\t-0.151043\t-0.2636

entity_embedding.vec:
- Each line contains an entity ID and its embedding vector (often used to represent entities like people, places, or topics in the news).
- These embeddings allow the model to understand relationships between entities based on context.

These are used to add context to each article by associating entities mentioned in the title or abstract, which may improve recommendation accuracy by linking related entities.

In [7]:
# Load `relation_embedding.vec`
with open("../dataset/MINDsmall_train/relation_embedding.vec", "r") as file:
    relation_embeddings = file.readlines()

# Display a sample of the relation embeddings
print("\nRelation Embeddings (First 5 lines):")
relation_embeddings[:5]



Relation Embeddings (First 5 lines):


['P31\t-0.073467\t-0.132227\t0.034173\t-0.032769\t0.008289\t-0.107088\t-0.031712\t-0.039581\t0.101882\t-0.106961\t-0.053441\t0.068202\t-0.045584\t-0.140448\t-0.079402\t0.001022\t0.059921\t-0.062510\t0.102848\t0.077947\t-0.063644\t0.050070\t-0.019180\t0.064456\t-0.052222\t0.071078\t-0.036413\t-0.039235\t0.137947\t0.067378\t-0.137468\t0.103482\t0.121755\t-0.006587\t0.063077\t-0.024954\t-0.031300\t-0.056833\t-0.139115\t-0.053570\t0.165815\t-0.022143\t0.006561\t-0.108691\t-0.149139\t0.080943\t0.054542\t-0.034564\t0.082343\t-0.095843\t-0.068758\t0.013850\t-0.025589\t-0.012451\t0.116367\t-0.066981\t-0.006472\t0.136078\t-0.057084\t-0.066427\t-0.035916\t-0.028447\t-0.070395\t-0.052364\t-0.040038\t0.037342\t-0.073347\t0.112529\t0.106537\t0.107426\t0.086297\t0.085833\t0.054393\t0.053187\t0.066242\t0.058507\t-0.047180\t-0.086089\t0.050148\t0.053491\t-0.042370\t-0.110435\t-0.058929\t0.063987\t-0.037393\t-0.057942\t-0.032128\t0.141226\t-0.106979\t0.072183\t-0.045641\t-0.050068\t-0.053686\t-0.045389

relation_embedding.vec:
- Each line includes a relation ID and its embedding vector, used to represent relationships between entities (like "is_related_to" or "is_about").
- Relation embeddings can help in fine-tuning content-based recommendations by understanding entity relationships.

Relation embeddings can help when associating content in hybrid or content-based models by adding deeper contextual understanding of entity relationships within and across articles.

- behaviors.tsv: Provides user interaction data, essential for collaborative filtering.
- news.tsv: Contains article metadata, useful for content-based filtering.
- entity_embedding.vec: Entity representations to add context to articles.
- relation_embedding.vec: Relationship embeddings to enrich content understanding.

Entity embeddings represent topics or entities (like “NASA” or “Mars”) within articles, adding context by showing which topics are relevant to each other. They enable personalized recommendations by connecting users to articles on similar subjects.

Relation embeddings capture the connections between entities (e.g., "NASA" is exploring "Mars"), enriching content understanding by showing how topics relate. This allows for more accurate content matching and enhances recommendations by suggesting articles that explore related themes, not just exact keywords.

Together, these embeddings help models recommend contextually similar articles, improving both content-based and hybrid recommendation approaches.


In [None]:
import pandas as pd
import scipy.sparse as sparse
import numpy as np

# Preprocess to get user-item interactions (binary: clicked or not clicked)
# Split the impressions column by the click indicator
interactions = []

for _, row in behaviors_df.iterrows():
    user = row["User_ID"]
    impressions = row["Impressions"].split()  # Format: ["1234-0", "5678-1", ...]
    for impression in impressions:
        article_id, clicked = impression.split("-")
        if clicked == "1":  # We consider only clicked items
            interactions.append((user, article_id))

# Convert to a DataFrame
interactions_df = pd.DataFrame(interactions, columns=["User_ID", "Article_ID"])

# Encode user and article IDs for matrix construction
user_codes = interactions_df["User_ID"].astype("category").cat.codes
article_codes = interactions_df["Article_ID"].astype("category").cat.codes
interactions_df["User_Code"] = user_codes
interactions_df["Article_Code"] = article_codes

# Create a sparse user-item matrix
user_item_matrix = sparse.coo_matrix((np.ones(len(interactions_df)), 
                                     (user_codes, article_codes)))


In [43]:
interactions_df.head()

Unnamed: 0,User_ID,Article_ID,User_Code,Article_Code
0,U13740,N55689,2246,5951
1,U91836,N17059,48340,895
2,U73700,N23814,37557,1753
3,U34670,N49685,14625,5151
4,U8125,N8400,42080,7500


In [45]:
# Print the non-zero entries (data) and positions (row and col) in COO format
print("Data:", user_item_matrix.data[:5])
print("Rows:", user_item_matrix.row[:5])
print("Cols:", user_item_matrix.col[:5])

Data: [1. 1. 1. 1. 1.]
Rows: [ 2246 48340 37557 14625 42080]
Cols: [5951  895 1753 5151 7500]


In [46]:
user_id_map = {user: idx for idx, user in enumerate(interactions_df['User_ID'].unique())}
user_id_revmap = {idx:user for idx, user in enumerate(interactions_df['User_ID'].unique())}

#### Building a Collaborative Filtering Model on the MIND Dataset

To develop a collaborative filtering model using the MIND dataset, we can leverage implicit feedback derived from user click behaviors. Libraries such as **Surprise** or **Implicit** are particularly well-suited for this type of collaborative filtering. Given that the MIND dataset contains user-article interactions, employing a collaborative filtering approach based on these interactions will allow us to recommend articles that users with similar behaviors have engaged with.

#### Steps to Build a Collaborative Filtering Model with Implicit Feedback

#### 1. Install Libraries

Utilize the **Implicit** library for collaborative filtering with implicit feedback.

#### 2. Prepare the Data

- Load the `behaviors.tsv` file to obtain user-item interactions.
- Construct a user-item matrix where rows represent users, columns represent articles, and values indicate the number of clicks or binary indicators of clicks.

#### 3. Train the Model

- Train a collaborative filtering model such as **Alternating Least Squares (ALS)** using the implicit feedback data.
- ALS is particularly effective for implicit feedback scenarios and is optimized to handle sparse matrices.

#### 4. Make Predictions

- For each user, generate a list of top article recommendations based on past interactions.

#### Explanation of the Model

- **Data Preparation:** A sparse user-item matrix represents user interactions with articles, indicating clicks as implicit feedback.
- **ALS Model:** This model identifies latent factors for users and items, utilizing collaborative filtering to generate recommendations based on similar users' behaviors.
- **Predictions:** The model produces a ranked list of articles for each user by suggesting those that similar users have clicked on.

#### Advantages of Collaborative Filtering with ALS

- **Personalization:** Recommendations are tailored based on interactions of similar users, enhancing relevance.
- **Scalability:** ALS efficiently handles sparse matrices, making it well-suited for large datasets.

This approach serves as a solid baseline for personalized recommendations in newsfeed applications.


In [47]:
user_item_matrix

<50000x7713 sparse matrix of type '<class 'numpy.float64'>'
	with 236344 stored elements in COOrdinate format>

In [None]:
import implicit

# Convert the matrix to a Compressed Sparse Row (CSR) format for efficient row access
user_item_csr = user_item_matrix.tocsr()

# Train ALS model
als_model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1, iterations=20)
als_model.fit(user_item_csr)

  check_blas_config()


  0%|          | 0/20 [00:00<?, ?it/s]

In [48]:
# Get recommendations for a user

user_id = 0  # Example user ID (0 corresponds to the first user in encoded matrix)
recommendations = als_model.recommend(user_id, user_item_csr[user_id], N=10)

# Print the type and content of recommendations to confirm its structure
print("Raw Recommendations:", recommendations)

# Extract the recommended article indices from the tuple
recommended_indices = recommendations[0]  # Get the first array from the tuple

# Convert recommendations back to article IDs using the indices
recommended_articles = [interactions_df["Article_ID"].astype("category").cat.categories[i]
                        for i in recommended_indices]  # Access the indices directly

print("Recommended Articles for User:", user_id_revmap[user_id], 'News:', recommended_articles)


Raw Recommendations: (array([5798, 5097,  513, 4156,  532, 6856, 4724, 6690, 3468, 3106],
      dtype=int32), array([0.00319199, 0.00225009, 0.00171585, 0.00164096, 0.00146707,
       0.00109179, 0.00108429, 0.00092599, 0.00088498, 0.00080291],
      dtype=float32))
Recommended Articles for User: U13740 News: ['N54489', 'N49180', 'N14029', 'N41881', 'N14184', 'N62360', 'N4642', 'N61233', 'N36789', 'N34185']


In [16]:
# Access user and item embeddings
user_embeddings = als_model.user_factors  # Shape: (num_users, num_factors)
item_embeddings = als_model.item_factors  # Shape: (num_items, num_factors)

# View the shape of the embeddings
print("User Embeddings Shape:", user_embeddings.shape)
print("Item Embeddings Shape:", item_embeddings.shape)

# Optionally, view the embeddings for a specific user and item
user_id = 0  # Example user ID
item_id = 0  # Example item ID

print("User Embedding for User ID 0:", user_embeddings[user_id])
print("Item Embedding for Item ID 0:", item_embeddings[item_id])

User Embeddings Shape: (50000, 50)
Item Embeddings Shape: (7713, 50)
User Embedding for User ID 0: [-6.10826391e-05 -1.98667505e-04 -1.12994094e-04  4.25650418e-04
 -1.22063699e-04 -1.97014393e-04 -4.81787691e-04  2.22405361e-04
 -3.88185988e-04 -2.30767298e-04 -2.30254023e-04  1.32836110e-04
 -7.69869221e-05  2.60578763e-05 -8.19234192e-05  6.56452612e-04
 -4.08868327e-05 -3.68227164e-04 -3.52951232e-04  4.30373562e-04
 -4.75030902e-05 -4.78772738e-04 -2.23632203e-04  6.10771240e-05
  2.16088461e-04  9.93336726e-05 -9.24663691e-05 -3.36726051e-04
  7.15017261e-04 -2.80726235e-05  5.51563571e-04 -6.16752295e-05
 -2.39596447e-05 -1.54777663e-04  1.74155735e-04 -8.59818319e-05
  6.40619663e-04 -4.05327562e-04  6.64181134e-05 -1.43632366e-04
 -2.57996406e-04  2.99616440e-05 -8.17779146e-05  2.20759350e-04
  4.37924493e-04  2.25014213e-04  1.93007909e-05  2.61107547e-04
  8.17847977e-05 -6.28211710e-05]
Item Embedding for Item ID 0: [ 1.2315308e-03 -4.5512940e-04 -1.1293784e-04 -2.6260418e

##### Content Filtering

#### Overview
Content-based recommendation systems suggest items based on their features, utilizing attributes like titles and descriptions to recommend similar items to users.

#### Steps to Build the Model

1. **Load Data**: Import article and user interaction data from the MIND dataset.
   
2. **Preprocess Data**: Clean and combine article metadata (e.g., titles, abstracts) into a single text field.

3. **Feature Extraction**: Use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text data into numerical vectors.

4. **Build the Model**: Calculate cosine similarity between article feature vectors to assess similarity.

5. **Generate Recommendations**: For each user-interacted article, suggest top N similar articles based on similarity scores.

#### Key Concepts
- **TF-IDF**: Transforms text into numerical format by evaluating word importance.
- **Cosine Similarity**: Measures similarity between two vectors; a score of 1 indicates identical items.

#### Benefits
- **Personalization**: Tailors recommendations based on user interactions.
- **No User Data Required**: Operates solely on item characteristics.
- **Transparency**: Recommendations are based on item features, making them understandable.

#### Use Case
- **Article Recommendations**: Suggest articles similar to those a user has previously read based on shared attributes.


In [49]:
news_df.head()

Unnamed: 0,NewsID,Category,SubCategory,Title,Abstract,URL,TitleEntities,AbstractEntities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


In [50]:
# Combine title and abstract for feature extraction
news_df['Content'] = news_df['Title'] + " " + news_df['Abstract']


In [52]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF Vectorizer
# Replace NaN values in 'Content' with an empty string
news_df['Content'] = news_df['Content'].fillna('')

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(news_df['Content'])


In [54]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Create a DataFrame to hold the article IDs for easy reference
news_df = news_df.reset_index()
indices = pd.Series(news_df.index, index=news_df['NewsID'])


In [55]:
news_df.head()

Unnamed: 0,level_0,index,NewsID,Category,SubCategory,Title,Abstract,URL,TitleEntities,AbstractEntities,Content
0,0,0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[],"The Brands Queen Elizabeth, Prince Charles, an..."
1,1,1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...",50 Worst Habits For Belly Fat These seemingly ...
2,2,2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId...",The Cost of Trump's Aid Freeze in the Trenches...
3,3,3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ...",I Was An NBA Wife. Here's How It Affected My M...
4,4,4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","How to Get Rid of Skin Tags, According to a De..."


In [58]:
def get_recommendations(news_id, cosine_sim=cosine_sim):
    # Get the index of the article that matches the article_id
    idx = indices[news_id]

    # Get the pairwise similarity scores of all articles with that article
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the articles based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar articles
    sim_scores = sim_scores[1:11]  # Exclude the first article (itself)

    # Get the article indices
    article_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar articles
    return news_df['NewsID'].iloc[article_indices].values

# Example usage
new_id = "N55528"
recommended_articles = get_recommendations(new_id)  # Replace with a valid Article_ID
print("Recommended Articles for Article :",new_id,  recommended_articles[:10])


Recommended Articles for Article : N55528 ['N9056' 'N38133' 'N43522' 'N63495' 'N43301' 'N63823' 'N42777' 'N60671'
 'N15619' 'N23446']


In [67]:
# Assume we have a function to get recent articles read by a user
def get_recent_articles(user_id, num_articles=5):
    # This function should return the last `num_articles` articles that the user has read.
    recent_articles = interactions_df[interactions_df['User_ID'] == user_id].tail(num_articles)['Article_ID'].tolist()
    return recent_articles

class ContentRecommender:
    def __init__(self, similarity_matrix):
        self.similarity_matrix = similarity_matrix
    
    def recommend(self, article_id, N=10):
        idx = news_df.index[news_df['NewsID'] == article_id][0]
        similarity_scores = list(enumerate(self.similarity_matrix[idx]))
        similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
        similar_articles_indices = [i[0] for i in similarity_scores[1:N+1]]
        # Return both the article IDs and their similarity scores
        return [(news_df['NewsID'].iloc[idx], score) for idx, score in similarity_scores[1:N+1]]

# New function to recommend articles based on user's recent reads
def recommend_for_user(user_id, content_model, N=10):
    # Get recent articles read by the user
    recent_articles = get_recent_articles(user_id)
    
    # Store recommendations with scores
    recommendations = {}
    
    for article in recent_articles:
        similar_articles_with_scores = content_model.recommend(article, N)
        for rec_id, score in similar_articles_with_scores:
            # Aggregate scores
            if rec_id in recommendations:
                recommendations[rec_id] += score  # Increment the score for this article
            else:
                recommendations[rec_id] = score  # Initialize the score
    
    # Sort recommendations based on aggregated scores
    sorted_recommendations = sorted(recommendations.items(), key=lambda x: x[1], reverse=True)
    
    # Return top-N recommendations based on aggregated scores
    top_n_recommendations = [item for item in sorted_recommendations[:N]]
    return top_n_recommendations

# Example usage
user_id = 'U13740'
content_model = ContentRecommender(cosine_sim)
recommended_articles = recommend_for_user(user_id, content_model, N=10)
print("Recommended Articles for User:", recommended_articles)


Recommended Articles for User: [('N17741', 0.7441272403329986), ('N47415', 0.6664363463252515), ('N31755', 0.5756498496755538), ('N25821', 0.5753420971296641), ('N26173', 0.5333115389049654), ('N20088', 0.4749948347145272), ('N51132', 0.45989686022872667), ('N16579', 0.451297742139377), ('N33023', 0.42532121883497864), ('N27536', 0.3885339964237947)]


In [62]:
collab_recs

(array([5798, 5097,  513, 4156,  532, 6856, 4724, 6690, 3468, 3106],
       dtype=int32),
 array([0.00319199, 0.00225009, 0.00171585, 0.00164096, 0.00146707,
        0.00109179, 0.00108429, 0.00092599, 0.00088498, 0.00080291],
       dtype=float32))

In [68]:
#### Hybrid Solution
user_id = 0  # Example user ID (0 corresponds to the first user in encoded matrix)
collab_recs = als_model.recommend(user_id, user_item_csr[user_id], N=10)

# Generate top-N recommendations from content-based model
content_recs = recommend_for_user(user_id_revmap[user_id], content_model, N=10)

# Convert recommendations into dictionaries for easy lookup
collab_item_ids, collab_scores = collab_recs

collab_dict = {item_id: score * 0.6 for item_id, score in zip(collab_item_ids, collab_scores)}
content_dict = {item_id: score * 0.4 for item_id, score in content_recs}

# Combine scores from both models
hybrid_scores = {}
for item_id in set(collab_dict.keys()).union(content_dict.keys()):
    hybrid_scores[item_id] = collab_dict.get(item_id, 0) + content_dict.get(item_id, 0)

# Sort by hybrid score to get final recommendations
final_recommendations = sorted(hybrid_scores.items(), key=lambda x: x[1], reverse=True)

# Output top-N recommendations
top_n_recommendations = [item_id for item_id, score in final_recommendations[:10]]
print("Hybrid Recommendations for User:", top_n_recommendations)


Hybrid Recommendations for User: ['N17741', 'N47415', 'N31755', 'N25821', 'N26173', 'N20088', 'N51132', 'N16579', 'N33023', 'N27536']
