# Problem Statement

iPrint, an Indian media house, provides diverse news and information services across sports, weather, health, stocks, and more. While iPrint has historically recommended popular and similar articles to users, this approach lacks personalization and fails to engage many users, resulting in declining user retention and revenue. To address this, iPrint aims to deliver a more personalized experience by recommending new, relevant articles to users each day and suggesting similar articles on individual news pages.

To meet these goals, iPrint seeks a robust recommendation system that will:

1.  Display the top 10 new relevant articles to users at the start of each day.
2. Recommend the top 10 similar articles when users click on any news item.

The recommendation system must avoid showing content previously viewed or removed from the platform, and only English-language articles will be used for content-based recommendations. The final output should provide the names and IDs of the recommended articles.

# Procedure Overview

1. Data Pre-processing
- Impute Ratings: Since the dataset lacks user ratings, generate a "ratings" feature based on the interaction type, assigning higher weights to types indicating greater engagement (e.g., highest to content_followed, then content_commented_on, etc.). This processed dataset will serve as the base for collaborative filtering models.
- Filter English Content: Extract only English-language articles from the platform_content data for content-based filtering.

2. Exploratory Data Analysis (EDA)
- Analyze key features like interaction types, consumer and producer locations, item types, and language distributions.
- Identify trends such as popular content types, common languages, and primary regions for article consumption.

3. Recommendation Techniques
- User-based Collaborative Filtering: Build a user-item matrix using the rating values, then generate a user-similarity matrix. Predict ratings for user-item pairs to make recommendations.
- Item-based Collaborative Filtering: Create an item-similarity matrix to recommend the top 10 similar items based on similarity scores.
- Content-based Filtering: Use text processing (e.g., TF-IDF) on article descriptions to recommend similar items based on content relevance.
- ALS (Alternating Least Squares): Create sparse matrices for users and items, train the ALS model, and fine-tune its hyperparameters to improve recommendation quality.

4. Hybrid Recommendation System
- Combine scores from content-based and collaborative filtering models, assigning appropriate weightings to each. Experiment with model hybrids, such as Content + Item-based or ALS + Content-based, to enhance recommendation accuracy.

5. Model Evaluation
- Evaluate using metrics like RMSE, MAE, and precision@k for user-specific recommendations, and global precision@k for overall system performance.
- For the secondary problem statement, explore online evaluation techniques, considering real-time or dynamic feedback mechanisms to enhance recommendation accuracy in production.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Data Preprocessing

In [None]:
# Load the datasets
consumer_transactions = pd.read_csv('data/consumer_transactions.csv')
consumer_transactions.head()

  consumer_transactions = pd.read_csv('data/consumer_transanctions.csv')


Unnamed: 0,event_timestamp,interaction_type,item_id,consumer_id,consumer_session_id,consumer_device_info,consumer_location,country
0,1465413032,content_watched,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,content_watched,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,content_watched,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,content_followed,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,content_watched,-7820640624231356730,-445337111692715325,561148 1178424124714,,,
