#### What is Matrix Factorization?
Matrix factorization is a collaborative filtering technique commonly used in recommendation systems. It decomposes a large matrix (such as a user-item interaction matrix) into the product of two smaller matrices. The objective is to represent the original matrix in terms of latent factors that capture the underlying patterns between users and items.

For example, in a user-item matrix where each entry represents a rating or interaction between a user and an item, matrix factorization decomposes this matrix into:

- User Matrix (U): Latent factors for users
- Item Matrix (V): Latent factors for items

The idea is that by multiplying the user and item matrices, you can approximate the original matrix, even for the missing values. This is often used for predicting user preferences or ratings for items they haven't interacted with.

#### Where is Matrix Factorization Used?
Matrix factorization is widely used in recommendation systems, especially in collaborative filtering. Key applications include:

- Movie Recommendations: Predicting movies that a user may like based on their viewing history.
- Music Recommendations: Suggesting songs or artists based on users’ listening habits.
- E-commerce: Recommending products to users based on purchase history or browsing patterns.
- Content Filtering: Filtering content on social media platforms based on user interests.

#### Getting a Dataset in Python
Let’s use the MovieLens dataset, a popular dataset for recommendation systems. The MovieLens dataset provides user ratings for movies, making it ideal for matrix factorization.

You can download and load it as follows:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the MovieLens 100k dataset
!wget -nc http://files.grouplens.org/datasets/movielens/ml-100k.zip
!unzip -o ml-100k.zip

# Read the ratings data
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=column_names)
ratings = ratings[['user_id', 'item_id', 'rating']]  # Drop timestamp for simplicity

# Split into train and test sets
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)


--2024-10-27 12:51:44--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘ml-100k.zip’


2024-10-27 12:51:45 (6.83 MB/s) - ‘ml-100k.zip’ saved [4924029/4924029]

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base    

#### Building a Matrix Factorization Model in Python
Using Surprise library, which has a built-in implementation of matrix factorization (SVD), we can quickly set up a recommendation model:

In [2]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

# Prepare data for Surprise library
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings, reader)

# Split data into train and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Build and train the SVD model
model = SVD()
model.fit(trainset)

# Make predictions on the test set
predictions = model.test(testset)

In this code, the SVD model in the Surprise library by default breaks down the user-item matrix into 100 latent dimensions (also called factors). However, this number can be changed and optimized to improve model performance.

#### Optimizing the Number of Latent Dimensions
To find the optimal number of latent dimensions, you can use grid search or cross-validation. In Surprise, you can specify the n_factors parameter of the SVD class, which controls the number of latent factors. Here’s how to perform a grid search to optimize it:

In [5]:
from surprise.model_selection import GridSearchCV

# Define a parameter grid for SVD
param_grid = {'n_factors': [20, 50, 100, 150, 200]}

# Set up GridSearchCV
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5)
grid_search.fit(data)

# Best score and parameters
print("Best RMSE score:", grid_search.best_score['rmse'])
print("Best parameters:", grid_search.best_params['rmse'])


Best RMSE score: 0.9357386532003883
Best parameters: {'n_factors': 50}


#### Evaluation Metrics
For evaluating the recommendation model, consider these metrics:

- Root Mean Square Error (RMSE): Measures the difference between predicted and actual ratings.
- Mean Absolute Error (MAE): Similar to RMSE but measures absolute differences.
- Precision@K and Recall@K: For top-K recommendations, measures the relevance of items in the top-K results.
- Mean Average Precision (MAP): Measures the average precision at different levels of recall.

#### Calculate RMSE as follows:



In [3]:
# Calculate RMSE
rmse = accuracy.rmse(predictions)
rmse

RMSE: 0.9498


0.9498076019628686

#### Understanding RMSE in Recommendation Systems

- Range of RMSE: RMSE (Root Mean Squared Error) ranges from 0 to ∞.
- 0 represents perfect prediction (no error), while larger values indicate higher error between predictions and actual values.
- In practical recommendation systems, a lower RMSE means the model's predictions are closer to actual user preferences.

#### What’s a Good RMSE Value?

- Good RMSE values depend on the dataset and industry standards. In movie recommendations, an RMSE between 0.8 to 1.0 is considered reasonable.
- Bad RMSE values vary but generally, anything over 1.2 may indicate the model is struggling to capture user preferences accurately.
- Benchmarking: RMSE alone isn’t always sufficient for determining a "good" model. It’s useful to benchmark against other models or baselines (e.g., a random or popular-item recommender) to gauge relative performance.

#### Other Considerations with Matrix Factorization

- Cold Start Problem: Matrix factorization struggles with new users or items that have little interaction data. Hybrid methods (combining matrix factorization with content-based features) can help mitigate this.
- Implicit vs. Explicit Feedback: Matrix factorization is often designed for explicit feedback like ratings, but in real-world applications, implicit feedback (e.g., clicks or views) is more common. Extensions like Alternating Least Squares (ALS) for implicit feedback are often used.
- Regularization: Regularization is key to prevent overfitting, as it helps the model generalize by penalizing large factor values.
- Latent Factor Interpretation: Latent factors can sometimes be interpretable, e.g., representing genres in movies or price sensitivity in e-commerce. But interpretation is not guaranteed and often requires careful analysis.


#### Generating a Rating or Score from User Actions in E-Commerce
In e-commerce, we often don’t have explicit ratings. Instead, we can generate an implicit score based on user behavior and interaction data. Here’s how we can approximate a "rating" from different website events:

- Assign Implicit Scores to User Actions:

    - Page View: Indicates mild interest. Assign a lower score, e.g., 1.
    - Click on Product: Stronger interest. Assign a score, e.g., 2.
    - Add to Cart: Indicates high interest. Assign a score, e.g., 3.
    - Purchase: Strongest signal. Assign a high score, e.g., 5.

- Weight Different Actions:

Combine these actions for each user-item pair. For example:
```score = α × views + β × clicks + γ × add to cart + δ × purchase
```

where:
- α, β, γ, δ are weights representing the relative importance of each action.
- Adjust these weights based on what behaviors best predict purchase or user preference.

- Time Decay:

    - Apply time decay to make recent actions more influential. For example, if a user recently viewed an item, it’s more indicative of interest than a view from months ago.

- Build an Interaction Matrix:

    - With these scores, create an interaction matrix similar to a user-item rating matrix. Matrix factorization can then be applied to predict missing values in this implicit score matrix.

- Event-Driven Modeling:

    - Advanced e-commerce recommenders use event-driven scoring. For example, the model may update scores in real-time as a user interacts with items, adjusting recommendations dynamically.

#### Example in Python for Implicit Ratings with Matrix Factorization
Here’s an outline using the implicit library in Python for ALS on implicit feedback data:

```python
import pandas as pd
from scipy.sparse import coo_matrix
from implicit.als import AlternatingLeastSquares

# Sample implicit feedback data
data = {'user_id': [1, 2, 3, 1], 'item_id': [101, 101, 102, 103], 'action': ['view', 'add_to_cart', 'purchase', 'click']}
df = pd.DataFrame(data)

# Assign scores for each action
score_map = {'view': 1, 'click': 2, 'add_to_cart': 3, 'purchase': 5}
df['score'] = df['action'].map(score_map)

# Create a sparse matrix for user-item interactions
user_item_matrix = coo_matrix((df['score'], (df['user_id'], df['item_id'])))

# Train ALS model
model = AlternatingLeastSquares(factors=20, regularization=0.1)
model.fit(user_item_matrix.T)
```

#### Evaluation Metrics for Implicit Feedback Models
- Mean Average Precision at K (MAP@K): Measures precision of top-K recommendations.
- Precision@K and Recall@K: Track how many relevant items are in the top-K results.
- Normalized Discounted Cumulative Gain (NDCG): Measures ranking quality and prioritizes correct recommendations in the top positions.
- Hit Rate: Measures how often a relevant item appears in recommendations.

By combining implicit scoring techniques with matrix factorization or other recommendation algorithms, we can build effective recommendation systems in real-world e-commerce environments.










In [4]:
from IPython.display import IFrame

# Embed the YouTube video
IFrame('https://www.youtube.com/embed/ZspR5PZemcs', width=800, height=450)


The video explains the concept of matrix factorization, a technique used in recommendation systems like Netflix to predict user ratings for items they haven't interacted with.

#### Key points:

- Implicit Ratings: In e-commerce, explicit ratings are often unavailable. Implicit ratings can be derived from user behavior (e.g., views, clicks, purchases).
- Matrix Factorization: This technique decomposes a large user-item rating matrix into two smaller matrices: user features and item features.
- Feature Engineering: Features can be explicit (e.g., genre, director) or latent (discovered through the factorization process).
- Predicting Ratings: By multiplying the user and item feature matrices, we can predict ratings for unrated items.
- Gradient Descent: This optimization algorithm is used to find the optimal values for the user and item features.
- Benefits of Matrix Factorization:
    - Improved storage efficiency
    - Ability to handle sparse data
    - Effective prediction of user preferences