# Recommender systems

## Collaborative filtering

### Making recommendations

**Overview:**
- Recommender systems are crucial for many online platforms like Amazon, Netflix, and food delivery services, driving significant sales and user engagement.
- They suggest products, movies, articles, etc., based on user preferences.

**Key Concepts:**
1. **Users and Items:**
   - Users are individuals (e.g., customers, viewers), and items are the things being recommended (e.g., products, movies).
   - **nu** = number of users (in this example, 4 users: Alice, Bob, Carol, Dave).
   - **nm** = number of items/movies (in this example, 5 movies).

2. **Ratings:**
   - Users rate items from 0 to 5 stars.
   - Not all users rate every item. Missing ratings are marked as a question mark.

3. **Matrix Representation:**
   - **r(i,j) = 1** if user j has rated movie i, **r(i,j) = 0** if not.
   - Example: Alice rated Movie 1, so **r(1,1) = 1**. Alice did not rate Movie 3, so **r(3,1) = 0**.
   - **y(i,j)** = the actual rating user j gives to movie i (e.g., y(3,2) = 4 if Bob rated Movie 3 as 4 stars).

4. **Goal of Recommender Systems:**
   - Predict how users will rate movies they haven't watched yet.
   - Recommend movies that users are more likely to rate highly (e.g., 5 stars).

5. **Assumptions:**
   - Temporarily assume access to **features** about movies (e.g., genre like romance, action).
   - Later, explore how to build algorithms without these features.

**Next Steps:**
- Develop an algorithm for prediction based on the features.
- Explore feature-less algorithms for recommendation.

This provides a foundational understanding of how recommender systems work with user-item matrices and introduces the key mathematical notations used in rating prediction models.

### Using per-item features

**Overview:**
- Enhancing recommender systems by incorporating features of items (e.g., movies).
- Focus on predicting movie ratings using features related to movie genres.

**Key Concepts:**
1. **User-Item Matrix:**
   - Users (e.g., Alice, Bob, Carol, Dave) rate movies (e.g., Love at Last, Nonstop Car Chases).
   - Introduced features $X_1$ and $X_2$ for movies:
     - $X_1$: Romance level
     - $X_2$: Action level
   - Example features:
     - Love at Last: $(0.9, 0)$
     - Nonstop Car Chases: $(0.1, 1.0)$

2. **Prediction Model:**
   - For user $j$, predict the rating for movie $i$ as:
     $$\hat{y}(i,j) = w(j) \cdot X(i) + b(j)$$
   - Similar to linear regression, with parameters $w(j)$ (weights) and $b(j)$ (bias).

3. **Parameter Notation:**
   - Different parameters for each user, denoted by superscripts (e.g., $w(1)$ for Alice).
   - Overall model for user $j$:
     $$\hat{y}(i,j) = w(j) \cdot X(i) + b(j)$$

4. **Cost Function:**
   - Mean squared error criterion for user $j$:
     $$J(w(j), b(j)) = \frac{1}{m(j)} \sum_{i: r(i,j)=1} \left( \hat{y}(i,j) - y(i,j) \right)^2$$
   - Summation only over movies rated by user $j$.

5. **Regularization:**
   - To prevent overfitting, add a regularization term:
     $$J(w(j), b(j)) = \frac{1}{m(j)} \sum_{i: r(i,j)=1} \left( \hat{y}(i,j) - y(i,j) \right)^2 + \frac{\lambda}{2m(j)} \sum_{k=1}^{n} (w(j,k))^2$$
   - $\lambda$: Regularization parameter.

6. **Learning Parameters:**
   - Minimize the cost function for all users:
     $$J = \sum_{j=1}^{\nu} J(w(j), b(j))$$
   - Use optimization techniques (e.g., gradient descent) to find optimal $w$ and $b$ for all users.

**Next Steps:**
- Explore modifications to the algorithm for cases where features $X$ are not available.
- Develop a recommender system without relying on detailed features to make predictions.

This summary provides a concise understanding of how features can be integrated into recommender systems and the associated modeling and optimization techniques.

### Collaborative filtering algorithm

This lecture covers how to derive features for movies in a collaborative filtering context, especially when those features are not known in advance. Here’s a summary of the key points:

1. **Learning Features**: In collaborative filtering, if you lack predefined features (like how much a movie is a romance or an action), you can infer these features (denoted as $x_1$ and $x_2$) based on user ratings and learned parameters.

2. **User Parameters**: You may have learned parameters for users, represented as $w^j$ (weights) and $b^j$ (biases), which help predict user ratings. For simplification, the biases are often set to zero.

3. **Predicting Ratings**: The prediction for user $j$'s rating of movie $i$ is given by the dot product of user weights and movie features: 
   $$\text{Predicted rating} = w^j \cdot x^i + b^j$$
   In this case, since $b^j = 0$, it simplifies to $w^j \cdot x^i$.

4. **Cost Function for Features**: To learn the movie features, a cost function based on minimizing the squared error between predicted ratings and actual ratings is used. This involves summing over all users who rated the movie.

5. **Regularization**: A regularization term can be added to prevent overfitting, yielding a final cost function that sums over all movies and users.

6. **Combining Cost Functions**: The overall cost function for collaborative filtering combines the costs for learning user parameters ($w$ and $b$) and movie features ($x$). It can be minimized using gradient descent.

7. **Gradient Descent Updates**: Updates are performed for all parameters—weights, biases, and features—based on their respective gradients.

8. **Collaborative Filtering**: This algorithm relies on the collaborative input of multiple users to help predict ratings for unobserved items, leveraging the shared ratings to infer movie features.

9. **Binary Labels**: The approach discussed primarily applies to continuous ratings (e.g., 1-5 stars). Future videos will address how to adapt this framework for binary ratings (e.g., like/dislike).

This summary highlights the innovative aspect of collaborative filtering, where features can be learned from user ratings rather than needing them to be predefined, enabling more flexible and powerful recommendation systems.

### Binary labels: favs, likes and clicks

This lecture delves into generalizing collaborative filtering algorithms to handle binary labels, where users indicate whether they liked or engaged with an item (e.g., a movie) rather than providing a numerical rating. Here’s a summary of the key points covered:

### Overview of Binary Labels
1. **Binary Label Definition**: 
   - A label of **1** indicates that a user liked or engaged with a movie (e.g., watched it completely, hit "like").
   - A label of **0** indicates no engagement (e.g., stopped watching).
   - A **?** signifies that the user has not yet interacted with the item.

2. **Examples of Labels**:
   - **E-commerce**: 1 if a user purchased an item after being shown it, 0 if not.
   - **Social Media**: 1 if a user liked an item after being shown it, 0 if not.
   - **Behavioral Metrics**: Duration of engagement (e.g., viewing an item for 30 seconds).

### Generalizing the Algorithm
1. **Predictive Model**:
   - Transition from predicting a direct rating $y_{ij}$ using a linear model to predicting the probability of engagement using the logistic function:
     $$P(y_{ij} = 1) = g(w_j \cdot x_i + b_j)$$
     where \( g(z) = \frac{1}{1 + e^{-z}} \).

2. **Cost Function Modification**:
   - Replace the squared error cost function with a binary cross-entropy loss function:
     $$\text{Loss} = -y \log(f(x)) - (1 - y) \log(1 - f(x))$$
   - Adapt this loss to the collaborative filtering context, summing over all user-item pairs where ratings are available:
     $$J(w, b, x) = \sum_{(i,j) | r_{ij} = 1} -y_{ij} \log(f(x_{ij})) - (1 - y_{ij}) \log(1 - f(x_{ij}))$$

### Implementation Tips
- The lecture concludes by hinting at practical implementation strategies and optimizations that can enhance the algorithm's efficiency and performance. The next lecture promises to cover these implementation details.

### Key Takeaway
This generalization significantly expands the applicability of collaborative filtering algorithms to scenarios where explicit ratings are not available, utilizing user engagement data instead to inform recommendations. 

## Recommender systems implementation detail

### Mean normalization

1. **Mean Normalization Overview**:
   - Enhances algorithm efficiency and prediction accuracy, especially for new users with few or no ratings.

2. **Dataset Example**:
   - Initial ratings table includes movie ratings by users, with some ratings as question marks (indicating unrated movies).

3. **Effect on New Users**:
   - Without normalization, parameters for new users may default to zero, leading to unrealistic predictions (e.g., rating all movies as zero).

4. **Mean Normalization Process**:
   - Calculate average ratings for each movie.
   - Subtract the average rating of each movie from the individual ratings, creating normalized values.

5. **Implementation**:
   - For predictions, add back the average rating to maintain realistic score predictions:
     - $\text{Predicted Rating} = w(j) \cdot x(i) + b(j) + \mu(i)$

6. **Benefits of Mean Normalization**:
   - Provides more reasonable initial ratings for new users based on existing data.
   - Speeds up optimization algorithms.

7. **Normalization Alternatives**:
   - Columns can also be normalized to address new items with no ratings; however, row normalization is prioritized for new users.

8. **Conclusion**:
   - Mean normalization is critical for improving predictions in collaborative filtering systems, particularly for users with limited input.


### TensorFlow implementation of collaborative filtering

1. **Overview of TensorFlow for Collaborative Filtering**:
   - TensorFlow can be utilized not only for neural networks but also for algorithms like collaborative filtering.
   - Automatic differentiation (Auto Diff) in TensorFlow simplifies the process of implementing gradient descent.

2. **Gradient Descent Basics**:
   - Key formula for updating parameters $w$:
     $$w \gets w - \alpha \frac{dJ}{dw}$$
   - Example cost function: $J = (wx - 1)^2$, where $wx$ is the model prediction.

3. **Using TensorFlow's Gradient Tape**:
   - The gradient tape records operations for automatic differentiation.
   - Code implementation involves initializing parameters and using `tf.GradientTape()` to compute derivatives automatically.

4. **Implementation Steps**:
   - Define a variable for $w$ and initialize it (e.g., $w = 3.0$).
   - Use `tf.GradientTape()` to compute the cost function $J$.
   - Call `tape.gradient()` to retrieve derivatives with respect to parameters.

5. **Optimization with Adam**:
   - TensorFlow allows the use of advanced optimizers like Adam:
     - Set up optimizer with learning rate.
     - Run iterations to compute gradients and update parameters.

6. **Collaborative Filtering Cost Function**:
   - Cost function $J$ inputs include:
     - Parameters $x, w, b$
     - Normalized ratings
     - Regularization parameter $\lambda$

7. **Real Dataset Example**:
   - The MovieLens dataset will be used in practice labs for implementing collaborative filtering.

8. **Limitations of Standard Layers**:
   - The collaborative filtering algorithm's cost function doesn't fit standard dense layers in TensorFlow, necessitating a custom implementation.

9. **Conclusion**:
   - TensorFlow's Auto Diff and optimization capabilities facilitate effective implementation of collaborative filtering.
   - Upcoming content will explore finding related items in the context of collaborative filtering.

This summary captures essential points for implementing collaborative filtering in TensorFlow, focusing on Auto Diff and the optimization process.

### Finding related items

1. **Introduction to Collaborative Filtering**:
   - Collaborative filtering algorithms can recommend related items based on user interactions, such as similar books or movies.

2. **Feature Representation**:
   - Each item (e.g., a movie) is represented by a feature vector $x^{(i)}$.
   - Features are learned automatically and may be difficult to interpret individually (e.g., distinguishing between genres).

3. **Similarity Measurement**:
   - To find related items, calculate the squared distance between feature vectors:
     $$\text{Distance}(x^{(i)}, x^{(k)}) = \sum_{l=1}^{n} (x^{(k)}_l - x^{(i)}_l)^2$$
   - Identify the top $k$ items with the smallest distance to recommend.

4. **Practical Application**:
   - When users browse an item, the website can show related products based on the similarity of feature vectors.
   - This method provides a systematic approach to enhance user experience on e-commerce sites.

5. **Limitations of Collaborative Filtering**:
   - **Cold Start Problem**: New items with few ratings or new users with limited interactions struggle with accurate recommendations.
     - Mean normalization can help, but additional strategies are needed for better recommendations in these cases.
   - **Lack of Side Information**: Collaborative filtering doesn't utilize extra information about items or users (e.g., demographics, genres, browsing behavior).
     - Understanding these correlations could enhance recommendations significantly.

6. **Transition to Content-Based Filtering**:
   - Content-based filtering can address the limitations of collaborative filtering by incorporating side information and user preferences.
   - The next topic will cover content-based filtering algorithms, which are widely used in commercial applications.

This summary encapsulates the key concepts around collaborative filtering and its limitations, providing a foundation for understanding the upcoming content-based filtering methods.

## Content-based filtering

### Collaborative filtering vs. content-based filtering

1. **Introduction to Content-Based Filtering**:
   - Content-based filtering recommends items based on the features of both users and items, contrasting with collaborative filtering, which relies on user ratings of similar users.

2. **Key Concepts**:
   - **User Features**: Information such as age, gender (one-hot encoded), country, and past behaviors (e.g., average ratings per genre).
   - **Item Features**: Attributes of items (e.g., movie year, genres, critic reviews, and average ratings).

3. **Feature Vectors**:
   - Each user has a feature vector $x_{u}^{j}$ and each item has a feature vector $x_{m}^{i}$.
   - Feature vectors can vary in size but must match for the dot product calculation.

4. **Prediction Mechanism**:
   - The goal is to predict how much a user $j$ will like a movie $i$ by computing:
     $$\text{Prediction} = v_{j_u} \cdot v_{i_m}$$
   - Here, $v_{j_u}$ and $v_{i_m}$ are vectors derived from user and item features, respectively.

5. **Dot Product Calculation**:
   - The dot product of the user and movie vectors yields a prediction of the user’s rating for the movie, capturing the user's preferences based on the features.

6. **Challenges**:
   - The challenge lies in effectively computing the user and item vectors $v_{j_u}$ and $v_{i_m}$ from their respective features.
   - The vectors must be the same size for the dot product, which may require dimensionality reduction or feature selection techniques.

7. **Comparison with Collaborative Filtering**:
   - Collaborative filtering uses user ratings to recommend items based on similar user behaviors, while content-based filtering leverages user and item features for matching.

### Conclusion
Content-based filtering allows for personalized recommendations by analyzing user preferences and item attributes, and the subsequent development of the algorithm will focus on computing the necessary feature vectors to optimize recommendations. 

### Deep learning for content-based filtering

1. **Introduction to Deep Learning in Content-Based Filtering**:
   - Deep learning techniques are increasingly used to develop sophisticated content-based filtering algorithms, leveraging neural networks to compute user and item vectors.

2. **User and Movie Networks**:
   - **User Network**: Takes user features (e.g., age, gender, country) and outputs a user vector $v_u$ with a fixed size (e.g., 32 dimensions).
   - **Movie Network**: Takes movie features (e.g., release year, ratings) and outputs a movie vector $v_m$ of the same size.

3. **Prediction Mechanism**:
   - The rating prediction is done using the dot product of the user and movie vectors:
     $$\text{Prediction} = v_{u}^{j} \cdot v_{m}^{i}$$
   - If binary labels are used, a sigmoid function can be applied to predict probabilities.

4. **Training the Networks**:
   - A shared cost function $J$ is constructed to minimize the squared error between predicted ratings and actual ratings across all user-item pairs:
     $$J = \sum_{(i,j)} \left( v_{u}^{j} \cdot v_{m}^{i} - y^{ij} \right)^{2}$$
   - Both user and movie networks are trained simultaneously using gradient descent and may include regularization terms to manage parameter sizes.

5. **Finding Similar Items**:
   - After training, the model can identify similar items by calculating the distance between movie vectors, allowing for recommendations of movies that are close in vector space.

6. **Pre-Computing Similarities**:
   - Similarities between movies can be pre-computed to improve response times when users browse the catalog, which is essential for scalability.

7. **Advantages of Neural Networks**:
   - Neural networks facilitate the integration of multiple models (user and movie networks) into a cohesive architecture that enhances predictive performance.

8. **Feature Engineering**:
   - Careful feature design is critical in implementing content-based filtering effectively, impacting the model's performance significantly.

9. **Challenges and Scaling**:
   - The algorithm can become computationally intensive with large item catalogs, and strategies for scaling will be discussed in future videos.

### Conclusion
Deep learning provides a robust framework for developing content-based filtering systems, allowing for personalized recommendations based on user and item features. The integration of user and movie networks enhances the predictive capability of the system, while effective training and feature engineering are crucial for successful implementation.

### Recommending from a large catalogue

1. **Need for Efficiency**:
   - Recommender systems must handle vast catalogs (thousands to millions of items) efficiently to provide timely recommendations.

2. **Two-Step Process**:
   - **Retrieval Step**: Quickly generates a large list of plausible item candidates.
     - Example: For the last 10 movies watched by a user, find the 10 most similar movies.
     - Can include top movies from genres the user enjoys and popular items in the user’s country.
     - This step prioritizes broad coverage over precision, leading to a list of hundreds of candidates.

3. **Ranking Step**:
   - Takes the list from the retrieval step and ranks items using a learned model.
   - The user feature vector and movie feature vectors are fed into the neural network to predict ratings for each candidate item.
   - Final recommendations are based on predicted ratings, displaying the highest-rated items to the user.

4. **Optimization**:
   - If movie vectors are pre-computed, only the user vector needs to be computed during inference.
   - The inner product between the user and movie vectors is calculated for items retrieved in the first step, enhancing efficiency.

5. **Item Retrieval Decisions**:
   - The number of items to retrieve impacts performance; more items can lead to better recommendations but slower processing.
   - Conduct offline experiments to determine the optimal number of items to retrieve based on performance metrics.

6. **Ethical Considerations**:
   - Be aware of the ethical implications of recommender systems, as they can potentially cause harm.
   - Aim to develop systems that serve users and society positively, rather than solely focusing on company profit.

### Conclusion
Implementing a two-step retrieval and ranking approach allows recommender systems to efficiently manage large catalogs while still providing relevant recommendations. Balancing the number of retrieved items and the ethical considerations is crucial for building responsible and effective systems.

### Ethical use of recommender systems

1. **Profit vs. User Benefit**:
   - While recommender systems can be profitable, they can also lead to harmful societal impacts. It's crucial to prioritize user well-being over profit maximization.

2. **Goal Setting**:
   - When designing recommender systems, it's important to define clear goals. For example:
     - Recommend items likely to be rated highly by users.
     - Show products that users are most likely to purchase.
   - However, systems can also prioritize clicks or profits over relevance.

3. **Problematic Use Cases**:
   - **Advertising**:
     - Companies may show ads that maximize clicks or profits rather than relevance, which can lead to unethical outcomes (e.g., payday loan companies exploiting customers).
   - **Engagement Metrics**:
     - Maximizing user engagement can result in the amplification of harmful content, such as conspiracy theories or hate speech, which keeps users engaged but has negative societal effects.

4. **Ameliorative Measures**:
   - Consider filtering out exploitative businesses and harmful content, though defining what constitutes "exploitative" or "harmful" is complex.
   - Strive for transparency in how recommendations are made to build user trust.

5. **Importance of Diverse Perspectives**:
   - Engage in discussions and debates on the ethical implications of recommender systems. Involve diverse viewpoints to better understand potential harms and benefits.

6. **Collective Responsibility**:
   - As developers and researchers in AI and recommender systems, aim to create technology that genuinely improves society and enhances user experiences, rather than simply focusing on profit.

### Conclusion
Recommender systems hold significant power and responsibility. While they can drive profits, careful consideration of their societal impacts is essential. By prioritizing ethical practices and transparent operations, we can create systems that benefit users and society at large.

### TensorFlow implementation of content-based filtering

1. **Model Structure**:
   - **User Network**: Implemented as a sequential model with dense layers.
     - Contains two dense layers with specified hidden units.
     - The final layer outputs a vector of 32 numbers.
   - **Item (Movie) Network**: Similar structure to the user network.
     - Also consists of dense layers and outputs a vector of 32 numbers.

2. **Activation Function**:
   - The default activation function used for hidden layers is ReLU (Rectified Linear Unit).

3. **Input Feature Feeding**:
   - User and item features are extracted and fed into their respective networks to compute vectors $v_u$ (user vector) and $v_m$ (movie vector).
   - Normalization is applied to both vectors using L2 normalization (making the length of the vector equal to one) to improve algorithm performance.

4. **Dot Product Calculation**:
   - The dot product of the normalized vectors $v_u$ and $v_m$ is computed using a special Keras layer (`tf.keras.layers.dot`), which outputs the final prediction.

5. **Model Definition**:
   - The overall model is defined by specifying the inputs (user and item features) and the output (the result of the dot product).

6. **Cost Function**:
   - The mean squared error cost function is used for training the model.

7. **Normalization**:
   - Normalizing the vectors helps improve the performance of the content-based filtering algorithm.

### Conclusion
The implementation of content-based filtering in TensorFlow involves structuring neural networks for users and items, normalizing input vectors, calculating their dot product, and defining the model with appropriate inputs and outputs. This foundational approach enables personalized recommendations based on user and item features.