Content-based filtering is a recommendation technique that uses item attributes and a user’s inferred preferences over those attributes to suggest new items with similar characteristics. It focuses on **what** an item is (its content/features) rather than on what other users did.

### How content-based filtering works

1. Item profile  
   - Each item is represented by a vector of features.  
   - Examples:  
     - Music: genre, artist, tempo, mood, language, acoustic features.  
     - Movies: genre, director, cast, keywords, runtime, year.  
     - Products: category, brand, price range, text description, tags.

2. User profile  
   - The system learns what feature values the user tends to like.  
   - It aggregates features from items the user has interacted with (listened, watched, clicked, purchased, highly rated).  
   - A simple version: average the feature vectors of all positively rated items; more advanced versions weight by rating, recency, or time spent.

3. Matching / filtering and scoring  
   - Candidate items are “filtered” by computing similarity between each item profile and the user profile.  
   - Common similarity measures:  
     - Cosine similarity between feature vectors.  
     - Dot product in an embedding space.  
     - Distance-based measures (e.g., 1 over Euclidean distance).  
   - Items with the highest similarity scores are recommended and ranked from most to least similar.

In other words, the filtering step is the process of turning “all available items” into “a ranked list of items whose feature vectors are closest to the user’s preference vector.”

### Spotify-style example with filtering

- Item profiles  
  - Each song is represented by features such as genre (pop, rock), tempo (BPM), artist, mood (happy, sad), energy, danceability, presence of female vocals, etc.  
- User profile  
  - A user often listens to upbeat pop songs by female vocalists like Taylor Swift and Dua Lipa.  
  - The system builds a profile that emphasizes: pop genre, high tempo, high energy, positive mood, female vocals.  
- Filtering step  
  - Start with the full catalog of songs.  
  - Compute similarity between each song’s feature vector and the user’s profile vector.  
  - Filter and rank: songs with high similarity (e.g., other upbeat pop tracks with female vocals and similar acoustic properties) move to the top and get recommended, even if the user has never played them before.

### What “filtering” means here

In content-based filtering, “filtering” covers two closely related actions:

- Narrowing the candidate set  
  - You may first discard items that clearly do not match basic constraints (e.g., wrong language, explicit content for a kid profile, already consumed items).  
  - This reduces the search space before detailed scoring.

- Ranking by similarity  
  - Among remaining items, the system produces a relevance score based on feature similarity to the user profile.  
  - The top k items (e.g., top 50 or 100) are presented as recommendations.  
  - This is the core filtering behavior: items that do not align with the user’s learned feature preferences effectively get “filtered out” by low scores.

A simple mental model: imagine every item and every user as points in a high-dimensional “feature space”; filtering is “show me the items closest to this user’s point.”

### Benefits of content-based filtering

- Personalization  
  - Recommendations are customized to each user’s learned feature preferences, not just global popularity.  
- Independence from other users  
  - It does not need many other users or overlapping histories, so it works even when there are few similar users or in niche domains.  
- Transparency / explainability  
  - It is easier to say *why* something was recommended: “We suggested this because it’s an upbeat pop track with a similar mood and tempo to songs you often play.”

### Limitations and their link to filtering

- Limited discovery  
  - Because filtering relies on similarity to past likes, the system tends to recommend “more of the same.”  
  - This can trap users in a narrow bubble (e.g., only pop if they started with pop), reducing diversity and serendipitous finds.  
- Feature dependence  
  - It needs good, informative item features.  
  - Poor or missing features cause weak similarity estimates, which in turn distort the filtering and ranking.  
- Cold-start for items with sparse metadata  
  - If new items have little or no feature information (e.g., no textual description, no tags), the system struggles to place them in feature space and thus cannot filter for them effectively.

### Item factors and user factors

- Each item is represented by a vector of numeric “item factors” capturing content features. These values range between 0 and 1 but are not probabilities, need not sum to 1, and are not mutually exclusive.  

- Some features can be correlated or opposed, but in general, features may also be largely independent.  

- For each user, a separate linear regression model is fitted that predicts that user’s ratings from the item factors, yielding an intercept (bias) and one weight per feature. These weights are the “user factors,” indicating how much that user likes or dislikes each content feature.  

### Linear model and prediction mechanism

- The model for a user is a standard linear regression: predicted rating = intercept + dot product of user factors with item factors.  

- Each prediction is interpreted as the user’s estimated rating for that item, and prediction errors are expected and unavoidable because a simple linear model cannot perfectly capture all nuances.  

- This same model naturally handles missing ratings: even if a user has never rated a specific item, the system can still predict a rating from the learned user factors and the item’s known factors.  

### Filtering and use in recommendations

- Once user factors and item factors are learned, the system can compute predicted ratings for many items a user has not yet consumed.  

- The “filtering” comes from using these predicted ratings to remove or down-rank items with low predicted scores and surface items with high predicted scores (for example, when choosing which one of several candidate albums to recommend next).  

- Predicted ratings can fall outside the nominal rating scale (e.g., above 5 or below 0); if desired, they can be clipped with min/max functions so recommendations show values within the allowed range.  

### Practical limitations highlighted

- A key limitation emphasized is the need for predefined content categories and reliable feature values: in realistic systems, we rarely have such clean, expert-defined item factors for every item.  

- For large catalogs, manually deciding all features and assigning their values is not scalable, motivating the move to alternative approaches in later lessons (such as techniques that *learn* factors automatically instead of hand-defining them).