
---

### 1. Project Understanding & Design

* **How does content-based differ from collaborative filtering?**
  Content-based recommends items similar to those a user liked based on item features, while collaborative filtering uses user interaction patterns across many users.

* **Why CountVectorizer over TF-IDF or embeddings?**
  CountVectorizer is simple, fast, and works well with limited textual data; embeddings can be more accurate but require more resources.

* **What happens if two movies are similar but one is more popular?**
  This method doesn’t consider popularity, so popular movies might not always rank higher — a limitation to address.

* **Why cosine similarity?**
  Cosine similarity measures the angle between vectors, focusing on direction rather than magnitude, which suits text similarity well.

---

### 2. NLP & Feature Engineering

* **Text preprocessing done?**
  Lowercasing, removing spaces, and combining tags; no stopwords removal or stemming applied.

* **Would TF-IDF help over Bag-of-Words?**
  TF-IDF could reduce the weight of common words and highlight unique terms, potentially improving recommendations.

* **How to improve 'tags'?**
  Use NLP techniques like lemmatization, named entity recognition, or add metadata like genres and keywords.

---

### 3. Scalability & Performance

* **Scalability for 1 million movies?**
  Computing full similarity matrix is expensive; approximate nearest neighbor methods or indexing needed.

* **New movie addition?**
  Currently requires recomputing similarity; can implement incremental updates for efficiency.

* **Real-time recommendations?**
  Precompute embeddings and use fast search indices like FAISS for real-time querying.

---

### 4. Frontend & API Integration

* **How does Streamlit communicate with backend?**
  All logic runs in the same app; Streamlit calls Python functions directly.

* **Why TMDB API? Rate limits?**
  TMDB offers rich poster data; rate limits exist, so caching posters is advised.

* **Could it work offline?**
  If poster images are cached locally and no API calls are needed, yes.

---

### 5. Limitations & Improvements

* **Limitations of content-based?**
  Cold start problem, lack of diversity, and no user preference learning.

* **Improve with deep learning?**
  Use pretrained language models like BERT for richer text embeddings.

* **Generative model for recommendations?**
  Could generate personalized recommendations based on user profiles and content.

---

### 6. Deployment & Reusability

* **Why pickle for serialization?**
  Simple for prototyping; but not secure or scalable for production.

* **How to deploy?**
  Containerize with Docker, expose REST API, and deploy on cloud platforms.

* **Track/version models?**
  Use tools like MLflow or DVC to manage model lifecycle.

---


* **Add user ratings, redesign?**
  Combine content-based with collaborative filtering to personalize better.

* **Challenges faced?**
  Integrating TMDB API reliably and optimizing recommendation speed.

* **Key learning?**
  Importance of clean data, model evaluation, and user-friendly UI.

---