# 📘 Week 4 Assignment – Building a Recommender System Using Batch and Streaming Pipelines on AWS

🛠️ **Note:** All AWS resources (like RDS, S3 buckets, Lambda functions, Firehose, etc.) were pre-configured and provisioned in the assignment. Our main job was to establish proper connections between components and understand the end-to-end working of the recommender architecture. You’ll dive deeper into Terraform and AWS tools in future courses.

---

## 🎯 Assignment Objective

In this lab, we are tasked with implementing a complete **Recommender System Pipeline** using AWS services. This includes:

- ✅ Extracting and transforming training data for a recommender system  
- ✅ Making the transformed data available to data scientists  
- ✅ Using the trained model outputs (embeddings) to create a **Vector Database**  
- ✅ Setting up a **streaming architecture** to deliver real-time product recommendations to users  
- ✅ Ensuring all outputs (recommendations) are stored properly in an S3 bucket for future use  

---

## 📦 Overview of the Architecture

We are essentially helping **build a machine learning-powered recommender system** that uses **batch processing for training** and **streaming processing for real-time inference**. Here's how it works at a high level:

- Raw data from an **Amazon RDS** instance is extracted and transformed using **AWS Glue ETL**  
- The transformed data is stored in **Amazon S3** and used by data scientists to train the recommender model  
- The **trained model**, along with **user/item embeddings**, is stored in another S3 bucket  
- These embeddings are uploaded to a **Vector Database** (PostgreSQL) for efficient similarity searches  
- A **streaming pipeline** is created using **Kinesis Data Streams**, **Lambda functions**, and **Data Firehose**  
- Real-time user interactions are used to compute recommendations using the trained model and embeddings  
- Final recommendations are sent to the client and also stored in a dedicated **S3 recommendations bucket**

---

## 🧱 Step 1: Batch Pipeline Setup

Our first task is to extract ratings data from a MySQL database hosted in **Amazon RDS**. This data represents how users rated products and is the key input for training a supervised ML model.

- We use **AWS Glue ETL** to extract and transform the ratings data  
- The output of this transformation is stored in an **S3 bucket** called `data lake`  
- A **Glue Crawler** is then used to create a catalog/table from the S3 data for easy access by data scientists  

📸 *Batch Pipeline Diagram*  
![Batch Pipeline](../images/batch_pipline_data_lake.png)

---

## 🧪 Step 2: Model Training (by Data Scientist)

Once the transformed data is available in the data lake bucket, it is used by the **data scientist** to train a recommender model.

- The **trained model** is stored in another S3 bucket called `ML Artifacts`  
- This bucket contains:
  - `models/` → serialized model (e.g. .pkl)
  - `embeddings/` → user and item embeddings CSV files
  - `scalars/` → preprocessing information  

You are not training the model — this is handled by the data scientist. Your role is to enable the next part of the pipeline.

---

## 🧲 Step 3: Vector Database Setup

The data scientist asks us to create a **Vector Database** using PostgreSQL to enable fast similarity searches based on item/user embeddings.

- We create a PostgreSQL database via Terraform  
- We collect the DB host, username, and password from the outputs  
- We use SQL scripts to upload the contents of `item_embeddings.csv` and `user_embeddings.csv` from the `ML Artifacts` bucket into the vector database  
- These embeddings allow us to retrieve similar products later during inference  

📸 *Vector Database Diagram*  
![Vector Database](../images/vector_database.png)

---

## 🚀 Step 4: Real-time Streaming Pipeline

Now that the model and embeddings are ready, we move to the **streaming pipeline** that will power live recommendations based on user activity.

- AWS **Kinesis Data Streams** are already set up in the background and continuously push user events (cart actions, clicks)
- We configure three components via Terraform:
  - `Model Inference Lambda` → uses trained model + vector DB to generate recommendations  
  - `Stream Transformation Lambda` → extracts user/item features from Kinesis records  
  - `Amazon Data Firehose` → orchestrates the flow: reads Kinesis, invokes both Lambdas, and stores results  

📸 *Streaming Pipeline Diagram*  
![Streaming Pipeline](../images/streaming_pipeline.png)

---

### 🔄 Streaming Pipeline Flow (Step-by-step)

1. User actions are pushed to **Kinesis Data Streams**  
2. **Firehose** reads the incoming records  
3. Firehose invokes the **Stream Transformation Lambda** to extract user/product data  
4. The output is passed to the **Model Inference Lambda**, which:
   - Loads the trained model from the `ML Artifacts` S3 bucket  
   - Connects to the vector DB  
   - Retrieves similar items for the given user/cart combination  
5. Firehose stores the **final recommendations** in a dedicated **S3 bucket (`recommendations`)**  
6. These recommendations can also be served back to the user via the platform

---

## 🧠 Summary

By the end of this lab:

- We implemented a complete **batch + streaming recommendation system**  
- We transformed data from **RDS → S3 → ML model**  
- We stored model artifacts and **created a vector DB** for similarity search  
- We connected all pieces in a **real-time stream** using Kinesis, Lambda, and Firehose  
- All resources were created via Terraform — your task was to **understand connections, not configurations**

This lab provides foundational experience in designing and implementing modern ML pipelines using AWS — exactly what data engineers are expected to build in real-world projects.