## ETL Milestone

The following code will display the URLs that are being used for web scraping. These URLs are pulled from a collection in MongoDB.

In [1]:
%load_ext autoreload

In [2]:
%autoreload all

In [1]:
from loguru import logger
from app.preprocessing.clean import etl_clean
from app.helpers.mongo_client import get_mongo_client
from app.scraper import etl_scrape
from app.preprocessing.chunk import etl_chunk
from app.featurizer import etl_featurize_step

In [4]:
mongo_client = get_mongo_client()
db = mongo_client["rag"]
collection = db["media_urls"]

In [5]:
# Displaying all the Media URLs being scraped

logger.info(f"Media URLs being scraped:")

media_urls = []
for i, media in enumerate(collection.find()):
    media_urls.append(media['url'])
    logger.info(f"{i}. {media['url']}")

[32m2024-12-08 13:00:08.551[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m3[0m - [1mMedia URLs being scraped:[0m


[32m2024-12-08 13:00:08.558[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m0. https://www.youtube.com/watch?v=idQb2pB-h2Q&ab_channel=RoboticsBack-End[0m
[32m2024-12-08 13:00:08.559[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m1. https://github.com/ros-realtime/ros-realtime.github.io[0m
[32m2024-12-08 13:00:08.559[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m2. https://github.com/ros-realtime/linux-real-time-kernel-builder[0m
[32m2024-12-08 13:00:08.559[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m3. https://github.com/ros-realtime/ros-realtime-rpi4-image[0m
[32m2024-12-08 13:00:08.559[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m4. https://github.com/ros-realtime/community[0m
[32m2024-12-08 13:00:08.560[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m8[0m - [1m5. https://github.com/ros-navigatio

In [6]:
# Running the scraper for each media URL
etl_scrape()

ClearML Task: overwriting (reusing) task id=fdc6c1ec983f42f7b7864eb1b847b846
ClearML results page: http://localhost:8080/projects/f6dc6f61ba1f4b69acea3decc23dcbd8/experiments/fdc6c1ec983f42f7b7864eb1b847b846/output/log
2024-12-08 13:00:13,941 - clearml.Task - INFO - Storing jupyter notebook directly as code


[32m2024-12-08 13:00:13.969[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36metl_scrape[0m:[36m221[0m - [1mStarting scraping...[0m


CLEARML-SERVER new package available: UPGRADE to v1.17.0 is recommended!
Release Notes:
### New Features 
- New ClearML Model dashboard: View all live model endpoints in a single location, complete with real time metrics reporting.
- New UI pipeline run table comparative view: compare plots and scalars of selected pipeline runs
- Improve services agent behavior: If no credentials are specified, agent uses default credentials ([ClearML Server GitHub issue #140](https://github.com/allegroai/clearml-server/issues/140))
- Add UI re-enqueue of failed tasks
- Add UI experiment scalar results table view
- Add "Block running user's scripts in the browser" UI setting option for added security
- Add UI "Reset" to set task installed packages to originally recorded values 
- Add UI edit of default Project default output destination

### Bug Fixes
- Fix broken download links to artifacts stored in Azure ([ClearML Server GitHub issue #247](https://github.com/allegroai/clearml-server/issues/247))
- F

[32m2024-12-08 13:00:14.625[0m | [31m[1mERROR   [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m210[0m - [31m[1mError scraping YouTube video: https://www.youtube.com/watch?v=idQb2pB-h2Q&ab_channel=RoboticsBack-End[0m
[32m2024-12-08 13:00:14.626[0m | [31m[1mERROR   [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m211[0m - [31m[1mname 'link' is not defined[0m
[32m2024-12-08 13:00:14.626[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m214[0m - [1mFinished scraping Youtube video: https://www.youtube.com/watch?v=idQb2pB-h2Q&ab_channel=RoboticsBack-End[0m
[32m2024-12-08 13:00:14.631[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m97[0m - [1mStarting scraping GitHub repository: https://github.com/ros-realtime/ros-realtime.github.io[0m
Cloning into 'ros-realtime.github.io'...
[32m2024-12-08 13:00:15.299[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m134[0m - [1mFinished scraping GitHub repository: https://git

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring


Cloning into 'ros-realtime-rpi4-image'...
[32m2024-12-08 13:00:15.996[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m134[0m - [1mFinished scraping GitHub repository: https://github.com/ros-realtime/ros-realtime-rpi4-image[0m
[32m2024-12-08 13:00:16.005[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m97[0m - [1mStarting scraping GitHub repository: https://github.com/ros-realtime/community[0m
Cloning into 'community'...
[32m2024-12-08 13:00:16.282[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m134[0m - [1mFinished scraping GitHub repository: https://github.com/ros-realtime/community[0m
[32m2024-12-08 13:00:16.288[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m97[0m - [1mStarting scraping GitHub repository: https://github.com/ros-navigation/docs.nav2.org[0m
Cloning into 'docs.nav2.org'...
[32m2024-12-08 13:00:23.384[0m | [31m[1mERROR   [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m130[0m

ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start


[32m2024-12-08 13:03:16.942[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m76[0m - [1mFinished scraping Medium article: https://medium.com/spinor/getting-started-with-ros2-create-and-set-up-a-workspace-f60a6c52328c[0m
[32m2024-12-08 13:03:16.944[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m30[0m - [1mStarting scraping Medium article: https://medium.com/@kabilankb2003/building-a-simple-ros2-object-avoidance-robot-using-python-962f5b8485d7[0m
[32m2024-12-08 13:03:17.139[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m76[0m - [1mFinished scraping Medium article: https://medium.com/@kabilankb2003/building-a-simple-ros2-object-avoidance-robot-using-python-962f5b8485d7[0m
[32m2024-12-08 13:03:17.141[0m | [1mINFO    [0m | [36mapp.scraper[0m:[36mscrape[0m:[36m30[0m - [1mStarting scraping Medium article: https://medium.com/spinor/getting-started-with-ros2-asynchronous-task-handling-using-ros2-actions-0edef14e6be

In [3]:
# Running the cleaner for each scraped post
etl_clean()

ClearML Task: overwriting (reusing) task id=ad5586b627974dd387d74d8c8afc744c
ClearML results page: http://localhost:8080/projects/f6dc6f61ba1f4b69acea3decc23dcbd8/experiments/ad5586b627974dd387d74d8c8afc744c/output/log
2024-12-08 13:46:03,272 - clearml.Task - INFO - Storing jupyter notebook directly as code


[32m2024-12-08 13:46:03.303[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36metl_clean[0m:[36m147[0m - [1mStarting cleaning process of raw docs...[0m


CLEARML-SERVER new package available: UPGRADE to v1.17.0 is recommended!
Release Notes:
### New Features 
- New ClearML Model dashboard: View all live model endpoints in a single location, complete with real time metrics reporting.
- New UI pipeline run table comparative view: compare plots and scalars of selected pipeline runs
- Improve services agent behavior: If no credentials are specified, agent uses default credentials ([ClearML Server GitHub issue #140](https://github.com/allegroai/clearml-server/issues/140))
- Add UI re-enqueue of failed tasks
- Add UI experiment scalar results table view
- Add "Block running user's scripts in the browser" UI setting option for added security
- Add UI "Reset" to set task installed packages to originally recorded values 
- Add UI edit of default Project default output destination

### Bug Fixes
- Fix broken download links to artifacts stored in Azure ([ClearML Server GitHub issue #247](https://github.com/allegroai/clearml-server/issues/247))
- F

[32m2024-12-08 13:46:03.513[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mmedium_cleaner[0m:[36m32[0m - [1mCleaning content for post: https://medium.com/@psreeram/building-a-home-service-robot-a-learning-journey-6890262ad5a7[0m
[32m2024-12-08 13:46:03.517[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mmedium_cleaner[0m:[36m56[0m - [1mContent cleaning and saving complete for post: https://medium.com/@psreeram/building-a-home-service-robot-a-learning-journey-6890262ad5a7[0m
[32m2024-12-08 13:46:03.524[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_cleaner[0m:[36m138[0m - [1mGitHub content cleaning and saving complete for repo: https://github.com/ros-realtime/ros-realtime.github.io[0m
[32m2024-12-08 13:46:03.527[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_cleaner[0m:[36m138[0m - [1mGitHub content cleaning and saving complete for repo: https://github.com/ros-realtime/linux-real-time-ker

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring


[32m2024-12-08 13:46:05.484[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_cleaner[0m:[36m138[0m - [1mGitHub content cleaning and saving complete for repo: https://github.com/ros2/geometry2[0m
[32m2024-12-08 13:46:05.508[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_cleaner[0m:[36m138[0m - [1mGitHub content cleaning and saving complete for repo: https://github.com/ros2/common_interfaces[0m
[32m2024-12-08 13:46:05.716[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_cleaner[0m:[36m138[0m - [1mGitHub content cleaning and saving complete for repo: https://github.com/ros2/rcl[0m
[32m2024-12-08 13:46:06.574[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_cleaner[0m:[36m138[0m - [1mGitHub content cleaning and saving complete for repo: https://github.com/ros-planning/moveit2[0m
[32m2024-12-08 13:46:06.686[0m | [1mINFO    [0m | [36mapp.preprocessing.clean[0m:[36mgithub_clean

In [None]:
# Running the chunker for each cleaned post

etl_chunk()

Could not read Jupyter Notebook: No module named 'nbconvert'
Please install nbconvert using "pip install nbconvert"
[32m2024-12-08 14:04:35.648[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m41[0m - [1mStarting chunking process of cleaned docs...[0m


ClearML Task: overwriting (reusing) task id=d3eff6909ca444b1934f52f8a9352210
ClearML results page: http://localhost:8080/projects/f6dc6f61ba1f4b69acea3decc23dcbd8/experiments/d3eff6909ca444b1934f52f8a9352210/output/log


[32m2024-12-08 14:04:35.950[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://medium.com/@psreeram/building-a-home-service-robot-a-learning-journey-6890262ad5a7[0m
[32m2024-12-08 14:04:35.965[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros-realtime/ros-realtime.github.io[0m
[32m2024-12-08 14:04:35.978[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros-realtime/linux-real-time-kernel-builder[0m
[32m2024-12-08 14:04:36.062[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros-realtime/ros-realtime-rpi4-image[0m
[32m2024-12-08 14:04:36.063[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m -

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring


[32m2024-12-08 14:04:37.806[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/gazebosim/gz-sensors[0m
[32m2024-12-08 14:04:37.813[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros2-java/ament_gradle_plugin[0m
[32m2024-12-08 14:04:37.824[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros2-java/ros2_java_examples[0m
[32m2024-12-08 14:04:37.988[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros2-java/ros2_java[0m
[32m2024-12-08 14:04:37.994[0m | [1mINFO    [0m | [36mapp.preprocessing.chunk[0m:[36metl_chunk[0m:[36m63[0m - [1mChunked content for post: https://github.com/ros2-java/ament_java[0m
[32m2024-12-0

This script is designed to featurize text data using transformer models and store the resulting embeddings in Qdrant (a vector database) and MongoDB. It also integrates logging for tracking the process and uses ClearML for task management.

### Key Components
1. **Libraries Used**:

- **Transformers**: For text embedding using pre-trained models.
- **Torch**: For model inference and tensor operations.
- **Qdrant**: To store vector embeddings.
- **MongoDB**: For storing metadata and additional data.
- **ClearML**: For managing and logging tasks.
- **Loguru**: For better logging and debugging.

2. **Main Classes/Functions**:

- `Featurizer`: Handles text preprocessing, embedding generation, and storing data in MongoDB and Qdrant.
- `etl_featurize_step()`: Orchestrates the process by initializing tasks, fetching data, and invoking the Featurizer.

### **Class**: Featurizer
This class encapsulates the logic for embedding generation and storage.

**Initialization**

- Sets up:
    - Qdrant client (`qdrant_client`) to interact with the Qdrant vector database.
    - Pre-trained `sentence-transformers/all-MiniLM-L6-v2 model` for generating embeddings.

**Methods**
1. `preprocess_text(self, text)`:

- Converts the input text to lowercase and removes extra spaces.
- Prepares the text for embedding generation.

2. `generate_embeddings(self, text)`:

- Uses the tokenizer and model to generate embeddings for the input text.
- Performs **mean pooling** on the model's last hidden state to get the final embedding vector.

3. `featurize_and_store(self, chunks, featurized_collection)`:

- Iterates over text chunks, preprocesses the text, generates embeddings, and stores the results in:
    - MongoDB: Saves the processed text and embedding.
    - Qdrant: Creates a collection (if not exists) and uploads embeddings as vector points.

- Uses a unique identifier (uuid) for each Qdrant point.

### Error Handling
- Captures and logs any errors during the process.

### Function: `etl_featurize_step()`
1. Initializes a ClearML task for tracking.
2. Connects to MongoDB and retrieves chunks of data from the `rag_chunked_data` collection.
3. Creates an instance of the `Featurizer` class.
4. Calls `featurize_and_store()` to process and store data.
5. Marks the ClearML task as complete.

### Main Entry Point
The script starts by calling the `etl_featurize_step()` function.

In [None]:
# Running the featurizer for each chunked post

etl_featurize_step()

ClearML Task: overwriting (reusing) task id=2c44a0363b8e4c34a3879de43bef1fe3
2024-12-08 14:56:31,087 - clearml.Task - INFO - Storing jupyter notebook directly as code


[32m2024-12-08 14:56:31.160[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36metl_featurize_step[0m:[36m126[0m - [1mFeaturizing data...[0m


ClearML results page: http://localhost:8080/projects/f6dc6f61ba1f4b69acea3decc23dcbd8/experiments/2c44a0363b8e4c34a3879de43bef1fe3/output/log
CLEARML-SERVER new package available: UPGRADE to v1.17.0 is recommended!
Release Notes:
### New Features 
- New ClearML Model dashboard: View all live model endpoints in a single location, complete with real time metrics reporting.
- New UI pipeline run table comparative view: compare plots and scalars of selected pipeline runs
- Improve services agent behavior: If no credentials are specified, agent uses default credentials ([ClearML Server GitHub issue #140](https://github.com/allegroai/clearml-server/issues/140))
- Add UI re-enqueue of failed tasks
- Add UI experiment scalar results table view
- Add "Block running user's scripts in the browser" UI setting option for added security
- Add UI "Reset" to set task installed packages to originally recorded values 
- Add UI edit of default Project default output destination

### Bug Fixes
- Fix broke

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[32m2024-12-08 14:56:31.384[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m92[0m - [1mStored document 1 in Mongo

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring


[32m2024-12-08 14:56:33.087[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m92[0m - [1mStored document 15 in MongoDB[0m
[32m2024-12-08 14:56:33.096[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m112[0m - [1mFeaturization for document 15 complete and data stored.[0m
[32m2024-12-08 14:56:33.153[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m92[0m - [1mStored document 16 in MongoDB[0m
[32m2024-12-08 14:56:33.161[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m112[0m - [1mFeaturization for document 16 complete and data stored.[0m
[32m2024-12-08 14:56:33.233[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m92[0m - [1mStored document 17 in MongoDB[0m
[32m2024-12-08 14:56:33.242[0m | [1mINFO    [0m | [36mapp.featurizer[0m:[36mfeaturize_and_store[0m:[36m112[0m - [1mFeaturization for documen