### 🔧 FOSS Project Embedding Pipeline (Multimodel Ingestion to Weaviate)

This notebook outlines the complete pipeline for embedding Free and Open Source Software (FOSS) project metadata (names and descriptions) using nine different transformer models. The resulting vector representations are ingested into a local Weaviate instance for semantic search and similarity-based retrieval.

### 🎯 Goal

To enable robust semantic matching of CVEs (Common Vulnerabilities and Exposures) to relevant FOSS projects by populating a vector database with diverse model embeddings of project descriptions and names.

---

### 🚀 Pipeline Overview

1. **Data Ingestion**
   - Load FOSS project metadata, specifically the `name` and `description` fields, from a structured dataset (e.g., JSON, CSV).
   - Clean and preprocess the text (e.g., lowercasing, trimming, deduplication, handling nulls).

2. **Model Embedding**
   - Apply **nine transformer models** (e.g., from HuggingFace or Sentence Transformers) to generate vector representations of each FOSS project:
     - Each model processes both the `name` and `description` separately.
     - Models may include general-purpose encoders, domain-specific variants, or multilingual models.

3. **Vector Storage in Weaviate**
   - Connect to a **locally running Weaviate instance** with vector index enabled.
   - Define a Weaviate schema with one class (e.g., `FOSSProject`) and multiple vector fields (e.g., `name_vector_model1`, `desc_vector_model1`, etc.).
   - Insert project entries into Weaviate, including:
     - Original metadata (name, description).
     - All generated vectors with model-specific field names.
     - Optional: metadata such as model used, processing time, or source info.

4. **Search and Validation**
   - Use Weaviate's vector search (`nearVector`) or hybrid search to validate embeddings by querying semantically similar entries.
   - Optionally log distance or certainty metadata for inspection.

---

### 🧠 Key Benefits

- **Multimodel Embedding**: Redundancy and diversity in embeddings improves robustness for downstream tasks (e.g., matching CVEs with fuzzy descriptions).
- **Modular Architecture**: Easy to add/remove models, change Weaviate schema, or plug into external CVE parsing pipelines.
- **Product-Ready**: Designed for local reproducibility and scalable production use, with extensibility for distributed Weaviate or inference acceleration.


In [None]:
import sys
from pathlib import Path

project_root = Path().resolve()
while not (project_root / 'embedding_pipeline').exists() and project_root != project_root.parent:
    project_root = project_root.parent

sys.path.append(str(project_root))


In [None]:
from pathlib import Path
from embedding_pipeline.weaviate_db.weaviate_write_operations import create_data_objects

foss_name_description_json: Path = Path("../../json_github_data_cleaned/github_repositories_final_ordered.json")

#data_objects = create_data_objects(foss_name_description_json)

### To-DO ### 
# pickle data objects or something in case stuff goes wrong .... so I don't have to run it again ugggggg


In [None]:


import pickle



# with open("", "wb") as f:
#     pickle.dump(data_objects, f)


In [None]:
from embedding_pipeline.weaviate_db.weaviate_config import connect_to_local_weaviate_client
from embedding_pipeline.weaviate_db.weaviate_config import verify_weaviate_client_ready
from embedding_pipeline.weaviate_db.weaviate_config import close_weaviate_client, create_weaviate_collection
from embedding_pipeline.weaviate_db.weaviate_config import list_weaviate_collections, inspect_collection_properties, retrieve_existing_weaviate_collection

local_client = connect_to_local_weaviate_client()

list_weaviate_collections(local_client)

close_weaviate_client(local_client)


In [None]:
import weaviate

local_client = weaviate.connect_to_local()

list_weaviate_collections(local_client)
inspect_collection_properties(local_client,collection_name="FOSS_vectors")

close_weaviate_client(local_client)




In [None]:
local_client = connect_to_local_weaviate_client()

NAME = "FOSS_vectors"

foss_collection = retrieve_existing_weaviate_collection(NAME,local_client)
inspect_collection_properties(local_client,NAME)

close_weaviate_client(local_client)

In [None]:
import pickle

with open("foss_vectors.pkl", "rb") as f:
    data_objects = pickle.load(f)


In [None]:
### TIME TO COOK ###
from embedding_pipeline.weaviate_db.weaviate_write_operations import batch_import_data_objects

local_client = connect_to_local_weaviate_client()

foss_collection = retrieve_existing_weaviate_collection(NAME,local_client)

ready_huh = verify_weaviate_client_ready(local_client)
if ready_huh:
    print("freaking ready to cook and change the game")


# batch_import_data_objects(data_objects=data_objects,collection=foss_collection)

close_weaviate_client(local_client)


Connecting to local client...
Connected to local weaviate client--> is ready: True
Weaviate client is ready True
freaking ready to cook and change the game

##################################################
Starting to batch import data objects into Weaviate!!!!
##################################################

ImportingfreeCodeCamp freeCodeCamp...
Successfully importedfreeCodeCamp freeCodeCamp...
Importingcodecrafters-io build-your-own-x...
Successfully importedcodecrafters-io build-your-own-x...
Importingsindresorhus awesome...
Successfully importedsindresorhus awesome...
ImportingEbookFoundation free-programming-books...
Successfully importedEbookFoundation free-programming-books...
Importingpublic-apis public-apis...
Successfully importedpublic-apis public-apis...
Importingjwasham coding-interview-university...
Successfully importedjwasham coding-interview-university...
Importingkamranahmedse developer-roadmap...
Successfully importedkamranahmedse developer-roadmap...
Importingd

In [24]:
import weaviate

client = weaviate.connect_to_local()

try:
    collection = client.collections.get(NAME)
    response = collection.aggregate.over_all(total_count=True)
    
    print(response.total_count)
    
finally:
    client.close()

12670


            Please make sure to close the connection using `client.close()`.
  client = weaviate.connect_to_local()


In [None]:
client = weaviate.connect_to_local()
collection = client.collections.get(NAME)

response = collection.query.fetch_objects(
    include_vector=True,
    limit=10
)

for obj in response.objects:
    print(obj.properties)  # View object properties
    print(obj.vector)
    print(obj.uuid)        # View object ID

client.close()