# Space Biology Knowledge Engine - Embeddings

This notebook generates and analyzes text embeddings for semantic similarity search.

## Objectives:
1. Generate BERT/SciBERT embeddings
2. Create embedding visualizations
3. Implement similarity search
4. Prepare embeddings for the API


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np
import os
import pickle

from sentence_transformers import SentenceTransformer
import faiss





In [2]:
# Cell 2: Load preprocessed dataset
df = pd.read_csv("datasets/sb_publications_clean.csv")

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (624, 6)


Unnamed: 0,title,link,text,clean_text,word_count,topic
0,Mice in Bion-M 1 space mission: training and s...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,Mice in Bion-M 1 space mission: training and s...,Mice in Bion-M 1 space mission: training and s...,6,2
1,Microgravity induces pelvic bone loss through ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,Microgravity induces pelvic bone loss through ...,Microgravity induces pelvic bone loss through ...,14,3
2,Stem Cell Health and Tissue Regeneration in Mi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,Stem Cell Health and Tissue Regeneration in Mi...,Stem Cell Health and Tissue Regeneration in Mi...,6,-1
3,Microgravity Reduces the Differentiation and R...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Microgravity Reduces the Differentiation and R...,Microgravity Reduces the Differentiation and R...,8,-1
4,Microgravity validation of a novel system for ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Microgravity validation of a novel system for ...,Microgravity validation of a novel system for ...,17,1


In [3]:
# Cell 3: Load embedding model
# You can use a smaller model for speed (e.g., 'all-MiniLM-L6-v2') 
# or a larger one for better quality (e.g., 'all-mpnet-base-v2')
MODEL_NAME = "all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

print("âœ… Model loaded:", MODEL_NAME)



âœ… Model loaded: all-MiniLM-L6-v2


In [4]:
# Cell 4: Generate embeddings
embeddings = model.encode(df["clean_text"].tolist(), batch_size=64, show_progress_bar=True)

embeddings = np.array(embeddings).astype("float32")
print("Embeddings shape:", embeddings.shape)


Batches:   0%|          | 0/10 [00:00<?, ?it/s]

Embeddings shape: (624, 384)


In [5]:
# Cell 5: Save embeddings + metadata
os.makedirs("datasets", exist_ok=True)

np.save("datasets/embeddings.npy", embeddings)
df.to_json("datasets/metadata.json", orient="records", lines=True)

print("âœ… Embeddings and metadata saved")


âœ… Embeddings and metadata saved


In [6]:
# Cell 6: Build FAISS index
d = embeddings.shape[1]  # embedding dimension
index = faiss.IndexFlatL2(d)
index.add(embeddings)

faiss.write_index(index, "datasets/faiss_index.idx")

print("âœ… FAISS index built and saved")


âœ… FAISS index built and saved


In [7]:
# Cell 7: Quick similarity search test
query = "plant growth in microgravity"
query_vec = model.encode([query]).astype("float32")

distances, indices = index.search(query_vec, k=5)

print("\nðŸ”Ž Query:", query)
print("\nTop 5 results:")
for i, idx in enumerate(indices[0]):
    print(f"{i+1}. {df.iloc[idx]['title']}  (score={distances[0][i]:.4f})")



ðŸ”Ž Query: plant growth in microgravity

Top 5 results:
1. Comparison of Microgravity Analogs to Spaceflight in Studies of Plant Growth and Development  (score=0.3082)
2. Plant cell proliferation and growth are altered by microgravity conditions in spaceflight.  (score=0.3368)
3. Conserved plant transcriptional responses to microgravity from two consecutive spaceflight experiments.  (score=0.5161)
4. Plant growth strategies are remodeled by spaceflight  (score=0.5887)
5. Fifteen days of microgravity causes growth in calvaria of mice  (score=0.6782)
