# Space Biology Knowledge Engine - Text Preprocessing

This notebook focuses on cleaning and preprocessing the text data from the space biology publications.

## Objectives:
1. Clean and normalize text data
2. Handle missing values and duplicates
3. Tokenize text and remove stop words
4. Apply stemming/lemmatization
5. Prepare data for topic modeling


In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
import pickle





In [6]:
# Cell 2: Load cleaned dataset
df = pd.read_csv("datasets/sb_publications_clean.csv")
print("Dataset shape:", df.shape)
print(df.head())


Dataset shape: (624, 6)
                                               title  \
0  Mice in Bion-M 1 space mission: training and s...   
1  Microgravity induces pelvic bone loss through ...   
2  Stem Cell Health and Tissue Regeneration in Mi...   
3  Microgravity Reduces the Differentiation and R...   
4  Microgravity validation of a novel system for ...   

                                                link  \
0  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...   
1  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...   
2  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...   
3  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...   
4  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...   

                                                text  \
0  Mice in Bion-M 1 space mission: training and s...   
1  Microgravity induces pelvic bone loss through ...   
2  Stem Cell Health and Tissue Regeneration in Mi...   
3  Microgravity Reduces the Differentiation and R...   
4  Microgravity valida

In [7]:
# Cell 3: Initialize embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

In [9]:
# Cell 4: Create embeddings
texts = df['clean_text'].fillna("").tolist()
embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)

print("Embeddings shape:", embeddings.shape)

Batches:   0%|          | 0/20 [00:00<?, ?it/s]

Embeddings shape: (624, 384)


In [10]:
# Cell 5: Build FAISS index
dimension = embeddings.shape[1]  # 384 for MiniLM
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)

print("FAISS index contains:", index.ntotal, "documents")

FAISS index contains: 624 documents


In [11]:
# Cell 6: Save FAISS index and metadata
faiss.write_index(index, "publications.index")

with open("metadata.pkl", "wb") as f:
    pickle.dump(df.to_dict(orient="records"), f)

print("✅ Index + metadata saved!")

✅ Index + metadata saved!


In [12]:
# Cell 7: Test semantic search
query = "plant growth in microgravity"
query_embedding = model.encode([query], convert_to_numpy=True)

# Search top 5
D, I = index.search(query_embedding, k=5)

print("Query:", query)
print("\nTop results:")
for idx in I[0]:
    print("-", df.iloc[idx]['title'], "|", df.iloc[idx]['link'])

Query: plant growth in microgravity

Top results:
- Comparison of Microgravity Analogs to Spaceflight in Studies of Plant Growth and Development | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6908503/
- Plant cell proliferation and growth are altered by microgravity conditions in spaceflight. | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9287483/
- Conserved plant transcriptional responses to microgravity from two consecutive spaceflight experiments. | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10800490/
- Plant growth strategies are remodeled by spaceflight | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11869762/
- Fifteen days of microgravity causes growth in calvaria of mice | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4110898/
