# Space Biology Knowledge Engine - Topic Modeling

This notebook applies topic modeling techniques to identify research themes in space biology publications.

## Objectives:
1. Apply Latent Dirichlet Allocation (LDA) for topic modeling
2. Perform clustering analysis
3. Visualize topics and clusters
4. Analyze topic trends over time


In [1]:
# Cell 1: Imports
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pickle
import os


In [2]:
# Cell 2: Load preprocessed dataset
df = pd.read_csv("datasets/sb_publications_clean.csv")

print("Dataset shape:", df.shape)
df.head()


Dataset shape: (624, 6)


Unnamed: 0,title,link,text,clean_text,word_count,topic
0,Mice in Bion-M 1 space mission: training and s...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...,Mice in Bion-M 1 space mission: training and s...,Mice in Bion-M 1 space mission: training and s...,6,2
1,Microgravity induces pelvic bone loss through ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,Microgravity induces pelvic bone loss through ...,Microgravity induces pelvic bone loss through ...,14,3
2,Stem Cell Health and Tissue Regeneration in Mi...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...,Stem Cell Health and Tissue Regeneration in Mi...,Stem Cell Health and Tissue Regeneration in Mi...,6,-1
3,Microgravity Reduces the Differentiation and R...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...,Microgravity Reduces the Differentiation and R...,Microgravity Reduces the Differentiation and R...,8,-1
4,Microgravity validation of a novel system for ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,Microgravity validation of a novel system for ...,Microgravity validation of a novel system for ...,17,1


In [3]:
# Cell 3: Vectorize text
vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words="english")
dtm = vectorizer.fit_transform(df["clean_text"])

print("Document-Term Matrix shape:", dtm.shape)


Document-Term Matrix shape: (624, 898)


In [4]:
# Cell 4: LDA Topic Modeling
NUM_TOPICS = 10  # Adjust as needed
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, random_state=42)
lda.fit(dtm)

# Save model & vectorizer
os.makedirs("models", exist_ok=True)
with open("models/lda_model.pkl", "wb") as f:
    pickle.dump(lda, f)
with open("models/vectorizer.pkl", "wb") as f:
    pickle.dump(vectorizer, f)

print("✅ LDA model trained and saved")


✅ LDA model trained and saved


In [5]:
# Cell 5: Display topics
def display_topics(model, feature_names, no_top_words=10):
    topics = {}
    for idx, topic in enumerate(model.components_):
        top_features = [feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]
        topics[f"Topic {idx}"] = top_features
        print(f"Topic {idx}: {', '.join(top_features)}")
    return topics

topics = display_topics(lda, vectorizer.get_feature_names_out())


Topic 0: spaceflight, arabidopsis, thaliana, analysis, space, response, protein, proteomic, plants, responses
Topic 1: arabidopsis, space, protein, response, drosophila, gravity, thaliana, novel, network, health
Topic 2: spaceflight, calcium, space, mouse, arabidopsis, effects, stress, signaling, radiation, reveals
Topic 3: spaceflight, arabidopsis, microbial, analysis, transcriptome, microbiome, growth, space, gene, retina
Topic 4: muscle, skeletal, signaling, mice, radiation, gravity, age, bone, cell, mechanical
Topic 5: spaceflight, microgravity, simulated, plant, rna, adaptation, seq, expression, gene, alters
Topic 6: spaceflight, mice, bone, effects, omics, skeletal, exploration, stress, repertoire, induced
Topic 7: bone, mice, microgravity, plants, signaling, spaceflight, tissue, expression, changes, gene
Topic 8: space, stem, human, cell, data, cells, science, open, plant, spaceflight
Topic 9: space, international, station, isolated, genome, draft, sequences, characterization, s

In [6]:
# Cell 6: Save topics as CSV
topics_df = pd.DataFrame.from_dict(topics, orient="index").transpose()
topics_df.to_csv("datasets/topics.csv", index=False)

print("\n✅ Topics saved to datasets/topics.csv")
topics_df.head()



✅ Topics saved to datasets/topics.csv


Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,Topic 7,Topic 8,Topic 9
0,spaceflight,arabidopsis,spaceflight,spaceflight,muscle,spaceflight,spaceflight,bone,space,space
1,arabidopsis,space,calcium,arabidopsis,skeletal,microgravity,mice,mice,stem,international
2,thaliana,protein,space,microbial,signaling,simulated,bone,microgravity,human,station
3,analysis,response,mouse,analysis,mice,plant,effects,plants,cell,isolated
4,space,drosophila,arabidopsis,transcriptome,radiation,rna,omics,signaling,data,genome
