# Create Topic Dictionary

I'm going to create a dictionary of the hierarchical structure of the topics for the bioRxiv data.

I could use the Nomic Python API to do this, but I've since cleaned up the topics, so I'll just do it manually.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/biorxiv_nomic_normalized.csv')

In [3]:
df.head()

Unnamed: 0,doi,title,authors,author_corresponding,author_corresponding_institution,date,category,abstract,published,x,y,topic_depth_1,topic_depth_2,topic_depth_3,id
0,10.1101/000109,Speciation and introgression between Mimulus n...,Yaniv Brandvain;Amanda M Kenney;Lex Fagel;Grah...,Yaniv Brandvain,Department of Evolution and Ecology & Center f...,2013-11-07,Evolutionary Biology,Mimulus guttatus and M. nasutus are an evoluti...,10.1371/journal.pgen.1004410,0.440659,0.262446,Evolutionary Biology,Evolutionary Change,Evolutionary Genetics,0
1,10.1101/000075,A Scalable Formulation for Engineering Combina...,Vanessa Jonsson;Anders Rantzer;Richard M Murray;,Vanessa Jonsson,Caltech,2013-11-07,Evolutionary Biology,It has been shown that optimal controller synt...,10.1109/ACC.2014.6859452,-0.496457,-0.087003,Genomics Analysis,Systems Biology,Cellular Networks,1
2,10.1101/000240,Genome-wide targets of selection: female respo...,Paolo Innocenti;Ilona Flis;Edward H Morrow;,Edward H Morrow,University of Sussex,2013-11-12,Evolutionary Biology,Despite the common assumption that promiscuity...,,0.359061,0.359294,Evolutionary Biology,Animal Behavior,Evolutionary Biology,2
3,10.1101/000208,Population genomics of parallel hybrid zones i...,Nicola Nadeau;Mayte Ruiz;Patricio Salazar;Bria...,Chri Jiggins,Cambridge,2013-11-12,Evolutionary Biology,Hybrid zones can be valuable tools for studyin...,10.1101/gr.169292.113,0.425488,0.269389,Evolutionary Biology,Animal Behavior,Butterfly Colors,3
4,10.1101/000398,The Origin of Human-infecting Avian Influenza ...,Liangsheng Zhang;Zhenguo Zhang;,Zhenguo Zhang,"Department of Biology, The Pennsylvania State ...",2013-11-14,Evolutionary Biology,"In this study, we retraced the origin of the r...",,0.562014,-0.578407,Viral Infections,Zoonotic Diseases,Avian Influenza,4


In [4]:
unique_topic_depth_1 = df['topic_depth_1'].unique().tolist()
unique_topic_depth_1

['Evolutionary Biology',
 'Genomics Analysis',
 'Viral Infections',
 'Microbial Ecology',
 'Neural Science',
 'Cancer Research',
 'Cell Biology',
 'Neurological Disorders']

These aren't super useful, so I think I'll just use the categories for this topic depth.

In [5]:
unique_topic_depth_2 = df['topic_depth_2'].unique().tolist()
unique_topic_depth_2

['Evolutionary Change',
 'Systems Biology',
 'Animal Behavior',
 'Zoonotic Diseases',
 'Evolutionary Dynamics',
 'Genetic Engineering',
 'Genetics',
 'Neural Networks',
 'Genetic Traits',
 'Genetic Evolution',
 'Noncoding RNAs',
 'Ecosystem Management',
 'Genomics',
 'Protein Networks',
 'Glioblastoma',
 'Liver Inflammation',
 'HIV Research',
 'Plant Stress',
 'Pharmaceuticals',
 'Nematode Biology',
 'Protein Structure',
 'Aging',
 'Soil Microbiology',
 "Alzheimer's Disease",
 'Parkinsons Disease',
 'Psychological Disorders',
 'Embryonic Development',
 'Mitochondria',
 'Microbial Pathogens',
 'Neuroscience',
 'Muscle Kinematics',
 'Biomedical Imaging',
 'Learning Theory',
 'Memory and Fear',
 'Virology',
 'Single Cell',
 'Malaria',
 'Influenza Virus',
 'Gene Expression',
 'Microscopy',
 'Microbial Ecology',
 'Immune Response',
 'Cell Division',
 'Honeybees',
 'Genome Regulation',
 'Neural Development',
 'Bacterial Biofilm',
 'Marine Biology',
 'Cell Mechanics',
 'Epigenetic Aging',
 'C

In [6]:
muscle_kinematics_rows = df[df['topic_depth_2'] == 'Muscle Kinematics']

muscle_kinematics_rows

Unnamed: 0,doi,title,authors,author_corresponding,author_corresponding_institution,date,category,abstract,published,x,y,topic_depth_1,topic_depth_2,topic_depth_3,id
67,10.1101/001156,Influence of walking speed on locomotor time p...,Fabrice MEGROT;Carole MEGROT;,Fabrice MEGROT,French Red Cross,2013-12-04,Neuroscience,The aim of the present study was to determine ...,,0.046694,0.856476,Neural Science,Muscle Kinematics,Exercise Physiology,67
403,10.1101/005538,Hip and knee kinematics display complex and ti...,Corey Scholes;Michael McDonald;Anthony Parker;,Corey Scholes,Sydney Orthopaedic Research Institute,2014-05-26,Physiology,The validity of fatigue protocols involving mu...,,0.288280,0.735520,Neural Science,Muscle Kinematics,Exercise Physiology,403
1140,10.1101/015040,Decoding of human hand actions to handle missi...,Jovana Belic;Aldo Faisal;,Jovana Belic,Royal Institute of Technology,2015-02-09,Neuroscience,The only way we can interact with the world is...,10.3389/fncom.2015.00027,0.212212,0.699732,Neural Science,Muscle Kinematics,Motor Control,1140
4376,10.1101/056812,The Human Octopus: controlling supernumerary h...,Sander Kulu;Madis Vasser;Raul Vicente Zafra;Ja...,Jaan Aru,"Institute for Computer Science, University of ...",2016-06-03,Neuroscience,"We investigated the \""human octopus\"" phenomen...",,0.195505,0.700397,Neural Science,Muscle Kinematics,Motor Control,4376
4874,10.1101/063404,Visual guidance of bimanual coordination relie...,Janina Brandes;Farhad Rezvani;Tobias Heed;,Janina Brandes,Faculty of Psychology and Human Movement Scien...,2016-07-12,Neuroscience,Visual spatial information is paramount in gui...,10.1038/s41598-017-16860-x,0.176387,0.710916,Neural Science,Muscle Kinematics,Motor Control,4874
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
257992,10.1101/2024.11.21.624713,E-bike riding: A metabolic evaluation in the c...,"Bonardi, A.; Iannetta, D.; Negro, F.",Danilo Iannetta,University of Brescia,2024-11-22,physiology,IntroductionE-bikes are being promoted as a mo...,,0.308692,0.740351,Neural Science,Muscle Kinematics,Exercise Physiology,257992
258417,10.1101/2024.11.26.625431,Cortical neural activity during responses to m...,"Hooks, K.; Kiani, K.; Fu, Q.",Qiushi Fu,"Mechanical and Aerospace Engineering, Universi...",2024-11-26,neuroscience,"Handedness, as measured by self-reported hand ...",,0.281859,0.812538,Neural Science,Muscle Kinematics,Motor Control,258417
258450,10.1101/2024.11.27.625740,Changes in interlimb coordination induced by w...,"Hall, B. L.; Roemmich, R. T.; Banks, C. L.",Ryan T Roemmich,Kennedy Krieger Institute/Johns Hopkins Univer...,2024-11-27,neuroscience,"During walking, interlimb coordination involve...",,0.279079,0.754857,Neural Science,Muscle Kinematics,Exercise Physiology,258450
258635,10.1101/2024.11.25.624926,Effects of Uni- and Bidirectional Interaction ...,"Short, M. R.; Ludvig, D.; Di Tommaso, F.; Vian...",Matthew R Short,Shirley Ryan AbilityLab,2024-11-28,bioengineering,Haptic human-robot-human interaction allows us...,,0.222113,0.709513,Neural Science,Muscle Kinematics,Motor Control,258635


In [7]:
honeybees_rows = df[df['topic_depth_2'] == 'Honeybees']
honeybees_rows

Unnamed: 0,doi,title,authors,author_corresponding,author_corresponding_institution,date,category,abstract,published,x,y,topic_depth_1,topic_depth_2,topic_depth_3,id
122,10.1101/001750,Shifts in stability and control effectiveness ...,Dennis Evangelista;Sharlene Cam;Tony Huynh;Aus...,Dennis Evangelista,University of North Carolina at Chapel Hill,2014-01-13,Biophysics,The capacity for aerial maneuvering was likely...,10.7717/peerj.632,0.442054,0.427260,Evolutionary Biology,Honeybees,Kinematics,122
139,10.1101/001925,Complex behavioral manipulation drives mismatc...,Fabricio Baccaro;João Araújo;Harry Evans;Jorge...,David Hughes,Penn State,2014-01-21,Ecology,Parasites and hosts are intimately associated ...,,0.843289,-0.015681,Evolutionary Biology,Honeybees,Ant Colonies,139
175,10.1101/002501,Within the fortress: A specialized parasite of...,Emilia S. Gracia;Charissa de Bekker;Jim Russel...,Emilia S. Gracia,Pennsylvania State University,2014-02-07,Ecology,Every level of biological organization from ce...,,0.530758,0.582947,Evolutionary Biology,Honeybees,Ant Colonies,175
265,10.1101/003574,3D mapping of disease in ant societies reveals...,Raquel G Loreto;Simon L Elliot;Mayara LR Freit...,Raquel G Loreto,Pennsylvania State University & Federal Univer...,2014-03-27,Evolutionary Biology,Despite the widely held position that the soci...,10.1371/journal.pone.0103516,0.537416,0.577258,Evolutionary Biology,Honeybees,Ant Colonies,265
630,10.1101/008516,Amino acid and carbohydrate tradeoffs by honey...,Harmen P. Hendriksma;Karmi L. Oxman;Sharoni Sh...,Harmen P. Hendriksma,The Hebrew University of Jerusalem,2014-08-28,Animal Behavior and Cognition,"Honey bees are important pollinators, requirin...",10.1016/j.jinsphys.2014.05.025,0.854429,0.301209,Evolutionary Biology,Honeybees,Honeybee Behavior,630
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258324,10.1101/2024.11.26.625348,Lateralised memory networks explain the use of...,"Filippi, G.; Knight, J.; Philippides, A.; Grah...",Giulio Filippi,University of Sussex,2024-11-26,neuroscience,Many insects use memories of their visual envi...,,0.498261,0.594319,Evolutionary Biology,Honeybees,Ant Colonies,258324
258584,10.1101/2024.11.28.625904,Gustatory sensitivity to amino acids in bumble...,"Rossoni, S.; Parkinson, R. H.; Niven, J. E.; N...",Elizabeth Nicholls,University of Sussex,2024-11-28,neuroscience,Bees rely on amino acids obtained from nectar ...,,0.853556,0.300586,Evolutionary Biology,Honeybees,Honeybee Behavior,258584
258589,10.1101/2024.11.27.625613,The swarming behaviors of Vorticella,"Tabata, T.; Maeyama, Y.",Tetsuya Tabata,The University of Tokyo,2024-11-28,animal behavior and cognition,Two Vorticella species undergo a synchronous t...,,0.326555,-0.296308,Evolutionary Biology,Honeybees,Kinematics,258589
258604,10.1101/2024.11.27.625616,Aggressive conflict impacts path integration i...,"Bollig, A.; Freire, M.; Buecking, K.; Kuehnapf...",Markus Knaden,Max Planck Institute for Chemical Ecology,2024-11-28,animal behavior and cognition,The desert ant Cataglyphis fortis inhabits the...,,0.505658,0.592075,Evolutionary Biology,Honeybees,Ant Colonies,258604


Okay, some editing of topic labels for Depth 2:

- Muscle Kinematics -> Motor Control and Movement
- Honeybees -> Insect Behavior and Ecology
- Brain Processing -> Cortical Systems and Sensory Processing
- Cardiovascular -> Cardiovascular Biology
- Microbial Health -> Microbial Communities
- Eye Vision -> Eye and Vision Biology

And I'll drop the "Research Paper" topic, because they are all bioRxiv withdrawals.

In [8]:
unique_topic_depth_3 = df['topic_depth_3'].unique().tolist()
unique_topic_depth_3


['Evolutionary Genetics',
 'Cellular Networks',
 'Evolutionary Biology',
 'Butterfly Colors',
 'Avian Influenza',
 'Genetic Mutations',
 'Social Networks',
 'Genetic Evolution',
 'Sex Determination',
 'Biological Circuits',
 'Biotech',
 'Neural Plasticity',
 'Genetic Regulation',
 'Genetics',
 'MicroRNAs',
 'Genetic Traits',
 'Ecological Diversity',
 'Genetic Analysis',
 'Gene Expression',
 'Systems Biology',
 'Bioinformatics',
 'Optimal Foraging',
 'Inflammation Response',
 'HIV Cure',
 'Plant Development',
 'Plant Extracts',
 'Immune Response',
 'Neural Behavior',
 'Brain Connectivity',
 'Biological Assemblies',
 'Telomere Maintenance',
 'Plant Disease',
 'Viral Infection',
 "Alzheimer's Disease",
 'DNA Origami',
 'Genetic Disorders',
 'Genetic Variation',
 'Human Genetics',
 'Photosynthesis',
 'Biodiversity Conservation',
 'Forestry',
 'Thermal Tolerance',
 'Cannabis',
 'Stem Cell Biology',
 'Mitochondria',
 'Cancer Microenvironment',
 'Cancer Biomarkers',
 'Bacterial Growth',
 'Gen

In [9]:
len(unique_topic_depth_3)

418

Edits for Depth 3:

Based on manual inspection of the abstracts for these topics in the Neon DB console:

- Butterfly Colors -> Color Phenotypes
- Biotech -> Technology Development
- Optimal Foraging -> Population-Level Dynamics
- HIV Cure -> HIV Treatment
- Brain Navigation -> Hippocampal Research
- Biology -> Spatial Biology and Environmental Response
- Academic Research -> Meta-Science
- Brain Computer -> Brain Computer Interfaces
- Brain Activity -> Sleep Research
- Protein Enzyme -> Enzymology
- Neurodegenerative -> Neurodegeneration
- Medical -> Bacterial Pathogens

Drop 'Research Paper' because they are all bioRxiv withdrawals.

Identifying similar/redundant topics with ChatGPT o1:

- Animal Behaviour -> Animal Behavior
- Immune Cell -> Immune Cells
- Neuro Imaging -> Neuroimaging
- Metabolic Pathways -> Metabolic Pathway
- COVID Disease -> COVID-19
- Genetic Evolution -> Evolutionary Genetics
- Neuronal Plasticity -> Neural Plasticity
- Genetic Regulation -> Gene Regulation
- Genetic Disorders -> Genetic Disease
- Neurological Disorders -> Neurological Disease
- Cancer Treatment -> Cancer Therapy

In [10]:
def replace_value_in_column(df, column, search_value, replace_value):
    """
    Replace occurrences of search_value with replace_value in the specified column of the dataframe.

    Parameters:
    df (pd.DataFrame): The dataframe to modify.
    column (str): The column to search within.
    search_value: The value to search for.
    replace_value: The value to replace the search_value with.

    Returns:
    pd.DataFrame: The modified dataframe with replaced values.
    """
    df[column] = df[column].replace(search_value, replace_value)
    return df

In [11]:
# Replace specific topic labels in the dataframe
df = replace_value_in_column(df, 'topic_depth_2', 'Muscle Kinematics', 'Motor Control and Movement')
df = replace_value_in_column(df, 'topic_depth_2', 'Honeybees', 'Insect Behavior and Ecology')
df = replace_value_in_column(df, 'topic_depth_2', 'Brain Processing', 'Cortical Systems and Sensory Processing')
df = replace_value_in_column(df, 'topic_depth_2', 'Cardiovascular', 'Cardiovascular Biology')
df = replace_value_in_column(df, 'topic_depth_2', 'Microbial Health', 'Microbial Communities')
df = replace_value_in_column(df, 'topic_depth_2', 'Eye Vision', 'Eye and Vision Biology')

# Drop the "Research Paper" topic
df = df[df['topic_depth_2'] != 'Research Paper']

In [12]:
# Replace specific topic labels in the dataframe for topic_depth_3
df = replace_value_in_column(df, 'topic_depth_3', 'Butterfly Colors', 'Color Phenotypes')
df = replace_value_in_column(df, 'topic_depth_3', 'Biotech', 'Technology Development')
df = replace_value_in_column(df, 'topic_depth_3', 'Optimal Foraging', 'Population-Level Dynamics')
df = replace_value_in_column(df, 'topic_depth_3', 'HIV Cure', 'HIV Treatment')
df = replace_value_in_column(df, 'topic_depth_3', 'Brain Navigation', 'Hippocampal Research')
df = replace_value_in_column(df, 'topic_depth_3', 'Biology', 'Spatial Biology and Environmental Response')
df = replace_value_in_column(df, 'topic_depth_3', 'Academic Research', 'Meta-Science')
df = replace_value_in_column(df, 'topic_depth_3', 'Brain Computer', 'Brain Computer Interfaces')
df = replace_value_in_column(df, 'topic_depth_3', 'Brain Activity', 'Sleep Research')
df = replace_value_in_column(df, 'topic_depth_3', 'Protein Enzyme', 'Enzymology')
df = replace_value_in_column(df, 'topic_depth_3', 'Neurodegenerative', 'Neurodegeneration')
df = replace_value_in_column(df, 'topic_depth_3', 'Medical', 'Bacterial Pathogens')

# Drop the "Research Paper" topic
df = df[df['topic_depth_3'] != 'Research Paper']

# Identifying similar/redundant topics with ChatGPT o1 for topic_depth_3
df = replace_value_in_column(df, 'topic_depth_3', 'Animal Behaviour', 'Animal Behavior')
df = replace_value_in_column(df, 'topic_depth_3', 'Immune Cell', 'Immune Cells')
df = replace_value_in_column(df, 'topic_depth_3', 'Neuro Imaging', 'Neuroimaging')
df = replace_value_in_column(df, 'topic_depth_3', 'Metabolic Pathways', 'Metabolic Pathway')
df = replace_value_in_column(df, 'topic_depth_3', 'COVID Disease', 'COVID-19')
df = replace_value_in_column(df, 'topic_depth_3', 'Genetic Evolution', 'Evolutionary Genetics')
df = replace_value_in_column(df, 'topic_depth_3', 'Neuronal Plasticity', 'Neural Plasticity')
df = replace_value_in_column(df, 'topic_depth_3', 'Genetic Regulation', 'Gene Regulation')
df = replace_value_in_column(df, 'topic_depth_3', 'Genetic Disorders', 'Genetic Disease')
df = replace_value_in_column(df, 'topic_depth_3', 'Neurological Disorders', 'Neurological Disease')
df = replace_value_in_column(df, 'topic_depth_3', 'Cancer Treatment', 'Cancer Therapy')

In [13]:
new_unique_topic_depth_3 = df['topic_depth_3'].unique().tolist()

new_unique_topic_depth_3

['Evolutionary Genetics',
 'Cellular Networks',
 'Evolutionary Biology',
 'Color Phenotypes',
 'Avian Influenza',
 'Genetic Mutations',
 'Social Networks',
 'Sex Determination',
 'Biological Circuits',
 'Technology Development',
 'Neural Plasticity',
 'Gene Regulation',
 'Genetics',
 'MicroRNAs',
 'Genetic Traits',
 'Ecological Diversity',
 'Genetic Analysis',
 'Gene Expression',
 'Systems Biology',
 'Bioinformatics',
 'Population-Level Dynamics',
 'Inflammation Response',
 'HIV Treatment',
 'Plant Development',
 'Plant Extracts',
 'Immune Response',
 'Neural Behavior',
 'Brain Connectivity',
 'Biological Assemblies',
 'Telomere Maintenance',
 'Plant Disease',
 'Viral Infection',
 "Alzheimer's Disease",
 'DNA Origami',
 'Genetic Disease',
 'Genetic Variation',
 'Human Genetics',
 'Photosynthesis',
 'Biodiversity Conservation',
 'Forestry',
 'Thermal Tolerance',
 'Cannabis',
 'Stem Cell Biology',
 'Mitochondria',
 'Cancer Microenvironment',
 'Cancer Biomarkers',
 'Bacterial Growth',
 'G

In [14]:
len(new_unique_topic_depth_3)

406

In [15]:
df.to_csv("../data/biorxiv_nomic_edited_topics.csv", index=False)

## Save the topic dictionary to JSON

In [16]:
unique_topic_depth_2 = df['topic_depth_2'].unique().tolist()

In [17]:
unique_topic_depth_2[0:10]

['Evolutionary Change',
 'Systems Biology',
 'Animal Behavior',
 'Zoonotic Diseases',
 'Evolutionary Dynamics',
 'Genetic Engineering',
 'Genetics',
 'Neural Networks',
 'Genetic Traits',
 'Genetic Evolution']

In [18]:
topic_dict = {}

for topic in unique_topic_depth_2:
    subset = df[df['topic_depth_2'] == topic]
    unique_topic_depth_3 = subset['topic_depth_3'].unique().tolist()
    topic_dict[topic] = unique_topic_depth_3

topic_dict

{'Evolutionary Change': ['Evolutionary Genetics',
  'Thermal Tolerance',
  'Ecosystem Ecology'],
 'Systems Biology': ['Cellular Networks',
  'Statistical Modeling',
  'Epidemiology',
  'Data Clustering',
  'Metabolic Pathway',
  'Meta-Science'],
 'Animal Behavior': ['Evolutionary Biology',
  'Color Phenotypes',
  'Social Networks',
  'Population-Level Dynamics',
  'Ecological Niche',
  'Behavioral Tracking',
  'Bird Migration',
  'Animal Communication',
  'Animal Behavior',
  'Primates'],
 'Zoonotic Diseases': ['Avian Influenza',
  'Livestock Disease',
  'Bats and Rabies'],
 'Evolutionary Dynamics': ['Genetic Mutations',
  'Bacterial Growth',
  'Evolutionary Biology',
  'Aging',
  'Ecosystem Dynamics',
  'Disease Transmission',
  'Microbial Communities'],
 'Genetic Engineering': ['Evolutionary Genetics',
  'Biological Circuits',
  'Technology Development',
  'Bacterial Genetics',
  'Transposons',
  'Antibiotic Resistance',
  'Microbial Evolution',
  'Bacteriophage Therapy',
  'Bacteria

In [19]:
import json

# Convert topic_dict to the required format
broadTopics = []
specificTopics = []

for idx, (broad_topic, specific_topic_list) in enumerate(topic_dict.items(), start=1):
    broadTopics.append({
        "id": idx,
        "title": broad_topic,
        "specificTopics": specific_topic_list
    })
    specificTopics.extend(specific_topic_list)

broadTopics = sorted(broadTopics, key=lambda x: x['title'])
for idx, topic in enumerate(broadTopics, start=1):
    topic['id'] = idx

# Remove duplicate specific topics and sort
specificTopics = sorted(list(set(specificTopics)))

# Save broadTopics to a JSON file
with open('../data/broadTopics.json', 'w') as f:
    json.dump(broadTopics, f, indent=2)

# Save specificTopics to a JSON file
with open('../data/specificTopics.json', 'w') as f:
    json.dump(specificTopics, f, indent=2)