In [14]:
import json
import gzip

from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import display_source_node

from llama_index.core import Document

In [8]:
Settings.llm = Ollama(model="llama3")
Settings.embed_model = HuggingFaceEmbedding("BAAI/bge-base-en-v1.5")

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=Settings.embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

In [5]:
jsonfile = "../data/pubmed_cis_json/pubmed24n1073_cis.json.gz" 

In [6]:
with gzip.open(jsonfile) as f:
    data = json.load(f)
    

In [12]:
data

{'33633159': {'pmid': '33633159',
  'title': 'Integrated analysis of behavioral, epigenetic, and gut microbiome analyses in AppNL-G-F, AppNL-F, and wild type mice.',
  'abstract': "Epigenetic mechanisms occurring in the brain as well as alterations in the gut microbiome composition might contribute to Alzheimer's disease (AD). Human amyloid precursor protein knock-in (KI) mice contain the Swedish and Iberian mutations (AppNL-F) or those two and also the Arctic mutation (AppNL-G-F). In this study, we assessed whether behavioral and cognitive performance in 6-month-old AppNL-F, AppNL-G-F, and C57BL/6J wild-type (WT) mice was associated with the gut microbiome, and whether the genotype modulates this association. The genotype effects observed in behavioral tests were test-dependent. The biodiversity and composition of the gut microbiome linked to various aspects of mouse behavioral and cognitive performance but differences in genotype modulated these relationships. These genotype-dependen

In [21]:
documents = []
for pmid in data:
    #print(pmid)
    abstract = data[pmid]["abstract"]
    journal = data[pmid]["journal"]
    pubdate = data[pmid]["pubdate"]
    document = Document(text=abstract, 
                        metadata = {"pmid": pmid, "journal": journal, "pubdate": pubdate})
    documents.append(document)
    

In [22]:
documents[0]

Document(id_='c69127ed-b8d0-4ccd-9fff-f406c7b4fefe', embedding=None, metadata={'pmid': '33633159', 'journal': 'Scientific reports', 'pubdate': '2021'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="Epigenetic mechanisms occurring in the brain as well as alterations in the gut microbiome composition might contribute to Alzheimer's disease (AD). Human amyloid precursor protein knock-in (KI) mice contain the Swedish and Iberian mutations (AppNL-F) or those two and also the Arctic mutation (AppNL-G-F). In this study, we assessed whether behavioral and cognitive performance in 6-month-old AppNL-F, AppNL-G-F, and C57BL/6J wild-type (WT) mice was associated with the gut microbiome, and whether the genotype modulates this association. The genotype effects observed in behavioral tests were test-dependent. The biodiversity and composition of the gut microbiome linked to various aspects of mouse behavioral and cognitive performance but differences in geno

In [23]:
nodes = splitter.get_nodes_from_documents(documents)

In [24]:
print(nodes[1].get_content())

These genotype-dependent associations include members of the Lachnospiraceae and Ruminococcaceae families. In a subset of female mice, we assessed DNA methylation in the hippocampus and investigated whether alterations in hippocampal DNA methylation were associated with the gut microbiome. Among other differentially methylated regions, we identified a 1 Kb region that overlapped ing 3'UTR of the Tomm40 gene and the promoter region of the Apoe gene that and was significantly more methylated in the hippocampus of AppNL-G-F than WT mice. The integrated gut microbiome hippocampal DNA methylation analysis revealed a positive relationship between amplicon sequence variants (ASVs) within the Lachnospiraceae family and methylation at the Apoe gene. Hence, these microbes may elicit an impact on AD-relevant behavioral and cognitive performance via epigenetic changes in AD-susceptibility genes in neural tissue or that such changes in the epigenome can elicit alterations in intestinal physiology t

In [25]:
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()

In [26]:
response = query_engine.query(
    "how to identify cis-regulatory elements in the genome?"
)

In [27]:
print(str(response))

The identification of the cis-regulatory elements controlling cell-type specific gene expression patterns is essential for understanding the origin of cellular diversity.


In [28]:
for n in response.source_nodes:
    display_source_node(n, source_length=20000)

**Node ID:** 32662a18-4cc2-4c98-bd55-09e94fd9ebb2<br>**Similarity:** 0.6555635826416948<br>**Text:** Identification of the cis-regulatory elements controlling cell-type specific gene expression patterns is essential for understanding the origin of cellular diversity. Conventional assays to map regulatory elements via open chromatin analysis of primary tissues is hindered by sample heterogeneity. Single cell analysis of accessible chromatin (scATAC-seq) can overcome this limitation.<br>

**Node ID:** 313eea0e-dd8a-4462-a362-0ed24380cf8b<br>**Similarity:** 0.61874482712326<br>**Text:** Over the past decades, transposable elements (TEs) have been shown to play important roles shaping genome architecture and as major promoters of genetic diversification and evolution of species. Likewise, TE accumulation is tightly linked to heterochromatinization and centromeric dynamics, which can ultimately contribute to speciation. Despite growing efforts to characterize the repeat landscape of species, few studies have focused on mapping the accumulation profiles of TEs on chromosomes.<br>