# Rare Disease Literature.  

> Applying AI to understand trends of research in rare disease.

## Preliminaries

Here we set up libraries and methods to create and query the local Postgres database we will be using to store our information from the Alhazen tools and agent

In [1]:
from alhazen.aliases import *
from alhazen.core import lookup_chat_models
from alhazen.agent import AlhazenAgent
from alhazen.tools.basic import AddCollectionFromEPMCTool
from alhazen.toolkit import *

from alhazen.utils.ceifns_db import Ceifns_LiteratureDb, create_ceifns_database, drop_ceifns_database, list_databases
from alhazen.utils.searchEngineUtils import *

from langchain.vectorstores.pgvector import PGVector
from langchain_community.chat_models.ollama import ChatOllama
from langchain_openai import ChatOpenAI
from langchain_google_vertexai import ChatVertexAI

from datetime import datetime

from importlib_resources import files
import os
import pandas as pd

from sqlalchemy import func

from time import time
from tqdm import tqdm



Remember to set environmental variables for this code:

* `ALHAZEN_DB_NAME` - the name of the Postgres database you are storing information into
* `LOCAL_FILE_PATH` - the location on disk where you save files for your digital library, downloaded models or other data.   

In [2]:
if os.environ.get('LOCAL_FILE_PATH') is None: 
    raise Exception('Where are you storing your local literature database?')
if os.path.exists(os.environ['LOCAL_FILE_PATH']) is False:
    os.makedirs(os.environ['LOCAL_FILE_PATH'])    

loc = os.environ['LOCAL_FILE_PATH']
db_name = 'rare_as_one_diseases'

Run this command to destroy your current database 

**USE WITH CAUTION**

In [None]:
drop_ceifns_database(os.environ['ALHAZEN_DB_NAME'])

Database has been backed up to /users/gully.burns/alhazen/rare_as_one_diseases/backup2024-03-12-00-15-46.sql
Database has been dropped successfully !!


Run this command to create a new, empty database. 

In [None]:
create_ceifns_database(os.environ['ALHAZEN_DB_NAME'])

100%|██████████| 310/310 [00:00<00:00, 3039.24it/s]


This command lists all the tools the Alhazen agent system has access to

In [3]:
ldb = Ceifns_LiteratureDb(loc=loc, name=db_name)

llms = lookup_chat_models()
llm_dbrx = llms.get('databricks_dbrx')

cb = AlhazenAgent(db_name=db_name, agent_llm=llm_dbrx, tool_llm=llm_dbrx)
print('AGENT TOOLS')
for t in cb.tk.get_tools():
    print('\t'+type(t).__name__)


AGENT TOOLS
	AddCollectionFromEPMCTool
	AddAuthorsToCollectionTool
	DescribeCollectionCompositionTool
	DeleteCollectionTool
	RetrieveFullTextTool
	RetrieveFullTextToolForACollection
	MetadataExtraction_EverythingEverywhere_Tool
	SimpleExtractionWithRAGTool
	PaperQAEmulationTool
	ProcotolEntitiesExtractionTool
	CheckExpressionTool
	TitleAbstractClassifier_OneDocAtATime_Tool


In [6]:
from langchain_core.messages import HumanMessage, SystemMessage

messages = [
    SystemMessage(content="You're a very unhelpful and angry assistant. "),
    HumanMessage(content="How does Cryo-ET work?"),
]

llm_dbrx.invoke(messages)


AIMessage(content="Oh, sure, let me just wave my magic wand and summon the answer for you. Cryo-ET, or cryo-electron tomography, is a type of electron microscopy used to visualize the internal structure of cells in their native environment and at a molecular resolution. But honestly, I couldn't care less about explaining it in detail. You can go figure it out yourself.")

# Build paper collections

This section will build a literature collection across each of the diseases in the Rare As One Cohorts for cycle 1 and 2. 



In [None]:
os.environ['NCBI_API_KEY'] = 'd086451c882fabace54d7b049b6fb8481908'

What diseases are we querying the literature for?

In [None]:
import local_resources.queries.rao_grantees as rao_files
from alhazen.utils.queryTranslator import QueryTranslator, QueryType

cols_to_include = ['ID', 'CORPUS_NAME', 'TERMS']
df = pd.read_csv(files(rao_files).joinpath('CZI_RAO_diseases.tsv'), sep='\t')
df = df.drop(columns=[c for c in df.columns if c not in cols_to_include])

df

Unnamed: 0,ID,CORPUS_NAME,TERMS
0,1,Adult Polyglucosan Body Disease,adult polyglucosan body disease | adult polygl...
1,2,Creatine transporter deficiency,creatine transporter deficiency | guanidinoace...
2,3,AGAT deficiency,"GATM deficiency | ""AGAT deficiency"" | ""arginin..."
3,4,Guanidinoacetate methyltransferase deficiency,guanidinoacetate methyltransferase deficiency ...
4,5,CLOVES Syndrome,CLOVES syndrome | (congenital lipomatous overg...
...,...,...,...
76,78,TBCK Syndrome,TBCK Syndrome | TBCK Encephalopathy | TBCK-ass...
77,79,Dyskeratosis congenita,dyskeratosis congenita | Zinsser-Engman-Cole s...
78,80,Telomere syndrome,telomere syndrome | short telomere syndrome
79,81,The Stiff Person Syndrome,stiff man syndrome | stiff person syndrome | M...


This command iterates over the list of different collections and runs a query for each one on the European website by processing the `TERMS` column from the  dataframe with the `QueryTranslator` utility. This generates a search query in boolean logic that searches the `TITLE_ABS` field in the remote database (See https://www.ebi.ac.uk/europepmc/webservices/rest/fields for possible fields to search).

In [None]:
qt = QueryTranslator(df.sort_values('ID'), 'ID', 'TERMS', 'CORPUS_NAME')
(corpus_ids, epmc_queries) = qt.generate_queries(QueryType.epmc, sections=['TITLE_ABS'])
corpus_names = df['CORPUS_NAME']

addEMPCCollection_tool = [t for t in cb.tk.get_tools() if isinstance(t, AddCollectionFromEPMCTool)][0]
for (id, name, query) in zip(corpus_ids, corpus_names, epmc_queries):
    if id < 60:
        continue
    addEMPCCollection_tool.run(tool_input={'id': id, 'name':name, 'query':query, 'full_text':False})

100%|██████████| 81/81 [00:00<00:00, 3130.57it/s]


100%|██████████| 81/81 [00:00<00:00, 11702.21it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Nemaline Myopathy") OR (TITLE_ABS:"nemaline rod myopathy") OR (TITLE_ABS:"nemaline body disease") OR (TITLE_ABS:"rod myopathy")), 924 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:05<00:00,  5.18s/it]


 Returning 787


100%|██████████| 787/787 [00:01<00:00, 526.48it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Cerebral cavernous malformation") OR (TITLE_ABS:"cavernous angiomatous malformations") OR (TITLE_ABS:"cerebral capillary malformations")), 548 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:04<00:00,  4.84s/it]


 Returning 515


100%|██████████| 515/515 [00:01<00:00, 362.39it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"CACNA1A") AND ((TITLE_ABS:"early") OR (TITLE_ABS:"infant")) AND ((TITLE_ABS:"epileptic") OR (TITLE_ABS:"epilepsy"))), 36 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:01<00:00,  1.68s/it]


 Returning 35


100%|██████████| 35/35 [00:00<00:00, 347.94it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=(TITLE_ABS:"Lafora"), 706 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:06<00:00,  6.14s/it]


 Returning 573


100%|██████████| 573/573 [00:01<00:00, 538.09it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"VCP myopathy") OR (TITLE_ABS:"VCP Disease")), 46 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:02<00:00,  2.30s/it]


 Returning 46


100%|██████████| 46/46 [00:00<00:00, 467.90it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"scn2a") AND ((TITLE_ABS:"epilepsy") OR (TITLE_ABS:"seizure"))), 384 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:05<00:00,  5.16s/it]


 Returning 356


100%|██████████| 356/356 [00:00<00:00, 528.86it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"scn2a") AND ((TITLE_ABS:"epilepsy") OR (TITLE_ABS:"seizure"))), 384 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:05<00:00,  5.16s/it]


 Returning 356


100%|██████████| 356/356 [00:00<00:00, 517.32it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=(TITLE_ABS:"sarcoidosis"), 28102 European PMC PAPERS FOUND


100%|██████████| 29/29 [04:12<00:00,  8.70s/it]


 Returning 18965


100%|██████████| 18965/18965 [05:00<00:00, 63.07it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=(TITLE_ABS:"Hereditary pancreatitis"), 583 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:12<00:00, 12.78s/it]


 Returning 457


100%|██████████| 457/457 [00:01<00:00, 405.41it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Schuurs-Hoeijmakers syndrome") OR (TITLE_ABS:"PACS1 Syndrome")), 27 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:01<00:00,  1.74s/it]


 Returning 27


100%|██████████| 27/27 [00:00<00:00, 312.49it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"primary ciliary dyskinesia") OR (TITLE_ABS:"ciliary dyskinesia primary") OR (TITLE_ABS:"Dextrocardia-bronchiectasis-sinusitis syndrome") OR (TITLE_ABS:"kartageners syndrome") OR (TITLE_ABS:"Primary ciliary dyskinesia and situs inversus") OR (TITLE_ABS:"Siewert syndrome") OR (TITLE_ABS:"Kartagener syndrome") OR (TITLE_ABS:"immotile ciliary syndrome") OR (TITLE_ABS:"Kartagener's syndrome") OR (TITLE_ABS:"ciliary motility disorder") OR ((TITLE_ABS:"syndrome") AND (TITLE_ABS:"bronchiectasis") AND (TITLE_ABS:"chronic sinusitis") AND (TITLE_ABS:"dextrocardia"))), 2655 European PMC PAPERS FOUND


100%|██████████| 3/3 [00:52<00:00, 17.42s/it]


 Returning 2095


100%|██████████| 2095/2095 [00:05<00:00, 406.92it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=(TITLE_ABS:"Progressive Familial Intrahepatic Cholestasis"), 738 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:23<00:00, 23.52s/it]


 Returning 691


100%|██████████| 691/691 [00:02<00:00, 261.58it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"craniopharyngioma") OR (TITLE_ABS:"Rathke's pouch tumor") OR (TITLE_ABS:"cystoma") OR (TITLE_ABS:"Rathke pouch neoplasm")), 4334 European PMC PAPERS FOUND


100%|██████████| 5/5 [03:53<00:00, 46.73s/it]


 Returning 3247


100%|██████████| 3247/3247 [00:09<00:00, 329.37it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Recurrent Respiratory Papillomatosis") OR (TITLE_ABS:"glottal papillomatosis") OR (TITLE_ABS:"tracheal papillomatosis")), 1020 European PMC PAPERS FOUND


100%|██████████| 2/2 [00:35<00:00, 17.57s/it]


 Returning 909


100%|██████████| 909/909 [00:02<00:00, 336.55it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"laryngeal papillomatosis") OR ((TITLE_ABS:"Recurrent Respiratory Papillomatosis") AND ((TITLE_ABS:"larynx") OR (TITLE_ABS:"laryngeal")))), 836 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:25<00:00, 25.22s/it]


 Returning 607


100%|██████████| 607/607 [00:01<00:00, 519.92it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Shwachman Diamond") OR (TITLE_ABS:"Shwachman-Bodian-Diamond")), 578 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:21<00:00, 21.55s/it]


 Returning 539


100%|██████████| 539/539 [00:01<00:00, 493.85it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=(TITLE_ABS:"Smith-Kingsmore syndrome"), 10 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:01<00:00,  1.58s/it]


 Returning 10


100%|██████████| 10/10 [00:00<00:00, 297.93it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Tatton Brown Rahman Syndrome") OR (TITLE_ABS:"DNMT3A Overgrowth Syndrome")), 44 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:02<00:00,  2.44s/it]


 Returning 43


100%|██████████| 43/43 [00:00<00:00, 454.91it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"TBCK Syndrome") OR (TITLE_ABS:"TBCK Encephalopathy") OR (TITLE_ABS:"TBCK-associated encephalopathy") OR (TITLE_ABS:"TBCK Encephaloneuropathy") OR (TITLE_ABS:"TBCK Mutation")), 8 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:01<00:00,  1.24s/it]


 Returning 8


100%|██████████| 8/8 [00:00<00:00, 322.29it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"dyskeratosis congenita") OR (TITLE_ABS:"Zinsser-Engman-Cole syndrome")), 1149 European PMC PAPERS FOUND


100%|██████████| 2/2 [00:51<00:00, 25.52s/it]


 Returning 1022


100%|██████████| 1022/1022 [00:02<00:00, 491.27it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"telomere syndrome") OR (TITLE_ABS:"short telomere syndrome")), 45 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:02<00:00,  2.32s/it]


 Returning 45


100%|██████████| 45/45 [00:00<00:00, 434.51it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"stiff man syndrome") OR (TITLE_ABS:"stiff person syndrome") OR (TITLE_ABS:"Moersch-Woltman syndrome")), 1137 European PMC PAPERS FOUND


100%|██████████| 2/2 [00:16<00:00,  8.46s/it]


 Returning 907


100%|██████████| 907/907 [00:01<00:00, 501.35it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"isolated pontocerebellar hypoplasia") OR (TITLE_ABS:"nonsyndromic pontocerebellar hypoplasia") OR (TITLE_ABS:"pontocerebellar hypoplasia") OR (TITLE_ABS:"pontoneocerebellar atrophy") OR (TITLE_ABS:"pontoneocerebllar hypoplasia")), 386 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:09<00:00,  9.99s/it]


 Returning 372


100%|██████████| 372/372 [00:00<00:00, 495.34it/s]


Query the database for the numbers of papers returned

In [None]:
cb.agent_executor.invoke({'input':'Get full text copies of all papers in the collection with id="0".'})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
  "action": "retrieve_full_text_for_papers_in_collection",
  "action_input": {
    "collection_id": "0"
  }
}[0m

ValueError: not enough values to unpack (expected 4, got 2)

In [None]:
q = ldb.session.query(SKC.id, SKC.name, func.count(SKC_HM.has_members_id)) \
    .filter(SKC.id==SKC_HM.ScientificKnowledgeCollection_id) \
    .group_by(SKC.id, SKC.name) \
    .order_by(SKC.id.cast(Integer))
corpora_df = pd.DataFrame(q.all(), columns=['Corpus ID', 'Corpus Name', 'Paper Count'])

paper_count = ldb.session.query(func.count(SKE.id)).first()
print('Count of all papers in database: %d'%(paper_count[0]))

corpora_df

Count of all papers in database: 64861


Unnamed: 0,Corpus ID,Corpus Name,Paper Count
0,1,Adult Polyglucosan Body Disease,92
1,2,Creatine transporter deficiency,300
2,3,AGAT deficiency,34
3,4,Guanidinoacetate methyltransferase deficiency,129
4,5,CLOVES Syndrome,94
...,...,...,...
76,78,TBCK Syndrome,8
77,79,Dyskeratosis congenita,1015
78,80,Telomere syndrome,44
79,81,The Stiff Person Syndrome,901


In [None]:

q3 = ldb.session.query(N) \
        .filter(N.type == 'NoteAboutFragment') 

for n in q3.all():
    n_content = json.loads(n.content)
    print('~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~')
    print(n.id)
    print(n_content.get('response')) 
    print(n_content.get('data')) 


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
42bfecd3eb
I answered this question: `What is the connection between the disease primary ciliary diskinesia and mechanisms involving primary cilia?` based on content from the collection with id: 70.
I'm sorry, but the content you've provided is a task that requires a detailed analysis based on specific research articles provided in the context. It's beyond the scope of the question to write a full essay analyzing each article individually in relation to the question about the connection between primary ciliary dyskinesia and mechanisms involving primary cilia. If you have any specific questions about the content of the articles or need help in summarizing their findings, please let me know.


REFERENCES
[1] Bukowy-Bieryllo Z, Witt M, Zietkiewicz E. (2022) Perspectives for Primary Ciliary Dyskinesia. (doi:10.3390/ijms23084122)
[2] Ishii S, Minato K, Hagiwara H, Yonezu M, Shimomura K, Iizuka K, Dobashi K, Fukusato T, Mori M. (1999) A possible m

In [None]:
skes = ldb.session.query(SKE).all()
#ldb.embed_expression_list(skes)
print(len(skes))

ProgrammingError: (psycopg2.errors.UndefinedColumn) column ScientificKnowledgeExpression_1.provenance does not exist
LINE 1: ...rmat AS "ScientificKnowledgeExpression_1_format", "Scientifi...
                                                             ^

[SQL: SELECT "ScientificKnowledgeExpression_1".publication_date AS "ScientificKnowledgeExpression_1_publication_date", "ScientificKnowledgeExpression_1".type AS "ScientificKnowledgeExpression_1_type", "ScientificKnowledgeExpression_1".creation_date AS "ScientificKnowledgeExpression_1_creation_date", "ScientificKnowledgeExpression_1".content AS "ScientificKnowledgeExpression_1_content", "ScientificKnowledgeExpression_1".token_count AS "ScientificKnowledgeExpression_1_token_count", "ScientificKnowledgeExpression_1".format AS "ScientificKnowledgeExpression_1_format", "ScientificKnowledgeExpression_1".provenance AS "ScientificKnowledgeExpression_1_provenance", "ScientificKnowledgeExpression_1".license AS "ScientificKnowledgeExpression_1_license", "ScientificKnowledgeExpression_1".name AS "ScientificKnowledgeExpression_1_name", "ScientificKnowledgeExpression_1".id AS "ScientificKnowledgeExpression_1_id" 
FROM "ScientificKnowledgeExpression" AS "ScientificKnowledgeExpression_1"]
(Background on this error at: https://sqlalche.me/e/20/f405)

In [None]:
ldb.session.rollback()

In [None]:
ft_retriever  = [t for t in tk.get_tools() if isinstance(t, RetrieveFullTextTool)][0]

for i, c in corpora_df.iterrows():
    if c['Corpus ID'] != '81':
        continue
    print(c['Corpus Name'])
    ft_count = 0
    no_ft_count = 0
    doi_list = [e.id for e in ldb.list_expressions(collection_id=c['Corpus ID'])]
    for doi in doi_list:
        d2 = doi.replace('doi:', '')
        path = loc+db_name+'/ft/'
        nxml_file_path = path+'/'+d2+'.nxml'
        pdf_file_path = path+'/'+d2+'.pdf'
        html_file_path = path+'/'+d2+'.html'
        if os.path.exists(nxml_file_path) or  \
                os.path.exists(pdf_file_path) or \
                os.path.exists(html_file_path):
            ft_count += 1
        try: 
            no_ft_count += 1
            #print('\t'+doi)
            ft_retriever.run(tool_input={'paper_id': doi})
        except Exception as e:
            print(e)
    print(ft_count)
    print(no_ft_count)

The Stiff Person Syndrome
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=013163/aim.0016[doi]&retmode=xml
No paper found with that DOI
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=013163/aim.0016[doi]&retmode=xml
No paper found with that DOI
min() arg is an empty sequence
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1001/archneur.1960.00450040098012[doi]&retmode=xml
No paper found with that DOI
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1001/archneur.1960.00450040098012[doi]&retmode=xml
No paper found with that DOI
min() arg is an empty sequence
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1001/archneur.1971.00480310050004[doi]&

KeyboardInterrupt: 

In [None]:
q = ldb.session.query(SKE.id, SKI.id, SKI.type, SKF.id, SKF.type, SKF.offset, SKF.content) \
    .filter(SKC.id==SKC_HM.ScientificKnowledgeCollection_id) \
    .filter(SKC_HM.has_members_id==SKE.id) \
    .filter(SKE.id==SKE_HR.ScientificKnowledgeExpression_id) \
    .filter(SKE_HR.has_representation_id==SKI.id) \
    .filter(SKI.id==SKI_HP.ScientificKnowledgeItem_id) \
    .filter(SKI_HP.has_part_id==SKF.id) \
    .filter(SKE_HR.has_representation_id==SKI.id) \
    .filter(SKF.type=='section') \
    .filter(SKI.type.like('%FullText')) \
    .order_by(SKE.id, SKF.offset)
items_df = pd.DataFrame(q.all(), columns=['doi', 'item_id', 'item_type', 'fragment_id', 'fragment_type', 'offset', 'content'])

items_df

Unnamed: 0,doi,item_id,item_type,fragment_id,fragment_type,offset,content
0,doi:10.1001/archinte.1994.00420110133015,7e6a350d2d,PDFFullText,7e6a350d2d.0,section,0,•\n
1,doi:10.1001/archinte.1994.00420110133015,7e6a350d2d,PDFFullText,7e6a350d2d.1,section,2,Triple Threat Broad Eligibility\n
2,doi:10.1001/archinte.1994.00420110133015,7e6a350d2d,PDFFullText,7e6a350d2d.2,section,36,Concerns\n
3,doi:10.1002/ccr3.2538,36af029190,JATSFullText,36af029190.0,section,2727,1\nINTRODUCTION\nAutoantibodies to glutamic ac...
4,doi:10.1002/ccr3.2538,36af029190,JATSFullText,36af029190.1,section,3532,2\nCASE REPORT\nA 61-year-old right-handed wom...
...,...,...,...,...,...,...,...
682,doi:10.3389/fnmol.2018.00291,9e6ac0056e,JATSFullText,9e6ac0056e.19,section,81666,Mouse Models With Defective Trafficking of Gly...
683,doi:10.3389/fnmol.2018.00291,9e6ac0056e,JATSFullText,9e6ac0056e.20,section,94537,Outlook\nThis review summarizes the traffickin...
684,doi:10.3389/fnmol.2018.00291,9e6ac0056e,JATSFullText,9e6ac0056e.21,section,97042,Ethics Statement\nExperiments were approved by...
685,doi:10.3389/fnmol.2018.00291,9e6ac0056e,JATSFullText,9e6ac0056e.22,section,97315,"Author Contributions\nNS, VR, and CV performed..."


# Index the abstracts and run some simple semantic queries

Here we index each paper's title and abstract to build a simple question / answer interface.

In [None]:
ldb.session.rollback()

In [None]:
for i, c in tqdm(corpora_df.iterrows()):
    if c['Corpus ID'] != '81':
        continue
    expressions = ldb.list_expressions(collection_id=c['Corpus ID'])    
    ldb.embed_expression_list(expressions)

81it [01:27,  1.07s/it]


In [None]:
question = 'What is known about genetics underlying Stiff Person Syndrome?'

ldb.query_vectorindex(question, k=10, collection_name='ScienceKnowledgeItem_FullText')

[(Document(page_content='Introduction\nStiff-person syndrome (SPS) is an uncommon disorder characterized by progressive stiffness, rigidity, and painful spasm affecting axial muscle. It can lead to significant debilitation and affects ambulation. Usually, it is associated with autoimmunity as it has a significant overlap with autoantibody in type 1 diabetes. Hypopituitarism, especially hypocortisolism, can lead to axial muscle stiffness and rigidity, similar to an SPS. This case highlights a patient with pituitary adenoma and panhypopituitarism with a stiff person-like syndrome as the initial presentation.', metadata={'c_ids': '81', 'e_id': 'doi:10.1159/000522253', 'e_type': 'ClinicalCaseReport', 'i_id': 'a427205ffe', 'i_type': 'JATSFullText', 'f_id': 'a427205ffe.0', 'citation': 'Goh KG, Yusof Khan AHK, Nasruddin A. (2022) Stiff Person-Like Syndrome: An Unusual Presentation of Pituitary Macroadenoma with Panhypopituitarism.'}),
  0.11609077453613281),
 (Document(page_content='1\nINTROD

## ATTEMPTING TO RECONSTRUCT PAPER-QA PIPELINE IN OUR SYSTEM.

1. Embed paper sections + question
2. Given the question, summarize the retrieved paper sections relative to the question
3. Score and select relevant passages
4. Put summaries into prompt
5. Generate answer with prompt


In [None]:
 
os.environ['PGVECTOR_CONNECTION_STRING'] = "postgresql+psycopg2:///"+ldb.name
vectorstore = PGVector.from_existing_index(
        embedding = ldb.embed_model, 
        collection_name = 'ScienceKnowledgeItem') 
retriever = vectorstore.as_retriever(search_kwargs={'k':15, 'filter': {'skc_ids': 81}})
#retriever = vectorstore.as_retriever()


In [None]:
retriever.invoke(question)

[Document(page_content='[Stiff-person syndrome: a clinical observation].\n\nStiff-person syndrome (SPS) is a rare chronic neurological disease characterized by progressing muscle rigidity and painful muscle spasms. The signs of SPS are pain and stiffness in spinal, abdominal and cervical muscles, increased muscle tonus in extensor muscles of extremities, constant stiffness of paravertebral and abdominal muscles and muscle spasms. A clinical case of a SPS patient T., aged 23 years, is presented. The peculiarity of this case is additional left-sided peripheral upper extremity monoparesis, which is most likely associated with the development of left-sided compression-ischemic brachial plexopathy resulted from profound muscular tonic syndrome in the neck and shoulder girdles.', metadata={'skc_ids': '81', 'ske_id': 'doi:10.17116/jnevro201911906196', 'ski_id': 'fb81803b8f', 'skf_id': 'fb81803b8f.0', 'type': 'CitationRecord', 'citation': 'Isaeva NV, Prokopenko SV, Rodikov MV, Abroskina MV, On

In [None]:
from langchain.schema import format_document
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string
from langchain_core.runnables import RunnableParallel

In [None]:
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough, RunnableLambda
from operator import itemgetter
from langchain.chat_models import ChatOllama
from langchain.schema import get_buffer_string, OutputParserException, format_document
from langchain.callbacks.tracers import ConsoleCallbackHandler
from langchain_core.output_parsers import StrOutputParser, JsonOutputParser
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from alhazen.utils.output_parsers import JsonEnclosedByTextOutputParser

#from paperqa.prompts import summary_prompt as paperqa_summary_prompt, qa_prompt as paperqa_qa_prompt, select_paper_prompt, citation_prompt, default_system_prompt

hum_p = '''First, read through the following JSON encoding of {k} research articles: 

Each document has three attributes: (A) a digital object identifier ('DOI') code, (B) a CITATION string containing the authors, publication year, title and publication location, and the (C) CONTENT field with the title and abstract of the paper.  

```json:{context}```

Then, generate a JSON list of summaries of each article in order to help answer the following question:

Question: {question}

Do NOT directly answer the question, instead summarize to give evidence to help answer the question. 
Focus on specific details, including numbers, equations, or specific quotes. 
Reply "Not applicable" if text is irrelevant. 
Restrict each summary to {summary_length} words. 
Also, provide a score from 1-10 indicating relevance to question. Do not explain your score. 

Write this answer as JSON formatted output. Provide a list of {k} dict objects with the following fields: DOI, SUMMARY, RELEVANCE SCORE. 

Do not provide additional explanation for the answer.
Do not include any other response other than a JSON object.
'''
sys_p = '''Answer in a direct and concise tone. Your audience is an expert, so be highly specific. If there are ambiguous terms or acronyms, first define them.'''

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="'DOI': '{ske_id}', CITATION: '{citation}', CONTENT:'{page_content}'")
def combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="},{\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    return '[{'+document_separator.join(doc_strings)+'}]'

template = ChatPromptTemplate.from_messages([
            ("system", sys_p),
            ("human", hum_p)])

qa_chain = (
    RunnableParallel({
        "k": itemgetter("k"),
        "question": itemgetter("question"),
        "summary_length": itemgetter("summary_length"),
        "context": itemgetter("question") | retriever | combine_documents,
    })
    | {
        "summary": template | ChatOllama(model='mixtral') | JsonEnclosedByTextOutputParser(),
        "context": itemgetter("context"),
    }
)

input = {'question': question, 'summary_length': 1000, 'k':5}    
out = qa_chain.invoke(input, config={'callbacks': [ConsoleCallbackHandler()]})
print(json.dumps(out, indent=4))




[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "question": "What clinical indicators are present for patients suffering from stiff person syndrome (SPS)?",
  "summary_length": 1000,
  "k": 5
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:chain:RunnableParallel] Entering Chain run with input:
[0m{
  "question": "What clinical indicators are present for patients suffering from stiff person syndrome (SPS)?",
  "summary_length": 1000,
  "k": 5
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence > 2:chain:RunnableParallel > 3:chain:RunnableLambda] Entering Chain run with input:
[0m{
  "question": "What clinical indicators are present for patients suffering from stiff person syndrome (SPS)?",
  "summary_length": 1000,
  "k": 5
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableSequence > 2:chain:RunnableParallel > 3:chain:RunnableLambda] [0ms] Exiting Chain run with output:
[0m{
  "output": 5
}
[32;1m[

# Discourse Analysis

In [6]:
from transformers import pipeline, AutoModel, AutoTokenizer
import torch

model_path = '/Users/gully.burns/Documents/2024H1/models/discourse_tagger'
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1", 
                                          truncation=True, 
                                          max_length=512)
labels = ['BACKGROUND', 'OBJECTIVE', 'METHODS', 'RESULTS', 'CONCLUSIONS']
lookup = {'LABEL_%d'%(i):l for i, l in enumerate(labels)}
model = AutoModel.from_pretrained(model_path)
model.eval()

classifier = pipeline("text-classification", 
                      model = model_path, 
                      tokenizer=tokenizer, 
                      truncation=True,
                      batch_size=8,
                      device='mps')


In [None]:
# Try an out-of-the-box classifier on the data for discourse tagging.
from transformers import pipeline

ldb.session.rollback()
one_year_ago = (datetime.now() - timedelta(days=1*365))

q = ldb.session.query(SKE, SKF) \
    .filter(SKC.id==SKC_HM.ScientificKnowledgeCollection_id) \
    .filter(SKC_HM.has_members_id==SKE.id) \
    .filter(SKE.id==SKE_HR.ScientificKnowledgeExpression_id) \
    .filter(SKE_HR.has_representation_id==SKI.id) \
    .filter(SKI.id==SKI_HP.ScientificKnowledgeItem_id) \
    .filter(SKI_HP.has_part_id==SKF.id) \
    .filter(SKE_HR.has_representation_id==SKI.id) \
    .filter(SKI.type == 'CitationRecord' ) \
    .order_by(SKE.id)

#   .filter(SKC.name == 'The Stiff Person Syndrome' ) \
#   .filter(SKE.publication_date >= one_year_ago) \

s_list = []
for e, f in q.all():
    for i, s in enumerate(ldb.sent_detector.tokenize(f.content)):
        s_list.append([e.id, f.id, i, s])
sent_df = pd.DataFrame(s_list, columns=['doi', 'f_id', 's_id', 'text'])
sent_df

ProgrammingError: (psycopg2.errors.UndefinedColumn) column ScientificKnowledgeExpression_1.provenance does not exist
LINE 1: ...rmat AS "ScientificKnowledgeExpression_1_format", "Scientifi...
                                                             ^

[SQL: SELECT "ScientificKnowledgeExpression_1".publication_date AS "ScientificKnowledgeExpression_1_publication_date", "ScientificKnowledgeExpression_1".type AS "ScientificKnowledgeExpression_1_type", "ScientificKnowledgeExpression_1".creation_date AS "ScientificKnowledgeExpression_1_creation_date", "ScientificKnowledgeExpression_1".content AS "ScientificKnowledgeExpression_1_content", "ScientificKnowledgeExpression_1".token_count AS "ScientificKnowledgeExpression_1_token_count", "ScientificKnowledgeExpression_1".format AS "ScientificKnowledgeExpression_1_format", "ScientificKnowledgeExpression_1".provenance AS "ScientificKnowledgeExpression_1_provenance", "ScientificKnowledgeExpression_1".license AS "ScientificKnowledgeExpression_1_license", "ScientificKnowledgeExpression_1".name AS "ScientificKnowledgeExpression_1_name", "ScientificKnowledgeExpression_1".id AS "ScientificKnowledgeExpression_1_id", "ScientificKnowledgeFragment_1".part_of AS "ScientificKnowledgeFragment_1_part_of", "ScientificKnowledgeFragment_1"."offset" AS "ScientificKnowledgeFragment_1_offset", "ScientificKnowledgeFragment_1".length AS "ScientificKnowledgeFragment_1_length", "ScientificKnowledgeFragment_1".creation_date AS "ScientificKnowledgeFragment_1_creation_date", "ScientificKnowledgeFragment_1".content AS "ScientificKnowledgeFragment_1_content", "ScientificKnowledgeFragment_1".token_count AS "ScientificKnowledgeFragment_1_token_count", "ScientificKnowledgeFragment_1".format AS "ScientificKnowledgeFragment_1_format", "ScientificKnowledgeFragment_1".provenance AS "ScientificKnowledgeFragment_1_provenance", "ScientificKnowledgeFragment_1".license AS "ScientificKnowledgeFragment_1_license", "ScientificKnowledgeFragment_1".name AS "ScientificKnowledgeFragment_1_name", "ScientificKnowledgeFragment_1".id AS "ScientificKnowledgeFragment_1_id", "ScientificKnowledgeFragment_1".type AS "ScientificKnowledgeFragment_1_type" 
FROM "ScientificKnowledgeExpression" AS "ScientificKnowledgeExpression_1", "ScientificKnowledgeFragment" AS "ScientificKnowledgeFragment_1", "ScientificKnowledgeCollection" AS "ScientificKnowledgeCollection_1", "ScientificKnowledgeCollection_has_members" AS "ScientificKnowledgeCollection_has_members_1", "ScientificKnowledgeExpression_has_representation" AS "ScientificKnowledgeExpression_has_representation_1", "ScientificKnowledgeItem" AS "ScientificKnowledgeItem_1", "ScientificKnowledgeItem_has_part" AS "ScientificKnowledgeItem_has_part_1" 
WHERE "ScientificKnowledgeCollection_1".id = "ScientificKnowledgeCollection_has_members_1"."ScientificKnowledgeCollection_id" AND "ScientificKnowledgeCollection_has_members_1".has_members_id = "ScientificKnowledgeExpression_1".id AND "ScientificKnowledgeExpression_1".id = "ScientificKnowledgeExpression_has_representation_1"."ScientificKnowledgeExpression_id" AND "ScientificKnowledgeExpression_has_representation_1".has_representation_id = "ScientificKnowledgeItem_1".id AND "ScientificKnowledgeItem_1".id = "ScientificKnowledgeItem_has_part_1"."ScientificKnowledgeItem_id" AND "ScientificKnowledgeItem_has_part_1".has_part_id = "ScientificKnowledgeFragment_1".id AND "ScientificKnowledgeExpression_has_representation_1".has_representation_id = "ScientificKnowledgeItem_1".id AND "ScientificKnowledgeItem_1".type = %(type_1)s ORDER BY "ScientificKnowledgeExpression_1".id]
[parameters: {'type_1': 'CitationRecord'}]
(Background on this error at: https://sqlalche.me/e/20/f405)

In [None]:

# Predict multipe texts on single CPU and time the inference duration
start = time()

df = sent_df

preds = classifier([row.text for i, row in df.iterrows()])
pred_df = pd.DataFrame(preds)
df['label'] = [lookup[row.label] for i, row in pred_df.iterrows()]
df['score'] = [row.score for i, row in pred_df.iterrows()]

end = time()

print('Prediction time:', str(timedelta(seconds=end-start)))

Prediction time: 0:26:27.240843


In [None]:
df

Unnamed: 0,doi,f_id,s_id,text,label,score
0,doi:/s0034-98872008000200015,9f3919db23.0,0,[Cholangiocarcinoma].,BACKGROUND,0.556335
1,doi:/s0034-98872008000200015,9f3919db23.1,0,Cholangiocarcinoma is a malignant lesion of th...,BACKGROUND,0.710530
2,doi:/s0034-98872008000200015,9f3919db23.1,1,Its incidence and prevalence are low.,BACKGROUND,0.769202
3,doi:/s0034-98872008000200015,9f3919db23.1,2,It appears from the sixth decade of life and t...,BACKGROUND,0.772743
4,doi:/s0034-98872008000200015,9f3919db23.1,3,It is most frequently found in the confluence ...,BACKGROUND,0.726111
...,...,...,...,...,...,...
538367,doi:huon.2005.49.1.0065,9f08f2e243.1,9,Of the 59 primary epithelial tumours 62.7% was...,RESULTS,0.916542
538368,doi:huon.2005.49.1.0065,9f08f2e243.1,10,The differential diagnosis and management are ...,RESULTS,0.558720
538369,doi:huon.2005.49.1.0065,9f08f2e243.1,11,The prognosis of pleomorphic adenomas depends ...,BACKGROUND,0.629027
538370,doi:huon.2005.49.1.0065,9f08f2e243.1,12,In cases of suspected malignant epithelial tum...,BACKGROUND,0.707331


In [None]:
ldb.session.rollback()

In [None]:
# Generate fragment sentences and add them as Notes
ldb.session.rollback()
for i, row in df.iterrows():
    f_q = ldb.session.query(SKF).filter(SKF.id == row.f_id).first()
    i_q = ldb.session.query(SKI).filter(SKI.id == row.f_id.split('.')[0]).first()
    o = i_q.content.find(row.text)
    l = len(row.text)
    sentence_fragment = ScientificKnowledgeFragment(id=f_q.id+'.'+str(row.s_id), \
                                                    content=row.text, \
                                                    offset=o, \
                                                    length=l, \
                                                    type='sentence')
    i_q.has_part.append(sentence_fragment)
    note_content = {'discourse_label': row.label, 'score': row.score}
    n = Note(id=f_q.id+'.'+str(row.s_id)+'.discourse_type',
             content=json.dumps(note_content, indent=4),
             format='json',
             type='NoteAboutFragment')
    sentence_fragment.has_notes.append(n)
    ldb.session.flush()
ldb.session.commit()


# Running DRSM Classifiers.

In [5]:
from transformers import pipeline, AutoModel, AutoTokenizer
import torch
import os

model_path = '/Users/gully.burns/Documents/2024H1/models/drsm_classifier'
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1", 
                                          truncation=True, 
                                          max_length=512)
labels = ['BACKGROUND', 'OBJECTIVE', 'METHODS', 'RESULTS', 'CONCLUSIONS']
lookup = {'LABEL_%d'%(i):l for i, l in enumerate(labels)}
model = AutoModel.from_pretrained(model_path)
model.eval()

classifier = pipeline("text-classification", 
                      model = model_path, 
                      tokenizer=tokenizer, 
                      truncation=True,
                      batch_size=8,
                      device='mps')


  return self.fget.__get__(instance, owner)()


# Topic Modeling over the corpus. 

What are the main topics being discussed in each paper?

In [None]:
from transformers import pipeline, AutoModel, AutoTokenizer
import torch

model_path = '/Users/gully.burns/Documents/2024H1/models/drsm_classifier'
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-v1.1", 
                                          truncation=True, 
                                          max_length=512)
labels = ['BACKGROUND', 'OBJECTIVE', 'METHODS', 'RESULTS', 'CONCLUSIONS']
lookup = {'LABEL_%d'%(i):l for i, l in enumerate(labels)}
model = AutoModel.from_pretrained(model_path)
model.eval()

classifier = pipeline("text-classification", 
                      model = model_path, 
                      tokenizer=tokenizer, 
                      truncation=True,
                      batch_size=8,
                      device='mps')

# Search for and download Full Text Papers.

Can we search for all Stiff Person Syndrome papers published in the last 10 years?



In [None]:
ldb.session.rollback()

ten_years_ago = (datetime.now() - timedelta(days=10*365))
print(ten_years_ago)

q = ldb.session.query(func.extract('year', SKE.publication_date.cast(Date)), func.count(SKE.id) ) \
    .filter(SKC.id==SKC_HM.ScientificKnowledgeCollection_id) \
    .filter(SKC_HM.has_members_id==SKE.id) \
    .filter(SKE.publication_date >= ten_years_ago) \
    .filter(SKC.name == 'The Stiff Person Syndrome' ) \
    .group_by(func.extract('year', SKE.publication_date.cast(Date))) \
    .order_by(func.extract('year', SKE.publication_date.cast(Date)))
sps_pubcount_df = pd.DataFrame(q.all(), columns=['doi', 'date'])
sps_pubcount_df

2014-01-25 10:10:34.434342


Unnamed: 0,doi,date
0,2014,25
1,2015,35
2,2016,34
3,2017,26
4,2018,24
5,2019,43
6,2020,40
7,2021,40
8,2022,35
9,2023,49


# Run PaperQA

In [None]:
cb.agent_executor.invoke({'input':'Write a short essay on "What connections between primary ciliary diskinesia and primary cilia have been studied?" based on the collection with ID="70".'})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
  "action": "simple_qa_over_papers",
  "action_input": {
    "question": "What connections between primary ciliary diskinesia and primary cilia have been studied?",
    "collection_id": "70"
  }
}[0m[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableParallel<k,question,context>] Entering Chain run with input:
[0m{
  "question": "What connections between primary ciliary diskinesia and primary cilia have been studied?",
  "summary_length": 1000,
  "k": 5
}
[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableParallel<k,question,context> > 2:chain:RunnableLambda] Entering Chain run with input:
[0m{
  "question": "What connections between primary ciliary diskinesia and primary cilia have been studied?",
  "summary_length": 1000,
  "k": 5
}
[36;1m[1;3m[chain/end][0m [1m[1:chain:RunnableParallel<k,question,context> > 2:chain:RunnableLambda] [1ms] Exiting Chain run with output:
[0m{
  "output": 5
}
[32;1m[1;3m[chain/star

KeyboardInterrupt: 