# CryoET Tutorial 

> Developing a simple tutorial to provide a walkthrough for users attempting to use Alhazen for the first time. This is based on analysis of the Cryo Electron Tomography literature and tools we have developed to analyze that data.  

## Introduction to CryoET

Cryo-electron Tomography (CryoET) involves rapidly freezing biological samples in their natural state to preserve their three-dimensional structure without the need for staining or crystallization. This methodology allows researchers to visualize proteins and other biomolecules at near-atomic resolution.

This digital library is based on capturing all papers that mention the technique in their titles, abstracts, or methods sections and then analyzing the various methods used and their applications. Our focus is on supporting the work of the Chan Zuckerberg Imaging Institute, [CZII](https://www.czimaginginstitute.org/) on developing [the CryoET data portal](https://cryoetdataportal.czscience.com/), an open source repository for CryoET-based data. 

## Basics

### Python Imports

Setting python imports, environment variables, and other crucial set up parameters here.  

In [1]:
from alhazen.aliases import *
from alhazen.core import lookup_chat_models
from alhazen.agent import AlhazenAgent
from alhazen.schema_sqla import *
from alhazen.core import lookup_chat_models
from alhazen.tools.basic import AddCollectionFromEPMCTool, DeleteCollectionTool
from alhazen.tools.paperqa_emulation_tool import PaperQAEmulationTool
from alhazen.tools.metadata_extraction_tool import * 
from alhazen.tools.protocol_extraction_tool import *
from alhazen.tools.tiab_classifier_tool import *
from alhazen.tools.tiab_extraction_tool import *
from alhazen.tools.tiab_mapping_tool import *
from alhazen.toolkit import *
from alhazen.utils.jats_text_extractor import NxmlDoc

from alhazen.utils.ceifns_db import Ceifns_LiteratureDb, create_ceifns_database, drop_ceifns_database, backup_ceifns_database

from alhazen.utils.searchEngineUtils import *

from langchain.callbacks.tracers import ConsoleCallbackHandler
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.pgvector import PGVector
from langchain_community.chat_models.ollama import ChatOllama
from langchain_google_vertexai import ChatVertexAI
from langchain_openai import ChatOpenAI

import nltk
nltk.download('punkt')

from bs4 import BeautifulSoup,Tag,Comment,NavigableString
from databricks import sql
from datetime import datetime
from importlib_resources import files
import json
import os
import pandas as pd
from pathlib import Path
import re
import requests

from sqlalchemy import text, create_engine, exists, func, or_, and_, not_, desc, asc
from sqlalchemy.orm import sessionmaker, aliased

from time import time,sleep
from tqdm import tqdm
from urllib.request import urlopen
from urllib.parse import quote_plus, quote, unquote
from urllib.error import URLError, HTTPError
import uuid
import yaml

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


### Environment Variables

You must set the following environmental variables for this code:

* `LOCAL_FILE_PATH` - the location on disk where you save temporary files, downloaded models or other data.   

Note that this notebook will build and use a database specified as `cryoet_tutorial`, specified below

In [2]:
if os.environ.get('LOCAL_FILE_PATH') is None: 
    raise Exception('Where are you storing your local literature database?')
if os.path.exists(os.environ['LOCAL_FILE_PATH']) is False:
    os.makedirs(os.environ['LOCAL_FILE_PATH'])    

loc = os.environ['LOCAL_FILE_PATH']
db_name = 'cryoet_tutorial'

# Variable to prevent accidental deletion of the database or any records
OK_TO_DELETE = False

### Setup utils, agents, and tools 

This cell sets up a database engine (`ldb`) and lists the available large-language models you can use.

In [3]:
ldb = Ceifns_LiteratureDb(loc=loc, name=db_name)
llms_lookup = lookup_chat_models()
print(llms_lookup.keys())

dict_keys(['ollama_llama3', 'ollama_mixtral', 'databricks_dbrx', 'databricks_mixtral', 'databricks_llama3', 'groq_mixtral', 'groq_llama3', 'gpt4_1106', 'gpt35'])


  warn_deprecated(
                    stop was transferred to model_kwargs.
                    Please confirm that stop is what you intended.


This cell initiates an `AlhazenAgent` that you can use to run tools or execute commands over.

In [4]:
llm = llms_lookup.get('databricks_llama3')

cb = AlhazenAgent(llm, llm, db_name=db_name)
print('AGENT TOOLS')
for t in cb.tk.get_tools():
    print('\t'+type(t).__name__)

AGENT TOOLS
	AddCollectionFromEPMCTool
	AddAuthorsToCollectionTool
	DescribeCollectionCompositionTool
	DeleteCollectionTool
	RetrieveFullTextTool
	RetrieveFullTextToolForACollection
	MetadataExtraction_EverythingEverywhere_Tool
	MetadataExtraction_MethodsSectionOnly_Tool
	SimpleExtractionWithRAGTool
	PaperQAEmulationTool
	ProcotolEntitiesExtractionTool
	CheckExpressionTool
	TitleAbstractClassifier_OneDocAtATime_Tool
	TitleAbstractDiscourseMappingTool
	TitleAbstractExtraction_OneDocAtATime_Tool


## Building the database


### Scripts to Build / Delete the database

If you need to restore a deleted database from backup, use the following shell commands:

```
$ createdb em_tech
$ psql -d em_tech -f /local/file/path/em_tech/backup<date_time>.sql
```

This command will delete your existing database (but will also store a copy).

In [5]:
if OK_TO_DELETE:
    drop_ceifns_database(db_name, backupFirst=True)

This command will backup your current database

In [6]:
if OK_TO_DELETE:
    current_date_time = datetime.now()
    formatted_date_time = f'{current_date_time:%Y-%m-%d-%H-%M-%S}'
    backup_path = loc+'/'+db_name+'/backup'+formatted_date_time+'.sql'
    backup_ceifns_database(db_name, backup_path)

This command will create a new, fresh, empty copy of your database.  

In [6]:
create_ceifns_database(db_name)

100%|██████████| 309/309 [00:00<00:00, 2035.56it/s]


### Build CEIFNS database from queries

#### Add a collection of all CryoET papers based on a query

This runs a query on European PMC for terms + synonyms related to Cryo Electron Tomography

In [5]:
cryoet_query = '''
("Cryoelectron Tomography" OR "Cryo Electron Tomography" OR "Cryo-Electron Tomography" OR
    "Cryo-ET" OR "CryoET" OR "Cryoelectron Tomography" OR "cryo electron tomography" or 
    "cryo-electron tomography" OR "cryo-et" OR cryoet)
'''
addEMPCCollection_tool = [t for t in cb.tk.get_tools() if isinstance(t, AddCollectionFromEPMCTool)][0]
addEMPCCollection_tool.run(tool_input={'id': '1', 
                                       'name': 'CryoET Papers', 
                                       'query': cryoet_query})

https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=
("Cryoelectron Tomography" OR "Cryo Electron Tomography" OR "Cryo-Electron Tomography" OR
    "Cryo-ET" OR "CryoET" OR "Cryoelectron Tomography" OR "cryo electron tomography" or 
    "cryo-electron tomography" OR "cryo-et" OR cryoet)
, 6337 European PMC PAPERS FOUND


100%|██████████| 7/7 [02:12<00:00, 18.88s/it]


 Returning 6230


100%|██████████| 6230/6230 [00:35<00:00, 174.60it/s]


{'response': 'We added a collection to the database called `CryoET Papers` containing 0 papers from this query: `\n("Cryoelectron Tomography" OR "Cryo Electron Tomography" OR "Cryo-Electron Tomography" OR\n    "Cryo-ET" OR "CryoET" OR "Cryoelectron Tomography" OR "cryo electron tomography" or \n    "cryo-electron tomography" OR "cryo-et" OR cryoet)\n`.'}

In [6]:
l = []
q = ldb.session.query(SKE) 
output = []        
for ske in q.all():
    l.append(ske)
print(len(l))

6228


#### Adding Machine Learning also from a query

In [7]:
ml_query = '''
("Cryoelectron Tomography" OR "Cryo Electron Tomography" OR "Cryo-Electron Tomography" OR
    "Cryo-ET" OR "CryoET" OR "Cryoelectron Tomography" OR "cryo electron tomography" or 
    "cryo-electron tomography" OR "cryo-et" OR cryoet ) AND 
("Machine Learning" OR "Artificial Intelligence" OR "Deep Learning" OR "Neural Networks")
'''
addEMPCCollection_tool = [t for t in cb.tk.get_tools() if isinstance(t, AddCollectionFromEPMCTool)][0]
addEMPCCollection_tool.run(tool_input={'id': '2', 
                                       'name': 'Machine Learning in CryoET', 
                                       'query': ml_query, 
                                       'full_text': False})

https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=
("Cryoelectron Tomography" OR "Cryo Electron Tomography" OR "Cryo-Electron Tomography" OR
    "Cryo-ET" OR "CryoET" OR "Cryoelectron Tomography" OR "cryo electron tomography" or 
    "cryo-electron tomography" OR "cryo-et" OR cryoet ) AND 
("Machine Learning" OR "Artificial Intelligence" OR "Deep Learning" OR "Neural Networks")
, 680 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:14<00:00, 14.23s/it]


 Returning 654


100%|██████████| 654/654 [00:01<00:00, 446.43it/s]


{'response': 'We added a collection to the database called `Machine Learning in CryoET` containing 0 papers from this query: `\n("Cryoelectron Tomography" OR "Cryo Electron Tomography" OR "Cryo-Electron Tomography" OR\n    "Cryo-ET" OR "CryoET" OR "Cryoelectron Tomography" OR "cryo electron tomography" or \n    "cryo-electron tomography" OR "cryo-et" OR cryoet ) AND \n("Machine Learning" OR "Artificial Intelligence" OR "Deep Learning" OR "Neural Networks")\n`.'}

#### Creates a new collection of randomly sampled papers to showcase full-text download capability 

In [8]:
ldb.create_new_collection_from_sample('3', 'CryoET Papers Tests', '1', 20, ['ScientificPrimaryResearchArticle', 'ScientificPrimaryResearchPreprint'])

ScientificKnowledgeCollection(creation_date=None,content=None,token_count=None,format=None,provenance=None,license=None,name=CryoET Papers Tests,id=3,type=skem:ScientificKnowledgeCollection,)

## Analyze Collections

### Survey + Run Classifications over Papers

This invoke the following classification process on the paper (defined in the prompt definition in `./local_resources/prompts/tiab_prompts`):
    
* A - Structural descriptions of Viral Pathogens (such as HIV, Influenza, SARS-CoV-2, etc.)
* B - Studies of mutated protein structures associated with disease (such as Alzheimer's, Parkinson's, etc.) 
* C - Structural studies of bacterial pathogens (such as E. coli, Salmonella, etc.)
* D - Structural studies of plant cells
* E - Structural studies of material science of non-biological samples
* F - Structural studies of transporters or transport mechanisms within cells, studies involving the cytoskeleton or active transport processes. 
* G - Structural studies of synapses or other mechansism of releasing vesicles over the plasma membrane
* H - Structural studies of any other organelle or structured component of a cell. 
* I - Studies of dynamic biological processes at a cellular level (such as cell division, cell migration, etc.)
* J - Studies of dynamics of molecular interactions within a cell.    
* K - Development of new CryoET imaging methods (including grid preparation techniques, such as lift-out). 
* L - Development of new data analysis methods (including machine learning, segmentation, point-picking, object recognition, or reconstruction). 

In [None]:
t = [t for t in cb.tk.get_tools() if isinstance(t, TitleAbstractClassifier_OneDocAtATime_Tool)][0]
t.run({'collection_id': '3', 'classification_type':'cryoet_study_types', 'repeat_run':True})

In [None]:
# USE WITH CAUTION - this will delete all extracted metadata notes in the database
# clear all notes across papers listed in `dois` list
if OK_TO_DELETE:        
    l = []
    q = ldb.session.query(N, SKE) \
            .filter(N.id == NIA.Note_id) \
            .filter(NIA.is_about_id == SKE.id) \
            .filter(N.type == 'TiAbClassificationNote__cryoet_study_types') \

    output = []        
    print(len(q.all()))
    for n, ske in q.all():
        ldb.delete_note(n.id)    
    print(len(q.all()))
    

Runs a query over the notes extracted and saved to the database to show the zero-shot document classifications based on the titles + abstracts 

In [10]:
l = []
q = ldb.session.query(N, SKE) \
        .filter(N.id == NIA.Note_id) \
        .filter(NIA.is_about_id == SKE.id) \
        .filter(N.type == 'TiAbClassificationNote__cryoet_study_types') \
        .order_by(SKE.id)

output = []        
for n, ske in q.all():
        tup = json.loads(n.content)
        tup['prov'] = n.name
        tup['doi'] = 'http://doi.org/'+re.sub('doi:', '', ske.id)
        tup['year'] = ske.publication_date.year
        tup['month'] = ske.publication_date.month
        tup['ref'] = ske.content
        output.append(tup)
df = pd.DataFrame(output).sort_values(['year', 'month'], ascending=[False, False])
df.to_csv(loc+'/'+db_name+'/cryoet_study_types.tsv', sep='\t')
df

Unnamed: 0,cryoet_study_type_code,cryoet_study_type_name,explanation,prov,doi,year,month,ref
6,I,Studies of dynamic biological processes at a c...,The paper primarily focuses on understanding t...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1038/s41594-024-01281-y,2024,4,"Dendooven T, Yatskevich S, Burt A, Chen ZA, Be..."
5,A,Structural descriptions of Viral Pathogens,This paper is primarily about the structural d...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1038/s41467-023-39756-z,2023,7,"Li F, Hou CD, Lokareddy RK, Yang R, Forti F, B..."
13,L,Development of new data analysis methods,The paper primarily focuses on developing a ne...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1101/2023.05.31.542975,2023,6,"Powell BM, Davis JH. (2023) Learning structura..."
8,A,Structural descriptions of Viral Pathogens (su...,The paper presents cryo-EM structures of HIV-1...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1038/s42003-023-04916-w,2023,5,"Wang K, Zhang S, Go EP, Ding H, Wang WL, Nguye..."
12,A,Structural descriptions of Viral Pathogens (su...,The paper is primarily focused on the structur...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1101/2022.08.11.503604,2022,8,"Hover S, Charlton FW, Hellert J, Barr JN, Mank..."
10,D,Structural studies of plant cells,The paper primarily focuses on the structural ...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1093/jb/mvab113,2022,1,"Kurisu G, Tsukihara T. (2022) Forty years of t..."
19,G,Structural studies of synapses or other mechan...,The paper primarily focuses on the structure a...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.3389/fnsyn.2021.798225,2022,1,Szule JA. (2022) Hypothesis Relating the Struc...
0,C,Structural studies of bacterial pathogens,The paper primarily focuses on the structure a...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1016/j.isci.2021.103458,2021,11,"Masson F, Pierrat X, Lemaitre B, Persat A. (20..."
7,A,Structural descriptions of Viral Pathogens (su...,The paper is primarily focused on developing a...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1038/s42003-021-02128-8,2021,5,"Chiba S, Frey SJ, Halfmann PJ, Kuroda M, Maemu..."
3,A,Structural descriptions of Viral Pathogens (su...,The paper is primarily concerned with the stru...,TitleAbstractClassifier_OneDocAtATime_Tool__cr...,http://doi.org/10.1017/qrd.2020.16,2020,11,"Zhang K, Li S, Pintilie G, Chmielewski D, Schm..."


In [22]:
df['explanation'][44]

"This paper is classified under 'C' because it focuses on the structural study of magnetosomes in a specific bacterium, Magnetovibrio blakemorei strain MV-1, and investigates their magnetic orientation and chain formation mechanisms."

## Run MetaData Extraction Chain over listed papers

Here, we run various versions of the metadata extraction tool to examine performance over the cryoet dataset. 

#### Get full text copies of all the papers about CryoET


In [5]:
cb.agent_executor.invoke({'input':'Get full text copies of all papers in the collection with id="3".'})



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m{
  "action": "retrieve_full_text_for_papers_in_collection",
  "action_input": {
    "collection_id": "3"
  }
}[0m

  0%|          | 0/20 [00:00<?, ?it/s]

	10.1101/2024.03.25.586532 does not exist


  5%|▌         | 1/20 [00:19<06:03, 19.14s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=10910546&format=pdf
	10.1107/s2059798324000986 no full text as PDF file.
Message: The element with the reference 71c2a41b-f6b1-45e1-8328-9b8fc22e9c63 is stale; either its node document is not the active document, or it is no longer connected to the DOM; For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#stale-element-reference-exception
Stacktrace:
RemoteError@chrome://remote/content/shared/RemoteError.sys.mjs:8:8
WebDriverError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:193:5
StaleElementReferenceError@chrome://remote/content/shared/webdriver/Errors.sys.mjs:725:5
getKnownElement@chrome://remote/content/marionette/json.sys.mjs:401:11
deserializeJSON@chrome://remote/content/marionette/json.sys.mjs:259:20
cl

 10%|█         | 2/20 [00:36<05:23, 17.95s/it]

Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1107/s2059798324000986.html


 15%|█▌        | 3/20 [00:47<04:12, 14.82s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`


 20%|██        | 4/20 [00:47<02:25,  9.11s/it]

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=9816115&format=pdf
	10.1038/s41598-022-26760-4 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1038/s41598-022-26760-4[doi]&retmode=xml


rewritetex: likely error invoking catdvi (empty output)
rewritetex: likely error invoking catdvi (empty output)
rewritetex: likely error invoking catdvi (empty output)
 25%|██▌       | 5/20 [01:20<04:25, 17.67s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=9565162&format=pdf
	10.1073/pnas.2210249119 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1073/pnas.2210249119[doi]&retmode=xml


 30%|███       | 6/20 [01:42<04:29, 19.23s/it]

Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1073/pnas.2210249119.html
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=9038204&format=pdf
	10.1371/journal.pbio.3001601 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1371/journal.pbio.3001601[doi]&retmode=xml
Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1371/journal.pbio.3001601.html


 35%|███▌      | 7/20 [02:02<04:12, 19.41s/it]

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=8855526&format=pdf
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1093/jmicro/dfab041[doi]&retmode=xml
Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1093/jmicro/dfab041.html


 40%|████      | 8/20 [02:27<04:12, 21.06s/it]

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=10060710&format=pdf
	10.1093/bioinformatics/btab794 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1093/bioinformatics/btab794[doi]&retmode=xml


 45%|████▌     | 9/20 [02:46<03:44, 20.45s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=8844581&format=pdf
	10.1002/advs.202103498 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1002/advs.202103498[doi]&retmode=xml
Type Error: this element does not have children or attributes
Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1002/advs.202103498.html


 50%|█████     | 10/20 [03:06<03:25, 20.52s/it]

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=7490828&format=pdf
	10.1038/s41594-020-0489-2 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1038/s41594-020-0489-2[doi]&retmode=xml


 55%|█████▌    | 11/20 [03:55<04:20, 28.99s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=7185972&format=pdf
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1091/mbc.e19-11-0658[doi]&retmode=xml


 60%|██████    | 12/20 [04:16<03:32, 26.62s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=6989802&format=pdf
	10.1128/jb.00592-19 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1128/jb.00592-19[doi]&retmode=xml


 65%|██████▌   | 13/20 [04:38<02:55, 25.14s/it]

Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1128/jb.00592-19.html
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=6476638&format=pdf
	10.1016/j.cell.2018.10.056 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1016/j.cell.2018.10.056[doi]&retmode=xml
Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1016/j.cell.2018.10.056.html


 70%|███████   | 14/20 [04:52<02:11, 22.00s/it]

	10.7554/elife.34271 does not exist
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.7554/elife.34271[doi]&retmode=xml
Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.7554/elife.34271.html


 75%|███████▌  | 15/20 [05:48<02:40, 32.05s/it]

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=5512264&format=pdf
	10.1128/jvi.00641-17 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1128/jvi.00641-17[doi]&retmode=xml


 80%|████████  | 16/20 [06:20<02:07, 31.99s/it]

Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1128/jvi.00641-17.html
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=5380593&format=pdf
	10.1021/jacs.6b08744 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1021/jacs.6b08744[doi]&retmode=xml


 85%|████████▌ | 17/20 [06:33<01:19, 26.47s/it]

Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1021/jacs.6b08744.html
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=4028063&format=pdf
	10.1111/tra.12166 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1111/tra.12166[doi]&retmode=xml


 90%|█████████ | 18/20 [06:47<00:45, 22.74s/it]

Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1111/tra.12166.html
https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=3540082&format=pdf
	10.1371/journal.pone.0053368 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.1371/journal.pone.0053368[doi]&retmode=xml
Error loading HTML file: /Users/gully.burns/alhazen/cryoet_tutorial/ft/10.1371/journal.pone.0053368.html


 95%|█████████▌| 19/20 [07:04<00:21, 21.03s/it]

https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id=2789973&format=pdf
	10.2217/nnm.09.56 no full text as PDF file.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?api_key=d086451c882fabace54d7b049b6fb8481908&db=pmc&term=10.2217/nnm.09.56[doi]&retmode=xml


100%|██████████| 20/20 [07:20<00:00, 22.02s/it]

Huridocs `pdf_paragraphs_extraction` service not running. Please download it and run it from `https://github.com/GullyBurns/pdf_paragraphs_extraction`
[38;5;200m[1;3m{'response': 'I retrieved full text papers for the collection named `CryoET Papers Tests` (id=`3).'}[0m
[32;1m[1;3m[0m

[1m> Finished chain.[0m





{'input': 'Get full text copies of all papers in the collection with id="3".',
 'output': {'response': 'I retrieved full text papers for the collection named `CryoET Papers Tests` (id=`3).'},
 'intermediate_steps': [(AgentAction(tool='retrieve_full_text_for_papers_in_collection', tool_input={'collection_id': '3'}, log='{\n  "action": "retrieve_full_text_for_papers_in_collection",\n  "action_input": {\n    "collection_id": "3"\n  }\n}'),
   {'response': 'I retrieved full text papers for the collection named `CryoET Papers Tests` (id=`3).'})]}

Identify which papers are in the sampled collection through their dois.

In [7]:
q = ldb.session.query(SKE.id) \
        .filter(SKC.id==SKC_HM.ScientificKnowledgeCollection_id) \
        .filter(SKC_HM.has_members_id==SKE.id) \
        .filter(SKC.id=='2')  
dois = [e.id for e in q.all()]
dois


['doi:10.1002/1873-3468.14067',
 'doi:10.1002/2211-5463.13206',
 'doi:10.1002/2211-5463.13473',
 'doi:10.1002/bies.202200060',
 'doi:10.1002/cnr2.1185',
 'doi:10.1002/jev2.12224',
 'doi:10.1002/mrd.23346',
 'doi:10.1002/pld3.271',
 'doi:10.1002/pmic.202000198',
 'doi:10.1002/pro.2504',
 'doi:10.1002/pro.3018',
 'doi:10.1002/pro.3513',
 'doi:10.1002/pro.3710',
 'doi:10.1002/pro.3805',
 'doi:10.1002/pro.3967',
 'doi:10.1002/pro.4038',
 'doi:10.1002/pro.4191',
 'doi:10.1002/pro.4472',
 'doi:10.1002/pro.4792',
 'doi:10.1002/smll.202301838',
 'doi:10.1007/82_2021_243',
 'doi:10.1007/978-3-030-57821-3_8',
 'doi:10.1007/s00418-023-02191-8',
 'doi:10.1007/s10278-018-0101-z',
 'doi:10.1007/s10974-019-09537-7',
 'doi:10.1007/s11274-024-03891-6',
 'doi:10.1007/s11426-022-1408-5',
 'doi:10.1007/s12154-009-0033-7',
 'doi:10.1007/s12551-022-01013-w',
 'doi:10.1007/s12551-023-01049-6',
 'doi:10.1007/s13238-021-00895-y',
 'doi:10.1007/s40484-019-0191-8',
 'doi:10.1007/s41048-017-0040-0',
 'doi:10.1007

Iterate over those dois and extract 15 metadata variables based on the questions shown in `./local_resources/prompt_elements/metadata_extraction.yaml`

In [8]:
# Get the metadata extraction tool
t2 = [t for t in cb.tk.get_tools() if isinstance(t, MetadataExtraction_MethodsSectionOnly_Tool)][0]

# Create a dataframe to store previously extracted metadata
#for d in [d for d_id in dois_to_include for d in dois_to_include[d_id]]:
df = pd.DataFrame()
for d in [d for d in dois]:
    item_types = set()
    l = t2.read_metadata_extraction_notes(d, 'cryoet', 'test')
    df = pd.concat([df, pd.DataFrame(l)]) 
     
# Iterate over papers to run the metadata extraction tool
#for d in [d for d_id in dois_to_include for d in dois_to_include[d_id]]:
for d in [d for d in dois]:
    item_types = set()

    # Skip if the doi is already in the database
    if len(df)>0 and d in df.doi.unique():
        continue

    # Run the metadata extraction tool on the doi
    t2.run(tool_input={'paper_id': d, 'extraction_type': 'cryoet', 'run_label': 'test'})

    # Add the results to the dataframe    
    l2 = t2.read_metadata_extraction_notes(d, 'cryoet', 'test')
    df = pd.concat([df, pd.DataFrame(l2)]) 

df

[32;1m[1;3m[chain/start][0m [1m[1:chain:RunnableSequence] Entering Chain run with input:
[0m{
  "section_text": "Materials and Methods\n\nModeling Paradigm.\nOur approach was to dramatically speed up the sampling of the intermolecular energy landscape by skipping the low-probability (high-energy) states, focusing only on the set of high-probability (low-energy) states corresponding to the energy minima. The \"minima hopping\" paradigm has been widely used since the early days of molecular modeling for the sampling of the energy landscapes of biomolecules, such as conformational analysis of biopolymers (<<REF:Galaktionov-1988-33-556>>), rotamer libraries (<<REF:Kirys-2012-80-2089>>), and refinement of protein-protein interfaces (<<REF:Dauzhenka-2018-39-2012>>), providing extraordinary savings of computing time by avoiding travel in low-probability areas of the landscape. Markov state models have been used to study protein folding, dynamics (<<REF:Shukla-2015-48-414>>), and associat

Unnamed: 0,sample_type,sample_preparation_type,sample_preparation_buffer_ph,grid_vitrification_cryogen,sample_preparation_cryo_protectant,grid_model,grid_material,grid_mesh,grid_support_topology,grid_vit_ctemp,grid_vit_chumid,organism_name,tissue,cell_type,cell_strain,cell_component,doi,extraction_type,run_label
0,not present,Unknown,not present,not present,not present,not present,not present,not present,not present,not present,not present,,,Unknown,Unknown,,doi:10.1073/pnas.2210249119,cryoet,test
0,"[micro-organism, virus, other]",Unknown,Unknown,"[ETHANE, NITROGEN]",glucose,Unknown,GOLD,Unknown,HOLEY,Unknown,Unknown,,,Unknown,Unknown,,doi:10.1093/jmicro/dfab041,cryoet,test
0,micro-organism,subtomogram averaging,Unknown,[ETHANE],,Unknown,GOLD,Unknown,HOLEY,Unknown,Unknown,Tetrahymena thermophila,,,Unknown,,doi:10.1101/2023.11.28.569001,cryoet,test
0,micro-organism,tomography,6.8,[ETHANE],,Quantifoil R2/1,COPPER,Unknown,HOLEY,30 degreesC,95%,,,HEK293T,Unknown,,doi:10.1101/2024.03.25.586532,cryoet,test
0,not present,not present,Unknown,not present,,Unknown,not present,Unknown,Unknown,Unknown,Unknown,,,Unknown,Unknown,,doi:10.1371/journal.pbio.3001601,cryoet,test


In [10]:
ldb.create_zip_archive_of_full_text_files('2', loc+'/'+db_name+'/full_text_files.zip')


In [9]:
q3 = ldb.session.query(SKE.id, N.name, N.provenance, N.content) \
        .filter(N.id == NIA.Note_id) \
        .filter(NIA.is_about_id == SKE.id) \
        .filter(N.type == 'MetadataExtractionNote') 
l = []
for row in q3.all():
    paper = row[0]
    name = row[1]
#    provenance = json.loads(row[2])
    result = json.loads(row[3])
    kv = {k:result[k] for k in result}
    kv['DOI'] = paper
    kv['run'] = name
    l.append(kv)
# create a dataframe from the list of dictionaries with DOI as the index column
if len(l)>0:
    df = pd.DataFrame(l).set_index(['DOI', 'run'])
else: 
    df = pd.DataFrame()
df 

Unnamed: 0_level_0,Unnamed: 1_level_0,sample_type,sample_preparation_type,sample_preparation_buffer_ph,grid_vitrification_cryogen,sample_preparation_cryo_protectant,grid_model,grid_material,grid_mesh,grid_support_topology,grid_vit_ctemp,grid_vit_chumid,organism_name,tissue,cell_type,cell_strain,cell_component
DOI,run,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
doi:10.1073/pnas.2210249119,cryoet__doi:10.1073/pnas.2210249119__test,not present,Unknown,not present,not present,not present,not present,not present,not present,not present,not present,not present,,,Unknown,Unknown,
doi:10.1093/jmicro/dfab041,cryoet__doi:10.1093/jmicro/dfab041__test,"[micro-organism, virus, other]",Unknown,Unknown,"[ETHANE, NITROGEN]",glucose,Unknown,GOLD,Unknown,HOLEY,Unknown,Unknown,,,Unknown,Unknown,
doi:10.1101/2023.11.28.569001,cryoet__doi:10.1101/2023.11.28.569001__test,micro-organism,subtomogram averaging,Unknown,[ETHANE],,Unknown,GOLD,Unknown,HOLEY,Unknown,Unknown,Tetrahymena thermophila,,,Unknown,
doi:10.1101/2024.03.25.586532,cryoet__doi:10.1101/2024.03.25.586532__test,micro-organism,tomography,6.8,[ETHANE],,Quantifoil R2/1,COPPER,Unknown,HOLEY,30 degreesC,95%,,,HEK293T,Unknown,
doi:10.1371/journal.pbio.3001601,cryoet__doi:10.1371/journal.pbio.3001601__test,not present,not present,Unknown,not present,,Unknown,not present,Unknown,Unknown,Unknown,Unknown,,,Unknown,Unknown,


In [11]:
# USE WITH CAUTION - this will delete all extracted metadata notes in the database
# clear all notes across papers listed in `dois` list
if OK_TO_DELETE:
    for row in q3.all():
        d_id = row[0]
        e = ldb.session.query(SKE).filter(SKE.id==d_id).first()
        notes_to_delete = []
        for n in ldb.read_notes_about_x(e):
            notes_to_delete.append(n.id)
        for n in notes_to_delete:
            ldb.delete_note(n)