# Imaging Technology Innovation Stages  

> Methods to extract and model how imaging technology evolves.

## Modeling Technological Evolution and Innovation

This notebook is concerned with building a digital library of publications derived from four subdisciplines of biomedical imaging:

1. Cryo-Electron Tomography
2. Volume Electron Microscopy
3. Hiercharchy Phase Contrast Tomography
4. Photoacoustic Imaging


### Python Imports

Setting python imports, environment variables, and other crucial set up parameters here.  

In [2]:
from alhazen.apps.chat import  AlhazenAgentChatBot
from alhazen.core import get_langchain_chatmodel, MODEL_TYPE
from alhazen.schema_sqla import *
from alhazen.tools.basic import AddCollectionFromEPMCTool, DeleteCollectionTool
from alhazen.tools.paperqa_emulation_tool import PaperQAEmulationTool
from alhazen.tools.metadata_extraction_tool import * 
from alhazen.tools.protocol_extraction_tool import *
from alhazen.toolkit import *
from alhazen.utils.jats_text_extractor import NxmlDoc
from alhazen.utils.jats_text_extractor import NxmlDoc
from alhazen.utils.ceifns_db import Ceifns_LiteratureDb, create_ceifns_database, drop_ceifns_database, restore_ceifns_database
from alhazen.utils.searchEngineUtils import *

from langchain.callbacks.tracers import ConsoleCallbackHandler
from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.pgvector import PGVector
from langchain_community.chat_models.ollama import ChatOllama
from langchain_google_vertexai import ChatVertexAI
from langchain_openai import ChatOpenAI

from bs4 import BeautifulSoup,Tag,Comment,NavigableString
from databricks import sql
from datetime import datetime
from importlib_resources import files
import os
import pandas as pd
from pathlib import Path
import re
import requests

from sqlalchemy import create_engine, exists, func, or_, and_, not_, desc, asc
from sqlalchemy.orm import sessionmaker, aliased

from time import time,sleep
from tqdm import tqdm
from urllib.request import urlopen
from urllib.parse import quote_plus, quote, unquote
from urllib.error import URLError, HTTPError
import yaml

In [3]:
# Using Aliases like this massively simplifies the use of SQLAlchemy
IR = aliased(InformationResource)

SKC = aliased(ScientificKnowledgeCollection)
SKC_HM = aliased(ScientificKnowledgeCollectionHasMembers)
SKE = aliased(ScientificKnowledgeExpression)
SKE_XREF = aliased(ScientificKnowledgeExpressionXref)
SKE_IRI = aliased(ScientificKnowledgeExpressionIri)
SKE_HR = aliased(ScientificKnowledgeExpressionHasRepresentation)
SKE_MO = aliased(ScientificKnowledgeExpressionMemberOf)
SKI = aliased(ScientificKnowledgeItem)
SKI_HP = aliased(ScientificKnowledgeItemHasPart)
SKF = aliased(ScientificKnowledgeFragment)

N = aliased(Note)
NIA = aliased(NoteIsAbout)
SKC_HN = aliased(ScientificKnowledgeCollectionHasNotes)
SKE_HN = aliased(ScientificKnowledgeExpressionHasNotes)
SKI_HN = aliased(ScientificKnowledgeItemHasNotes)
SKF_HN = aliased(ScientificKnowledgeFragmentHasNotes)

### Environment Variables

Remember to set environmental variables for this code:

* `ALHAZEN_DB_NAME` - the name of the PostGresQL database you are storing information into
* `LOCAL_FILE_PATH` - the location on disk where you save temporary files, downloaded models or other data.   

In [4]:
os.environ['ALHAZEN_DB_NAME'] = 'imaging_tech_innovation'
os.environ['LOCAL_FILE_PATH'] = '/users/gully.burns/alhazen/'

In [5]:
if os.path.exists(os.environ['LOCAL_FILE_PATH']) is False:
    os.makedirs(os.environ['LOCAL_FILE_PATH'])
    
if os.environ.get('ALHAZEN_DB_NAME') is None: 
    raise Exception('Which database do you want to use for this application?')
db_name = os.environ['ALHAZEN_DB_NAME']

if os.environ.get('LOCAL_FILE_PATH') is None: 
    raise Exception('Where are you storing your local literature database?')
loc = os.environ['LOCAL_FILE_PATH']

### Setup utils, agents, and tools 

In [6]:
ldb = Ceifns_LiteratureDb(loc=loc, name=db_name)
llm = ChatOllama(model='mixtral:instruct') 
llm2 = ChatOpenAI(model='gpt-4-1106-preview') 

cb = AlhazenAgentChatBot()
print('AGENT TOOLS')
for t in cb.tk.get_tools():
    print('\t'+type(t).__name__)

AGENT TOOLS
	AddCollectionFromEPMCTool
	AddAuthorsToCollectionTool
	DescribeCollectionCompositionTool
	DeleteCollectionTool
	RetrieveFullTextTool
	RetrieveFullTextToolForACollection
	MetadataExtraction_EverythingEverywhere_Tool
	SimpleExtractionWithRAGTool
	PaperQAEmulationTool
	ProcotolExtractionTool
	CheckExpressionTool
	IntrospectionTool


## Building the database


### Scripts to Build / Delete the database

If you need to restore a deleted database from backup, use the following shell commands:

```
$ createdb em_tech
$ psql -d em_tech -f /local/file/path/em_tech/backup<date_time>.sql
```

In [59]:
drop_ceifns_database(os.environ['ALHAZEN_DB_NAME'])

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Database has been backed up to /users/gully.burns/alhazen/imaging_tech_innovation/backup2024-03-04-23-34-21.sql
Database has been dropped successfully !!


In [60]:
create_ceifns_database(os.environ['ALHAZEN_DB_NAME'])

100%|██████████| 310/310 [00:00<00:00, 3467.45it/s]


### Build CEIFNS database from queries

#### Run queries on European PMC based on innovation categories 

Here we build general corpora across the categories of interest. 

* Hierarchical phase-contrast tomography
* Cryo-Electron Tomography
* Volume Electron Microscopy
* Photoacoustic imaging

In [7]:
import local_resources.queries.imaging_tech as imaging_tech
from alhazen.utils.queryTranslator import QueryTranslator, QueryType

cols_to_include = ['ID', 'CORPUS_NAME', 'QUERY']
df = pd.read_csv(files(imaging_tech).joinpath('imaging_tech.tsv'), sep='\t', )
df = df.drop(columns=[c for c in df.columns if c not in cols_to_include])
df

Unnamed: 0,ID,CORPUS_NAME,QUERY
0,1,Hierarchical phase-contrast tomography,Hierarchical phase-contrast tomography | HIP-C...
1,2,Cryo-Electron Tomography,Cryoelectron Tomography | Cryo Electron Tomogr...
2,3,Volume Electron Microscopy,Volume Electron Microscopy | Volume EM | (seri...
3,4,Photoacoustic imaging,Photoacoustic imaging | Photoacoustic microscopy


In [62]:
qt = QueryTranslator(df.sort_values('ID'), 'ID', 'QUERY', 'CORPUS_NAME')
(corpus_ids, epmc_queries) = qt.generate_queries(QueryType.epmc, sections=['TITLE_ABS', 'METHODS'])
corpus_names = df['CORPUS_NAME']

addEMPCCollection_tool = [t for t in cb.tk.get_tools() if isinstance(t, AddCollectionFromEPMCTool)][0]
for (id, name, query) in zip(corpus_ids, corpus_names, epmc_queries):
    addEMPCCollection_tool.run(tool_input={'id': id, 'name':name, 'query':query, 'full_text':False})

100%|██████████| 4/4 [00:00<00:00, 7533.55it/s]
100%|██████████| 4/4 [00:00<00:00, 3442.19it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Hierarchical phase-contrast tomography" OR METHODS:"Hierarchical phase-contrast tomography") OR (TITLE_ABS:"HIP-CT" OR METHODS:"HIP-CT") OR (TITLE_ABS:"Hierarchical phase contrast tomography" OR METHODS:"Hierarchical phase contrast tomography")), 143 European PMC PAPERS FOUND


100%|██████████| 1/1 [00:03<00:00,  3.54s/it]


 Returning 135


100%|██████████| 135/135 [00:00<00:00, 507.06it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Cryoelectron Tomography" OR METHODS:"Cryoelectron Tomography") OR (TITLE_ABS:"Cryo Electron Tomography" OR METHODS:"Cryo Electron Tomography") OR (TITLE_ABS:"Cryo-Electron Tomography" OR METHODS:"Cryo-Electron Tomography") OR (TITLE_ABS:"Cryo-ET" OR METHODS:"Cryo-ET") OR (TITLE_ABS:"CryoET" OR METHODS:"CryoET")), 2581 European PMC PAPERS FOUND


100%|██████████| 3/3 [00:55<00:00, 18.45s/it]


 Returning 2558


100%|██████████| 2558/2558 [00:06<00:00, 375.25it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Volume Electron Microscopy" OR METHODS:"Volume Electron Microscopy") OR (TITLE_ABS:"Volume EM" OR METHODS:"Volume EM") OR (TITLE_ABS:"multibeam SEM" OR METHODS:"multibeam SEM") OR (TITLE_ABS:"FAST-SEM" OR METHODS:"FAST-SEM") OR ((TITLE_ABS:"serial section" OR METHODS:"serial section") AND ((TITLE_ABS:"electron microscopy" OR METHODS:"electron microscopy") OR (TITLE_ABS:"EM" OR METHODS:"EM") OR (TITLE_ABS:"transmission electron microscopy" OR METHODS:"transmission electron microscopy") OR (TITLE_ABS:"TEM" OR METHODS:"TEM") OR (TITLE_ABS:"scanning electron microscopy" OR METHODS:"scanning electron microscopy") OR (TITLE_ABS:"SEM" OR METHODS:"SEM") OR (TITLE_ABS:"electron tomography" OR METHODS:"electron tomography"))) OR ((TITLE_ABS:"serial block-face" OR METHODS:"serial block-face") AND ((TITLE_ABS:"scanning electron microscopy" OR METHODS:"scanning electron 

100%|██████████| 7/7 [02:28<00:00, 21.18s/it]


 Returning 6820


100%|██████████| 6820/6820 [00:41<00:00, 164.39it/s]


https://www.ebi.ac.uk/europepmc/webservices/rest/search?format=JSON&pageSize=1000&synonym=TRUE&resultType=core&query=((TITLE_ABS:"Photoacoustic imaging" OR METHODS:"Photoacoustic imaging") OR (TITLE_ABS:"Photoacoustic microscopy" OR METHODS:"Photoacoustic microscopy")), 4600 European PMC PAPERS FOUND


100%|██████████| 5/5 [00:59<00:00, 11.84s/it]


 Returning 4478


100%|██████████| 4478/4478 [00:17<00:00, 257.83it/s]


#### Run queries on known lists of papers from CZI grantees on the four imaging innovation categories 

Here we seach pre-developed lists of papers from CZI grantee's work, indexed in a local file: `./local_resources/queries/imaging_tech/grantee_dois.json`

In [15]:
with open(files(imaging_tech).joinpath('grantee_dois.json'), 'r') as f:
    dict_lists = json.load(f)

addEMPCCollection_tool = [t for t in cb.tk.get_tools() if isinstance(t, AddCollectionFromEPMCTool)][0]
for i, k in enumerate(dict_lists.keys()):
    query = ' OR '.join(['doi:"'+d_id+'"' for d_id in dict_lists[k] ])
    print('%s: Searching for %d'%(k, len(dict_lists[k])))
    addEMPCCollection_tool.run(tool_input={'id': str(5+i), 'name': k + ' (grantees)', 'query':query})


Cryo-Electron Tomography: Searching for 23
Volume Electron Microscopy: Searching for 12
Hierarchical phase-contrast tomography: Searching for 14
Photoacoustic imaging: Searching for 26


## Analyze Collections

In [13]:
q = ldb.session.query(SKC.id, SKC.name, SKE.id, SKI.type) \
        .filter(SKC.id==SKC_HM.ScientificKnowledgeCollection_id) \
        .filter(SKC_HM.has_members_id==SKE.id) \
        .filter(SKE.id==SKE_HR.ScientificKnowledgeExpression_id) \
        .filter(SKE_HR.has_representation_id==SKI.id) 
df = pd.DataFrame(q.all(), columns=['id', 'collection name', 'doi', 'item type'])    
df.pivot_table(index=['id', 'collection name'], columns='item type', values='doi', aggfunc=lambda x: len(x.unique()))

Unnamed: 0_level_0,item type,CitationRecord
id,collection name,Unnamed: 2_level_1
1,Hierarchical phase-contrast tomography,135
2,Cryo-Electron Tomography,2556
3,Volume Electron Microscopy,6817
4,Photoacoustic imaging,4477
5,Cryo-Electron Tomography (grantees),20
6,Volume Electron Microscopy (grantees),11
7,Hierarchical phase-contrast tomography (grantees),12
8,Photoacoustic imaging (grantees),23
