In [None]:
%reload_ext openad.notebooks.styles

<!-- Header banner -->
<div class="banner"><div>Working with the Deep Search Plugin</div><b>OpenAD <span>Tutorial</span></b></div>

### Table of Contents

1. [Getting Started](#Getting-Started)
    1. [Installation](#Installation)
    1. [Authentication](#Authentication)
    1. [Magic Commands](#Magic-Commands)
    1. [About Deep Search](#About-Deep-Search)
    1. [Command Documentation](#Command-Documentation)
1. [Searching for Molecules](#Searching-for-Molecules)
    1. [Similar Molecules](#Similar-Molecules)
    1. [By Substructure](#By-Substructure)
    1. [Across Patents](#Across-Patents)
1. [Exploring Collections](#Exploring-Collections)
    1. [Overview of Collections](#Overview-of-Collections)
    1. [Find Collections by Domain](#Find-Collections-by-Domain)
    1. [Find Collections by Content](#Find-Collections-by-Content)
1. [Searching a Collection](#Searching-a-Collection)
    1. [Search Parameters](#Search-Parameters)
    1. [Example A: Search for "Ibuprofen" on PubChem](#Example-A:-Search-for-"Ibuprofen"-on-PubChem)
    1. [Example B: Query arXiv for "power conversion efficiency"](#Example-B:-Query-arXiv-for-"power-conversion-efficiency")
1. [Enriching your Molecules with Deep Search Results](#Enriching-your-Molecules-with-Deep-Search-Results)

## Getting Started

### Installation
If you haven't already, you can install the plugin directly from its [GitHub repo](https://github.com/acceleratedscience/openad-plugin-ds#readme).
    
    pip install git+https://github.com/acceleratedscience/openad-plugin-ds

### Authentication
Before you can work with the Deep Search plugin, you'll need to sign up for a Deep Search account and create an API key.<br>We have [detailed instructions in the plugin's README](https://github.com/acceleratedscience/openad-plugin-ds#authentication). 

### Magic Commands
Magic commands let you interact with the OpenAD shell.
1. `%openad` - Display results directly in your notebook<br>
2. `%openadd` - Store the returned data in a variable

To learn more, check the [OpenAD intro to magic commands](https://github.com/acceleratedscience/openad-toolkit/blob/main/openad/notebooks/magic_commands.ipynb).

### About Deep Search
To learn about what this plugin does and list all available commands, run:

    ds

In [None]:
%openad ds

### Command Documentation

Every command has detailed documentation where you can find everything you need to know, including optional parameters and examples.

To see the documentation of a command, just run the beginning of the command followed by a question mark.

In [None]:
%openad ds reset ?

## Searching for Molecules

### Similar Molecules

    ds search for molecules similar to <smiles>

In [None]:
%openad ds search for molecules similar to 'CC(C)(c1ccccn1)C(CC(=O)O)Nc1nc(-c2c[nH]c3ncc(Cl)cc23)c(C#N)cc1F'

### By Substructure

    ds search for molecules with substructure <smiles>

In [None]:
%openad ds search for molecules with  substructure 'C1(C(=C)C([O-])C1C)=O'

### Across Patents

#### From a List

    ds search for molecules in patents from list ['<patent_id>','<patent_id>',...]

In [None]:
# Basic example
%openad ds search for molecules in patents from list ['CN108473493B','US20190023713A1']

In [None]:
# Practical example
from IPython.display import display, HTML

# 1) Find patents containing a certain molecule
smiles = 'CC(C)(c1ccccn1)C(CC(=O)O)Nc1nc(-c2c[nH]c3ncc(Cl)cc23)c(C#N)cc1F'
patents = None
patents = %openadd ds search for patents containing molecule {smiles}
patents

# 2) Search for other molecules in these patents
if patents is not None:
    patent_ids = list(patents["publication_id"])
    %openad ds search for molecules in patents from list {patent_ids}
else:
    display(HTML(f'<span style="color:#d00">Something went wrong finding patents containing {smiles}</span>'))

#### From a DataFrame

    ds search for molecules in patents from dataframe <dataframe_name>

Please note that your patent ids should be stored in a column named `patent id` for this command to work.

In [None]:
import pandas as py

# Create a Pandas DataFrame with patent ids
patent_ids = ['CN108473493B','US20190023713A1']
df = py.DataFrame(patent_ids, columns=['patent id'])

In [None]:
%openad ds search for molecules in patents from dataframe df

#### From a File

    ds search for molecules in patents from file '<filename.csv>'

Please note that your patent ids should be stored in a column named `patent id` for this command to work.

For the purpose of this demo, we'll store a .csv file with patent ids in your workspace.

In [None]:
# Prep
import pandas as pd
patent_ids = ['CN108473493B','US20190023713A1']
cmd_pointer = %openadd cmd_pointer
workspace_path = cmd_pointer.workspace_path()
csv_file_path = f'{workspace_path}/ds_demo_patents.csv'

# Store reactions in a CSV file in your workspace
df = pd.DataFrame(patent_ids, columns=['patent id'])
df.to_csv(csv_file_path)

In [None]:
# Inspect the file we just created
%cat {csv_file_path}

In [None]:
%openad ds search for molecules in patents from file 'ds_demo_patents.csv'

## Exploring Collections

Before you can search a collection, you'll need to know _what_ collections to search.

### Overview of Collections

    ds list all collections [ details ]

In [None]:
# Overview of all available collections
%openad ds list all collections

In [None]:
# Description of all available collections
%openad ds list all collections details

You can also request the description of a single collection.

    ds list collection details '<collection_name_or_key>'

In [None]:
%openad ds list collection details 'ipcc'

### Find Collections by Domain

If you are looking for collections within a certain domain, you can first list the available domains...

    ds list all domains
    
... and then list the collections for the domain(s) you want.

    ds list collections for domain '<domain_name>'
    ds list collections for domains ['<domain_name>','<domain_name>',...]

In [None]:
%openad ds list all domains

In [None]:
%openad ds list collections for domain 'Materials Science'

In [None]:
%openad ds list collections for domains ['Materials Science','Scientific Literature']

### Find Collections by Content

If you're still not sure what collection to search, you can find collections relevant to your topic.

    ds list collections containing '<search_query>'

In [None]:
%openad ds list collections containing '"carbon capture"'

## Searching a Collection

Deep Search allows you to search across a variety of collections, returning documents with snippets highlighting the data matching your search criteria.

    ds search collection '<collection_name_or_key>' for '<search_query>'

### Search Parameters
Because of the large number of parameters, it is recommended to start by looking at the available options, only some of which we'll cover here.

In [None]:
%openad ds search collection ?

### <span style="color: green">Example A:</span> Search for "_Ibuprofen_" on PubChem

In this basic example we'll search for all PubChem entries which contain the string _Ibuprofen_, then visualize all molecules that are listed.


In [None]:
# Search pubchem for mention of Ibuprofen
%openad ds search collection 'pubchem' for 'Ibuprofen' show (data)

#### Visualizing Results

By using the `%openadd` magic command, we can store the results in a dataframe and manipulate them as we wish.

In [None]:
# Load results in a dataframe
pubchem_df = %openadd ds search collection 'pubchem' for 'Ibuprofen' show (data)

In [None]:
# Display the dataframe
pubchem_df

In [None]:
# Count the results
result_count = len(pubchem_df.index)
print(f'Your query returned {result_count} molecules:')

# List the returned smiles
smiles_list = pubchem_df['SMILES'].tolist()
for sm in smiles_list:
    print('- ' + sm)

In [None]:
# Load the results in your molecule working set and enrich them with PubChem data
%openad load molecules from dataframe pubchem_df enrich

In [None]:
# List the molecules in your working set
%openad list molecules

In [None]:
# Visualize the molecules in your working set
%openad show molecules

In [None]:
# Visualize a single molecule
%openad show molecule CC(C)C1CC2=C(C1)C=C(C=C2)C(C)C(=O)O

### <span style="color: green">Example B:</span> Query arXiv for "*power conversion efficiency*"

In this example we'll search for the input query in documents from the arXiv.org data collection. For each matched document we'll return the title, authors as well as the link to the original document on arXix.org

#### Getting the result estimate

Before launching our query, we can get an estimate of how many documents may be. returned, so we can massage our query as needed to get more or less results.

In [None]:
# Estimate results
%openad ds search collection 'arXiv abstracts' for 'ide("power conversion efficiency" OR PCE) AND organ*' show (docs) estimate only

For queries that would return more than 100 results, you get the option to abort, as fetching the results may take a considerable amoung of time.

In [None]:
# A too general query won't be executed unless you confirm
%openad ds search collection 'arXiv abstracts' for 'organ*' show (docs)

#### Displaying Results

Your results table will be displayed with the matching snippets highlighted.

In [None]:
%openad ds search collection 'arXiv abstracts' for 'ide("power conversion efficiency" OR PCE) AND organ* ' using (slop=5) show (data docs)

In [None]:
%openad ds search collection 'arXiv abstracts' for 'ide("power conversion efficiency" OR PCE) AND organ* ' using (slop=5) show (data)

In [None]:
%openad ds search collection 'arXiv abstracts' for 'ide("power conversion efficiency" OR PCE) AND organ* ' using (slop=5) show (docs)

#### Processing Results
Alternatively, you can use the `%openadd` magic command to store the results in a dataframe and process them as you wish.

In [None]:
arxiv_df = %openadd ds search collection 'arXiv abstracts' for 'ide("power conversion efficiency" OR PCE) AND organ* ' using (slop=5) show (data docs)

You can still display your data in a next step, but because we store the raw data, highlighting is not available like this.

<div class="alert alert-info"><b>Tip</b> By right-clicking the cell's output, you can enable or disable cell scrolling, for the data not to take up your entire notebook.</div>

In [None]:
arxiv_df

Now the data is stored in a dataframe, you can process it however you wish.

In [None]:
# Count results
result_count = len(arxiv_df.index)

# Create set of all authors
authors_column = list(arxiv_df['Authors'])
authors = set()
for author_group in authors_column:
    paper_authors = author_group.split(',')
    for a in paper_authors:
        authors.add(a.strip())

# Sort by last name
authors = list(authors)
authors_sorted = []
for a in authors:
    a = a.split(' ')
    a = a[-1] + ', ' + ' '.join(a[:-1])
    authors_sorted.append(a)
authors_sorted.sort()
    
# Print result
title = f'There are {result_count} results by {len(authors_sorted)} authors:'
print(title)
print(len(title) * '-' + '\n')
for i, a in enumerate(authors_sorted):
    print(f'{i:>3}. {a}')

## Enriching your Molecules with Deep Search Results

After running a Deep Search query, you can add the results to the related molecules in your molecule working set.

    enrich molecules with analysis

Commands that are supported by this functionality:
- `ds search for molecules similar to <smiles>`
- `search for patents containing molecule <smiles>`

In [None]:
# Clear any previously stored results
%openad clear analysis cache

# Empty your molecule working set
%openad clear mols

In [None]:
# Run retrosynthesis query (using %openadd to skip the printout)
smiles = 'CC(C)(c1ccccn1)C(CC(=O)O)Nc1nc(-c2c[nH]c3ncc(Cl)cc23)c(C#N)cc1F'
_ = %openadd ds search for molecules similar to {smiles}

# Add the relevant molecule to your molecule working set (MWS)
%openad add molecule {smiles}

# Enrich the MWS with the RXN result
%openad enrich molecules with analysis

# Display the molecule to see the result (scroll down to analysis).
# From here you can export the molecule to a new file.
%openad show molecule {smiles}