In [2]:
from utils import find_similar_docs, search_fuzzy, search_match_phrase, format_search_output
import pandas as pd 

In [14]:
INDEX_NAME = "arxiv-cosine" 
query = "what are fossil fuel s?"

In [15]:
query_keywords = [q for q in query.split(" ") if len(q) > 2]
query_keywords

['what', 'are', 'fossil', 'fuel']

## Retrieve lexical search results on OpenSearch index

### Run exact match phrase search

In [16]:
lexical_df = pd.DataFrame()
for q in query_keywords:
    out_shard = search_match_phrase(field='text', query=q, index_name=INDEX_NAME)
    df_ = format_search_output(out_shard)
    lexical_df = pd.concat([lexical_df, df_], axis=0)
          

Searching for `what` in the field `text`
Searching for `are` in the field `text`
Searching for `fossil` in the field `text`
Searching for `fuel` in the field `text`


In [17]:
lexical_df.head()

Unnamed: 0,score,abstract,title,arxiv_id,embeddings
0,7.011356,"During the last three decades, evidence has ...",Cool Stars in Hot Places,704.1045,"[-1.1015625, -0.2487793, -0.74072266, 0.734863..."
1,6.443816,The amount $Q$ of particles that are transpo...,Counting statistics in multiple path geometrie...,704.3506,"[-1.2714844, -0.23791504, -2.4316406, 0.610351..."
2,6.321819,"The Large Hadron Collider, a 7 + 7 TeV proto...","Higgs Bosons, Electroweak Symmetry Breaking, a...",704.2045,"[-0.08380127, -3.2636719, -1.5712891, -0.45947..."
3,6.000948,We present recent measurements of B and B^0_...,Y(5S): What has been learned and what can be l...,705.0342,"[0.53466797, -2.8886719, 0.4975586, 2.9589844,..."
4,5.982053,Category theory has foundational importance ...,Adjoint Functors and Heteromorphisms,704.2207,"[-1.0878906, -1.5673828, 1.0390625, 2.1894531,..."


### Run fuzzy word search 
By specifying the level of `fuzziness` we can tolerate for things like mispellings, typos etc. `fuzziness` is an integer>=0 where when `fuzziness=0` we are saying we don't want any fuzziness and want an exact match. When `fuzziness=1`, we are saying we can tolerate results that are one character off from our search query. 

In [18]:
fuzzy_df = pd.DataFrame()
for q in query_keywords:
    out_shard = search_fuzzy(field='text', query=q, fuzziness=1, index_name=INDEX_NAME)
    df_ = format_search_output(out_shard)
    fuzzy_df = pd.concat([fuzzy_df, df_], axis=0)
          

Search for `what` in the `text` field with fuzziness set to 1
Search for `are` in the `text` field with fuzziness set to 1
Search for `fossil` in the `text` field with fuzziness set to 1
Search for `fuel` in the `text` field with fuzziness set to 1


In [19]:
fuzzy_df.head()

Unnamed: 0,score,abstract,title,arxiv_id,embeddings
0,1.193325,The quantum Zeno effect arises due to freque...,Quantum Zeno Effect in the Decoherent Histories,704.1551,"[3.9804688, -0.67041016, -0.6225586, 2.8027344..."
1,1.173753,The amount $Q$ of particles that are transpo...,Counting statistics in multiple path geometrie...,704.3506,"[-1.2714844, -0.23791504, -2.4316406, 0.610351..."
2,1.107973,We discuss a scenario that gravitinos produc...,Gravitino Dark Matter from Inflaton Decay,705.0579,"[0.52490234, -2.5, -1.6357422, 1.8994141, -0.1..."
3,1.089642,Category theory has foundational importance ...,Adjoint Functors and Heteromorphisms,704.2207,"[-1.0878906, -1.5673828, 1.0390625, 2.1894531,..."
4,1.072099,J. G. Thompson showed that a finite group G ...,Two Generator Subalgebras of Lie Algebras,704.2723,"[-0.92578125, 0.6074219, -0.81396484, 0.036437..."


### Retrieve semantic search output using OpenSearch knn-vector search and co:here embeddings

In [20]:
semantic_out = find_similar_docs(query=query, k=2, num_results=3, index_name=INDEX_NAME) 
semantic_df = format_search_output(semantic_out)

In [21]:
semantic_df.head()

Unnamed: 0,score,abstract,title,arxiv_id,embeddings
0,0.689815,Unconstrained CO2 emission from fossil fuel ...,"Implications of ""peak oil"" for atmospheric CO2...",704.2782,"[-1.7470703, -1.1748047, -1.3642578, 2.6113281..."
1,0.56827,The structure of three laminar premixed rich...,Rich methane premixed laminar flames doped by ...,704.0375,"[0.5957031, -0.41552734, -0.984375, 1.2890625,..."
2,0.56795,"Author offers and researches a new, cheap me...",Extraction of Freshwater and Energy from Atmos...,704.2571,"[1.7832031, 0.5048828, -0.52783203, -0.6523437..."


## Visualize outputs
Let's take the top abstract result from the `lexical_df`, `fuzzy_df` and the top abstract result from the `semantic_df` and see if the results look interesting. They query keywords in all abstract results are highlighted to show that while the semantic results may not retrieve the most keywords, the results are semantically more meaningful than lexical/fuzzy based approaches. 

In [22]:
from utils import colorize

def visualize(top_row, color): 
    print(f'''Top result for lexical search is arxiv_id={top_row['arxiv_id']} with score={top_row['score']}\n''')
    print(colorize(top_row.abstract, query_keywords, color=color))

In [23]:
visualize(lexical_df.iloc[0], color="cyan")


Top result for lexical search is arxiv_id=704.1045 with score=7.0113564

  during the last three decades, evidence has mounted that star and planet
formation is not an isolated process, but is influenced by current and previous
generations of stars. although cool stars form in a range of environments, from
isolated globules to rich embedded clusters, the influences of other stars on
cool star and planet formation may be most significant in embedded clusters,
where hundreds to thousands of cool stars form in close proximity to ob stars.
at the cool stars 14 meeting, a splinter session was convened to discuss the
role of environment in the formation of cool stars and planetary systems; with
an emphasis on the ``hot'' environment found in rich clusters. we review here
the basic results, ideas and questions presented at the session. we have
organized this contribution into five basic questions: [5m[7m[36mwhat[0m is the typical
environment of cool star formation, [5m[7m[36mwhat[0m r

In [24]:
visualize(fuzzy_df.iloc[0], color="blue")

Top result for lexical search is arxiv_id=704.1551 with score=1.193325

  the quantum zeno effect arises due to frequent observation. that implies the
existence of some experimenter and its interaction with the system. in this
contribution, we examine [5m[7m[34mwhat[0m happens for a closed system if one considers a
quantum zeno type of question, namely: "[5m[7m[34mwhat[0m is the probability of a system,
remaining always in a particular subspace". this has implications to the
arrival time problem that is also discussed. we employ the decoherent histories
approach to quantum theory, as this is the better developed formulation of
closed system quantum mechanics, and in particular, dealing with questions that
involve time in a non-trivial way. we get a very restrictive decoherence
condition, that implies that even if we do introduce an environment, there will
be very few cases that we can assign probabilities to these histories, but in
those cases, the quantum zeno effect is still 

In [25]:
visualize(semantic_df.iloc[0], color="green")

Top result for lexical search is arxiv_id=704.2782 with score=0.68981475

  unconstrained co2 emission from [5m[7m[32mfossil[0m [5m[7m[32mfuel[0m burning has been the dominant
cause of observed anthropogenic global warming. the amounts of "proven" and
potential [5m[7m[32mfossil[0m [5m[7m[32mfuel[0m reserves [5m[7m[32mare[0m uncertain and debated. regardless of the
true values, society has flexibility in the degree to which it chooses to
exploit these reserves, especially unconventional [5m[7m[32mfossil[0m [5m[7m[32mfuel[0ms and those
located in extreme or pristine environments. if conventional oil production
peaks within the next few decades, it may have a large effect on future
atmospheric co2 and climate change, depending upon subsequent energy choices.
assuming that proven oil and gas reserves do not greatly exceed estimates of
the energy information administration, and recent trends [5m[7m[32mare[0m toward lower
estimates, we show that it is feasible