# Smart Querying

*Last updated: 01/21/2022*
 
This notebook allows you to automatically create a Boolean search string from an input text like, for example, an abstract (see Tool 1). It also helps you to find synonyms for search terms and to translate a search string from British to U.S. English (and vice versa).

If you prefer to get creative on natural language search strings, Tool 2 provides some support.

## Working with Jupyter notebooks

In case you are not familiar with Jupyter notebooks, this is how to go about it: In order to execute a piece of code, click inside a cell (the ones with `[]` to the left) and press Shift+Enter. Wait until the cell is done--that's when the `*` in `[]` turned into a number--and move on to the next cell.

If you get inconceivable error messages or the notebook gets stuck, choose "Restart & Clear Output" in the "Kernel" dropdown-menu above.
___
**Please help us to improve this tool by [emailing us](mailto:ai4ki.dev@gmail.com?subject=ai4ki-tools:%20Smart%20Querying) your update ideas or error reports.**
___

In [None]:
import pprint

from PyDictionary import PyDictionary
from ai4ki_utils.smart_query_utils import *

!python -m spacy download en_core_web_sm

## Tool 1: Create Query String from Text

### Preparation: Load Input Text
Create a `.txt`-file of your text, upload it to the base directory, change the variable `filename` below accordingly,  and run the cell.

In [None]:
filename = 'abstract.txt'
with open(filename, 'r', encoding='utf-8') as f: text = f.read()
text = text.replace('\n',' ')
my_queries = [] # We need that later for exporting queries

### (A)  The fast, efficient, and solid way: statistical keyword extraction
For extracting relevant keywords from your text, we use the method described [here](https://www.sciencedirect.com/science/article/abs/pii/S0020025519308588?via%3Dihub) and documented [here](https://github.com/LIAAD/yake).

Note that this method doesn't select for different parts of speech (but it also doesn't return stopwords). However, we found that using only proper nouns, nouns, and verbs yields better search strings. As a default setting, therefore, we discard all other types of words, except when they appear in a multi-term keyword (that is, an n-gram with n>1). If you want to override this setting, change the paramter `org_kw`to `True`.

After keyword extraction, we apply the following simple recipe to cook up a Boolean search string: We chain the first `top_k` keywords together with the AND-operator, concatenate the remaining terms with the OR-operator, and link up the two resulting expressions with 'AND'.

If you're unsatisfied with the result, try adjusting the following parameters:

- `num_keywords`: maximum number of keywords to extract
- `ngram_limit`: maximum number of terms in a keyword
- `dedup_value`: [0,1], controls repetition of words in keyword list (only relevant, when `ngram_limit>1`); set it low to reduce repetions; we found that a value of 0.3 yields good results (note that low values may result in less keywords overall)

If that doesn't work, try changing the `language`-parameter to match the language of your text (use the normal international country codes like "en" or "de"). 

*Run the following cell to create your query.*

In [None]:
kwe_query, keywords = query_constructor(text,
        org_kw=False,
        language='en',
        num_keywords=15,
        top_k=3,
        ngram_limit=1,
        dedup_value=0.3)

*If you like a suggested query, run the following cell to add it to the export list.*

In [None]:
my_queries.append(kwe_query)

### (B) The slow, expensive, and extravagant way: neural query generation
Here we use the [Huggingface Transformer](https://huggingface.co/docs/transformers/index) module to let GPT-2 generate a Boolean search string all by its own.

Large language models like GPT-2 use a lot of compute. Therefore, for the sake of our climate, we urge you to use this option with restraint.  

If you're unsatisfied with the result, try changing the `temperature` parameter of the model (between 0 and 1). Temperature controls the randomness of the output text: The higher the value the more random--or 'creative'--it is. Note that it sometimes takes several model runs to get a reasonable result. Not surprisingly, we found that GPT-3 perfroms much better at this task. You can try this out in our notebook `GPT_Playground.ipynb`.

You can also adjust the maximum length of the output with the parameter `maxq_length` (measured in the number of tokens, with 10 tokens corresponding to roughly 7.5 words).   

*Run the following cell to generate your query.*

In [None]:
llm_query = query_composer(text, maxq_length=30, temperature=0.9)
print('THIS IS YOUR SERACH STRING:')
print(llm_query)

*If you like a query suggested by GPT-2, run the following cell to add it to the export list.*

In [None]:
my_queries.append(llm_query)

### Refine your search string with synoyms
In order to refine your search string, you can replace a particular term in your search string with a sequence of OR-chained synonyms.

Note that terms appearing in n-gram-keywords (with n>1) are not replaced.  

*Follow the instructions in the cell below and run it.*  

In [None]:
# Enter the term you want replaced with synonyms:
term_to_be_replaced = "livestock" 

# Enter the search string you want modified (options: kwe_query, llm_query);
# in case you want more than one term replaced, choose syn_query and run the cell again
query_to_be_refined = kwe_query   

syn_query = query_synonymizer(query_to_be_refined, term=term_to_be_replaced)
print(syn_query)

*If you like a synonymized query, run the following cell to add it to the export list.*

In [None]:
my_queries.append(syn_query)

### Translate between American and British English
*Follow the instructions in the cell below and run it.* 

In [None]:
# Choose query to be translated (options: kwe_query, llm_query, syn_query)
query = kwe_query

# Choose direction of translation (options: us2uk and uk2us)
direction = "uk2us"
trl_query = us_vs_uk_en(query, direction)
my_queries.append(trl_query)
print(trl_query)

## Tool 2: Get Creative with Natural Language Queries

### Enter some query
*If you use a well defined technical term, which you want keep as such, enclose it in exclamation marks, like so `!biodiversity!`*<br>
*If you use technical composites, stitch them together with the ampersand, like so `!climate&change!`.* 

In [None]:
# Enter query
query = input('Enter query: ') 
print('Accepted query: ', query)

### Find synonyms for each term in your query

In [None]:
# Find synonyms and suggest alternative queries
query_syns = find_synonyms(query)

pp = pprint.PrettyPrinter(indent=1)
pp.pprint(query_syns)

#### Quickly check the meaning of a word

In [None]:
word = input('Enter word: ')
pydict = PyDictionary()
for k,v in pydict.meaning(word).items():
    print('==> PART OF SPEECH: ',k)
    print('==> MEANING(s) OF {}: '.format(word))
    print(*v, sep='\n')

### Automatically suggest alternative query formulations
*You can run this cell as often as you like. Or until you come across something useful...* 

In [None]:
# Pick random alternative query
rnd_query = rand_query(query_syns)

*If you like a random query, run the following cell to add it to the export list.*

In [None]:
my_queries.append(rnd_query)

## Export your Queries 
*Run the following cell to export all selected queries to file `my_queries.txt`.*

In [None]:
with open('my_queries.txt', 'w') as f:
    for q in my_queries:
        f.write(f"{q}\n")