# Assignment 4
### Murashko Artem SD20-01 | ar.murashko@innopolis.university

# Sugges_

One of the strategies to improve user experience is to provide user with hints, or, otherwise, to autocomplete his queries. Let's consider suggest.

Today we will practice generating suggestions using [Trie](https://en.wikipedia.org/wiki/Trie) data structure (prefix tree), see the example below.

Plan of your homework:

1. Build Trie based on real search query data, provided by AOL company;
2. Generate suggestion based on a trie;
3. Measure suggestion speed;

![image](https://www.ritambhara.in/wp-content/uploads/2017/05/Screen-Shot-2017-05-01-at-4.01.38-PM.png)

## 0. Install Trie data structure support

You are free to use any library implementation of Trie, as well as the one we suggest (read the docs before asking any questions!): https://github.com/google/pygtrie

In [1]:
# !pip install pygtrie

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.9 -m pip install --upgrade pip[0m


### Imports 

In [30]:
import pandas as pd
import pygtrie
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from tqdm.auto import tqdm

## 1. Build a trie upon a dataset

### 1.1. [5] Read the dataset

Download the [dataset](https://github.com/IUCVLab/information-retrieval/tree/main/datasets/aol) (we provide only the first part of the original data for simplicity (~3.5 mln queries)).

Explore the data, see readme file. Load the dataset. Pass the assert.

In [13]:
aol_data = pd.read_csv('datasets/aol/user-ct-test-collection-01.txt.gz', compression='gzip', delimiter='\t')
assert aol_data.shape[0] == 3558411, "Dataset size does not match"

In [15]:
aol_data.sample(5)

Unnamed: 0,AnonID,Query,QueryTime,ItemRank,ClickURL
685846,1750999,al hirschfeld theatre,2006-03-06 19:34:07,5.0,http://www.nytix.com
2938376,12660155,ultra b-100 complex,2006-03-12 15:42:03,,
3042481,13898195,equations of the first degree in one unknown,2006-03-15 22:14:14,18.0,http://www.homeschoolmath.net
3166922,15553989,-,2006-04-07 00:13:47,,
413126,996567,sexandtrash.free.fr,2006-04-14 12:34:51,,


### 1.2. [10] Build a Trie

We want a suggest function to be **non-sensitive to stop words** because we don't want to upset the users if they confuses/omits prepositions. Consider *"public events in Innopolis"* vs *"public events at Innopolis"* or *"public events Innopolis"* - they all mean the same.

Build a Trie based on the dataset, **storing query statistics such as query _frequency_, urls and ranks in the nodes**. Some queries may have no associated urls, others may have multiple ranked urls. Think of the way to store this information.

Pass the asserts.

In [36]:
def clean_sentence(query, stop_words):
    tokens = [w for w in word_tokenize(query) if not w in stop_words]
    return ' '.join(tokens)

def castNan(data):
    if data != data:
        return None
    return data

In [60]:
class MetaData:
    def __init__(self, frequency, urls, ranks, queries):
        self.frequency = frequency
        self.urls = urls
        self.ranks = ranks
        self.queries = queries
    
def build_trie(aol_data):
#     stop_words = set(stopwords.words('english'))
    stop_words = set(['at', 'using', 'the', 'a', 'in', 'of', 'for', 'and', '&', 'on', 'with', 'is', 'from', 'to'])
    
    trie = pygtrie.CharTrie()
    data = aol_data.reset_index()
    
    cnter = data.Query.value_counts()
    for idx, row in tqdm(data.iterrows(), total=data.shape[0]):
        if row.Query != row.Query:
            continue
            
        freq = cnter[row.Query]
        filteredQuery = clean_sentence(row.Query.lower(), stop_words)
        rank = castNan(row.ItemRank)
        url = castNan(row.ClickURL)
        
        if filteredQuery in trie:
            trieData = trie[filteredQuery]
            trie[filteredQuery] = MetaData(freq, trieData.urls + [url], trieData.ranks + [rank], trieData.queries + [row.Query])
        else:
            trie[filteredQuery] = MetaData(freq, [url], [rank], [row.Query])

    return trie

In [61]:
aol_trie = build_trie(aol_data)

# test trie
# print(aol_trie.iteritems)
bag = []
for key, val in aol_trie.iteritems("sample q"):
    print(key, '~', val)
    
    #NB: here we assume you store urls in a property of list type. But you can do something different. 
    bag += val.urls
    
    assert "sample question" in key, "All examples have `sample question` substring"
    assert key[:len("sample question")] == "sample question", "All examples have `sample question` starting string"

for url in ["http://www.surveyconnect.com", "http://www.custominsight.com", 
            "http://jobsearchtech.about.com", "http://www.troy.k12.ny.us",
            "http://www.flinders.edu.au", "http://uscis.gov"]:
    assert url in bag, "This url should be in a try"

  0%|          | 0/3558411 [00:00<?, ?it/s]

sample question surveys ~ <__main__.MetaData object at 0x147dac940>
sample questions immigration interview ~ <__main__.MetaData object at 0x1badd24f0>
sample questions interview ~ <__main__.MetaData object at 0x1badcf520>
sample questions family interview ~ <__main__.MetaData object at 0x1badcf310>
sample questions sociology race ethnicity ~ <__main__.MetaData object at 0x151826d90>
sample questions biology ~ <__main__.MetaData object at 0x143a7e880>
sample questions us citizenship test ~ <__main__.MetaData object at 0x149c17e80>
sample questionarie teaching evaluation ~ <__main__.MetaData object at 0x131c9c970>
sample questionnaire teaching evaluation ~ <__main__.MetaData object at 0x131c74e50>
sample questionnaire clinical research coordinators certification ~ <__main__.MetaData object at 0x151282c70>


## 2. [15] Write a suggest function which is non-sensitive to stop words

Suggest options for user query based on Trie you just built.
Output results sorted by frequency, print query count for each suggestion. If there is an url available, print the url too. If multiple url-s are available, print the one with the highest rank (the less the better).

Pass the asserts.

Question for analysis: What is the empirical threshold for minimal prefix for suggest?

In [None]:
def complete_user_query(query: str, trie, top_k=5) -> list[str]:
    #TODO: suggest top_k options for a user query
    # sort results by frequency (!), 
    # suggest the QUERIES for first k ranked urls if available
    pass

        
inp = "trie"
print("Query:", inp)
print("Results:")
res = complete_user_query(inp, aol_trie)
print(res)

#NB we assume you return suggested query string only
assert res[0] == "tried and true tattoo"
assert res[1] == "triest" or res[1] == "triethanalomine"

assert "boys and girls club of conyers georgia" \
            in complete_user_query("boys girls club conyers", aol_trie, 10), "Should be here"

## 3. Measure suggest speed ##

### 3.1. [10] Full Trie test

Check how fast your search is working. Consider changing your code if it takes too long on average.

Sucess criterion:
- there is an average and a standard deviation for **multiple runs** of the given bucket.
- there is an average and a standard deviation for **multiple runs** of naive search in the unindexed dataset.

In [None]:
inp_queries = ["inf", "the best ", "information retrieval", "sherlock hol", "carnegie mell", 
               "babies r", "new york", "googol", "inter", "USA sta", "Barbara "]

#TODO: measure average execution time and standard deviation (in milliseconds) per query and print it out
# Repeat this for index and for no index.

## 4. [10] Add spellchecking to your suggest

Try to make your search results as close as possible. Compare top-5 results of each query with top-5 results for corrected.

You can use use [pyspellchecker](https://pypi.org/project/pyspellchecker/) `candidates()` call, or use any other spellchecker implementation.

In [None]:
def complete_user_query_with_spellchecker(query, trie, top_k=5) -> list[str]:
    #TODO: suggest top_k options for a user query
    # sort results by frequency (!!), 
    # suggest the QUERIES for first k ranked urls if available
    pass

In [None]:
inp_queries = ["inormation retrieval", "shelrock hol", "carnagie mell", "babis r", "Barrbara "]
inp_queries_corrected = ["information retrieval", "sherlock hol", "carnegie mell", "babies r", "Barbara "]

for q, qc in zip(inp_queries, inp_queries_corrected):
    assert  complete_user_query(qc, trie, 5) == \
            complete_user_query_with_spellchecker(q, trie, 5), "Assert {} and {} give different results".format(q, qc)