# Problem 9
Implement an autocomplete system. That is, given a query string s and a set of all possible query strings, return all strings in the set that have s as a prefix.

For example, given the query string `de` and the set of strings `[dog, deer, deal]`, return `[deer, deal]`.

Hint: Try preprocessing the dictionary into a more efficient data structure to speed up queries.

---
## Test Cases

In [7]:
# test cases

# this test case is based on the python library 'english_words' as a test dictionary
from english_words import get_english_words_set
test_dict = list(get_english_words_set(['web2'], lower=True))
test_dict.sort()
print("Number of words in 'english_words' python library for test dictionary:", len(test_dict), "words")

Number of words in 'english_words' python library for test dictionary: 234450 words


---
## Solution

In [8]:
# create column database for word list
import pandas as pd
import numpy as np
import itertools
from tqdm.notebook import tqdm

# creates all headers needed for column database
def headers(n = 3):
    if(n < 1): n = 1
    columns = []
    letters = 'a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,-'.split(',')
    columns = list(map(list,itertools.product(letters, repeat=n)))
    for col in range(len(columns)):
        header = ''
        for letter in columns[col]:
            letter = letter.replace('-', '')
            header += letter
        columns[col] = header
    columns = list(set(columns))
    columns = [col for col in columns if col != '']
    columns.sort(key = len)
    return columns

# creates column database based on word list given and headers created in headers function
def column_df(word_list, header_list):
    df = pd.DataFrame()
    # used to fill columns assigned with small amount of words
    null_filler = 80000
    # tqdm is used to help indicate how long database takes to create
    for header in tqdm(header_list):
        temp_word_list = []
        for word in word_list:
            if(word[:len(header)] == header):
                temp_word_list.append(word)
        # removes any column that would be empty
        if(len(temp_word_list) != 0):
            temp_word_list = temp_word_list + ['' for i in range(0,null_filler-len(temp_word_list))]
            df[f"{header}"] = temp_word_list
    df = df.dropna(how = "all")
    return df

In [9]:
# command to make column database based on 'max_autocomplete_word_length'
# as max_autocomplete_word_length increases, speed of autocomplete speeds up, however process ot make database significantly increase
max_autocomplete_word_length = 3
column_database = column_df(test_dict,headers(max_autocomplete_word_length))

  0%|          | 0/18278 [00:00<?, ?it/s]

In [117]:
import time

# solution based on column database data structure
def autocomplete(user_input, dataframe):
    output = ''
    columns = dataframe.columns
    # condition is user_input in current database column headers
    if(user_input in columns):
        output = dataframe[f"{user_input}"].to_list()
        if(user_input in output):
            output.remove(user_input)
        output = [result for result in output if result != '']
    else:
        # if user_input not in headers, find closest related column
        for i in range(1, len(user_input)):
            if(user_input[:len(user_input) - i] in columns):
                output = dataframe[f"{user_input[:len(user_input) - i]}"].to_list()
                output = [result for result in output if result != '']
                output_remove = []
                if(user_input[:len(user_input)] in output):
                    output_remove.append(user_input[:len(user_input)])
                # reduce closest related column if words in column does not start with user_input
                for word in output:
                    if(word[:len(user_input)] != user_input):
                        output_remove.append(word)
                output = [word for word in output if word not in output_remove]
                break
        if(output == ''):
            output = "No results found"
    return output

def pretty_print(user_input, column_database):
    # use time library to find time to query autocomplete list
    start = time.time() * 1000
    results = autocomplete(user_input, column_database)
    end = time.time() * 1000
    # print user friendly output
    print("User inputed:", user_input)
    print("-" * 25)
    if(type(results) == list):
        print(f"{len(results)} autocomplete results found.")
        if(len(results) < 5):
            print(f"First {len(results)} autocomplete results are: {results[:5]}")
        else:
            print(f"First 5 autocomplete results are: {results[:5]}")
    else:
        print("No autocomplete results found.")
    print(f"This autocomplete took {round(end - start, 2)} milliseconds to query.")

---
## Test Solution

In [118]:
user_input = "dog"
pretty_print(user_input, column_database)

User inputed: dog
-------------------------
88 autocomplete results found.
First 5 autocomplete results are: ['dogal', 'dogate', 'dogbane', 'dogberry', 'dogberrydom']
This autocomplete took 2.03 milliseconds to query.


In [114]:
user_input = "yello"
pretty_print(user_input, column_database)

User inputed: yello
-------------------------
37 autocomplete results found.
First 5 autocomplete results are: ['yelloch', 'yellow', 'yellowammer', 'yellowback', 'yellowbelly']
This autocomplete took 3.0 milliseconds to query.


In [115]:
user_input = "s"
pretty_print(user_input, column_database)

User inputed: s
-------------------------
24936 autocomplete results found.
First 5 autocomplete results are: ['sa', 'saa', 'saad', 'saan', 'saarbrucken']
This autocomplete took 8.12 milliseconds to query.


In [116]:
user_input = "wsq"
pretty_print(user_input, column_database)

User inputed: wsq
-------------------------
0 autocomplete results found.
First 0 autocomplete results are: []
This autocomplete took 91.23 milliseconds to query.


---
## Solution Explained

### autocomplete(user_input, dataframe) solution
This solution is based on the creation of a colum database. The column database is created using the `column_df(word_list, header_list)` function which takes a list of word (in this case taken from the `english_words` python library to imitate all the word in the english dictionary) and a list of headers (created by the `headers(max_autocomplete_word_length)` function which create all combanations letters in the english language in the range of 1 to `max_autocomplete_word_length` letters per combonation: `a`, `aa`, `ab`, `aaa`, `aab`, ...). After the database is created, we can use the solution function `autocomplete(user_input, dataframe)`. This function uses the `column_df(word_list, header_list)` output as the `dataframe` and `user_input` as the word trying to be completed by the user. The function takes this `user_input` and looks to see if it can be found in the dataframe's columns. If so, the dataframe's column is turned into a list and removes all Nulls from the list, then returns the list. If the condition is not true, the functions searches the the column most similar to the `user_input`, and removes any word in the column that does not start with `user_input` then returns the list. If no words can be found similar to the `user_input`, the function prints out that no results were found.

To make this output more user friendly and find how long it take to query the autocomplete information, a user can use the function `pretty_print(user_input, column_database)`. This function times the `autocomplete(user_input, dataframe)` function based on `user_input` for what's being autocompleted for the user, then prints out how many results were found, how long the autocomplete querying took, and the first few autocomplete results sorted alphabetically. 