# Semantic Search Word2Vec Model First Draft

This Jupyter notebook is meant to serve as an introduction to reading Github `.md` documentation and analyzing it...

## Phase 1: Documentation Data Reading and Pre-Processing

### Step 1: Reading and Storing the Documentation Data

In this section, we'll read the markdown `.md` file data, collect it, and store it for processing. We can do this by reading through all of the `.md` files in a directory and reading them into plain text format, then storing it.

In [111]:
import doc_reader as reader

doc_data = reader.collect_doc_data("docs/docs")

### Step 2: Cleaning the Documentation Data

In this section, we'll take our collected and stored documentation data from Step 1 and clean it up so we can use it. This could include removing HTML tags, removing punctuation and special characters, removing extra whitespaces from the text, making all of our text lowercase for semantic searching, and catching any mispellings in the documentation.

In [112]:
import md_cleaner as cleaner

cleaned_doc_data = cleaner.clean_doc_data(doc_data)

### Step 3: Pre-processing the Documentation Data

In this section, we'll take our cleaned documentation data from Step 2 and pre-process it by tokenization, stemming lemmatization, and stop-word removal.

In [113]:
import md_preprocessor as preprocessor

preproc_docs = preprocessor.preprocess_doc_data(cleaned_doc_data)

## Phase 2: Implementing Semantic Search with Word2Vec

Now, we can use `Gensim` to implement the semantic searching of the cleaned and pre-processed documentation data with its Word2Vec algorithm. This basically maps words and phrases to dense vector representations in a high-dimensional space.

In [115]:
import gensim.downloader

pretrained_model = gensim.downloader.load('word2vec-google-news-300')

In [116]:
import gensim
from gensim.models import Word2Vec
import numpy as np

corpus = list(preproc_docs.values())
model = Word2Vec(corpus, vector_size=500, window=5, min_count=5, workers=4)
model.build_vocab_from_freq(pretrained_model.key_to_index, corpus_count=len(corpus), update=True)

In [117]:
model.save("word2vec_model.bin")

In [7]:
document_embeddings = {}

for filename, tokens in preproc_docs.items():
    embeddings = [model.wv[word] for word in tokens if word in model.wv]
    if embeddings:
        document_embeddings[filename] = gensim.matutils.unitvec(np.mean(embeddings, axis=0))

In [8]:
def run_query(query_str: str):
    """
    Performs a similarity search on the inputted query against the given Word2Vec model.

    Parameters:
        query_str (str) : user query
    
    Returns:
        similar_docs (list[str]) : most similar docs in order
    """
    corrected_query = cleaner.correct_spelling(query_str)
    query_tokens = preprocessor.preprocess_str(cleaner.clean_str(corrected_query))
    average_vec_rep = [model.wv[token] for token in query_tokens if token in model.wv]
    query_embedding = gensim.matutils.unitvec(np.mean(average_vec_rep, axis=0))

    similar_docs = []
    for filename, doc_embedding in document_embeddings.items():
        similarity_score = np.dot(doc_embedding, query_embedding)
        similar_docs.append((filename, similarity_score))
    
    similar_docs = sorted(similar_docs, key=lambda x: x[1], reverse=True)

    return similar_docs

In [9]:
def get_relevant_files(query, top_k=5, include_score=False, verbose=False):
    """
    Gets the top 'k' relevant files from an inputted query. Defaults to top
    5 most relevant files.

    Parameters:
        query (str) : question to search PW documentation for
        top_k (int) : top 'k' most relevant files to return (default: 5)
        include_score (bool) : if True, includes similarity score of file
        verbose (bool) : if True, prints files in addition to returning
    
    Returns:
        rel_files (list) : top 'k' most relevant files
    """
    try:
        similar_docs = run_query(query)
    except TypeError:
        print("Your query does not match anything in our system.")
        return []

    if include_score:
        rel_files = similar_docs[:top_k]
        if verbose:
            print(f"Top {top_k} most relevant files to your query with similarity scores included:\n")
            for i, (file, sim_score) in enumerate(rel_files):
                print(f"{i + 1}. {file}: {sim_score}")
        return rel_files
    else:
        rel_files = [filename for filename, _ in similar_docs[:top_k]]
        if verbose:
            print(f"Top {top_k} most relevant files to your query:\n")
            for i, file in enumerate(rel_files):
                print(f"{i + 1}. {file}")
    return rel_files

In [69]:
def get_file_content_from_filenames(filenames: list, docs: dict) -> list:
    """
    Helper function that takes a list of filenames and returns a list of
    the cleaned and preprocessed contents of those files.

    Parameters:
        filenames (list) : filenames to get content of
        docs (dict) : documents keyed by filename and valued with content

    Returns:
        file_content (list) : content of each of the inputted filenames 
    """
    return {file: docs[file] for file in filenames}

In [92]:
def format_gpt_input(gpt_docs: dict) -> dict:
    """
    Helper function to turn the dict list
    """
    new_docs = {}
    for file, content in gpt_docs.items():
        new_docs[file] = " ".join(content)
    
    return new_docs

In [None]:
def ensure_token_length(*args):
    """
    Helper function to make sure that inputted tokens are below the 4097
    token input cap.

    Paramaters:
        *args : args
    
    Returns:
        (bool) : if the inputted arguments are below 4097 tokens
    """

In [99]:
import openai
import json
import helper_funcs as helper

def run_gpt(query, *args, api_key=helper.get_api_key()):
    """
    Function that runs the gpt-3.5-turbo AI API on a query and set of arguments
    Arguments should consist of a variable length list, where each
    element contains a list of tokens from the most relevant files related to
    the inputted query.

    Paramaters:
        query (str) : inputted query from user
        *args (list[list[str]]) : array containing document information tokens
        api_key (str) : user API key to run
    
    Returns:
        reply (str) : GPT AI response to query with supporting relevant documents
    """
    openai.api_key = api_key

    gpt_prompt = "You are a helpful assistant in charge of helping users understand our platform."
    clarification_1 = "Your responses should not require users to search through our files. Instead, you can include relevant filenames as additional support resources if they need it."
    clarification_2 = "When including a filename in your response, print in Proper formatting -- so without dashes or the file extensions."

    messages = [
        {"role": "system", "content": gpt_prompt},
        {"role": "system", "content": clarification_1},
        {"role": "system", "content": clarification_2},
        {"role": "user", "content": query}
    ]

    for tokens in args:
        messages.append({"role": "user", "content": json.dumps(tokens)})

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    reply = response.choices[0].message.content
    return reply

In [101]:
query = "How do I create a new user container?"

rel_docs = get_relevant_files(query)

rel_content = get_file_content_from_filenames(rel_docs, preproc_docs)

In [102]:
formatted_rel_content = format_gpt_input(rel_content)
# response = run_gpt(query, formatted_rel_content)

In [103]:
print(f"ChatGPT: {response}")

ChatGPT: To create a new user container, navigate to your organization settings, select the "User" tab, and click "Add User." On the next page, you can enter the new user's credentials such as their name, username, email, and password. You can also choose to add their location (optional). Once you have completed this form, click "Create Account" and the new user will be listed in the User tab.

If you need to add several new users, you can use the mass-importing method. Navigate to the organization settings and click the "User" tab, then choose "Import User". Follow the steps and use the CSV template provided to input the new user's information. Once completed, click "Import User," and the new users will be listed in the User tab. 

For troubleshooting, please note that sometimes CSV files can cause issues with compatibility. MacOS, for example, sometimes saves CSV files as a ".numbers" file or prompts to save it as an Excel format file instead of CSV. However, please note that the use