# Semantic Search Word2Vec Model First Draft

This Jupyter notebook is meant to serve as an introduction to reading Github `.md` documentation and analyzing it...

## Phase 1: Documentation Data Reading and Pre-Processing

### Step 1: Reading and Storing the Documentation Data

In this section, we'll read the markdown `.md` file data, collect it, and store it for processing. We can do this by reading through all of the `.md` files in a directory and reading them into plain text format, then storing it.

In [52]:
import doc_reader as reader

doc_data = reader.collect_doc_data("docs/docs")

### Step 2: Cleaning the Documentation Data

In this section, we'll take our collected and stored documentation data from Step 1 and clean it up so we can use it. This could include removing HTML tags, removing punctuation and special characters, removing extra whitespaces from the text, making all of our text lowercase for semantic searching, and catching any mispellings in the documentation.

In [53]:
import md_cleaner as cleaner

cleaned_doc_data = cleaner.clean_doc_data(doc_data)

### Step 3: Pre-processing the Documentation Data

In this section, we'll take our cleaned documentation data from Step 2 and pre-process it by tokenization, stemming lemmatization, and stop-word removal.

In [54]:
import md_preprocessor as preprocessor

preproc_docs = preprocessor.preprocess_doc_data(cleaned_doc_data)

## Phase 2: Implementing Semantic Search with Word2Vec

Now, we can use `Gensim` to implement the semantic searching of the cleaned and pre-processed documentation data with its Word2Vec algorithm. This basically maps words and phrases to dense vector representations in a high-dimensional space.

In [55]:
import gensim.downloader

pretrained_model = gensim.downloader.load('word2vec-google-news-300')

In [56]:
import gensim
from gensim.models import Word2Vec
import numpy as np

corpus = list(preproc_docs.values())
model = Word2Vec(corpus, vector_size=500, window=5, min_count=5, workers=4)
model.build_vocab_from_freq(pretrained_model.key_to_index, corpus_count=len(corpus), update=True)

In [57]:
document_embeddings = {}

for filename, tokens in preproc_docs.items():
    embeddings = [model.wv[word] for word in tokens if word in model.wv]
    if embeddings:
        document_embeddings[filename] = gensim.matutils.unitvec(np.mean(embeddings, axis=0))

In [58]:
def run_query(query_str: str):
    """
    Performs a similarity search on the inputted query against the given Word2Vec model.

    Parameters:
        query_str (str) : user query
    
    Returns:
        similar_docs (list[str]) : most similar docs in order
    """
    corrected_query = cleaner.correct_spelling(query_str)
    query_tokens = preprocessor.preprocess_str(cleaner.clean_str(corrected_query))
    average_vec_rep = [model.wv[token] for token in query_tokens if token in model.wv]
    query_embedding = gensim.matutils.unitvec(np.mean(average_vec_rep, axis=0))

    similar_docs = []
    for filename, doc_embedding in document_embeddings.items():
        similarity_score = np.dot(doc_embedding, query_embedding)
        similar_docs.append((filename, similarity_score))
    
    similar_docs = sorted(similar_docs, key=lambda x: x[1], reverse=True)

    return similar_docs

In [59]:
def get_relevant_files(query, top_k=5, include_score=False, verbose=False):
    """
    Gets the top 'k' relevant files from an inputted query. Defaults to top
    5 most relevant files.

    Parameters:
        query (str) : question to search PW documentation for
        top_k (int) : top 'k' most relevant files to return (default: 5)
        include_score (bool) : if True, includes similarity score of file
        verbose (bool) : if True, prints files in addition to returning
    
    Returns:
        rel_files (list) : top 'k' most relevant files
    """
    try:
        similar_docs = run_query(query)
    except TypeError:
        print("Your query does not match anything in our system.")
        return []

    if include_score:
        rel_files = similar_docs[:top_k]
        if verbose:
            print(f"Top {top_k} most relevant files to your query with similarity scores included:\n")
            for i, (file, sim_score) in enumerate(rel_files):
                print(f"{i + 1}. {file}: {sim_score}")
        return rel_files
    else:
        rel_files = [filename for filename, _ in similar_docs[:top_k]]
        if verbose:
            print(f"Top {top_k} most relevant files to your query:\n")
            for i, file in enumerate(rel_files):
                print(f"{i + 1}. {file}")
    return rel_files

In [60]:
# for i, val in enumerate(preproc_docs.values()):
#     print(f"{i}. {len(val)}")

In [63]:
query = "What is the best way to authenticate into an S3 bucket?"

get_relevant_files(query, include_score=True, verbose=True);

Top 5 most relevant files to your query with similarity scores included:

1. transferring-data-aws.md: 0.9999222755432129
2. creating-storage.md: 0.9999147653579712
3. creating-clusters.md: 0.9999105334281921
4. configuring-storage.md: 0.9999102354049683
5. configuring-clusters.md: 0.9999101161956787
