# Semantic Search Model First Draft

This Jupyter notebook is meant to serve as an introduction to reading Github `.md` documentation and analyzing it...

## Phase 1: Documentation Data Reading and Pre-Processing

### Step 1: Reading and Storing the Documentation Data

In this section, we'll read the markdown `.md` file data, collect it, and store it for processing. We can do this by reading through all of the `.md` files in a directory and reading them into plain text format, then storing it.

In [161]:
import markdown
from bs4 import BeautifulSoup

def read_md_file(filepath: str) -> str:
    """
    Reads a markdown file and processes it into a string of plain text.

    Parameters:
        filepath (str) : the filepath of the markdown file to read
    
    Returns:
        text (str) : the plain text from the inputted markdown file
    """
    with open(filepath, 'r') as f:
        content = f.read()
        html = markdown.markdown(content)
        text = ''.join(BeautifulSoup(html).findAll(text=True))

    return text


In [162]:
import os

def collect_doc_data(directory: str) -> list[str]:
    """
    Scans through a directory and collects the documentation data from all
    '.md' files into a list.

    Parameters:
        directory (str) : directory to scan through
    
    Returns
        docs_data (list[str]) : documentation data from `.md` files
    """
    doc_data = {}
    for dirpath, _, filenames in os.walk(directory):
        for file in filenames:
            if file.endswith('.md'):
                file_path = os.path.join(dirpath, file)
                text = read_md_file(file_path)
                doc_data[file] = text
    
    return doc_data


In [163]:
doc_data = collect_doc_data("docs/docs")

  text = ''.join(BeautifulSoup(html).findAll(text=True))


### Step 2: Cleaning the Documentation Data

In this section, we'll take our collected and stored documentation data from Step 1 and clean it up so we can use it. This could include removing HTML tags, removing punctuation and special characters, removing extra whitespaces from the text, making all of our text lowercase for semantic searching, and catching any mispellings in the documentation.

In [164]:
import re

def _remove_whitespace(input_str: str) -> str:
    """
    Removes whitespace from an input string.
    """
    return ' '.join(input_str.split())

def _lower_str(input_str: str) -> str:
    """
    Lowers the input string to lower case
    """
    return input_str.lower()

def _remove_punct_and_special_chars(input_str: str) -> str:
    """
    Removes punctuation and special characters using Regex
    """
    pattern = r'[^\w\s]'
    return re.sub(pattern, '', input_str)


def _filter_sidebar_pos(input_str: str) -> str:
    """
    Removes sidebar positioning.
    """
    pattern = r"sidebar_position(?: \d+)? "
    return re.sub(pattern, "", input_str)

def clean_str(input_str: str) -> str:
    """
    Applies data cleaning measures to input string, including removing whitespaces,
    lowercasing the string, removing punctuations and special characters, and
    filtering sidebar positioning from .md files.

    Parameters:
        input_str (str) : inputted string to be cleaned
    
    Returns:
        cleaned_str (str) : cleaned string
    """
    cleaning_funcs = [_remove_whitespace, _lower_str,
                        _remove_punct_and_special_chars, _filter_sidebar_pos]
    
    cleaned_str = input_str
    for func in cleaning_funcs:
        cleaned_str = func(cleaned_str)

    return cleaned_str

def clean_doc_data(doc_data: list[str]) -> list[str]:
    """
    Clean the documentation data by removing HTML tags, removing punctuation
    and special characters, removing extra whitespaces, making everything
    lowercase, and catching mispelled words.

    Parameters:
        doc_data (list[str]) : the collected and read `.md` data
    
    Returns:
        cleaned_doc_data (list[str]) : the cleaned documentation data
    """
    return {file : clean_str(content) for file, content in doc_data.items()}

In [165]:
cleaned_doc_data = clean_doc_data(doc_data)

### Step 3: Pre-processing the Documentation Data

In this section, we'll take our cleaned documentation data from Step 2 and pre-process it by tokenization, stemming lemmatization, and stop-word removal.

In [166]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

def tokenize_str(input_str: str) -> str:
    """
    Tokenize an input string.
    """
    return word_tokenize(input_str)

def _remove_stopwords(tokens: list[str]) -> list[str]:
    """
    Remove stop-words from a list of tokens.
    """
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]

def _stem_tokens(tokens: list[str]) -> list[str]:
    """
    Stems tokens.
    """
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

def _lemmatize_tokens(tokens: list[str]) -> list[str]:
    """
    Lemmatizes tokens.
    """
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

def _clean_tokens(tokens: list[str]) -> list[str]:
    """
    Clean tokens again.
    """
    return [re.sub(r'[^a-zA-Z0-9]', '', token) for token in tokens if token]

def preprocess_str(cleaned_str: str) -> list[str]:
    """
    Helper function to preprocess an inputted string via tokenization, stemming or 
    lemmatizing, and stop-word removal.

    Parameters:
        cleaned_str (str) : a pre-cleaned string
    
    Returns:
        preproc_tokens (list[str]) : preprocessed tokens of a string
    """
    tokens = tokenize_str(cleaned_str)

    preproc_funcs = [_remove_stopwords, _lemmatize_tokens, _clean_tokens] 

    preproc_tokens = tokens
    for func in preproc_funcs:
        preproc_tokens = func(preproc_tokens)
    
    return preproc_tokens

def preprocess_doc_data(cleaned_doc_data: list[str]) -> list[list[str]]:
    """
    Preprocesses the full documentation data set by tokenizing each string entry
    in the inputted documentation data, then removing stop words and lemmatizing
    the tokens via the WordNetLemmatizer algorithm.

    Parameters:
        cleaned_doc_data (list[str]) : cleaned documentation data
    
    Returns:
        preproc_doc_data (list[list[str]]) : full pre-processed documentation data
    """
    return {file : preprocess_str(content) for file, content in cleaned_doc_data.items()}

In [167]:
preproc_docs = preprocess_doc_data(cleaned_doc_data)

## Phase 2: Implementing Semantic Search with Word2Vec

Now, we can use `Gensim` to implement the semantic searching of the cleaned and pre-processed documentation data with its Word2Vec algorithm. This basically maps words and phrases to dense vector representations in a high-dimensional space.

In [168]:
# import gensim.downloader

# pretrained_model = gensim.downloader.load('word2vec-google-news-300')

In [169]:
import gensim
from gensim.models import Word2Vec
import numpy as np

corpus = list(preproc_docs.values())
model = Word2Vec(corpus, vector_size=500, window=5, min_count=5, workers=4)
# model.build_vocab_from_freq(pretrained_model.key_to_index, corpus_count=len(corpus), update=True)

In [170]:
document_embeddings = {}

for filename, tokens in preproc_docs.items():
    embeddings = [model.wv[word] for word in tokens if word in model.wv]
    if embeddings:
        document_embeddings[filename] = gensim.matutils.unitvec(np.mean(embeddings, axis=0))

In [171]:
def run_query(query_str: str):
    """
    Performs a similarity search on the inputted query against the given Word2Vec model.

    Parameters:
        query_str (str) : user query
    
    Returns:
        similar_docs (list[str]) : most similar docs in order
    """
    query_tokens = preprocess_str(clean_str(query_str))
    average_vec_rep = [model.wv[token] for token in query_tokens if token in model.wv]
    query_embedding = gensim.matutils.unitvec(np.mean(average_vec_rep, axis=0))

    similar_docs = []
    for filename, doc_embedding in document_embeddings.items():
        similarity_score = np.dot(doc_embedding, query_embedding)
        similar_docs.append((filename, similarity_score))
    
    similar_docs = sorted(similar_docs, key=lambda x: x[1], reverse=True)

    return similar_docs

In [172]:
def get_relevant_files(query, top_k=5, include_score=False, verbose=False):
    """
    Gets the top 'k' relevant files from an inputted query. Defaults to top
    5 most relevant files.

    Parameters:
        query (str) : question to search PW documentation for
        top_k (int) : top 'k' most relevant files to return (default: 5)
        include_score (bool) : if True, includes similarity score of file
        verbose (bool) : if True, prints files in addition to returning
    
    Returns:
        rel_files (list) : top 'k' most relevant files
    """
    similar_docs = run_query(query)

    if include_score:
        rel_files = similar_docs[:top_k]
        if verbose:
            print(f"Top {top_k} most relevant files to your query with similarity scores included:\n")
            for i, (file, sim_score) in enumerate(rel_files):
                print(f"{i + 1}. {file}: {sim_score}")
        return rel_files
    else:
        rel_files = [filename for filename, _ in similar_docs[:top_k]]
        if verbose:
            print(f"Top {top_k} most relevant files to your query:\n")
            for i, file in enumerate(rel_files):
                print(f"{i + 1}. {file}")
    return rel_files

In [174]:
query = "Where do I look for a question about monitoring my costs?"

get_relevant_files(query, include_score=True, verbose=True);

Top 5 most relevant files to your query with similarity scores included:

1. configuring-storage.md: 0.9999051094055176
2. configuring-clusters.md: 0.99989914894104
3. navigating-the-admin-panel.md: 0.9998986124992371
4. logging-in-controller.md: 0.9998984932899475
5. configuring-instances.md: 0.9998974800109863
