# Semantic Search Word2Vec Model First Draft

This Jupyter notebook is meant to serve as an introduction to reading Github `.md` documentation and analyzing it...

In [7]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("..")

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Phase 1: Documentation Data Reading and Pre-Processing

### Step 1: Reading and Storing the Documentation Data

In this section, we'll read the markdown `.md` file data, collect it, and store it for processing. We can do this by reading through all of the `.md` files in a directory and reading them into plain text format, then storing it.

In [8]:
import doc_reader as reader

doc_data = reader.collect_doc_data("docs/docs")

In [9]:
def dict_len(doc_data: dict) -> int:
    sum = 0
    for v in doc_data.values():
        sum += len(v)

    return sum

In [10]:
dict_len(doc_data)

0

### Step 2: Cleaning the Documentation Data

In this section, we'll take our collected and stored documentation data from Step 1 and clean it up so we can use it. This could include removing HTML tags, removing punctuation and special characters, removing extra whitespaces from the text, making all of our text lowercase for semantic searching, and catching any mispellings in the documentation.

In [11]:
import md_cleaner as cleaner

cleaned_doc_data = cleaner.clean_doc_data(doc_data)

In [12]:
dict_len(cleaned_doc_data)

0

### Step 3: Pre-processing the Documentation Data

In this section, we'll take our cleaned documentation data from Step 2 and pre-process it by tokenization, stemming lemmatization, and stop-word removal.

In [13]:
import md_preprocessor as preprocessor

preproc_docs = preprocessor.preprocess_doc_data(cleaned_doc_data)

In [14]:
dict_len(preproc_docs)

0

## Phase 2: Implementing Semantic Search with Word2Vec

Now, we can use `Gensim` to implement the semantic searching of the cleaned and pre-processed documentation data with its Word2Vec algorithm. This basically maps words and phrases to dense vector representations in a high-dimensional space.

In [15]:
import w2v_semantic_search as semsearch
import helper_funcs as helper

In [23]:
model = helper.load_w2v("../../models/word2vec_model.bin")

In [24]:
query = "Workflow user"

rel_docs = semsearch.get_relevant_files(query, preproc_docs, model, include_score=False, verbose=True)

Top 5 most relevant files to your query:

