## Documentation Data Reader

This Jupyter notebook is meant to serve as an introduction to reading Github `.md` documentation and analyzing it..

### Step 1: Reading and Storing the Documentation Data

In this section, we'll read the markdown `.md` file data, collect it, and store it for processing. We can do this by reading through all of the `.md` files in a directory and reading them into plain text format, then storing it.

In [28]:
import markdown
from bs4 import BeautifulSoup

def read_md_file(filepath: str) -> str:
    """
    Reads a markdown file and processes it into a string of plain text.

    Parameters:
        filepath (str) : the filepath of the markdown file to read
    
    Returns:
        text (str) : the plain text from the inputted markdown file
    """
    with open(filepath, 'r') as f:
        content = f.read()
        html = markdown.markdown(content)
        text = ''.join(BeautifulSoup(html).findAll(text=True))

    return text


In [23]:
text1 = read_md_file("../proj-overview.md")

  text = ''.join(BeautifulSoup(html).findAll(text=True))


In [7]:
import os

def collect_doc_data(directory: str) -> list[str]:
    """
    Scans through a directory and collects the documentation data from all
    '.md' files into a list.

    Parameters:
        directory (str) : directory to scan through
    
    Returns
        docs_data (list[str]) : documentation data from `.md` files
    """
    doc_data = []
    for dirpath, _, filenames in os.walk(directory):
        for file in filenames:
            if file.endswith('.md'):
                file_path = os.path.join(dirpath, file)
                text = read_md_file(file_path)
                doc_data.append(text)
    
    return doc_data


In [34]:
doc_data = collect_doc_data("../docs/docs")

  text = ''.join(BeautifulSoup(html).findAll(text=True))


### Step 2: Cleaning the Documentation Data

In this section, we'll take our collected and stored documentation data from Step 1 and clean it up so we can use it. This could include removing HTML tags, removing punctuation and special characters, removing extra whitespaces from the text, making all of our text lowercase for semantic searching, and catching any mispellings in the documentation.

In [42]:
import re

def remove_punct_and_special_chars(input_str: str) -> str:
    """
    Removes punctuation and special characters using Regex
    """
    return re.sub(r'[^\w\s]', '', input_str)

def remove_whitespace(input_str: str) -> str:
    """
    Removes whitespace from an input string.
    """
    return ' '.join(input_str.split())

def clean_str(input_str: str) -> str:
    """
    Applies data cleaning to input string.
    """
    return remove_punct_and_special_chars(remove_whitespace(input_str).lower())

def clean_doc_data(doc_data: list[str]) -> list[str]:
    """
    Clean the documentation data by removing HTML tags, removing punctuation
    and special characters, removing extra whitespaces, making everything
    lowercase, and catching mispelled words.

    Parameters:
        doc_data (list[str]) : the collected and read `.md` data
    
    Returns:
        cleaned_doc_data (list[str]) : the cleaned documentation data
    """
    return [clean_str(i) for i in doc_data]

In [43]:
cleaned_doc_data = clean_doc_data(doc_data)

In [44]:
cleaned_doc_data

['sidebar_position 6 working with cloud snapshots cloud snapshots let you make persistent changes to the operating system image running on your cluster nodes by using an image provided by parallel works as a base you can stage automations install additional software or enable additional services creating cloud snapshots navigate to your account settings username  account in profile settings click cloud snapshots in cloud snapshots click new snapshot snapshot configuration settings there are several configurable parameters for cloud snapshots which are outlined below type use this dropdown menu to select whether your snapshot will be built for aws azure or google cloud account use this dropdown menu to select which cloud account will provision your snapshot for most users this menu should be left as the default option unless youre a member of multiple organizations group use this dropdown menu to select the group name that your organization uses to allocate costs this menu is especially

### Step 3: Pre-processing the Documentation Data

In this section, we'll take our cleaned documentation data from Step 2 and pre-process it by tokenization, stemming lemmatization, and stop-word removal.

In [10]:
def preprocess_doc_data(cleaned_doc_data: list[str]) -> list[str]:
    """
    Preprocess the documentation data by tokenization, stemming
    lemmatization, and stop-word removal.

    Parameters:
        cleaned_doc_data (list[str]) : the cleaned documentation data
    
    Returns:
        proc_doc_data (list[str]) : preprocessed documentation data
    """
    raise NotImplementedError