## Documentation Data Reader

This Jupyter notebook is meant to serve as an introduction to reading Github `.md` documentation and analyzing it..

### Step 1: Reading and Storing the Documentation Data

In this section, we'll read the markdown `.md` file data, collect it, and store it for processing. We can do this by reading through all of the `.md` files in a directory and reading them into plain text format, then storing it.

In [18]:
import markdown
from bs4 import BeautifulSoup

def read_md_file(filepath: str) -> str:
    """
    Reads a markdown file and processes it into a string of plain text.

    Parameters:
        filepath (str) : the filepath of the markdown file to read
    
    Returns:
        text (str) : the plain text from the inputted markdown file
    """
    with open(filepath, 'r') as f:
        content = f.read()
        html = markdown.markdown(content)
        text = ''.join(BeautifulSoup(html).findAll(text=True))

    return text


In [19]:
text1 = read_md_file("../proj-overview.md")

  text = ''.join(BeautifulSoup(html).findAll(text=True))


In [20]:
import os

def collect_doc_data(directory: str) -> list[str]:
    """
    Scans through a directory and collects the documentation data from all
    '.md' files into a list.

    Parameters:
        directory (str) : directory to scan through
    
    Returns
        docs_data (list[str]) : documentation data from `.md` files
    """
    doc_data = []
    for dirpath, _, filenames in os.walk(directory):
        for file in filenames:
            if file.endswith('.md'):
                file_path = os.path.join(dirpath, file)
                text = read_md_file(file_path)
                doc_data.append(text)
    
    return doc_data


In [21]:
doc_data = collect_doc_data("../docs/docs")

  text = ''.join(BeautifulSoup(html).findAll(text=True))


### Step 2: Cleaning the Documentation Data

In this section, we'll take our collected and stored documentation data from Step 1 and clean it up so we can use it. This could include removing HTML tags, removing punctuation and special characters, removing extra whitespaces from the text, making all of our text lowercase for semantic searching, and catching any mispellings in the documentation.

In [22]:
import re

def _remove_whitespace(input_str: str) -> str:
    """
    Removes whitespace from an input string.
    """
    return ' '.join(input_str.split())

def _lower_str(input_str: str) -> str:
    """
    Lowers the input string to lower case
    """
    return input_str.lower()

def _remove_punct_and_special_chars(input_str: str) -> str:
    """
    Removes punctuation and special characters using Regex
    """
    pattern = r'[^\w\s]'
    return re.sub(pattern, '', input_str)


def _filter_sidebar_pos(input_str: str) -> str:
    """
    Removes sidebar positioning.
    """
    pattern = r"sidebar_position \d+ "
    return re.sub(pattern, "", input_str)

def _clean_str(input_str: str) -> str:
    """
    Applies data cleaning to input string.
    """
    cleaning_funcs = [_remove_whitespace, _lower_str,
                        _remove_punct_and_special_chars, _filter_sidebar_pos]
    
    cleaned_str = input_str
    for func in cleaning_funcs:
        cleaned_str = func(cleaned_str)

    return cleaned_str

def clean_doc_data(doc_data: list[str]) -> list[str]:
    """
    Clean the documentation data by removing HTML tags, removing punctuation
    and special characters, removing extra whitespaces, making everything
    lowercase, and catching mispelled words.

    Parameters:
        doc_data (list[str]) : the collected and read `.md` data
    
    Returns:
        cleaned_doc_data (list[str]) : the cleaned documentation data
    """
    return [_clean_str(i) for i in doc_data]

In [23]:
cleaned_doc_data = clean_doc_data(doc_data)

### Step 3: Pre-processing the Documentation Data

In this section, we'll take our cleaned documentation data from Step 2 and pre-process it by tokenization, stemming lemmatization, and stop-word removal.

In [24]:
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

def tokenize_str(input_str: str) -> str:
    """
    Tokenize an input string.
    """
    return word_tokenize(input_str)

def _remove_stopwords(tokens: list[str]) -> list[str]:
    """
    Remove stop-words from a list of tokens.
    """
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]

def _stem_tokens(tokens: list[str]) -> list[str]:
    """
    Stems tokens.
    """
    stemmer = PorterStemmer()
    return [stemmer.stem(token) for token in tokens]

def _clean_tokens(tokens: list[str]) -> list[str]:
    """
    Clean tokens again.
    """
    return [re.sub(r'[^a-zA-Z0-9]', '', token) for token in tokens if token]

def _preprocess_str(cleaned_str: str) -> list[str]:
    """
    Helper function to preprocess an inputted string via tokenization, stemming
    lemmatization, and stop-word removal.

    Parameters:
        cleaned_str (str) : a pre-cleaned string
    
    Returns:
        preproc_tokens (list[str]) : preprocessed tokens of a string
    """
    tokens = tokenize_str(cleaned_str)

    preproc_funcs = [_remove_stopwords, _stem_tokens, _clean_tokens] 

    preproc_tokens = tokens
    for func in preproc_funcs:
        preproc_tokens = func(preproc_tokens)
    
    return preproc_tokens

def preprocess_doc_data(cleaned_doc_data: list[str]) -> list[list[str]]:
    """
    Preprocesses the full documentation data set by tokenizing each string entry
    in the inputted documentation data, then removing stop words and stemming
    the tokens via the PortStemmer algorithm.

    Parameters:
        cleaned_doc_data (list[str]) : cleaned documentation data
    
    Returns:
        preproc_doc_data (list[list[str]]) : full pre-processed documentation data
    """
    return [_preprocess_str(i) for i in cleaned_doc_data]

In [25]:
preprocessed_doc_data = preprocess_doc_data(cleaned_doc_data)

In [28]:
for i in preprocessed_doc_data:
    print(i)

['work', 'cloud', 'snapshot', 'cloud', 'snapshot', 'let', 'make', 'persist', 'chang', 'oper', 'system', 'imag', 'run', 'cluster', 'node', 'use', 'imag', 'provid', 'parallel', 'work', 'base', 'stage', 'autom', 'instal', 'addit', 'softwar', 'enabl', 'addit', 'servic', 'creat', 'cloud', 'snapshot', 'navig', 'account', 'set', 'usernam', 'account', 'profil', 'set', 'click', 'cloud', 'snapshot', 'cloud', 'snapshot', 'click', 'new', 'snapshot', 'snapshot', 'configur', 'set', 'sever', 'configur', 'paramet', 'cloud', 'snapshot', 'outlin', 'type', 'use', 'dropdown', 'menu', 'select', 'whether', 'snapshot', 'built', 'aw', 'azur', 'googl', 'cloud', 'account', 'use', 'dropdown', 'menu', 'select', 'cloud', 'account', 'provis', 'snapshot', 'user', 'menu', 'left', 'default', 'option', 'unless', 'your', 'member', 'multipl', 'organ', 'group', 'use', 'dropdown', 'menu', 'select', 'group', 'name', 'organ', 'use', 'alloc', 'cost', 'menu', 'especi', 'import', 'organ', 'use', 'multipl', 'group', 'your', 'sur