## Purpose
The purpose of this script is to convert a folder of MS Word documents into Markdown files that can be hosted on a GitHub Pages website.


### How to use
IMPORTANT: Make sure that the current website repository branch is set to `src` before running this script

First, populate the `input` folder with the desired subfolders and `.docx` word documents. The top level subfolders should represent the collections defined in the website `_config.yml`, see [here](https://github.com/chrisnielsen/chrisnielsen.github.io/blob/src/_config.yml#:~:text=collections%3A,news%3A) for examples of this. 

Each `.docx` Word document must follow the formatting rules:
1. The file name of the .docx document will be used as the associated web page title
2. The allowed `.docx` styles are `Normal`, `Title`, `Heading 1`, `Heading 2`, `Heading 3`, `List Bullet`, and `List Number`
3. All documents should have their MathType equations converted into `MathJax:LaTeX` by clicking the `Convert Equations` button and selecting `MathJax:LaTeX` from the MathType translator dropdown menu
4. To ensure that a math equation is centered, make sure that there are line breaks to the top and bottom of the equation in the document
5. Use common Markdown syntax to bold, italic, and apply other modifications to text in the `.docx` document
6. For each image in the document, write the following above the image: `Figure__width-in-pixels__figure-name`, where `width-in-pixels` specifies the width in pixels for the image on the web page and `figure-name` specifies the figure name for the image which will be shown below the image in the Markdown document
7. For `projects` documents, the first paragraph in the `.docx` file will be used as the abstract displayed on the `projects` page
8. All hyperlinks must be removed in the `.docx` file

Next, run the Python notebook `process_documents.ipynb`. The path to the website repository must be specified using the `repo_path` variable defined at the top of the notebook. Running this notebook will process all documents located in the `input` folder, storing the processed results in the `output` folder. Then the `output` folder will be merged with the `repo_path` folder to update the website contents. This merge process will replace existing files with the same file names but will not delete any preexisting files in the website directory. After the notebook is run, the `src` branch can be pushed and `./bin/deploy` can be run to deploy the website.


### Important gotchas
1. To ensure the Markdown output is encoded in UTF-8, make sure that when writing the file to disk that you include the argument `encoding='utf8'` 



### Data provided in repo
There is two sets of example data provided in this repo:
1. The `input` folder contains the current subfolders and documents used by my website
2. The `input_test` folder contains a number of subfolders and documents created to test the system 


In [1]:
import shutil
import os
import docx2txt
from docx import Document

# local folder that contains documents to process
input_path = 'input'

# local folder that contains constructed markdown documents
output_path = 'output'

# base path to website
repo_path = 'C:/GitHub/chrisnielsen.github.io'

### Place subfolders and document in alphabetical order

In [2]:
import os

document_ordering = []
for root, dirs, files in os.walk(input_path):
    for file in files:
        if file.endswith(".docx"):
            
            document_ordering.append( tuple(root.split('\\')[1:]) + (file,) )

sorted_document_ordering = sorted(document_ordering)
print(sorted_document_ordering)

[('projects', 'Utility Software', 'Personal Website Built Using Github Pages With Jekyll.docx')]


### Create empty output directory 

In [3]:
try:
    shutil.rmtree(output_path)
except:
    pass

os.makedirs(output_path)

### Populate the `writing` Markdown page

In [4]:
try:
    os.makedirs(output_path + '/_pages')
except:
    pass


writing_page_str = ''

writing_page_str += '---\n'
writing_page_str += 'layout: page\n'
writing_page_str += 'permalink: /writing/\n'
writing_page_str += 'description: This page will be populated with my writing projects\n'
writing_page_str += 'title: Writing\n'
writing_page_str += 'nav: Writing\n\n'
writing_page_str += '---\n\n'

writing_page_str += '<br/>\n'
writing_page_str += '<!-- MarkdownTOC depth=4 -->\n\n\n'


level1_set = set()
level2_set = set()
level_dict = {}
level_order = []

for document_path in sorted_document_ordering:
    if len(document_path) > 3:
        print('error: document path length greater than 3')
        
    level1 = document_path[0]
    if level1 == 'projects':
        continue
        
    level2 = ''
    if level1 not in level1_set:
        level1_set.add(level1)
        level1_punctuated = level1.replace('-',' ').capitalize()
        writing_page_str += '-  [**' + level1_punctuated + '**](#' + level1 + ')\n'
        
    if len(document_path) == 3:
        document_name = document_path[2]
        level2 = document_path[1]
        if level2 not in level2_set:
            level2_set.add(level2)
            writing_page_str += '    -  [**' + level2 + '**](#' + level2.lower().replace(' ','-') + ')\n'
    else:
        document_name = document_path[1]
        
    
    if (level1, level2) not in level_order:
        level_order.append((level1, level2))
    
    
    if (level1, level2) not in level_dict:
        level_dict[(level1, level2)] = []
    
    level_dict[(level1, level2)].append('* [**' + document_name[:-5] + '**](' + 'https://chrisnielsen.github.io/' \
                                        + level1 + '/' + level2.lower().replace(' ','-') + '/' \
                                        + document_name[:-5].lower().replace(' ','-') + ')')
    
    
writing_page_str += '\n' + '<!-- /MarkdownTOC -->\n\n\n\n' + '<br/>\n\n'
                                        
    
level1_seen = set()
for level1, level2 in level_order:
    print(level1,level2)
    if level2 == '':
        writing_page_str += '\n<br/>\n<br/>\n\n\n'
    else:
        writing_page_str += '\n<br/>\n\n\n'
    
    if level1 not in level1_seen:
        level1_seen.add(level1)
        writing_page_str += '<a name="' + level1 + '"></a>\n'
        writing_page_str += '---\n'
        writing_page_str += '#### **' + level1.replace('-',' ').capitalize() + '**' + '\n'
        writing_page_str += '---\n\n\n'
    
    if level2 != '':
        writing_page_str += '<a name="' + level2 + '"></a>\n'
        writing_page_str += '**' + level2 + '**' + '\n\n\n'
        
    for document_name in level_dict[(level1, level2)]:
        writing_page_str += document_name + '\n'
        
        

with open(output_path + '/_pages/writing.md', 'w', encoding='utf8') as file:
    file.write(writing_page_str)
        
        
        
    


### Populate the `project` Markdown page

In [5]:
projects_page_str = ''
projects_page_str += '''---
layout: page
title: Projects
nav: Projects
permalink: /projects/
description: This page will be populated with my research projects
---\n\n'''


level2_set = set()
for document_path in sorted_document_ordering:
    if len(document_path) > 3:
        print('error: document path length greater than 3')
        
    level1 = document_path[0]
    
    if level1 != 'projects':
        continue
    
    level2 = document_path[1]
    document_name = document_path[2]
    
        
    if level2 not in level2_set:
        level2_set.add(level2)
        projects_page_str += '\n<br/>\n<br/>\n''<h3 class="mt-4">' + level2 + '</h3>\n\n'
        
    href = 'https://chrisnielsen.github.io/projects/' + level2.lower().replace(' ','-') \
            + '/' + document_name[:-5].lower().replace(' ','-')
    
    
    document = Document(input_path + '/projects/' + level2 + '/' + document_name)
    
    for paragraph in document.paragraphs:
        text_str = paragraph.text
        if text_str == '':
            continue
        
        abstract = text_str
        break
        
    
    

    projects_page_str += '''<div class="card mt-3">
          <div class="p-3">
            <div class="row">
              <div class="col-sm-10">
                <h5 class="font-weight-bold"> <a href="%s">%s</a> </h5>
              </div>
            </div>
            <h6 class="mt-2 mt-sm-0">%s</h6>
          </div>
        </div>\n\n''' % (href, document_name[:-5], abstract)
        
    
    
    
    
with open(output_path + '/_pages/projects.md','w', encoding='utf8') as file:
    file.write(projects_page_str)
    
    

### Generate a Markdown page for each of the documents

In [6]:
character_replacement_dict = {'“': '"',
                              '”': '"',}

for document_path in sorted_document_ordering:
    level1 = document_path[0]
    
    if len(document_path) == 2:
        level2 = ''
        document_name = document_path[1]
        input_folder = level1 + '/' + document_name
        document_output_path = level1 + '/' + document_name[:-5].lower().replace(' ','-')
        output_folder = '_' + level1
    else:
        level2 = document_path[1]
        document_name = document_path[2]
        input_folder = level1 + '/' + level2 + '/' + document_name
        document_output_path = level1 + '/' + level2.lower().replace(' ','-') + '/' +document_name[:-5].lower().replace(' ','-')
        output_folder = '_' + level1 + '/' + level2.lower().replace(' ','-')
        
        
    image_path = output_path + '/assets/img/' + document_output_path
    
    try:
        os.makedirs(image_path)
    except:
        pass
    
    text = docx2txt.process(input_path + '/' + input_folder, image_path) 
    
    
    
    image_figure_dict = {}
    

    files = os.listdir(image_path)
    
    for file in files:
        number = file.strip().split('.')[0][5:]
        image_figure_dict[int(number)] = file
    
    
    
    document = Document(input_path + '/' + input_folder)

    output_header_str = ''
    output_toc_str = ''
    output_document_str = ''


    #############
    #############
    ## Header
    #############
    #############

    output_header_str += '---\n'
    output_header_str += 'layout: page\n'
    output_header_str += 'title: ' + document_name[:-5] + '\n'
    output_header_str += 'permalink: /' + document_output_path +'/' + '\n'
    output_header_str += '---\n'
    output_header_str += '<br />\n'





    #############
    #############
    ## Document and TOC
    #############
    #############

    output_toc_str += '### **Table of Contents**\n'
    output_toc_str += '<!-- MarkdownTOC depth=4 -->\n'
    

    figure_number = 0
    list_number = 0

    output_document_str += '{% raw %}\n'


    for paragraph_index, paragraph in enumerate(document.paragraphs):
        text_str = paragraph.text
        
        if paragraph_index == 0:
            continue

        pre_line = ''
        post_line = ''
        out_string = ''

        
        text_str = text_str.replace('\\[','$$')
        text_str = text_str.replace('\\]','$$')

        if 'Heading' in paragraph.style.name:
            handle_str = text_str.lower().replace(' ','-')
            pre_line = '<a name="' + handle_str + '"></a>\n\n' + '<br />\n\n' + '---\n'
            post_line = '---\n'


        if paragraph.style.name == 'Title':
            pre_line = '<br />\n'
            post_line = '---\n'
            out_string += '# '
        if paragraph.style.name == 'Heading 1':
            output_toc_str += '-  [' + text_str + '](#' + handle_str + ')\n'
            out_string += '## '
        if paragraph.style.name == 'Heading 2':
            output_toc_str += '    -  [' + text_str + '](#' + handle_str + ')\n'
            out_string += '#### '
        if paragraph.style.name == 'Heading 3':
            output_toc_str += '        -  [' + text_str + '](#' + handle_str + ')\n'
            out_string += '##### '


        if 'List Number' in paragraph.style.name:
            list_number += 1
            out_string += str(list_number) +'. '
        else:
            list_number = 0

        if 'Bullet' in paragraph.style.name:
            print('bullet')
            out_string += '- '


        if 'Figure__' in text_str:
            figure_number += 1
            width = text_str.strip().split('__')[1]
            alternate_text = text_str.strip().split('__')[2]
            img_src= '/' + image_path + '/' + image_figure_dict[figure_number]
            text_str = '<figure><center><img src="' + img_src + \
            '" alt="' + alternate_text + \
            '" width="' + width + '"/> <figcaption> <em>' + alternate_text + ' </em> </figcaption> </center></figure>'


        for char_to_replace in character_replacement_dict:
            replacement_char = character_replacement_dict[char_to_replace]
            text_str = text_str.replace(char_to_replace,replacement_char)


        

        output_document_str += pre_line
        output_document_str += out_string + text_str + '\n'
    #         file.write('{: style="text-align: justify"}\n')
        output_document_str += post_line


    output_document_str += '{% endraw %}\n'
    
    
    output_toc_str += '<!-- /MarkdownTOC -->\n\n\n'
    output_toc_str += '---\n<br/>\n\n'
    

    try:
        os.makedirs(output_path + '/' + output_folder)
    except:
        pass
    

    with open(output_path + '/_' + document_output_path + '.md', 'w', encoding='utf8') as file:
        file.write(output_header_str + output_toc_str + output_document_str)



    


### Merge folder code taken from [here](https://lukelogbook.tech/2018/01/25/merging-two-folders-in-python/)

In [7]:
#recursively merge two folders including subfolders
def mergefolders(root_src_dir, root_dst_dir):
    for src_dir, dirs, files in os.walk(root_src_dir):
        dst_dir = src_dir.replace(root_src_dir, root_dst_dir, 1)
        if not os.path.exists(dst_dir):
            os.makedirs(dst_dir)
        for file_ in files:
            src_file = os.path.join(src_dir, file_)
            dst_file = os.path.join(dst_dir, file_)
            if os.path.exists(dst_file):
                os.remove(dst_file)
            shutil.copy(src_file, dst_dir)
            
            
mergefolders(output_path, repo_path)