# Markdown conversion
This notebook is to help convert documents to markdown format. Markdown format can be used to display content on the web. We use markdown on Github  for WCS courses.

Run the cells below with "shift + enter" or by clicking the "Play" (triangle) on the left hand side of the cell. 

The first cell will install conda. Conda is an open source system of managing tools and libraries. More info on the library used to install conda on Google Colab is at this [website](https://inside-machinelearning.com/en/how-to-install-use-conda-on-google-colab/)


In [None]:
!pip install -q condacolab
import condacolab
condacolab.install()

# Install pandoc
Next we install pandoc, a very useful tool for converting documents. There are [demos](https://pandoc.org/demos.html) for other kinds of conversions pandoc can do

In [None]:
!conda install -c conda-forge pandoc 

# Retrieve example files:
Next we will clone a Github repo which contains an example word document which we will convert. You may also upload your own documents by clicking on the "folder"  icon on the left hand side tab, and then the "upload files" icon, under "Files".

In [None]:
!git clone https://github.com/WCSCourses/format_convert.git

#Running pandoc
Next we will run pandoc, indicating our source file with "-s", and then name of the output file with "-o"

In [4]:
!pandoc -s /content/format_convert/example_docs/Example_wordoc_number1.docx -o example_markdown_no1.md

# Using the converted docs
Now you may use the converted document on github or any other web source that supports markdown. If you dont see the doc yet, right click on the space in the "Files" tab on the left, and select "Refresh". You can then download the "example_markdown" file by right clicking or clicking the three dots on the right of its name and selecting "Download". Markdown is best opened in a text editor. Windows- Notepad, Mac - TextEdit, Linux - Gedit.  

# Retrieving images from docs
You will notice that image links fail (as sources for these must be given in markdown). We will first create pdfs of a given docx, then pull out the images from these pdf files.
Begin by installing [poppler utilities](https://pypi.org/project/poppler-utils/) in the ubuntu base with the following commands. These will also update the texlive files used by pandoc to convert to pdf 


In [None]:
!sudo apt-get update
!sudo apt install poppler-utils
!sudo apt-get install texlive-fonts-recommended
!sudo apt-get install texlive-latex-base
!sudo apt-get install texlive-fonts-extra
!sudo apt-get install texlive-generic-extra
!sudo apt-get install texlive-latex-extra 


First convert from docx format to pdf with pandoc

In [6]:
!pandoc -s /content/format_convert/example_docs/Example_wordoc_number1.docx -o Example_wordoc_number1.pdf

Then retrieve the images from the file - note, this is not a perfect process, and some blank images may be found due to the input document spacing. You can change the image file type by replacing "-png" with the format you require

In [7]:
!pdfimages /content/Example_wordoc_number1.pdf images -png

## Building loops
Now that you have converted your example file, it is time to loop through a series of files to maximise your productivity. First create a base folder. We copy the example files over for showing how this works

In [1]:
!mkdir -p base_folder
!cp /content/format_convert/example_docs/*.docx /content/base_folder/

Remember you can upload files to base_folder directly by right clicking its name on the right hand side file exploring panel. Note, unix scripts often break due to spaces " " in filenames. You can replace these spaces with "_" after uploading to "base_folder" with the cell below.

In [11]:
!cd /content/base_folder/; for f in *\ *; do mv "$f" "${f// /_}"; done

Next is a loop that will "glob" through any docx file in "base_folder". Be careful with modifying the loop - you may need to run some tests if you want to change where output files go. The first command will convert to markdown, and the second to pdf in preparation for the image extraction step.

In [None]:
!for file in /content/base_folder/*.docx; do echo $file; name=${file##*/}; mkdir -p "$file"_folder; pandoc -s $file -o \/"$file"_folder/$name.md; done

In [None]:
!for file in /content/base_folder/*.docx; do echo $file; name=${file##*/}; mkdir -p "$file"_folder; pandoc -s $file -o \/"$file"_folder/$name.pdf; done

And now to retrieve the images from your converted files. This uses the paths from your input files, and assumes you converted with the pdf naming as in the loop above. It will add the images to the file folder based on the original input file. 

In [None]:
!for file in /content/base_folder/*.docx; do echo $file; name=${file##*/}; cd \/"$file"_folder/; pdfimages $name.pdf images -png; done