# Getting R Comp html and making install.R file

Meant to be run in sessions launched from [here](https://github.com/fomightez/rcomp_testenv). However, because it doesn't need anything special except the one package that gets installed as the notebook runs, it can be run in pretty much any session launched via MyBinder.org or in a current JupyterHub environment.

This is to make a Binder-launchable environment with all the necessary R packages installed.

In [None]:
%pip install beautifulsoup4

In [None]:
r_companion_index_url = "https://rcompanion.org/rcompanion/index.html"

In [None]:
site_prefix = "https://rcompanion.org/rcompanion/"

import os
import sys
import urllib.request
from bs4 import BeautifulSoup as BS

def extract_name_of_the_html(url, add_html_extension):
    '''
    make a file name based on the URL "https://rcompanion.org/rcompanion/index.html".
    if `add_html_extension` is True than add `.html` extension
    to the file name.
    
    Return filename
    '''
    split_url = url.split("/")
    fn = split_url[-1]
    if add_html_extension:
        fn += ".html"
    if fn == 'index.html':
        fn = "rcomp_index.html"
    return fn

def get_html_and_save(url):
    '''
    Take a url for a web page get the html and stores the text.
    Returns the html code too
    
    based on https://stackoverflow.com/a/30890016/8508004
    '''
    global the_html # so can save using `%store` the variable needs to be global
    global fn_save_name # so can save using `%store` the variable needs to be global
    hh = urllib.request.urlopen(url)
    hbytes = hh.read()

    the_html = hbytes.decode("utf8")
    #print (the_html[:200])
    hh.close()
    
    fn_save_name = extract_name_of_the_html(url, add_html_extension=False)
    
    %store the_html > {fn_save_name}
    
    return the_html 


pages_and_titles_dict = {}
index_html = get_html_and_save(r_companion_index_url)
# mine from the Contents panel on the left, the list of the pages
nav_code = index_html.split("<!-- Begin Navigation -->")[1].split("<!-- End Navigation -->")[0]
contents_code = nav_code.split("<ul>Introduction")[1].split('<div id="adskyscraper">')[0]
#print(nav_code )

# ul and li tags based on https://stackoverflow.com/a/17246983/8508004
soup = BS(nav_code)
for ultag in soup.find_all('ul'):
    for litag in ultag.find_all('li'):
        #print(litag.text.strip())  #<--ends up being same as `print(link.text.strip())`
        pass
        for link in litag.find_all('a'):
            #print(link.get('title')) #based on https://stackoverflow.com/a/32542575/8508004
            #print(link.text.strip())
            #print(link.get('href')) #based on https://python.gotrained.com/beautifulsoup-extracting-urls/
            if link.get('href').startswith("http://rcompanion.org/"):
                full_link = link.get('href')
            else:
                full_link = f"{site_prefix}{link.get('href')}"
            pages_and_titles_dict[full_link] = link.text.strip()
pages_and_titles_dict

Here, do not remove a few that I already converted to notebooks because I need to get packages those need, too.

In [None]:
'''drafts_made_already = [
    "Reading_SAS_Datalines_in_R.ipynb",
    "Power_Analysis.ipynb",
    "Exact_Test_of_Goodness-of-Fit.ipynb"
    
]

drafts_made_already_for_matching = [x.rsplit(".ipynb")[0].replace("_"," ") for x in drafts_made_already]
ones_to_remove = [] # need to make a list because cannot delete while iterating over
for url,page_name in pages_and_titles_dict.items():
    if page_name in drafts_made_already_for_matching:
        ones_to_remove.append(url)
for t in ones_to_remove:
    del pages_and_titles_dict[t]
pages_and_titles_dict
'''

Remove the last one as it doesn't include code. (Just leads to another online book by the author Salvatore.) Also remove the index, so I don't clobber the one in the environment. (Note above, renamed the `index` file from RComp so that it doesn't clobber the Jupyter environment notebook with the same name.)

In [None]:
remove_specifically = [r_companion_index_url, "http://rcompanion.org/handbook/"]
for p in remove_specifically:
    del pages_and_titles_dict[p]

## Now get the html for each and make markdown from it.

In [None]:
urls_to_get = list(pages_and_titles_dict.keys())

In [None]:
import os
import sys
import urllib.request


def extract_name_of_the_html(url, add_html_extension):
    '''
    make a file name based on the URL "https://rcompanion.org/rcompanion/index.html".
    if `add_html_extension` is True than add `.html` extension
    to the file name.
    
    Return filename
    '''
    split_url = url.split("/")
    fn = split_url[-1]
    if add_html_extension:
        fn += ".html"
    return fn

def get_html_and_save(url):
    '''
    Take a url for a web page get the html and store the text.
    
    return the name of the html and the name of file to save.
    (Turns out `%store` magics didn't work in the function?!)
    
    based on https://stackoverflow.com/a/30890016/8508004
    '''
    hh = urllib.request.urlopen(url)
    hbytes = hh.read()

    the_html = hbytes.decode("utf8")
    #print (the_html[:200])
    hh.close()
    fn_save_name = extract_name_of_the_html(url, add_html_extension=False)
    
    #%store the_html > {fn_save_name} #seems cannot use this in a function?;
    # probably because it needs to be a global and here it would be local
    # variable it would be trying to save.
    
    return the_html,fn_save_name

htmls_collected = []
markdowns_made = []
for url in urls_to_get:
    the_html,fn_save_name = get_html_and_save(url)
    %store the_html > {fn_save_name}
    htmls_collected.append(fn_save_name)
    markdown_name = fn_save_name.rsplit(".html")[0] + ".md"
    !pandoc -s -f html -t markdown {fn_save_name} -o {markdown_name}
    sys.stderr.write("'{}' has been generated.\n".format(markdown_name))
    markdowns_made.append(markdown_name)

Iterating over markdown produced and collect packages needed for running all R code
--------------------------------------------------------------

In [None]:
#markdowns_made = ["d_06.md"] # Uncomment for debugging only

In [None]:
def extract_libraries_from_md(md):
    '''
    Go through markdown line by line and collect packages needed from lines like:
    
    if(!require(DescTools)){install.packages(\"DescTools\")}\
    if(!require(RVAideMemoire)){install.packages(\"RVAideMemoire\")}\
    
    Return packages as list
    '''
    packages_needed = []
    with open(md, 'r') as input:
        for line in input:
            if (line.strip().startswith("if(!require")) and (
                "{install.packages(" in line):
                package_nom = line.split('{install.packages(')[1].split('\")}')[0].strip()
                packages_needed.append(package_nom[2:-1])
    return packages_needed

R_packages = []
for md in markdowns_made:
    extracted_packages = extract_libraries_from_md(md)
    R_packages += extracted_packages
R_packages = list(set(R_packages))
R_packages

## Make the install.R file for the repository

While doing this, address any peculiarites I have seen come up. For example, for `RVAideMemoire`, I've seen that simply including `install.packages("RVAideMemoire")` won't result in an environment that allows it to run when launched via Binder. It reports that it needs `mixOmics` to work as well and it doesn't seem to get installed as `mixOmics` has been moved from CRAN / MRAN to Bioconductor. I found that it was then necessary to add in the special handling needed to get that in the `Install.R` file if `RVAideMemoire` is among the list of libraries required.

install.packages("BiocManager")
BiocManager::install("mixOmics")

Nothing is needed in the `install.R` file for the package `agricolae`, but while talking about special handling necessary for `RVAideMemoire`, it probably is good place to point out special handling was needed to get the package `agricolae` to be useable. Because so much goes on when the image builds, I probably missed that `agricolae` was having issues but it became apparent installation failed when I tried to run notebooks that used it, like `d_06`, because `library(agricolae)` doesn't work even though it is listed in `installs.R`.  I thought maybe it is like the case of`RVAideMemoire` and needs some dependency specially handled and so I looked around. I found [this post](https://www.reddit.com/r/RStudio/comments/63hkje/cant_install_agricolae_package/) that suggested it wasn't working because needed gfortran?. Luckily, working on my getting David Goodsell's Illustrate script working in my pymol-binder, I learned how to install gfortran on Binder-launchable systems using `apt.txt` and so I could do that. As another avenue of learning about what might be the issue, I made a repo that only tried installing argicolae and tried then bringing it in the sessio with `library(agricolae)`, but it still failed. Didn't indicate what it needed there but when I tried `install.packages("agricolae")` in the running session, I did see something about also installing some other dependencies  which I would have expected to already been installed since I had `install.packages("agricolae")` in the `install.R` script included in the repo. The three listed were: ‘units’, ‘sf’, ‘spdep’. And then it said installation of each of those packages returned non-zero exit status. So I looked around about installing those libraries and found [this](https://philmikejones.me/tutorials/2018-08-29-install-sf-ubuntu/) about `units` and `sf`. And so I added the two listed for units to the list in `apt.txt` that I was making to include gfortran.
Also found [this](https://r-spatial.github.io/sf/), which listed the four `libudunits2-dev libgdal-dev libgeos-dev libproj-dev`, and so I added the additional three listed there. (Already had `libudunits2-dev` from the listing I just had mentioned before.)
Actually next time ran the building process, I noticed among the list of installing was `gdal-data` and so I so went back to https://philmikejones.me/tutorials/2018-08-29-install-sf-ubuntu/ and added `libgdal-dev` to the growing `apt.txt` file, too. Ran the launch with the new `apt.txt` file with gfortran and those other libraries and when it came up I tried `library(agricolae)`, and it worked. So the `apt.txt` has been added to the repo, too. I don't know if I needed all those I added to `apt.txt` (aside: later I did see [this](https://github.com/satijalab/seurat/issues/791) suggesting `robustbase` also needed gfortran and so probably good it is included), but at least it solved the issue and it works.

There are some libraries that didn't feature in text like `if(!require(PACKAGE)){install.packages("PACKAGE")}` at the top of pertinent notebooks and so they didn't get mined when I used that to find packages needed. So need to add:

* robustbase
* popbio
* PerformanceAnalytics
* nlstools

Plus, thought I'd add a few additional popular packages.

In [None]:
special_additions_needed_for_RVAideMemoire = '''\
install.packages("BiocManager")
BiocManager::install("mixOmics")
'''

additions_for_additional_packages = '''install.packages("robustbase")
install.packages("popbio")
install.packages("PerformanceAnalytics")
install.packages("nlstools")
'''

additional_popular_packages='''install.packages("tidyverse")
install.packages("rmarkdown")
install.packages("httr")
'''

basic_line = 'install.packages("PLACEHOLDER")'
text_2_save = ''
if "RVAideMemoire" in R_packages:
        text_2_save += special_additions_needed_for_RVAideMemoire
for x in R_packages:
    text_2_save += basic_line.replace("PLACEHOLDER",x)
    text_2_save += "\n"
text_2_save += additions_for_additional_packages
text_2_save += additional_popular_packages
%store text_2_save > install.R

Download `install.R` for placing in the repository.

----

----