# Getting R Comp html and making install.R file

Meant to be run in sessions launched from [here](https://github.com/fomightez/rcomp_testenv). (This can also be run [here](https://github.com/fomightez/muscle-binder) where I added the recent version of pandoc on Feb 3, 2019. That version added conversion to markdown.) 

This is to make a binder-launchable environment with all the necessary R packages installed.

In [None]:
%pip install beautifulsoup4

In [None]:
r_companion_index_url = "https://rcompanion.org/rcompanion/index.html"

In [None]:
site_prefix = "https://rcompanion.org/rcompanion/"

import os
import sys
import urllib.request
from bs4 import BeautifulSoup as BS

def extract_name_of_the_html(url, add_html_extension):
    '''
    make a file name based on the URL "https://rcompanion.org/rcompanion/index.html".
    if `add_html_extension` is True than add `.html` extension
    to the file name.
    
    Return filename
    '''
    split_url = url.split("/")
    fn = split_url[-1]
    if add_html_extension:
        fn += ".html"
    return fn

def get_html_and_save(url):
    '''
    Take a url for a web page get the html and stores the text.
    Returns the html code too
    
    based on https://stackoverflow.com/a/30890016/8508004
    '''
    global the_html # so can save using `%store` the variable needs to be global
    global fn_save_name # so can save using `%store` the variable needs to be global
    hh = urllib.request.urlopen(url)
    hbytes = hh.read()

    the_html = hbytes.decode("utf8")
    #print (the_html[:200])
    hh.close()
    
    fn_save_name = extract_name_of_the_html(url, add_html_extension=False)
    
    %store the_html > {fn_save_name}
    
    return the_html 


pages_and_titles_dict = {}
index_html = get_html_and_save(r_companion_index_url)
# mine from the Contents panel on the left, the list of the pages
nav_code = index_html.split("<!-- Begin Navigation -->")[1].split("<!-- End Navigation -->")[0]
contents_code = nav_code.split("<ul>Introduction")[1].split('<div id="adskyscraper">')[0]
#print(nav_code )

# ul and li tags based on https://stackoverflow.com/a/17246983/8508004
soup = BS(nav_code)
for ultag in soup.find_all('ul'):
    for litag in ultag.find_all('li'):
        #print(litag.text.strip())  #<--ends up being same as `print(link.text.strip())`
        pass
        for link in litag.find_all('a'):
            #print(link.get('title')) #based on https://stackoverflow.com/a/32542575/8508004
            #print(link.text.strip())
            #print(link.get('href')) #based on https://python.gotrained.com/beautifulsoup-extracting-urls/
            if link.get('href').startswith("http://rcompanion.org/"):
                full_link = link.get('href')
            else:
                full_link = f"{site_prefix}{link.get('href')}"
            pages_and_titles_dict[full_link] = link.text.strip()
pages_and_titles_dict

Here, do not remove a few that I already converted to notebooks because I need to get packages those need, too.

In [None]:
'''drafts_made_already = [
    "Reading_SAS_Datalines_in_R.ipynb",
    "Power_Analysis.ipynb",
    "Exact_Test_of_Goodness-of-Fit.ipynb"
    
]

drafts_made_already_for_matching = [x.rsplit(".ipynb")[0].replace("_"," ") for x in drafts_made_already]
ones_to_remove = [] # need to make a list because cannot delete while iterating over
for url,page_name in pages_and_titles_dict.items():
    if page_name in drafts_made_already_for_matching:
        ones_to_remove.append(url)
for t in ones_to_remove:
    del pages_and_titles_dict[t]
pages_and_titles_dict
'''

Remove the last one as it doesn't include code. (Just leads to another online book by the author Salvatore.)

In [None]:
remove_specifically = "http://rcompanion.org/handbook/"
del pages_and_titles_dict[remove_specifically]

## Now get the html for each and make markdown from it.

In [None]:
urls_to_get = list(pages_and_titles_dict.keys())

In [None]:
import os
import sys
import urllib.request


def extract_name_of_the_html(url, add_html_extension):
    '''
    make a file name based on the URL "https://rcompanion.org/rcompanion/index.html".
    if `add_html_extension` is True than add `.html` extension
    to the file name.
    
    Return filename
    '''
    split_url = url.split("/")
    fn = split_url[-1]
    if add_html_extension:
        fn += ".html"
    return fn

def get_html_and_save(url):
    '''
    Take a url for a web page get the html and store the text.
    
    return the name of the html and the name of file to save.
    (Turns out `%store` magics didn't work in the function?!)
    
    based on https://stackoverflow.com/a/30890016/8508004
    '''
    hh = urllib.request.urlopen(url)
    hbytes = hh.read()

    the_html = hbytes.decode("utf8")
    #print (the_html[:200])
    hh.close()
    fn_save_name = extract_name_of_the_html(url, add_html_extension=False)
    
    #%store the_html > {fn_save_name} #seems cannot use this in a function?;
    # probably because it needs to be a global and here it would be local
    # variable it would be trying to save.
    
    return the_html,fn_save_name

htmls_collected = []
markdowns_made = []
for url in urls_to_get:
    the_html,fn_save_name = get_html_and_save(url)
    %store the_html > {fn_save_name}
    htmls_collected.append(fn_save_name)
    markdown_name = fn_save_name.rsplit(".html")[0] + ".md"
    !pandoc -s -f html -t markdown {fn_save_name} -o {markdown_name}
    sys.stderr.write("'{}' has been generated.\n".format(markdown_name))
    markdowns_made.append(markdown_name)

Iterating over markdown produced and collect packages needed for running all R code
--------------------------------------------------------------

In [None]:
#markdowns_made = ["d_06.md"] # Uncomment for debugging only

In [None]:
def extract_libraries_from_md(md):
    '''
    Go through markdown line by line and collect packages needed from lines like:
    
    if(!require(DescTools)){install.packages(\"DescTools\")}\
    if(!require(RVAideMemoire)){install.packages(\"RVAideMemoire\")}\
    
    Return packages as list
    '''
    packages_needed = []
    with open(md, 'r') as input:
        for line in input:
            if (line.strip().startswith("if(!require")) and (
                "{install.packages(" in line):
                package_nom = line.split('{install.packages(')[1].split('\")}')[0].strip()
                packages_needed.append(package_nom[2:-1])
    return packages_needed

R_packages = []
for md in markdowns_made:
    extracted_packages = extract_libraries_from_md(md)
    R_packages += extracted_packages
R_packages = list(set(R_packages))
R_packages

## Make the install.R file for the repository

While doing this, address any peculiarites I have seen come up. For example, for `RVAideMemoire`, I've seen that simply including `install.packages("RVAideMemoire")` won't result in an environment that allows it to run when launched via Binder. It reports that it needs `mixOmics` to work as well and it doesn't seem to get installed as `mixOmics` has been moved from CRAN / MRAN to Bioconductor. I found that it was then necessary to add in the special handling needed to get that in the `Install.R` file if `RVAideMemoire` is among the list of libraries required.

install.packages("BiocManager")
BiocManager::install("mixOmics")

In [None]:
special_additions_needed_for_RVAideMemoire = '''\
install.packages("BiocManager")
BiocManager::install("mixOmics")
'''

basic_line = 'install.packages("PLACEHOLDER")'
text_2_save = ''
if "RVAideMemoire" in R_packages:
        text_2_save += special_additions_needed_for_RVAideMemoire
for x in R_packages:
    text_2_save += basic_line.replace("PLACEHOLDER",x)
    text_2_save += "\n"
%store text_2_save > install.R

Download `install.R` for placing in the repository.

----

----

In [None]:
'''
archive_file_name = "FirstSetmarkdown_from_RCompanion.tar.gz"
import os
import sys
# store `urls_to_get` and `markdowns_made` as json since lighter-weight and more portable than pickling
# and the order of them wll correspond to the index I made so I can use them with papermill 
# in conjuction without needing to make a new dictionary.
RCompanion_urls_to_get_storedfn = "RCompanion_urls_to_get.json"
RCompanion_markdowns_made_storedfn = "RCompanion_markdowns_made.json"
import json
with open(RCompanion_urls_to_get_storedfn, 'w') as f:
    json.dump(urls_to_get, f)
with open(RCompanion_markdowns_made_storedfn, 'w') as f:
    json.dump(markdowns_made, f)
files_to_archive = markdowns_made + [RCompanion_urls_to_get_storedfn] + [RCompanion_markdowns_made_storedfn]
!tar czf {archive_file_name} {" ".join(files_to_archive)}
sys.stderr.write("***************************DONE***********************************\n"
    "'{}' generated. Download it.\n"
    "***************************DONE***********************************".format(archive_file_name))
''';

Follow-up this with `?????`.

In [None]:
import time

def executeSomething():
    #code here
    print ('.')
    time.sleep(480) #60 seconds times 8 minutes

while True:
    executeSomething()