# XML Sitemap Generator

This notebook will generate all of the sitemaps that are stored within the Platform webapp ``/sitemap`` directory and linked to the [Platform index sitemap](https://www.targetvalidation.org/sitemaps/1804/index.xml). The index sitemap is the one that we submit to Google, Bing, Yahoo, Yandex, etc. to improve our SEO.

This notebook should be **run at least 3 times per year (every other release)** to ensure that search engines are correctly indexing the Platform. 

Links and documentation:

* [JS sitemap generator (stored in webapp repo)](https://github.com/opentargets/webapp/blob/master/sitemap-generator.js)
* [Creating XML sitemap from list](https://stackoverflow.com/questions/16681543/create-xml-file-with-python-by-iterating-over-lists)
* [ElementTree XML API documentation](https://docs.python.org/3.4/library/xml.etree.elementtree.html#building-xml-documents)
* [Sitemap XML format](https://www.sitemaps.org/protocol.html)

### 1. Create list of sitemaps that need to be generated by pipeline

In [1]:
sitemaps = [
    {
        "title": "target association pages",
        "file_name": "target_association_pages.xml",
        "default_priority": "0.9",
        "is_association_page": True,
        "entity": "target"
    },
    {
        "title": "target profile pages",
        "file_name": "target_profile_pages.xml",
        "default_priority": "0.9",
        "is_association_page": False,
        "entity": "target"
    },
    {
        "title": "disease association pages",
        "file_name": "disease_association_pages.xml",
        "default_priority": "0.9",
        "is_association_page": True,
        "entity": "disease"
    },
    {
        "title": "disease profile pages",
        "file_name": "disease_profile_pages.xml",
        "default_priority": "0.6",
        "is_association_page": False,
        "entity": "disease"
    },
    {
        "title": "static pages",
        "file_name": "static_pages.xml",
        "default_priority": "1",
        "is_association_page": False,
        "entity": ""
    },
]

### 2. Generate list of target and disease IDs

In [2]:
# import relevant libraries
import json
import time

# set location of JSON file
filename = "data/1906_data.json"

# create two temporary lists to store target and disease IDs
target_ids = []
disease_ids = []

# set start time
start = time.time()

# open JSON file and extract target ID and disease ID from each line
with open(filename, "r") as f:
    for line in f:
        data = json.loads(line)
        target_ids.append(data["target"]["id"])
        disease_ids.append(data["disease"]["id"])
        
# set end time
end = time.time()

# remove duplicates by transforming target ID and disease ID lists into sets
targets = set(target_ids) 
diseases = set(disease_ids)

# number of targets and diseases should match data available 
# at https://platform-api-qc.opentargets.io/v3/platform/public/utils/stats
print("Loaded association data JSON in a time of %s" % (end - start))
print("%i target IDs and %i disease IDs" % ((len(targets)),(len(diseases))))

Loaded association data JSON in a time of 111.64060711860657
27021 target IDs and 10473 disease IDs


### 3. Import list of static pages

In [3]:
static_pages = [
    "/",
    "/about",
    "/downloads/data",
    "/batch-search",
    "/variants",
    "/terms-of-use",
    "/faq",
    "/scoring",
    "/outreach",
]

### 4. Pipeline code to generate sitemaps

In [4]:
# import relevant libraries
import lxml.etree as etree
from datetime import datetime
import time

# determine current date and format in YYYY-MM-DD format
today_date = datetime.today().strftime('%Y-%m-%d')

def create_individual_sitemap(sitemap, data_list):
    
    # create <urlset> root and set attribute values
    attribute_qname = etree.QName("http://www.w3.org/2001/XMLSchema-instance", "schemaLocation")
    namespace_mappings = {
        "xsi": "http://www.w3.org/2001/XMLSchema-instance",
        None: "http://www.sitemaps.org/schemas/sitemap/0.9"
    }
    urlset = etree.Element("urlset",
                           {attribute_qname: "http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"},
                           nsmap=namespace_mappings
                          )

    # iterate through list of targets, diseases, or static pages
    for item in data_list:
        
        # create <url> and <loc> elements
        url = etree.SubElement(urlset,"url")
        loc = etree.SubElement(url, "loc")
        
        # check what type of entities are in the list and assign the correct loc_base_url value
        if sitemap["entity"] == "target":
            loc_base_url = "https://www.targetvalidation.org/target/"
        elif sitemap["entity"] == "disease":
            loc_base_url = "https://www.targetvalidation.org/disease/"
        else:
            loc_base_url = "https://www.targetvalidation.org"
        
        # set the text for the <loc> element by checking to see if sitemap is for 
        # associations pages, profile pages, or static pages
        if sitemap["is_association_page"] == True:
            loc.text = loc_base_url + item + "/associations" 
        else:
            loc.text = loc_base_url + item
        
        #create <lastmod>, <changefreq>, and <priority> elements
        lastmod = etree.SubElement(url, "lastmod")
        lastmod.text = today_date
        changefreq = etree.SubElement(url, "changefreq")
        changefreq.text = "monthly"
        priority = etree.SubElement(url, "priority")
        priority.text = sitemap["default_priority"]
        
    # create and save XML file
    xml_tree_raw = etree.ElementTree(urlset)
    with open("sitemaps/" + sitemap["file_name"], "wb") as xml_file:
        xml_file.write(etree.tostring(xml_tree_raw, xml_declaration=True, encoding="UTF-8", pretty_print=True))
    
    print("Created " + sitemap["file_name"] + " sitemap")
    
def create_index_sitemap(sitemaps):
    
    # create <sitemap> root node
    sitemapindex = etree.Element("sitemapindex")

    # set XML standards property
    sitemapindex.set("xmlns", "http://www.sitemaps.org/schemas/sitemap/0.9")
    
    # create child nodes of <sitemap> root node using sitemaps list
    for sitemap in sitemaps:
        sitemap_element = etree.SubElement(sitemapindex,"sitemap")
        loc = etree.SubElement(sitemap_element, "loc")
        loc.text = "https://www.targetvalidation.org/sitemaps/" + sitemap["file_name"]
        lastmod = etree.SubElement(sitemap_element, "lastmod")
        lastmod.text = today_date
    
    # create and save XML file
    xml_tree_raw = etree.ElementTree(sitemapindex)
    with open("sitemaps/index.xml", "wb") as xml_file:
        xml_file.write(etree.tostring(xml_tree_raw, xml_declaration=True, encoding="UTF-8", pretty_print=True))
    
    # set end time
    end = time.time()
    
    print("Created index.xml sitemap")
    
def create_all_sitemaps(sitemaps, targets, diseases, static_pages):
    
    print("*****")
    print("Started XML sitemap generation pipeline")
    print("*****")
    
    # set start time
    start = time.time()
    
    # create sitemaps for target profile, target associations, disease profile, disease association, and static pages
    for sitemap in sitemaps:
        if sitemap["entity"] == "target":
            create_individual_sitemap(sitemap, targets)
        elif sitemap["entity"] == "disease":
            create_individual_sitemap(sitemap, diseases)
        else:
            create_individual_sitemap(sitemap, static_pages)
    
    # create index sitemap to link all sitemaps together
    # note: the URL for this sitemap is what is submitted to search engines for indexing
    create_index_sitemap(sitemaps)
    
    # set end time
    end = time.time()
    
    print("*****")
    print("Finished XML sitemap generation pipeline in %s" % (end - start))
    print("*****")

### 5. Run pipeline

In [5]:
create_all_sitemaps(sitemaps, targets, diseases, static_pages)

*****
Started XML sitemap generation pipeline
*****
Created target_association_pages.xml sitemap
Created target_profile_pages.xml sitemap
Created disease_association_pages.xml sitemap
Created disease_profile_pages.xml sitemap
Created static_pages.xml sitemap
Created index.xml sitemap
*****
Finished XML sitemap generation pipeline in 0.6294372081756592
*****
