### Bioinformatics and text files 

The following text is from:  
https://medium.com/ngs-sh/a-simple-introduction-to-xml-and-json-4547b93c9aae

Most bioinformatics file formats are simple text files, a famous example being the FASTA format to store sequences. Historically, most file formats were proposed to ad hoc address a specific need, resulting in a fragmented universe of formats.

Examples of famous bioinformatics formats are the FASTA and FASTQ for sequences, the SAM format to store details of sequence mappings, the VCF format to describe the variants of an individual compared to a reference genome, the GFF and BED formats to describe features in a genome (e. g. genes, enhancers, binding sites…).

#### XML and JSON
Among the “general purpose” formats commonly used in computer science, two are XML (for eXtensible Markup Language) and JSON (JavaScript Object Notation). The former has been very popular at the beginning of the new century, while the latter gained popularity later in this decade. They are both meant to encode structured information, and possibly to be able to describe any form of document needed (not necessarily in an ideal way). XML is more formal and enables a strict adherence to a defined structure, while JSON is a simpler data container (but this simplicity resulted in a good popularity in later times, the BIOM 1.0 format is an example of widely adopted JSON format).


_________

### Extensible Markup Language -- `xml` -- format

The following text is from:    
https://www.tutorialspoint.com/xml/xml_overview.htm

XML stands for Extensible Markup Language. It is a text-based markup language derived from Standard Generalized Markup Language (SGML).

XML tags identify the data and are used to store and organize the data, rather than specifying how to display it like HTML tags, which are used to display the data. XML is not going to replace HTML in the near future, but it introduces new possibilities by adopting many successful features of HTML.

There are three important characteristics of XML that make it useful in a variety of systems and solutions:  
* XML is extensible − XML allows you to create your own self-descriptive tags, or language, that suits your application.
* XML carries the data, does not present it − XML allows you to store the data irrespective of how it will be presented.
* XML is a public standard − XML was developed by an organization called the World Wide Web Consortium (W3C) and is available as an open standard.

##### XML Usage
A short list of XML usage says it all:
* XML can work behind the scene to simplify the creation of HTML documents for large web sites.
* XML can be used to exchange the information between organizations and systems.
* XML can be used for offloading and reloading of databases.
* XML can be used to store and arrange the data, which can customize your data handling needs.
* XML can easily be merged with style sheets to create almost any desired output.

Virtually, any type of data can be expressed as an XML document.

##### What is Markup?   
XML is a markup language that defines set of rules for encoding documents in a format that is both human-readable and machine-readable. So what exactly is a markup language? Markup is information added to a document that enhances its meaning in certain ways, in that it identifies the parts and how they relate to each other. More specifically, a markup language is a set of symbols that can be placed in the text of a document to demarcate and label the parts of that document.




#### Parsing `xml` files

```python
import xml.etree.ElementTree as ET
```

https://docs.python.org/3/library/xml.html

The XML handling submodules are:

* xml.etree.ElementTree: the ElementTree API, a simple and lightweight XML processor
* xml.dom: the DOM API definition
* xml.dom.minidom: a minimal DOM implementation
* xml.dom.pulldom: support for building partial DOM trees
* xml.sax: SAX2 base classes and convenience functions
* xml.parsers.expat: the Expat parser binding



`genes.xml`

```xml
<?xml version="1.0"?>
<data>
    <gene symbol="BRCA1">
        <id>672</id>
        <name>BRCA1 DNA repair associated</name>
        <gdppc>141100</gdppc>
        <alias sym="IRIS"/>
        <alias sym="PSCP"/>
    </gene>
    <gene symbol="BRCA2">
        <id>675</id>
        <name>BRCA2 DNA repair associated</name>
        <alias sym="FAD"/>
        <alias sym="FAD1"/>
        <alias sym="BRCC2"/>
    </gene>
</data>
```

In [None]:
import xml.etree.ElementTree as ET

In [None]:
# see what is available for the ElementTree objects 
[elem for elem in dir(ET) if not elem.startswith("_")]

In [None]:
help(ET.parse)

In [None]:
tree = ET.parse('genes.xml')

In [None]:
[elem for elem in dir(tree) if not elem.startswith("_")]

In [None]:
help(tree.getroot)

In [None]:
root = tree.getroot()

In [None]:
[elem for elem in dir(root) if not elem.startswith("_")]

In [None]:
# As an Element object, root has a tag and a dictionary of attributes:
root.tag

In [None]:
root.attrib

In [None]:
# root also has child Elements
for child in root:
    print(child.tag, child.attrib)

In [None]:
# Children are nested, and we can access specific child nodes by index:

root[0].attrib

In [None]:
root[0][0].text

In [None]:
root[0][1].text

Element has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on). For example, Element.iter()

In [None]:
# we can explore specific tags

for alias in root.iter('alias'):
    print(alias.attrib)

In [None]:
genes = []
for gene in root.findall('gene'):
    gene_id = gene.find('id').text
    symbol = gene.get('symbol')
    name = gene.find('name').text
    aliases = []
    for alias in gene.findall('alias'):
        aliases.append(alias.get('sym'))
    genes.append((gene_id, symbol, name, aliases))
genes

In [None]:
# xml to pandas DataFrame
import pandas as pd
pd.DataFrame(genes, columns = ("gene_id", "symbol", "name", "aliases"))

#### xml to pandas DataFrame
from: https://www.geeksforgeeks.org/how-to-create-pandas-dataframe-from-nested-xml/

Save the following content into a file stores.xml

```xml
<?xml version="1.0" encoding="UTF-8"?>
  
       <Food>
  
           <Info>
           <Msg>Food Store items.</Msg>
           </Info>
  
           <store slNo="1">
               <foodItem>oranges</foodItem>
               <price>5</price>
               <quantity>1kg</quantity>
               <discount>7%</discount>
           </store>
  
           <store slNo="2">
               <foodItem>carrots</foodItem>
               <price>2</price>
               <quantity>1kg</quantity>
               <discount>5%</discount>
           </store>
  
       </Food>
```


In [None]:
import xml.etree.ElementTree as ETree
import pandas as pd
  
# give the path where you saved the xml file # inside the quotes

prstree = ETree.parse("stores.xml")
root = prstree.getroot()
  
# print(root)
store_items = []
all_items = []
  
for storeno in root.iter('store'):
    
    store_Nr = storeno.attrib.get('slNo')
    itemsF = storeno.find('foodItem').text
    price = storeno.find('price').text
    quan = storeno.find('quantity').text
    dis = storeno.find('discount').text
  
    store_items = [store_Nr, itemsF, price, quan, dis]
    all_items.append(store_items)
  
xmlToDf = pd.DataFrame(all_items, columns=['SL No', 'ITEM_NUMBER', 'PRICE', 'QUANTITY', 'DISCOUNT'])
  
print(xmlToDf.to_string(index=False))

Resources:

https://www.geeksforgeeks.org/xml-parsing-python/
https://www.w3schools.com/xml/
https://www.tutorialspoint.com/python/python_xml_processing.htm
https://docs.python.org/3/library/xml.etree.elementtree.html
https://realpython.com/python-xml-parser/
https://docs.python-guide.org/scenarios/xml/
https://www.geeksforgeeks.org/reading-and-writing-xml-files-in-python/
https://www.guru99.com/manipulating-xml-with-python.html


______

### JavaScript Object Notation -- `json` -- format

A JSON document is composed by a list of items stored as key and value pairs.    
Values can be single values (strings, integers, floating point…) or a list of values.

https://medium.com/ngs-sh/a-simple-introduction-to-xml-and-json-4547b93c9aae


JSON supports primitive types, like strings and numbers, as well as nested lists, tuples, and objects (dict), or null (None).  

#### Working with `json` format data

Import the `json` module.   
https://docs.python.org/3/library/json.html    
To work with JSON data (string or JSON file), first, it has to be 'translated' into the python data structure. In this lesson, we are going to use python's built-in module json to do it.   

```python
import json
```
   
There are a few python methods used to load json data:   

* load(): This method loads data from a JSON file into a python dictionary.
* loads(): This method loads data from a JSON variable into a python dictionary.
* dump(): This method loads data from the python dictionary to the JSON file.
* dumps(): This method loads data from the python dictionary to the JSON variable.

https://www.networkacademy.io/devnet-associate/data-formats/parsing-json-with-python

Datatypes conversion: python to json  

| Python                                 | JSON   |
|----------------------------------------|--------|
| dict                                   | object |
| list, tuple                            | array  |
| str                                    | string |
| int, float, int- & float-derived Enums | number |
| True                                   | true   |
| False                                  | false  |
| None                                   | null   |



`gene.json`

```json
{
  "id": 672,
  "symbol": "BRCA1",
  "full_name": "BRCA1 DNA repair associated",
  "aliases": [
    "IRIS",
    "PSCP",
    "BRCAI",
    "FANCS",
    "PNCA4",
    "RNF53",
    "BROVCA1",
    "PPP1R53"
  ]
}
```

In [None]:
import json

In [None]:
help(json.load)

In [None]:
with open("gene.json") as gene_file: 
    gene1_dict = json.load(gene_file) 
gene1_dict

In [None]:
test_lst = [1,2,3, ("test", "gene")]

In [None]:
# create json format string from list object
json.dumps(test_lst)

In [None]:
# create json format string from dict object
res = json.dumps(gene1_dict)
res

In [None]:
type(res)

Resources:
    
https://www.geeksforgeeks.org/working-with-json-data-in-python/    
https://www.geeksforgeeks.org/read-json-file-using-python/    
https://www.networkacademy.io/devnet-associate/data-formats/parsing-json-with-python      
https://medium.com/ngs-sh/a-simple-introduction-to-xml-and-json-4547b93c9aae     
https://www.tutorialspoint.com/python_data_science/python_processing_json_data.htm    
https://www.w3schools.com/js/js_json_intro.asp
https://www.w3schools.com/python/python_json.asp

https://www.tutorialspoint.com/json/json_comparison.htm

### Exercise:

Add the BRCA2 gene to the json file to have a list of 2 genes, and read the results.    
Make a pandas DataFrame out of the list of genes.   
Convert the data from the DataFrame into json format.   
Save it to a file.   
Load the data from the file into a DataFrame.


https://www.ncbi.nlm.nih.gov/gene/675


```json
{
  "id": 675,
  "symbol": "BRCA2",
  "full_name": "BRCA2 DNA repair associated",
  "aliases": [
    "FAD",
    "FACD",
    "FAD1",
    "GLM3".
    "BRCC2".
    "FANCD",
    "PNCA2",
    "FANCD1",
    "XRCC11",
    "BROVCA2"
  ]
}
```

    