# XML Parsing using  xml.etree.ElementTree Module:


The xml.etree.ElementTree module implements a simple and efficient API for parsing and creating XML data.

This is a short tutorial for using xml.etree.ElementTree (ET in short). The goal is to demonstrate some of the building blocks and basic concepts of the module.

**XML tree and elements:**
XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level.

**Parsing XML**
We’ll be using the following XML document as the sample data for this section:

We are reading  this [abstract file](samples/test.xml)

We can import this data by reading from a file:

In [68]:
import xml.etree.ElementTree as ET
tree = ET.parse('samples/test.xml')
root = tree.getroot()

fromstring() parses XML from a string directly into an Element, which is the root element of the parsed tree. Other parsing functions may create an ElementTree. Check the documentation to be sure.

As an Element, root has a tag and a dictionary of attributes:



In [69]:
print(root.tag)

print(root.attrib) #No attribut to PubmedArticleSet so print blank

PubmedArticleSet
{}


It also has children nodes over which we can iterate:



In [70]:
for child in root:
    print(child.tag, child.attrib)

PubmedArticle {}
PubmedArticle {}
PubmedArticle {}
PubmedArticle {}
PubmedArticle {}


Children are nested, and we can access specific child nodes by index:

In [71]:
print("\n",root[0][0].attrib)

print("\n",root[0][0][3].attrib)

print("\n",root[0][0][3][3][0].attrib)

print("\n",root[0][0][3][3][0].text[0:800]) #Printing only first 800 chars


 {'Status': 'Publisher', 'Owner': 'NLM'}

 {'PubModel': 'Electronic-eCollection'}

 {'NlmCategory': 'UNASSIGNED'}

 We report a case study of natural variations and correlations of some photosynthetic parameters, green biomass and grain yield in Cappelle Desprez and Plainsman V winter wheat (Triticum aestivum L.) cultivars, which are classified as being drought sensitive and tolerant, respectively. We monitored biomass accumulation from secondary leaves in the vegetative phase and grain yield from flag leaves in the grain filling period. Interestingly, we observed higher biomass production, but lower grain yield stability in the sensitive Cappelle cultivar, as compared to the tolerant Plainsman cv. Higher biomass production in the sensitive variety was correlated with enhanced water-use efficiency. Increased cyclic electron flow around PSI was also observed in the Cappelle cv. under drought stress as sh





**Finding interesting elements:**
Element has some useful methods that help iterate recursively over all the sub-tree below it (its children, their children, and so on). For example, Element.iter():

In [72]:
abstractcount = 0
text = ""
for Abstract in root.iter('AbstractText'):
    text = text + Abstract.text #Iterates through all Abstract text tags and gets text from them
print ("\n\n",text[0:500]) #only printing first 500 chars to show



 We report a case study of natural variations and correlations of some photosynthetic parameters, green biomass and grain yield in Cappelle Desprez and Plainsman V winter wheat (Triticum aestivum L.) cultivars, which are classified as being drought sensitive and tolerant, respectively. We monitored biomass accumulation from secondary leaves in the vegetative phase and grain yield from flag leaves in the grain filling period. Interestingly, we observed higher biomass production, but lower grain yi


**Element.findall()** finds only elements with a tag which are direct children of the current element. Element.find() finds the first child with a particular tag, and Element.text accesses the element’s text content. Element.get() accesses the element’s attributes

In [73]:
for PubmedArticlte in root.findall('PubmedArticle'):
    print("\n",PubmedArticlte)  #So finds  direct children
    

for AbstractText in root.findall('AbstractText'):
    print(AbstractText)  # AbstractText is   not direct child of root so can'r find it ,No loop performed
    
    


 <Element 'PubmedArticle' at 0x7f6576c4c368>

 <Element 'PubmedArticle' at 0x7f6576c4a1d8>

 <Element 'PubmedArticle' at 0x7f6576c58688>

 <Element 'PubmedArticle' at 0x7f6576c5fc78>

 <Element 'PubmedArticle' at 0x7f6576c6a2c8>
