Parsing
1. ConfigParser
2. HTML Parser
3. XML Parser
4. JSON

###Config Parser

* A sample configuration file with section “bug_tracker” and three options would look like:
```
[bug_tracker]
url = http://localhost:8080/bugs/
username = pace
password = SECRET
```

* e.g.: Backup all MySQL databases, one in each file with a timestamp on the end.

```python
#Importing the modules
import os
import ConfigParser
import time

# On Debian, /etc/mysql/debian.cnf contains 'root' a like login and password.
config = ConfigParser.ConfigParser()
config.read("/etc/mysql/debian.cnf")
username = config.get('client', 'user')
password = config.get('client', 'password')
hostname = config.get('client', 'host')
filestamp = time.strftime('%Y-%m-%d')

# Get a list of databases with :
database_list_command="mysql -u %s -p%s -h %s --silent -N -e 'show databases'" % (username, password, hostname)
for database in os.popen(database_list_command).readlines():
    database = database.strip()
    if database == 'information_schema':
        continue
    if database == 'performance_schema':
        continue
    filename = "/backups/mysql/%s-%s.sql" % (database, filestamp)
    os.popen("mysqldump --single-transaction -u %s -p%s -h %s -d %s | gzip -c > %s.gz" % (username, password, hostname, database, filename))
  
```



In [5]:
import ConfigParser
cfg = ConfigParser.ConfigParser()
cfg.read('config.cfg')
print cfg

print dir(cfg)

<ConfigParser.ConfigParser instance at 0x7faa740e25a8>
['OPTCRE', 'OPTCRE_NV', 'SECTCRE', '_KEYCRE', '__doc__', '__init__', '__module__', '_boolean_states', '_defaults', '_dict', '_get', '_interpolate', '_interpolation_replace', '_optcre', '_read', '_sections', 'add_section', 'defaults', 'get', 'getboolean', 'getfloat', 'getint', 'has_option', 'has_section', 'items', 'options', 'optionxform', 'read', 'readfp', 'remove_option', 'remove_section', 'sections', 'set', 'write']


In [2]:
print cfg.sections()

['bug_tracker']


In [3]:
!cat config.cfg


[bug_tracker]
url = http://localhost:8080/bugs/
username = pace
password = SECRET



In [4]:
print cfg.has_section('test')

False


In [None]:
print cfg.get('section1', 'bb')

###HTML Parsing


In [6]:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_doc)

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a href="http://example.com/elsie" class="sister" id="link1">
    Elsie
   </a>
   ,
   <a href="http://example.com/lacie" class="sister" id="link2">
    Lacie
   </a>
   and
   <a href="http://example.com/tillie" class="sister" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


In [7]:
soup.title
# <title>The Dormouse's story</title>


<title>The Dormouse's story</title>

In [8]:

soup.title.name
# u'title'


u'title'

In [9]:

soup.title.string
# u'The Dormouse's story'


u"The Dormouse's story"

In [10]:
print soup.title.parent.name
# u'head'

head


In [11]:
soup.p
# <p class="title"><b>The Dormouse's story</b></p>


<p class="title"><b>The Dormouse's story</b></p>

In [12]:
soup.p['class']
# u'title'

u'title'

In [13]:
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>

In [14]:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_doc)
print [x for x in dir(soup) if 'find' in x]
print soup.findAll('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


['_findAll', '_findOne', 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings']
[<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>, <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>]


In [16]:
soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>

In [18]:
#  extracting all the URLs found within a page’s <a> tags:
for link in soup.findAll('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie


http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [19]:
#Another common task is extracting all the text from a page:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_doc)
print  [x for x in dir(soup) if 'text' in x]
print(soup.text)


['_SGMLParser__starttag_text', 'get_starttag_text', 'text']
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,LacieandTillie;
and they lived at the bottom of a well....


### XML Parsing

```xml
<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>
```

In [21]:
from xml.dom import minidom
xmldoc = minidom.parse('sample.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
#print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

4
item1
item2
item3
item4


__lxml__ is another library.

New libraries keep coming.

E.g.: __untangle__
* untangle is a simple library which takes an XML document and returns a Python object which mirrors the nodes and attributes in its structure.

* xmltodict is another simple library that aims at making XML feel like working with JSON.

In [None]:
import untangle
obj = untangle.parse('sample.xml')
obj.root.child['name']

```xml
<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>
```

In [27]:

!cat sample2.xml



<mydocument has="an attribute">
  <and>
    <many>elements</many>
    <many>more elements</many>
  </and>
  <plus a="complex">
    element as well
  </plus>
</mydocument>




In [26]:
import xmltodict

with open('sample2.xml') as fd:
    doc = xmltodict.parse(fd.read())
    
#and then you can access elements, attributes and values like this:

print doc['mydocument']['@has'] # == u'an attribute'
print doc['mydocument']['and']['many'] # == [u'elements', u'more elements']
print doc['mydocument']['plus']['@a'] # == u'complex'
print doc['mydocument']['plus']['#text'] # == u'element as well'



an attribute
[u'elements', u'more elements']
complex
element as well


In [None]:
import simplejson

```python
    >>> import simplejson as json
    >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
    '["foo", {"bar": ["baz", null, 1.0, 2]}]'
    >>> print(json.dumps("\"foo\bar"))
    "\"foo\bar"
    >>> print(json.dumps(u'\u1234'))
    "\u1234"
    >>> print(json.dumps('\\'))
    "\\"
    >>> print(json.dumps({"c": 0, "b": 0, "a": 0}, sort_keys=True))
    {"a": 0, "b": 0, "c": 0}

```