# Stack Overflow solution

* Question: [Extract a content from `<a href=“data:text/csv;…content>`](https://stackoverflow.com/q/58616135/1913726)
* Answer: [My Answer](https://stackoverflow.com/a/58616865/1913726)

Initially my first thought was that this question on Stack Overflow was dump because anybody can split strings to get wanted data. I consider it a hack though if the data has a scheme and could be parsed by a proper parser.

The poser of the question changed the question to clarify that he wanted ~~pure python~~ a proper parser for URL data. I'm happy he did because I learned something new. I have been creating the data URLs to embed into a website by joining strings. 

Now I know better.

It turns out there is a parser and its a well-defined scheme for data URLs. Usually when I think "this is ~~dumb~~ anti-pattern" I then do some research and learn something along the way. Though some things do remain dumb and knowing which ones are still ~~dumb~~ anti-pattern is a skill, too.

For example, [trying to parse HTML with regular expressions](https://stackoverflow.com/a/1732454/1913726) is ~~dumb~~ anti-pattern.

## Resources

* [data URLs on MDN](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs)
* [python-datauri](https://pypi.org/project/python-datauri/)

In [1]:
html_string = """
<a href="data:text/csv;charset=UTF-8,%22csvcontentfollows">
"""

In [2]:
import lxml.etree
from datauri import DataURI

tree = lxml.etree.fromstring(html_string, lxml.etree.HTMLParser())

uris = (
    DataURI(item.attrib["href"])
    for item in tree.iterdescendants()
    if item.attrib.get("href")
)
attrs = ("mimetype", "charset", "is_base64", "data")
print([{attr: getattr(uri, attr) for attr in attrs} for uri in uris])

[{'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': '"csvcontentfollows'}]


In [3]:
from html.parser import HTMLParser
from datauri import DataURI

uri_attrs = ("mimetype", "charset", "is_base64", "data")

class MyHTMLParser(HTMLParser):
    
    def __init__(self):
        super().__init__()
        self.data = []
    
    def handle_starttag(self, tag, attrs):
        if tag == "a":
            for attr, value in attrs:
                if attr == "href":
                    # Adjust the delimter for splitting as necessary
                    for key, value in attrs:
                        uri = DataURI(value)
                        self.data.append({attr: getattr(uri, attr) for attr in uri_attrs})
        
parser = MyHTMLParser()
parser.feed(html_string)
print(parser.data)

[{'mimetype': 'text/csv', 'charset': 'UTF-8', 'is_base64': False, 'data': '"csvcontentfollows'}]
