# Python for Data Extraction, Transforms, and Loading

Data wrangling is one of the most critical skills of a data scientist!  Data can be found in a very wide variety of formats and often the veracity of data varies greatly from source to source. Consequently, Python has become one of the most important tools in the data scientist's toolbox, mainly because of its flexibility, extensibility, and ease-of-use.

## Extraction

Data can come in a variety of _styles_ and a key to choosing the right tool is to recognize the data style then choosing the appropriate tool.
These varieties include:


 * Structured vs. Unstructured
 * Flat vs. Hierarchical
 * Strong vs. Weak Semantics
 


For example, _comma separated values_ (CSV) files are **weak-semantic structured flat** files.
The data are arranged in rows and columns, conceptually similar to a spreadsheet or single database table.
These files are considered **flat** because there is no parent-child relationship implicit in the structure of the file. 
They are **structured** precisely because you know the commas are field separators and the rows are records.
These files have **weak semantics**, as the column headings (if available) provide only names, not format and data type constraints.

 

The _eXtensible Markup Language_ (XML) is an example of a class of **_structured, hierarchical_**, and possibly **_strongly semantic_** file formats. 
Examples may include XML, HTML, KML, and many others. 
These files are composed of _entities_ that have two sets of children:

 * Attributes : Key==> Value pairs
 * Child Entities

The format supports document type definitions (DTD) and/or schema definitions (XSD), as well as other formats. These defining, related documents provide strong semantic meaning to elements and attributes within the document, including permissible children of each type for all entity types.

 

On the other end of the spectrum, you may encounter **_unstructured, flat_** log files. 
Log files come from a variety of sources, include transaction systems, monitoring systems, sensors, etc.
These are the most challenging files to extract data from; but also can be very rich sources of information.

 



## Libraries:

 * lxml (low-level XML DOM Parser/Writer/Serializer/Deserializer): http://lxml.de/index.html
 * 
 BeautifulSoup (Wrapper around lxml and other helpers) : http://www.crummy.com/software/BeautifulSoup/

Please read about HTML Scraping : http://docs.python-guide.org/en/latest/scenarios/scrape/

 


## Transforming Data

Often, once you have developed a suitable parsing for a set of files, you must then transform them into a usable format for subsequent analysis.  This can include merging data from multiple files, slicing out rows or columns, or even numerical transformations.

## Loading Data

Once you have transformed data, you may want to automatically load it into a data repository.  
These repositories come in a variety of flavors; for our examples here we will use the ubiquitous SQL database. 
Specifically, we will load the data into an SQLite database file.

A common scenario is the use of Python to extract data from hierarchical, structured data and load it into a series of CSV data files, then into a database; or directly into a database.

