# Introduction to Frictionless Data

Originally developed for a presentation at [PyWebIL](https://www.meetup.com/PyWeb-IL/events/238288247/) and [HaSadna](http://www.hasadna.org.il) on April 3rd, 2017 in Tel Aviv, Israel. 

Presented by [Paul Walsh](http://github.com/pwalsh) and [Adam Kariv](http://github.com/akariv), and the slides are [available here](https://hackmd.io/JwUwTAZgHFAmEFoBsICsYEBZQEYEEMAGAZmCwHYcJjVz8IIcBjIA?edit).

## Demo: Basics

A simple, initial example of loading a data package.

This example demonstrates the primary API for reading a Data Package, and getting a stream of data out of each Data Resource it contains.

In [1]:
from pprint import pprint
import datapackage

descriptor = 'data/israel-muni/datapackage.json'

dp = datapackage.DataPackage(descriptor)

In [2]:
# The loaded Descriptor
pprint(dp.descriptor)

{'resources': [{'data <NEW_IN_V1>': ['budget-tree.csv'],
                'name': 'budget-tree',
                'path': 'budget-tree.csv',
                'schema': {'fields': [{'contraints': {'required': True},
                                       'name': 'CODE',
                                       'type': 'string'},
                                      {'name': 'COMPARABLE', 'type': 'boolean'},
                                      {'constraints': {'enum': ['EXPENDITURE',
                                                                'REVENUE']},
                                       'name': 'DIRECTION',
                                       'type': 'string'}]}},
               {'data <NEW_IN_V1>': ['tel-aviv-2013.csv'],
                'name': 'tel-aviv-2013',
                'path': 'tel-aviv-2013.csv',
                'schema': {'fields': [{'constraints': {'minLength': 2,
                                                       'required': True},
                           

In [3]:
# The loaded Data Resource objects
pprint(dp.resources)

for resource in dp.resources:
    pprint(resource.descriptor)

(<datapackage.resource.TabularResource object at 0x1106abe10>,
 <datapackage.resource.TabularResource object at 0x111a6cf28>)
{'data <NEW_IN_V1>': ['budget-tree.csv'],
 'name': 'budget-tree',
 'path': 'budget-tree.csv',
 'schema': {'fields': [{'contraints': {'required': True},
                        'name': 'CODE',
                        'type': 'string'},
                       {'name': 'COMPARABLE', 'type': 'boolean'},
                       {'constraints': {'enum': ['EXPENDITURE', 'REVENUE']},
                        'name': 'DIRECTION',
                        'type': 'string'}]}}
{'data <NEW_IN_V1>': ['tel-aviv-2013.csv'],
 'name': 'tel-aviv-2013',
 'path': 'tel-aviv-2013.csv',
 'schema': {'fields': [{'constraints': {'minLength': 2, 'required': True},
                        'name': 'PARENT',
                        'type': 'string'},
                       {'constraints': {'pattern': '[0-9|]*', 'required': True},
                        'name': 'PARENT SCOPE',
                 

In [4]:
# Each resource provides a stream over the data
budget_tree = dp.resources[0].iter()
pprint(budget_tree)

<generator object TabularResource._iter_from_tabulator at 0x111a84d00>


In [5]:
# When a Data Resource is a Tabular Data Resource (it has a schema)
# Values are cast on iteration
tel_aviv_budget = dp.resources[1].iter()

for idx, budget_line in enumerate(tel_aviv_budget):
    if idx > 9:
        break
    pprint(budget_line)

{'ACTUAL': None,
 'BUDGET': Decimal('202000'),
 'CODE': '1.611112.124',
 'NAME': 'החזר הוצאות',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '12',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('4000'),
 'CODE': '1.611114.126',
 'NAME': 'הבראה',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '14',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('131000'),
 'CODE': '1.611115.127',
 'NAME': 'השתתפות באחזקת רכב03',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '15',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('15000'),
 'CODE': '1.611119.130',
 'NAME': 'תשלומים מיוחדים',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '19',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('484000'),
 'CODE': '1.611112.181',
 'NAME': 'מיסים וביטוח לאומי',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '12',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': Non

# Demo: Core libraries

Some demonstration code showing main functionality of core libraries.

In [6]:
import datapackage
import jsontableschema as tableschema  # name has recently been updated for v1.

# TABULAR DATA PACKAGE

# read existing TDP
# iterate over its resources
# show data pre cast / post cast

# read and show malformed descriptor
# read and show maalformed data (casting error)

# create new DP

# TABLE SCHEMA

# create schema and read it in
# inspect schema with headers, unique, etc.
# cast one or two example rows against this schema (show errors too)
# Infer: have a data file without schema, and infer a schema.
# use infered schema to iterate over cast data stream

# Demo: goodtables

Some demonstration code showing features of goodtables.

In [7]:
import goodtables

inspector = goodtables.Inspector()
print(inspector.inspect('data/invalid.csv'))

# show what we get when data is valid
# get decent sized, more complex data source and inspect
# create a simple custom check (range, median, etc) and run again a data file before and after

{'error-count': 7,
 'errors': [],
 'table-count': 1,
 'tables': [{'error-count': 7,
             'errors': [{'code': 'blank-header',
                         'column-number': 3,
                         'message': 'Header in column 3 is blank',
                         'row': None,
                         'row-number': None},
                        {'code': 'duplicate-header',
                         'column-number': 4,
                         'message': 'Header in column 4 is duplicated to '
                                    'header in column(s) 2',
                         'row': None,
                         'row-number': None},
                        {'code': 'missing-value',
                         'column-number': 3,
                         'message': 'Row 2 has a missing value in column 3',
                         'row': ['1', 'english'],
                         'row-number': 2},
                        {'code': 'missing-value',
                         'column-numbe

# Demo: Data Package Pipelines

Some demonstration code using Data Package Pipelines.