# Introduction to Frictionless Data

Originally developed for a presentation at [PyWebIL](https://www.meetup.com/PyWeb-IL/events/238288247/) and [HaSadna](http://www.hasadna.org.il) on April 3rd, 2017 in Tel Aviv, Israel. 

Presented by [Paul Walsh](http://github.com/pwalsh) and [Adam Kariv](http://github.com/akariv), and the slides are [available here](https://hackmd.io/JwUwTAZgHFAmEFoBsICsYEBZQEYEEMAGAZmCwHYcJjVz8IIcBjIA?edit).

## Demo: Basics

A simple, initial example of loading a data package.

This example demonstrates the primary API for reading a Data Package, and getting a stream of data out of each Data Resource it contains.

In [1]:
from pprint import pprint
import datapackage

descriptor = 'data/israel-muni/datapackage.json'

dp = datapackage.DataPackage(descriptor)

In [2]:
# The loaded Descriptor
pprint(dp.descriptor)

{'resources': [{'data <NEW_IN_V1: replaces path>': ['budget-tree.csv'],
                'name': 'budget-tree',
                'path': 'budget-tree.csv',
                'schema': {'fields': [{'contraints': {'required': True},
                                       'name': 'CODE',
                                       'type': 'string'},
                                      {'name': 'COMPARABLE', 'type': 'boolean'},
                                      {'constraints': {'enum': ['EXPENDITURE',
                                                                'REVENUE']},
                                       'name': 'DIRECTION',
                                       'type': 'string'}]}},
               {'data <NEW_IN_V1: replaces path>': ['tel-aviv-2013.csv'],
                'name': 'tel-aviv-2013',
                'path': 'tel-aviv-2013.csv',
                'schema': {'fields': [{'constraints': {'minLength': 2,
                                                       'required': True

In [3]:
# The loaded Data Resource objects
for resource in dp.resources:
    pprint(resource.descriptor)

{'data <NEW_IN_V1: replaces path>': ['budget-tree.csv'],
 'name': 'budget-tree',
 'path': 'budget-tree.csv',
 'schema': {'fields': [{'contraints': {'required': True},
                        'name': 'CODE',
                        'type': 'string'},
                       {'name': 'COMPARABLE', 'type': 'boolean'},
                       {'constraints': {'enum': ['EXPENDITURE', 'REVENUE']},
                        'name': 'DIRECTION',
                        'type': 'string'}]}}
{'data <NEW_IN_V1: replaces path>': ['tel-aviv-2013.csv'],
 'name': 'tel-aviv-2013',
 'path': 'tel-aviv-2013.csv',
 'schema': {'fields': [{'constraints': {'minLength': 2, 'required': True},
                        'name': 'PARENT',
                        'type': 'string'},
                       {'constraints': {'pattern': '[0-9|]*', 'required': True},
                        'name': 'PARENT SCOPE',
                        'type': 'string'},
                       {'constraints': {'maxLength': 12,
             

In [4]:
# Each resource provides a stream over the data
budget_tree = dp.resources[0].iter()
pprint(budget_tree)

<generator object TabularResource._iter_from_tabulator at 0x1029f5ca8>


In [5]:
# When a Data Resource is a Tabular Data Resource (it has a schema)
# Values are cast on iteration
tel_aviv_budget = dp.resources[1].iter()

for idx, budget_line in enumerate(tel_aviv_budget):
    if idx > 4:
        break
    pprint(budget_line)

{'ACTUAL': None,
 'BUDGET': Decimal('202000'),
 'CODE': '1.611112.124',
 'NAME': 'החזר הוצאות',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '12',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('4000'),
 'CODE': '1.611114.126',
 'NAME': 'הבראה',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '14',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('131000'),
 'CODE': '1.611115.127',
 'NAME': 'השתתפות באחזקת רכב03',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '15',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('15000'),
 'CODE': '1.611119.130',
 'NAME': 'תשלומים מיוחדים',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '19',
 'PARENT SCOPE': '6111|611|61|6'}
{'ACTUAL': None,
 'BUDGET': Decimal('484000'),
 'CODE': '1.611112.181',
 'NAME': 'מיסים וביטוח לאומי',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '12',
 'PARENT SCOPE': '6111|611|61|6'}


# Demo: Core libraries

Some demonstration code showing main functionality of core libraries.

In [6]:
import io
import csv
import datapackage
import jsontableschema as tableschema  # name has recently been updated for v1.

In [7]:
source = 'data/mailing-list/data.csv'
# When we just have a source of data, we can still get a schema
with io.open(source) as stream:
    reader = csv.reader(stream)
    headers = next(reader)
    values = list(reader)

# take a sample of values to feed to the inference engine
sample = values[:4]

schema = tableschema.infer(headers, sample)

# Inference currently only infers "default" types
# It does not try to infer format yet (coming)
pprint(schema)

{'fields': [{'description': '',
             'format': 'default',
             'name': 'first_name',
             'title': '',
             'type': 'string'},
            {'description': '',
             'format': 'default',
             'name': 'last_name',
             'title': '',
             'type': 'string'},
            {'description': '',
             'format': 'default',
             'name': 'age',
             'title': '',
             'type': 'integer'},
            {'description': '',
             'format': 'default',
             'name': 'rating',
             'title': '',
             'type': 'number'},
            {'description': '',
             'format': 'default',
             'name': 'contactable',
             'title': '',
             'type': 'boolean'},
            {'description': '',
             'format': 'default',
             'name': 'created',
             'title': '',
             'type': 'date'}]}


In [8]:
# we can validate any schema
pprint(tableschema.validate(schema))

True


In [9]:
# and catch if a schema is not valid
try:
    tableschema.validate({"fields": {}})
except tableschema.exceptions.SchemaValidationError as e:
    pprint(e.message)

"{} is not of type 'array'"


In [10]:
# We get get some helper methods to work we schemas
model = tableschema.Schema(schema)

pprint(model.headers)
pprint(model.has_field('occupation'))
pprint(model.cast_row(['Amos', 'Levy', '13', '2.0', 'T', '2011-02-05']))

['first_name', 'last_name', 'age', 'rating', 'contactable', 'created']
False
['Amos', 'Levy', 13, Decimal('2.0'), True, datetime.date(2011, 2, 5)]


In [11]:
# We can iterate over a stream and cast values
table = tableschema.Table(source, schema)

# unkeyed
pprint(next(table.iter()))

# keyed
pprint(next(table.iter(keyed=True)))

['Jane', 'Roberts', 42, Decimal('4.8'), True, datetime.date(2016, 1, 6)]
{'age': 42,
 'contactable': True,
 'created': datetime.date(2016, 1, 6),
 'first_name': 'Jane',
 'last_name': 'Roberts',
 'rating': Decimal('4.8')}


In [12]:
# We saw the basics of handling Data Packages in the first demo.
# Now let's use our infered schema, and source data for a new Tabular Data Package

tdp = datapackage.DataPackage(schema='tabular')
pprint(tdp.descriptor)

{}


In [13]:
# We've just got an empty descriptor, so it is not actually valid
try:
    tdp.validate()
except datapackage.exceptions.ValidationError as e:
    pprint(e.message)

"'name' is a required property"


In [14]:
# Add the minimum for a Tabular Data Resource
tdp.descriptor.update({
    'name': 'my-mailing-lists',
    'resources': [
        {
            'name': 'mailer',
            'path': source,
            'schema': schema
        }
    ]
})

tdp.validate()
pprint(tdp.descriptor)

{'name': 'my-mailing-lists',
 'resources': [{'name': 'mailer',
                'path': 'data/mailing-list/data.csv',
                'schema': {'fields': [{'description': '',
                                       'format': 'default',
                                       'name': 'first_name',
                                       'title': '',
                                       'type': 'string'},
                                      {'description': '',
                                       'format': 'default',
                                       'name': 'last_name',
                                       'title': '',
                                       'type': 'string'},
                                      {'description': '',
                                       'format': 'default',
                                       'name': 'age',
                                       'title': '',
                                       'type': 'integer'},
                       

In [15]:
for line in tdp.resources[0].iter():
    pprint(line)

{'age': 42,
 'contactable': True,
 'created': datetime.date(2016, 1, 6),
 'first_name': 'Jane',
 'last_name': 'Roberts',
 'rating': Decimal('4.8')}
{'age': 64,
 'contactable': False,
 'created': datetime.date(2010, 2, 4),
 'first_name': 'Joe',
 'last_name': 'Walsh',
 'rating': Decimal('3.0')}
{'age': 19,
 'contactable': True,
 'created': datetime.date(2011, 8, 19),
 'first_name': 'Ruby',
 'last_name': 'Smith',
 'rating': Decimal('5.0')}
{'age': 28,
 'contactable': True,
 'created': datetime.date(2013, 11, 15),
 'first_name': 'Peter',
 'last_name': 'Harrison',
 'rating': Decimal('2.3')}
{'age': 31,
 'contactable': True,
 'created': datetime.date(2008, 7, 28),
 'first_name': 'Mary',
 'last_name': 'Campbell',
 'rating': Decimal('1.9')}
{'age': 57,
 'contactable': True,
 'created': datetime.date(2014, 4, 9),
 'first_name': 'Emanuel',
 'last_name': 'Webb',
 'rating': Decimal('3.9')}
{'age': 23,
 'contactable': True,
 'created': datetime.date(2016, 2, 21),
 'first_name': 'Natasha',
 'last_na

# Demo: goodtables

Some demonstration code showing features of goodtables.

In [7]:
import goodtables

inspector = goodtables.Inspector()
print(inspector.inspect('data/invalid.csv'))

# show what we get when data is valid
# get decent sized, more complex data source and inspect
# create a simple custom check (range, median, etc) and run again a data file before and after

{'error-count': 7,
 'errors': [],
 'table-count': 1,
 'tables': [{'error-count': 7,
             'errors': [{'code': 'blank-header',
                         'column-number': 3,
                         'message': 'Header in column 3 is blank',
                         'row': None,
                         'row-number': None},
                        {'code': 'duplicate-header',
                         'column-number': 4,
                         'message': 'Header in column 4 is duplicated to '
                                    'header in column(s) 2',
                         'row': None,
                         'row-number': None},
                        {'code': 'missing-value',
                         'column-number': 3,
                         'message': 'Row 2 has a missing value in column 3',
                         'row': ['1', 'english'],
                         'row-number': 2},
                        {'code': 'missing-value',
                         'column-numbe

# Demo: Data Package Pipelines

Some demonstration code using Data Package Pipelines.