# Introduction to Frictionless Data

Originally developed for a presentation at [PyWebIL](https://www.meetup.com/PyWeb-IL/events/238288247/) and [HaSadna](http://www.hasadna.org.il) on April 3rd, 2017 in Tel Aviv, Israel. 

Presented by [Paul Walsh](http://github.com/pwalsh) and [Adam Kariv](http://github.com/akariv), and the slides are [available here](https://hackmd.io/JwUwTAZgHFAmEFoBsICsYEBZQEYEEMAGAZmCwHYcJjVz8IIcBjIA?edit).

## Demo: Basics

A simple, initial example of loading a data package.

This example demonstrates the primary API for reading a Data Package, and getting a stream of data out of each Data Resource it contains.

In [1]:
from pprint import pprint
import datapackage

descriptor = 'data/israel-muni/datapackage.json'

dp = datapackage.DataPackage(descriptor)

In [2]:
# The loaded Descriptor
dp.descriptor

{'name': 'israel-municipal-budget-data',
 'resources': [{'data <NEW_IN_V1: replaces path>': ['budget-tree.csv'],
   'name': 'budget-tree',
   'path': 'budget-tree.csv',
   'schema': {'fields': [{'contraints': {'required': True},
      'name': 'CODE',
      'type': 'string'},
     {'name': 'PARENT', 'type': 'string'},
     {'name': 'PARENT SCOPE', 'type': 'string'},
     {'constraints': {'enum': ['EXPENDITURE', 'REVENUE']},
      'name': 'DIRECTION',
      'type': 'string'},
     {'name': 'INVERSE', 'type': 'string'},
     {'name': 'INVERSE SCOPE', 'type': 'string'},
     {'name': 'COMPARABLE', 'type': 'boolean'},
     {'name': 'NAME', 'type': 'string'},
     {'name': 'NAME_EN', 'type': 'string'},
     {'name': 'NAME_RU', 'type': 'string'},
     {'name': 'NAME_AR', 'type': 'string'},
     {'name': 'DESCRIPTION', 'type': 'string'},
     {'name': 'DESCRIPTION_EN', 'type': 'string'},
     {'name': 'DESCRIPTION_RU', 'type': 'string'},
     {'name': 'DESCRIPTION_AR', 'type': 'string'}]}},
  

In [3]:
# The loaded Data Resource objects
dp.resources

(<datapackage.resource.TabularResource at 0x108b0ab70>,
 <datapackage.resource.TabularResource at 0x108b28780>)

In [4]:
# Each resource provides a stream over the data
israel_muni_budget_tree = dp.resources[0].iter()

israel_muni_budget_tree

<generator object TabularResource._iter_from_tabulator at 0x10922fb48>

In [5]:
# When a Data Resource is a Tabular Data Resource
# Values from the CSV are cast on iteration
tel_aviv_budget = dp.resources[1].iter()

next(tel_aviv_budget)

{'ACTUAL': None,
 'BUDGET': Decimal('202000'),
 'CODE': '1.611112.124',
 'NAME': 'החזר הוצאות',
 'NAME_AR': '',
 'NAME_EN': '',
 'NAME_RU': '',
 'PARENT': '12',
 'PARENT SCOPE': '6111|611|61|6'}

# Demo: Core libraries

Some demonstration code showing main functionality of core libraries.

In [6]:
import io
import csv
import datapackage
import jsontableschema as tableschema  # name has recently been updated for v1.

In [7]:
source = 'data/mailing-list/data.csv'
# When we just have a source of data, we can still get a schema
with io.open(source) as stream:
    reader = csv.reader(stream)
    headers = next(reader)
    values = list(reader)

# take a sample of values to feed to the inference engine
sample = values[:4]

schema = tableschema.infer(headers, sample)

# Inference currently only infers "default" types
# It does not try to infer format yet (coming)
schema

{'fields': [{'description': '',
   'format': 'default',
   'name': 'first_name',
   'title': '',
   'type': 'string'},
  {'description': '',
   'format': 'default',
   'name': 'last_name',
   'title': '',
   'type': 'string'},
  {'description': '',
   'format': 'default',
   'name': 'age',
   'title': '',
   'type': 'integer'},
  {'description': '',
   'format': 'default',
   'name': 'rating',
   'title': '',
   'type': 'number'},
  {'description': '',
   'format': 'default',
   'name': 'contactable',
   'title': '',
   'type': 'boolean'},
  {'description': '',
   'format': 'default',
   'name': 'created',
   'title': '',
   'type': 'date'}]}

In [8]:
# we can validate any schema
tableschema.validate(schema)

True

In [9]:
# and catch if a schema is not valid
try:
    tableschema.validate({"fields": {}})
except tableschema.exceptions.SchemaValidationError as e:
    msg = e.message

msg

"{} is not of type 'array'"

In [10]:
# We get get some helper methods to work we schemas
model = tableschema.Schema(schema)

In [11]:
model.headers

['first_name', 'last_name', 'age', 'rating', 'contactable', 'created']

In [12]:
model.has_field('occupation')

False

In [13]:
model.cast_row(['Amos', 'Levy', '13', '2.0', 'T', '2011-02-05'])

['Amos', 'Levy', 13, Decimal('2.0'), True, datetime.date(2011, 2, 5)]

In [14]:
# We can iterate over a stream and cast values
table = tableschema.Table(source, schema)

In [15]:
next(table.iter())

['Jane', 'Roberts', 42, Decimal('4.8'), True, datetime.date(2016, 1, 6)]

In [16]:
next(table.iter(keyed=True))

{'age': 42,
 'contactable': True,
 'created': datetime.date(2016, 1, 6),
 'first_name': 'Jane',
 'last_name': 'Roberts',
 'rating': Decimal('4.8')}

In [17]:
# We saw the basics of handling Data Packages in the first demo.
# Now let's use our infered schema, and source data for a new Tabular Data Package

tdp = datapackage.DataPackage(schema='tabular')
tdp.descriptor

{}

In [18]:
# We've just got an empty descriptor, so it is not actually valid
try:
    tdp.validate()
except datapackage.exceptions.ValidationError as e:
    msg = e.message

msg

"'name' is a required property"

In [19]:
# Add the minimum for a Tabular Data Resource
tdp.descriptor.update({
    'name': 'my-mailing-lists',
    'resources': [
        {
            'name': 'mailer',
            'path': source,
            'schema': schema
        }
    ]
})

tdp.validate()
tdp.descriptor

{'name': 'my-mailing-lists',
 'resources': [{'name': 'mailer',
   'path': 'data/mailing-list/data.csv',
   'schema': {'fields': [{'description': '',
      'format': 'default',
      'name': 'first_name',
      'title': '',
      'type': 'string'},
     {'description': '',
      'format': 'default',
      'name': 'last_name',
      'title': '',
      'type': 'string'},
     {'description': '',
      'format': 'default',
      'name': 'age',
      'title': '',
      'type': 'integer'},
     {'description': '',
      'format': 'default',
      'name': 'rating',
      'title': '',
      'type': 'number'},
     {'description': '',
      'format': 'default',
      'name': 'contactable',
      'title': '',
      'type': 'boolean'},
     {'description': '',
      'format': 'default',
      'name': 'created',
      'title': '',
      'type': 'date'}]}}]}

In [20]:
next(tdp.resources[0].iter())

{'age': 42,
 'contactable': True,
 'created': datetime.date(2016, 1, 6),
 'first_name': 'Jane',
 'last_name': 'Roberts',
 'rating': Decimal('4.8')}

# Demo: goodtables

Some demonstration code showing features of goodtables.

In [21]:
import goodtables

inspector = goodtables.Inspector()
inspector.inspect('data/invalid.csv')

{'error-count': 7,
 'errors': [],
 'table-count': 1,
 'tables': [{'error-count': 7,
   'errors': [{'code': 'blank-header',
     'column-number': 3,
     'message': 'Header in column 3 is blank',
     'row': None,
     'row-number': None},
    {'code': 'duplicate-header',
     'column-number': 4,
     'message': 'Header in column 4 is duplicated to header in column(s) 2',
     'row': None,
     'row-number': None},
    {'code': 'missing-value',
     'column-number': 3,
     'message': 'Row 2 has a missing value in column 3',
     'row': ['1', 'english'],
     'row-number': 2},
    {'code': 'missing-value',
     'column-number': 4,
     'message': 'Row 2 has a missing value in column 4',
     'row': ['1', 'english'],
     'row-number': 2},
    {'code': 'duplicate-row',
     'column-number': None,
     'message': 'Row 3 is duplicated to row(s) 2',
     'row': ['1', 'english'],
     'row-number': 3},
    {'code': 'blank-row',
     'column-number': None,
     'message': 'Row 4 is completely

In [22]:
# We can customize our inspector
inspector = goodtables.Inspector(checks={
    'duplicate-header': False,
    'extra-header': False,
    'missing-value': False,
    'blank-header': False
})
inspector.inspect('data/invalid.csv')

{'error-count': 3,
 'errors': [],
 'table-count': 1,
 'tables': [{'error-count': 3,
   'errors': [{'code': 'duplicate-row',
     'column-number': None,
     'message': 'Row 3 is duplicated to row(s) 2',
     'row': ['1', 'english'],
     'row-number': 3},
    {'code': 'blank-row',
     'column-number': None,
     'message': 'Row 4 is completely blank',
     'row': [],
     'row-number': 4},
    {'code': 'extra-value',
     'column-number': 5,
     'message': 'Row 5 has an extra value in column 5',
     'row': ['2', 'german', '1', '2', '3'],
     'row-number': 5}],
   'headers': ['id', 'name', '', 'name'],
   'row-count': 5,
   'source': 'data/invalid.csv',
   'time': 0.002,
   'valid': False}],
 'time': 0.007,
 'valid': False}

In [23]:
# We can also inspect all Data Resources in a Data Package
inspector = goodtables.Inspector()
result = inspector.inspect('data/israel-muni/datapackage.json', preset='datapackage')

result['error-count'], result['table-count']

(0, 2)

goodtables also has a command line interface

`goodtables table data/israel-muni/tel-aviv-2013.csv`
`goodtables datapackage data/israel-muni/datapackage.json`

In [24]:
%%bash
goodtables

Usage: goodtables [OPTIONS] COMMAND [ARGS]...

Options:
  --checks TEXT
  --error-limit INTEGER
  --table-limit INTEGER
  --row-limit INTEGER
  --infer-schema
  --infer-fields
  --order-fields
  --json
  --help                 Show this message and exit.

Commands:
  datapackage
  table


# Demo: Data Package Pipelines

Some demonstration code using Data Package Pipelines.