# Datadotworld example flow

## What's the problem?
When BPIO is processing data in-house, the best option is going to be direct SQL queries with the `pyodbc` Python package. That provides faster access to the freshest possible data on facilities and fleet activity. But we aren't going to be handing out our database credentials to everyone who's involved in a data partnership with us, or who just wants to explore a bit of DGS data. 

For external partners, analysts, and collaborators, the `datadotworld` Software Development Kit for Python offers a clean alternative. We break off tables that can be saved as .csv or .xslx files, host them at our page on the [Data.World site](https://data.world/dgsbpio), and partners can gain access to them without us needing to pass them files. This has several advantages:

- The lack of data files enables cleaner GitHub repositories and workflows.
- We can update the data in one place and all partners will have the update, without us needing to send files to each of them.
- The data dictionary uploads along with the data, so we can be sure everybody has the same column definitions.

The code below demonstrates some of the features and flows we can expect from data served by data.world.

## Setup

In [1]:
import datadotworld as dw  # here's the Python SDK
import pandas as pd  # at DGS, we typically pair datadotworld with Pandas
import pprint as pp
from pathlib import Path

### Configure the SDK if needed
This library requires a data.world API authentication token to work.

Your authentication token can be obtained on data.world once you enable Python under Integrations > Python

To configure the library, run the following command. You will be prompted to provide your API key. 

In [2]:
# dw configure

## Basic data import
### Load dataset
The `dw.load_dataset()` function is probably the one that partners would use the most, if they simply want to use Data.World to obtain data. 

Take a look at the structure of the dictionary returned by the `describe()` method on a datadotworld dataset. 

In [3]:
facilities = dw.load_dataset(dataset_key='dgsbpio/facilities-sandbox')
pp.pprint(facilities.describe())

{'description': 'This dataset contains .csv files exported from Archibus. This '
                'is meant to be a toy dataset for demonstration purposes.',
 'homepage': 'https://data.world/dgsbpio/facilities-sandbox',
 'name': 'dgsbpio_facilities-sandbox',
 'resources': [{'format': 'csv',
                'name': 'buildings',
                'path': 'data/buildings.csv'},
               {'format': 'csv',
                'name': 'work_requests',
                'path': 'data/work_requests.csv'},
               {'bytes': 304730,
                'description': 'This is the buildings table, where each row '
                               'represents a unique building. \n'
                               '\n'
                               'Data sourced from DGS Archibus on March 9, '
                               '2020.',
                'format': 'csv',
                'keywords': ['raw data'],
                'mediatype': 'text/csv',
                'name': 'original/buildings.csv',
    

  'force_update=True'.format(dataset_key))


### Import a specific table for analysis
#### ... as a QueryResults object
We can use the SDK's `query()` function to grab whole tables ... or selections or combination of tables based on a SQL query. This returns a QueryResults object.

In [4]:
wr = dw.query(dataset_key='dgsbpio/facilities-sandbox', 
              query='SELECT * FROM work_requests')

type(wr)

datadotworld.models.query.QueryResults

The main advantage of keeping the data as a `QueryResults` object is that the column descriptions are all accessible through its `describe()` method:

In [5]:
pp.pprint(wr.describe(), depth=3)

{'fields': [{'description': 'Unique identifier for each building; PK in table '
                            '`bl`.',
             'name': 'bl_id',
             'rdfType': None,
             'type': 'string'},
            {'description': 'Unique identifier for each work request; PK in '
                            'table `wr`',
             'name': 'wr_id',
             'rdfType': None,
             'type': 'string'},
            {'description': 'Categorical variable representing the type of '
                            'issue the request refers to.',
             'name': 'prob_type',
             'rdfType': None,
             'type': 'string'},
            {'description': 'Date when work request was created, always by a '
                            'person. Please ignore time-related section of '
                            'timestamp.',
             'format': 'any',
             'name': 'date_requested',
             'rdfType': 'http://www.w3.org/2001/XMLSchema#dateTime',
          

#### ... as a Pandas DataFrame
But what if we want to just analyze the data in Pandas?

A `QueryResults` object returned by `query()` has a `.dataframe` attribute that causes the return value to be a Pandas DataFrame. 

In [6]:
wr_df = wr.dataframe

In [7]:
wr_df.head()

Unnamed: 0,bl_id,wr_id,prob_type,date_requested,date_assigned,date_completed,work_team_id
0,B00027,19566,OVERHDDOOR,2015-01-02,2015-01-02,2015-06-23,XYZ
1,B00020,19585,OVERHDDOOR,2015-01-05,2015-01-05,2015-06-23,CONTRACT ...
2,B04031,19756,OVERHDDOOR,2015-01-07,2015-01-07,2015-06-23,CONTRACT ...
3,B04016,19757,OVERHDDOOR,2015-01-07,2015-01-07,2015-05-13,CONTRACT ...
4,B00166,19800,HVAC,2015-01-08,2015-01-08,NaT,DISPATCHERS ...


Notice that the numerical ID columns, which we saved, correctly, as _strings_ in the Data.World browser-based UI, are still strings when we import them into a Pandas DataFrame here. **That is awesome!** Similarly, the datetime columns remain in the correct format, and the user of the Python SDK does not need to coerce them into the correct data type.

In [8]:
wr_df.dtypes

bl_id                     object
wr_id                     object
prob_type                 object
date_requested    datetime64[ns]
date_assigned     datetime64[ns]
date_completed    datetime64[ns]
work_team_id              object
dtype: object

## Use the API client to upload data
The idea of using an API to push data up to Data.World originally got the Baltimore DGS team pretty excited because we hoped to be able to use GitHub as a familiar space to store all the data descriptions, labels, column descriptions, and so on. Then, if we were interested in applying version control to this information, that would just work out of the box.  

It turns out, though, that Data.World's API is still a bit limited. Only file labels (e.g. "clean data", "raw data", "documentation") and a file-level description can be created in this way. The following cells demonstrate this functionality. 

In [9]:
# fire up a client object
client = dw.api_client()
path = Path.cwd() / 'data' / 'buildings.csv'

In [10]:
# create an empty dictionary
metadata = {}
# put some labels and a description into the dictionary
metadata['buildings.csv'] =  { 'labels': ['raw data'], 
                              'description': 'the file description for this file'}

# upload the data to the DGS sandbox!
client.upload_files(dataset_key='dgsbpio/facilities-sandbox',
                    files=[path], 
                    files_metadata=metadata)

Taking a look at the methods and attributions of the client object shows us that the API can do lots of things, most of which we haven't tried yet. Options include:

- appending rows to an existing table
- adding or deleting datasets, insights, and projects
- syncing files

In [11]:
dir(client)

['_RestApiClient__build_dataset_obj',
 '_RestApiClient__build_insight_obj',
 '_RestApiClient__build_project_obj',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_build_api_client',
 '_config',
 '_datasets_api',
 '_download_api',
 '_download_host',
 '_host',
 '_insights_api',
 '_projects_api',
 '_protocol',
 '_streams_api',
 '_uploads_api',
 '_user_api',
 'add_files_via_url',
 'add_linked_dataset',
 'append_records',
 'create_dataset',
 'create_insight',
 'create_project',
 'delete_dataset',
 'delete_files',
 'delete_insight',
 'delete_project',
 'download_datapackage',
 'download_dataset',
 'download_file',
 'fetch_contributing_datasets',
 'fetch_contributing_project