## DemystData Python Toolkit

DemystData connects users to external data, with sources that can enrich consumer, commrecial, and property records - and more. With the Demyst Python library, users access that data with extensive tools to tailor their user experience to their own needs and strengths. Here, we will lay out the functions of the library.  

In [1]:
# Import some popular python packages for handling data
import csv
import pandas as pd
import numpy as np
import random

# Import and instantiate an Analytics object from demyst-analytics
from demyst.analytics import Analytics
analytics = Analytics()

# Import 'report' for post processing
from demyst.analytics.report import *

## Inputs

Inputs are a necessity for accessing Demyst's data sources. These sources provide matching technology that appends data in their stores to the inputs (consumer, business, or property records) that users bring to the table. 

If you do not have an input file handy, do not worry. Demyst has sample files that are perfect for testing and exploring.

### Hosted Inputs

Hosted inputs are readily available input files with the sufficient columns and correct format to access with Demyst's data sources.

In [2]:
# List all hosted inputs

analytics.input_files()

['us_business_entity']

In [3]:
# Download the 'us_business_entity' input file, a set of US business records.

analytics.input_file('us_business_entity')

Unnamed: 0,city,post_code,country,naics_code,business_post_code,state,street,blackbelly_ts,business_state,business_name,business_city,row_number,business_street
0,Union,63084,us,492210,63084,MO,731 Second Creek Road.,2019-02-25 15:09:44,MO,Dinamite D & A LLC,Union,1,731 Second Creek Road.
1,Orlando,32818,us,541940,32818,FL,2608 Powers Drive,2019-02-25 15:09:44,FL,Dr. Jeffrey Bacia and Dr. Loui,Orlando,2,2608 Powers Drive
2,RICHFIELD,84701,us,339932,84701,UT,673 N MAIN ST,2019-02-25 15:09:44,UT,"FOREST CREATION, INC.",RICHFIELD,3,673 N MAIN ST
3,HOUSTON,77084,us,722212,77084,TX,2902 Greenhouse Road,2019-02-25 15:09:44,TX,Al's Pizza House Inc.,HOUSTON,4,2902 Greenhouse Road
4,SAN FRANCISCO,94123,us,722110,94123,CA,2953 BAKER ST,2019-02-25 15:09:44,CA,BAKER STREET BISTRO,SAN FRANCISCO,5,2953 BAKER ST
5,Salt lake city,84150,us,722211,84150,UT,28 South State.,2019-02-25 15:09:44,UT,Juicemaker Enterprises LLC and,Salt lake city,6,28 South State.
6,TUCSON,85705,us,337122,85705,AZ,224 N 4TH AVE,2019-02-25 15:09:44,AZ,ARROYO DESIGN,TUCSON,7,224 N 4TH AVE
7,MAGEE,39111,us,424460,39111,MS,9544 Highway 18 West,2019-02-25 15:09:44,MS,"Fish Depot, Inc.",MAGEE,8,9544 Highway 18 West
8,Westminster,21157,us,621999,21157,MD,1011 Baltimore Blvd,2019-02-25 15:09:44,MD,"Express Care of Westminster, L",Westminster,9,1011 Baltimore Blvd
9,CHICAGO,60617,us,624410,60617,IL,8515 S STONY ISLAND AVE,2019-02-25 15:09:44,IL,"LINKS TO LEARNING CHILD CARE,",CHICAGO,10,8515 S STONY ISLAND AVE


#### Optional Arguments

The `input_file` command defaults to 50 records, and does not appy any filters to the data set. However, there are two optional arguments.

Users can provide a parameter for number of rows (second param, type: int).

example: `10`

Users can provide and a parameter to filter based on values in a column (third parameter, type: dict).

example: `{"state" : "ca", "naics_code" : "722110"}`

In [4]:
analytics.input_file('us_business_entity', 10, {"state" : "ca", "naics_code" : "722110"})

Unnamed: 0,city,post_code,country,naics_code,business_post_code,state,street,blackbelly_ts,business_state,business_name,business_city,row_number,business_street
0,SAN FRANCISCO,94123,us,722110,94123,CA,2953 BAKER ST,2019-02-25 15:09:44,CA,BAKER STREET BISTRO,SAN FRANCISCO,5,2953 BAKER ST
1,FAIRFIELD,94533,us,722110,94533,CA,1430 N. TEXAS STREET,2019-02-25 15:09:44,CA,"YO SUSHI, INC.",FAIRFIELD,82,1430 N. TEXAS STREET
2,Artesia,90701,us,722110,90701,CA,18854 Norwalk Blvd..,2019-02-25 15:09:44,CA,La Szechwan Garden Inc,Artesia,239,18854 Norwalk Blvd..
3,Tustin,92780,us,722110,92780,CA,17245 Seventeenth St.,2019-02-25 15:09:44,CA,"Dosa Place International, Inc.",Tustin,286,17245 Seventeenth St.
4,North Hollywood,91605,us,722110,91605,CA,11669 Sherman Way,2019-02-25 15:09:44,CA,"Salsa & Beer, Inc.",North Hollywood,501,11669 Sherman Way
5,APTOS,95003,us,722110,95003,CA,102 RANCHO DEL MAR,2019-02-25 15:09:44,CA,ERIK'S DELICAFE OF APTOS,APTOS,659,102 RANCHO DEL MAR
6,San luis obispo,93401,us,722110,93401,CA,1819 Osos Street.,2019-02-25 15:09:44,CA,Bridgeview Asian Grill LLC,San luis obispo,878,1819 Osos Street.
7,San luis obispo,93405,us,722110,93405,CA,290 Madonna Road.,2019-02-25 15:09:44,CA,Alex Chiang & Jie Zhu Ouyang G,San luis obispo,917,290 Madonna Road.
8,PETALUMA,94952,us,722110,94952,CA,6 PETALUMA BLVD NORTH STE A5,2019-02-25 15:09:44,CA,NANCY DELORENZO,PETALUMA,977,6 PETALUMA BLVD NORTH STE A5
9,BURBANK,91505,us,722110,91505,CA,4300 W RIVERSIDE DR,2019-02-25 15:09:44,CA,Leona FGardner,BURBANK,1066,4300 W RIVERSIDE DR


### Validate

Users can use their own files to start out. The function, `validate`, will ensure those files are formatted correctly. 

In [5]:
inputs = analytics.input_file('us_business_entity', 10, {"state" : "ca", "naics_code" : "722110"})
analytics.validate(inputs)

Column,Type,Error %
city,City,0.0
country,Country,0.0
state,State,0.0
street,Street,0.0
business_name,BusinessName,0.0

Column,Suggestions
post_code,This column should be of type string
naics_code,This column should be of type string
business_post_code,This column should be of type string
blackbelly_ts,No suggestions found for this column
business_state,"State (hit rate: 100.0%), Country (hit rate: 100.0%)"
business_city,No suggestions found for this column
row_number,This column should be of type string
business_street,No suggestions found for this column


In [6]:
# Changing post_code to a string, as recommended

inputs['post_code'] = inputs['post_code'].astype(str)

analytics.validate(inputs)

Column,Type,Error %
city,City,0.0
post_code,PostCode,0.0
country,Country,0.0
state,State,0.0
street,Street,0.0
business_name,BusinessName,0.0

Column,Suggestions
naics_code,This column should be of type string
business_post_code,This column should be of type string
blackbelly_ts,No suggestions found for this column
business_state,"State (hit rate: 100.0%), Country (hit rate: 100.0%)"
business_city,No suggestions found for this column
row_number,This column should be of type string
business_street,No suggestions found for this column


### Finding Data

Demyst connects to hundreds of data sources, so it can be challenging to decide on the right sources to run. In the Demyst python toolkit, The `search` function helps to find relevent sources, and the `product_stats` function helps to compare them.

#### search

In [7]:
# Use the inputs param to see the sources that will work with your input data set.

analytics.search(inputs=inputs)

Unnamed: 0,street,state,first_name,country,city,post_code,last_name
Option 1,☒,☒,☐,☒,☒,☒,☐

Unnamed: 0,street,email_address,ip4,state,phone,first_name,country,city,post_code,last_name
Option 1,☒,,,☒,,☐,☒,☒,☒,☐
Option 2,☒,,,☒,☐,☐,☒,☒,☒,☐
Option 3,,☐,,,,☐,,,,☐
Option 4,☒,,☐,☒,☐,,☒,☒,☒,
Option 5,☒,,☐,☒,,,☒,☒,☒,
Option 6,,,☐,,☐,,,,,

Unnamed: 0,street,phone,state,city,post_code
Option 1,☒,☐,☒,☒,☒

Unnamed: 0,street,email_address,state,first_name,city,last_name
Option 1,☒,☐,☒,☐,☒,☐

Unnamed: 0,phone,country
Option 1,☐,☒

Unnamed: 0,street,country
Option 1,☒,☒

Unnamed: 0,street,phone,state,city,post_code
Option 1,☒,☐,☒,☒,☒

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,street,state,first_name,city,post_code,last_name
Option 1,☒,☒,☐,☒,☒,☐

Unnamed: 0,street,state,first_name,city,post_code,last_name
Option 1,☒,☒,☐,☒,☒,☐

Unnamed: 0,phone,country
Option 1,☐,☒

Unnamed: 0,phone,country
Option 1,☐,☒

Unnamed: 0,phone,country
Option 1,☐,☒

Unnamed: 0,phone,country
Option 1,☐,☒

Unnamed: 0,first_name,last_name,business_name
Option 1,☐,☐,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,street_line_1,model,email_address,phone,state,first_name,city,post_code,last_name
Option 1,☐,☐,☐,☐,☒,☐,☒,☒,☐

Unnamed: 0,street_line_1,email_address,phone,state,first_name,city,post_code,last_name
Option 1,☐,☐,☐,☒,☐,☒,☒,☐

Unnamed: 0,street_line_1,email_address,phone,state,first_name,city,post_code,last_name
Option 1,☐,☐,☐,☒,☐,☒,☒,☐

Unnamed: 0,street,latitude,state,first_name,country,city,longitude,post_code,last_name
Option 1,☒,☐,☒,☐,☒,☒,☐,☒,☐

Unnamed: 0,state,city,post_code,business_name
Option 1,☒,☒,☒,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,city_id,latitude,country,city,longitude,post_code
Option 1,☐,☐,☒,☒,☐,☒

Unnamed: 0,city_id,latitude,country,city,longitude,post_code
Option 1,☐,☐,☒,☒,☐,☒

Unnamed: 0,city_id,latitude,country,city,longitude,post_code
Option 1,☐,☐,☒,☒,☐,☒

Unnamed: 0,city_id,latitude,number_of_hours,country,city,longitude,date_time
Option 1,☐,☐,☐,☒,☒,☐,☐

Unnamed: 0,street,post_code,state,country,city,freeform
Option 1,☒,☒,☒,☒,☒,☐

Unnamed: 0,state,city,business_name
Option 1,☒,☒,☒

Unnamed: 0,state,city,business_name
Option 1,☒,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,email_address,state,first_name,city,last_name
Option 1,☒,☐,☒,☐,☒,☐

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,latitude,state,city,longitude,post_code
Option 1,☒,☐,☒,☒,☐,☒

Unnamed: 0,street,latitude,state,city,longitude,post_code
Option 1,☒,☐,☒,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,email_address,phone,state,first_name,city,post_code,last_name
Option 1,☒,☐,☐,☒,☐,☒,☒,☐

Unnamed: 0,email_address,phone,first_name,post_code,last_name
Option 1,☐,☐,☐,☒,☐

Unnamed: 0,street,email_address,phone,state,first_name,city,post_code,last_name
Option 1,☒,☐,☐,☒,☐,☒,☒,☐

Unnamed: 0,street,state,country,city,post_code
Option 1,☒,☒,☒,☒,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,latitude,business_name,radius,partner_id,phone,country,longitude
Option 1,☐,☒,☐,☐,☐,☒,☐

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,street,business_name,state_region,min_conf,city
Option 1,☒,☒,☐,☐,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,street,url,business_name,email_address,phone,state,country,city,duns_number,registration_number
Option 1,☒,☐,☒,☐,☐,☒,☒,☒,☐,☐

Unnamed: 0,abn,business_name
Option 1,☐,☒

Unnamed: 0,street,email_address,phone,state,country,city,post_code
Option 1,☒,☐,☐,☒,☒,☒,☒

Unnamed: 0,street,email_address,phone,state,country,city,post_code
Option 1,☒,☐,☐,☒,☒,☒,☒

Unnamed: 0,post_code
Option 1,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,begins_with,business_name
Option 1,☐,☒

Unnamed: 0,street,post_code,inp_state,first_name,country,city,date_of_birth,last_name
Option 1,☒,☒,☐,☐,☒,☒,☐,☐

Unnamed: 0,country,number
Option 1,☒,☐

Unnamed: 0,country,number
Option 1,☒,☐

Unnamed: 0,country,name
Option 1,☒,☐

Unnamed: 0,country,name
Option 1,☒,☐

Unnamed: 0,country
Option 1,☒

Unnamed: 0,street,post_code,inp_state,first_name,city,date_of_birth,last_name
Option 1,☒,☒,☐,☐,☒,☐,☐

Unnamed: 0,country
Option 1,☒

Unnamed: 0,street,business_name,state,country,city,post_code
Option 1,☒,☒,☒,☒,☒,☒

Unnamed: 0,country,business_name
Option 1,☒,☒

Unnamed: 0,company_code,country
Option 1,☐,☒

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,country,business_name
Option 1,☒,☒

Unnamed: 0,company_code,country
Option 1,☐,☒

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,application_id,country
Option 1,☐,☒

Unnamed: 0,product_type,country,application_id,date_of_upload,verification_type
Option 1,☐,☒,☐,☐,☐

Unnamed: 0,application_id,country
Option 1,☐,☒

Unnamed: 0,business_name
Option 1,☒

Unnamed: 0,country,business_name
Option 1,☒,☒

Unnamed: 0,company_code,country
Option 1,☐,☒

Unnamed: 0,application_id,country
Option 1,☐,☒

Unnamed: 0,address,state,city
Option 1,☐,☒,☒

Unnamed: 0,address,state,city
Option 1,☐,☒,☒


In [8]:
# Optionally, add "tags" to narrow your search

analytics.search(inputs=inputs, tags=["Property"])

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒

Unnamed: 0,city_id,latitude,country,city,longitude,post_code
Option 1,☐,☐,☒,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,block_id,post_code
Option 1,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,latitude,state,city,longitude,post_code
Option 1,☒,☐,☒,☒,☐,☒

Unnamed: 0,street,latitude,state,city,longitude,post_code
Option 1,☒,☐,☒,☒,☐,☒

Unnamed: 0,street,post_code
Option 1,☒,☒

Unnamed: 0,street,email_address,phone,state,first_name,city,post_code,last_name
Option 1,☒,☐,☐,☒,☐,☒,☒,☐

Unnamed: 0,street,state,country,city,post_code
Option 1,☒,☒,☒,☒,☒

Unnamed: 0,street,state,post_code,city
Option 1,☒,☒,☒,☒


In [9]:
# To retrieve with the raw data, add a 'notebook=False' argument.

data_products = analytics.search(inputs=inputs, tags=["Property"], notebook=False)
data_product_names = [data_product["name"] for data_product in data_products]
data_product_names

['infutor_property_append',
 'utilityscore_bill',
 'utilityscore_savings',
 'openweather_current',
 'housecanary_property_geocode',
 'housecanary_property_census',
 'housecanary_property_details',
 'housecanary_block_hazard_hail',
 'housecanary_property_details_enhanced',
 'housecanary_property_flood',
 'housecanary_block_hazard_tornado',
 'housecanary_block_hazard_earthquake',
 'housecanary_block_hazard_wind',
 'housecanary_block_hazard_hurricane',
 'housecanary_crime',
 'housecanary_property_schools',
 'housecanary_superfund',
 'housecanary_property_mortgage_lien',
 'housecanary_property_notice_of_default',
 'housecanary_property_sales_history',
 'hazardhub_risks',
 'hazardhub_property',
 'housecanary_property_value',
 'acxiom_place',
 'attom_expanded_profile_report',
 'core_logic_property_search']

#### product_stats

To better understand the strengths and limitations of products in the catalog, Demyst has kicked off a study of the performance of those products, down to the attribute level. Users can leverage that data to decide which data proudcts they're interested in.

In [10]:
# Pass a list of product names into the product_stats function to get data for each attribute. 

stats = analytics.product_stats(data_product_names)
stats



Unnamed: 0,consistency_rate,entity_name,error_rate,field_is_populated_rate,flattened_name,generic_flattened_name,hit_rate,last_updated_at,num_distinct_values,product
0,,property_entity,0.077236,0.754065,address[0].carrier_route,,0.755420,2019-04-18 00:11:00,199.0,infutor_property_append
1,,property_entity,0.077236,0.755420,address[0].city,,0.755420,2019-04-18 00:11:00,830.0,infutor_property_append
2,,property_entity,0.077236,0.752710,address[0].delivery_point_code,,0.755420,2019-04-18 00:11:00,597.0,infutor_property_append
3,,property_entity,0.077236,0.754065,address[0].delivery_point_validation,,0.755420,2019-04-18 00:11:00,4.0,infutor_property_append
4,,property_entity,0.077236,0.755420,address[0].postcode_type,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append
5,,property_entity,0.077236,0.755420,address[0].post_code,,0.755420,2019-04-18 00:11:00,1049.0,infutor_property_append
6,,property_entity,0.077236,0.752710,address[0].post_code_extension,,0.755420,2019-04-18 00:11:00,1016.0,infutor_property_append
7,,property_entity,0.077236,0.755420,address[0].state,,0.755420,2019-04-18 00:11:00,51.0,infutor_property_append
8,,property_entity,0.077236,0.755420,address[0].street,,0.755420,2019-04-18 00:11:00,1115.0,infutor_property_append
9,,property_entity,0.077236,0.752710,address[0].vacant,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append


In [11]:
# Filter for providers that have > 75% hit rate and fields that have > 50% populated rate.

high_hit_rate_stats = stats.loc[(stats['hit_rate'] > 0.75) & (stats['field_is_populated_rate'] > 0.5)]
high_hit_rate_stats

Unnamed: 0,consistency_rate,entity_name,error_rate,field_is_populated_rate,flattened_name,generic_flattened_name,hit_rate,last_updated_at,num_distinct_values,product
0,,property_entity,0.077236,0.754065,address[0].carrier_route,,0.755420,2019-04-18 00:11:00,199.0,infutor_property_append
1,,property_entity,0.077236,0.755420,address[0].city,,0.755420,2019-04-18 00:11:00,830.0,infutor_property_append
2,,property_entity,0.077236,0.752710,address[0].delivery_point_code,,0.755420,2019-04-18 00:11:00,597.0,infutor_property_append
3,,property_entity,0.077236,0.754065,address[0].delivery_point_validation,,0.755420,2019-04-18 00:11:00,4.0,infutor_property_append
4,,property_entity,0.077236,0.755420,address[0].postcode_type,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append
5,,property_entity,0.077236,0.755420,address[0].post_code,,0.755420,2019-04-18 00:11:00,1049.0,infutor_property_append
6,,property_entity,0.077236,0.752710,address[0].post_code_extension,,0.755420,2019-04-18 00:11:00,1016.0,infutor_property_append
7,,property_entity,0.077236,0.755420,address[0].state,,0.755420,2019-04-18 00:11:00,51.0,infutor_property_append
8,,property_entity,0.077236,0.755420,address[0].street,,0.755420,2019-04-18 00:11:00,1115.0,infutor_property_append
9,,property_entity,0.077236,0.752710,address[0].vacant,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append


In [12]:
# On top of that, filter for categorical variables that have < 10 distinct values observed.

categorical_stats = high_hit_rate_stats.loc[(high_hit_rate_stats['num_distinct_values'] > 1) & (high_hit_rate_stats['num_distinct_values'] < 10)]
categorical_stats

Unnamed: 0,consistency_rate,entity_name,error_rate,field_is_populated_rate,flattened_name,generic_flattened_name,hit_rate,last_updated_at,num_distinct_values,product
3,,property_entity,0.077236,0.754065,address[0].delivery_point_validation,,0.755420,2019-04-18 00:11:00,4.0,infutor_property_append
4,,property_entity,0.077236,0.755420,address[0].postcode_type,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append
9,,property_entity,0.077236,0.752710,address[0].vacant,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append
13,,property_entity,0.077236,0.740515,address[1].delivery_point_validation,,0.755420,2019-04-18 00:11:00,3.0,infutor_property_append
14,,property_entity,0.077236,0.740515,address[1].postcode_type,,0.755420,2019-04-18 00:11:00,3.0,infutor_property_append
19,,property_entity,0.077236,0.740515,address[1].vacant,,0.755420,2019-04-18 00:11:00,2.0,infutor_property_append
20,,property_entity,0.077236,0.755420,category,,0.755420,2019-04-18 00:11:00,3.0,infutor_property_append
31,,property_entity,0.077236,0.922764,is_hit,,0.755420,2019-04-18 00:11:01,2.0,infutor_property_append
139,,property_entity,0.077236,0.715447,property[0].absentee_owner_indicator,,0.755420,2019-04-18 00:11:05,5.0,infutor_property_append
141,,property_entity,0.077236,0.755420,property[0].address_indicator,,0.755420,2019-04-18 00:11:05,2.0,infutor_property_append


In [13]:
# See the data products that these fields encompass.

products = list(set(categorical_stats["product"].values))

In [14]:
# Save the field names themselves.

categorical_stats["full_field_name"] = categorical_stats["product"].map(str) + "." + categorical_stats["flattened_name"]
flattened_field_names = list(set(categorical_stats["full_field_name"].values))
flattened_field_names

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


['hazardhub_risks.radon.description',
 'housecanary_property_census.address_info.geo_precision',
 'hazardhub_risks.murder.score',
 'hazardhub_risks.wind.description',
 'housecanary_property_details.details.assessment.assessment_year',
 'housecanary_property_geocode.address_info.geo_precision',
 'hazardhub_risks.hail.score',
 'infutor_property_append.property[0].value_calculated_indicator',
 'hazardhub_risks.fema_all_flood_params.zone_subty',
 'housecanary_property_census.api_code',
 'hazardhub_risks.radon.score',
 'hazardhub_property.is_hit',
 'housecanary_property_details_enhanced.details_enhanced.public_record.assessment.assessment_year',
 'hazardhub_risks.fema_all_flood_params.study_typ',
 'hazardhub_risks.earthquake.score',
 'hazardhub_risks.motor_vehicle_theft.description',
 'housecanary_block_hazard_hurricane.api_code',
 'hazardhub_risks.toxic_release_facilities.score',
 'housecanary_property_value.api_code_description',
 'utilityscore_savings.data.dw_scorechange',
 'infutor_prop

### Enrich

The Demyst python library is yet another way to execute data appends through the Demyst platform. 

In [15]:
# Running an enrichment costs credits. Before starting, let's see how many credits will cost.

analytics.enrich_credits(products, inputs)

Verifying providers...


801.8

In [16]:
# Now, let's check our credit balance for our organization.

analytics.credits()

999455432

In [None]:
# Now, assuming we sufficient credits, we can kick off the enrichment.

# Pass the list of products and the inputs into the enrich_and_download function to kick off.

results = analytics.enrich_and_download(products, inputs)

Verifying providers...
Starting enrichment...
Uploading data...


This enrichment will use 801.8 credits of the 999455432 credits your organization currently has.


Enrich Job ID: 4883


IntProgress(value=1, max=2)

Label(value='Checking status...')

We now have a brick of data with all fields from the data products that were filtered down above. The package returns them as a Pandas DataFrame. 

In [None]:
results

### Post Enrich



In [None]:
# Only look at columns that met previous criteria

keep_columns = list(set(flattened_field_names) & set(results.columns))
reduced_results = results[keep_columns]
reduced_results

#### Report

The demyst results are flattened, and each header indicates which data product the column was appended from. As raw data for modeling, this format works well. However, for analyzing how the data products and fields performed, the report that we imported at the start will provide more clarity.

Each output field is listed as a row, and the match rate, fill rate, and number of unique outcomes are listed as columns.

In [None]:
# Generate a report to get an overview of the results

# Remember that with a very small sample size, nunique may be smaller than expected.

report(inputs,reduced_results)

### Modeling

It is up to the user how to find value in the appended data for their own use case. One logical next step is to test the predictive power of the data by building models. 

Demyst passes through all of the input data into the results so that users can join internal data and response variables to their results.

In [None]:
# Columns containing input data are prepended by the string 'inputs.'

results["inputs.business_name"]

We will fake a response variable and internal score for demonstration.

In [None]:
# Faking internal score and binary response, assocaited with business names run through demyst

fake_internal = pd.DataFrame()
fake_internal["business_name"] = results["inputs.business_name"]
fake_internal['score'] = np.random.rand(fake_internal.shape[0])
fake_internal["binary_response"] = np.random.randint(0, 2, fake_internal.shape[0])
fake_internal

In [None]:
joined = pd.merge(fake_internal, results, left_on='business_name', right_on='inputs.business_name')
joined

Now, we will refilter to the columns we identified, plus the joined in columns

In [None]:
join_keep_columns = ["business_name", "score", "binary_response"] + keep_columns
ready_for_modeling_data = joined[join_keep_columns]
ready_for_modeling_data

This block of data is now ready for ingesting into your data science pipeline. It can be saved as a csv and uploaded to DataRobot, kept in a DataFrame and run with python scripts, or the many other options.