## DemystData Python Toolkit

DemystData connects users to external data, with sources that can enrich consumer, commrecial, and property records - and more. With the Demyst Python library, users access that data with extensive tools to tailor their user experience to their own needs and strengths. Here, we will lay out the functions of the library.  

In [None]:
# Import some popular python packages for handling data
import csv
import pandas as pd
import numpy as np
import random

# Import and instantiate an Analytics object from demyst-analytics
from demyst.analytics import Analytics
analytics = Analytics()

# Import 'report' for post processing
from demyst.analytics.report import *

## Inputs

Inputs are a necessity for accessing Demyst's data sources. These sources provide matching technology that appends data in their stores to the inputs (consumer, business, or property records) that users bring to the table. 

If you do not have an input file handy, do not worry. Demyst has sample files that are perfect for testing and exploring.

### Hosted Inputs

Hosted inputs are readily available input files with the sufficient columns and correct format to access with Demyst's data sources.

In [None]:
# List all hosted inputs

analytics.input_files()

In [None]:
# Download the 'us_business_entity' input file, a set of US business records.

analytics.input_file('us_business_entity')

#### Optional Arguments

The `input_file` command defaults to 50 records, and does not appy any filters to the data set. However, there are two optional arguments.

Users can provide a parameter for number of rows (second param, type: int).

example: `10`

Users can provide and a parameter to filter based on values in a column (third parameter, type: dict).

example: `{"state" : "ca", "naics_code" : "722110"}`

In [None]:
analytics.input_file('us_business_entity', 10, {"state" : "ca", "naics_code" : "722110"})

### Validate

Users can use their own files to start out. The function, `validate`, will ensure those files are formatted correctly. 

In [None]:
inputs = analytics.input_file('us_business_entity', 10, {"state" : "ca", "naics_code" : "722110"})
analytics.validate(inputs)

In [None]:
# Changing post_code to a string, as recommended

inputs['post_code'] = inputs['post_code'].astype(str)

analytics.validate(inputs)

### Finding Data

Demyst connects to hundreds of data sources, so it can be challenging to decide on the right sources to run. In the Demyst python toolkit, The `search` function helps to find relevent sources, and the `product_stats` function helps to compare them.

#### search

In [None]:
# Use the inputs param to see the sources that will work with your input data set.

analytics.search(inputs=inputs)

In [None]:
# Optionally, add "tags" to narrow your search

analytics.search(inputs=inputs, tags=["Property"])

In [None]:
# To retrieve with the raw data, add a 'notebook=False' argument.

data_products = analytics.search(inputs=inputs, tags=["Property"], notebook=False)
data_product_names = [data_product["name"] for data_product in data_products]
data_product_names

#### product_stats

To better understand the strengths and limitations of products in the catalog, Demyst has kicked off a study of the performance of those products, down to the attribute level. Users can leverage that data to decide which data proudcts they're interested in.

In [None]:
# Pass a list of product names into the product_stats function to get data for each attribute. 

stats = analytics.product_stats(data_product_names)
stats

In [None]:
# Filter for providers that have > 75% hit rate and fields that have > 50% populated rate.

high_hit_rate_stats = stats.loc[(stats['hit_rate'] > 0.75) & (stats['field_is_populated_rate'] > 0.5)]
high_hit_rate_stats

In [None]:
# On top of that, filter for categorical variables that have < 10 distinct values observed.

categorical_stats = high_hit_rate_stats.loc[(high_hit_rate_stats['num_distinct_values'] > 1) & (high_hit_rate_stats['num_distinct_values'] < 10)]
categorical_stats

In [None]:
# See the data products that these fields encompass.

products = list(set(categorical_stats["product"].values))

In [None]:
# Save the field names themselves.

categorical_stats["full_field_name"] = categorical_stats["product"].map(str) + "." + categorical_stats["flattened_name"]
flattened_field_names = list(set(categorical_stats["full_field_name"].values))
flattened_field_names

### Enrich

The Demyst python library is yet another way to execute data appends through the Demyst platform. 

In [None]:
# Running an enrichment costs credits. Let's check how many credits our organization has.

analytics.credits()

In [None]:
# Pass the list of products and the inputs to kick off an enrichment job.

results = analytics.enrich_and_download(products, inputs)

We now have a brick of data with all fields from the data products that were filtered down above. The package returns them as a Pandas DataFrame. 

In [None]:
results

### Post Enrich



In [None]:
# Only look at columns that met previous criteria

keep_columns = list(set(flattened_field_names) & set(results.columns))
reduced_results = results[keep_columns]
reduced_results

#### Report

The demyst results are flattened, and each header indicates which data product the column was appended from. As raw data for modeling, this format works well. However, for analyzing how the data products and fields performed, the report that we imported at the start will provide more clarity.

Each output field is listed as a row, and the match rate, fill rate, and number of unique outcomes are listed as columns.

In [None]:
# Generate a report to get an overview of the results

# Remember that with a very small sample size, nunique may be smaller than expected.

report(inputs,reduced_results)

### Modeling

It is up to the user how to find value in the appended data for their own use case. One logical next step is to test the predictive power of the data by building models. 

Demyst passes through all of the input data into the results so that users can join internal data and response variables to their results.

In [None]:
# Columns containing input data are prepended by the string 'inputs.'

results["inputs.business_name"]

We will fake a response variable and internal score for demonstration.

In [None]:
# Faking internal score and binary response, assocaited with business names run through demyst

fake_internal = pd.DataFrame()
fake_internal["business_name"] = results["inputs.business_name"]
fake_internal['score'] = np.random.rand(fake_internal.shape[0])
fake_internal["binary_response"] = np.random.randint(0, 2, fake_internal.shape[0])
fake_internal

In [None]:
joined = pd.merge(fake_internal, results, left_on='business_name', right_on='inputs.business_name')
joined

Now, we will refilter to the columns we identified, plus the joined in columns

In [None]:
join_keep_columns = ["business_name", "score", "binary_response"] + keep_columns
ready_for_modeling_data = joined[join_keep_columns]
ready_for_modeling_data

This block of data is now ready for ingesting into your data science pipeline. It can be saved as a csv and uploaded to DataRobot, kept in a DataFrame and run with python scripts, or the many other options.