In this notebook, we will explore learn about the WhyLogs Python library and the resulting profile summaries. 

# Getting Started with WhyLogs Profile Summaries

We will first read in raw data into Pandas from file and explore that data briefly. To run WhyLogs, we will then import the WhyLogs library, initialize a logging session with WhyLogs, and create a profile that data -- resulting in a WhyLogs profile summary. Finally, we'll explore some of the features of the profile summary content.

First, we will install the necessary libraries and import a few standard data science Python libraries.

In [1]:
!pip install pandas numpy altair

Looking in indexes: https://aws:****@dev-207285235248.d.codeartifact.us-west-2.amazonaws.com/pypi/python-dev/simple/


In [10]:
!conda list

# packages in environment at /Users/bernease/miniconda3/envs/wldev:
#
# Name                    Version                   Build  Channel
altair                    3.2.0                    py38_0  
appnope                   0.1.0                 py38_1001  
argh                      0.26.2                   pypi_0    pypi
attrs                     19.3.0                     py_0  
awscli                    1.18.99                  pypi_0    pypi
backcall                  0.1.0                    pypi_0    pypi
blas                      1.0                         mkl  
bleach                    3.1.5                      py_0  
boto3                     1.14.19                  pypi_0    pypi
botocore                  1.17.22                  pypi_0    pypi
bump2version              1.0.0                    pypi_0    pypi
ca-certificates           2020.6.24                     0  
certifi                   2020.6.20                py38_0  
chardet                   3.0.4                

In [2]:
import os.path
import pandas as pd
import numpy as np

WhyLogs allows you to characterize and store key characteristics of a growing dataset efficiently. In machine learning, datasets often consist of both input features and outputs of the model. In deployed systems, you often have a relatively static training dataset as well as a growing dataset from model input and output at inference time.

## Downloading and exploring the raw Lending Club data

In our case, we will download and explore a sample from the Lending Club dataset before logging a WhyLogs profile summary. Lending Club is a peer-to-peer lending and alternative investing website on which members may apply for personal loans and invest in personal loans to other Lending Club members. The company published a dataset with information starting in 2013(?). This particular dataset contains only the accepted loans.

Before downloading, we can first orient ourselves to ensure that we point to the correct file.

In [3]:
print("Current working directory:", os.getcwd())
print("Directory contents:\n", os.listdir())

Current working directory: /Users/bernease/repos/cli-demo-1/example-notebooks
Directory contents:
 ['GettingStarted.ipynb', '.ipynb_checkpoints']


If you see a file named `lending_club_1000.csv`, we are done. If not, navigate to the `whylogs-python/data` folder and try the above cell again.

You may use the Juypyter command `!` in front of cell contents to execute a Bash command like `cd` to change directories.

In [4]:
data_file = "../example-input/lending_club_1000.csv"

Let's read in that data file into a Pandas dataframe. Each row refers to a particular loan instance while each column refers to a variable in our dataset.

In [5]:
data = pd.read_csv(os.path.join(data_file))
data

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,90671227,,4800.0,4800.0,4800.0,36 months,13.49,162.87,C,C2,...,,,Cash,N,,,,,,
1,90060135,,21600.0,21600.0,21600.0,60 months,9.49,453.54,B,B2,...,,,Cash,N,,,,,,
2,90501423,,24200.0,24200.0,24200.0,36 months,9.49,775.09,B,B2,...,,,Cash,N,,,,,,
3,90186302,,3600.0,3600.0,3600.0,36 months,11.49,118.70,B,B5,...,,,Cash,N,,,,,,
4,90805192,,8000.0,8000.0,8000.0,36 months,10.49,259.99,B,B3,...,,,Cash,N,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,88985880,,40000.0,40000.0,40000.0,60 months,10.49,859.56,B,B3,...,,,Cash,N,,,,,,
996,88224441,,24000.0,24000.0,24000.0,60 months,14.49,564.56,C,C4,...,,,Cash,Y,Mar-2019,ACTIVE,Mar-2019,10000.0,44.82,1.0
997,88215728,,14000.0,14000.0,14000.0,60 months,14.49,329.33,C,C4,...,,,Cash,N,,,,,,
998,Total amount funded in policy code 1: 1465324575,,,,,,,,,,...,,,,,,,,,,


A Pandas dataframe is built on top of the numpy array framework, so we can use many helpful functions and gather useful information like the shape of the data. This tells us the number of rows followed by number of columns.

In [6]:
data.shape

(1000, 151)

One important variable is the `issue_d` column which indicates the month and year during which that loan was originated. In this dataset, it's represented by a string in the `MMM-YYYY` format.

We might imagine that if this public dataset was still collecting data from Lending Club's website, new rows in the dataset would be added with `issue_d` matching the date that the loan was accepted and information gathered. For this sample dataset, let's look at the values for that variable.

In [7]:
data['issue_d']

0      Oct-2016
1      Oct-2016
2      Oct-2016
3      Oct-2016
4      Oct-2016
         ...   
995    Oct-2016
996    Oct-2016
997    Oct-2016
998         NaN
999         NaN
Name: issue_d, Length: 1000, dtype: object

You may notice a number of other interesting variables in the columns of the dataframe above.

Let's first look at `funded_amnt` which contains the amount of money that was committed to that particular loan. It is a numeric value that is represented by a floating point number.

In [8]:
print("Min:", min(data['funded_amnt']))
print("Max:", max(data['funded_amnt']))

Min: 1000.0
Max: 40000.0


## Running WhyLogs for logging a single dataset

Let's now explore import a function from Why Labs that allows us to create a logging session.

This session can be connected with multiple writers that output the results of our profiling locally in JSON, a flat CSV, or binary protobuf format as well as writers to an AWS S3 bucket in the cloud. Further writing functionality will be added as well.

Let's create a default session below.

In [9]:
from whylabs.logs import get_or_create_session

ImportError: cannot import name 'get_or_create_session' from 'whylabs.logs' (unknown location)

In [None]:
session = get_or_create_session()
logger = session.logger()

In [None]:
session.log_dataframe(data.head(100), 'test.data')

Now that we've logged our dataset, we can see the output of the WhyLogs profiling process in created directory. Inside of our original directory, an `output` directory that contains directories with the optional name given `test.data` and a directory with the Unix datetime inside of that.

In [None]:
print("Current working directory:", os.getcwd())
print("Directory contents:\n", os.listdir())

In [None]:
!ls ..

Inside of that directory, we see a number of files:
* `whylogs.json`
* `summary_summary.csv`
* `summary_histogram.json`
* `summary_strings.json`
* `protobuf.bin`

We could read these files into Pandas using the `pd.read_csv` and `pd.read_json` functions to operate explore these profile summaries.

WhyLogs also provides a static `dataframe_profile` function that returns a DatasetProfile object when passed in a Pandas dataframe with our raw data. We will take this opportunity to use this method.

This particular function does not require an active session to be running. Because the remainder of the notebook uses this functionality instead of the typical writing logs to disk or S3, we can close the session now. Typically, this task would be saved until the end.

In [None]:
session.close()

In [None]:
from whylabs.logs.core.datasetprofile import dataframe_profile

profile = dataframe_profile(data, 'testname')
profile

This DatasetProfile object, stored in the `profile` variable, can now be explored in greater detail.

This object contains helpful information about the profile, such as the session ID, the dates associated with both the data and session, and user-specified metadata and tags.

For this simple example, we can see a data timestamp attribute (defaults to time of running `dataframe_profile`) that associates this data with an appropriate timestamp for temporal analysis and will be soon become helpful.

In [None]:
print(profile.data_timestamp)

First, let's transform the dataset profile into the flat summary form. Unlike the binary `protobuf.bin` file and the hierarchical `whylogs.json` file that was written using the logger, the summary format makes it much easier to analyze and run data science processes on the data. This structure is much more flat, a table format or a single depth dictionary format organized by variable.

These less hierarchical formats were also created with the `log_dataframe` functionality and can be found in the `summary_summary.csv`, `summary_histogram.json` and `summary_strings.json` files.

In [None]:
summaries = profile.flat_summary()

Let's first look at the overall summary for the profiled dataset.

In [None]:
summary = summaries['summary']
summary

We can see that this summary object is much smaller at **151 rows x 32 columns** than the original dataset at **1000 rows x 151 columns**. Smaller storage sizes are important in reducing costs and making it easier for your data scientists to complete monitoring and post-analysis on large amounts of data.

Each row of our flat profile summary contains under column the name of the variable found in the dataset.

We can also see a number of useful metrics as columns in our summary: descriptive statistics, type information, unique estimates and bounds, as well as specially formulated metrics like inferred_dtype and dtype_fraction.

In [None]:
summary.columns

Let's explore the output of WhyLogs for a few of the variables we mentioned earlier. For example, let's look at the  `funded_amnt` variable.

In [None]:
summary[summary['column']=='funded_amnt'].T

You may notice that the count for this variable was recorded at **1000** counted with a minimum loan amount of **\$1,000.00 USD** and a maximum loan amount of **\$40,000.00 USD**.

For numerical variables like `funded_amnt`, we can view further information in the histograms dictionary from the profile summaries object. The variable's histogram object contains bin edges along with counts.

In [None]:
histograms = summaries['hist']

In [None]:
histograms['funded_amnt']

In [None]:
print("Bin edges length:", len(histograms['funded_amnt']['bin_edges']))
print("Counts length:", len(histograms['funded_amnt']['bin_edges']))

Let's plot this histogram and note any patterns.

In [None]:
# Histogram plot for funded_amnt

For another variable, `loan_status` we will see interesting information in different metrics. This is because loan status is a categorical field that takes strings as inputs.

In [None]:
summary[summary['column']=='loan_status']

Let's look at a few relevant metrics for string variables.

In [None]:
summary.loc[116, ['type_string_count', 'type_null_count', 'nunique_str', 'nunique_str_lower', 'ununique_str_upper']]

Notice that there are **2** elements of null type with the remaining **998** elements as string type. Also, the unique string fields show **6** unique strings. The lower and upper bounds for the estimate are also **6**, meaning that this is an exact number. You will see many instances of this -- DataSketches in WhyLogs finds exact estimates for numbers as high as 400 unique values.

Let's now explore the frequent strings object from our profile summaries.

In [None]:
frequent_strings = summaries['frequent_strings']

In [None]:
frequent_strings['loan_status']

## Visualizing multiple datasets across time with WhyLogs

Now that we've seen one dataset for October 2016, let's calculate profile summaries for a series of months to be analyzed in sequential order.

We'll be creating a list of profile summaries manually using the `issue_d` variable, but WhyLogs will soon be able to do this subsetting automatically when passing along the desired timestamp variable if available.

In most use cases, you would be passing in live data to a WhyLogs session. Instead of gathering the date timestamp from the dataset, you will associate each profile with the default which is the current date and time. Using past dates can be helpful to backfill with past runs of your machine learning model, however.

In [None]:
import datetime

# Create a list of data profiles
full_data = data

remaining_dates = ['Nov-2016', 'Dec-2016', 'Jan-2017', 'Feb-2017', 'Mar-2017', 'Apr-2017', 
                   'May-2017', 'Jun-2017', 'Jul-2017', 'Aug-2017', 'Sep-2017']

profiles = [profile]  # list with original profile
for date in remaining_dates:
    timestamp = datetime.datetime.strptime(date, '%b-%Y')
    subset_data = full_data[full_data['issue_d']==date]
    subset_profile = dataframe_profile(subset_data, timestamp=timestamp)
    profiles.append(subset_profile)

profiles

Let's now see how even more valuable WhyLogs profiles become when we collect them in sequence.

First, let's look at the `funded_amnt` column over time.

In [None]:
# Visualize the funded_amnt column null view

In [None]:
# Visualize