# Structured Dataset Profiling with Lens

### Find the code
This notebook can be found on [github](https://github.com/credo-ai/credoai_lens/blob/develop/docs/notebooks/lens_demos/dataset_profiling.ipynb).
```

## Contents

1. [What is Covered](#What-is-Covered)
2. [Introduction](#Introduction)
3. [Dataset](#Dataset)
4. [Running Lens](#Running-Lens)

## What is Covered <a name="What-is-Covered"></a>
* **Domain:**
  * Applications that rely on structured datasets.


* **ML task:**
  * Exploratory data analysis for model training, validation, and testing with structured datasets.

## Introduction <a name="Introduction"></a>
Structured data conforms to a tabular format with relationship between the different rows and columns. Many machine learning models are trained, validated, and tested on structured datasets.

Exploratory analysis of a structured dataset provides insights for a more informed assessment of the ML model. Lens Dataset Profiling module uses pandas_profiling to enable this analysis through generating data profiles.

## Dataset <a name="Dataset"></a>
The [Census Adult Dataset](https://archive.ics.uci.edu/ml/datasets/adult) is from the Census Bureau and the label is whether a given adult makes more than $50K a year based attributes such as sex and education.

The dataset provides 13 input variables that are a mixture of categorical, ordinal, and numerical data types. The complete list of variables is as follows:

Age, Workclass, Education, Education Number of Years, Marital-status, Occupation, Relationship, Race, Sex, Capital-gain, Capital-loss, Hours-per-week, and Native-country.

In [None]:
import numpy as np
# Imports for demo data
from credoai.data import fetch_censusincome

# Base Lens imports
import credoai.lens as cl
import credoai.assessment as assess

cl.set_logging_level('info')
# set default format for image displays. Change to 'png' if 'svg' is failing
%config InlineBackend.figure_formats = ['svg']

In [None]:
data = fetch_censusincome()
df = data['data'].copy()
df['target'] = data['target']

In [None]:
df.head(3)

In [None]:
# Prepare missing values
df = df.replace("\\?", np.nan, regex=True)

In [None]:
import credoai.modules as mod
dp = mod.DatasetProfiling(X=df[['education', 'occupation']], y=df['target'])
dp.profile_data()

## Running Lens <a name="Running-Lens"></a>
First step is creating a Lens CredoData artifact. This will hold the structured dataset and the meta information needed for doing the assessment. CredoData has the following paramters:

`name` : an arbitrary name that you want to assign to the object (str)


`data` : dataset dataframe that includes all features and labels (pd.DataFrame)


`sensitive_feature_key` : name of the sensitive feature column in your data, like 'race' or 'gender' (str)


`label_key` : name of the label column in your data, like 'label' (str)

In [None]:
label_key = 'target'
categorical_features_keys = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']

# Set up the data artifact
credo_data = cl.CredoData(name='census-income',
                          data=df, 
                          label_key=label_key)

In [None]:
lens = cl.Lens(data=credo_data, assessments=[assess.DatasetProfilingAssessment()])
results = lens.run_assessments().get_results()

In [None]:
lens.create_report()