# YData Profiling

## About

`ydata-profiling`'s primary goal is to provide a **one-line Exploratory Data Analysis (EDA) experience** in a consistent and fast solution. Like `pandas df.describe()` function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.

## Quickstart

### Install

``` bash
pip install ydata-profiling
```
or
``` bash
conda install -c conda-forge ydata-profiling
```

### Start profiling

Start by loading your pandas `DataFrame` as you normally would, e.g. by using:

``` python
import numpy as np
import pandas as pd
from ydata_profiling import ProfileReport

df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
```

To generate the standard profiling report, merely run:

``` python
profile = ProfileReport(df, title="Profiling Report")
```

## Key features

- **Type inference**: automatic detection of columns' data types (*Categorical*, *Numerical*, *Date*, etc.)
- **Warnings**: A summary of the problems/challenges in the data that you might need to work on (*missing data*, *inaccuracies*, *skewness*, etc.)
- **Univariate analysis**: including descriptive statistics (mean, median, mode, etc) and informative visualizations such as distribution histograms
- **Multivariate analysis**: including correlations, a detailed analysis of missing data, duplicate rows, and visual support for variables pairwise interaction
- **Time-Series**: including different statistical information relative to time dependent data such as auto-correlation and seasonality, along ACF and PACF plots.
- **Text analysis**: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- **File and Image analysis**: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata
- **Compare datasets**: one-line solution to enable a fast and complete report on the comparison of datasets
- **Flexible output formats**: all analysis can be exported to an HTML report that can be easily shared with different parties, as JSON for an easy integration in automated systems and as a widget in a Jupyter Notebook.

The report contains three additional sections:

- **Overview**: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
- **Alerts**: a comprehensive and automatic list of potential data quality issues (high correlation, skewness, uniformity, zeros, missing values, constant values, between others)
- **Reproduction**: technical details about the analysis (time, version and configuration)

## Use Cases

YData-profiling can be used to deliver a variety of different use-case. The documentation includes guides, tips and tricks for tackling them:

| Use case | Description                                                                                 |
|----------|---------------------------------------------------------------------------------------------|
| [Comparing datasets](https://docs.profiling.ydata.ai/latest/features/comparing_datasets)                        | Comparing multiple version of the same dataset                                              |
| [Profiling a Time-Series dataset](https://docs.profiling.ydata.ai/latest/features/time_series_datasets)               | Generating a report for a time-series dataset with a single line of code                    |
|[Profiling large datasets](https://docs.profiling.ydata.ai/latest/features/big_data)                            | Tips on how to prepare data and configure `ydata-profiling` for working with large datasets |
| [Handling sensitive data](https://docs.profiling.ydata.ai/latest/features/sensitive_data)                       | Generating reports which are mindful about sensitive data in the input dataset              |
| [Dataset metadata and data dictionaries](https://docs.profiling.ydata.ai/latest/features/metadata)               | Complementing the report with dataset details and column-specific data dictionaries         |
| [Customizing the report's appearance](https://docs.profiling.ydata.ai/latest/features/custom_reports) | Changing the appearance of the report's page and of the contained visualizations            |
| [Profiling Databases](https://docs.profiling.ydata.ai/latest/features/collaborative_data_profiling) | For a seamless profiling experience in your organization's databases, check [Fabric Data Catalog](https://ydata.ai/products/data_catalog), which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others. |

## Example Code for the Advanced Analytics Workspace (AAW)

To install `ydata-profiling` on the Advanced Analytics Workspace (AAW), simply run the cell below. Remove `%%capture` if you want to see the console output during installation.

In [None]:
%%capture
! pip install -U ydata-profiling ipywidgets scikit-learn

### Generating a Standard Report 

This cell contains a script to fetch a Pokemon dataset and display the default `ProfileReport` from `ydata_profiling`. The report includes some additional correlation calculations (see line 23 below). Line 34 `profile_report.to_file("pokemon.html")` saves the report to an HTML file which you can open in a new browser tab.

[Click here to see the output.](pokemon.html)

In [None]:
import numpy as np
import pandas as pd

from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file


file_name = cache_file(
    "pokemon.csv",
    "https://raw.githubusercontent.com/bryanpaget/html/main/pokemon.csv"
)

pokemon_df = pd.read_csv(file_name)

profile_report = ProfileReport(
    pokemon_df,
    sort=None,
    html={
        "style": {"full_width": True}
    }, 
    progress_bar=False,
    correlations={
        "auto": {"calculate": True},
        "pearson": {"calculate": False},
        "spearman": {"calculate": False},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": True},
        "cramers": {"calculate": True},
    },
    explorative=True,
    title="Profiling Report"
)

profile_report.to_file("pokemon.html")

profile_report

### Comparing Datasets

We can also generate reports comparing two datasets. This example below compares training and test pokemon datasets. `train_test_split` from `scikit-learn` is used to create the train and test datasets. 

[Click here to see the output.](comparison.html)

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

from ydata_profiling import ProfileReport


file_name = cache_file(
    "pokemon.csv",
    "https://raw.githubusercontent.com/bryanpaget/html/main/pokemon.csv"
)

pokemon_df = pd.read_csv(file_name)

X = pokemon_df[['Total', 'HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
y = pokemon_df[['Type 1', 'Type 2']]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

train_df = X_train
train_report = ProfileReport(train_df, title="Train")

test_df = X_test
test_report = ProfileReport(test_df, title="Test")

comparison_report = train_report.compare(test_report)
comparison_report.to_file("comparison.html")

comparison_report

### Time Series Data

`ydata-prfiling` has a time-series mode which can be activated with `tsmode=True`, see line 19. We'll look at Microsoft's stock price.

[Click here to see the output.](msft-report-timeseries.html)

In [None]:
import numpy as np
import pandas as pd

import ydata_profiling
from ydata_profiling.utils.cache import cache_file

from ydata_profiling import ProfileReport

file_name = cache_file(
    "msft.csv",
    "https://raw.githubusercontent.com/bryanpaget/html/main/msft.csv"
)

msft_df = pd.read_csv(file_name)
msft_df["Date"] = pd.to_datetime(msft_df["Date"])

# Enable tsmode to True to automatically identify time-series variables
# Provide the column name that provides the chronological order of your time-series
profile = ProfileReport(msft_df, tsmode=True, sortby="Date", title="Time-Series EDA")

profile.to_file("msft-report-timeseries.html")

profile