# `plot()`: analyze distributions

## Overview

The function `plot()` explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of `plot()` for a given dataframe `df`.

1. `plot(df)`: plots the distribution of each column and computes dataset statistics
2. `plot(df, x)`: plots the distribution of column `x` in various ways, and computes its statistics
3. `plot(df, x, y)`: generates plots depicting the relationship between columns `x` and `y`

The generated plots are different for numerical and categorical columns. The following table summarizes the output for the different column types.

| `x` | `y` | Output |
| --- | --- | --- |
| None | None | dataset statistics, [histogram](https://www.wikiwand.com/en/Histogram) or [bar chart](https://www.wikiwand.com/en/Bar_chart) for each column |
| Numerical | None | column statistics, histogram, [kde plot](https://www.wikiwand.com/en/Kernel_density_estimation), [qq-normal plot](https://www.wikiwand.com/en/Q%E2%80%93Q_plot), [box plot](https://www.wikiwand.com/en/Box_plot) |
| Categorical | None | column statistics, bar chart, [pie chart](https://www.wikiwand.com/en/Pie_chart), [word cloud](https://www.wikiwand.com/en/Tag_cloud), word frequencies |
| Numerical | Numerical | [scatter plot](https://www.wikiwand.com/en/Scatter_plot), [hexbin plot](https://www.data-to-viz.com/graph/hexbinmap.html), binned box plot|
| Numerical | Categorical | categorical box plot, multi-[line chart](https://www.wikiwand.com/en/Line_chart) |
| Categorical | Numerical | categorical box plot, multi-line chart
| Categorical | Categorical | [nested bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [stacked bar chart](https://www.wikiwand.com/en/Bar_chart#/Grouped_and_stacked), [heat map](https://www.wikiwand.com/en/Heat_map) |

Next, we demonstrate the functionality of `plot()`. 

## Load the dataset
`dataprep.eda` supports **Pandas** and **Dask** dataframes. Here, we will load the well-known [adult dataset](http://archive.ics.uci.edu/ml/datasets/Adult) into a Pandas dataframe.

In [1]:
import pandas as pd
df = pd.read_csv("https://www.openml.org/data/get_csv/1595261/phpMawTba", na_values = [' ?'])

## Get an overview of the dataset with `plot(df)`

We start by calling `plot(df)` which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column. The number of bins in the histogram can be specified with the parameter `bins`, and the number of categories in the bar chart can be specified with the parameter `ngroups`. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.

In [2]:
from dataprep.eda import plot
plot(df)

## Understand a column with `plot(df, x)`

After getting an overview of the dataset, we can thoroughly investigate a column of interest `x` using `plot(df, x)`. The output is of `plot(df, x)` is different for numerical and categorical columns.

When `x` is a numerical column, it  computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:

In [3]:
plot(df, "age")

When `x` is a categorical column, it computes column statistics, and plots a bar chart and pie chart:

In [4]:
plot(df, "education")

## Understand the relationship between two columns with `plot(df, x, y)`

Next, we can explore the relationship between columns `x` and `y` using `plot(df, x, y)`. The output depends on the types of the columns. 

When `x` and `y` are both numerical columns, it generates a scatter plot, hexbin plot and box plot:

In [5]:
plot(df, "age", "hours-per-week")

When `x` and `y` are both categorical columns, it plots a nested bar chart, stacked bar chart and heat map:

In [6]:
plot(df, "education", "marital-status")

When `x` and `y` are one each of type numerical and categorical, it generates a box plot per category and a multi-line chart:

In [7]:
plot(df, "age", "education")
# or plot(df, "education", "age")