# Creating histograms in Pandas and Seaborn

## Overview

We **CANNOT** use bar charts with continuous variables! That is because bar charts COUNT the number of rows for each unique value! Continuous variables typically have FAR TOO MANY unique values to actually count and visualize with bar charts! Instead, we want to see the intervals where the variable is CONCENTRATED vs the intervals where the variable is NOT concentrated!

## Import Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

## Read data

In [None]:
gap_url = 'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/gapminder.tsv'

In [None]:
gap_df = pd.read_csv( gap_url, sep='\t' )

In [None]:
gap_df.info()

In [None]:
penguins = sns.load_dataset('penguins')

In [None]:
penguins.info()

## Gapminder data

In [None]:
gap_df.dtypes

In [None]:
gap_df.nunique()

In [None]:
gap_df.lifeExp.value_counts()

In [None]:
gap_df.lifeExp.value_counts().value_counts()

If we would make the bar chart...for this continuous variable with MANY unique values...

In [None]:
fig, ax = plt.subplots()

gap_df.lifeExp.value_counts().plot(kind='bar', ax=ax)

plt.show()

We need to use a plot type that is SPECIFIC to the NUMERIC or number data type!!!

We still want to count but we CANNOT count the raw individual values! Instead, we need to focus on INTERVALS!!!!!!!

Let's call a Pandas method that produces HISTOGRAMS for us.

In [None]:
gap_df.hist()

plt.show()

Let's focus on a SINGLE continuous variable.

In [None]:
fig, ax = plt.subplots()

gap_df.lifeExp.hist(ax=ax)

plt.show()

Let's make the corresponding figure in Seaborn using the appropriate AXES level function.

In [None]:
fig, ax = plt.subplots()

sns.histplot( data = gap_df, x='lifeExp', ax=ax )

plt.show()

Pandas by default uses 10 bins in the histogram. Let's force Seaborn to also use 10 bins.

In [None]:
fig, ax = plt.subplots()

sns.histplot( data = gap_df, x='lifeExp', bins=10, ax=ax )

plt.show()

We can also create the histogram using the FIGURE level function.

The bar chart figure level function was `sns.catplot()`. The goal was to work with categorical variables. But...the GOAL for visualizing the **DISTRIBUTION** of continuous variables is different from counting categoricals! Thus, the figure level function is different! The figure level function is `sns.displot()`!!!!

In [None]:
sns.displot( data = gap_df, x='lifeExp', kind='hist', bins=10 )

plt.show()

What happens if we use fewer bins...

In [None]:
sns.displot(data = gap_df, x='lifeExp', kind='hist', bins=3)

plt.show()

What if we use many, many bins...

In [None]:
sns.displot(data = gap_df, x='lifeExp', kind='hist', bins=201)

plt.show()

We want just GENERAL or ROUGH ideas about the how the CONCENTRATION is distributed!

In [None]:
sns.displot(data = gap_df, x='lifeExp', kind='hist', bins=21)

plt.show()

In addition to the HISTOGRAM...you could use the Kernel Density Estimate or KDE plot.

In [None]:
sns.displot(data = gap_df, x='lifeExp', kind='kde')

plt.show()

COMBINE the histogram with the default number of bins with the KDE.

In [None]:
sns.displot( data = gap_df, x='lifeExp', kind='hist', kde=True )

plt.show()

Look at the DISTRIBUTIONS of the other continuous variables in `gap_df`.

In [None]:
sns.displot( data = gap_df, x='pop', kind='hist', kde=True )

plt.show()

In [None]:
sns.displot( data = gap_df, x='gdpPercap', kind='hist', kde=True )

plt.show()

## Penguins

In [None]:
penguins.dtypes

In [None]:
penguins.nunique()

In [None]:
sns.displot(data = penguins, x='flipper_length_mm', kind='hist', kde=True)

plt.show()

In [None]:
sns.displot(data = penguins, x='bill_depth_mm', kind='hist', kde=True)

plt.show()