# Data Visualization Tutorial

Notebook from [Eric Elmoznino](https://github.com/EricElmoznino/lighthouse_visualization_tutorial).

In this tutorial, we will visualize a dataset using both the `matplotlib` and the `seaborn` libraries.

---
## The dataset

In [1]:
import seaborn as sns

df = sns.load_dataset("tips")
df.head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


---
## Matplotlib
This is the visualization library you will use when you want to quickly make a simple plot, or when you want very
fine-grained control over every aspect of the plot.

- Matplotlib = lower-level
- Seaborn = higher-level

> **Note:** We will only be using the "stateless" API (object-oriented), as opposed "stateful" API (which is used in the compass exercises). Tomorrow you will cover stateless. I think that the stateless API is the one you should try to be more comfortable with as it is a) more intuitive to use and b) provides more functionality.

In [None]:
import matplotlib.pyplot as plt

### Basic API
The general workflow consists of:
- Using `plt.subplots()` to create a Figure and any number of Axes (graphs) you want (Figure=canvas, Axes=graphs).
- Using the `ax.` plotting methods to generate the visualizations.
- Using the `ax.` customization methods to fine-tune your plots.
- Displaying or saving the plot.

In [None]:
#stateless API
# Create Figure and Axes objects with which to do your plotting
fig, ax = plt.subplots()

# Get numpy arrays for the data you want to plot
total_bill = df['total_bill'].values
tip = df['tip'].values

# Plot the data using one of Matplotlib's plotting functions
ax.scatter(total_bill, tip)

# Customize other aspects of the plot
ax.set_title('Tip amounts as a function of total bill')
ax.set_xlabel('Total Bill (USD)')
ax.set_ylabel('Tip (USD)')

# Display the plot
plt.show()

In [None]:
#stateful API
# Don't define Figure and Axes objects. Instead, the "state" of plt is remembered.
# Figures and Axes are remembered, but it is difficult to access specific ones.

# Plot the data using one of Matplotlib's plotting functions
plt.scatter(total_bill, tip)

# Customize other aspects of the plot
plt.title('Tip amounts as a function of total bill')
plt.xlabel('Total Bill (USD)')
plt.ylabel('Tip (USD)')

# Display the plot
plt.show()

### Multiple plots on the same Axes object (graph)

In [None]:
# Create the figure and get an Axes object with which to do your plotting
fig, ax = plt.subplots(figsize=(10, 8))    # You can manually set the figure size (width and height in inches)

# Get numpy arrays for the data you want to plot
df_female = df[df['sex'] == 'Female']
df_male = df[df['sex'] == 'Male']
total_bill_female = df_female['total_bill'].values
tip_female = df_female['tip'].values
total_bill_male = df_male['total_bill'].values
tip_male = df_male['tip'].values

# To put multiple plots in the same graph, just call
# multiple plotting functions. You can also pass in
# a label which will be used if you display a legend
ax.scatter(total_bill_female, tip_female, label='Female', color='red')
ax.scatter(total_bill_male, tip_male, label='Male', color='blue')

# Customize other aspects of the plot
ax.set_title('Tip amounts as a function of total bill')
ax.set_xlabel('Total Bill (USD)')
ax.set_ylabel('Tip (USD)')

# Display the legend
ax.legend()
#ax.legend(loc='lower right')

# Display the plot
plt.show()

### Multiple plots on different Axes objects (graph)

In [None]:
# Create 2 graphs, arranged in a 1row X 2column. "axes" will be an array of Axes objects.
# Note: if our grid were 2D (i.e. more than 1 row and column), then "axes" would be a 2d array of Axes objects.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Make a scatterplot on the first graph
axes[0].scatter(total_bill, tip)

# Customize the first graph
axes[0].set_title('Tip amounts as a function of total bill')
axes[0].set_xlabel('Total Bill (USD)')
axes[0].set_ylabel('Tip (USD)')

# Make a histogram on the second graph
axes[1].hist(tip)

# Customize the second graph
axes[1].set_title('Distribution of tip amounts')
axes[1].set_xlabel('Tip (USD)')
axes[1].set_xlim(left=0)

# Display the plot
plt.show()

---
## Seaborn
This is likely the visualization library that you will use the most when trying
to present your data at a more professional level, or when attempting to visualize
more complex relationships easily. The advantages of Seaborn over Matplotlib are:
- It is designed to work well with Pandas, and this will be the format of most of your datasets.
- It handles a lot of the plotting logic for you. Fewer lines of code than using Matplotlib alone.
- It supports many kinds of plots natively and has a simple API to customize them.
- When necessary, you can still use Matplotlib to fine-tune your figures on top of Seaborn ones, since Seaborn is just using Matplotlib under the hood.

For a quick introduction to many different seaborn plots using simple
examples, visit this [tutorial](https://seaborn.pydata.org/introduction.html)
on their website.

In [None]:
# There are multiple themes that will apply to your whole plots 
# https://www.codecademy.com/articles/seaborn-design-i#:~:text=Seaborn%20has%20five%20built%2Din,better%20suit%20your%20presentation%20needs.
sns.set_style("darkgrid")

In [None]:
df.head(5)

### Basic API
Most Seaborn plotting functions take in:
- A `data` argument, which is your Pandas dataframe
- One or more column names from your dataframe which dictate which parts are used for plotting

In [None]:
sns.scatterplot(x='total_bill', y='tip', data=df) #x and y are the column names from df

plt.show()

### Multiple plots in one figure
Most of the time, plotting functions take in an optional `hue` parameter
which corresponds to a column name that will be used to split the data.
The colours and legend will be created automatically.
There are often many other parameters too, depending on the plot type,
for which you can simply pass in column names that will dictate plot attributes.
This flexibility to work with a single dataframe and use various columns
to customize different properties of the resulting graph is one of the
things that makes Seaborn so powerful and easy to use.

In [None]:
sns.scatterplot(x='total_bill', y='tip', hue='sex', data=df)
#sns.scatterplot(x='total_bill', y='tip', hue='size', data=df)

plt.show()

In [None]:
sns.scatterplot(x='total_bill', y='tip', hue='sex', style='time', data=df)

plt.show()

In [None]:
sns.scatterplot(x='total_bill', y='tip', hue='sex', style='time', size='size', data=df)

plt.show()

### There are many more complex plot types

Some plot types will even automatically compute things for you and display them
(such as error bars).

In [None]:
sns.barplot(x='day', y='total_bill', color='steelblue', data=df) #displays averages (and error bars) by default

plt.show()

In [None]:
sns.violinplot(x='day', y='total_bill', hue='sex', data=df)

plt.show()

### Integration with Matplotlib
Seaborn plots can be customized further using Matplotlib. Any seaborn plotting function:
- Returns a Matplotlib "Axes" object, which you can use just like any ordinary Matplotlib object.
- Can optionally take a Matplotlib "Axes" object as input, if you want to apply a Seaborn plot to an existing figure.

In [None]:
ax = sns.scatterplot(x='total_bill', y='tip', data=df)

# Change the default axis labels
ax.set_xlabel('Total Bill (USD)')
ax.set_ylabel('Tip (USD)')

# Manually set the range of an axis
ax.set_ylim(bottom=0)

plt.show()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

sns.scatterplot(x='total_bill', y='tip', data=df, ax=axes[0])
sns.violinplot(x='day', y='tip', hue='sex', data=df, ax=axes[1])

#plt.savefig('filename.jpg') #must come before plt.show()
plt.show()

### Seaborn examples
Check out Seaborn's example plotting gallery [here](https://seaborn.pydata.org/examples/index.html) for inspiration on the different types of graphs possible with Seaborn.