# Seaborn stands on shoulders of Matplotlib

- You can (and often want) to use Matplotlib
- Firs, let's import Matplotlib library - comes preinstalled on Kaggle

In [None]:
import matplotlib.pyplot as plt

- Let's plot some lines

In [None]:
def plot_some_lines():
    plt.figure()
    x1 = [10,20,5,40,8]
    x2 = [30,43,9,7,20]
    plt.plot(x1, label="Group Spam")
    plt.plot(x2, label="Group Eggs")
    plt.legend()
    plt.show()

plot_some_lines()

- Doesn't look too bad. But let's try out Seaborn

In [None]:
import seaborn as sns

- Seaborn can take over from Matplotlib very easily.

In [None]:
sns.set()

- Seaborn has taken control of the figure's aesthetics.
- Now show the matplotlib plots again with Seaborn in charge.

In [None]:
plot_some_lines()

## Seaborn has some very handy data sets built in

In [None]:
sns.get_dataset_names()

- And we can use them just like a dataframe that came from a CSV

In [None]:
mpg_df = sns.load_dataset("mpg")
mpg_df.head()

## Seaborn has many types of easy to use plots

- Look how easy it is to render a "count plot"

In [None]:
countplot = sns.countplot(data=mpg_df, x="cylinders")

## Let's Check Some Assumptions


In [None]:
four_cylinders = mpg_df[ mpg_df.cylinders == 4 ]
by_origin = four_cylinders.groupby("origin", as_index=False)
mpg_by_origin = by_origin.mpg.mean()
barplot = sns.barplot(x="origin", y="mpg", data=mpg_by_origin)

In [None]:
avg_mpg = mpg_df.groupby("model_year", as_index=False).mpg.mean()
relplot = sns.relplot(x="model_year", y="mpg", data=avg_mpg)

## Enough of cars, let's take to the skies

In [None]:
flights = sns.load_dataset("flights")
flights.tail()

In [None]:
flights_plot = sns.relplot(x="year",y="passengers", data=flights, hue="month")

- Let's consolidate the data by combining months

In [None]:
year_sums = flights.groupby("year", as_index=False).passengers.sum()
sums_plot = sns.relplot(x="year",y="passengers", data=year_sums)

- We can make some predictions, but let Seaborn handle the boilerplate Linear Model part.
- lmplot to the rescue

In [None]:
sums_lmplot = sns.lmplot(x="year",y="passengers",data=year_sums)

- Apparently more people are flying year over year (unless there's a pandemic)
- But do those visualizations best convey that notion? Or are there better choices available?
- How much work would it take to use another type of graph in Seaborn?
- Let's try a Bar Plot instead...

In [None]:
barplot = sns.barplot(x="year", y="passengers", data=flights)

- How hard is it to swap out month for year as the x variable?
  - Spoiler alert, not hard at all.

In [None]:
by_month = sns.barplot(x="month", y="passengers", data=flights)

- What are those black lines jutting out the top? That's called the confidence interval
- You can modify (or remove) them if you like

In [None]:
by_month_no_ci=sns.barplot(x="month",y="passengers", data=flights, errorbar=None)

## The Importance of Plotting Data

- Plotting data isn't just for good looks, seeing data visually can spot problems that are tougher spot with the raw data.
- The famous [Ancombe's Quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet) data set demonstrates this point.

In [None]:
anscombe = sns.load_dataset("anscombe")
anscombe

- Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. 

In [None]:
from sklearn.linear_model import LinearRegression
datasets = ["I","II","III","IV"]

for dataset in datasets:
    print("Dataset", dataset)
    data=anscombe.query(f"dataset == '{dataset}'")

    am = LinearRegression().fit(data.x.values.reshape(-1,1), data.y.values)
    print(am.coef_)
    print(am.intercept_)

- All 4 Datasets can be described in such a way that they seem quite similar. 
- Check how closely the numbers match across the data sets

## But how these similar seeming datasets look when visualized?

### Data Set 1

In [None]:
sns.regplot(x='x', y='y', data=anscombe.query("dataset == 'I'"))

### Data Set 2

In [None]:
sns.regplot(x='x', y='y', data=anscombe.query("dataset == 'II'"), order=2)

### Data Set 3

In [None]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           robust=True, ci=None, scatter_kws={"s": 80})

### Data Set 4

In [None]:
sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'IV'"),ci=None, scatter_kws={"s": 80})

## In summary, ALWAYS plot your data