## Visualising the tips dataset

While `pandas` functions were used above to look at a summary statistics of the dataset using statistics, the `seaborn` package will now be used to create some visualisations of the dataset that can be used to verify these summary statistics. `pandas` does have some basic plotting functionality built in from the `matplotlib` package. `seaborn` is built on top of`matplotlib` and closely integrated with `pandas` data structures.
 
Plots can highlight any obvious relationships between the different variables in the dataset. They can also be used to identify any groups of observations that are clearly separate to other groups of observations. There are many different ways to visualise this dataset using the seaborn library and no universal best way. I will look through the examples and documentation on <https://seaborn.pydata.org> and see which ones are most suitable for the tips dataset and learn more about the seaborn plotting functions.

Most of the plots that follow in the notebook are based on the official [seaborn tutorial](https://seaborn.pydata.org/tutorial.html) and adapted for this project.
***
 
[Visualizing statistical relationships](https://seaborn.pydata.org/tutorial/relational.html#visualizing-statistical-relationships)
>Statistical analysis is a process of understanding how variables in a dataset relate to each other and how those relationships depend on other variables. Visualization can be a core component of this process because, when data are visualized properly, the human visual system can see trends and patterns that indicate a relationship.



Seaborn's `relplot()` is a (figure-level) function for visualizing statistical relationships using scatter plots or line plots. These are fairly simple 2 dimensional plots of the data but further dimensions can be added where the hue(colour), size, and style semantics of the points can take on meaning representing further variables in the same plot.

Showing several semantic variables at the same time on a single plot may not always be suitable, instead multiple plots may be more appropriate for showing the relationship between multiple variables. Instead of using additional variables as semantics they could be used to facet the plots. This is where you make multiple axes and plot subsets of the data on each one. (`FacetGrid`).

The [Figure-level and axes-level functions](https://seaborn.pydata.org/introduction.html#figure-level-and-axes-level-functions) section of the seaborn introductory guide discusses the differences between seaborn plotting functions. 

>“figure-level” functions are optimized for exploratory analysis because they set up the matplotlib figure containing the plot(s) and make it easy to spread out the visualization across multiple axes using seaborn `FacetGrid` to place the legend outside the axes. 

Scatter and line plots are used to visualise relationships between numerical variables. 
The scatter plot shows the joint distribution of two variables where each point represents an observation in the dataset and can be used to spot relationships. Line plots are generally used to show changes in a variable as a function of time. However this is not applicable for the tips dataset which doesn't have a real time variable. The `time` variable is just a binary categorical variables that represents lunch or dinner. 



 
 
The `pairplot` function in seaborn show scatter plots of the variables against each other. A kernel density function or histogram is displayed down the diagonal. 

The `catplot()` can show different representations of the relationship between one numeric variable and one (or more) categorical variables by specifying the `kind` of plot to use. 
- `kind="swarm"` creates a scatter plot where the positions of the points along the categorical axis are adjusted to avoid overlapping points while 
    `sns.catplot(x="day", y="total_bill", hue="smoker", kind="swarm", data=tips)`

- `kind = "violin"` creates a kernel density estimation to represent the underlying distribution that the points are sampled from.
`sns.catplot(x="day", y="total_bill", hue="smoker",kind="violin", split=True, data=tips)`

- `kind ="bar"` to show only the mean value and its confidence interval within each nested category

`sns.catplot(x="day", y="total_bill", hue="smoker",kind="bar", data=tips)`


[Specialised categorical plots](https://seaborn.pydata.org/introduction.html#specialized-categorical-plots)

### Figure level and Axes level function
The [Figure-level and axes-level functions](https://seaborn.pydata.org/introduction.html#figure-level-and-axes-level-functions) section of the seaborn introductory guide discusses the differences between seaborn plotting functions. 

“figure-level” functions are optimized for exploratory analysis because they set up the matplotlib figure containing the plot(s) and make it easy to spread out the visualization across multiple axes using seaborn `FacetGrid` to place the legend outside the axes. 

>Each different figure-level plot kind combines a particular “axes-level” function with the FacetGrid object. For example, the scatter plots are drawn using the scatterplot() function, and the bar plots are drawn using the barplot() function. These functions are called “axes-level” because they draw onto a single matplotlib axes and don’t otherwise affect the rest of the figure.

Figure-level function must control the figure it lives in, while axes-level functions can be combined into a more complex matplotlib figure with other axes that may or may not have seaborn plots on them.
"Axes-level" plot functions take an `ax=` parameter while a "figure-level" function does not. A figure-level functions returns the `FacetGrid` while axes-level functions return the `matplotlib` axes.

