# Chapter 5 Notes

- Intro to matplotlib
  - Plotting in `pandas` and `seaborn` is powered by `matplotlib`, providing wrappers to `matplotlib`'s lower-level functionality
    - These higher-level plotting options have a wide variety of visualization options that are quick to use, but more extensive customization requires interfacing with `matplotlib`
  - One common `matplotlib` library module is `pyplot`, typically imported as `plt`
  - When using `matplotlib`, `plt.show()` must be passed after creating and customizing a visualization to actually see it the plot
    - The magic command `%matplotlib inline` can be used in Jupyter Notebooks to bypass the need for `plt.show()`, the plot is shown when the cell is executed
  - `plt.plot()` generates a line plot by default
    - x and y data must be input
    - A format string can be passed to change the line style or change the plot to a scatter plot
  - `plt.hist()` generates a histogram
    - bin sizes can be specified with the `bin` variable
  - Subplots can be generated using the `plt.subplots()` function
    - Common inputs to `plt.subplots()` are `nrows`, specifying the number of rows of subplots, `ncols`, specifying the column count, and `figsize`, specifying the overall size of the figure
      - Figure size defaults can be changed in the `matplotlib.rcParams` attribute
    - Subplots can be customized in a variety of different ways, like plots within plots
  - When creating a plot, two important elements are created, a `Figure` object and an `Axes` object
    - `Figure` objects encompass the entire plot, containing all of the elements
      - Ex. The figure contains the plot area along with the title and axis labels
    - `Axes` objects encompass the area where the data is plotted
  - The `fig.savefig` function can be used to save the figure out of the workspace
  - It is recommended to run `plt.close('all')` after work is complete with the figures to free the memory used to create the figures
- Plotting with pandas
  - The pandas `plot()` method allows a user to specify the type of plot with the `kind` argument to the method
  - Evolution over time
    - Passing `line` to the `kind` argument generates a lines plot
      - If not specified, `plot()` will use the dataframe index for `x`, `y` must be specified
      - Lists can be passed to plot multiple lines on a single plot, with a list passed to the `style` argument
      - The `subplots` argument will instead create subplots for each line to plot
        - There are options to specified shared axes with the subplots using `sharex` or `sharey` arguments with booleans
        - Each individual subplot can have unique figures/plot types by customizing the `Axes` objects individually
    - Passing `area` to the `kind` argument generates an area plot
      - Area plots are handy to show not only changes over time but individual component contributions to the total
  - Relationships between variables
    - Scatter plots are one way to show relationships between variables
      - Scatter plots can be generated by passing `scatter` to the `kind` argument
      - The x and y axes can be displayed as either linear or logarithmic, which can help avoid having to recalculate data
      - Scatter plots with lots of over lapping data can made it difficult to tell point density, reducing the `alpha` argument will add transparency to the points
        - A `hexbin` plot is a different way of visualizing this data, creating a two dimensional histogram
    - A correlation matrix shows the magnitude (value) and direction (positive or negative) of the correlation
      - Plotting the correlation matrix as a heatmap helps visualize the coefficients
        - Best performed with a diverging colormap
  - Distributions
    - Distributions can be generated with the `hist` or `kde` inputs to the `kind` argument
    - Kernel density estimates or KDEs can be used with continuous data to determine value density, giving an estimated probability density function
      - KDEs and histograms can be plotted on the sample axis by specifying the same axis on each plot call
    - Cumulative distribution functions (CDFs) are used to show the probability of getting a value less than or equal to some other value
    - Box plots are useful to show outliers and distributions with quartiles
  - Counts and frequencies
    - Bar charts can be used with categorical data to show counts and frequencies of particular values
      - Horizontal bar charts are useful when category names are long
        - With these bar charts, the first item in the list of categories is plotted at the bottom of the chart
      - Vertical bar charts are useful when there are a lot of categories and or there is some order to the categories
      - Grouped bar charts can be created using the `groupby()` and `unstack()` functions
      - Stacked bar charts can be created by passing a boolean value to the `stacked` argument
- pandas.plotting module
  - Scatter matrices can be created using the `scatter_matrix()` function
    - These types of plots show scatter plots for all of the different combinations of data and can be useful to quickly identify correlations in data
    - Ofter used in machine learning
    - The diagonals usually contain histograms, but they can be changed to KDEs
  - Lag plots, generated by the `lag_plot()` function show the relationship between values at a given time and values a certain number of periods before that time
    - The default lag is 1, but it can be specified by passing a value to the `lag` argument
  - Autocorrelation plots, generated by the `autocorrelation_plot()` function are an iteration of lag plots, showing autocorrelations between many lag points
  - Bootstrap plots are used to determine the uncertainty of common summary statistics, taking a specified number of random samples (with replacement) from the variable in question and calculating the summary statistics
