## Matplotlib

Matplotlib is a Python library used for plotting beautiful and attractive Graphs. In Machine Learning visualization is an important step. By using visualization we can easily understand how data is split and predictions are made.  This is the last Notebook that can be skipped at first as most parts of it will be explained in the Machine Learning Notebooks.

<img src="./resources/matplotlib.webp" style="height: 75px"/>

To get started we need to import Matplotlib.

In [None]:
pip install matplotlib

## 1. Intro to pyplot

matplotlib.pyplot is a collection of command style functions. Each pyplot function makes some changes to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In [None]:
import numpy as np

x = np.array([1, 2, 3, 4])
y = np.array([1, 4, 9, 16])

import matplotlib.pyplot as plt

plt.plot(x, y)
plt.xlabel('Some numbers')
plt.ylabel('Other numbers')
plt.show()

There is an optional third argument which is the format string that indicates the color and line type of the plot. For example, we can plot the above with red (r) circles (o). The axis() command takes a list of [xmin, xmax, ymin, ymax] and specifies the viewport of the axes.

In [None]:
import numpy as np

x = np.array([1, 2, 3, 4])
y = np.array([1, 4, 9, 16])

import matplotlib.pyplot as plt

plt.plot(x, y, 'ro')
plt.axis([0, 6, 0, 20])
plt.show()

## 2. Plotting with categorical variables

It is also possible to create a plot using categorical variables.

In [None]:
names = ['apples', 'oranges', 'bananas']
values = [9, 13, 3]

plt.figure(figsize=(10, 3)) # width = 10, height = 3

plt.subplot(131) # 1 row, 3 columns, put this subplot in the first column
plt.bar(names, values)
plt.subplot(132) # 1 row, 3 columns, put this subplot in the second column
plt.scatter(names, values)
plt.subplot(133) # 1 row, 3 columns, put this subplot in the third column
plt.plot(names, values)
plt.suptitle('Categorical Plotting')
plt.show()

## 3. Pandas

Another great thing about Pandas is that it integrates with Matplotlib, so you get the ability to plot directly off DataFrames and Series. First let's load the movie data once again.

In [None]:
import pandas as pd

movies_df = pd.read_csv("resources/IMDB-Movie-Data.csv", index_col="Title")

movies_df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue_millions'
    }, inplace=True)

movies_df.columns = [col.lower() for col in movies_df]

Let's plot the relationship between ratings and revenue. All we need to do is call .plot() on movies_df with some info about how to construct the plot.

In [None]:
movies_df.plot(kind='scatter', x='rating', y='revenue_millions', title='Revenue (millions) vs Rating');

What's with the semicolon? It's not a syntax error, just a way to hide the *<matplotlib.axes.subplots.AxesSubplot at 0x26613b5cc18>* output when plotting in Jupyter notebooks. You can see the difference when you omit the semicolon.

If we want to plot a simple Histogram based on a single column, we can call plot on a column.

In [None]:
movies_df['rating'].plot(kind='hist', title='Rating');

Do you remember the .describe() example from Pandas? Well, there's a graphical representation of the interquartile range, called the boxplot. Let's recall what describe() gives us on the ratings column.

In [None]:
movies_df['rating'].describe()

Using a boxplot we can visualize this data.

In [None]:
movies_df['rating'].plot(kind="box");

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum).

<img src="./resources/boxplot.gif" style="height: 450px"/>