# Lesson: Matplotlib and Google Colab

## Introduction

There are amazing Python packages that support visualizations.

- [Matplotlib](https://matplotlib.org/)
- [Plotly](https://plotly.com/python/)
- [Seaborn](https://seaborn.pydata.org/)

We'll use Matplotlib. 

Matplotlib is common in Jupyter notebooks and its primary focus is data science. We don't present Matplotlib visualizations to executives and managers (unless they want to see them). We present Matplotlib visualizations to data professionals. As data professionals, we want to focus on the technical details: stats, distributions, trends, and analyzing models. A lot of these visualization are one-off. We can noodle around, analyze, and move on to the next thing. If there are consistent and enduring visualizations, we probably want to include it in a dashboard.

[Google Colab (Colaboratory)](https://colab.research.google.com/) is a data science, machine learning-focused environment that uses Pandas, Matplotlib, and [PyTorch](https://pytorch.org/docs/stable/index.html) (tensors, neural networks, LLMs).

<blockquote>
<p><strong>What is Colab?</strong></p>
Colab, or "Colaboratory", allows you to write and execute Python in your browser, with
<ul>
<li>Zero configuration required</li>
<li>Access to GPUs free of charge</li>
<li>Easy sharing</li>
</ul>
</blockquote>

### Learning Outcomes

When you've finished this lesson and its exercises, you should be able to:

- Integrate Pandas and Matplotlib.
- Generate a scatter plot in a Jupyter notebook.
- Generate a bar chart with Pandas and Matplotlib.
- Write code that implements a histogram.

## Set Up

Upload both your `lesson.ipynb` and `cars93.csv` to [Colab](https://colab.research.google.com/).

## Example

In the Colab environment, Pandas, Matplotlib, Numpy, and a huge number of packages come preinstalled.

We can find those packages use `pip list`. A _bang_ (exclamation point) is required as a prefix.

In [None]:
!pip list

In [147]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [148]:
# Extract cars93 data
df = pd.read_csv("cars93.csv")

### Scatter Plot: Weight vs MPG

In Excel, our scatter plot required a lot of UI manipulation.

In Pandas and Matplotlib, it took us 10 lines of code to generate a chart (barring comments and formatted code)!

In [None]:
# Weight vs MPG.city
ax = df.plot.scatter(
    x="Weight",
    y="MPG.city",
    alpha=0.5,
    label="MPG (city)",
    legend=True,
)
# add a trend line
# y = mx + b, m = slope, b = y-intercept
m, b = np.polyfit(df["Weight"], df["MPG.city"], 1)
ax.plot(df["Weight"], m * df["Weight"] + b, color="red")

# Weight vs MPG.highway
df.plot.scatter(
    x="Weight", y="MPG.highway", color="darkorange", label="MPG (highway)", ax=ax
)
# add a trend line
m, b = np.polyfit(df["Weight"], df["MPG.highway"], 1)
ax.plot(df["Weight"], m * df["Weight"] + b, color="blue")

ax.set_title("Weight vs MPG")
ax.set_xlabel("Weight")
ax.set_ylabel("MPG")

plt.show()

### Box and Whisker: Cylinders vs Price

Pandas DataFrames and Series have a built-in integration with Matplotlib.

- `df.boxplot`
- `df.plot.scatter`
- `df.plot.bar`
- `df.plot.hist`

In [None]:
# box and whisker plot
df.boxplot(column="Price", by="Cylinders")

# add a title
plt.suptitle("")
plt.title("Cylinders vs Price")
plt.ylabel("Price")
plt.grid(axis="x")

plt.show()

When we use the Matplotlib package directly, we need to do a little more data manipulation work, so we'll just settle on DataFrame integration.

In [None]:
cylinders_sorted_unique = sorted(df["Cylinders"].unique())
cylinder_price = [
    df[df["Cylinders"] == cyl]["Price"] for cyl in cylinders_sorted_unique
]
plt.boxplot(cylinder_price, tick_labels=cylinders_sorted_unique)
plt.title("Cylinders vs Price")
plt.xlabel("Cylinders")
plt.ylabel("Price")
plt.grid(axis="y")
plt.show()

### Bar Chart: Cylinders vs Car Count

Our `value_counts` method groups cylinder values to the record count (or car count).

In Excel, we need to generate a pivot table. That's a little fussy.

In [None]:
df["Cylinders"].value_counts().plot.bar(
    title="Cylinders vs Car Count", xlabel="Cylinders", ylabel="Car Count"
)
plt.grid(axis="y", alpha=0.5)
plt.show()

### Bar Chart: Weight vs Average Price

Our DataFrame visualizations made it look easy, but this data manipulation will be a little more complex. With Excel, our grouping had a min price of 1695, max price of 4105, and an interval of 250. We use that code in Python. 

Our bins are generated from a list comprehension. Labels are generated from a list comprehension. The `cut` method generates a new column that tracks bin values and labels to an assigned weight. Then we group the weight interval to our price average (`mean`).

Finally we render a Series with a bar chart.

In [None]:
# group by bin
weight_min = df["Weight"].min()
weight_max = df["Weight"].max()
interval = 250

# separate bins
bins = [value for value in range(weight_min, weight_max, interval)]
if bins[-1] < weight_max:
    bins.append(weight_max)
labels = [f"{bins[i]}-{bins[i + 1]}" for i in range(len(bins) - 1)]

df["Weight_Interval"] = pd.cut(df["Weight"], bins=bins, labels=labels)
# new Series
series_weight_interval_avg_price = df.groupby("Weight_Interval", observed=False)[
    "Price"
].mean()

series_weight_interval_avg_price.plot.bar()
plt.title("Weight vs Avg Price")
plt.xlabel("Weight Intervals")
plt.ylabel("Avg Price")
plt.grid(axis="y", alpha=0.5)
plt.xticks(rotation=45)
plt.show()

### Histogram: Price Frequency

We didn't look at histograms in Excel (we could always generate a new one with quantitative data), but this histogram is pretty straight-forward.

- y-axis: df["Price"]
- bins: 10 (equally distributed)

In [None]:
df.plot.hist(
    y="Price",
    bins=10,
    legend=False,
    title="Price Frequency",
    xlabel="Price",
    ylabel="Frequency",
)

plt.grid(axis="y", alpha=0.5)
plt.show()

Our data distributions skews left.