# The cars dataset - Graphs

The cars dataset is a basic dataset of some cars and their mileage. Many versions of this dataset are available, we'll be using the one R has installed by default.

In this notebook we'll import the dataset and use it to draw some graphs. In another dataset we'll clean it up. Normally you'd combine both in one notebook.

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/mjochen/Beobank_course_material/main/Data_science/5%20Pandas%20introduction/Notebooks/files/mpg.csv")  
    
df.head(10) 

It was imported fine, but the first column is all wrong: the CSV has an index, but that index was seen as an extra column (and another index was added).

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/mjochen/Beobank_course_material/main/Data_science/5%20Pandas%20introduction/Notebooks/files/mpg.csv", index_col=0) 
df.head(10)

Perfect! Now let's create some graphs.

![](https://raw.githubusercontent.com/mjochen/Beobank_course_material/main/Data_science/5%20Pandas%20introduction/Notebooks/files/2022-08-24-09-54-06.png)

Firstly make sure matplotlib is installed.

In [None]:
!pip install matplotlib

## Scatterplots

To start of we'll be looking at how two variables relate to each other. We'll be using the boxplot for this.

We'll map the highway miles per gallon vs the engine displacement (how big the engine is).

In [None]:
df.plot(x="displ", y="hwy", kind="scatter")

This is a scatterplot. It scatters the datapoints around the field. You can use the following types of graph:


1. "area" is for area plots.
1. "bar" is for vertical bar charts.
1. "barh" is for horizontal bar charts.
1. "box" is for box plots.
1. "hexbin" is for hexbin plots.
1. "hist" is for histograms.
1. "kde" is for kernel density estimate charts.
1. "density" is an alias for "kde".
1. "line" is for line graphs.
1. "pie" is for pie charts.
1. "scatter" is for scatter plots.

You could also import matplotlib separately and show the graphs from there.

In [None]:
import matplotlib.pyplot as plt

plt.scatter(df["displ"], df["hwy"])
plt.show()

Maybe the class of the car has an impact on this graph. Could we color the dots according to the class?

In [None]:
plt.scatter(df.displ, df.hwy, c=pd.factorize(df['class'])[0])
plt.gca().set(xlabel='Engine displacement', ylabel='Highway miles per gallon', title='About cars')

Good, but what do the colors stand for?

In [None]:
import matplotlib.patches

levels, categories = pd.factorize(df['class'])
colors = [plt.cm.tab10(i) for i in levels] # using the "tab10" colormap
handles = [matplotlib.patches.Patch(color=plt.cm.tab10(i), label=c) for i, c in enumerate(categories)]

plt.scatter(df.displ, df.hwy, c=colors)
plt.gca().set(xlabel='Engine displacement', ylabel='Highway miles per gallon', title='About cars')
plt.legend(handles=handles, title='Class')

There seems to be something of a trend in this data. Could we show this?

In [None]:
# https://www.statology.org/matplotlib-trendline/

import numpy as np

plt.scatter(df.displ, df.hwy)

#calculate equation for trendline
z = np.polyfit(df.displ, df.hwy, 1)
p = np.poly1d(z)

#add trendline to plot
plt.plot(df.displ, p(df.displ), color="red")

Maybe a second order trendline? This means the equation for calculating it is in x², not just in x. It's slightly curved, and the more orders you give it the more curvy it gets. (With the risk of overfitting, but that's a different story.)

In [None]:
# https://www.statology.org/matplotlib-trendline/

plt.scatter(df.displ, df.hwy)

#calculate equation for trendline
z = np.polyfit(df.displ, df.hwy, 2)
p = np.poly1d(z)

#add trendline to plot
plt.plot(df.displ.sort_values(), p(df.displ.sort_values()), color="red")

Note how we had to sort the values before plotting them? That's because the line wants to follow the data, so the data has to be sorted. Remove it to see what happens!

More information:

* [realpython.com/pandas-plot-python/](https://realpython.com/pandas-plot-python/)

## Distributions and boxplots

After comparing two variables, we'll be looking at just one variable at the time. This is done using a histogram or a boxplot.

In [None]:
plt.hist(df.hwy)

A very important consideration in this type of diagram is the bin width. It determines how many values are taken together. A small bin width would give you very detailed information, a large one gives more of a general overview. As in all graphs this also is subject to many rules, like the [Freedman–Diaconis rule](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule).

Also not how we show multiple diagrams at once using a subplot.

In [None]:
figure, axis = plt.subplots(1, 3, figsize=(15,5))
# plt.figure(figsize=(1, 3))

axis[0].hist(df.hwy, bins=3)
axis[0].title.set_text("Not enough")
axis[1].hist(df.hwy, bins=15)
axis[1].title.set_text("About right")
axis[2].hist(df.hwy, bins=30)
axis[2].title.set_text("Too much")

plt.show()

These diagrams look way nicer when using data that actually has a normal distribution, like the average length of a group of people, or the weight of cars, or the lifespan of a raccoon in the wild.

Let's generate some nice data and show it in a histogram and a boxplot.

In [None]:
np.random.seed(1) # set the basis for the random-engine. This number makes sure the same random numbers are always generated.

data = np.random.normal(loc=0, scale=1, size=200)

figure, axis = plt.subplots(1, 2, figsize=(12,5))

axis[0].hist(data, 45)
axis[1].boxplot(data)
plt.show()


A boxplot is a much easier way of looking at a normal distribution. It shows:

* the centerline (the red line, the median or middle value)
* the main body (the actual box, 50% of all samples are here)
* the whiskers (the line next to the box, with 25% above and 25% below the middle box)
* the outliers (the dots, data that is not within the expected range)

Boxplots are easier to analyse than histograms when you have multiple categories, like with our cars. What is the engine displacement when separated by type (or class) of car?

In [None]:
df.groupby('class').displ.hist(alpha=0.5)

A small command but generally unusable graph. But what would that look like using multiple boxplots?

By the way: up until now we've been rigidly using only matplotlib. That's because it is the basic package that started it all. Now we'll take a step up and start using seaborn, which is based on matplotlib but is easier to use and has more functionality. You could, as an exercise, try to get this plot using only matplotlib.

In [None]:
!pip install seaborn

In [None]:
import seaborn as sns

sns.boxplot(data=df, x='class', y='displ')

And why not a violinplot to finish up?

In [None]:
sns.violinplot(x='class', y='displ', data=df)