# Lecture 14 Graphing
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney: Chapter 9](https://wesmckinney.com/book/plotting-and-visualization)
* [Irizzary: Chapter 7](https://rafalab.dfci.harvard.edu/dsbook-part-1/dataviz/distributions.html)

-----
## Intro to Exploratory Data Analysis
We all want to do __machine learning__. However, we know that we can't just jump in - we have to prepare data.
* Obtain data
* Clean data
* Analyze data
* Prepare data
* Model
* Evaluate

We have learned how to obtain and clean the data, now we are going to analyze data. We often call this stage __exploratory data analysis__ (EDA).

Why do we do EDA?
* Explore how we can use each variable
  * What variables are related to each other?
  * What variables can be used?
  * Are there any patterns/trends in the data?
* Make a plan for our model

What is involved in EDA?
* Graphing
* Statistics

Let's find out how to make effective graphs in python. Today, we will just make graphs. The next two classes will focus on more intricate graphing (including interactive graphs) and analysis.

-----
## Graphing with Matplotlib

`matplotlib` is the basic package we use for data visualization. There are other packages that are built on matplotlib (like `seaborn`), and other packages which are independent of matplotlib (like some interactive graphs we'll see later). Today, we'll learn the basics of matplotlib.

First, load matplotlib. Let's also create a simple dataset we can work with:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 101)
print(x)
y = x**2 + np.random.randn(len(x))/50
print(y)

Now, we need to create a frame for our figure and a set of axes:

In [None]:
fig = plt.figure()
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)
plt.show()

From here, we can go ahead and graph our data, or we can set up the environment some more. Here is a quick graph using our data.

In [None]:
fig = plt.figure()
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])
ax.plot(x,y)
plt.show()

We can add information to the graph. For example, let's add a $y=x^2$ line, a title, axis labels, and a legend. And let's make the $y=x^2$ line dashed.

In [None]:
fig = plt.figure()
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8])

ax.plot(x,y, label='original data')
ax.plot(x,x**2,c='orange', linestyle='dashed', label='$y=x^2$')

plt.xlabel('The x value')
plt.ylabel('The y value')
plt.title('Random data following the $y=x^2$ line')

ax.legend(loc='lower right')
plt.show()

Now, let's say that we want an inset, like a zoom-in view, of the range $x=[0.4,0.5]$.

Note that I am also resizing the figure to the size I would like.

In [None]:
fig = plt.figure(figsize=(10,5))
ax = fig.add_axes([0.1, 0.1, 0.8, 0.8]) # left, bottom, width, height (range 0 to 1)

ax.plot(x,y, label='original data')
ax.plot(x,x**2,c='orange', linestyle='dashed', label='$y=x^2$')

plt.xlabel('The x value')
plt.ylabel('The y value')
plt.title('Random data following the $y=x^2$ line')

ax_inset = fig.add_axes([0.2, 0.6, 0.3, 0.25]) # left, bottom, width, height (range 0 to 1)

ax_inset.plot(x,y)
ax_inset.plot(x,x**2,c='orange')
ax_inset.set_xlim([0.4,0.5])
ax_inset.set_ylim([0.1,0.3])

ticks = ax_inset.set_xticks([0.4,0.42,0.44,0.46,0.48,0.5])
labels = ax_inset.set_xticklabels(['','0.42','0.44','0.46','0.48',''])

ax.legend(loc='lower right')
plt.show()

-----
## Seaborn and Graph Types
Other packages that are built on Matplotlib can simplify the process. We are going to look at the `seaborn` package. Let's load an actual dataset with both numerical and categorical data to see how these work.

In [None]:
import seaborn as sns
tips = sns.load_dataset('tips')
tips.head()

Now, let's make a scatterplot of the total bill compared to the tip left by the customer.

In [None]:
fig = plt.figure()
ax = fig.add_axes([0.1,0.1,0.8,0.8])

ax.scatter(tips['total_bill'], tips['tip'])

plt.title('Tips as a function of the Total Bill')
plt.xlabel('Total Bill ($US)')
plt.ylabel('Tip Amount ($US)')

plt.show()

In [None]:
sns.scatterplot(data=tips, x='total_bill', y='tip')

plt.title('Tips as a function of the Total Bill')
plt.xlabel('Total Bill ($US)')
plt.ylabel('Tip Amount ($US)')

plt.show()

One thing that seaborn can do is highlight each point by a third variable. We can do this in matplotlib, but it's so much easier in seaborn.

In [None]:
fig = plt.figure()
ax = fig.add_axes([0.1,0.1,0.8,0.8])

ax.scatter(tips[tips['time'] == 'Lunch']['total_bill'],
           tips[tips['time'] == 'Lunch']['tip'],
           c='orange',
           label='Lunch')
ax.scatter(tips[tips['time'] == 'Dinner']['total_bill'],
           tips[tips['time'] == 'Dinner']['tip'],
           c='blue',
           label='Dinner')

plt.title('Tips as a function of the Total Bill')
plt.xlabel('Total Bill ($US)')
plt.ylabel('Tip Amount ($US)')

plt.legend()
plt.show()

In [None]:
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time')

plt.title('Tips as a function of the Total Bill')
plt.xlabel('Total Bill ($US)')
plt.ylabel('Tip Amount ($US)')

plt.show()

Other options that can be applied in both matplotlib and seaborn:
* Transparency
* Size

In [None]:
sns.scatterplot(data=tips, x='total_bill', y='tip', hue='time',
                alpha=0.7, size='size')

plt.title('Tips as a function of the Total Bill')
plt.xlabel('Total Bill ($US)')
plt.ylabel('Tip Amount ($US)')

plt.show()

We saw how to create one graph in one figure. We can also create multiple graphs. We can use insets as we saw above, or we can use a grid of figures. Here, we have three figures in a row. Let's also see how to make:
* countplots (bar graphs)
* boxplots

In [None]:
fig, ax = plt.subplots(1,3, figsize=(13,4)) # ax is an array of 3 axes

sns.scatterplot(data=tips, x='total_bill', y='tip', hue='sex', alpha=0.7, ax=ax[0])
ax[0].set_title('Bill vs. Tip')

sns.countplot(data=tips, x='day', ax=ax[1])
ax[1].set_title('Meals by day')

sns.boxplot(data=tips, x='tip', hue='day', ax=ax[2])
ax[2].set_title('Tips')
plt.show()

Now, we'll make a 2x3 grid of figures. Let's also see how to make:
* Histograms
* KDE graphs (1D and 2D)
* Heatmaps

In [None]:
fig, ax = plt.subplots(2,3, figsize=(13,8)) # ax is a matrix of 6 axes

sns.scatterplot(data=tips, x='total_bill', y='tip', hue='sex', alpha=0.7, ax=ax[0,0])
ax[0,0].set_title('Bill vs. Tip')

sns.countplot(data=tips, x='day', ax=ax[0,1], hue='sex')
ax[0,1].set_title('Meals by day')

# Horizontal barplot
sns.countplot(data=tips, y='time', hue='sex', ax=ax[0,2])
ax[0,2].set_title('Tips')

sns.histplot(data=tips, x='tip', ax=ax[1,0], hue='sex', kde=True)
ax[1,0].set_title('Tips')

# KDE plot
sns.kdeplot(data=tips, x='total_bill', hue='sex', ax=ax[1,1])
ax[1,1].set_title('Total Bill')

# 2D KDE plot
sns.kdeplot(data=tips, x='size', y='tip', hue='sex', ax=ax[1,2])
ax[1,2].set_title('Tips based on party size')

plt.subplots_adjust(hspace=0.4, wspace=0.3)
plt.show()

-----
## Student's Play time
Can we predict a wizard's/witch's Hogwarts House based on hair color? eye color? ancestry?
* https://github.com/drolsonmi/math3080/tree/main/Datasets and select [HarryPotterCharacters.csv](https://raw.githubusercontent.com/drolsonmi/math3080/main/Datasets/HarryPotterCharacters.csv)
* https://www.kaggle.com/datasets/gulsahdemiryurek/harry-potter-dataset?select=shortversioncharacters.csv

Import a dataset that has quantitative and categorical variables

Still to cover:
* Scatterplots
* Histograms
* Boxplots
* Barplot (`.bar()` and `.barh()`)
* Timeseries
* Subplots
* Seaborn

```python
fig, ax = plt.subplots(1,3, sharey=True)
ax[0].plot()
ax[1].plot()
ax[2].plot()
```

```python
fig, ax = plt.subplots(2,3, sharey=False, sharex=True)
ax[0,0].plot() # Row 0, Column 0
ax[0,1].plot() # Row 0, Column 1
ax[1,0].plot() # Row 1, Column 0
ax[1,1].plot() # Row 1, Column 1
```

In [None]:
import pandas as pd
co2 = pd.read_csv('../Datasets/co2_mm_mlo.csv', header=40)
co2.head()

In [None]:
co2['date'] = co2['year'] + (co2['month']-1)/12
co2.head()

In [None]:
import seaborn as sns
sns.lineplot(data=co2, x='decimal date', y='average')