# Week 6
# Introduction to Python Plotting Tools

Making informative visualizations of data is one of the most important tasks in data analysis.
- Learn the distribution of data
- Explore trends and patterns in data
- Identify outliers
- Generate ideas for modeling
- Present your findings

Today, we will study how to create several most frequently-used types of plots in Python.
- Scatter plots
- Bar plots
- Histograms
- Pie plots
- Box plots

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Scatter Plots
A **scatter plot** uses dots to represent values for two numerical variables. The position of each dot represents an instance of data. Scatter plots are helpful for identifying relationships between variables.

In [None]:
# A simple example of scatter plots
# Source: https://www.who.int/growthref/hfa_boys_5_19years_z.pdf?ua=1
heights_boys = pd.DataFrame({'Age': range(5, 20),
                   'Height': [110, 116, 122, 127, 133, 137, 143, 149, 156, 163, 169, 173, 175, 176, 176.5]})
heights_boys

In [None]:
# Plot Age vs. Heights
plt.plot(heights_boys['Age'], heights_boys['Height'])

In [None]:
# Add descriptions to the figure
plt.plot(heights_boys['Age'], heights_boys['Height'], 'r.')
plt.title("Average Height for Boys")
plt.xlabel("Age")
plt.ylabel("Height (cm)")

In [None]:
# Multiple sequences of data
heights = pd.DataFrame({'Age': range(5, 20),
                        'BoyHeight': [110, 116, 122, 127, 133, 137, 143, 149, 156, 163, 169, 173, 175, 176, 176.5],
                        'GirlHeight': [109.6, 115, 121, 126.5, 132.5, 139, 145, 151, 156, 160, 161.7, 162.5, 162.8, 163, 163.2]})
heights

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(heights['Age'], heights['BoyHeight'], 'r^', label="Boys")
plt.plot(heights['Age'], heights['GirlHeight'], 'gs', label='Girls')
plt.title("Average Heights")
plt.xlabel("Age")
plt.ylabel("Height (cm)")
plt.legend()

**Q: What can we see from this plot?**

## Bar Plots

Bar plots are useful for presenting labeled data.

In [None]:
df = pd.DataFrame([[67, 76],
                   [78, 87],
                   [89, 98],
                   [90, 95]],
                  index=['Alice', 'Bob', 'Clare', 'David'],
                  columns=['Midterm', 'Final'])
df

In [None]:
df['Midterm'].plot.bar()

In [None]:
df[['Midterm', 'Final']].plot.bar()

In [None]:
df[['Midterm', 'Final']].plot.bar(stacked=True)

In [None]:
df[['Midterm', 'Final']].plot.barh(stacked=True)

## Histograms
**Histograms** are useful for showing the distribution of a variable
- Each bar cover a range of values.
- The height of each bar represents the number of data in the corresponding range.
- Boundary values are counted towards the left bar by convention.

In [None]:
# Generate 100 values using np.random.rand()
df = pd.DataFrame(np.random.rand(100), columns=['Rand'])
df

In [None]:
df['Rand'].hist()

In [None]:
# Add a column 'Randn' with values generated by np.random.randn()
df['Randn'] = np.random.randn(100)
df

In [None]:
# df['Randn'].hist()
df['Randn'].plot.hist()

In [None]:
df[['Rand', 'Randn']].plot.hist(alpha=0.5)

## Pie Plots

**Pie Plots** are useful for showing the proportion of values.

In [None]:
df = pd.DataFrame([5, 10, 20, 7, 3],
                  index=['A', 'B', 'C', 'D', 'F'],
                  columns=['Students'])
df

In [None]:
df['Students'].plot.pie(autopct='%.2f', figsize=(6, 6))

## Box Plots

**Box plots** are used for depicting groups of numerical data through their quartiles

- Upper edge: 75% quartile (75% of data are below this value)
- Lower edge: 25% quartile (25% of data are below this value)
- Middle line: median value (50% of data are below this value)
- Upper bar: upper *interquartile range* (values above this bar are considered outliers)
- Lower bar: lower *interquartile range* (values below this bar are considered outliers)
- Dots: outliers

In [None]:
# Import the iris dataset
from sklearn import datasets
iris_raw = datasets.load_iris()
iris_raw

<img src="https://gadictos.com/wp-content/uploads/2019/03/iris-machinelearning-1060x397.png">

In [None]:
# Turn the data into a data frame
iris_df = pd.DataFrame(data=iris_raw['data'],
                       columns=iris_raw['feature_names'])
iris_df

In [None]:
iris_df.plot.box()

In [None]:
# Add target labels
iris_df['Target'] = iris_raw['target']
iris_df

In [None]:
iris_raw['target_names']

**1. What is the distribution of sepal length for each type of iris?**

**2. What is the distribution of petal length and width for each type of iris?**

**3. What is the distribution of sepal length and width for each type of iris?**

**4. What is the distribution of petal length and width for each type of iris?**

**5. Based on the above observations, can you come up with a simple rule for classification?**

**6. Can you show the accuracy of your classification rules on each type of iris?**