# Lesson 6:  Visualization

Adapted from material by Ani Adhikari, Suraj Rampure, and Fernando Pérez and Josh Hug and Narges Norouzi


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
births = pd.read_csv('data/baby.csv')

In [None]:
births.head()

In [None]:
births.shape

# Visualizing Distributions: Qualitative Variables
## Bar Plots

We often use bar plots to display distributions of a categorical variable:

In [None]:
babies = births['Maternal Smoker'].value_counts()


Some basic plotting functionality is built directly into Pandas. 

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html 

For example, you can call `.plot(kind=bar)` on any Series of Qualitative data:

In [None]:
births['Maternal Smoker'].value_counts().plot(kind='bar')

**Practice:  Make a horizontal bar plot of the distribution of the Maternal Smoker variable:**

In [None]:
...

### Matplotlib 

`matplotlib` is a comprehensive library for creating static, animated, and interactive visualizations in Python. 

https://matplotlib.org/ 

It is based on the plotting paradigm of MATLAB.  We will typically import it with alias `.plt` 

We can plot a bar plot in matplot lib using 


`plt.bar(x, height)`
where x and height are arrays such that the bars are positioned at x with height given by height.  

In [None]:
import matplotlib.pyplot as plt

plt.bar(babies.index,babies.values);


You can use matplotlib's `barh` to make horizontal bar plots:

In [None]:
plt.barh(babies.index,babies.values);

### Seaborn: 

https://seaborn.pydata.org/

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
We will usually import seaborn with the alias `sns`

`countplot` gives a count of each type of qualitative variable

In [None]:
import seaborn as sns

sns.countplot(data = births, x = 'Maternal Smoker');

### Plotly
https://plotly.com/python/getting-started/

The plotly Python library is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.

Built on top of the Plotly JavaScript library (plotly.js), plotly enables Python users to create beautiful interactive web-based visualizations that can be displayed in Jupyter notebooks, saved to standalone HTML files, or served as part of pure Python-built web applications using Dash.

In [None]:
import plotly.express as px
px.histogram(births, x = 'Maternal Smoker', color = 'Maternal Smoker')

In [None]:
px.bar(births,  y="Maternal Smoker", color = "Maternal Smoker")

# Visualizing Distributions:  Quantitative Variables

In [None]:
sns.countplot(data = births, x = 'Maternal Pregnancy Weight');

In [None]:
sns.histplot(data = births, x = 'Maternal Pregnancy Weight');

In [None]:
sns.histplot(data = births, x = 'Maternal Pregnancy Weight');

In [None]:
sns.histplot(data = births, x = 'Maternal Pregnancy Weight', stat='density', kde = True);


**Practice:  Use seaborn to create a **density** histogram showing the distribution of the babies' birth weights.  
Include the Kernal Density Estimate (KDE) graph on your histogram:**

In [None]:
...

In [None]:
# We can also use plotly
px.histogram(births, x = 'Maternal Pregnancy Weight')

**Practice:  How many data points have baby weights between [110,115) oz?**

### Percentiles

The nth percentile is that value q such that n% of the data values fall at or below it. 

In [None]:
p10= np.percentile(births['Maternal Pregnancy Weight'], 10)


births['category'] = None
births.loc[(births['Maternal Pregnancy Weight'] <= p10), 'category'] = 'Less than the 10th percentile'
births.loc[(births['Maternal Pregnancy Weight'] > p10) , 'category'] = 'Greater than the 10th percentile'

sns.histplot(births, x = 'Maternal Pregnancy Weight', hue = 'category', bins = 35, stat='density');

births.drop(columns = ['category'], inplace = True)

#Mark the 10th percentile on the graph
plt.scatter(p10, -.001, marker='^', color='red', s=400)

print(p10)

In [None]:
q1, median, q3 = np.percentile(births['Maternal Pregnancy Weight'], [25, 50, 75])
iqr = q3 - q1

births['category'] = None
births.loc[(births['Maternal Pregnancy Weight'] <= q1) | (births['Maternal Pregnancy Weight'] >= q3), 'category'] = 'Outside of the middle 50%'
births.loc[(births['Maternal Pregnancy Weight'] > q1) & (births['Maternal Pregnancy Weight'] < q3), 'category'] = 'In the middle 50%'

sns.histplot(births, x = 'Maternal Pregnancy Weight', hue = 'category', bins = 35
             , stat = "density");

births.drop(columns = ['category'], inplace = True)


plt.scatter(q1, -.001, marker='^', color='orange', s=400)

plt.scatter(median, -.001, marker='^', color='red', s=400)

plt.scatter(q3, -.001, marker='^', color='green', s=400)

display([q1, median, q3])

## Box Plots

In [None]:
plt.figure(figsize = (3, 6))
sns.boxplot(data = births, y = 'Maternal Pregnancy Weight');

In [None]:
bweights = births['Maternal Pregnancy Weight']
q1 = np.percentile(bweights, 25)
q2 = np.percentile(bweights, 50)
q3 = np.percentile(bweights, 75)
iqr = q3 - q1


q1, q2, q3

**Practice:  Create a boxplot of the distribution of the babies' birth weights**

In [None]:
...

**Practice:  Calculate the IQR of the babies' birth weights**

In [None]:
...

## Violin Plots

In [None]:
plt.figure(figsize = (3, 6))
sns.violinplot(data = births, y = 'Maternal Pregnancy Weight');


In [None]:
#You can put a boxplot inside a violin plot...

px.violin(births, y = "Maternal Pregnancy Weight", box=True, width = 350, height = 450)

## Describing Distributions

In [None]:
median = births['Maternal Pregnancy Weight'].median()
mean = births['Maternal Pregnancy Weight'].mean()

print("Median", median)
print("Mean", mean)

# Visualizing Relationships Between Variables

## Relationships Between 2 Quantitative Variables


If both features are quantitative, then we often
examine their relationship with a scatter plot.
Each point in a scatter plot
marks the position of a pair of values for an observation.
So we can think of a scatter plot as a two-dimensional rug plot.

With scatter plots, we look for linear and simple nonlinear relationships, and we examine the strength of the relationships.
We also look to see if a transformation of one or the other or both features leads to a linear relationship.

## Scatter plots

In [None]:
births.head()

In [None]:
plt.scatter(births['Maternal Height'], births['Birth Weight']);
plt.xlabel('Maternal Height')
plt.ylabel('Birth Weight');

Most `matplotlib` functions also accept a `data=` keyword, and when using this mode, you can then refer to x and y as names of columns in the `data` DataFrame, instead of passing the series explicitly:

In [None]:
sns.scatterplot(data = births, x = 'Maternal Height', y = 'Birth Weight');

In [None]:
# We can add some "jittering" to the data to help deal with overplotting

sns.stripplot(data = births, x = 'Maternal Height', y = 'Birth Weight', jitter = 0.25);

## Hex plots and contour plots

In [None]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight');

In [None]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', kind = 'hex');

In [None]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', kind = 'kde', fill = True);

## Relationships Between Two Qualitative Variables

With two qualitative features, we often compare the distribution of one feature
across subgroups defined by the other feature. In effect, we hold one feature constant
and plot the distribution of the other one. To do this, we can use some of the same plots
we used to display the distribution of one qualitative feature, such as a line plot or 
bar plot.
As an example, let's examine the relationship between the suitability of a breed for children and the size of the breed. 

In [None]:
dogs = pd.read_csv('data/akc.txt')

kids = {1:"high", 2:"medium", 3:"low"}
dogs["kids"] = dogs['children'].map(kids)

dogs

To examine the relationship between these two qualitative features, we calculate three sets of
proportions (one each for low, medium, and high suitability). 
Within each suitability category, we find  the proportion of small, medium, and large dogs. 
These proportions are displayed in the following table. Notice that each column sums to 1 (equivalent to 100\%):

In [None]:
def proportions(series):
    return series / sum(series)

counts = (dogs.groupby(['kids', 'size'])
 .size()
 .rename('count')
)

prop_table = (counts
 .unstack(level=1)
 .reindex(['high', 'medium', 'low'])
 .apply(proportions, axis=1)
)

prop_table_t= prop_table.transpose()

In [None]:
prop_table_t

The line plot that follows provides a visualization of these proportions.
There is one "line" (set of connected dots) for each suitability level.
The connected dots give the breakdown of size within a suitability category.
We see that breeds with low suitability for kids are primarily small:

In [None]:
fig = px.line(prop_table_t, y=prop_table_t.columns, 
        x=prop_table_t.index, line_dash='kids',
        markers=True, width=500, height=250)

fig.update_layout(
    yaxis_title="proportion", xaxis_title="Size",
    legend_title="Suitability <br>for children"
)

We can also present these proportions as a collection of side-by-side bar plots as shown here:

In [None]:
fig = px.bar(prop_table_t, y=prop_table_t.columns, x=prop_table_t.index,
        barmode='group', width=500, height=250)

fig.update_layout(
    yaxis_title="proportion", xaxis_title="Size", 
    legend_title="Suitability <br>for children"
)

## Relationships Between Mixed Variables

## Side box plots and violin plots

### Overlaid Histograms

In [None]:
# OPTION 1: Using displot
sns.displot(data = births, x = 'Birth Weight', stat = 'density', hue = 'Maternal Smoker');

In [None]:
#OPTION 2;  Using Matplotlib

non_smoker = births[births["Maternal Smoker"]==False]

smoker = births[births["Maternal Smoker"]==True]

plt.hist(non_smoker["Birth Weight"], density=True, alpha=0.5);

plt.hist(smoker["Birth Weight"], density=True, alpha =.7);

In [None]:
sns.displot(data = births, x = 'Birth Weight', kde = True, stat = 'density', hue = 'Maternal Smoker');

In [None]:
sns.displot(data = births, x = 'Birth Weight', kind = 'kde', hue = 'Maternal Smoker');

In [None]:
plt.figure(figsize=(5, 8))
sns.boxplot(data = births, x = 'Maternal Smoker', y = 'Birth Weight');

In [None]:
plt.figure(figsize=(5, 8))
sns.violinplot(data = births, x = 'Maternal Smoker', y = 'Birth Weight');

In [None]:
plt.figure(figsize=(5, 8))
sns.violinplot(data = births, x = 'Maternal Smoker', y = 'Birth Weight');

## Visualizing More than 2 Variables


Here we summarize the various plotting techniques for making comparisons when we have three (or more) features:

**Two quantitative and one qualitative:** You can use a scatter plot that varies the markers according to the qualitative feature’s categories, or by the panels of scatter plots, with one for each category.



In [None]:
sns.scatterplot(data = births, x = 'Maternal Height', y = 'Birth Weight', 
           ci = False, hue = 'Maternal Smoker');

In [None]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', hue = 'Maternal Smoker');

In [None]:
sns.jointplot(data = births, x = 'Maternal Height', y = 'Birth Weight', kind = 'kde', hue = 'Maternal Smoker');

**Two qualitative and one quantitative feature:** We can compare the basic shape of a distribution across subgroups with side-by-side box plots. When we have two or more qualitative features, we can organize the box plots into groups according to one of the qualitative features.

**Three quantitative features:** We can use a similar technique when we plot two quantitative features and one qualitative. This time, we convert one of the quantitative features into an ordinal feature, where each category typically has roughly the same number of records. Then we make faceted scatter plots of the other two features. We again look for similarities in relationships across the facets.

**Three qualitative features:** 
When we examine relationships between qualitative features, we examine proportions of one feature within subgroups defined by another. In the previous section, the three line plots in one figure and the side-by-side bar plots both display such comparisons. With three (or more) qualitative features, we can continue to subdivide the data according to the combinations of levels of the features and compare these proportions using line plots, dot plots, side-by-side bar charts, and so forth. But these plots tend to get increasingly difficult to understand with further subdivisions.