# Lesson 10: Data Visualization
### April 9th, 2019
### Jinping Liu & Michael Chambers
---
### Outline
* Review: Pandas - import dataset, create dataset, merge together
* Data Visualization Packages
    * Matplotlib
    * Seaborn
    * Altair
    * Plotly
* Group Exercise

**NOTE:**  
* Be sure you have the Seaborn, Altair, and Plotly libraries installed
* You may need to run this notebook in Jupyter Lab to visualize Altair interactive plots

### Resources
* [Matplotlib tutorials](https://matplotlib.org/tutorials/introductory/pyplot.html#plotting-with-categorical-variables)
* [Visualization with Matplotlib](https://www.oreilly.com/library/view/python-data-science/9781491912126/ch04.html)
* [Python graph gallery](https://python-graph-gallery.com/50-basic-violinplot-and-input-formats/)

## Matplotlib and Seaborn
* Figures:
    * Simple pyplot
    * Scater plot
    * Bubble plot
    * Histogram
    * Density plot 
    * Violin plot - distribution and comparison
    * Heatmap - relationship and comparison
* Add-ons:
    * Shape 
    * Color
    * Transparent
    * Legend 
    * Scale 
    * Multiple plots - plot pairwise relaitonships
    * Cluster methods 

## Review: Pandas
* Create a Pandas dataframe from the 'data/iris.tab' file
* How many unique values are in the "species" column?  What are they?
* Create a new column titled "masked" that replaces the species name with a number
    * e.g. "sentosa" = 0
    * Hint: you can create a dictionary of the species and your replacement values, then use the replace() function
* Create a new column titled 'sepal_area' that multiplies the sepal_length by the sepal_width
* Create a new column titled 'petal_area' that multiplies the petal_length by the petal_width

**BONUS:** 
* Try using the pd.get_dummies() function on the "species" column.  Try adding the returned dataframe as new columns to the iris dataset
* create a new dataframe called "df_setosa" that is composed only of the species identified as "setosa"

In [None]:
# work space



In [None]:
#load the libraries 
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

#%matplotlib inline # jupyter magic to view plots automatically

matplotlib simple pyplot 

In [None]:
#x, y  [1, 2, 3, 4], [1, 4, 9, 16] plt.plot() plt.ylabel(), plt.xlabel(), plt.show()
x=[1, 2, 3, 4]
y=[1, 4, 9, 16] 
plt.plot(x,y)
plt.show()

scatter plot: shape & color

In [None]:
#scatter plot ,'ro'
plt.plot(x,y,'ro')
plt.show()

In [None]:
#multi-data scatter plot
#np.arrage(start, stop, step) Return evenly spaced values within a given interval.
t1 = np.arange(0, 5, 0.2)
t2 = t1**2
t3 = t1**3

#red dashes, blue squares and green triangles
#color and shape: 'r--'; 'bs'; 'g^',  label = ''
#plt.legend (loc = 'upper left'), 
plt.plot(t1,t1,'r--', label='t1')
plt.plot(t1,t2,'bs', label='t2')
plt.plot(t1,t3,'g^', label='t3')
plt.legend()
plt.show()

Bubble plot: scatter plot with size and color

In [None]:
#data
#np.arange(stop)
#np.random.randint(low, high, size) Return random integers from low (inclusive) to high (exclusive).
#np.random.randn(n)                 Return samples from the “standard normal” distribution
#np.abs
data = {'a': np.arange(50),
        'c': np.random.randint(0, 50, 50),
        'd': np.random.randn(50)}
data['b'] = data['a'] + 10 * np.random.randn(50)
data['d'] = np.abs(data['d']) * 100

#plt.scatter(), x = 'a'; y ='b', c='c', s='d', data= data, alpha = 0.6 #alpha for transparency
plt.scatter('a','b',
            c='c',
            s='d',
            data= data,
            alpha=0.6)
#plt.xlabel('a') #plt.ylabel('b') #plt.colorbar();  #show color scale #plt.show()
plt.xlabel('a')
plt.ylabel('b')
plt.colorbar()
plt.show()

Histograms, Binnings, and Density

In [None]:
#np.random.randn
data = np.random.randn(2000)   

In [None]:
#white background plt.style.use('seaborn-white')

#histogram plt.hist(data)
plt.hist(data)
plt.show()

In [None]:
#Binings data, bins = , density=True, alpha=0.5, color='steelblue', edgecolor='darkblue', plt.show()
plt.hist(data,
        bins=30,
        density = True,
        alpha=0.5,
        color='steelblue',
        edgecolor='darkblue')
plt.show()

In [None]:
#three datasets histogram overlay 
#data  np.random.normal()
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)

kwargs = dict(bins=40,
              histtype='stepfilled', 
              alpha=0.3, 
              density=True)

# *; **; *args; **kwargs : https://www.agiliq.com/blog/2012/06/understanding-args-and-kwargs/
#single star *: unpacking the values in the list. For using * in the function call, we required a list/tuple
#*args inside a function definition
#double star **: inside the function call of a dictionary. For using ** in the function call, we require a dictionary
#with **kwargs, it can take any number of keyword arguments. Let’s see some examples

#plt.hist(x1, **kwargs) x2, x3 plt.show()
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs)
plt.show()


In [None]:
#three datasets density plot seanborn library: sns.kdeplot
# p1=sns.kdeplot(x1, shade=True, label = 'x1'), layers:x1 -> x2, x3, plt.legend(loc='upper left')
p1=sns.kdeplot(x1, shade=True, label = 'x1')
p1=sns.kdeplot(x2, shade=True, label = 'x2')
p1=sns.kdeplot(x3, shade=True, label = 'x3')
plt.legend(loc='upper left')
plt.show()

Violinplot distribution for multiple groups

In [None]:
# library & dataset import seaborn as sns
#seanborn iris flower dataset
df = sns.load_dataset('iris')
df.head()

In [None]:
# Unique species  
np.unique(df['species'])

In [None]:
# Violin plot: sns.violinplot(x, y) shows data distrubution x= None, y= df["sepal_length"] 
x= None
y=df["sepal_length"]
#sns.violinplot(x, y)
sns.violinplot(x,y)
plt.show()

In [None]:
# plot several groups compare values x=df["species"], y=df["sepal_length"]
x = df["species"]
y=df["sepal_length"]
sns.violinplot(x,y)
plt.show()

In [None]:
# Load dataset: sns.load_dataset("tips")
tips = sns.load_dataset("tips")
tips.head(5)

In [None]:
#pairing boxplot sns.boxplot() x="day", y="total_bill", hue="smoker",data=tips
sns.boxplot(x="day", 
            y="total_bill", 
            hue="smoker",
            data = tips)
plt.show()

In [None]:
#pairing violinplot sns.violinplot
sns.violinplot(x="day", 
            y="total_bill", 
            hue="smoker",
            data = tips)
plt.show()

In [None]:
#sns.set()
sns.set(style="whitegrid", #style background 
        palette="pastel",  # color palette range
        color_codes=True)  #If True and palette is a seaborn palette
       
# Draw a nested violinplot
#split the violins for easier comparison
#add split=True,   #split to True will draw half of a violin for each level
#add inner="quartile",#{“box”, “quartile”, “point”, “stick”, None}
#add palette={"Yes": "y", "No": "b"},
#plt.legend(loc = 'upper left')
sns.violinplot(x="day", 
            y="total_bill", 
            hue="smoker",
            split=True,
            inner="quartile",
            palette={"Yes": "y", "No": "b"},
            data = tips)
plt.legend(loc = 'upper left')
plt.show()

In [None]:
#chage color through palette: palette={"Yes": "lightblue", "No": "lightpink"} 
sns.violinplot(x="day", 
            y="total_bill", 
            hue="smoker",
            split=True,
            inner="quartile",
            palette={"Yes": "lightblue", "No": "lightpink"},
            data = tips)
plt.legend(loc = 'upper left')
plt.show()

Boxplot

Multiple plots: Plot pairwise relationships in a dataset: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [None]:
iris = sns.load_dataset("iris")
iris.head()

In [None]:
#sns.pairplot() iris, hue='species', size=2.5
sns.pairplot(iris, hue='species', size= 2.5)
plt.show()

Heatmap: Plot a matrix dataset for hierarchical cluster -
similarity and dissimilarity
sns.clustermap :https://seaborn.pydata.org/generated/seaborn.clustermap.html

In [None]:
# Data set
url = 'https://python-graph-gallery.com/wp-content/uploads/mtcars.csv'
df = pd.read_csv(url)
df = df.set_index('model')
df.head(5)

clustering and segmentation in 9 steps:  https://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions45/ClusterAnalysisReading.html#step_2:_scale_the_data
Step 1: Confirm data is metric/numeric
Step 2: Scale the data
Step 3: Select Segmentation Variables # which variables to use for clustering is a critically important decision
Step 4: Define similarity measure
Step 5: Visualize Pair-wise Distances
Step 6: Method and Number of Segments # Kmeans Clustering Method, and the Hierarchical Clustering Method
Step 7: Profile and interpret the segments
Step 8: Robustness Analysis

In [None]:
# Default plot sns.clustermap()
sns.clustermap(df)
plt.show()

In [None]:
# scale the data : standardize or z-score
# Scale Method1: Standardize
#standard_scale 1 (columns), subtract the minimum and divide each by its maximum: standard_scale=1
sns.clustermap(df,
              standard_scale=1)
plt.show()

In [None]:
# Scale Method2: z-score
#z_score 1 (columns) z = (x - mean)/std, standardizing scores on the same scale: z_score=1
#https://corporatefinanceinstitute.com/resources/excel/functions/z-score-standardize-function/
sns.clustermap(df,
              z_score=1)
plt.show()

In [None]:
sns.clustermap(df,
              z_score=1,
              metric="euclidean")
plt.show()

In [None]:
#Distance between ind
# OK now we can compare our individuals. But how do you determine the similarity between 2 cars
# There are different solutions for measuring the distance between observations in order to define clusters. 
# the two most common ways are: correlation and euclidean distance: metric="correlation" or metric="euclidean"
#defaut cluster methods
#https://journals.ametsoc.org/doi/full/10.1175/1520-0493%282001%29129%3C0540%3AEDAASM%3E2.0.CO%3B2
sns.clustermap(df,
              z_score=1,
              metric="correlation")

plt.show()

In [None]:
# Cluster methods
# OK now we determined the distance between 2 individuals. But how to do the clusterisation. Several methods exist.
# If you have no idea, ward is probably a good start.
# Single linkage : minimum distance  : method="single
# Complete linkage : maximum distance
# Average linkage : average distance
# Centroid method : distance between cetres
# Wards method : minimization of within-cluster variance :method="ward"
# define cluster methods
sns.clustermap(df,
              z_score=1,
              metric="euclidean",
              method="single")
plt.show()

In [None]:
sns.clustermap(df,
              z_score=1,
              metric="euclidean",
              method="ward")
plt.show()

In [None]:
#color
# CHange color palette
#cmap cmap="mako" or cmap="viridis" or cmap="Blues"
sns.clustermap(df,
              z_score=1,
              metric="euclidean",
              method="ward",
              cmap="Blues")
plt.show()

In [None]:
df.cyl.unique()

In [None]:
# Prepare a vector of color mapped to the 'cyl' column
my_palette = dict(zip(df.cyl.unique(), ["orange","yellow","brown"])) #zip two lists together
row_colors = df.cyl.map(my_palette)
 
# plot
sns.clustermap(df, 
               metric="correlation", 
               method="single", 
               cmap="Blues", 
               standard_scale=1, 
               row_colors=row_colors)
plt.show()

# Interactive Plots
## Altair

In [None]:
from sklearn.decomposition import PCA
import seaborn as sns
import altair as alt
import pandas as pd

#alt.renderers.enable('notebook')
%matplotlib inline

In [None]:
# acquire iris dataset from Seaborn
iris = sns.load_dataset("iris")
type(iris)

In [None]:
df.head()

In [None]:
# use sklearn's PCA function to create X and Y coordinates from iris dataset
cols = iris.columns
data = iris[cols[:-1]].values # convert the data to a NumPy array for sklearn

pca = PCA(2) # reducing iris dataset to 2 dimensions
projected = pca.fit_transform(data)

# create a dataframe of the iris x and y coordinates
df = pd.DataFrame(projected, columns=['x','y'])
df.head()

In [None]:
# add the x and y coordinates to the original iris dataset
df = pd.concat([iris,df], axis=1)
df.head()

In [None]:
# plot the x and y coordinates using Seaborn
sns.scatterplot('x','y',hue='species', data=df)

Guide: [Jake Vanderplos - Using Altair to Make Interactive Plots](https://www.youtube.com/watch?v=OquQ6M7yoGU)

In [None]:
# create side-by-side charts and highlight points within the plot using mouse
brush = alt.selection_interval()

chart = alt.Chart(df).mark_point().encode(
    color=alt.condition(brush, 'species:N', alt.value('lightgray'))
).properties(
    width=250,
    height=250
).add_selection(
    brush
)

chart2 = chart.encode(x='x', y='y') | chart.encode(x='sepal_length:Q', y='sepal_width')
chart2.save('my_plot.html')
chart2

## Plotly

In [10]:
import plotly as py # used to execute the plotting
import plotly.graph_objs as go # used to setup your data for a given plot

import pandas as pd

# formating your plot
layout = go.Layout(
    title='Iris Dataset Heatmap',
    xaxis = dict(title='Variables', automargin=True),
    yaxis = dict(title='Observations', automargin=True)
)

df = pd.read_csv('data/iris.tab', sep='\t')
df.drop(columns='species', inplace=True)
data = [go.Heatmap( z=df.values.tolist(), colorscale='Viridis')]

fig = go.Figure(data=data, layout=layout)

py.offline.plot(fig, filename='figs/iris_heatmap.html')

'file:///Users/chambersmj/Google Drive/git/spring2019/20190409/figs/iris_heatmap.html'

In [2]:
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61
0,100,100,101,101,101,101,101,100,100,100,...,107,107,107,106,106,105,105,104,104,103
1,101,101,102,102,102,102,102,101,101,101,...,108,108,107,107,106,106,105,105,104,104
2,102,102,103,103,103,103,103,102,102,102,...,109,108,108,107,107,106,106,105,105,104
3,103,103,104,104,104,104,104,103,103,103,...,109,109,108,108,107,107,106,106,105,105
4,104,104,105,105,105,105,105,104,104,103,...,110,109,109,108,107,107,107,106,106,105


In [None]:
import plotly.plotly as py
import plotly.graph_objs as go

In [None]:
import seaborn as sns

In [None]:
df = sns.load_dataset('iris')

In [None]:
cols = df.columns.tolist()

In [None]:
cols = cols[:-1]

In [None]:
go.heatmap(df[cols])

In [None]:
# basic heatmap
trace = go.Heatmap(z=[[1, 20, 30],
                      [20, 1, 60],
                      [30, 60, 1]])
data=[trace]
py.iplot(data, filename='basic-heatmap')

In [None]:
# interactive heatmap
layout = go.Layout(
    title='Abricate: Antimicrobial Resistance and Virulence Genes Identified',
    xaxis = dict(title='Virulence Factors', automargin=True),
    yaxis = dict(title='S. lugdunensis RefSeq Genome Accessions', automargin=True)
)

data = [go.Heatmap(z=df.values.tolist(), x=df.columns.tolist(), y = df.index.tolist(), colorscale='Viridis')]

fig = go.Figure(data=data, layout=layout)

plotly.offline.plot(fig, filename='fig/s_lugdunensis.html')

## Group Project

1.) **Choose a dataset**  
You'll be working in groups to generate a cookbook of plots using datasets in the 'train' folder (feel free to choose another dataset if you prefer!).

* The goal is to write a function that can generate a specific plot type in the format you prefer (a function that makes your custom plots!)

2.) **Generate some plots**
Types of plots:
* line
* scatter
* violin
* barplot
* boxplot
* the list goes on and on!

Plot Requirements:
* title
* labels for x and y axis
* appropriate name for saved plot
* save as .png

You can choose which plots your group would like to focus on.
* You can generate as many different plots as you wish
* If you'd like to make a plot that we don't have a good dataset for, feel free to generate your own

3.) **Write functions that generate your plots**

At the end of class we're going to try running different datasets through a few of your functions to see how they work.

**Goal:** Have a few functions that generate plots given a formatted dataset

**Reach Goal:** Combine all of your groups plot functions into a single script.  The script will take two arguments, the dataset and the type of plot that is desired
* e.g. `python plot_script.py -heatmap heatmap_data.csv`

**ULTRA REACH GOAL:** Have your script generate a static and interactive plot!

Resources:
* [Matplotlib](https://matplotlib.org/gallery.html)
* [Seaborn](https://seaborn.pydata.org/examples/index.html)
* [Altair](https://altair-viz.github.io/gallery/index.html)
* [Plotly](https://plot.ly/python/)
* [Seaborn Datasets](https://github.com/mwaskom/seaborn-data)

In [None]:
# https://chrisalbon.com/python/data_wrangling/break_list_into_chunks_of_equal_size/
import random

# Create a function called "chunks" with two arguments, l and n:
def chunks(l, n):
    """Chunk list (l) into random groups of size n
    Arg:
        l (int): number of students
        n (int): number of students in each group
    """
    # For item i in a range that is a length of l,
    random.seed(0)
    random.shuffle(l)
    for i in range(0, len(l), n):
        # Create an index range for l of n items:
        yield l[i:i+n]
        
for i in list(chunks()):
    print(i)

In [None]:
# workspace
