## Data Visualization
### BIOINF 575 - Fall 2021


____

#### `matplotlib` - powerful basic plotting library
https://matplotlib.org/stable/gallery/index.html   
https://matplotlib.org/stable/contents.html    
https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html    

`matplotlib.pyplot` is a collection of command style functions that make matplotlib work like MATLAB. <br>
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In `matplotlib.pyplot` various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes.<br>
"axes" in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).


https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533    
https://matplotlib.org
https://matplotlib.org/tutorials/    
https://github.com/rougier/matplotlib-tutorial     
https://www.tutorialspoint.com/matplotlib/matplotlib_pyplot_api.htm    
https://realpython.com/python-matplotlib-guide/    
https://github.com/matplotlib/AnatomyOfMatplotlib    
https://www.w3schools.com/python/matplotlib_pyplot.asp   
http://scipy-lectures.org/intro/matplotlib/index.html

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Call signatures::
```
    plot([x], y, [fmt], data=None, **kwargs)
    plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
```

Quick plot

The main usage of `plt` is the `plot()` and `show()` functions

In [None]:
plt.plot()
plt.show()

List

In [None]:
plt.plot([8, 24, 27, 42])
plt.ylabel('numbers')
plt.show()

In [None]:
# Plot the two lists, add axes labels
x=[4,5,6,7]
y=[2,5,1,7]
plt.plot(x,y)
plt.xlabel("x numerical values")
plt.ylabel("y numerical values")
plt.show()

`matplotlib` can use *format strings* to quickly declare the type of plots you want. Here are *some* of those formats:

|**Character**|**Description**|
|:-----------:|:--------------|
|'--'|Dashed line|
|':'|Dotted line|
|'o'|Circle marker|
|'^'|Upwards triangle marker|
|'b'|Blue|
|'c'|Cyan|
|'g'|Green|

In [None]:
#dir(plt)
#help(plt.scatter)

In [None]:
plt.plot([3, 4, 9, 20], 'gs--')
plt.axis([-1, 4, 0, 25])
plt.show()

In [None]:
plt.plot([3, 4, 9, 20], '^b--', linewidth=2, markersize=12)
plt.show()

In [None]:
plt.plot([3, 4, 9, 20], color='blue', marker='^', linestyle='dashed', linewidth=2, markersize=12)
plt.show()

#### <font color = "red">Exercise</font>

* Plot the values x = 4,5,6 and y = 7,8,9 with blue color, no line and square marker



In [None]:
x=[4,5,6]
y=[7,8,9]



In [None]:
import numpy as np

# Plot a list with 10 random numbers with a magenta dotted line and circles for points.


In [None]:
# help(plt.plot)

In [None]:
#import numpy as np

# evenly sampled time 
time = np.arange(0, 7, 0.3)
# gene expression
ge = np.arange(1, 8, 0.3)

# red dashes, blue squares and green triangles
plt.plot(time, ge, 'r--', time, ge**2, 'bs', time, ge**2.5, 'ms:', time, ge**3, 'g^')
plt.show()

linestyle or ls	[ '-' | '--' | '-.' | ':' | 

In [None]:
# Categorical data plotting using categories on the x axis 
# we also use the figure function to create more complex figure (size = (width,height))
# and subplot to plot multiple sub-plots ar different positions in the figure
# 131 - *nrows*, *ncols*, and *index*
# Different types of plots: bar, scatter, and histogram 
 
names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(100)

plt.figure(figsize=(12, 3))

plt.subplot(131)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.hist(values1)
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# help(plt.subplot)

In [None]:
# Add another subplot with another color

names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(10000)

plt.figure(figsize=(15, 3))

plt.subplot(141)
plt.bar(names, values)
plt.subplot(142)
plt.scatter(names, values)
plt.subplot(143)
plt.hist(values1)
plt.subplot(144)
plt.hist(values2, color = "green")
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# Changing the grid layout

names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(10000)

plt.figure(figsize=(9, 6))

plt.subplot(221)
plt.bar(names, values)
plt.subplot(222)
plt.scatter(names, values)
plt.subplot(223)
plt.hist(values1)
plt.subplot(224)
plt.hist(values2, color = "green")
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# help(plt.bar)

In [None]:
import pandas as pd

In [None]:
df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris.head()

In [None]:
x1 = df_iris.petal_length
y1 = df_iris.petal_width

x2 = df_iris.sepal_length
y2 = df_iris.sepal_width

# Plot the data categories from the dataframe with green triangles and blue squares

plt.plot(x1, y1, 'g^', x2, y2, 'bs')
plt.show()

#### Histogram

In [None]:
# help(plt.hist)

In [None]:
n, bins, patches = plt.hist(df_iris.petal_length, bins=20,facecolor='#8303A2', alpha=0.8, rwidth=.8, align='mid')
print(n)
# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('number of plants')
plt.xlabel('petal length')


plt.show()

#### Boxplot

In [None]:
# help(plt.boxplot)

In [None]:
plt.boxplot(df_iris.petal_length)

# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('petal length')

The biggest issue with `matplotlib` isn't its lack of power...it is that it is too much power. With great power, comes great responsibility. When you are quickly exploring data, you don't want to have to fiddle around with axis limits, colors, figure sizes, etc. Yes, you *can* make good figures with `matplotlib`, but you probably won't.

https://python-graph-gallery.com/matplotlib/

Pandas works off of `matplotlib` by default. You can easily start visualizing dataframs and series just by a simple command.

#### Using pandas `.plot()`

Pandas abstracts some of those initial issues with data visualization. However, it is still a `matplotlib` plot</br></br>
Every plot that is returned from `pandas` is subject to `matplotlib` modification.

In [None]:
df_iris.plot.box()
plt.show()

In [None]:
df_iris.head()

In [None]:
# Plot the histogram of the petal lengths
# Plot the histograms of all 4 numerical characteristics in a plot
df_iris.petal_length.plot.hist()
plt.show()



In [None]:
df_iris.plot.hist()
plt.show()

In [None]:
df_iris.groupby("species")['petal_length'].mean().plot(kind='bar')
plt.show()

In [None]:
df_iris.groupby("species")['sepal_length'].sum().plot(kind='bar',color = "green")
plt.show()

In [None]:
df_iris.plot(x='petal_length', y='petal_width', kind = "scatter")
plt.savefig('output.png')

In [None]:
plt.savefig('output.png')

https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533

#### Multiple Plots

In [None]:
df_iris.petal_length.plot(kind='density')
df_iris.sepal_length.plot(kind='density')
df_iris.petal_width.plot(kind='density')
plt.show()

`matplotlib` allows users to define the regions of their plotting canvas. If the user intends to create a canvas with multiple plots, they would use the `subplot()` function. The `subplot` function sets the number of rows and columns the canvas will have **AND** sets the current index of where the next subplot will be rendered.

In [None]:
plt.figure(1)
# Plot all three columns from df in different subplots
# Rows first index (top-left)
plt.subplot(3, 1, 1)
df_iris.petal_length.plot(kind='density')
plt.subplot(3, 1, 2)
df_iris.sepal_length.plot(kind='density')
plt.subplot(3, 1, 3)
df_iris.petal_width.plot(kind='density')
# Some plot configuration
plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
plt.show()

In [None]:
# Temporary styles
with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot all three columns from df in different subplots
    # Rows first index (top-left)
    plt.subplot(3, 1, 1)
    df_iris.petal_length.plot(kind='density')
    plt.subplot(3, 1, 2)
    df_iris.sepal_length.plot(kind='density')
    plt.subplot(3, 1, 3)
    df_iris.petal_width.plot(kind='density')
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
    plt.show()

In [None]:
# Plot the histograms of the petal length and width and sepal length and width 
# Display them on the columns of a figure with 2X2 subplots
# color them red, green, blue and yellow, respectivelly  


with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot each of the columns from the df in different subplots
    # Rows first index (top-left)
    plt.subplot(2, 2, 1)
    df_iris.petal_length.plot(kind='hist', color = "red")
    plt.xlabel("petal length")
    plt.subplot(2, 2, 2)
    df_iris.sepal_length.plot.hist(color = "blue")
    plt.xlabel("sepal length")
    plt.subplot(2, 2, 3)
    df_iris.petal_width.plot(kind='hist', color = "green")
    plt.xlabel("petal width")
    plt.subplot(2, 2, 4)
    df_iris.sepal_width.plot.hist(color = "yellow")
    plt.xlabel("sepal width")
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.001, left=.1, right=.95, hspace=.30, wspace=.35)
    plt.show()

In [None]:
# Adjusting the plot configuration

with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot each of the columns from the df in different subplots
    # Rows first index (top-left)
    plt.subplot(2, 2, 1)
    df_iris.petal_length.plot(kind='box', color = "red")
    plt.subplot(2, 2, 2)
    df_iris.sepal_length.plot.box(color = "blue")
    plt.subplot(2, 2, 3)
    df_iris.petal_width.plot(kind='hist', color = "green")
    plt.xlabel("petal width")
    plt.subplot(2, 2, 4)
    df_iris.sepal_width.plot.hist(color = "yellow")
    plt.xlabel("sepal width")
    # Some plot configuration
    plt.subplots_adjust(top=.92, bottom=.08, left=.1, right=.95, hspace=.25, wspace=.35)
    plt.show()

In [None]:
# dir(df_iris.petal_length.plot)

____________

### `seaborn` - dataset-oriented plotting

Seaborn is a library that specializes in making *prettier* `matplotlib` plots of statistical data. <br>
It is built on top of matplotlib and closely integrated with pandas data structures.

https://seaborn.pydata.org/introduction.html<br>
https://python-graph-gallery.com/seaborn/   
https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html
https://seaborn.pydata.org/tutorial/distributions.html

In [None]:
import seaborn as sns

`seaborn` lets users *style* their plotting environment.

In [None]:
sns.set(style='whitegrid')

However, you can always use `matplotlib`'s `plt.style`

In [None]:
#dir(sns)

In [None]:
sns.scatterplot(x='petal_length',y='petal_width',data=df_iris)
plt.show()

In [None]:
# hue argument allows you to color dots by category

sns.scatterplot(x='petal_length',y='petal_width', hue = "species", data=df_iris)
plt.show()

#### Violin plot

Fancier box plot that gets rid of the need for 'jitter' to show the inherent distribution of the data points

In [None]:
columns = ['petal_length', 'petal_width', 'sepal_length']

fig, axes = plt.subplots(figsize=(5, 5))
sns.violinplot(data=df_iris.loc[:,columns], ax=axes)
axes.set_ylabel('number')
axes.set_xlabel('columns', )
plt.show()

#### Distplot

In [None]:
# A distplot plots a univariate distribution of observations. 
# The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.

sns.set(style='darkgrid', palette='muted')

# 4 rows, 1 column - all have the same x axis
f, axes = plt.subplots(4,1, figsize=(10,10), sharex=True)
sns.despine(left=True)

# Regular displot
sns.distplot(df_iris.petal_length, ax=axes[0])

# Change the color
sns.distplot(df_iris.petal_width, kde=False, ax=axes[1], color='orange')

# Show the Kernel density estimate
sns.distplot(df_iris.sepal_width, hist=False, kde_kws={'shade':True}, ax=axes[2], color='purple')

# Show the rug
sns.distplot(df_iris.sepal_length, hist=False, rug=True, ax=axes[3], color='green')

#### FacetGrid

In [None]:
sns.set()
columns = ['species', 'petal_length', 'petal_width']
facet_column = 'species'
g = sns.FacetGrid(df_iris.loc[:,columns], col=facet_column, hue=facet_column, col_wrap=5)
g.map(plt.scatter, 'petal_length', 'petal_width')

In [None]:
sns.relplot(x="petal_length", y="petal_width", col="species",
            hue="species", style="species", size="species",
            data=df_iris)
plt.show()

#### <font color = "red">Exercise</font>

* Use seaborn to plot a boxplot of the sepal_width for each species 


____

### `plotnine` - grammar of graphics - R ggplot2 in python

plotnine is an implementation of a grammar of graphics in Python, it is based on ggplot2. The grammar allows users to compose plots by explicitly mapping data to the visual objects that make up the plot.

Plotting with a grammar is powerful, it makes custom (and otherwise complex) plots are easy to think about and then create, while the simple plots remain simple.



https://plotnine.readthedocs.io/en/stable/api.html   
https://plotnine.readthedocs.io/en/stable/   
http://cmdlinetips.com/2018/05/plotnine-a-python-library-to-use-ggplot2-in-python/  
https://plotnine.readthedocs.io/en/stable/tutorials/miscellaneous-altering-colors.html   
https://datascienceworkshops.com/blog/plotnine-grammar-of-graphics-for-python/   
https://realpython.com/ggplot-python/



### Uncomment and run the following line to install the library

In [None]:
# !pip install plotnine

In [None]:
from plotnine import *

In [None]:
ggplot(data=df_iris) + geom_point(aes(x="petal_length", y = "petal_width"))

In [None]:
# add transparency - to avoid over plotting - alpha argument
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point(alpha=0.2)

In [None]:
# change point size 
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width") + geom_point(size = 0.7, alpha=0.3, color = "red")

In [None]:
# more parameters - scale_x_log10 - transform x axis values to log scale, xlab - add label to x axis
ggplot(data=df_iris) + aes(x="petal_length", y = "petal_width", color = "species") \
    + geom_point() + scale_x_log10() + xlab("Petal Length")

In [None]:

title = '3 species : petal length and width'

ggplot(data=df_iris) +aes(x='petal_length',y='petal_width',color="species") + \
    geom_point(size=0.7,alpha=0.7) + facet_wrap('~species',nrow=3) + \
    theme(figure_size=(9,5)) + ggtitle(title)


In [None]:
# Set width of bar for histogram and color for the bar line and bar fill color

p = ggplot(data=df_iris) + aes(x='petal_length') + geom_histogram(binwidth=1,color='black',fill='grey')
p

In [None]:
# Save the plot to a file

ggsave(plot=p, filename='hist_plot_with_plotnine.png')


#### Quick analysis on the cars dataset:

| variable | description                              |
|----------|------------------------------------------|
| mpg      | Miles/(US) gallon                        |
| cyl      | Number of cylinders                      |
| disp     | Displacement (cu.in.)                    |
| hp       | Gross horsepower                         |
| drat     | Rear axle ratio                          |
| wt       | Weight (lb/1000)                         |
| qsec     | 1/4 mile time                            |
| vs       | V/S                                      |
| am       | Transmission (0 = automatic, 1 = manual) |
| gear     | Number of forward gears                  |
| carb     | Number of carburetors                    |

In [None]:
# Create a linear regression line that uses the weight of the car to explain/predict the miles per gallon
# These are broken down in 3 categories by gear
# The grey area is the 95% confidence level interval for predictions from a linear model ("lm")

from plotnine.data import mtcars

(ggplot(mtcars, aes('wt', 'mpg', color='factor(gear)'))
 + geom_point()
 + stat_smooth(method='lm')
 + facet_wrap('~gear'))


https://raw.githubusercontent.com/rstudio/cheatsheets/master/pngs/data-visualization.png

#### <font color = "red">Exercise</font>

* Use ggplot and the df_iris dataset to plot the sepal_length in boxplots separated by species, add new axes labels and make the y axis values log10.



<img src = "https://raw.githubusercontent.com/rstudio/cheatsheets/master/pngs/data-visualization.png" width = "1000"/>