## Data Visualization
### BIOINF 575 - Fall 2023


____

#### `matplotlib` - powerful basic plotting library
https://matplotlib.org/stable/gallery/index.html   
https://matplotlib.org/stable/contents.html    
https://matplotlib.org/3.1.1/tutorials/introductory/pyplot.html    

`matplotlib.pyplot` is a collection of command style functions that make matplotlib work like MATLAB. <br>
Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc.

In `matplotlib.pyplot` various states are preserved across function calls, so that it keeps track of things like the current figure and plotting area, and the plotting functions are directed to the current axes.<br>
"axes" in most places in the documentation refers to the axes part of a figure and not the strict mathematical term for more than one axis).



**Useful resources:**      
https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533    
https://matplotlib.org
https://matplotlib.org/tutorials/  
https://matplotlib.org/stable/tutorials/introductory/quick_start.html
https://matplotlib.org/stable/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py
https://matplotlib.org/stable/tutorials/introductory/lifecycle.html      
https://github.com/rougier/matplotlib-tutorial     
https://www.tutorialspoint.com/matplotlib/matplotlib_pyplot_api.htm    
https://realpython.com/python-matplotlib-guide/    
https://github.com/matplotlib/AnatomyOfMatplotlib    
https://www.w3schools.com/python/matplotlib_pyplot.asp   
http://scipy-lectures.org/intro/matplotlib/index.html

```python 
%matplotlib inline
```
Magic command to show plots in the notebook.
Using this magic command, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.

Starting with IPython 5.0 and matplotlib 2.0 you can avoid the use of IPython’s specific magic and use matplotlib.pyplot.ion()/matplotlib.pyplot.ioff() which have the advantages of working outside of IPython as well.

https://ipython.readthedocs.io/en/stable/interactive/plotting.html

In [None]:
import matplotlib.pyplot as plt

In [None]:
# what can the pyplot do?
# many many things

for e in dir(plt):
    if not e.startswith("_"):
        print(e)

Call signatures::
```
    plot([x], y, [fmt], data=None, **kwargs)
    plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs)
```

#### Quick plot

The main usage of `plt` is the `plot()` and `show()` functions

In [None]:
help(plt.plot)

In [None]:
help(plt.show)

In [None]:
# create an empty plot 

result = plt.plot()
# plt.show()

#### Display lists of numbers 

In [None]:
# one list - on the y axis
# the x axis will be the index of the element in the list

axis_lst = plt.plot([8, 24, 27, 42])
ylbl = plt.ylabel('numbers')
# plt.show()

In [None]:
# Plot the two lists, add axes labels

x=[4,5,6,7]
y=[2,5,1,7]
plt.plot(x,y)
plt.xlabel("x numerical values")
plt.ylabel("y numerical values")
plt.show()

`matplotlib` can use *format strings* to quickly declare the type of plots you want. Here are *some* of those formats:

|**Character**|**Description**|
|:-----------:|:--------------|
|'--'|Dashed line|
|':'|Dotted line|
|'o'|Circle marker|
|'^'|Upwards triangle marker|
|'b'|Blue|
|'c'|Cyan|
|'g'|Green|

In [None]:
#dir(plt)
#help(plt.scatter)

In [None]:
# plot a green (g), dashed (--) line, and 
# mark the points on the graph qith a square (s)
plt.plot([3, 4, 9, 20], 'gs--')

# set axis limits
plt.axis([-1, 4, 0, 25])

plt.show()

In [None]:
# use a triangle (^) to mark the points on the plot
# draw a blue (b), dashed (d) line
# linewidth is self explanatory
# markersize is the size of the point on the plot - in this case the triangle

plt.plot([3, 4, 9, 20], '^b--', linewidth=2, markersize=12)
plt.show()

In [None]:
# we can spell out all parameters

plt.plot([3, 4, 9, 20], 
         color='blue', 
         marker='^', 
         linestyle='dashed', 
         linewidth=2, 
         markersize=12)
plt.show()

#### <font color = "red">Exercise</font>

* Plot the values x = 4,5,6 and y = 7,8,9 with blue color, no line and square marker



In [None]:
x=[4,5,6]
y=[7,8,9]



In [None]:
import numpy as np

# Plot a list with 10 random numbers 
# with a magenta dotted line and 
# circles for points.




In [None]:
# help(plt.plot)

In [None]:
import numpy as np

# evenly sampled time 
time = np.arange(0, 7, 0.3)

# gene expression
ge = np.arange(1, 8, 0.3)

# we can plot multiple lines
# we give triplets of x, y and style for each line
# red dashes - time and gene expression
# blue squares - time and gene expression squared
# magenta squares on dotted line - time and gene expression to the power 2.5
# green triangles - time and gene expression cubed

plt.plot(time, ge, 'r--', time, ge**2, 'bs', time, ge**2.5, 'ms:', time, ge**3, 'g^')
plt.show()


In [None]:
# Look in the documentation 
# to find information about 
# the parameters of plot

# help(plt.plot)

* `linestyle or ls`: {'-', '--', '-.', ':', '', (offset, on-off-seq), ...}
    - `'-'` solid line
    - `'--'` dashed line
    - `'-.'` dash and dot line
    - `':'` dotted line
    - `''` no line


In [None]:
# help(plt.figure)

In [None]:
# help(plt.subplot)

In [None]:
# Categorical data plotting using categories on the x axis 
# we also use the figure function to create more complex figure (size = (width,height))
# and subplot to plot multiple sub-plots ar different positions in the figure
# 131 - *nrows*, *ncols*, and *index*
# Different types of plots: bar, scatter, and histogram 
 
names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)

# create figure size = (width, height) in inches
plt.figure(figsize=(12, 3))

# subplot: no_rows no_cols subplot_position
plt.subplot(131)
# plot a barplot with the names on the x axis and values on the y axis
plt.bar(names, values)

# make another sublot and display a scatterplot
plt.subplot(132)
plt.scatter(names, values)

# make another sublot and display a histogram
plt.subplot(133)
plt.hist(values1)

# add subtitle to the figure
plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# Add another subplot with another color

names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(10000)

plt.figure(figsize=(15, 3))

plt.subplot(141)
plt.bar(names, values)

plt.subplot(142)
plt.scatter(names, values)

plt.subplot(143)
plt.hist(values1)

# add the fourth subplot - a histogram
plt.subplot(144)
plt.hist(values2, color = "green")

plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# Changing the grid layout from 1 by 4 to 2 by 2

names = ['A', 'B', 'C', 'D']
values = [7, 20, 33, 44]
values1 = np.random.rand(100)
values2 = np.random.rand(10000)

# changing the figure size to match the new layout
plt.figure(figsize=(9, 6))

plt.subplot(221)
plt.bar(names, values)

plt.subplot(222)
plt.scatter(names, values)

plt.subplot(223)
plt.hist(values1)

plt.subplot(224)
plt.hist(values2, color = "green")

plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
# help(plt.bar)

In [None]:
import pandas as pd

In [None]:
# loading the iris dataset
# 150 flowers/rows from 3 species - 50 per species (5th column)
# 4 measured characteristics/columns: 
# petal_length, petal_width, sepal_length, sepal_width

df_iris = pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df_iris.head()

In [None]:
x1 = df_iris.petal_length
y1 = df_iris.petal_width

x2 = df_iris.sepal_length
y2 = df_iris.sepal_width

# Plot the categories from the dataframe with 
# green triangles and blue squares

plt.plot(x1, y1, 'g^', x2, y2, 'bs')
plt.show()

#### Histogram

In [None]:
help(plt.hist)

In [None]:
n, bins, patches = plt.hist(df_iris.petal_length, 
                            bins=20,              # how many bars/intervals
                            facecolor='#8303A2',  # color code in hexa - color for the bars in the histogram
                            rwidth=.8,            # relative width of the bars as a fraction of the bin width
                            align='mid')          # horizontal alignment of the bins

# n - The values of the histogram bins - the height
print(n)

print(min(df_iris.petal_length))
print(max(df_iris.petal_length))

# the bins start point (first one is min val), 
# last one is the end point (max val)
print(bins)

# the container of actual bar objects
print(patches)

# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('number of plants')
plt.xlabel('petal length')


plt.show()

#### Boxplot

In [None]:
# help(plt.boxplot)

In [None]:
# the result is a dictionary mapping each component of the boxplot 
# to a list of the `.Line2D` instances created

res = plt.boxplot(df_iris.petal_length)

# Add a title
plt.title('Iris dataset petal length')

# Add y axis label
plt.ylabel('petal length')
plt.show()

#### With great power, comes great responsibility.
- The biggest issue with `matplotlib` isn't its lack of power...it is that it is too much power.     
- When you are quickly exploring data, you don't want to have to fiddle around with axis limits, colors, figure sizes, etc. 
- Yes, you *can* make good figures with `matplotlib`, but you probably won't.

https://python-graph-gallery.com/matplotlib/

Pandas works off of `matplotlib` by default.    
You can easily start visualizing dataframes and series just by a simple command.

#### Using pandas `.plot()`

Pandas abstracts some of those initial issues with data visualization. However, it is still a `matplotlib` plot</br></br>
Every plot that is returned from `pandas` is subject to `matplotlib` modification.

https://github.com/pandas-dev/pandas/blob/v0.25.0/pandas/plotting/_core.py#L504-L1533

In [None]:
# plot a boxplot for each of the 4 characteristics 
df_iris.plot.box()
plt.show()

In [None]:
df_iris.head()

In [None]:
# Plot the histogram of the petal lengths
# Plot the histograms of all 4 numerical characteristics in a plot
df_iris.petal_length.plot.hist()
plt.show()



In [None]:
# histograms for all 4 charateristics
# alpha parameter adds transparency

df_iris.plot.hist(alpha = 0.7)
plt.show()

In [None]:
# mean of petal_length by species in a bar plot
df_iris.groupby("species")['petal_length'].mean().plot(kind='bar')
plt.show()

In [None]:
# sum of sepal_length by species in a green bar plot
df_iris.groupby("species")['sepal_length'].sum().plot(kind='bar',color = "green")
plt.show()

In [None]:
# scatter plot of petal_length vs petal_width save the plot in a file
df_iris.plot(x='petal_length', y='petal_width', kind = "scatter")
plt.savefig('output.png')

#### Multiple Plots

In [None]:
df_iris.petal_length.plot(kind='density')
df_iris.sepal_length.plot(kind='density')
df_iris.petal_width.plot(kind='density')
plt.show()

`matplotlib` allows users to define the regions of their plotting canvas. If the user intends to create a canvas with multiple plots, they would use the `subplot()` function. The `subplot` function sets the number of rows and columns the canvas will have **AND** sets the current index of where the next subplot will be rendered.

In [None]:
plt.figure(1)

# Plot all three columns from df in different subplots
# Rows first index (top-left)

plt.subplot(3, 1, 1)
df_iris.petal_length.plot(kind='density')

plt.subplot(3, 1, 2)
df_iris.sepal_length.plot(kind='density')

plt.subplot(3, 1, 3)
df_iris.petal_width.plot(kind='density')

# Some plot configuration
plt.subplots_adjust(top=.92, 
                    bottom=.08, 
                    left=.1, 
                    right=.95, 
                    hspace=.25, 
                    wspace=.35)
plt.show()

In [None]:
# Temporary styles
with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot all three columns from df in different subplots
    # Rows first index (top-left)
    
    plt.subplot(3, 1, 1)
    df_iris.petal_length.plot(kind='density')
    
    plt.subplot(3, 1, 2)
    df_iris.sepal_length.plot(kind='density')
    
    plt.subplot(3, 1, 3)
    df_iris.petal_width.plot(kind='density')
    
    # Some plot configuration
    plt.subplots_adjust(top=.92, 
                        bottom=.08, 
                        left=.1, 
                        right=.95, 
                        hspace=.25, 
                        wspace=.35)
    plt.show()

In [None]:
# Plot the histograms of the petal length and width and sepal length and width 
# Display them on the columns of a figure with 2X2 subplots
# color them red, green, blue and yellow, respectivelly  


with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot each of the columns from the df in different subplots
    # Rows first index (top-left)
    plt.subplot(2, 2, 1)
    df_iris.petal_length.plot(kind='hist', color = "red")
    plt.xlabel("petal length")
    plt.subplot(2, 2, 2)
    df_iris.sepal_length.plot.hist(color = "blue")
    plt.xlabel("sepal length")
    plt.subplot(2, 2, 3)
    df_iris.petal_width.plot(kind='hist', color = "green")
    plt.xlabel("petal width")
    plt.subplot(2, 2, 4)
    df_iris.sepal_width.plot.hist(color = "yellow")
    plt.xlabel("sepal width")
    # Some plot configuration
    plt.subplots_adjust(top=.92, 
                        bottom=.001, 
                        left=.1, 
                        right=.95, 
                        hspace=.30, 
                        wspace=.35)
    plt.show()

In [None]:
# Adjusting the plot configuration

with plt.style.context(('ggplot')):
    plt.figure(1)

    # Plot each of the columns from the df in different subplots
    # Rows first index (top-left)
    plt.subplot(2, 2, 1)
    df_iris.petal_length.plot(kind='box', color = "red")
    plt.subplot(2, 2, 2)
    df_iris.sepal_length.plot.box(color = "blue")
    plt.subplot(2, 2, 3)
    df_iris.petal_width.plot(kind='hist', color = "green")
    plt.xlabel("petal width")
    plt.subplot(2, 2, 4)
    df_iris.sepal_width.plot.hist(color = "yellow")
    plt.xlabel("sepal width")
    # Some plot configuration
    plt.subplots_adjust(top=0.99, 
                        bottom=0, 
                        left=.1, 
                        right=.95, 
                        hspace=.25, 
                        wspace=.35)
    plt.show()

In [None]:
# see what the pandas dataframe plot can do
# dir(df_iris.petal_length.plot)

____________

### `seaborn` - dataset-oriented plotting

Seaborn is a library that specializes in making *prettier* `matplotlib` plots of statistical data. <br>
- It is built on top of matplotlib and closely integrated with pandas data structures.    
- It needs a data object from where it will extract the information for the plot in different parameters.



**Useful resources:**     
https://seaborn.pydata.org/introduction.html<br>
https://python-graph-gallery.com/seaborn/   
https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html
https://seaborn.pydata.org/tutorial/distributions.html


In [None]:
import seaborn as sns

In [None]:
# basic scatterplot

sns.scatterplot(x='petal_length',y='petal_width',data=df_iris)
plt.show()

`seaborn` lets users *style* their plotting environment.

In [None]:
sns.set(style='whitegrid')

However, you can always use `matplotlib`'s `plt.style`

In [None]:
#dir(sns)

In [None]:
sns.scatterplot(x='petal_length',y='petal_width',data=df_iris)
plt.show()

In [None]:
# hue argument allows you to color dots by category

sns.scatterplot(x='petal_length',
                y='petal_width', 
                hue = "species", # color
                data=df_iris)    # always needs a dataframe 
plt.show()

#### Violin plot

Fancier box plot that gets rid of the need for 'jitter' to show the inherent distribution of the data points

In [None]:
res = sns.violinplot(data=df_iris)

#### Histplot, kdeplot and subplots

In [None]:
# A distplot plots a univariate distribution of observations. 
# The distplot() function combines the matplotlib hist function with the seaborn kdeplot() and rugplot() functions.

sns.set(style='darkgrid', palette='muted')

# 4 rows, 1 column - all have the same x axis
f, axes = plt.subplots(4,1, figsize=(10,10), sharex=True)

# Regular displot
sns.histplot(df_iris.petal_length, ax=axes[0])

# Change the color and remove the fill
sns.histplot(df_iris.petal_width, fill=False, ax=axes[1], color='orange')

# Show the kernel density estimate (kde)
sns.histplot(df_iris.sepal_width, kde=True, ax=axes[2], color='purple')

# kdeplot
sns.kdeplot(df_iris.sepal_length,   ax=axes[3], color='green' )



#### FacetGrid - break plot in subplots

In [None]:
# sns.set()
columns = ['species', 'petal_length', 'petal_width']
facet_column = 'species'
g = sns.FacetGrid(df_iris.loc[:,columns], 
                  col=facet_column, 
                  hue=facet_column, 
                  col_wrap=2)
g.map(plt.scatter, 'petal_length', 'petal_width')

#### sns.relplot - drawing relational plots onto a FacetGrid

In [None]:
# help(sns.relplot)

In [None]:
# relplot - scatterplot
res = sns.relplot(x="petal_length", 
                  y="petal_width", 
                  col="species",
                  hue="species", 
                  style="species", 
                  size="species",
                  data=df_iris)


#### <font color = "red">Exercise</font>

* Use seaborn to plot a boxplot of the sepal_width for each species 


In [None]:
sns.boxplot(data=df_iris)