# worksheet 11a: Plotting with pandas and matplotlib

### Pandas leverages matplotlib
In this notebook you will learn the basics of matplotlib and integration
with pandas
- link here https://matplotlib.org/stable/tutorials/pyplot.html
- https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
gene_tx_df = pd.read_csv('genes_transcripts.csv')

In [None]:
gene_tx_df['gene_name'].unique()

In [None]:
gene_tx_df['gene_name'].nunique()

## matplotlib pyplot + pandas

### line plot, only `y` data
when a single list of values passed to `plt.plot` 
- treated as y-axis values
- matplotlib generates x-axis values to match this

In [None]:
plt.plot(gene_tx_df['gene_name'].unique())

### lineplot, `x` and `y` data
- plot x , y values

In [None]:
plt.plot(gene_tx_df['gene_name'].unique(), 
         gene_tx_df['gene_name'].unique()
        )

### scatterplot, `x` and `y` data
- plot x , y values

In [None]:
plt.scatter(
    gene_tx_df['gene_name'].unique(), 
    gene_tx_df['gene_name'].unique()
)

### Formatting the style of the plot
- default is solid blue line
- change above plot to red points
- see `Example format strings` here https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot

In [None]:
plt.scatter(
    gene_tx_df['gene_name'].unique(), 
    gene_tx_df['gene_name'].unique(),
    color='red'
)

In [None]:
plt.plot(
    gene_tx_df['gene_name'].unique(), 
    gene_tx_df['gene_name'].unique(),
    'or'
)

### compare scatter plot to pandas df.plot

In [None]:
d = {
    'col1': gene_tx_df['gene_name'].unique(),
    'col2': gene_tx_df['gene_name'].unique()
}
df = pd.DataFrame(d)

In [None]:
df

In [None]:
df.plot.scatter(x='col1', y='col2', color='red')

### Plotting with keyword strings
Revisit scatterplot of `tx_length` by `gene_name`, where the points are colored by `biotype`
- pandas + matplotlib
- the `plt.scatter` command is very similar to `df.plot.scatter`

#### unique biotypes

In [None]:
gene_tx_df['biotype'].unique()

#### assign color for each biotype

In [None]:
# assign color for each biotype
map_biotype_to_colors = {'processed_pseudogene': 'blue',
          'transcribed_unprocessed_pseudogene': 'pink',
          'transcribed_processed_pseudogene': 'yellow',
          'unprocessed_pseudogene': 'brown',
          'protein_coding': 'red',
          'retained_intron': 'grey',
          'nonsense_mediated_decay': 'green',
          'protein_coding_CDS_not_defined': 'purple',
          'lncRNA': 'black',
          'snRNA': 'orange'
         }
color_list = [map_biotype_to_colors[biotype] for biotype in gene_tx_df['biotype']]

#### plt.scatter

In [None]:
plt.scatter(x='gene_name', y='tx_length', c=color_list, data=gene_tx_df)
plt.xlabel('gene names')
plt.ylabel('tx length in bps')
plt.show()

#### df.plot.scatter

In [None]:
gene_tx_df.plot.scatter(x='gene_name', y='tx_length', c=color_list)
plt.xlabel('gene names')
plt.ylabel('tx length in bps')

### Plotting multiple plots (subplots)
create multiple subplots with the following:
- plot 1: plot of `tx_length` by `gene_name` for `protein_coding` biotype
- plot 2: plot of `tx_length` by `gene_name` for all biotypes
- see `plt.figure` for figsize and other args https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure

In [None]:
protein_coding_df = gene_tx_df[gene_tx_df['biotype'] == 'protein_coding']

In [None]:
# plt.figure(figsize=(width_in_inches, height_in_inches))
plt.figure(figsize=(15, 4))
# plt.subplot(num_rows, num_cols, plot_number)
plt.subplot(121)
plt.scatter(x='gene_name', y='tx_length', data=protein_coding_df)
plt.xlabel('genes')
plt.ylabel('tx_length in bps')
plt.title('tx_length for protein coding biotype')
plt.subplot(122)
plt.scatter(x='gene_name',y='tx_length', data=gene_tx_df, color='red')
plt.xlabel('genes')
plt.ylabel('tx_length in bps')
plt.title('tx_length for all biotypes')
plt.show()

### Controlling line properties

Plot `gene_name` versus counts of transcripts for protein_coding genes using different line styles
- https://matplotlib.org/stable/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D

In [None]:
protein_coding_df.columns

#### gene_name versus count of transcripts

In [None]:
tx_count = protein_coding_df.groupby('gene_name')['transcripts'].agg('count')

In [None]:
tx_count.name = 'transcript_counts'

In [None]:
tx_count

In [None]:
plt.figure(figsize=(13,13))
plt.subplot(221)
tx_count.plot(title='default')
plt.subplot(222)
tx_count.plot(linewidth=5.0, title='line width=5.0')
plt.subplot(223)
tx_count.plot(linestyle='--', title='linestyle="--"')
plt.subplot(224)
tx_count.plot()
plt.yscale('log')
plt.title('log scale')
plt.grid(True)
plt.show()

## matplotlib axes
- Matplotlib graphs your data on Figures (e.g., windows, Jupyter widgets, etc.), each of which can contain one or more Axes, an area where points can be specified in terms of x-y coordinates
- The simplest way of creating a Figure with an Axes is using pyplot.subplots. We can then use Axes.plot to draw some data on the Axes, and display the figure
- https://matplotlib.org/stable/users/explain/quick_start.html
- see `parts of a figure` chart here, really useful! https://matplotlib.org/stable/users/explain/quick_start.html

### One plot

#### this is a single figure with a single axes (x, y)

In [None]:
plt.plot(gene_tx_df['gene_name'].unique(), gene_tx_df['gene_name'].unique())

#### this is also a figure containing a single axes
- call `plt.subplots()` and save `ax`
- use `ax` to plot, set the title, labels etc

In [None]:
# Create a figure containing a single Axes.
fig, ax = plt.subplots()   
# Plot some data on the Axes.
ax.plot(gene_tx_df['gene_name'].unique(), gene_tx_df['gene_name'].unique())
ax.set_xlabel('gene name')
ax.set_ylabel('gene name')
ax.set_title('axes experiment')
# show the plot
plt.show()                           

### Two plots, single axes 
- scatter plot of `tx_length` for protein_coding transcripts
- scatter plot of `tx_length` for lncRNA transcripts
- use figure with a single axes
- two overlapping plots

In [None]:
gene_tx_df.columns

In [None]:
lncRNA_df = gene_tx_df[gene_tx_df['biotype'] == 'lncRNA']

In [None]:
fig, ax = plt.subplots(figsize=(5,3), layout='constrained') # a figure with single axes
ax.scatter(x='gene_name', y='tx_length', data=protein_coding_df, label='tx_length, protein coding', color='red')
ax.scatter(x='gene_name', y='tx_length', data=lncRNA_df, label='tx_length, lncRNA_df')
ax.set_xlabel('gene name')
ax.set_ylabel('tx length')
ax.set_title('scatter plot of tx_lengths')
ax.legend()
plt.show()

### Two plots, two different axes
- two subplots
- tx_length protein_coding, tx_length lncRNA

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(15,5))
ax1.scatter(x='gene_name', y='tx_length', data=protein_coding_df, label='tx_length, protein coding', color='red')
ax1.set_xlabel('gene name')
ax1.set_ylabel('tx length, protein coding')
ax2.set_yscale('log')
ax2.scatter(x='gene_name', y='tx_length', data=lncRNA_df, label='tx_length, lncRNA_df')
ax2.set_xlabel('gene name')
ax2.set_ylabel('tx length, lncRNA')
ax1.legend()
ax2.legend()
plt.show()

### Three plots, three different axes
- three subplots
- tx_count protein coding, tx_count protein coding undef CDS, tx_count lncRNA

In [None]:
gene_tx_df['biotype'].unique()

In [None]:
pc_undef = gene_tx_df[gene_tx_df['biotype'] == 'protein_coding_CDS_not_defined']

In [None]:
tx_count_pc = protein_coding_df.groupby('gene_name')['transcripts'].count()
tx_count_lncRNA = lncRNA_df.groupby('gene_name')['transcripts'].count()
tx_count_pc_undef = pc_undef.groupby('gene_name')['transcripts'].count()

In [None]:
fig, (ax, ax2, ax3) = plt.subplots(nrows=1, ncols=3, figsize=(10,3), layout='constrained')
ax.bar(tx_count_pc.index, tx_count_pc.values, color='red', data=tx_count_pc)
ax.set_title('tx_count_pc')
ax2.barh(tx_count_lncRNA.index, tx_count_lncRNA.values, color='red', data=tx_count_lncRNA)
ax2.set_title('tx_count_lncRNA')
ax3.bar(tx_count_pc_undef.index, tx_count_pc_undef.values, color='red', data=tx_count_pc_undef)
ax3.set_title('tx_count_pc_undef')
plt.show()


# Collaborative Exercises

## Exercise 1
Create the following plots
- plot 1: scatter plot of `median_tx_length` by `gene_name` for `protein_coding` biotype. You will need to compute the median by gene for protein coding biotype
- plot 2: boxplot of `tx_length` by `gene_name` for `protein_coding` biotype
- plot 3: overlapping histograms. Plot hist of `tx_length` for BRCA1 and BRCA2 and show legend
- see `plt.figure` for figsize and other args https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html#matplotlib.pyplot.figure

## Exercise 2
- Plot a violin plot of `tx_length` by `gene_name` for protein_coding biotype
- BONUS: Can you figure out how to change the color `tx_length` by `gene_name` for protein_coding biotype? 
Your code should color violin plot for each gene uniquely
- see helpful link here https://matplotlib.org/stable/gallery/statistics/customized_violin.html#sphx-glr-gallery-statistics-customized-violin-py
