# Matplotlib, Seaborn, and Pandas

`Matplotlib`, `Seaborn`, and `Pandas` are two very popular scientific libraries  

`Matplotlib` is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.  
`Seaborn` is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.  
`Pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

## Index

* [Matplotlib](matplotlib_pandas.ipynb#Matplotlib)
* [Seaborn](matplotlib_pandas.ipynb#Seaborn)
* [Pandas](matplotlib_pandas.ipynb#Pandas)
    * [Create a DataFrame](matplotlib_pandas.ipynb#Create-a-DataFrame)
    * [Read a file in pandas](matplotlib_pandas.ipynb#Read-a-file-in-pandas)
    * [Operations on a column](matplotlib_pandas.ipynb#Operations-on-a-column)
    * [Operations on a row (or column)](matplotlib_pandas.ipynb#Operations-on-a-row)
    * [Add a column to a DataFrame](matplotlib_pandas.ipynb#Add-a-column-to-a-DataFrame)
    * [Remove a column](matplotlib_pandas.ipynb#Remove-a-column)
    * [Filter a DataFrame](matplotlib_pandas.ipynb#Filter-a-DataFrame)
    * [Save the DataFrame to a file](matplotlib_pandas.ipynb#Save-the-DataFrame-to-a-file)
    * [Create plots with Pandas](matplotlib_pandas.ipynb#Create-plots-with-Pandas)
    * [Other useful pandas methods](matplotlib_pandas.ipynb#Other-useful-pandas-methods)

## Matplotlib
[back to top](matplotlib_pandas.ipynb#Index)

The most convenient way to use matplotlib in a jupyter environment is to import the library in this way:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

This allow to display the generated plots directly in the notebook

In [None]:
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
plt.plot(x, y)
plt.title('Plot title')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.show()

In [None]:
import numpy as np

# evenly sampled time at 200ms intervals
t = np.arange(0, 5, 0.2)

# red dashes, blue squares and green triangles
plt.plot(t, t, color='red', linestyle='--', label='linear') 
plt.plot(t, t ** 2, color='blue', linestyle='', marker='s', label='square')
plt.plot(t, t ** 3, color='green', linestyle='', marker='^', label='cubic')

plt.legend()
plt.show()

In [None]:
x = np.arange(50)
y = x + 10 * np.random.randn(50)
color = np.random.randint(0, 50, 50)
size = np.abs(np.random.randn(50)) * 100

plt.scatter(x, y, c=color, s=size)
plt.xlabel('X values')
plt.ylabel('Y values')

plt.show()

In [None]:
names = ['group_a', 'group_b', 'group_c']
values = [1, 10, 100]

plt.figure(figsize=(9, 3))

plt.subplot(131)
plt.bar(names, values)

plt.subplot(132)
plt.scatter(names, values)

plt.subplot(133)
plt.plot(names, values)

plt.suptitle('Categorical Plotting')
plt.show()

In [None]:
def f(t):
    return np.exp(-t) * np.cos(2 * np.pi * t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(figsize=(10, 6))

ax0 = plt.subplot2grid((2, 1), (0, 0))
ax1 = plt.subplot2grid((2, 1), (1, 0))

ax0.plot(t1, f(t1), color='blue', marker='o') 
ax0.plot(t2, f(t2), color='black')
ax1.plot(t2, np.cos(2*np.pi*t2), color='red', linestyle='--')

plt.show()

There are different types of histograms: `bar`, `barstacked`, `step`, and `stepfilled`

In [None]:
mu = 200
sigma = 25
x = np.random.normal(mu, sigma, size=100)

fig = plt.figure(figsize=(8, 8))

n, bins, patches = plt.hist(x, 20, density=True, histtype='stepfilled', facecolor='g', alpha=0.75)
plt.title('stepfilled')

plt.show()

## Seaborn
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
import seaborn as sns

sns.set(color_codes=True)

# Load a dataset
iris = sns.load_dataset("iris")

# Remove from the dataset the column 'species'
species = iris.pop("species")

# Assign each specie a color
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)

# Plot the clustermap
g = sns.clustermap(iris, row_colors=row_colors)

## Pandas
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
import pandas as pd

### Create a DataFrame
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df = pd.DataFrame({
    'a': range(10),
    'b': np.arange(0, 1, 0.1),
    'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l'],
    'd': [True, True, False, False, True, True, True, False, False, True]
})

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.shape

Rows and columns can be accessed as follow:

In [None]:
df['a']

In [None]:
df.loc[2]

Each column has a data-type:

In [None]:
df.dtypes

We can force a column to be a different compatible data-type:

In [None]:
df['a'] = df['a'].astype('float')

In [None]:
df.head()

In [None]:
df.dtypes

### Read a file in pandas
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df = pd.read_csv('../data/yeast_genes_chrom2.txt', sep='\t', header=0)

In [None]:
df.shape

In [None]:
df.head(3)

In [None]:
df.dtypes

### Operations on a column
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df['Chromosome'] += 1

In [None]:
df.head(3)

### Operations on a row
#### (or column)
[back to top](matplotlib_pandas.ipynb#Index)

We can use the function `apply` to execute an operation to all the rows (or columns) of a DataFrame  
The syntax is the following
```
    dataframe.apply(function, axis)
```
where __axis__ can be **0** (apply the function to each column) or __1__ (apply the function to each row)

In [None]:
# First, I define the function to be apllied to each row

def gene_length(row):
    """Calculate the length of each gene
    
    :param row: DataFrame row
    :return: int, length of the gene
    """
    
    length = row['Stop'] - row['Start']
    return abs(length)

In [None]:
# Now apply the function to the dataframe

df['Length'] = df.apply(gene_length, axis=1)

In [None]:
df.head()

In [None]:
# Now I define the function to be apllied to each column

def average_length(column):
    """Calculate the average length of the elements of a column
    
    :param column: DataFrame column
    :return: float, average length of the elements of a column
    """
    
    elements = [len(str(i)) for i in column]
    return np.mean(elements)

In [None]:
# Example of applying a function to each column

df.apply(average_length, axis=0)

### Add a column to a DataFrame
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
def derive_strand(line):
    """Derive the strand of genomic coordinates
    
    :param line: line of a pandas DataFrame
    :return: string, strand of the gene
    """
    
    if line['Start'] < line['Stop']:
        return '+'
    else:
        return '-'

In [None]:
df['Strand'] = df.apply(derive_strand, axis=1)

In [None]:
df.head(10)

### Remove a column
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df['Chromosome'].unique()

In [None]:
del df['Chromosome']

In [None]:
df.head(3)

### Filter a DataFrame
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df[df['Strand'] == '+'].head()

In [None]:
df[(df['Strand'] == '+') & (df['Length'] < 600)]

### Save the DataFrame to a file
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df.to_csv('../data/yeast_genes_chrom2_strand.txt', sep='\t', header=True, index=False)

### Create plots with Pandas
[back to top](matplotlib_pandas.ipynb#Index)

In [None]:
df['Length'].hist()

In [None]:
df[df['Length'] > 7000]

In [None]:
df.plot(x='Start', y='Length')

In [None]:
df.plot(x='Start', y='Length', kind='scatter')

### Other useful pandas methods
[back to top](matplotlib_pandas.ipynb#Index)

#### **.tolist()**

In [None]:
df[:5]['Gene_name']

In [None]:
df[:5]['Gene_name'].tolist()

#### **.describe()**

In [None]:
df.describe()

#### **.mean()**

In [None]:
df['Length'].mean()

#### **.std()**

In [None]:
df['Length'].std()

#### **.isna()** and __.dropna()__

In [None]:
new_df = df.head()

In [None]:
new_df.loc[1, 'Length'] = np.nan
new_df.loc[3, 'Strand'] = np.nan

In [None]:
new_df

In [None]:
new_df.isna()

In [None]:
new_df.dropna()

#### **.pivot_table()**

In [None]:
new_df

In [None]:
# Add a fake gene
new_df.loc[5] = ['PAU9', 7733, 7605, 128.0, '+']

In [None]:
new_df.pivot_table(new_df, index=['Gene_name', 'Strand'])

#### **.T**

In [None]:
new_df

In [None]:
new_df.T

#### **.value_counts()**

In [None]:
df['Strand'].value_counts()