# Introduction to Coding

### Matplotlib, Seaborn, and Pandas

`Matplotlib`, `Seaborn`, and `Pandas` are three very popular scientific libraries  

`Matplotlib` is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.  
`Seaborn` is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.  
`Pandas` is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language

The most convenient way to use `matplotlib` in a jupyter environment is to import the library in this way:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

This allows to display the generated plots directly in the notebook

## Matplotlib

#### Simple plot

In [None]:
# Create some random data
x = [1, 2, 3, 4]
y = [1, 4, 9, 16]

# Plots the data
plt.plot(x, y)

# Add a title and labels to the plot
plt.title('Plot title')
plt.xlabel('X values')
plt.ylabel('Y values')

# Show the plot
plt.show()

#### Scatterplot

Same code as above but using a `scatterplot`

In [None]:
# Plots the data
plt.scatter(x, y)

# Add a title and labels to the plot
plt.title('Plot title')
plt.xlabel('X values')
plt.ylabel('Y values')

# Show the plot
plt.show()

#### Changing colors and sizes

In [None]:
# We need this module to create random numbers
import random

In [None]:
# We create random data (ccordinates, color, and size of the dots)
x = []
y = []
colors = []
size= []

for i in range(100):
    x.append(random.randrange(50))
    y.append(random.randrange(50))
    colors.append(random.randrange(100))
    size.append(random.randrange(100, 1000))

In [None]:
# We plot the data    
plt.scatter(x, y, c=colors, s=size, alpha=0.5)
plt.xlabel('X values')
plt.ylabel('Y values')

plt.show()

#### Barplot

In [None]:
# Define data
names = ['A', 'B', 'C']
x = [1, 2, 3]
y = [1, 10, 100]
colors = ['blue', 'red', 'green']

# Create plot
plt.bar(x=x, height=y, tick_label=names, color=colors)
plt.title('Example of barplot')

plt.show()

#### Histogram

In [None]:
# Creates random data
x = []
for i in range(500):
    x.append(random.randrange(100))
    
    
# Plots the data
plt.hist(x, bins=50)
plt.title('Example of histogram')

plt.show()

### Exercise

Try to plot the nucleotide's content of the first chromosome of the yeast genome. A file with the sequence of the first chromosome is available in in the `data` folder (`../data/yeast_chr1.fa`).  
The data have been downloaded from the [SGD archive](http://sgd-archive.yeastgenome.org/sequence/S288C_reference/chromosomes/fasta/)

Tips:

* Read the fasta file and assign the sequence sequence of the chromosome to a variable
* Count the number of A, C, G, and T in the sequence
* Define the color of each bar of the plot (optional)
* Create the barplot `(the minimal syntax is: plt.bar(x, height). Optionaly you can add the arguments tick_label and colors)`

[![button](../figures/button_solution_small.png)](solutions.ipynb#Useful-libraries)

## Seaborn

In [None]:
import seaborn as sns

In [None]:
iris = sns.load_dataset("iris")
iris.tail()

In [None]:
sns.set(color_codes=True)

# Load a dataset
iris = sns.load_dataset("iris")

# Remove from the dataset the column 'species'
species = iris.pop("species")

# Assign each specie a color
lut = dict(zip(species.unique(), "rbg"))
row_colors = species.map(lut)

# Plot the clustermap
g = sns.clustermap(iris, row_colors=row_colors)

## Pandas

In [None]:
import pandas as pd
import numpy as np

#### Create a DataFrame

In [None]:
df = pd.DataFrame({
    'a': range(10),
    'b': np.arange(0, 1, 0.1),
    'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l'],
    'd': [True, True, False, False, True, True, True, False, False, True]
})

In [None]:
df

In [None]:
df.head(3)

In [None]:
df.tail()

In [None]:
df.shape

Rows and columns can be accessed as follow:

In [None]:
df['a']

In [None]:
df.loc[2]

Each column has a data-type:

In [None]:
df.dtypes

We can force a column to be a different compatible data-type:

In [None]:
df['a'] = df['a'].astype('float')

In [None]:
df.head()

In [None]:
df.dtypes

#### Read a file in pandas

In [None]:
df = pd.read_csv('../data/yeast_genes_chrom2.txt', sep='\t', header=0)

In [None]:
df.shape

In [None]:
df.head(3)

In [None]:
df.dtypes

#### Operations on a column

In [None]:
df['Chromosome'] += 1

In [None]:
df.head(3)

#### Operations on a row (or column)

We can use the function `apply` to execute an operation to all the rows (or columns) of a DataFrame  
The syntax is the following
```
    dataframe.apply(function, axis)
```
where __axis__ can be **0** (apply the function to each column) or __1__ (apply the function to each row)

In [None]:
# First, I define the function to be apllied to each row

def gene_length(row):
    """Calculate the length of each gene
    
    :param row: DataFrame row
    :return: int, length of the gene
    """
    
    length = row['Stop'] - row['Start']
    return abs(length)

In [None]:
# Now apply the function to the dataframe

df['Length'] = df.apply(gene_length, axis=1)

In [None]:
df.head()

In [None]:
# Now I define the function to be apllied to each column

def average_length(column):
    """Calculate the average length of the elements of a column
    
    :param column: DataFrame column
    :return: float, average length of the elements of a column
    """
    
    elements = [len(str(i)) for i in column]
    return np.mean(elements)

In [None]:
# Example of applying a function to each column

df.apply(average_length, axis=0)

#### Add a column to a DataFrame

In [None]:
def derive_strand(line):
    """Derive the strand of genomic coordinates
    
    :param line: line of a pandas DataFrame
    :return: string, strand of the gene
    """
    
    if line['Start'] < line['Stop']:
        return '+'
    else:
        return '-'

In [None]:
df['Strand'] = df.apply(derive_strand, axis=1)

In [None]:
df.head(10)

#### Remove a column

In [None]:
df['Chromosome'].unique()

In [None]:
del df['Chromosome']

In [None]:
df.head(3)

#### Filter a DataFrame

In [None]:
df[df['Strand'] == '+'].head()

In [None]:
df[(df['Strand'] == '+') & (df['Length'] < 600)]

#### Save the DataFrame to a file

In [None]:
df.to_csv('../data/yeast_genes_chrom2_strand.txt', sep='\t', header=True, index=False)

#### Create plots with Pandas

In [None]:
df['Length'].hist()

In [None]:
df[df['Length'] > 7000]

In [None]:
df.plot(x='Start', y='Length')

In [None]:
df.plot(x='Start', y='Length', kind='scatter')

#### Other useful pandas methods

#### **.tolist()**

In [None]:
df[:5]['Gene_name']

In [None]:
df[:5]['Gene_name'].tolist()

#### **.describe()**

In [None]:
df.describe()

#### **.mean()**

In [None]:
df['Length'].mean()

#### **.std()**

In [None]:
df['Length'].std()

#### **.isna()** and __.dropna()__

In [None]:
new_df = df.head()

In [None]:
new_df.loc[1, 'Length'] = np.nan
new_df.loc[3, 'Strand'] = np.nan

In [None]:
new_df

In [None]:
new_df.isna()

In [None]:
new_df.dropna()

#### **.pivot_table()**

In [None]:
new_df

In [None]:
# Add a fake gene
new_df.loc[5] = ['PAU9', 7733, 7605, 128.0, '+']

In [None]:
new_df.pivot_table(new_df, index=['Gene_name', 'Strand'])

#### **.T**

In [None]:
new_df

In [None]:
new_df.T

#### **.value_counts()**

In [None]:
df['Strand'].value_counts()