# Data analysis using pandas & matplotlib

In this assignment, you are invited to analyze RNA base pairs data using the pandas and matplotlib libraries.
The task looks like a laboratory work in which you have to fill in the blank cells and answer a number of questions.

 - [Pandas docs](https://pandas.pydata.org/)
 - [Matplotlib docs](https://matplotlib.org/index.html)

## 0. RNA base pairs

In this work you are asked to analyze a dataset of nucleotide-nucleotide interactions, annotated as base pairs with automatic annotation tools.

### [What is a base pair? (wiki)](https://en.wikipedia.org/wiki/Base_pair)

### Leontis-Westhof base pair classification

![image](LW.png)

## 1. Data 

Let's start with the necessary preparations.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

#### Load the dataset from "base_pairs.tsv". Create a pandas.DataFrame object with name *bps* and N column as index.

In [2]:
# Paste your code here.

Let's see what we've got

In [None]:
bps.head(4)

column descriptions:
 - N - ordinal number
 - RESIDUE1 - first nucleotide (chain.base.id)
 - RESIDUE2 - second nucleotide (chain.base.id)
 - BASE1 - base 1
 - BASE2 - base 2
 - CLOSESTATOM1 - first nucleotide's atom of the closest atom-atom pair
 - CLOSESTATOM2 - second nucleotide's atom of the closest atom-atom pair
 - MINDIST - distance of the closest atom-atom pair
 - #ATOMPAIRS - number of mutually closest atom pairs
 - ATOMPAIRS - list of mutually closest atom pairs
 - CONFORMATION1 - first nucleotide conformation annotated with DSSR
 - CONFORMATION2 - second nucleotide conformation annotated with DSSR
 - DSSR_BP - DSSR annotations (Lu et al. 2015)
 - FR3D_BP - FR3D annotations (Sarver et al. 2008)
 - MCANNOTATE_BP - MC-Annotate annotations (Gendron et al. 2001)
 - RNAVIEW_BP - RNAView annotations (Yang et al. 2003)
 - CLARNA_BP - ClaRNA annotations (Waleń et al. 2014)
 

## 2. What do we have here?

Plot the distribution of the MINDIST values and interpret the results.

In [None]:
# Your code here

What are the most frequent BASE1-BASE2 pairs? 

In [None]:
# Your code here

What are the most frequent CLOSESTATOM1-CLOSESTATOM2 pairs? Do the most frequent atom pairs differ among various BASE1-BASE2 pairs?

In [None]:
# Your code here

## 3. Comparison of the annotations

Compare how often the annotation tools agree/disagree with each other. Treat all non-empty **X_BP** columns as annotated base pairs. Make a figure of any chosen type.

In [None]:
# Your code here

## 4. Data cleaning

To compare the annotations separately for each of the twelve Leontis-Westhof base pair types (LW types), we need to unify the format of annotations. Make **X_BP_CLEAN** columns with unified LW type annotations, e.g., tWW, cHS, etc. Note that not all annotations are of a defined LW type, i.e., you can ignore some of the "uncertain" annotations. Describe in detail how you produced **X_BP_CLEAN** columns.

In [None]:
# Your code here

## 5. Comparison of the LW type annotations

Separately for each LW type, compare how often the annotation tools agree/disagree with each other. Do we observe any differences between the LW types? Make a figure of any chosen type.

In [None]:
# Your code here

## 6. Examining LW types

Choose one LW type and analyze how sufficient it is for the base pair description. First, select the rows annotated with the choosen LW type by at least two tools. Then, group the ATOMPAIRS column values by BASE1-BASE2 pairs and see how many different patterns you observe. Can the ATOMPAIRS list be always unambiguously defined by LW type and a base-base pair?

In [None]:
# Your code here

## 7. Modified residues

Are any modified residues included in the dataset? What are those? What can we state about the automatic base pair annotation for the modified residues?

In [None]:
# Your code here