#  Bio 208: Lecture 06 -- Working with tabular data

## Pandas library

[Pandas](https://pandas.pydata.org/) is a widely used Python library for working with tabular data.

![Image from Pandas tutorial.](https://pandas.pydata.org/docs/_images/01_table_dataframe1.svg)

In [8]:
import pandas as pd

### Creating DataFrames

The core data structure in the Pandas library is called a "data frame" (`DataFrame`).

We can create a dataframe by hand as follows:

In [10]:
df = pd.DataFrame({
    "Name": ["ORF1ab", "S", "E", "M" ],
    "Start": [266,  21563, 26245, 26523],
    "Stop": [21555, 25384, 26472, 27191],
    "Product": ["ORF1ab polyprotein", "surface glycoprotein", "envelope protein", "membrane glycoprotein"]
})

In [11]:
df

Unnamed: 0,Name,Start,Stop,Product
0,ORF1ab,266,21555,ORF1ab polyprotein
1,S,21563,25384,surface glycoprotein
2,E,26245,26472,envelope protein
3,M,26523,27191,membrane glycoprotein


### Reading a DataFrame from a file

### What are the dimensions (number of rows and columns) of a DataFrame?

### Getting a specific column from a DataFrame

### Getting a subset of columns from a DataFrame

### Getting specific rows from a DataFrame using slices

### Selecting cross section of a DataFrame using `DataFrame.loc`

### Selection cross sections of a DataFrame by integer positions using DataFrame.iloc

### Subsetting the rows of a DataFrame by Boolean indexing

### More complex subsetting using Boolean operators

## Working with a table of features from the Saccharomyces Genome Database (SGD)

The file [`SGD_features.tsv`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/SGD_features.tsv) is a tab-delimited file I downloaded from SGD that summarizes key pieces of information about genome features in the budding yeast genome.  The original file can be found here: http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/

Here's a short summary of the contents of this file, from the "SGD_features.README" document:

```
1. Information on current chromosomal features in SGD, including Dubious ORFs. 
Also contains coordinates of intron, exons, and other subfeatures that are located within a chromosomal feature.

2. The relationship between subfeatures and the feature in which they
are located is identified by the feature name in column #7 (parent
feature). For example, the parent feature of the intron found in
ACT1/YFL039C will be YFL039C. The parent feature of YFL039C is
chromosome 6.

3. The coordinates of all features are in chromosomal coordinates.

Columns within SGD_features.tab:

1.   Primary SGDID (mandatory)
2.   Feature type (mandatory)
3.   Feature qualifier (optional)
4.   Feature name (optional)
5.   Standard gene name (optional)
6.   Alias (optional, multiples separated by |)
7.   Parent feature name (optional)
8.   Secondary SGDID (optional, multiples separated by |)
9.   Chromosome (optional)
10.  Start_coordinate (optional)
11.  Stop_coordinate (optional)
12.  Strand (optional)
13.  Genetic position (optional)
14.  Coordinate version (optional)
15.  Sequence version (optional)
16.  Description (optional)

Note that "chromosome 17" is the mitochondrial chromosome.
```


Download [`SGD_features.tsv`](https://github.com/bio208fs-class/bio208fs-lecture/raw/master/data/SGD_features.tsv) to your computer and then load it using the `read_csv` function, specifying the delimiter argument as a tab:

### How many genome features are there in the  yeast genome?

In [None]:
### How  