<a href="https://colab.research.google.com/github/chunribu/biotable/blob/main/src/pandas_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```
Author: @chunribu[GitHub]
```

Welcome🖐!

Let's focus on [pandas](https://pandas.pydata.org/) in this section, a powerful tool for **data analysis and manipulation**. 

Pandas is built on top of the [Python](https://www.python.org/) programming language and this tutorial assume you have learned fundamentals of Python, including items bellow, which will be used later.
+ [String operations](https://docs.python.org/3.10/library/stdtypes.html#string-methods)
+ [List operations](https://docs.python.org/3/tutorial/datastructures.html#more-on-lists)
+ [Comprehensions](https://www.geeksforgeeks.org/comprehensions-in-python/)
+ [Lambda function](https://www.tutorialsteacher.com/python/python-lambda-function)
+ etc.



The goal of this tutorial is to be comprehensible and practical, especially for **biodata processing** like data from NCBI databases.

First of all, there are several high-level concepts you need to know before diving into details.

+ **Method Chaining**: Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it. In pandas, most of operations can be chained for after every operation it return a copy of the source Object. To save memory, deep copy only occurs after a manual trigger.
+ **Series**: One-dimensional ndarray with axis labels (including time series). You can think of it as a vector or a row (or column) of a table.
+ **DataFrame**: Two-dimensional, size-mutable, potentially heterogeneous tabular data. It consists of zero or more Series, consider it a common table.
+ **GroupBy**: GroupBy objects are returned by groupby calls: `pandas.DataFrame.groupby()`, `pandas.Series.groupby()`, etc. A groupby operation is to split the data into groups based on some criteria. Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

# Input/output
A lot of types are supported:
+ Pickling
+ Flat file
+ Clipboard
+ Excel
+ JSON
+ HTML
+ XML
+ Latex
+ HDFStore: PyTables (HDF5)
+ Feather
+ Parquet
+ ORC
+ SAS
+ SPSS
+ SQL
+ Google BigQuery
+ STATA

Let's take one of "Flat file", `csv`(comma-separated values), as example. `refseq-genbank.csv` is from BioProject of NCBI, containing 4 columns: 
`Refseq accn`, `Genbank accn`, `Organism name` and `TaxID`

In [1]:
!wget https://ftp.ncbi.nlm.nih.gov/bioproject/refseq-genbank.csv
!head refseq-genbank.csv

--2021-12-25 16:00:40--  https://ftp.ncbi.nlm.nih.gov/bioproject/refseq-genbank.csv
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.230, 130.14.250.7, 2607:f220:41e:250::13, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.230|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93506 (91K) [text/csv]
Saving to: ‘refseq-genbank.csv’


2021-12-25 16:00:41 (3.51 MB/s) - ‘refseq-genbank.csv’ saved [93506/93506]

Refseq accn,Genbank accn,Organism name,TaxID
PRJNA116,PRJNA10719,Arabidopsis thaliana,3702
PRJNA116,PRJNA11796,Arabidopsis thaliana,3702
PRJNA116,PRJNA13191,Arabidopsis thaliana,3702
PRJNA122,PRJNA12269,Oryza sativa Japonica Group,39947
PRJNA122,PRJDB1747,Oryza sativa Japonica Group,39947
PRJNA127,PRJNA13836,Schizosaccharomyces pombe 972h-,284812
PRJNA127,PRJNA20755,Schizosaccharomyces pombe,4896
PRJNA128,PRJNA43747,Saccharomyces cerevisiae S288c,559292
PRJNA132,PRJNA13841,Neurospora crassa OR74A,367110


In [2]:
import pandas as pd
df = pd.read_csv('refseq-genbank.csv')

The method `read_csv` is designed for loading a plain text file with specific separator into a `DataFrame`. Parameter `sep` is set default to `,`(comma), you may need to declare `sep='\t'` to load a tabular plain text file which also know as `tsv`.

Notablly, authors of pandas are so thoughtful that they designed pandas detect compressed type by suffix, for example, a compressed `refseq-genbank.csv.gz` can be loaded directly. The same goes for output when using `to_csv`.

In [3]:
df

Unnamed: 0,Refseq accn,Genbank accn,Organism name,TaxID
0,PRJNA116,PRJNA10719,Arabidopsis thaliana,3702
1,PRJNA116,PRJNA11796,Arabidopsis thaliana,3702
2,PRJNA116,PRJNA13191,Arabidopsis thaliana,3702
3,PRJNA122,PRJNA12269,Oryza sativa Japonica Group,39947
4,PRJNA122,PRJDB1747,Oryza sativa Japonica Group,39947
...,...,...,...,...
1714,PRJNA756971,PRJNA682572,Prionailurus bengalensis,37029
1715,PRJNA758027,PRJDB3949,Aspergillus udagawae,91492
1716,PRJNA758049,PRJDB7449,Aspergillus pseudoviridinutans,1517512
1717,PRJNA759178,PRJNA597580,Colletes gigas,935657


When exporting data, add `.gz` suffix to save storage. By default, `index` is `True` , turn it `False` if you don't need.

In [4]:
df.to_csv('refseq-genbank.csv.gz', index=False)
!ls -lh refseq*

-rw-r--r-- 1 root root 92K Sep  9 08:30 refseq-genbank.csv
-rw-r--r-- 1 root root 38K Dec 25 16:00 refseq-genbank.csv.gz


# Must-know Usage

## Slice a DataFrame

A pair of methods `loc` and `iloc` are related to slicing, which is one of the frequently used operations. `loc` and `iloc` are basicly the same except that `loc` accept index/column **name(s)**, `iloc` **number(s)**.

Let's select part of `df` we have build before. 

Because the index numbers are also index names by default, let's first set `Genbank accn` column as index. 

*Tip: remember to assign result to the original variable, pandas doesn't change originals by default.

In [5]:
df = df.set_index('Genbank accn')
df

Unnamed: 0_level_0,Refseq accn,Organism name,TaxID
Genbank accn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PRJNA10719,PRJNA116,Arabidopsis thaliana,3702
PRJNA11796,PRJNA116,Arabidopsis thaliana,3702
PRJNA13191,PRJNA116,Arabidopsis thaliana,3702
PRJNA12269,PRJNA122,Oryza sativa Japonica Group,39947
PRJDB1747,PRJNA122,Oryza sativa Japonica Group,39947
...,...,...,...
PRJNA682572,PRJNA756971,Prionailurus bengalensis,37029
PRJDB3949,PRJNA758027,Aspergillus udagawae,91492
PRJDB7449,PRJNA758049,Aspergillus pseudoviridinutans,1517512
PRJNA597580,PRJNA759178,Colletes gigas,935657


Now make a slice of the first two rows and the first two columns. 

In [6]:
df.iloc[0:2, 0:2] #which is the same as: 
# df.iloc[[0,1], [0,1]]

Unnamed: 0_level_0,Refseq accn,Organism name
Genbank accn,Unnamed: 1_level_1,Unnamed: 2_level_1
PRJNA10719,PRJNA116,Arabidopsis thaliana
PRJNA11796,PRJNA116,Arabidopsis thaliana


In [7]:
df.loc[['PRJNA10719', 'PRJNA11796'], ['Refseq accn', 'Organism name']]

Unnamed: 0_level_0,Refseq accn,Organism name
Genbank accn,Unnamed: 1_level_1,Unnamed: 2_level_1
PRJNA10719,PRJNA116,Arabidopsis thaliana
PRJNA11796,PRJNA116,Arabidopsis thaliana


Use `:` or nothing if you want all rows/columns.

In [8]:
df.loc[['PRJNA10719','PRJNA11796']]#which is the same as:
# df.loc[['PRJNA10719','PRJNA11796'], :]

Unnamed: 0_level_0,Refseq accn,Organism name,TaxID
Genbank accn,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PRJNA10719,PRJNA116,Arabidopsis thaliana,3702
PRJNA11796,PRJNA116,Arabidopsis thaliana,3702


When passing in a list, the output is a sliced DataFrame; when passing in one single index, you will get a Series.

In [9]:
df.loc[:, 'TaxID']

Genbank accn
PRJNA10719        3702
PRJNA11796        3702
PRJNA13191        3702
PRJNA12269       39947
PRJDB1747        39947
                ...   
PRJNA682572      37029
PRJDB3949        91492
PRJDB7449      1517512
PRJNA597580     935657
PRJNA736740      32260
Name: TaxID, Length: 1719, dtype: int64

## 