# Data Manipulation and Analyses in Jupyter

## Introduction

You can do complex biological data manipulation and analyses using the pandas python package, (or by switching kernels to R).

We will look at pandas here, which orovides R-like unctions for data manipulation and analyses. Pandas is built on top of NumPy.
Most importantly it offers an R-like `DataFrame` object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data, as well as missing values, which would not be possible using numpy arrays.

Pandas also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs.

Assuming pandas is installed, you can import it and check the version:

In [1]:
import pandas as pd
pd.__version__

'0.21.0'

In [2]:
import scipy as sc

Just as we generally import NumPy under the alias np, we will import pandas under the alias pd.

## dataframes

The dataframe is the main data object in pandas.

### Importing data

Dataframes can be created from multiple sources - csv files, excel files, JSON....

In [6]:
MyDF = pd.read_csv('../../../silbiocompmasterepo/Data/testcsv.csv', sep = ',')
MyDF

Unnamed: 0,Species,Infraorder,Family,Distribution,Body mass male (Kg)
0,Daubentonia_madagascariensis,Chiromyiformes,Daubentoniidae,Madagascar,2.7
1,Allocebus_trichotis,Lemuriformes,Cheirogaleidae,Madagascar,0.1
2,Avahi_laniger,Lemuriformes,Indridae,America,1.03
3,Avahi_occidentalis,Lemuriformes,Indridae,Madagascar,0.814
4,Avahi_unicolor,Lemuriformes,Indridae,America,0.83
5,Cheirogaleus_adipicaudatus,Lemuriformes,Cheirogaleidae,Madagascar,0.2
6,Cheirogaleus_crossleyi,Lemuriformes,Cheirogaleidae,Madagascar,0.4
7,Cheirogaleus_major,Lemuriformes,Cheirogaleidae,Madagascar,0.45
8,Cheirogaleus_medius,Lemuriformes,Cheirogaleidae,Madagascar,0.217


### Creating dataframes

You can also create dataframes using a python-like dictionary syntax:

In [7]:
MyDF = pd.DataFrame({
    'col1' : ['Var1', 'Var2', 'Var3', 'Var4'],
    'col2' : ['Grass', 'Rabbit', 'Fox', 'Wolf'],
    'col3' : [1, 2, sc.nan, 4]
})

MyDF

Unnamed: 0,col1,col2,col3
0,Var1,Grass,1.0
1,Var2,Rabbit,2.0
2,Var3,Fox,
3,Var4,Wolf,4.0


### Examining your data

In [9]:
MyDF.head()  # accepts an optional int parameter - number of rows to show (default 5).

Unnamed: 0,col1,col2,col3
0,Var1,Grass,1.0
1,Var2,Rabbit,2.0
2,Var3,Fox,
3,Var4,Wolf,4.0


In [10]:
MyDF.tail()

Unnamed: 0,col1,col2,col3
0,Var1,Grass,1.0
1,Var2,Rabbit,2.0
2,Var3,Fox,
3,Var4,Wolf,4.0


In [12]:
MyDF.shape

(4, 3)

In [15]:
len(MyDF)  # ncols

4

In [16]:
MyDF.columns  # An array of the column names

Index(['col1', 'col2', 'col3'], dtype='object')

In [17]:
MyDF.dtypes  # Column names and their types

col1     object
col2     object
col3    float64
dtype: object

In [18]:
MyDF.values  # converts to a two-dimensional table

array([['Var1', 'Grass', 1.0],
       ['Var2', 'Rabbit', 2.0],
       ['Var3', 'Fox', nan],
       ['Var4', 'Wolf', 4.0]], dtype=object)

In [19]:
MyDF.describe()  # displays descriptive stats for all columns

Unnamed: 0,col3
count,3.0
mean,2.333333
std,1.527525
min,1.0
25%,1.5
50%,2.0
75%,3.0
max,4.0


OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can alreay see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using tidyr + ggplot) for this.