We will work with the Puromycin data set (available in R) in this exercise.

```
Reaction Velocity of an Enzymatic Reaction

Description:

     The ‘Puromycin’ data frame has 23 rows and 3 columns of the
     reaction velocity versus substrate concentration in an enzymatic
     reaction involving untreated cells or cells treated with
     Puromycin.

Usage:

     Puromycin
     
Format:

     This data frame contains the following columns:

     ‘conc’ a numeric vector of substrate concentrations (ppm)

     ‘rate’ a numeric vector of instantaneous reaction rates
          (counts/min/min)

     ‘state’ a factor with levels ‘treated’ ‘untreated’

Details:

     Data on the velocity of an enzymatic reaction were obtained by
     Treloar (1974).  The number of counts per minute of radioactive
     product from the reaction was measured as a function of substrate
     concentration in parts per million (ppm) and from these counts the
     initial rate (or velocity) of the reaction was calculated
     (counts/min/min).  The experiment was conducted once with the
     enzyme treated with Puromycin, and once with the enzyme untreated.

Source:

     Bates, D.M. and Watts, D.G. (1988), _Nonlinear Regression Analysis
     and Its Applications_, Wiley, Appendix A1.3.

     Treloar, M. A. (1974), _Effects of Puromycin on
     Galactosyltransferase in Golgi Membranes_, M.Sc. Thesis, U. of
     Toronto.
```

## Load the Puromycin data set into a Python DataFrame

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

%matplotlib inline
%load_ext rpy2.ipython

In [2]:
datPuro = %R Puromycin
datPuro.head()

Unnamed: 0,conc,rate,state
1,0.02,76.0,treated
2,0.02,47.0,treated
3,0.06,97.0,treated
4,0.06,107.0,treated
5,0.11,123.0,treated


## How many rows and columns are there?

In [3]:
datPuro.shape

(23, 3)

There are 23 rows and 3 columns 

## What is the type of each column?

In [4]:
type(datPuro["conc"])

pandas.core.series.Series

In [5]:
print(datPuro["conc"].dtype)
print(datPuro["rate"].dtype)
print(datPuro["state"].dtype)

float64
float64
object


In [6]:
datPuro.apply(lambda x:type(x))

conc     <class 'pandas.core.series.Series'>
rate     <class 'pandas.core.series.Series'>
state    <class 'pandas.core.series.Series'>
dtype: object

In [8]:
#datPuro.apply(np.dtype)
datPuro.apply(lambda x:x.dtypes)
#datPuro.apply(lambda x:np.array(x).dtype)

conc     object
rate     object
state    object
dtype: object

## Show all unique values for the state column

Reference
- https://stackoverflow.com/questions/30621887/return-a-column-vector-in-pandas-apply-with-different-length


In [8]:
print(pd.unique(datPuro["conc"]))
print(pd.unique(datPuro["rate"]))
print(pd.unique(datPuro["state"]))

[ 0.02  0.06  0.11  0.22  0.56  1.1 ]
[  76.   47.   97.  107.  123.  139.  159.  152.  191.  201.  207.  200.
   67.   51.   84.   86.   98.  115.  131.  124.  144.  158.  160.]
['treated' 'untreated']


In [9]:
#datPuro.apply(pd.unique)

##### Show the first 5 rows

In [10]:
datPuro.head(5)

Unnamed: 0,conc,rate,state
1,0.02,76.0,treated
2,0.02,47.0,treated
3,0.06,97.0,treated
4,0.06,107.0,treated
5,0.11,123.0,treated


## Show the last 5 rows

In [11]:
datPuro.tail(5)

Unnamed: 0,conc,rate,state
19,0.22,131.0,untreated
20,0.22,124.0,untreated
21,0.56,144.0,untreated
22,0.56,158.0,untreated
23,1.1,160.0,untreated


## Show 5 randomly sampled rows

In [12]:
datPuro.sample(5)

Unnamed: 0,conc,rate,state
20,0.22,124.0,untreated
16,0.06,86.0,untreated
8,0.22,152.0,treated
10,0.56,201.0,treated
17,0.11,98.0,untreated


## Show rows 5 to 10 (inclusive)

In [13]:
datPuro.loc[5:10]

Unnamed: 0,conc,rate,state
5,0.11,123.0,treated
6,0.11,139.0,treated
7,0.22,159.0,treated
8,0.22,152.0,treated
9,0.56,191.0,treated
10,0.56,201.0,treated


## Show only rows where the state is untreated

In [14]:
datPuro[datPuro.state == 'untreated']

Unnamed: 0,conc,rate,state
13,0.02,67.0,untreated
14,0.02,51.0,untreated
15,0.06,84.0,untreated
16,0.06,86.0,untreated
17,0.11,98.0,untreated
18,0.11,115.0,untreated
19,0.22,131.0,untreated
20,0.22,124.0,untreated
21,0.56,144.0,untreated
22,0.56,158.0,untreated


In [15]:
# can use filter to do it?

## Show only rows where the conc is 0.11

In [16]:
datPuro[datPuro.conc == 0.11]

Unnamed: 0,conc,rate,state
5,0.11,123.0,treated
6,0.11,139.0,treated
17,0.11,98.0,untreated
18,0.11,115.0,untreated


## Show only rows where the conc is less than 0.1

In [17]:
datPuro[datPuro.conc < 0.1]

Unnamed: 0,conc,rate,state
1,0.02,76.0,treated
2,0.02,47.0,treated
3,0.06,97.0,treated
4,0.06,107.0,treated
13,0.02,67.0,untreated
14,0.02,51.0,untreated
15,0.06,84.0,untreated
16,0.06,86.0,untreated


## Show only rows where the state is treated and the rate is more than 100

In [18]:
datPuro[(datPuro.rate > 100) & (datPuro.state == "treated")]

Unnamed: 0,conc,rate,state
18,0.11,115.0,untreated
19,0.22,131.0,untreated
20,0.22,124.0,untreated
21,0.56,144.0,untreated
22,0.56,158.0,untreated
23,1.1,160.0,untreated


## Show only rows where the conc is less than 0.1 or the rate is more than 200

In [19]:
datPuro[(datPuro.conc < 0.1) | (datPuro.rate > 200)]

Unnamed: 0,conc,rate,state
1,0.02,76.0,treated
2,0.02,47.0,treated
3,0.06,97.0,treated
4,0.06,107.0,treated
10,0.56,201.0,treated
11,1.1,207.0,treated
13,0.02,67.0,untreated
14,0.02,51.0,untreated
15,0.06,84.0,untreated
16,0.06,86.0,untreated


## Show only the conc and rate columns

In [20]:
datPuro[["conc", "rate"]].head()

Unnamed: 0,conc,rate
1,0.02,76.0
2,0.02,47.0
3,0.06,97.0
4,0.06,107.0
5,0.11,123.0


In [21]:
datPuro.filter(items=["conc", "rate"]).head()

Unnamed: 0,conc,rate
1,0.02,76.0
2,0.02,47.0
3,0.06,97.0
4,0.06,107.0
5,0.11,123.0


## Show only the columns whose type is numeric

In [None]:
puromycin.select_dtypes([np.int, np.float]).head()

## Show only the columns whose names end with the letter e

In [25]:
datPuro.select(lambda x:x.endswith('e'), axis=1).head()

Unnamed: 0,rate,state
1,76.0,treated
2,47.0,treated
3,97.0,treated
4,107.0,treated
5,123.0,treated


## Convert all column names to UPPERCASE

In [4]:
datPuro.columns.map(str.upper)

Index(['CONC', 'RATE', 'STATE'], dtype='object')

In [7]:
tmp = datPuro.copy()
tmp.columns = tmp.columns.map(str.upper)
tmp.head()

Unnamed: 0,CONC,RATE,STATE
1,0.02,76.0,treated
2,0.02,47.0,treated
3,0.06,97.0,treated
4,0.06,107.0,treated
5,0.11,123.0,treated


## Rearrange the columns in the order state, conc, rate

In [42]:
datPuro.head()
datPuro[["state", "conc", "rate"]].head()

Unnamed: 0,state,conc,rate
1,treated,0.02,76.0
2,treated,0.02,47.0
3,treated,0.06,97.0
4,treated,0.06,107.0
5,treated,0.11,123.0


## Drop the state column

In [49]:
datPuro.head()

Unnamed: 0,conc,rate,state
1,0.02,76.0,treated
2,0.02,47.0,treated
3,0.06,97.0,treated
4,0.06,107.0,treated
5,0.11,123.0,treated


In [51]:
datPuro.drop("state", axis=1).head()

Unnamed: 0,conc,rate
1,0.02,76.0
2,0.02,47.0
3,0.06,97.0
4,0.06,107.0
5,0.11,123.0


## Create a new column rate2 that is the square of rate

In [57]:
datPuro.assign(rate2=datPuro['rate'].map(lambda x:x**2)).head()

Unnamed: 0,conc,rate,state,rate2
1,0.02,76.0,treated,5776.0
2,0.02,47.0,treated,2209.0
3,0.06,97.0,treated,9409.0
4,0.06,107.0,treated,11449.0
5,0.11,123.0,treated,15129.0


## Create a new data frame that only has the 3 columns with conc, conc^2 and conc^3 values. Name them conc, conc2 and conc3

In [94]:
(
    datPuro
        .assign(conc2=datPuro["conc"]**2)
        .assign(conc3=datPuro["conc"]**3)
        .head()
)

Unnamed: 0,conc,rate,state,conc2,conc3
1,0.02,76.0,treated,0.0004,8e-06
2,0.02,47.0,treated,0.0004,8e-06
3,0.06,97.0,treated,0.0036,0.000216
4,0.06,107.0,treated,0.0036,0.000216
5,0.11,123.0,treated,0.0121,0.001331


## Replace each value of all numeric columns with the square root of the value

In [93]:
(
    datPuro
        .assign(conc=datPuro["conc"]**0.5)
        .assign(rate=datPuro["rate"]**0.5)
        .head()
)

Unnamed: 0,conc,rate,state
1,0.141421,8.717798,treated
2,0.141421,6.855655,treated
3,0.244949,9.848858,treated
4,0.244949,10.34408,treated
5,0.331662,11.090537,treated


## Sort in ascending rate order

In [8]:
datPuro.sort_values(by="rate").head()

Unnamed: 0,conc,rate,state
2,0.02,47.0,treated
14,0.02,51.0,untreated
13,0.02,67.0,untreated
1,0.02,76.0,treated
15,0.06,84.0,untreated


## Sort in descending rate order

In [9]:
datPuro.sort_values(by="rate", ascending=False).head()

Unnamed: 0,conc,rate,state
11,1.1,207.0,treated
10,0.56,201.0,treated
12,1.1,200.0,treated
9,0.56,191.0,treated
23,1.1,160.0,untreated


## Sort first on conc i ascending order, then rate in ascending order

In [99]:
datPuro.sort_values(by=["conc", "rate"]).head(10)

Unnamed: 0,conc,rate,state
2,0.02,47.0,treated
14,0.02,51.0,untreated
13,0.02,67.0,untreated
1,0.02,76.0,treated
15,0.06,84.0,untreated
16,0.06,86.0,untreated
3,0.06,97.0,treated
4,0.06,107.0,treated
17,0.11,98.0,untreated
18,0.11,115.0,untreated


## Sort in ascending order of the number of characters in the state column

In [101]:
# it is because t comes before u, so it does not answer the question
datPuro.sort_values(by=["state"]).head()

Unnamed: 0,conc,rate,state
1,0.02,76.0,treated
11,1.1,207.0,treated
10,0.56,201.0,treated
9,0.56,191.0,treated
8,0.22,152.0,treated


## Find the mean value of numeric columns

In [110]:
datPuro[["conc", "rate"]].apply(np.mean, axis=0)

conc      0.312174
rate    126.826087
dtype: float64

## Find the mean length of the state column

In [None]:
# What does this mean???
# Answer
datPuro.state.str.len().mean()

## Find the min, median and max of the rate column

In [128]:
print("min  med   max")
print(datPuro.min()["rate"], datPuro.median()["rate"], datPuro.max()["rate"])

min  med   max
47.0 124.0 207.0


In [12]:
datPuro.loc[:,"rate"].agg(['min', 'median', 'max'])

min        47.0
median    124.0
max       207.0
Name: rate, dtype: float64

## Find the average rate for each state

In [136]:
datPuro.groupby("state").mean().loc[:,"rate"]

state
treated      141.583333
untreated    110.727273
Name: rate, dtype: float64

## Find the number of treated and untreated states in a new column count

In [163]:
datPuro.shape

(23, 3)

In [162]:
(
    datPuro
        .loc[:,"state"]
        .value_counts()
)

treated      12
untreated    11
Name: state, dtype: int64

## Find the number of rows with the same conc and state in a new column count and only show rows where the count is an even number.

In [14]:
datPuro.groupby(["conc", "state"]).count().loc[lambda x:x.rate % 2 == 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,rate
conc,state,Unnamed: 2_level_1
0.02,treated,2
0.02,untreated,2
0.06,treated,2
0.06,untreated,2
0.11,treated,2
0.11,untreated,2
0.22,treated,2
0.22,untreated,2
0.56,treated,2
0.56,untreated,2


In [175]:
df = datPuro.groupby(["conc", "state"]).count()
df.columns = ["Count"]
df[df.Count % 2 == 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
conc,state,Unnamed: 2_level_1
0.02,treated,2
0.02,untreated,2
0.06,treated,2
0.06,untreated,2
0.11,treated,2
0.11,untreated,2
0.22,treated,2
0.22,untreated,2
0.56,treated,2
0.56,untreated,2


## Find the mean and standard deviation of rate for each state and conc. Remove any rows with an NA value for the rate standard deviation.

In [180]:
(
    datPuro
        .groupby(["state", "conc"])
        .agg(["mean", "std"])
        .dropna()
)

Unnamed: 0_level_0,Unnamed: 1_level_0,rate,rate
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,std
state,conc,Unnamed: 2_level_2,Unnamed: 3_level_2
treated,0.02,61.5,20.506097
treated,0.06,102.0,7.071068
treated,0.11,131.0,11.313708
treated,0.22,155.5,4.949747
treated,0.56,196.0,7.071068
treated,1.1,203.5,4.949747
untreated,0.02,59.0,11.313708
untreated,0.06,85.0,1.414214
untreated,0.11,106.5,12.020815
untreated,0.22,127.5,4.949747
