# Introduction to Pandas

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure


Key features:
    
- Easy handling of **missing data**
- **Size mutability**: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit **data alignment**: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
- Powerful, flexible **group by functionality** to perform split-apply-combine operations on data sets
- Intelligent label-based **slicing, fancy indexing, and subsetting** of large data sets
- Intuitive **merging and joining** data sets
- Flexible **reshaping and pivoting** of data sets
- Robust **IO tools** for loading data from flat files, Excel files, databases, and HDF5

In [1]:
import pandas as pd
import numpy as np

## Pandas Data Structures

### Series

A **Series** is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector.

In [2]:
counts = pd.Series([632, 1638, 569, 115])
counts

0     632
1    1638
2     569
3     115
dtype: int64

If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [3]:
counts.values

array([ 632, 1638,  569,  115], dtype=int64)

In [4]:
counts.index # in pandas v0.18. The output is RangeIndex (used to support memory saving)
#print counts.index[0]

RangeIndex(start=0, stop=4, step=1)

We can assign meaningful labels to the index, if they are available:

In [5]:
bacteria = pd.Series([632, 1638, 569, 115], 
    index=['Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes'])

bacteria

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

These labels can be used to refer to the values in the `Series`.

In [6]:
bacteria['Actinobacteria']

569

In [7]:
b = bacteria[[name.endswith('bacteria') for name in bacteria.index]]
b

Proteobacteria    1638
Actinobacteria     569
dtype: int64

In [8]:
[name.endswith('bacteria') for name in bacteria.index]

[False, True, True, False]

Notice that the indexing operation preserved the association between the values and the corresponding indices.

We can still use positional indexing if we wish.

In [9]:
bacteria[0]

632

NumPy's math functions and other operations can be applied to Series without losing the data structure.

In [10]:
np.log(bacteria)

Firmicutes        6.448889
Proteobacteria    7.401231
Actinobacteria    6.343880
Bacteroidetes     4.744932
dtype: float64

We can also filter according to the values in the `Series`:

In [12]:
b = bacteria[bacteria>1000]
print (b)
print (type(b))

Proteobacteria    1638
dtype: int64
<class 'pandas.core.series.Series'>


A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [13]:
bacteria_dict = {'Firmicutes': 632, 'Proteobacteria': 1638, 'Actinobacteria': 569, 'Bacteroidetes': 115}
pd.Series(bacteria_dict)

Firmicutes         632
Proteobacteria    1638
Actinobacteria     569
Bacteroidetes      115
dtype: int64

Notice that the `Series` is created in key-sorted order.

If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values.

In [14]:
bacteria2 = pd.Series(bacteria_dict, index=['Cyanobacteria','Firmicutes','Proteobacteria','Actinobacteria'])
bacteria2

Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64

In [15]:
bacteria2.isnull()

Cyanobacteria      True
Firmicutes        False
Proteobacteria    False
Actinobacteria    False
dtype: bool

Critically, the labels are used to **align data** when used in operations with other Series objects:

In [16]:
bacteria + bacteria2

Actinobacteria    1138.0
Bacteroidetes        NaN
Cyanobacteria        NaN
Firmicutes        1264.0
Proteobacteria    3276.0
dtype: float64

Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

### DataFrame

Inevitably, we want to be able to store, view and manipulate data that is *multivariate*, where for every index there are multiple fields or columns of data, often of varying data type.

A `DataFrame` is a tabular data structure, encapsulating multiple series like columns in a spreadsheet. Data are stored internally as a 2-dimensional object, but the `DataFrame` allows us to represent and manipulate higher-dimensional data.

In [41]:
data = pd.DataFrame({'value':[632, 1638, 569, 115, 433, 1130, 754, 555],
                     'patient':[1, 1, 1, 1, 2, 2, 2, 2],
                     'phylum':['Firmicutes', 'Proteobacteria', 'Actinobacteria', 
    'Bacteroidetes', 'Firmicutes', 'Proteobacteria', 'Actinobacteria', 'Bacteroidetes']})
data

Unnamed: 0,value,patient,phylum
0,632,1,Firmicutes
1,1638,1,Proteobacteria
2,569,1,Actinobacteria
3,115,1,Bacteroidetes
4,433,2,Firmicutes
5,1130,2,Proteobacteria
6,754,2,Actinobacteria
7,555,2,Bacteroidetes


A `DataFrame` has two indices, a row index and an index representing columns:

In [23]:
print (data.index)
print (data.columns)
print (type(data.columns))

RangeIndex(start=0, stop=8, step=1)
Index(['value', 'patient', 'phylum'], dtype='object')
<class 'pandas.core.indexes.base.Index'>


### Reorder the columns

In [42]:
d = data[['phylum','value','patient']]
print (data)
print (d)

   value  patient          phylum
0    632        1      Firmicutes
1   1638        1  Proteobacteria
2    569        1  Actinobacteria
3    115        1   Bacteroidetes
4    433        2      Firmicutes
5   1130        2  Proteobacteria
6    754        2  Actinobacteria
7    555        2   Bacteroidetes
           phylum  value  patient
0      Firmicutes    632        1
1  Proteobacteria   1638        1
2  Actinobacteria    569        1
3   Bacteroidetes    115        1
4      Firmicutes    433        2
5  Proteobacteria   1130        2
6  Actinobacteria    754        2
7   Bacteroidetes    555        2


### Access columns

Dict-like indexing:

In [43]:
data['value']
#data['patient']

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64

Or by attribute:

In [26]:
print (data.value)
#data.patient

0     632
1    1638
2     569
3     115
4     433
5    1130
6     754
7     555
Name: value, dtype: int64


In [27]:
type(data['value'])

pandas.core.series.Series

In [28]:
type(data[['value']])

pandas.core.frame.DataFrame

### How to access a row or a cell in a data frame?

In [143]:
data.index = [1,2,3,4,5,6,7,8] # reset row index
print (data)
print ('-------------')
# iloc works on the positions in the index (so it only takes integers).
d = data.iloc[4]
print (d)
#print data.iloc[:4] # with one item in the list, you are accessing rows
#print data.iloc[:4,2] # with two items in the list, you are accessing cells
print ('-------------')
# loc works on labels in the index.
print (data.loc[4])
print (data.loc[4,'phylum'])
print (data.loc[[1,4],['phylum','value']])
print ('-------------')
# if we want to mix the positions and labels, you can use both loc and iloc
print (data.iloc[0].loc[['phylum','value']]) # returns the phylum and value of the first row

    value  patient          phylum  treatment
1   632.0        1      Firmicutes        0.0
2  1638.0        1  Proteobacteria        0.0
3    37.0        1  Actinobacteria        0.0
4   115.0        1   Bacteroidetes        1.0
5     0.0        2      Firmicutes        1.0
6  1130.0        2  Proteobacteria        NaN
7   754.0        2  Actinobacteria        NaN
8  1000.0        2   Bacteroidetes        NaN
-------------
value                 0
patient               2
phylum       Firmicutes
treatment             1
Name: 5, dtype: object
-------------
value                  115
patient                  1
phylum       Bacteroidetes
treatment                1
Name: 4, dtype: object
<class 'str'>
          phylum  value
1     Firmicutes  632.0
4  Bacteroidetes  115.0
-------------
phylum    Firmicutes
value            632
Name: 1, dtype: object
phylum    Firmicutes
value            632
Name: 1, dtype: object


Its important to note that the Series returned when a DataFrame is indexted is merely a **view** on the DataFrame, and not a copy of the data itself. So you must be cautious when manipulating this data:

In [45]:
vals = data.value
vals

1     632
2    1638
3     569
4     115
5     433
6    1130
7     754
8     555
Name: value, dtype: int64

In [46]:
vals[8] = 554 # you will see a warning, you can ignore it. 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [47]:
print(data)

   value  patient          phylum
1    632        1      Firmicutes
2   1638        1  Proteobacteria
3    569        1  Actinobacteria
4    115        1   Bacteroidetes
5    433        2      Firmicutes
6   1130        2  Proteobacteria
7    754        2  Actinobacteria
8    554        2   Bacteroidetes


In [48]:
data.loc[8,'value'] = 1000
print (data)

   value  patient          phylum
1    632        1      Firmicutes
2   1638        1  Proteobacteria
3    569        1  Actinobacteria
4    115        1   Bacteroidetes
5    433        2      Firmicutes
6   1130        2  Proteobacteria
7    754        2  Actinobacteria
8   1000        2   Bacteroidetes


In [49]:
vals = data.value.copy()
vals[5] = 2000
print (data)

   value  patient          phylum
1    632        1      Firmicutes
2   1638        1  Proteobacteria
3    569        1  Actinobacteria
4    115        1   Bacteroidetes
5    433        2      Firmicutes
6   1130        2  Proteobacteria
7    754        2  Actinobacteria
8   1000        2   Bacteroidetes


### Create or modify columns by assignment:

In [50]:
data.value[3] = 14
print (data)

   value  patient          phylum
1    632        1      Firmicutes
2   1638        1  Proteobacteria
3     14        1  Actinobacteria
4    115        1   Bacteroidetes
5    433        2      Firmicutes
6   1130        2  Proteobacteria
7    754        2  Actinobacteria
8   1000        2   Bacteroidetes


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [51]:
data.loc[3,'value']=37
print (data)

   value  patient          phylum
1    632        1      Firmicutes
2   1638        1  Proteobacteria
3     37        1  Actinobacteria
4    115        1   Bacteroidetes
5    433        2      Firmicutes
6   1130        2  Proteobacteria
7    754        2  Actinobacteria
8   1000        2   Bacteroidetes


### Add a column:

In [52]:
data['year'] = '1995'
print (data)

   value  patient          phylum  year
1    632        1      Firmicutes  1995
2   1638        1  Proteobacteria  1995
3     37        1  Actinobacteria  1995
4    115        1   Bacteroidetes  1995
5    433        2      Firmicutes  1995
6   1130        2  Proteobacteria  1995
7    754        2  Actinobacteria  1995
8   1000        2   Bacteroidetes  1995


But note, we cannot use the attribute indexing method to add a new column:

In [54]:
data.treatment = 1
print (data)

   value  patient          phylum  year
1    632        1      Firmicutes  1995
2   1638        1  Proteobacteria  1995
3     37        1  Actinobacteria  1995
4    115        1   Bacteroidetes  1995
5    433        2      Firmicutes  1995
6   1130        2  Proteobacteria  1995
7    754        2  Actinobacteria  1995
8   1000        2   Bacteroidetes  1995


Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:

In [55]:
treatment = pd.Series([0]*4 + [1]*2)
print (treatment)

0    0
1    0
2    0
3    0
4    1
5    1
dtype: int64


In [56]:
data['treatment'] = treatment
print (data)

   value  patient          phylum  year  treatment
1    632        1      Firmicutes  1995        0.0
2   1638        1  Proteobacteria  1995        0.0
3     37        1  Actinobacteria  1995        0.0
4    115        1   Bacteroidetes  1995        1.0
5    433        2      Firmicutes  1995        1.0
6   1130        2  Proteobacteria  1995        NaN
7    754        2  Actinobacteria  1995        NaN
8   1000        2   Bacteroidetes  1995        NaN


You can also add a numpy array or a series as a column. Other Python data structures (ones without an index such as a list) need to be the same length as the `DataFrame`:

In [57]:
month = ['Jan', 'Feb', 'Mar', 'Apr'] # only 4 items
data['month'] = month # you will get an error because the length of the column should be 8

ValueError: Length of values does not match length of index

In [67]:
data['month'] = ['Jan']*len(data)
print (data)

   value  patient          phylum  year  treatment month
1    632        1      Firmicutes  1995        0.0   Jan
2   1638        1  Proteobacteria  1995        0.0   Jan
3     37        1  Actinobacteria  1995        0.0   Jan
4    115        1   Bacteroidetes  1995        1.0   Jan
5    433        2      Firmicutes  1995        1.0   Jan
6   1130        2  Proteobacteria  1995        NaN   Jan
7    754        2  Actinobacteria  1995        NaN   Jan
8   1000        2   Bacteroidetes  1995        NaN   Jan


We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [68]:
del data['month']
print (data)

   value  patient          phylum  year  treatment
1    632        1      Firmicutes  1995        0.0
2   1638        1  Proteobacteria  1995        0.0
3     37        1  Actinobacteria  1995        0.0
4    115        1   Bacteroidetes  1995        1.0
5    433        2      Firmicutes  1995        1.0
6   1130        2  Proteobacteria  1995        NaN
7    754        2  Actinobacteria  1995        NaN
8   1000        2   Bacteroidetes  1995        NaN


#### Or you can drop columns using:

In [69]:
data.drop('year', axis=1, inplace=True) # if you don't use inplace, a new dataframe would be returned. Also remember to have axis=1.
print (data)

   value  patient          phylum  treatment
1    632        1      Firmicutes        0.0
2   1638        1  Proteobacteria        0.0
3     37        1  Actinobacteria        0.0
4    115        1   Bacteroidetes        1.0
5    433        2      Firmicutes        1.0
6   1130        2  Proteobacteria        NaN
7    754        2  Actinobacteria        NaN
8   1000        2   Bacteroidetes        NaN


We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:

In [70]:
data.values

array([[632, 1, 'Firmicutes', 0.0],
       [1638, 1, 'Proteobacteria', 0.0],
       [37, 1, 'Actinobacteria', 0.0],
       [115, 1, 'Bacteroidetes', 1.0],
       [433, 2, 'Firmicutes', 1.0],
       [1130, 2, 'Proteobacteria', nan],
       [754, 2, 'Actinobacteria', nan],
       [1000, 2, 'Bacteroidetes', nan]], dtype=object)

Notice that because of the mix of string and integer (and `NaN`) values, the dtype of the array is `object`. The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

In [72]:
df = pd.DataFrame({'foo': [1,2,3], 'bar':[0.4, -1.0, 4.5]})
print (df.values)

[[ 1.   0.4]
 [ 2.  -1. ]
 [ 3.   4.5]]


## Importing data

This table can be read into a DataFrame using `read_csv`:

In [73]:
mb = pd.read_csv("microbiome.csv")
print (mb)

             Taxon  Patient  Tissue  Stool
0       Firmicutes        1     632    305
1       Firmicutes        2     136   4182
2       Firmicutes        3    1174    703
3       Firmicutes        4     408   3946
4       Firmicutes        5     831   8605
5       Firmicutes        6     693     50
6       Firmicutes        7     718    717
7       Firmicutes        8     173     33
8       Firmicutes        9     228     80
9       Firmicutes       10     162   3196
10      Firmicutes       11     372     32
11      Firmicutes       12    4255   4361
12      Firmicutes       13     107   1667
13      Firmicutes       14      96    223
14      Firmicutes       15     281   2377
15  Proteobacteria        1    1638   3886
16  Proteobacteria        2    2469   1821
17  Proteobacteria        3     839    661
18  Proteobacteria        4    4414     18
19  Proteobacteria        5   12044     83
20  Proteobacteria        6    2310     12
21  Proteobacteria        7    3053    547
22  Proteob

Notice that `read_csv` automatically considered the first row in the file to be a header row.

We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.

In [74]:
df = pd.read_csv("microbiome1.csv", header=None).head()
print (df)

            0  1       2       3
0  Firmicutes  1   632.0   305.0
1  Firmicutes  2   136.0  4182.0
2  Firmicutes  3  1174.0   703.0
3  Firmicutes  4   408.0     NaN
4  Firmicutes  5     NaN  8605.0


Then add column name

In [75]:
df.columns = ['Taxon','Patient','Tissue','Stool']
print (df)

        Taxon  Patient  Tissue   Stool
0  Firmicutes        1   632.0   305.0
1  Firmicutes        2   136.0  4182.0
2  Firmicutes        3  1174.0   703.0
3  Firmicutes        4   408.0     NaN
4  Firmicutes        5     NaN  8605.0


### Rename a column:

In [76]:
df.rename(columns = {'Taxon':'Class'},inplace=True)
print (df)

        Class  Patient  Tissue   Stool
0  Firmicutes        1   632.0   305.0
1  Firmicutes        2   136.0  4182.0
2  Firmicutes        3  1174.0   703.0
3  Firmicutes        4   408.0     NaN
4  Firmicutes        5     NaN  8605.0


In [77]:
print (df.isnull()) # returns a new dataframe with bools

   Class  Patient  Tissue  Stool
0  False    False   False  False
1  False    False   False  False
2  False    False   False  False
3  False    False   False   True
4  False    False    True  False


## Pandas Fundamentals

This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.

For some variety, we will leave our digestive tract bacteria behind and employ some baseball data.

In [78]:
baseball = pd.read_csv("baseball.csv", index_col='id')
baseball.head()

Unnamed: 0_level_0,player,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88641,womacto01,2006,2,CHN,NL,19,50,6,14,1,...,2,1,1,4,4,0,0,3,0,0
88643,schilcu01,2006,1,BOS,AL,31,2,0,1,0,...,0,0,0,0,1,0,0,0,0,0
88645,myersmi01,2006,1,NYA,AL,62,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88648,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0,0,0,0,2,0,0,0,0,0
88649,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0,0,0,0,2,0,0,0,0,0


Notice that we specified the `id` column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by combining `player` and `year`:

In [79]:
player_id = baseball.player + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_id
print (baseball_newind.head())

                  player  year  stint team  lg   g  ab  r   h  X2b  ...   rbi  \
womacto012006  womacto01  2006      2  CHN  NL  19  50  6  14    1  ...     2   
schilcu012006  schilcu01  2006      1  BOS  AL  31   2  0   1    0  ...     0   
myersmi012006  myersmi01  2006      1  NYA  AL  62   0  0   0    0  ...     0   
helliri012006  helliri01  2006      1  MIL  NL  20   3  0   0    0  ...     0   
helliri012006  helliri01  2006      1  MIL  NL  20   3  0   0    0  ...     0   

               sb  cs  bb  so  ibb  hbp  sh  sf  gidp  
womacto012006   1   1   4   4    0    0   3   0     0  
schilcu012006   0   0   0   1    0    0   0   0     0  
myersmi012006   0   0   0   0    0    0   0   0     0  
helliri012006   0   0   0   2    0    0   0   0     0  
helliri012006   0   0   0   2    0    0   0   0     0  

[5 rows x 22 columns]


This looks okay, but let's check:

In [80]:
baseball_newind.index.is_unique

False

So, indices need not be unique. Our choice is not unique because some players change teams within years.

In [81]:
pd.Series(baseball_newind.index).value_counts() # this is an important function for categorical variables

francju012007    3
benitar012007    2
hernaro012007    2
gomezch022007    2
trachst012007    2
cirilje012007    2
claytro012007    2
wickmbo012007    2
wellsda012007    2
loftoke012007    2
helliri012006    2
sweenma012007    2
coninje012007    2
greensh012007    1
alomasa022007    1
bondsba012007    1
whiteri012007    1
johnsra052006    1
johnsra052007    1
sprinru012007    1
finlest012006    1
sheffga012007    1
mesajo012007     1
ramirma022007    1
perezne012007    1
schmija012007    1
cormirh012007    1
myersmi012006    1
tavarju012007    1
stairma012007    1
                ..
finlest012007    1
seleaa012007     1
womacto012006    1
williwo022007    1
hoffmtr012007    1
rogerke012007    1
gonzalu012007    1
seaneru012007    1
stinnke012007    1
graffto012007    1
parkch012007     1
villoro012007    1
witasja012007    1
gordoto012007    1
vizquom012007    1
delgaca012007    1
glavito022007    1
moyerja012007    1
guarded012007    1
sandere022007    1
maddugr012007    1
suppaje01200

The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:

In [83]:
baseball_newind.loc['wickmbo012007']

Unnamed: 0,player,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
wickmbo012007,wickmbo01,2007,2,ARI,NL,8,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
wickmbo012007,wickmbo01,2007,1,ATL,NL,47,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We will learn more about indexing below.

We can create a truly unique index by combining `player`, `team` and `year`:

In [84]:
player_unique = baseball.player + baseball.team + baseball.year.astype(str)
baseball_newind = baseball.copy()
baseball_newind.index = player_unique
print (baseball_newind.head())

                     player  year  stint team  lg   g  ab  r   h  X2b  ...   \
womacto01CHN2006  womacto01  2006      2  CHN  NL  19  50  6  14    1  ...    
schilcu01BOS2006  schilcu01  2006      1  BOS  AL  31   2  0   1    0  ...    
myersmi01NYA2006  myersmi01  2006      1  NYA  AL  62   0  0   0    0  ...    
helliri01MIL2006  helliri01  2006      1  MIL  NL  20   3  0   0    0  ...    
helliri01MIL2006  helliri01  2006      1  MIL  NL  20   3  0   0    0  ...    

                  rbi  sb  cs  bb  so  ibb  hbp  sh  sf  gidp  
womacto01CHN2006    2   1   1   4   4    0    0   3   0     0  
schilcu01BOS2006    0   0   0   0   1    0    0   0   0     0  
myersmi01NYA2006    0   0   0   0   0    0    0   0   0     0  
helliri01MIL2006    0   0   0   0   2    0    0   0   0     0  
helliri01MIL2006    0   0   0   0   2    0    0   0   0     0  

[5 rows x 22 columns]


In [85]:
baseball_newind.index.is_unique

False

We can create meaningful indices more easily using a hierarchical index; for now, we will stick with the numeric `id` field as our index.

### Manipulating indices


We can remove rows or columns via the `drop` method:

In [86]:
baseball.shape

(102, 22)

In [87]:
print (baseball.drop([89525, 89526]).head())

          player  year  stint team  lg   g  ab  r   h  X2b  ...   rbi  sb  cs  \
id                                                          ...                 
88641  womacto01  2006      2  CHN  NL  19  50  6  14    1  ...     2   1   1   
88643  schilcu01  2006      1  BOS  AL  31   2  0   1    0  ...     0   0   0   
88645  myersmi01  2006      1  NYA  AL  62   0  0   0    0  ...     0   0   0   
88648  helliri01  2006      1  MIL  NL  20   3  0   0    0  ...     0   0   0   
88649  helliri01  2006      1  MIL  NL  20   3  0   0    0  ...     0   0   0   

       bb  so  ibb  hbp  sh  sf  gidp  
id                                     
88641   4   4    0    0   3   0     0  
88643   0   1    0    0   0   0     0  
88645   0   0    0    0   0   0     0  
88648   0   2    0    0   0   0     0  
88649   0   2    0    0   0   0     0  

[5 rows x 22 columns]


In [88]:
baseball.drop(['ibb','hbp'], axis=1)

Unnamed: 0_level_0,player,year,stint,team,lg,g,ab,r,h,X2b,X3b,hr,rbi,sb,cs,bb,so,sh,sf,gidp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
88641,womacto01,2006,2,CHN,NL,19,50,6,14,1,0,1,2,1,1,4,4,3,0,0
88643,schilcu01,2006,1,BOS,AL,31,2,0,1,0,0,0,0,0,0,0,1,0,0,0
88645,myersmi01,2006,1,NYA,AL,62,0,0,0,0,0,0,0,0,0,0,0,0,0,0
88648,helliri01,2006,1,MIL,NL,20,3,0,0,0,0,0,0,0,0,0,2,0,0,0
88649,helliri01,2006,1,MIL,NL,20,3,0,0,0,0,0,0,0,0,0,2,0,0,0
88650,johnsra05,2006,1,NYA,AL,33,6,0,1,0,0,0,0,0,0,0,4,0,0,0
88652,finlest01,2006,1,SFN,NL,139,426,66,105,21,12,6,40,7,0,46,55,3,4,6
88653,gonzalu01,2006,1,ARI,NL,153,586,93,159,52,2,15,73,0,1,69,58,0,6,14
88662,seleaa01,2006,1,LAN,NL,28,26,2,5,1,0,0,0,0,0,1,7,6,0,1
89176,francju01,2007,2,ATL,NL,15,40,1,10,3,0,0,8,0,0,4,10,0,1,1


## Indexing and Selection

Indexing works analogously to indexing in NumPy arrays, except we can use the labels in the `Index` object to extract values in addition to arrays of integers.

In [89]:
# Sample Series object
hits = baseball_newind.h
hits

womacto01CHN2006     14
schilcu01BOS2006      1
myersmi01NYA2006      0
helliri01MIL2006      0
helliri01MIL2006      0
johnsra05NYA2006      1
finlest01SFN2006    105
gonzalu01ARI2006    159
seleaa01LAN2006       5
francju01ATL2007     10
francju01ATL2007     10
francju01NYN2007     10
zaungr01TOR2007      80
witasja01TBA2007      0
williwo02HOU2007      6
wickmbo01ARI2007      0
wickmbo01ATL2007      0
whitero02MIN2007     19
whiteri01HOU2007      0
wellsda01LAN2007      4
wellsda01SDN2007      4
weathda01CIN2007      0
walketo04OAK2007     13
wakefti01BOS2007      0
vizquom01SFN2007    126
villoro01NYA2007      0
valenjo03NYN2007     40
trachst01CHN2007      1
trachst01BAL2007      0
timlimi01BOS2007      0
                   ... 
guarded01CIN2007      0
griffke02CIN2007    146
greensh01NYN2007    130
graffto01MIL2007     55
gordoto01PHI2007      0
gonzalu01LAN2007    129
gomezch02CLE2007     15
gomezch02BAL2007     51
glavito02NYN2007     12
floydcl01CHN2007     80
finlest01COL2007

In [90]:
# Numpy-style indexing
hits[:3]

womacto01CHN2006    14
schilcu01BOS2006     1
myersmi01NYA2006     0
Name: h, dtype: int64

In [92]:
baseball_newind[['h','ab']].head()

Unnamed: 0,h,ab
womacto01CHN2006,14,50
schilcu01BOS2006,1,2
myersmi01NYA2006,0,0
helliri01MIL2006,0,3
helliri01MIL2006,0,3


In [93]:
baseball_newind[baseball_newind.ab>500]

Unnamed: 0,player,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
gonzalu01ARI2006,gonzalu01,2006,1,ARI,NL,153,586,93,159,52,...,73,0,1,69,58,10,7,0,6,14
vizquom01SFN2007,vizquom01,2007,1,SFN,NL,145,513,54,126,18,...,51,14,6,44,48,6,1,14,3,14
thomafr04TOR2007,thomafr04,2007,1,TOR,AL,155,531,63,147,30,...,95,0,0,81,94,3,7,0,5,14
rodriiv01DET2007,rodriiv01,2007,1,DET,AL,129,502,50,141,31,...,63,2,2,9,96,1,1,1,2,16
griffke02CIN2007,griffke02,2007,1,CIN,NL,144,528,78,146,24,...,93,6,1,85,99,14,1,0,9,14
delgaca01NYN2007,delgaca01,2007,1,NYN,NL,139,538,71,139,30,...,87,4,0,52,118,8,11,0,6,12
biggicr01HOU2007,biggicr01,2007,1,HOU,NL,141,517,68,130,31,...,50,4,3,23,112,0,3,7,5,5


The indexing field `ix` allows us to select subsets of rows and columns in an intuitive way:

In [96]:
baseball_newind.loc['gonzalu01ARI2006', ['h','X2b', 'X3b', 'hr']]

h      159
X2b     52
X3b      2
hr      15
Name: gonzalu01ARI2006, dtype: object

In [102]:
baseball_newind.loc[['gonzalu01ARI2006','finlest01SFN2006']].iloc[:,5:8] # use the mix of loc and iloc

Unnamed: 0,g,ab,r
gonzalu01ARI2006,153,586,93
finlest01SFN2006,139,426,66


## Operations

`DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.



NaN + anything = NaN. While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with `NaN`. We can use the `add` method to calculate player home run totals by using the `fill_value` argument to insert a zero for home runs where labels do not overlap:

In [103]:
baseball['total_bb'] = baseball.bb + baseball.ibb
baseball.head(10)

Unnamed: 0_level_0,player,year,stint,team,lg,g,ab,r,h,X2b,...,sb,cs,bb,so,ibb,hbp,sh,sf,gidp,total_bb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88641,womacto01,2006,2,CHN,NL,19,50,6,14,1,...,1,1,4,4,0,0,3,0,0,4
88643,schilcu01,2006,1,BOS,AL,31,2,0,1,0,...,0,0,0,1,0,0,0,0,0,0
88645,myersmi01,2006,1,NYA,AL,62,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88648,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0,0,0,2,0,0,0,0,0,0
88649,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0,0,0,2,0,0,0,0,0,0
88650,johnsra05,2006,1,NYA,AL,33,6,0,1,0,...,0,0,0,4,0,0,0,0,0,0
88652,finlest01,2006,1,SFN,NL,139,426,66,105,21,...,7,0,46,55,2,2,3,4,6,48
88653,gonzalu01,2006,1,ARI,NL,153,586,93,159,52,...,0,1,69,58,10,7,0,6,14,79
88662,seleaa01,2006,1,LAN,NL,28,26,2,5,1,...,0,0,1,7,0,0,6,0,1,1
89176,francju01,2007,2,ATL,NL,15,40,1,10,3,...,0,0,4,10,1,0,0,1,1,5


In [104]:
baseball['bb'] = baseball['bb'].add(baseball['ibb'],fill_value=0) # fill_value for null values
baseball.head(10)

Unnamed: 0_level_0,player,year,stint,team,lg,g,ab,r,h,X2b,...,sb,cs,bb,so,ibb,hbp,sh,sf,gidp,total_bb
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88641,womacto01,2006,2,CHN,NL,19,50,6,14,1,...,1,1,4,4,0,0,3,0,0,4
88643,schilcu01,2006,1,BOS,AL,31,2,0,1,0,...,0,0,0,1,0,0,0,0,0,0
88645,myersmi01,2006,1,NYA,AL,62,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
88648,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0,0,0,2,0,0,0,0,0,0
88649,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0,0,0,2,0,0,0,0,0,0
88650,johnsra05,2006,1,NYA,AL,33,6,0,1,0,...,0,0,0,4,0,0,0,0,0,0
88652,finlest01,2006,1,SFN,NL,139,426,66,105,21,...,7,0,48,55,2,2,3,4,6,48
88653,gonzalu01,2006,1,ARI,NL,153,586,93,159,52,...,0,1,79,58,10,7,0,6,14,79
88662,seleaa01,2006,1,LAN,NL,28,26,2,5,1,...,0,0,1,7,0,0,6,0,1,1
89176,francju01,2007,2,ATL,NL,15,40,1,10,3,...,0,0,5,10,1,0,0,1,1,5


In [105]:
baseball.hr - baseball.hr.max()

id
88641   -34
88643   -35
88645   -35
88648   -35
88649   -35
88650   -35
88652   -29
88653   -20
88662   -35
89176   -35
89177   -35
89178   -34
89330   -25
89333   -35
89334   -34
89335   -35
89336   -35
89337   -31
89338   -35
89339   -35
89340   -35
89341   -35
89343   -35
89345   -35
89347   -31
89348   -35
89352   -32
89354   -35
89355   -35
89359   -35
         ..
89460   -35
89462    -5
89463   -25
89464   -26
89465   -35
89466   -20
89467   -35
89468   -34
89469   -35
89473   -26
89474   -34
89480   -35
89481   -23
89482   -25
89489   -11
89493   -35
89494   -35
89495   -29
89497   -35
89498   -35
89499   -34
89501   -35
89502   -33
89521    -7
89523   -25
89525   -35
89526   -35
89530   -32
89533   -22
89534   -35
Name: hr, Length: 102, dtype: int64

We can also apply functions to each column of a `DataFrame`

In [106]:
baseball[['hr','bb']].apply(np.max)

hr     35
bb    175
dtype: int64

## Sorting and Ranking

Pandas objects include methods for re-ordering data.

In [107]:
baseball_newind.sort_index().head() # sort_index creates a new dataframe

Unnamed: 0,player,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
alomasa02NYN2007,alomasa02,2007,1,NYN,NL,8,22,1,3,1,...,0,0,0,0,3,0,0,0,0,0
aloumo01NYN2007,aloumo01,2007,1,NYN,NL,87,328,51,112,19,...,49,3,0,27,30,5,2,0,3,13
ausmubr01HOU2007,ausmubr01,2007,1,HOU,NL,117,349,38,82,16,...,25,6,1,37,74,3,6,4,1,11
benitar01FLO2007,benitar01,2007,2,FLO,NL,34,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
benitar01SFN2007,benitar01,2007,1,SFN,NL,19,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We can also use `order` to sort a `Series` by value, rather than by label.

In [108]:
baseball.hr.sort_values(ascending=False)

id
89360    35
89462    30
89521    28
89361    26
89378    25
89489    24
89374    21
89371    21
89439    20
89396    20
89466    15
88653    15
89533    13
89481    12
89389    11
89523    10
89330    10
89482    10
89463    10
89473     9
89464     9
89398     8
89430     7
88652     6
89495     6
89438     6
89337     4
89347     4
89352     3
89530     3
         ..
89480     0
89425     0
89451     0
89421     0
89420     0
89412     0
89411     0
89410     0
89406     0
89402     0
89431     0
89442     0
89445     0
89450     0
89388     0
89384     0
89363     0
89452     0
89382     0
89381     0
89460     0
89375     0
89465     0
89372     0
89467     0
89370     0
89367     0
89469     0
89365     0
89534     0
Name: hr, Length: 102, dtype: int64

For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_index`:

In [109]:
baseball[['player','sb','cs']].sort_values(by=['sb', 'cs'], ascending=[False,True]).head(10)

Unnamed: 0_level_0,player,sb,cs
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
89378,sheffga01,22,5
89430,loftoke01,21,4
89347,vizquom01,14,6
89463,greensh01,11,1
88652,finlest01,7,0
89462,griffke02,6,1
89530,ausmubr01,6,1
89466,gonzalu01,6,2
89521,bondsba01,5,0
89438,kleskry01,5,1


**Ranking** does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.

In [110]:
baseball.hr.rank()

id
88641     64.5
88643     30.0
88645     30.0
88648     30.0
88649     30.0
88650     30.0
88652     78.0
88653     91.5
88662     30.0
89176     30.0
89177     30.0
89178     64.5
89330     85.5
89333     30.0
89334     64.5
89335     30.0
89336     30.0
89337     75.5
89338     30.0
89339     30.0
89340     30.0
89341     30.0
89343     30.0
89345     30.0
89347     75.5
89348     30.0
89352     73.5
89354     30.0
89355     30.0
89359     30.0
         ...  
89460     30.0
89462    101.0
89463     85.5
89464     82.5
89465     30.0
89466     91.5
89467     30.0
89468     64.5
89469     30.0
89473     82.5
89474     64.5
89480     30.0
89481     89.0
89482     85.5
89489     97.0
89493     30.0
89494     30.0
89495     78.0
89497     30.0
89498     30.0
89499     64.5
89501     30.0
89502     71.0
89521    100.0
89523     85.5
89525     30.0
89526     30.0
89530     73.5
89533     90.0
89534     30.0
Name: hr, Length: 102, dtype: float64

you can break ties via one of several methods, such as by the order in which they occur in the dataset:

In [111]:
baseball.hr.rank(method='first')

id
88641     60.0
88643      1.0
88645      2.0
88648      3.0
88649      4.0
88650      5.0
88652     77.0
88653     91.0
88662      6.0
89176      7.0
89177      8.0
89178     61.0
89330     84.0
89333      9.0
89334     62.0
89335     10.0
89336     11.0
89337     75.0
89338     12.0
89339     13.0
89340     14.0
89341     15.0
89343     16.0
89345     17.0
89347     76.0
89348     18.0
89352     73.0
89354     19.0
89355     20.0
89359     21.0
         ...  
89460     47.0
89462    101.0
89463     85.0
89464     82.0
89465     48.0
89466     92.0
89467     49.0
89468     67.0
89469     50.0
89473     83.0
89474     68.0
89480     51.0
89481     89.0
89482     86.0
89489     97.0
89493     52.0
89494     53.0
89495     79.0
89497     54.0
89498     55.0
89499     69.0
89501     56.0
89502     72.0
89521    100.0
89523     87.0
89525     57.0
89526     58.0
89530     74.0
89533     90.0
89534     59.0
Name: hr, Length: 102, dtype: float64

## Missing data

The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (*e.g.* NumPy).

In [112]:
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo

0       NaN
1        -3
2      None
3    foobar
dtype: object

In [113]:
foo.isnull()

0     True
1    False
2     True
3    False
dtype: bool

Missing values may be dropped or indexed out:

In [114]:
print (bacteria2)
bacteria2.dropna() # returns a new series


Cyanobacteria        NaN
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64


Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64

By default, `dropna` drops entire rows in which one or more values are missing.

In [115]:
print (data)
data['value'][5] = np.nan

   value  patient          phylum  treatment
1    632        1      Firmicutes        0.0
2   1638        1  Proteobacteria        0.0
3     37        1  Actinobacteria        0.0
4    115        1   Bacteroidetes        1.0
5    433        2      Firmicutes        1.0
6   1130        2  Proteobacteria        NaN
7    754        2  Actinobacteria        NaN
8   1000        2   Bacteroidetes        NaN


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [116]:
data.dropna() # this method has an argument axis = 0. 

Unnamed: 0,value,patient,phylum,treatment
1,632.0,1,Firmicutes,0.0
2,1638.0,1,Proteobacteria,0.0
3,37.0,1,Actinobacteria,0.0
4,115.0,1,Bacteroidetes,1.0


This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value.

In [117]:
data.dropna(how='all')

Unnamed: 0,value,patient,phylum,treatment
1,632.0,1,Firmicutes,0.0
2,1638.0,1,Proteobacteria,0.0
3,37.0,1,Actinobacteria,0.0
4,115.0,1,Bacteroidetes,1.0
5,,2,Firmicutes,1.0
6,1130.0,2,Proteobacteria,
7,754.0,2,Actinobacteria,
8,1000.0,2,Bacteroidetes,


This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument.

In [118]:
data.dropna(thresh=4)

Unnamed: 0,value,patient,phylum,treatment
1,632.0,1,Firmicutes,0.0
2,1638.0,1,Proteobacteria,0.0
3,37.0,1,Actinobacteria,0.0
4,115.0,1,Bacteroidetes,1.0


This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.

If we want to drop missing values column-wise instead of row-wise, we use `axis=1`.

In [120]:
data.dropna(axis=1)

Unnamed: 0,patient,phylum
1,1,Firmicutes
2,1,Proteobacteria
3,1,Actinobacteria
4,1,Bacteroidetes
5,2,Firmicutes
6,2,Proteobacteria
7,2,Actinobacteria
8,2,Bacteroidetes


Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the `fillna` argument.

In [121]:
bacteria2.fillna(0)

Cyanobacteria        0.0
Firmicutes         632.0
Proteobacteria    1638.0
Actinobacteria     569.0
dtype: float64

In [122]:
data.fillna({'year': 2013, 'treatment':2})

Unnamed: 0,value,patient,phylum,treatment
1,632.0,1,Firmicutes,0.0
2,1638.0,1,Proteobacteria,0.0
3,37.0,1,Actinobacteria,0.0
4,115.0,1,Bacteroidetes,1.0
5,,2,Firmicutes,1.0
6,1130.0,2,Proteobacteria,2.0
7,754.0,2,Actinobacteria,2.0
8,1000.0,2,Bacteroidetes,2.0


Notice that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or  `DataFrame` in place (**in general, we like to do this, by the way!**).

In [123]:
print (data)

    value  patient          phylum  treatment
1   632.0        1      Firmicutes        0.0
2  1638.0        1  Proteobacteria        0.0
3    37.0        1  Actinobacteria        0.0
4   115.0        1   Bacteroidetes        1.0
5     NaN        2      Firmicutes        1.0
6  1130.0        2  Proteobacteria        NaN
7   754.0        2  Actinobacteria        NaN
8  1000.0        2   Bacteroidetes        NaN


We can alter values in-place using `inplace=True`.

In [124]:

data.value.fillna(0, inplace=True)
data

Unnamed: 0,value,patient,phylum,treatment
1,632.0,1,Firmicutes,0.0
2,1638.0,1,Proteobacteria,0.0
3,37.0,1,Actinobacteria,0.0
4,115.0,1,Bacteroidetes,1.0
5,0.0,2,Firmicutes,1.0
6,1130.0,2,Proteobacteria,
7,754.0,2,Actinobacteria,
8,1000.0,2,Bacteroidetes,


In [125]:
bacteria2.fillna(bacteria2.mean())

Cyanobacteria      946.333333
Firmicutes         632.000000
Proteobacteria    1638.000000
Actinobacteria     569.000000
dtype: float64

## Data summarization

We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.

In [126]:
baseball.sum()

player      womacto01schilcu01myersmi01helliri01helliri01j...
year                                                   204705
stint                                                     116
team        CHNBOSNYAMILMILNYASFNARILANATLATLNYNTORTBAHOUA...
lg          NLALALNLNLALNLNLNLNLNLNLALALNLNLNLALNLNLNLNLAL...
g                                                        5273
ab                                                      13697
r                                                        1870
h                                                        3592
X2b                                                       742
X3b                                                        55
hr                                                        437
rbi                                                      1855
sb                                                        138
cs                                                         46
bb                                                       1731
so      

Clearly, `sum` is more meaningful for some columns than others. For methods like `mean` for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:

In [127]:
baseball.mean()

year        2006.911765
stint          1.137255
g             51.696078
ab           134.284314
r             18.333333
h             35.215686
X2b            7.274510
X3b            0.539216
hr             4.284314
rbi           18.186275
sb             1.352941
cs             0.450980
bb            16.970588
so            23.725490
ibb            1.745098
hbp            1.098039
sh             1.352941
sf             1.186275
gidp           3.480392
total_bb      16.970588
dtype: float64

In [128]:
bacteria2.mean()

946.3333333333334

Sometimes we may not want to ignore missing values, and allow the `nan` to propagate.

In [129]:
bacteria2.mean(skipna=False)

nan

A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is `describe`:

In [130]:
baseball.describe()

Unnamed: 0,year,stint,g,ab,r,h,X2b,X3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp,total_bb
count,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0,102.0
mean,2006.911765,1.137255,51.696078,134.284314,18.333333,35.215686,7.27451,0.539216,4.284314,18.186275,1.352941,0.45098,16.970588,23.72549,1.745098,1.098039,1.352941,1.186275,3.480392,16.970588
std,0.285037,0.345816,47.802355,180.857002,27.615225,49.912126,11.039225,1.432795,7.919618,28.143807,3.663162,1.058931,29.740613,32.580489,4.996407,2.213862,2.896386,2.018383,5.167933,29.740613
min,2006.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2007.0,1.0,11.25,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2007.0,1.0,33.0,40.0,2.0,8.0,1.0,0.0,0.0,2.0,0.0,0.0,1.0,7.0,0.0,0.0,0.0,0.0,1.0,1.0
75%,2007.0,1.0,82.25,227.0,30.75,56.5,11.0,0.75,5.5,25.75,0.75,0.0,19.75,33.75,1.0,1.0,1.0,2.0,5.75,19.75
max,2007.0,2.0,155.0,586.0,107.0,159.0,52.0,12.0,35.0,96.0,22.0,6.0,175.0,134.0,43.0,11.0,14.0,9.0,21.0,175.0


`describe` can detect non-numeric data and sometimes yield useful information about it.

In [131]:
baseball.player.describe()

count           102
unique           82
top       francju01
freq              3
Name: player, dtype: object

We can also calculate summary statistics *across* multiple columns, for example, correlation and covariance.

$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$

In [132]:
baseball.hr.cov(baseball.X2b)

68.20830906620073

$$corr(x,y) = \frac{cov(x,y)}{(n-1)s_x s_y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}}$$

In [133]:
baseball.hr.corr(baseball.X2b)

0.7801793841954151

In [134]:
baseball.ab.corr(baseball.h)

0.9942592616798702

In [135]:
baseball.corr()

Unnamed: 0,year,stint,g,ab,r,h,X2b,X3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp,total_bb
year,1.0,0.023634,-0.028874,0.020466,-0.002516,0.022229,-0.029985,-0.221755,0.072628,0.060078,0.039603,0.067535,0.024218,0.085857,0.025764,0.013844,0.002116,0.011642,0.069389,0.024218
stint,0.023634,1.0,-0.26458,-0.218776,-0.215304,-0.209384,-0.196703,-0.090901,-0.213225,-0.205097,-0.124594,-0.062557,-0.185402,-0.214558,-0.117078,-0.198806,-0.098269,-0.150473,-0.225623,-0.185402
g,-0.028874,-0.26458,1.0,0.936179,0.911008,0.929575,0.88605,0.52066,0.803291,0.891598,0.494459,0.523413,0.801129,0.86701,0.514869,0.731717,0.085523,0.766845,0.86381,0.801129
ab,0.020466,-0.218776,0.936179,1.0,0.965818,0.994259,0.952469,0.537797,0.84423,0.948105,0.535301,0.579171,0.819101,0.924403,0.507328,0.768458,0.099831,0.840264,0.927158,0.819101
r,-0.002516,-0.215304,0.911008,0.965818,1.0,0.970713,0.923766,0.502886,0.890734,0.941535,0.597823,0.578522,0.887757,0.880049,0.589255,0.807588,0.004828,0.839085,0.895421,0.887757
h,0.022229,-0.209384,0.929575,0.994259,0.970713,1.0,0.957482,0.516155,0.856,0.952511,0.531788,0.573616,0.822423,0.907568,0.51393,0.768689,0.050971,0.839702,0.935984,0.822423
X2b,-0.029985,-0.196703,0.88605,0.952469,0.923766,0.957482,1.0,0.495084,0.780179,0.902309,0.415769,0.479705,0.74952,0.862982,0.454537,0.739457,0.010565,0.819751,0.907412,0.74952
X3b,-0.221755,-0.090901,0.52066,0.537797,0.502886,0.516155,0.495084,1.0,0.213219,0.372169,0.451964,0.386295,0.319858,0.411067,0.092691,0.220393,0.189883,0.396306,0.41395,0.319858
hr,0.072628,-0.213225,0.803291,0.84423,0.890734,0.856,0.780179,0.213219,1.0,0.948866,0.366802,0.348187,0.903522,0.866596,0.673932,0.768656,-0.13952,0.782053,0.799536,0.903522
rbi,0.060078,-0.205097,0.891598,0.948105,0.941535,0.952511,0.902309,0.372169,0.948866,1.0,0.396758,0.437347,0.868819,0.929807,0.583763,0.781848,-0.049642,0.855535,0.907413,0.868819


If we have a `DataFrame` with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:

In [136]:
mb.head()

Unnamed: 0,Taxon,Patient,Tissue,Stool
0,Firmicutes,1,632,305
1,Firmicutes,2,136,4182
2,Firmicutes,3,1174,703
3,Firmicutes,4,408,3946
4,Firmicutes,5,831,8605


## Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.

In [138]:
mb.to_csv("mb.csv")

The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file. You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.

As Wes warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python.