## Lesson 11 - Pandas Part I

In this lesson we will cover some basic features of [Pandas](http://pandas.pydata.org).

### Readings

* McKinney: [Chapter 4. Numpy Basics: Arrays and Vectorized Computation](http://proquest.safaribooksonline.com/book/programming/python/9781491957653/numpy-basics-arrays-and-vectorized-computation/numpy_html)
* McKinney: [Chapter 5. Getting Started with Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781491957653/getting-started-with-pandas/pandas_html)

### Table of Contents

* [Series](#series)
* [DataFrame](#dataframe)
* [index, columns](#indexing)
* [dtypes, info, describe](#attributes)
* [read_csv](#readcsv)
* [head, tail](#headtail)
* [Indexing with bracket/dot notation, loc, iloc](#indexing2)
* [transpose](#transpose)
* [to_csv, to_excel, read_excel](#toread)
* [to_datetime](#datetime)

In [1]:
import pandas as pd
import numpy as np

<a id="series"></a>

### Series

In [2]:
# a list of strings
my_list = ['cubs', 'pirates', 'giants', 'yankees', 'donkeys']
my_list

['cubs', 'pirates', 'giants', 'yankees', 'donkeys']

In [3]:
# pandas Series from list
series_from_list = pd.Series(my_list)
series_from_list

0       cubs
1    pirates
2     giants
3    yankees
4    donkeys
dtype: object

In [4]:
# indexing a Series is similar to lists and arrays
series_from_list[3]

'yankees'

In [5]:
# a numpy array
my_array = np.random.rand(5)
my_array

array([0.67421844, 0.55303709, 0.56433867, 0.54957026, 0.44879646])

In [6]:
# pandas Series from array
series_from_array = pd.Series(my_array)
series_from_array

0    0.674218
1    0.553037
2    0.564339
3    0.549570
4    0.448796
dtype: float64

In [7]:
# indexing supports lists
series_from_array[[1, 3]]

1    0.553037
3    0.549570
dtype: float64

In [8]:
# indexing supports slices
series_from_array[3:]

3    0.549570
4    0.448796
dtype: float64

<a id="dataframe"></a>

### DataFrame

#### 2D array to DataFrame

In [9]:
# create a 2D numpy array
my_2d_array = np.random.randn(5,5)
my_2d_array

array([[ 0.30183528,  0.8527996 ,  1.292565  ,  0.77310374, -1.02036927],
       [-0.21313282, -0.50346792, -1.65864876,  0.00325651,  0.06528936],
       [-0.51409901,  0.19897581, -1.56643912,  2.9767219 ,  1.50389804],
       [ 0.16888865,  1.00044238,  0.93583698,  0.07282361, -2.1335255 ],
       [ 0.46805536, -0.42282092,  0.79700182,  0.04224115,  0.55716763]])

In [10]:
# make a DataFrame from the 2D numpy array
pd.DataFrame(my_2d_array)

Unnamed: 0,0,1,2,3,4
0,0.301835,0.8528,1.292565,0.773104,-1.020369
1,-0.213133,-0.503468,-1.658649,0.003257,0.065289
2,-0.514099,0.198976,-1.566439,2.976722,1.503898
3,0.168889,1.000442,0.935837,0.072824,-2.133526
4,0.468055,-0.422821,0.797002,0.042241,0.557168


In [11]:
# we can set the index and column labels when we create the DataFrame
# note that we can combine positional arguments (data) and keyword arguments
# (index, columns) as long as positional arguments come first
df_from_2d_array = pd.DataFrame(
    my_2d_array,
    index=['row1', 'row2', 'row3', 'row4', 'row5'],
    columns=['col1', 'col2', 'col3', 'col4', 'col5'])
df_from_2d_array

Unnamed: 0,col1,col2,col3,col4,col5
row1,0.301835,0.8528,1.292565,0.773104,-1.020369
row2,-0.213133,-0.503468,-1.658649,0.003257,0.065289
row3,-0.514099,0.198976,-1.566439,2.976722,1.503898
row4,0.168889,1.000442,0.935837,0.072824,-2.133526
row5,0.468055,-0.422821,0.797002,0.042241,0.557168


In [12]:
# also note that commas and brackets/parentheses can precede a newline in python code
mylist = [
    0, 2, 4,
    6, 8, 10]

In [13]:
mylist

[0, 2, 4, 6, 8, 10]

#### Converting multiple Series to a DataFrame

In [14]:
# method 1: getting data as a list of series will orient them as rows
# this is typically not ideal
x = pd.DataFrame(data=[series_from_list, series_from_array])
x

Unnamed: 0,0,1,2,3,4
0,cubs,pirates,giants,yankees,donkeys
1,0.674218,0.553037,0.564339,0.54957,0.448796


In [15]:
# note that each column has dtype object (string)
x.dtypes

0    object
1    object
2    object
3    object
4    object
dtype: object

In [16]:
# in this example, we need to transpose the table - we'll see this again later in the lesson
x = x.transpose()
x

Unnamed: 0,0,1
0,cubs,0.674218
1,pirates,0.553037
2,giants,0.564339
3,yankees,0.54957
4,donkeys,0.448796


In [17]:
# the transposed columns have dtype object - we'll see how to fix this below
x.dtypes

0    object
1    object
dtype: object

In [18]:
# method 2: pass list/Series as value of dictionary
y = pd.DataFrame({'a': series_from_list, 'b': series_from_array})
y

Unnamed: 0,a,b
0,cubs,0.674218
1,pirates,0.553037
2,giants,0.564339
3,yankees,0.54957
4,donkeys,0.448796


In [19]:
# importing the data as columns gives us the correct dtypes
y.dtypes

a     object
b    float64
dtype: object

In [20]:
# method 3: use pd.concat to combine series in column orientation
df = pd.concat([series_from_list, series_from_array], axis=1)
df

Unnamed: 0,0,1
0,cubs,0.674218
1,pirates,0.553037
2,giants,0.564339
3,yankees,0.54957
4,donkeys,0.448796


In [21]:
# again, importing the data as columns gives us the correct dtypes
df.dtypes

0     object
1    float64
dtype: object

<a id="indexing"></a>

### index, columns

In [22]:
# numeric indexes
df.index

RangeIndex(start=0, stop=5, step=1)

In [23]:
# numeric column names
df.columns

RangeIndex(start=0, stop=2, step=1)

In [24]:
# rename columns and indexes - method 1
# set the index and column names to an existing DataFrame
df.index = ['a', 'b', 'c', 'd', 'e']
df.columns = ['team', 'random']
df

Unnamed: 0,team,random
a,cubs,0.674218
b,pirates,0.553037
c,giants,0.564339
d,yankees,0.54957
e,donkeys,0.448796


In [25]:
# label (object) indexes
df.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [26]:
# label (object) column names
df.columns

Index(['team', 'random'], dtype='object')

In [27]:
# add a new column to the end of a DataFrame
df['integers'] = [2, 3, 5, 8, 13]
df

Unnamed: 0,team,random,integers
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


In [28]:
# add a new column at a specific position
df.insert(0, 'integers2', [2, 3, 5, 8, 13])
df

Unnamed: 0,integers2,team,random,integers
a,2,cubs,0.674218,2
b,3,pirates,0.553037,3
c,5,giants,0.564339,5
d,8,yankees,0.54957,8
e,13,donkeys,0.448796,13


In [29]:
# delete a column
df.drop('integers2', axis=1, inplace=True)
df

Unnamed: 0,team,random,integers
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


In [30]:
# rename columns - method 2
# using dictionary mapping old names to new names
df.rename(columns={'integers': 'fibonacci'}, inplace=True)
df

Unnamed: 0,team,random,fibonacci
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


In [31]:
# reorder the columns by passing a list of columns in the desired order
df[['fibonacci', 'team', 'random']]

Unnamed: 0,fibonacci,team,random
a,2,cubs,0.674218
b,3,pirates,0.553037
c,5,giants,0.564339
d,8,yankees,0.54957
e,13,donkeys,0.448796


<a id="attributes"></a>

### dtypes, info, describe

In [32]:
# gives the datatype of each column
df.dtypes

team          object
random       float64
fibonacci      int64
dtype: object

In [33]:
# shape (dimensions) of dataframe
df.shape

(5, 3)

In [34]:
# information about index and columns
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, a to e
Data columns (total 3 columns):
team         5 non-null object
random       5 non-null float64
fibonacci    5 non-null int64
dtypes: float64(1), int64(1), object(1)
memory usage: 160.0+ bytes


In [35]:
# basic statistics
df.describe()

Unnamed: 0,random,fibonacci
count,5.0,5.0
mean,0.557992,6.2
std,0.07995,4.438468
min,0.448796,2.0
25%,0.54957,3.0
50%,0.553037,5.0
75%,0.564339,8.0
max,0.674218,13.0


<a id="readcsv"></a>

### read_csv

In [36]:
# read_csv with defaults
# by default column headers are the first row and row indexes are integers starting from zero
df_sio = pd.read_csv('../data/scripps_pier_20151110.csv')

In [37]:
df_sio.head()

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,11/10/15 1:42,22.307,3.712,33.199,19.95
1,11/10/15 1:35,22.311,3.588,33.201,19.94
2,11/10/15 1:29,22.305,3.541,33.2,19.95
3,11/10/15 1:23,22.323,3.463,33.2,19.95
4,11/10/15 1:17,22.316,3.471,33.199,19.95


In [38]:
# by default, read_csv will infer the object types
df_sio.dtypes

Date            object
chl (ug/L)     float64
pres (dbar)    float64
sal (PSU)      float64
temp (C)       float64
dtype: object

In [39]:
df_sio.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 5 columns):
Date           66 non-null object
chl (ug/L)     66 non-null float64
pres (dbar)    66 non-null float64
sal (PSU)      66 non-null float64
temp (C)       66 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.7+ KB


In [40]:
df_sio.describe()

Unnamed: 0,chl (ug/L),pres (dbar),sal (PSU),temp (C)
count,66.0,66.0,66.0,66.0
mean,22.349576,3.041818,33.199318,20.06697
std,0.038988,0.254295,0.004959,0.0685
min,22.305,2.714,33.184,19.94
25%,22.319,2.81325,33.197,20.04
50%,22.3335,2.997,33.199,20.07
75%,22.385,3.2155,33.203,20.105
max,22.426,3.712,33.206,20.19


In [41]:
# read_csv specifying dtype (for all columns), index_col, and header
# sometimes it's better to specify the dtype as object and convert to int, float, etc. later
df_sio2 = pd.read_csv('../data/scripps_pier_20151110.csv', 
                     dtype=object, index_col=None, header=0)

In [42]:
df_sio2.dtypes

Date           object
chl (ug/L)     object
pres (dbar)    object
sal (PSU)      object
temp (C)       object
dtype: object

In [43]:
# read_csv specifying dtypes (per column), index_col, and header
# this allows us to have more control over the dtype of each column
df_sio3 = pd.read_csv('../data/scripps_pier_20151110.csv',
                      dtype={'pres (dbar)': np.float64, 'temp (C)': str}, 
                      index_col=None, header=0)

In [44]:
df_sio3.dtypes

Date            object
chl (ug/L)     float64
pres (dbar)    float64
sal (PSU)      float64
temp (C)        object
dtype: object

#### Changing dtype of columns after DataFrame is created

In [45]:
# method 1: list comprehension (one column)
df_sio['chl (ug/L)'] = [float(x) for x in df_sio['chl (ug/L)']]

In [46]:
# method 2: pd.to_numeric (one column)
df_sio['pres (dbar)'] = pd.to_numeric(df_sio['pres (dbar)'])

In [47]:
# method 3: apply(pd.to_numeric) (multiple columns)
df_sio[['sal (PSU)','temp (C)']] = df_sio[['sal (PSU)','temp (C)']].apply(pd.to_numeric)

In [48]:
df_sio.dtypes

Date            object
chl (ug/L)     float64
pres (dbar)    float64
sal (PSU)      float64
temp (C)       float64
dtype: object

<a id="headtail"></a>

### head, tail

In [49]:
# add a number to change the number of rows printed
df_sio.head()

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,11/10/15 1:42,22.307,3.712,33.199,19.95
1,11/10/15 1:35,22.311,3.588,33.201,19.94
2,11/10/15 1:29,22.305,3.541,33.2,19.95
3,11/10/15 1:23,22.323,3.463,33.2,19.95
4,11/10/15 1:17,22.316,3.471,33.199,19.95


In [50]:
# tail works the same way
df_sio.tail(3)

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
63,11/9/15 19:22,22.418,3.316,33.202,19.96
64,11/9/15 19:16,22.41,3.209,33.2,19.96
65,11/9/15 19:10,22.426,3.328,33.203,19.95


In [51]:
# if we view the whole dataframe, only the first 30 and last 30 rows are shown
df_sio

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,11/10/15 1:42,22.307,3.712,33.199,19.95
1,11/10/15 1:35,22.311,3.588,33.201,19.94
2,11/10/15 1:29,22.305,3.541,33.200,19.95
3,11/10/15 1:23,22.323,3.463,33.200,19.95
4,11/10/15 1:17,22.316,3.471,33.199,19.95
5,11/10/15 1:11,22.315,3.476,33.198,19.95
6,11/10/15 1:05,22.310,3.448,33.199,19.96
7,11/10/15 0:59,22.316,3.377,33.200,19.99
8,11/10/15 0:53,22.311,3.338,33.200,20.00
9,11/10/15 0:47,22.322,3.325,33.201,20.01


<a id="indexing2"></a>

### Indexing with bracket/dot notation, loc, iloc

Pandas has three indexing methods:

* `[ ]` and `.` work on labels of columns
* `.loc` works on labels of indexes and columns
* `.iloc` works on the positions of indexes and columns (so it only takes integers)

In [52]:
df

Unnamed: 0,team,random,fibonacci
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


#### brackets only -- column by header

In [53]:
# to get a column (Series), use the column header (don't need .loc, .iloc, or .ix)
df['team']

a       cubs
b    pirates
c     giants
d    yankees
e    donkeys
Name: team, dtype: object

In [54]:
# for multiple columns, put a list inside the brackets (so two sets of brackets)
df[['team', 'random']]

Unnamed: 0,team,random
a,cubs,0.674218
b,pirates,0.553037
c,giants,0.564339
d,yankees,0.54957
e,donkeys,0.448796


#### dot-notation

In [55]:
# if the column name has only alpha-numerics (including underscores), 
# we can use a dot instead of brackets and quotes
df.team

a       cubs
b    pirates
c     giants
d    yankees
e    donkeys
Name: team, dtype: object

#### loc -- row by index

In [56]:
# to get a row by name, use .loc with the row index
df.loc['a']

team             cubs
random       0.674218
fibonacci           2
Name: a, dtype: object

In [57]:
# for multiple rows, put a list inside the brackets (so two sets of brackets)
df.loc[['a', 'd']]

Unnamed: 0,team,random,fibonacci
a,cubs,0.674218,2
d,yankees,0.54957,8


#### iloc -- row (or column) by position

In [58]:
# to get a row by position, use .iloc with the row number
df.iloc[0]

team             cubs
random       0.674218
fibonacci           2
Name: a, dtype: object

In [59]:
# for multiple rows, put a list inside the brackets (so two sets of brackets)
df.iloc[[0, 3]]

Unnamed: 0,team,random,fibonacci
a,cubs,0.674218,2
d,yankees,0.54957,8


In [60]:
# or pass a slice
df.iloc[2:]

Unnamed: 0,team,random,fibonacci
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


In [61]:
# iloc also works with columns
df.iloc[:, [0, 2]]

Unnamed: 0,team,fibonacci
a,cubs,2
b,pirates,3
c,giants,5
d,yankees,8
e,donkeys,13


<a id="transpose"></a>

### transpose

In [62]:
df.transpose()

Unnamed: 0,a,b,c,d,e
team,cubs,pirates,giants,yankees,donkeys
random,0.674218,0.553037,0.564339,0.54957,0.448796
fibonacci,2,3,5,8,13


In [63]:
df.T

Unnamed: 0,a,b,c,d,e
team,cubs,pirates,giants,yankees,donkeys
random,0.674218,0.553037,0.564339,0.54957,0.448796
fibonacci,2,3,5,8,13


<a id="toread"></a>

### to_csv, to_excel

In [64]:
# to_csv with defaults (sep=',')
df.to_csv('teams.csv')

In [65]:
# use the sep option if the separator is not a comma
df.to_csv('teams.tsv', sep='\t')

In [66]:
# with index label
df.to_csv('teams.csv', index_label='index')

In [67]:
# to_excel requires the openpyxl package
df.to_excel('teams.xlsx', index_label='index')

### read_csv (revisited), read_excel

In [68]:
# read_csv with defaults
pd.read_csv('teams.csv')

Unnamed: 0,index,team,random,fibonacci
0,a,cubs,0.674218,2
1,b,pirates,0.553037,3
2,c,giants,0.564339,5
3,d,yankees,0.54957,8
4,e,donkeys,0.448796,13


In [69]:
# read_csv specifying first column of csv as index_col
df1 = pd.read_csv('teams.csv', index_col=0)
df1

Unnamed: 0_level_0,team,random,fibonacci
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


In [70]:
# default datatypes
df1.dtypes

team          object
random       float64
fibonacci      int64
dtype: object

In [71]:
# again, we can specify the dtypes when we read_csv
df2 = pd.read_csv('teams.csv', index_col=0, dtype=object)
df3 = pd.read_csv('teams.csv', index_col=0, 
                  dtype={'team': object, 'random': np.float, 'integers': np.int})

In [72]:
# specify datatypes: all object
df2.dtypes

team         object
random       object
fibonacci    object
dtype: object

In [73]:
# specify datatypes: per column
df3.dtypes

team          object
random       float64
fibonacci      int64
dtype: object

In [74]:
# use the sep option if the separator is not a comma
df4 = pd.read_csv('teams.tsv', index_col=0, sep='\t')
df4

Unnamed: 0,team,random,fibonacci
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


In [75]:
# read_excel requires the xlrd package
df5 = pd.read_excel('teams.xlsx', index_col=0)
df5

Unnamed: 0_level_0,team,random,fibonacci
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,cubs,0.674218,2
b,pirates,0.553037,3
c,giants,0.564339,5
d,yankees,0.54957,8
e,donkeys,0.448796,13


<a id="datetime"></a>

### to_datetime

We will cover time series in greater detail in a future lesson.

In [76]:
df_sio.head()

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,11/10/15 1:42,22.307,3.712,33.199,19.95
1,11/10/15 1:35,22.311,3.588,33.201,19.94
2,11/10/15 1:29,22.305,3.541,33.2,19.95
3,11/10/15 1:23,22.323,3.463,33.2,19.95
4,11/10/15 1:17,22.316,3.471,33.199,19.95


In [77]:
df_sio.dtypes

Date            object
chl (ug/L)     float64
pres (dbar)    float64
sal (PSU)      float64
temp (C)       float64
dtype: object

In [78]:
time = pd.to_datetime(df_sio['Date'])
time.head()

0   2015-11-10 01:42:00
1   2015-11-10 01:35:00
2   2015-11-10 01:29:00
3   2015-11-10 01:23:00
4   2015-11-10 01:17:00
Name: Date, dtype: datetime64[ns]

In [79]:
df_sio['Date'] = time

In [80]:
df_sio.head()

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,2015-11-10 01:42:00,22.307,3.712,33.199,19.95
1,2015-11-10 01:35:00,22.311,3.588,33.201,19.94
2,2015-11-10 01:29:00,22.305,3.541,33.2,19.95
3,2015-11-10 01:23:00,22.323,3.463,33.2,19.95
4,2015-11-10 01:17:00,22.316,3.471,33.199,19.95


In [81]:
df_sio.dtypes

Date           datetime64[ns]
chl (ug/L)            float64
pres (dbar)           float64
sal (PSU)             float64
temp (C)              float64
dtype: object

In [82]:
# to do this in a single step, we can use read_csv's parse_dates keyword
df_sio4 = pd.read_csv('../data/scripps_pier_20151110.csv', index_col=None,
                      parse_dates=['Date'])

In [83]:
df_sio4.head()

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,2015-11-10 01:42:00,22.307,3.712,33.199,19.95
1,2015-11-10 01:35:00,22.311,3.588,33.201,19.94
2,2015-11-10 01:29:00,22.305,3.541,33.2,19.95
3,2015-11-10 01:23:00,22.323,3.463,33.2,19.95
4,2015-11-10 01:17:00,22.316,3.471,33.199,19.95


In [84]:
df_sio4.dtypes

Date           datetime64[ns]
chl (ug/L)            float64
pres (dbar)           float64
sal (PSU)             float64
temp (C)              float64
dtype: object

In [85]:
# for maximal control, this can be combined with dtype
df_sio5 = pd.read_csv('../data/scripps_pier_20151110.csv', index_col=None,
                      parse_dates=['Date'], dtype={'temp (C)': str}).head()

In [86]:
df_sio5.head()

Unnamed: 0,Date,chl (ug/L),pres (dbar),sal (PSU),temp (C)
0,2015-11-10 01:42:00,22.307,3.712,33.199,19.95
1,2015-11-10 01:35:00,22.311,3.588,33.201,19.94
2,2015-11-10 01:29:00,22.305,3.541,33.2,19.95
3,2015-11-10 01:23:00,22.323,3.463,33.2,19.95
4,2015-11-10 01:17:00,22.316,3.471,33.199,19.95


In [87]:
df_sio5.dtypes

Date           datetime64[ns]
chl (ug/L)            float64
pres (dbar)           float64
sal (PSU)             float64
temp (C)               object
dtype: object