# Chapter 5 - Getting Started with pandas
### Python for data analysis

pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops. While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [80]:
import pandas as pd

## Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created. You can get the array representation and index object of the Series via its values and index attributes, respectively:

In [83]:
obj = pd.Series([4, 7, -5, 3])
print(obj)
print(obj.values)
print(obj.index) # like range(4)

0    4
1    7
2   -5
3    3
dtype: int64
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


Often it will be desirable to create a Series with an index identifying each data point with a label:

In [86]:
obj2 = pd.Series([4, 7,-5, 3], index=['d', 'b', 'a', 'c'])
print(obj2)
print(obj2.index)

d    4
b    7
a   -5
c    3
dtype: int64
Index(['d', 'b', 'a', 'c'], dtype='object')


Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:

In [90]:
print(obj2[:2]) # as with NumPy
print(obj2['a']) # only in pandas

# Here ['c', 'a', 'd'] is interpreted as a list of indices, even though it contains
# strings instead of integers.
print(obj2[['c','a','d']])

# modify the series
obj2['d'] = 6
print(obj2)

d    6
b    7
dtype: int64
-5
c    3
a   -5
d    6
dtype: int64
d    6
b    7
a   -5
c    3
dtype: int64


Using NumPy functions or NumPy-like operations, such as filtering with a boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [94]:
print(obj2[obj2>0])
print(obj2 * 2)
print(np.exp(obj2))

d    6
b    7
c    3
dtype: int64
d    12
b    14
a   -10
c     6
dtype: int64
d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64


Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict:

In [97]:
print('b' in obj2)
print('e' in obj2)

True
False


Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [99]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
print(obj3)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


When you are only passing a dict, the index in the resulting Series will have the dict’s keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [100]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
print(obj4)

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


Here, three values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in states, it is excluded from the resulting object. The isnull and notnull functions in pandas should be used to detect missing data:

In [104]:
print(pd.isnull(obj4))

print('---------------')
print(pd.notnull(obj4))

print('---------------')
# Series also has these as instance methods:
print(obj4.isnull())

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
---------------
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool
---------------
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool


A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations:

In [105]:
print(obj3)

print('---------------')
print(obj4)

print('---------------')
print(obj3 + obj4)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
---------------
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
---------------
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality:

In [106]:
obj4.name = 'population'
obj4.index.name = 'state'
print(obj4)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64


## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. 

In [107]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [108]:
print(pd.DataFrame(data, columns=['year', 'state', 'pop']))

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2


If you pass a column that isn’t contained in the dict, it will appear with missing values in the result:

In [113]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four','five', 'six'])
print(frame2)

print('------------------')
print(frame2.columns)

print('------------------')
print(frame2['state'])

# frame2[column] works for any column name, but frame2.column
# only works when the column name is a valid Python variable
# name.
print('------------------')
print(frame2.state)

       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN
------------------
Index(['year', 'state', 'pop', 'debt'], dtype='object')
------------------
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
------------------
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object


Rows can also be retrieved by position or name with the special loc attribute (much more on this later):

In [114]:
print(frame2.loc['three'])

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object


Columns can be modified by assignment. For example, the empty 'debt' column could be assigned a scalar value or an array of values:

In [117]:
frame2['debt'] = 16.5
print(frame2)

print('--------------')
frame2['debt'] = np.arange(6.)
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5
--------------
       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0


When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any holes:

In [118]:
val = pd.Series([-1.2,-1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.

In [122]:
frame2['eastern'] = frame2.state == 'Ohio'
print(frame2)

print('----------------')
del frame2['eastern']
print(frame2)

       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False
----------------
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


In [126]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
print(frame3)
print(frame3.T)

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5
        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5


If a DataFrame’s index and columns have their name attributes set, these will also be displayed:

In [127]:
frame3.index.name = 'year'; frame3.columns.name = 'state'
print(frame3)

state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2000      NaN   1.5


As with Series, the values attribute returns the data contained in the DataFrame as a two-dimensional ndarray:

In [128]:
print(frame3.values)

[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]


## Index Objects

Pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index:

In [130]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)
print(index[1:])

Index(['a', 'b', 'c'], dtype='object')
Index(['b', 'c'], dtype='object')


Index objects are immutable and thus can’t be modified by the user:

In [131]:
index[1] = 'd' # TypeError

TypeError: Index does not support mutable operations

Index objects are immutable and thus can’t be modified by the user:

In [133]:
labels = pd.Index(np.arange(3))
print(labels)

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)


Int64Index([0, 1, 2], dtype='int64')
0    1.5
1   -2.5
2    0.0
dtype: float64


## Essential functionality

This section will walk you through the fundamental mechanics of interacting with the data contained in a Series or DataFrame. In the chapters to come, we will delve more deeply into data analysis and manipulation topics using pandas.