# 5.1 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: **Series** and **DataFrame**.

## Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index.

In [144]:
import pandas as pd

In [76]:
obj = pd.Series([4, 7, -5, 3])

In [77]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [78]:
obj.values

array([ 4,  7, -5,  3])

In [79]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [80]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])

In [81]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [82]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [83]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [84]:
obj2["a"]

-5

In [85]:
obj2["b"]

7

In [86]:
obj2[["a", "d", "c"]]

a   -5
d    4
c    3
dtype: int64

In [87]:
obj2[obj2 > 2]

d    4
b    7
c    3
dtype: int64

In [88]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [89]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [90]:
"b" in obj2

True

In [91]:
"e" in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [92]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}

In [93]:
sdata

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [94]:
obj3 = pd.Series(sdata)

In [95]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:

In [96]:
states = ["California", 'Ohio', "Utah"]

In [97]:
obj4 = pd.Series(data=sdata, index=states)

In [98]:
obj4

California        NaN
Ohio          35000.0
Utah           5000.0
dtype: float64

In [99]:
obj3.index

Index(['Ohio', 'Texas', 'Oregon', 'Utah'], dtype='object')

In [100]:
obj3.reindex(index=states) # reindex by passing a list

California        NaN
Ohio          35000.0
Utah           5000.0
dtype: float64

In [101]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Since no value for 'California' was found, it appears as **NaN** (not a number), which is considered in pandas to mark missing or NA values. Since 'Utah' was not included in **states**, it is excluded from the resulting object.

In [102]:
pd.isnull(obj4)

California     True
Ohio          False
Utah          False
dtype: bool

In [103]:
pd.notnull(obj4)

California    False
Ohio           True
Utah           True
dtype: bool

In [104]:
pd.notna(obj4)

California    False
Ohio           True
Utah           True
dtype: bool

Both the Series object itself and its index have a **name** attribute, which integrates with other key areas of pandas functionality:

In [105]:
obj4.name = "population"

In [106]:
obj4.index.name = 'state'

In [107]:
obj4

state
California        NaN
Ohio          35000.0
Utah           5000.0
Name: population, dtype: float64

A Series's index can be altered in-place by assignment:

In [108]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [109]:
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]

In [110]:
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). 

The DataFrame has both a row and column index.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

In [111]:
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
}

In [112]:
frame = pd.DataFrame(data)

In [113]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [114]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [115]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


If you specify a sequence of columns, the DataFrame's columns will be arranged in that order:

In [116]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass a column that isn't contained in the dict, it will appear with missing values in the result:

In [117]:
frame2 = pd.DataFrame(
    data,
    columns=["year", "state", "pop", "debt"],
    index=["one", "two", "three", "four", "five", "six"],
)

In [118]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [119]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [120]:
frame2["state"]

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [121]:
frame2.index

Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

In [122]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [123]:
frame2["state"]

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

**frame2[column]** works for any columns name, but **frame2.column** only works when the column name is a valid Python variable name.

In [124]:
frame2.loc["three"]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [125]:
frame2.loc["three", "pop"]

3.6

Columns can be modified by assignment. For example, the empty **'debt'** column could be assigned a scalar value or an array of values:

In [126]:
frame2["debt"] = 16.5

In [127]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [128]:
frame2["debt"] = np.arange(6.0)

In [129]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When you are assigning lists or arrays to a columns, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:

In [130]:
val = pd.Series([-1.2, -1.5, -1.7], index=["one", "two", "five"])

In [131]:
frame2["debt"] = val

In [132]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,-1.2
two,2001,Ohio,1.7,-1.5
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


In [133]:
frame2["eastern"] = frame2.state == "Ohio"

In [134]:
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'eastern'], dtype='object')

New columns cannot be created with the **frame2.eastern** syntax

**----------------update--------------**

It can.

In [135]:
frame2["western"] = frame2.state == "Nevada"

In [136]:
frame2.columns

Index(['year', 'state', 'pop', 'debt', 'eastern', 'western'], dtype='object')

In [137]:
frame2

Unnamed: 0,year,state,pop,debt,eastern,western
one,2000,Ohio,1.5,-1.2,True,False
two,2001,Ohio,1.7,-1.5,True,False
three,2002,Ohio,3.6,,True,False
four,2001,Nevada,2.4,,False,True
five,2002,Nevada,2.9,-1.7,False,True
six,2003,Nevada,3.2,,False,True


The **del** method can then be used to remove this column:

In [138]:
del frame2["eastern"]

In [139]:
del frame2["western"]

In [140]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

Another common form of data is a nested dict of dicts:

In [152]:
pop = {"Nevada": {2001: 2.4, 2002: 2.9}, "Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [153]:
pop

{'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [154]:
frame3 = pd.DataFrame(pop)

In [155]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [156]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


The keys in the inner dicts are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:

In [157]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [166]:
pdata = {"Ohio": frame3["Ohio"][:-1], "Nevada": frame3["Nevada"][:-1]}

In [167]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame's **index** and **columns** have their **name** attributes set, these will also be displayed:

In [168]:
frame3.index.name = "year"

In [169]:
frame3.columns.name = "state"

In [170]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


As with Series, the **values** attribute returns the data contained in the DataFrame as a two-dimensional ndarray:

In [171]:
frame3.values

array([[nan, 1.5],
       [2.4, 1.7],
       [2.9, 3.6]])

If the Dataframe's columns are different dtypes, the dtypeof the values array will be chosen to accommodate all of the columns:

In [172]:
frame2.values

array([[2000, 'Ohio', 1.5, -1.2],
       [2001, 'Ohio', 1.7, -1.5],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

## Index Objects

pandas's Index objects are responsible for holding the axis labels and other metadata (like the axis name or names).

In [173]:
obj = pd.Series(range(3), index=["a", "b", "c"])

In [174]:
index = obj.index

In [175]:
index

Index(['a', 'b', 'c'], dtype='object')

In [182]:
type(index)

pandas.core.indexes.base.Index

In [176]:
index[1:]

Index(['b', 'c'], dtype='object')

Index Object are immutable and thus can't be modified by the user:

In [None]:
index[1] = "d" # TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [179]:
labels = pd.Index(np.arange(3))

In [180]:
labels

Int64Index([0, 1, 2], dtype='int64')

In [181]:
type(labels)

pandas.core.indexes.numeric.Int64Index

In [183]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)

In [197]:
obj2

0    1.0
1   -2.5
2    0.0
dtype: float64

In [198]:
obj2.index is labels

True

In [216]:
obj2.index == labels

array([ True,  True,  True])

In addition to being array-like, and Index also behaves like a fixed-size set:

In [217]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [219]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [220]:
"Ohio" in frame3.columns

True

In [221]:
2003 in frame3.index

False

Unlike Python sets, a pandas Index can contain duplicate labels:

In [225]:
dup_labels = pd.Index(("foo", "foo", "bar", "bar"))

In [226]:
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

# 5.2 Essential Functionality

## Reindexing

An important method on pandas objects is **reindex**, which means to create a new object with the data *conformed* to a new index. Consider an example:

In [227]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

In [228]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [229]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])

In [230]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The **method** option allows us to do this, using a method such as **ffill**, which forward-fills the values:

In [231]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])

In [232]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [233]:
obj3.reindex(range(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [234]:
obj3

0      blue
2    purple
4    yellow
dtype: object

## Dropping entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array of list without those entries. The **drop** method will return a new object with the indicated value or values deleted from an axis:

In [2]:
obj = pd.Series(np.arange(5.0), index=[list("abcde")])

In [4]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [5]:
new_obj = obj.drop("c")

In [7]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [8]:
obj.drop(["a", "d"])

b    1.0
c    2.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we first create an example DataFrame:

In [9]:
data = pd.DataFrame(
    np.arange(16).reshape(4, 4),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"],
)

In [10]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Calling **drop** with a sequence of labels will drop values from the row labels (axis 0):

In [11]:
data.drop(["Colorado", "Utah"])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


In [13]:
data.drop("two", axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [36]:
data.drop(["one", "three"], axis="columns")

Unnamed: 0,two,four
Ohio,1,3
Colorado,5,7
Utah,9,11
New York,13,15


Many functions, like **drop**, which modify the size or shape of a Series or DataFrame, can manipulate an object *in-place* without returning a new object:

In [39]:
obj.drop("c", inplace=True)

In [40]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Be careful with the **inplace**, as it destroys any data that is dropped.

## Indexing, Selection, and Filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series's index values instead of only integers. Here are some examples of this:

In [41]:
obj = pd.Series(np.arange(4.0), index=["a", "b", "c", "d"])

In [42]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [43]:
obj["b"]

1.0

In [45]:
obj[1]

1.0

In [46]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [48]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [49]:
obj[obj < 3]

a    0.0
b    1.0
c    2.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing and ndarray in that the end-point is inclusive:

In [51]:
array = np.arange(6.0)

In [54]:
array[3:4]

array([3.])

In [57]:
obj["a":"c"]

a    0.0
b    1.0
c    2.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

In [60]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"],
)

In [61]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [62]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [66]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [75]:
data[1:2]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7


In [83]:
data[["one", "two"]]

Unnamed: 0,one,two
Ohio,0,1
Colorado,4,5
Utah,8,9
New York,12,13


The row selection syntax data[:2] is provided as convenience. Passing a single element or a list to the [] operator selects columns.

boolean DataFrame:

In [84]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [86]:
data[data < 5] = 0

In [87]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


This makes DataFrame syntactically more like a two-dimensional NumPy array in this particular case.

## Selection with loc and iloc

For DataFrame label-indexing one the rows, I introduce the special indexing operators **loc** and **iloc**. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either labels (**loc**) or integers (**iloc**).

Let's select a single row and multiple columns by labels:

In [89]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

We'll then perform some similar selections with integers using **iloc**:

In [90]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

In [91]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [92]:
data.iloc[[1, 2], [3, 0, 2]]

Unnamed: 0,four,one,three
Colorado,7,0,6
Utah,11,8,10


In [94]:
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [97]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


## Integer Indexes