In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## 5.1 Introduction to pandas Data Structures


### Series
A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) of the same type and an associated array of data labels,
called its index.

In [None]:
obj = pd.Series([4, 7, -5, 3])
obj

Unnamed: 0,0
0,4
1,7
2,-5
3,3


In [None]:
obj.array

<NumpyExtensionArray>
[np.int64(4), np.int64(7), np.int64(-5), np.int64(3)]
Length: 4, dtype: int64

In [None]:
obj.index

RangeIndex(start=0, stop=4, step=1)

The result of the `.array` attribute is a PandasArray which usually wraps a NumPy
array but can also contain special extension array types

In [None]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

Unnamed: 0,0
d,4
b,7
a,-5
c,3


In [None]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [None]:
print(obj2["a"])
obj2["d"] = 6
obj2[["c", "a", "d"]]

-5


Unnamed: 0,0
c,3
a,-5
d,6


Here ["c", "a", "d"] is interpreted as a list of indices, even though it contains
strings instead of integers.

In [None]:
print(obj2[obj2 > 0])
print(obj2 * 2)

d    6
b    7
c    3
dtype: int64
d    12
b    14
a   -10
c     6
dtype: int64


In [None]:
np.exp(obj2)

Unnamed: 0,0
d,403.428793
b,1096.633158
a,0.006738
c,20.085537


In [None]:
print("b" in obj2)
print("e" in obj2)

True
False


Should you have data contained in a Python dictionary, you can create a Series from
it by passing the dictionary:

In [None]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Unnamed: 0,0
Ohio,35000
Texas,71000
Oregon,16000
Utah,5000


A Series can be converted back to a dictionary with its to_dict method:

In [None]:
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

 You can override this by passing an index with the dictionary
keys in the order you want them to appear in the resulting Series:

In [None]:
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

Unnamed: 0,0
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


Here, three values found in sdata were placed in the appropriate locations, but since
no value for `"California"` was found, it appears as NaN (Not a Number), which is
considered in pandas to mark missing or NA values. Since `"Utah"` was not included
in states, it is excluded from the resulting object.

The `isna` and `notna` functions in pandas should be used to detect missing data:

In [None]:
pd.isna(obj4)

Unnamed: 0,0
California,True
Ohio,False
Oregon,False
Texas,False


In [None]:
pd.notna(obj4)

Unnamed: 0,0
California,False
Ohio,True
Oregon,True
Texas,True


In [None]:
obj4.isna()

Unnamed: 0,0
California,True
Ohio,False
Oregon,False
Texas,False


A useful Series feature for many applications is that it automatically aligns by index
label in arithmetic operations:

In [None]:
print(obj3)
print(obj4)
obj3 + obj4

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


Unnamed: 0,0
California,
Ohio,70000.0
Oregon,32000.0
Texas,142000.0
Utah,


Both the Series object itself and its index have a `name` attribute, which integrates with
other areas of pandas functionality:

In [None]:
obj4.name = "population"
obj4.index.name = "state"
obj4

Unnamed: 0_level_0,population
state,Unnamed: 1_level_1
California,
Ohio,35000.0
Oregon,16000.0
Texas,71000.0


A Series’s index can be altered in place by assignment:

In [None]:
print(obj)
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


### DataFrame
A DataFrame represents a rectangular table of data and contains an ordered, named
collection of columns, each of which can be a different value type (numeric, string,
Boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dictionary of Series all sharing the same index.

In [None]:
data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [None]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


For large DataFrames, the `head` method selects only the first five rows:

In [None]:
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


Similarly, `tail` returns the last five rows:

In [None]:
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [None]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If you pass a column that isn’t contained in the dictionary, it will appear with missing
values in the result:

In [None]:
frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"])
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [None]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dictionary-like
notation or by using the dot attribute notation:

In [None]:
frame2["state"]

Unnamed: 0,state
0,Ohio
1,Ohio
2,Ohio
3,Nevada
4,Nevada
5,Nevada


In [None]:
frame2.year

Unnamed: 0,year
0,2000
1,2001
2,2002
3,2001
4,2002
5,2003


In [None]:
print(frame2.loc[1])
print(frame2.iloc[2])

frame2["debt"] = 16.5
print(frame2)

frame2["debt"] = np.arange(6.)
print(frame2)


year     2001
state    Ohio
pop       1.7
debt      1.0
Name: 1, dtype: object
year     2002
state    Ohio
pop       3.6
debt      2.0
Name: 2, dtype: object
   year   state  pop  debt
0  2000    Ohio  1.5  16.5
1  2001    Ohio  1.7  16.5
2  2002    Ohio  3.6  16.5
3  2001  Nevada  2.4  16.5
4  2002  Nevada  2.9  16.5
5  2003  Nevada  3.2  16.5
   year   state  pop  debt
0  2000    Ohio  1.5   0.0
1  2001    Ohio  1.7   1.0
2  2002    Ohio  3.6   2.0
3  2001  Nevada  2.4   3.0
4  2002  Nevada  2.9   4.0
5  2003  Nevada  3.2   5.0


When you are assigning lists or arrays to a column, the value’s length must match the
length of the DataFrame. If you assign a Series, its labels will be realigned exactly to
the DataFrame’s index, inserting missing values in any index values not present:

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
frame2["debt"] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [None]:
frame2["eastern"] = frame2["state"] == "Ohio"
frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2000,Ohio,1.5,,True
1,2001,Ohio,1.7,,True
2,2002,Ohio,3.6,,True
3,2001,Nevada,2.4,,False
4,2002,Nevada,2.9,,False
5,2003,Nevada,3.2,,False


The `del` method can then be used to remove this column:

In [None]:
del frame2["eastern"]
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
populations = {"Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
                "Nevada": {2001: 2.4, 2002: 2.9}}

In [None]:
frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


You can transpose the DataFrame (swap rows and columns) with similar syntax to a
NumPy array:

In [None]:
frame3.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


The keys in the inner dictionaries are combined to form the index in the result. This
isn’t true if an explicit index is specified:

In [None]:
pd.DataFrame(populations, index=[2001, 2002, 2003])

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9
2003,,


Dictionaries of Series are treated in much the same way:

In [None]:
pdata = {"Ohio": frame3["Ohio"][:-1],
          "Nevada": frame3["Nevada"][:2]}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


If a DataFrame’s **index** and **columns** have their `name` attributes set, these will also be
displayed:

In [None]:
frame3.index.name = "year"
frame3.columns.name = "state"
frame3

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


Unlike Series, DataFrame does not have a `name` attribute. DataFrame’s `to_numpy`
method returns the data contained in the DataFrame as a two-dimensional ndarray:

In [None]:
frame3.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

In [None]:
frame2.to_numpy()

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, nan],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, nan],
       [2002, 'Nevada', 2.9, nan],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

### Index Objects
pandas’s Index objects are responsible for holding the axis labels and other metadata. Any array
or other sequence of labels you use when constructing a Series or DataFrame is
internally converted to an Index:

In [None]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [None]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and thus can’t be modified by the user:

In [None]:
index[1] = "d"

TypeError: Index does not support mutable operations

In [None]:
labels = pd.Index(np.arange(3))
print(labels)

obj2 = pd.Series([1.5, -2.5, 0], index=labels)
print(obj2)

Index([0, 1, 2], dtype='int64')
0    1.5
1   -2.5
2    0.0
dtype: float64


In [None]:
obj2.index is labels

True

In addition to being array-like, an Index also behaves like a fixed-size set:

In [None]:
print(frame3)
print(frame3.columns)

print("Ohio" in frame3.columns)

print(2003 in frame3.index)

state  Ohio  Nevada
year               
2000    1.5     NaN
2001    1.7     2.4
2002    3.6     2.9
Index(['Ohio', 'Nevada'], dtype='object', name='state')
True
False


Unlike Python sets, a pandas Index can contain duplicate labels:

In [None]:
pd.Index(["foo", "foo", "bar", "bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

## 5.2 Essential Functionality


### Reindexing

An important method on pandas objects is **reindex**, which means to create a new
object with the values rearranged to align with the new index.

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

Unnamed: 0,0
d,4.5
b,7.2
a,-5.3
c,3.6


Calling `reindex` on this Series rearranges the data according to the new index,
introducing missing values if any index values were not already present:

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

Unnamed: 0,0
a,-5.3
b,7.2
c,3.6
d,4.5
e,


In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
print(obj3)

obj3.reindex(np.arange(6), method="ffill")

0      blue
2    purple
4    yellow
dtype: object


Unnamed: 0,0
0,blue
1,blue
2,purple
3,purple
4,yellow
5,yellow


With DataFrame, `reindex` can alter the (row) index, columns, or both. When passed
only a sequence, it reindexes the rows in the result:

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                        index=["a", "c", "d"],
                        columns=["Ohio", "Texas", "California"])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [None]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed with the `columns` keyword:

In [None]:
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Because "Ohio" was not in states, the data for that column is dropped from the
result.
Another way to reindex a particular axis is to pass the new axis labels as a positional
argument and then specify the axis to reindex with the axis keyword:

In [None]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


### Dropping Entries from an Axis

In [None]:
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
print(obj)

new_obj = obj.drop("c")
print(new_obj)

obj.drop(["d", "c"])

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64


Unnamed: 0,0
a,0.0
b,1.0
e,4.0


With DataFrame, index values can be deleted from either axis. To illustrate this, we
first create an example DataFrame:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
              index=["Ohio", "Colorado", "Utah", "New York"],
              columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Calling drop with a sequence of labels will drop values from the row labels (axis 0):

In [None]:
data.drop(index=["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(columns=["two"])

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


You can also drop values from the columns by passing axis=1 (which is like NumPy)
or `axis="columns"`:

In [None]:
data.drop("two", axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [None]:
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


### Indexing, Selection, and Filtering
Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers.

In [None]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
obj

Unnamed: 0,0
a,0.0
b,1.0
c,2.0
d,3.0


In [None]:
print(obj["b"])
print("--------------")
print(obj[1])
print("--------------")

print(obj[2:4])
print("--------------")

print(obj[["b", "a", "d"]])
print("--------------")

print(obj[[1, 3]])
print("--------------")

print(obj[obj < 2])

1.0
--------------
1.0
--------------
c    2.0
d    3.0
dtype: float64
--------------
b    1.0
a    0.0
d    3.0
dtype: float64
--------------
b    1.0
d    3.0
dtype: float64
--------------
a    0.0
b    1.0
dtype: float64


  print(obj[1])
  print(obj[[1, 3]])


While you can select data by label this way, the preferred way to select index values is
with the special loc operator:

In [None]:
obj.loc[["b", "a", "d"]]

Unnamed: 0,0
b,1.0
a,0.0
d,3.0


In [None]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])
obj1

Unnamed: 0,0
2,1
0,2
1,3


In [None]:
print(obj2)

print(obj1[[0, 1, 2]])

print(obj2[[0, 1, 2]])

a    1
b    2
c    3
dtype: int64
0    2
1    3
2    1
dtype: int64
a    1
b    2
c    3
dtype: int64


  print(obj2[[0, 1, 2]])


When using loc, the expression `obj.loc[[0, 1, 2]]` will fail when the index does
not contain integers:

In [None]:
obj2.loc[[0, 1]]

KeyError: "None of [Index([0, 1], dtype='int64')] are in the [index]"

Since loc operator indexes exclusively with labels, there is also an iloc operator
that indexes exclusively with integers to work consistently whether or not the index
contains integers:

In [None]:
print(obj1.iloc[[0, 1, 2]])
print(obj2.iloc[[0, 1, 2]])
print(obj2.loc["b":"c"])

2    1
0    2
1    3
dtype: int64
a    1
b    2
c    3
dtype: int64
b    2
c    3
dtype: int64


Assigning values using these methods modifies the corresponding section of the
Series:

In [None]:
obj2.loc["b":"c"] = 5
obj2

Unnamed: 0,0
a,1
b,5
c,5


Indexing into a DataFrame retrieves one or more columns either with a single value
or sequence:

In [None]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=["Ohio", "Colorado", "Utah", "New York"],
columns=["one", "two", "three", "four"])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
print(data["two"])
print(data[["three", "one"]])

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64
          three  one
Ohio          2    0
Colorado      6    4
Utah         10    8
New York     14   12


In [None]:
print(data[:2])
print(data[data["three"] > 5])


          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15


The row selection syntax `data[:2]` is provided as a convenience. Passing a single
element or a list to the `[]` operator selects columns.

In [None]:
print(data<5)

            one    two  three   four
Ohio       True   True   True   True
Colorado   True  False  False  False
Utah      False  False  False  False
New York  False  False  False  False


We can use this DataFrame to assign the value 0 to each location with the value True,
like so:

In [None]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Selection on DataFrame with loc and iloc
Like Series, DataFrame has special attributes loc and iloc for label-based and
integer-based indexing, respectively. Since DataFrame is two-dimensional, you can
select a subset of the rows and columns with NumPy-like notation using either axis
labels (loc) or integers (iloc).

In [None]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
print(data.loc["Colorado"])
print(data.loc[["Colorado", "New York"]])
print("You can combine both row and column selection in loc by separating the selections with a comma:")
print(data.loc["Colorado", ["two", "three"]])
print("We’ll then perform some similar selections with integers using iloc:")
print(data.iloc[2])
print(data.iloc[[2, 1]])
print(data.iloc[2, [3, 0, 1]])
print(data.iloc[[1, 2], [3, 0, 1]])
print("Both indexing functions work with slices in addition to single labels or lists of labels:")
print(data.loc[: "Utah", "two"])
print(data.iloc[:, :3][data.three > 5])
print("Boolean arrays can be used with loc but not iloc:")
print(data.loc[data.three >= 2])

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64
          one  two  three  four
Colorado    0    5      6     7
New York   12   13     14    15
You can combine both row and column selection in loc by separating the selections with a comma:
two      5
three    6
Name: Colorado, dtype: int64
We’ll then perform some similar selections with integers using iloc:
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
          one  two  three  four
Utah        8    9     10    11
Colorado    0    5      6     7
four    11
one      8
two      9
Name: Utah, dtype: int64
          four  one  two
Colorado     7    0    5
Utah        11    8    9
Both indexing functions work with slices in addition to single labels or lists of labels:
Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64
          one  two  three
Colorado    0    5      6
Utah        8    9     10
New York   12   13     14
Boolean arrays can be used with loc but not iloc:
   

#### Integer indexing pitfalls
Working with pandas objects indexed by integers can be a stumbling block for new
users since they work differently from built-in Python data structures like lists and
tuples.

In [None]:
ser = pd.Series(np.arange(3.))
ser

Unnamed: 0,0
0,0.0
1,1.0
2,2.0


In [None]:
ser[-1]

KeyError: -1

In this case, pandas could “fall back” on integer indexing, but it is difficult to do
this in general without introducing subtle bugs into the user code. Here we have an
index containing 0, 1, and 2, but pandas does not want to guess what the user wants
(label-based indexing or position-based):

In [None]:
ser

Unnamed: 0,0
0,0.0
1,1.0
2,2.0


On the other hand, with a noninteger index, there is no such ambiguity:

In [None]:
ser2 = pd.Series(np.arange(3.), index=["a", "b", "c"])
print(ser2[-1])
print(ser.iloc[-1])
print(ser[:2])

2.0
2.0
0    0.0
1    1.0
dtype: float64


  print(ser2[-1])


#### Pitfalls with chained indexing
In the previous section we looked at how you can do flexible selections on a Data‐
Frame using loc and iloc. These indexing attributes can also be used to modify
DataFrame objects in place, but doing so requires some care.

In [None]:
data.loc[:, "one"] = 1
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,1,9,10,11
New York,1,13,14,15


In [None]:
data.iloc[2] = 5

data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,1,5,6,7
Utah,5,5,5,5
New York,1,13,14,15


In [None]:
data.loc[data["four"] > 5] = 3

data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,5,5
New York,3,3,3,3


Depending on the data contents, this may print a special SettingWithCopyWarning,
which warns you that you are trying to modify a temporary value (the nonempty
5.2 Essential Functionality | 151

result of `data.loc[data.three == 5]) `instead of the original DataFrame data,
which might be what you were intending.

In [None]:
data.loc[data.three == 5, "three"] = 6
data

Unnamed: 0,one,two,three,four
Ohio,1,0,0,0
Colorado,3,3,3,3
Utah,5,5,6,5
New York,3,3,3,3


### Arithmetic and Data Alignment
pandas can make it much simpler to work with objects that have different indexes.
For example, when you add objects, if any index pairs are not the same, the respective
index in the result will be the union of the index pairs.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
print(s1)
print(s2)

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64


In [None]:
# Adding
s1 + s2

Unnamed: 0,0
a,5.2
c,1.1
d,
e,0.0
f,
g,


In [None]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list("bcd"),
                            index=["Ohio", "Texas", "Colorado"])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
                index=["Utah", "Ohio", "Texas", "Oregon"])
print(df1)
print(df2)

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0
          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0


Adding these returns a DataFrame with index and columns that are the unions of the
ones in each DataFrame:

In [None]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


If you add DataFrame objects with no column or row labels in common, the result
will contain all nulls:

In [None]:
df1 = pd.DataFrame({"A": [1, 2]})
df2 = pd.DataFrame({"B": [3, 4]})
print(df1)
print(df2)

   A
0  1
1  2
   B
0  3
1  4


In [None]:
df1 + df2

Unnamed: 0,A,B
0,,
1,,


#### Arithmetic methods with fill values
In arithmetic operations between differently indexed objects, you might want to fill
with a special value, like 0, when an axis label is found in one object but not the other.

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list("abcde"))
df2.loc[1, "b"] = np.nan

print(df1)

print(df2)

     a    b     c     d
0  0.0  1.0   2.0   3.0
1  4.0  5.0   6.0   7.0
2  8.0  9.0  10.0  11.0
      a     b     c     d     e
0   0.0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0


Adding these results in missing values in the locations that don’t overlap:

In [None]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Using the `add` method on df1, I pass df2 and an argument to `fill_value`,

In [None]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [None]:
1 / df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [None]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


#### Operations between DataFrame and Series
As with NumPy arrays of different dimensions, arithmetic between DataFrame and
Series is also defined.

In [None]:
arr = np.arange(12.).reshape((3, 4))
print(arr)
print("-----------------")
print(arr[0])
print("-----------------")
print(arr - arr[0])

[[ 0.  1.  2.  3.]
 [ 4.  5.  6.  7.]
 [ 8.  9. 10. 11.]]
-----------------
[0. 1. 2. 3.]
-----------------
[[0. 0. 0. 0.]
 [4. 4. 4. 4.]
 [8. 8. 8. 8.]]


Operations between a DataFrame and a Series
are similar:

In [None]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list("bde"),
              index=["Utah", "Ohio", "Texas", "Oregon"])
series = frame.iloc[0]
print(frame)
print("-----------------")

print(series)

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
-----------------
b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64


By default, arithmetic between DataFrame and Series matches the index of the Series
on the columns of the DataFrame,

In [None]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is *not found* in either the DataFrame’s columns or the Series’s index,
the objects will be *reindexed* to form the union:

In [None]:
series2 = pd.Series(np.arange(3), index=["b", "e", "f"])
series2

Unnamed: 0,0
b,0
e,1
f,2


In [None]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods and specify to match over the index.

In [None]:
series3 = frame["d"]
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
series3

Unnamed: 0,d
Utah,1.0
Ohio,4.0
Texas,7.0
Oregon,10.0


In [None]:
frame.sub(series3, axis="index")

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping
NumPy ufuncs (element-wise array methods) also work with pandas objects:


In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)),
                     columns=list("bde"),
                    index=["Utah", "Ohio", "Texas", "Oregon"])
print(frame)


               b         d         e
Utah   -0.479911  0.783904 -1.182917
Ohio   -1.278368 -0.193565  1.133823
Texas   0.225550 -0.236620  0.424130
Oregon -0.280566  0.572051  2.700719


In [None]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.479911,0.783904,1.182917
Ohio,1.278368,0.193565,1.133823
Texas,0.22555,0.23662,0.42413
Oregon,0.280566,0.572051,2.700719


In [None]:
def f1(x):
  return x.max() - x.min()

In [None]:
frame.apply(f1)

Unnamed: 0,0
b,1.503918
d,1.020525
e,3.883635


If you pass `axis="columns"` to apply, the function will be invoked once per row
instead. A helpful way to think about this is as “apply across the columns”:

In [None]:
frame.apply(f1, axis="columns")

Unnamed: 0,0
Utah,1.966821
Ohio,2.412191
Texas,0.66075
Oregon,2.981285


The function passed to apply *need not return a scalar value*; it can also **return a Series**
with multiple values:

In [None]:
def f2(x):
  return pd.Series([x.min(), x.max()], index=["min", "max"])

In [None]:
print(frame.apply(f2))


            b         d         e
min -1.278368 -0.236620 -1.182917
max  0.225550  0.783904  2.700719


Element-wise Python functions can be used, too. Suppose you wanted to compute
a formatted string from each floating-point value in `frame`.

In [None]:
def my_format(x):
  return f"{x:.2f}"

In [None]:
frame.applymap(my_format)

  frame.applymap(my_format)


Unnamed: 0,b,d,e
Utah,-0.48,0.78,-1.18
Ohio,-1.28,-0.19,1.13
Texas,0.23,-0.24,0.42
Oregon,-0.28,0.57,2.7


The reason for the name `applymap` is that Series has a `map` method for applying an
element-wise function:

In [None]:
frame["e"].map(my_format)

Unnamed: 0,e
Utah,-1.18
Ohio,1.13
Texas,0.42
Oregon,2.7


### Sorting and Ranking
Sorting a dataset by some criterion is another important built-in operation. To sort
lexicographically by row or column label, use the `sort_index` method

In [None]:
obj = pd.Series(np.arange(4), index=["d", "a", "b", "c"])
print(obj)
print("--------------")
print(obj.sort_index())

d    0
a    1
b    2
c    3
dtype: int64
--------------
a    1
b    2
c    3
d    0
dtype: int64


With a DataFrame, you can sort by index on either axis:

In [None]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=["three", "one"],
                     columns=["d", "a", "b", "c"])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [None]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [None]:
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default but can be sorted in descending
order, too:

In [None]:
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [None]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

Unnamed: 0,0
2,-3
3,2
0,4
1,7


Any missing values are sorted to the end of the Series by default:

In [None]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

Unnamed: 0,0
4,-3.0
5,2.0
0,4.0
2,7.0
1,
3,


Missing values can be sorted to the start instead by using the na_position option:

In [None]:
obj.sort_values(na_position="first")

Unnamed: 0,0
1,
3,
4,-3.0
5,2.0
0,4.0
2,7.0


In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
print(frame)
print("-----------------")
print(frame.sort_values("b"))
print("-----------------")

   b  a
0  4  0
1  7  1
2 -3  0
3  2  1
-----------------
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1
-----------------


To sort by multiple columns, pass a list of names:

In [None]:
frame.sort_values(["a", "b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


*Ranking* assigns ranks from one through the number of valid data points in an array,
starting from the lowest value. The rank methods for Series and DataFrame are the
place to look; by default, rank breaks ties by assigning each group the mean rank:

In [None]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

Unnamed: 0,0
0,6.5
1,1.0
2,6.5
3,4.5
4,3.0
5,2.0
6,4.5


In [None]:
obj.rank(method="first")

Unnamed: 0,0
0,6.0
1,1.0
2,7.0
3,4.0
4,3.0
5,2.0
6,5.0


You can rank in descending order, too:

In [None]:
obj.rank(ascending=False)

Unnamed: 0,0
0,1.5
1,7.0
2,1.5
3,3.5
4,5.0
5,6.0
6,3.5


DataFrame can compute ranks over the rows or the columns:

In [None]:
frame = pd.DataFrame({"b": [4.3, 7, -3, 2], "a": [0, 1, 0, 1], "c":[-2, 5, 8, -2.5]})
print(frame)
print("---------------")
print(frame.rank(axis="columns"))

     b  a    c
0  4.3  0 -2.0
1  7.0  1  5.0
2 -3.0  0  8.0
3  2.0  1 -2.5
---------------
     b    a    c
0  3.0  2.0  1.0
1  3.0  1.0  2.0
2  1.0  2.0  3.0
3  3.0  2.0  1.0


### Axis Indexes with Duplicate Labels
Up until now almost all of the examples we have looked at have unique axis labels
(index values). While many pandas functions (like reindex) require that the labels be
unique, it’s not mandatory. Let’s consider a small Series with duplicate indices:

In [None]:
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
print(obj)
print("--------------")
#The is_unique property of the index can tell you whether or not its labels are unique:
print(obj.index.is_unique)

a    0
a    1
b    2
b    3
c    4
dtype: int64
--------------
False


In [None]:
print(obj["a"])
print(obj["c"])

a    0
a    1
dtype: int64
4


The same logic extends to indexing rows (or columns) in a DataFrame:

In [None]:
df = pd.DataFrame(np.random.standard_normal((5, 3)),index=["a", "a", "b", "b", "c"])
df

Unnamed: 0,0,1,2
a,0.379643,-1.620296,-0.783822
a,-0.706427,-1.163263,-2.224218
b,-0.874051,0.254865,-0.869201
b,-0.950482,0.576852,0.817988
c,0.207899,0.942505,-0.258197


In [None]:
print(df.loc["b"])
print("-------------------")
print(df.loc["c"])

          0         1         2
b -0.874051  0.254865 -0.869201
b -0.950482  0.576852  0.817988
-------------------
0    0.207899
1    0.942505
2   -0.258197
Name: c, dtype: float64


## 5.3 Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical meth‐
ods. Most of these fall into the category of *reductions or summary statistics*, methods
that extract a single value (like the sum or mean) from a Series, or a Series of values
from the rows or columns of a DataFrame. Compared with the similar methods
found on NumPy arrays, they have built-in handling for missing data.

In [None]:
import pandas as pd
import numpy as np


In [None]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
                  [np.nan, np.nan], [0.75, -1.3]],
                  index=["a", "b", "c", "d"],
                  columns=["one", "two"])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


`sum` method returns a Series containing column sums:

In [None]:
df.sum()

Unnamed: 0,0
one,9.25
two,-5.8


Passing `axis="columns"` or `axis=1` sums across the columns instead:

In [None]:
df.sum(axis="columns")

Unnamed: 0,0
a,1.4
b,2.6
c,0.0
d,-0.55


This can be disabled with the `skipna` option, in
which case any NA value in a row or column names the corresponding result NA:

In [None]:
print(df.sum(axis="index", skipna=False))
print("-----------------")
print(df.sum(axis="columns", skipna=False))

one   NaN
two   NaN
dtype: float64
-----------------
a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64


Some aggregations, like `mean`, require at least one non-NA value to yield a value
result,

In [None]:
df.mean(axis="columns")

Unnamed: 0,0
a,1.4
b,1.3
c,
d,-0.275


method `idxmax` and `idxmin`

In [None]:
print("idxmax:")
print(df.idxmax())
print("idxmin:")
print(df.idxmin())

idxmax:
one    b
two    d
dtype: object
idxmin:
one    d
two    b
dtype: object


Other methods are *accumulations*:

In [None]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [None]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On nonnumeric data, `describe` produces alternative summary statistics:

In [None]:
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

Unnamed: 0,0
count,16
unique,3
top,a
freq,8


### Correlation and Covariance
Some summary statistics, like correlation and covariance, are computed from pairs
of arguments. Let’s consider some DataFrames of stock prices and volumes originally
obtained from Yahoo!

In [None]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [None]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


The `corr` method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, `cov` computes the covariance:

In [None]:
returns["MSFT"].corr(returns["IBM"])

np.float64(0.49976361144151144)

In [None]:
returns["MSFT"].cov(returns["IBM"])

np.float64(8.870655479703546e-05)

DataFrame’s `corr` and `cov` methods, return a full correlation or
covariance matrix

In [None]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [None]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


DataFrame’s `corrwith` method, compute pair-wise correlations
between a DataFrame’s columns or rows with another Series or DataFrame.Passing a
Series returns a Series with the correlation value computed for each column:

In [None]:
returns.corrwith(returns["IBM"])

Unnamed: 0,0
AAPL,0.386817
GOOG,0.405099
IBM,1.0
MSFT,0.499764


In [None]:
returns.corrwith(volume)

Unnamed: 0,0
AAPL,-0.075565
GOOG,-0.007067
IBM,-0.204849
MSFT,-0.09295


### Unique Values, Value Counts, and Membership


In [None]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])

Function `uniques`, which gives you an array of the unique values in a Series:

In [None]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

`value_counts` computes a Series containing value fre‐
quencies:

In [None]:
obj.value_counts()

Unnamed: 0,count
c,3
a,3
b,2
d,1


In [None]:
pd.value_counts(obj.to_numpy(), sort=False)

  pd.value_counts(obj.to_numpy(), sort=False)


Unnamed: 0,count
c,3
a,3
d,1
b,2


`isin` performs a vectorized set membership check and can be useful in filtering a
dataset down to a subset of values in a Series or column in a DataFrame:

In [None]:
mask = obj.isin(["b", "c"])
mask

Unnamed: 0,0
0,True
1,False
2,False
3,False
4,False
5,True
6,True
7,True
8,True


In [None]:
obj[mask]

Unnamed: 0,0
0,c
5,b
6,b
7,c
8,c


Related to isin is the `Index.get_indexer` method

In [None]:
to_match = pd.Series(["c", "a", "b", "b", "c", "a"])
unique_vals = pd.Series(["c", "b", "a"])
indices = pd.Index(unique_vals).get_indexer(to_match)
indices

array([0, 2, 1, 1, 0, 2])

In [None]:
data = pd.DataFrame({"Qu1": [1, 3, 4, 3, 4],
                    "Qu2": [2, 3, 1, 2, 3],
                    "Qu3": [1, 5, 2, 4, 4]})
print(data)


   Qu1  Qu2  Qu3
0    1    2    1
1    3    3    5
2    4    1    2
3    3    2    4
4    4    3    4


We can compute the value counts for a single column, like so:

In [None]:
data["Qu1"].value_counts().sort_index()

Unnamed: 0_level_0,count
Qu1,Unnamed: 1_level_1
1,1
3,2
4,2


To compute this for all columns, pass `pandas.value_counts` to the DataFrame’s
apply method:

In [None]:
result = data.apply(pd.value_counts).fillna(0)

result

  result = data.apply(pd.value_counts).fillna(0)


Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


There is also a DataFrame.value_counts method, but it computes counts considering
each row of the DataFrame as a tuple to determine the number of occurrences of each
distinct row:

In [None]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [None]:
data.value_counts()

Unnamed: 0_level_0,Unnamed: 1_level_0,count
a,b,Unnamed: 2_level_1
1,0,2
2,0,2
1,1,1


## 5.4 Conclusion
In the next chapter, we will discuss tools for reading (or loading) and writing datasets
with pandas. After that, we will dig deeper into data cleaning, wrangling, analysis, and
visualization tools using pandas.