# Getting Started with Pandas

Pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and convenient in Python. It is built on top of NumPy and is intended to integrate well with many other 3rd party libraries.  
While pandas adopts many coding idioms from NumPy, the biggest about difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

In [1]:
# imports
import numpy as np
import pandas as pd

## Series
A Series is a 1d array-like object containing a sequence of values of the same type and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [2]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
obj.array

<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [4]:
# create a Series with an index
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [5]:
# you can use labels in the index when selecting single values
obj2["a"]

-5

In [6]:
obj2[["c", "a", "d"]]

c    3
a   -5
d    4
dtype: int64

In [7]:
# using numpy functions or operations will preserve the index-value link
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [8]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [9]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

You can also think about Series as a fixed-length ordered dict, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dict.

In [10]:
"b" in obj2

True

In [11]:
sdata = {"Ohio": 35000, "Texas": 71000, "Oregon": 16000, "Utah": 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [12]:
# convert back to a dictionary
obj3.to_dict()

{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [13]:
# create a Series from a dictionary passing an index
states = ["California", "Ohio", "Oregon", "Texas"]
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In the previous example, three values found in sdata were placed in the appropiate locations, but since no value for "California" was found, it appears as NaN. NaN (not a number) is the standard missing data marker used in pandas.

In [14]:
pd.isna(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [15]:
pd.notna(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [16]:
# indexes align when performing operations
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Data aligment features are similar to a join operation in databases.

Both the Series object itself and its index have a name attribute, which integrates with other areas of pandas:

In [17]:
obj4.name = "population"
obj4.index.name = "state"
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [18]:
# Series' index altered in place
obj.index = ["Bob", "Steve", "Jeff", "Ryan"]
obj

Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.

In [19]:
# one of the most common ways to create a DataFrame is from a dictionary of equal-length lists or NumPy arrays
data = {
    "state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
    "year": [2000, 2001, 2002, 2001, 2002, 2003],
    "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2],
}
frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [20]:
# select only the first 5 rows
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [21]:
# select only the last 5 rows
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [22]:
# arrange the columns in a particular order
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [23]:
# if a column is not found in the data, it will appear with missing values
pd.DataFrame(data, columns=["year", "state", "pop", "debt"])

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,
5,2003,Nevada,3.2,


In [24]:
# retrieve a column as a Series
frame["state"]

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
5    Nevada
Name: state, dtype: object

In [25]:
frame.year

0    2000
1    2001
2    2002
3    2001
4    2002
5    2003
Name: year, dtype: int64

In [26]:
# retrieve a row by position or name
frame.loc[3]

state    Nevada
year       2001
pop         2.4
Name: 3, dtype: object

In [27]:
frame.iloc[2]

state    Ohio
year     2002
pop       3.6
Name: 2, dtype: object

In [28]:
# update column by assignment
frame["debt"] = 16.5
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,16.5
1,Ohio,2001,1.7,16.5
2,Ohio,2002,3.6,16.5
3,Nevada,2001,2.4,16.5
4,Nevada,2002,2.9,16.5
5,Nevada,2003,3.2,16.5


In [29]:
frame["debt"] = np.arange(6.0)
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,0.0
1,Ohio,2001,1.7,1.0
2,Ohio,2002,3.6,2.0
3,Nevada,2001,2.4,3.0
4,Nevada,2002,2.9,4.0
5,Nevada,2003,3.2,5.0


In [30]:
# delete a column
del frame["debt"]
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


The column returned from indexing a DataFrame is a view on the underlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied using the Series’s copy method.

In [31]:
populations = {
    "Ohio": {2000: 1.5, 2001: 1.7, 2002: 3.6},
    "Nevada": {2001: 2.4, 2002: 2.9},
}

In [32]:
# create a DataFrame from a nested dictionary
frame2 = pd.DataFrame(populations)
frame2

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [33]:
# transpose the DataFrame
frame2.T

Unnamed: 0,2000,2001,2002
Ohio,1.5,1.7,3.6
Nevada,,2.4,2.9


Is important to note that trasposing discards the column data types if the columns do not all have the same type.

In [34]:
# set name for the index and columns
frame2.index.name = "year"
frame2.columns.name = "state"
frame2

state,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4
2002,3.6,2.9


In [35]:
frame2.to_numpy()

array([[1.5, nan],
       [1.7, 2.4],
       [3.6, 2.9]])

If the DataFrame's columns are different types data types, the data type of the returned array will be chosen to accommodate all of the columns.

In [36]:
frame.to_numpy()

array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], dtype=object)

panda's Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index. Index objects are also immutable, so they can’t be modified by the user.

In [37]:
obj = pd.Series(np.arange(3), index=["a", "b", "c"])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [38]:
index[1:]

Index(['b', 'c'], dtype='object')

In [39]:
index[1] = "z"  # TypeError

TypeError: Index does not support mutable operations

In [None]:
# an Index also behaves like a fixed-size set
"Ohio" in frame2.columns

True

In [None]:
# unlike Python sets, a pandas Index can contain duplicate labels
pd.Index(["foo", "foo", "bar", "bar"])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

## Basic Functionality

In [None]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [None]:
obj2 = obj.reindex(["a", "b", "c", "d", "e"])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [None]:
obj3 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [None]:
# forward-fill the values
obj3.reindex(range(6), method="ffill")

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [None]:
frame = pd.DataFrame(
    np.arange(9).reshape((3, 3)),
    index=["a", "c", "d"],
    columns=["Ohio", "Texas", "California"],
)
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [None]:
frame2 = frame.reindex(["a", "b", "c", "d"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [None]:
# reindex columns using the columns keyword
states = ["Texas", "Utah", "California"]
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [None]:
frame.reindex(states, axis="columns")

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


You can also reindex by using the loc operator.

In [None]:
frame.loc[["a", "d", "c"], ["California", "Texas"]]

Unnamed: 0,California,Texas
a,2,1
d,8,7
c,5,4


In [None]:
obj = pd.Series(np.arange(5.0), index=["a", "b", "c", "d", "e"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [None]:
new_obj = obj.drop("c")
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [None]:
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis.

In [None]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"],
)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
# drop values from the row labels (axis 0)
data.drop(["Colorado", "Ohio"])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data.drop(columns=["two", "four"])

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [None]:
# axis=1 drops columns
data.drop(["two", "four"], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


Series indexing works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.

In [None]:
obj = pd.Series(np.arange(4.0), index=["a", "b", "c", "d"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [None]:
obj["b"]

1.0

In [None]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [None]:
obj[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [None]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Thre preferred way to do so is with the loc operator. The reason for this is because of the different treatment of integers when indexing with []. Regular []-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index.

In [None]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

In [None]:
obj1 = pd.Series([1, 2, 3], index=[2, 0, 1])
obj2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

In [None]:
obj1[[0, 1, 2]]

0    2
1    3
2    1
dtype: int64

In [None]:
obj2[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

In [None]:
# when using loc the expression will fail when the index does not contain integers
obj2.loc[[0, 1]]

KeyError: "None of [Int64Index([0, 1], dtype='int64')] are in the [index]"

Since loc operator indexes exclusivaly with labels, there is also an iloc operator that indexed exclusively with integers.

In [None]:
obj1.iloc[[0, 1, 2]]

2    1
0    2
1    3
dtype: int64

In [None]:
obj2.iloc[[0, 1, 2]]

a    1
b    2
c    3
dtype: int64

In [None]:
# assign a value to a slice modifies the original Series
obj2.loc["b":"c"] = 5
obj2

a    1
b    5
c    5
dtype: int64

Indexing into a DataFrame retrieves one or more columns, either with a single value or sequence.

In [None]:
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=["Ohio", "Colorado", "Utah", "New York"],
    columns=["one", "two", "three", "four"],
)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [None]:
data[["three", "one"]]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [None]:
# select rows by slicing
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [None]:
# select rows by boolean indexing
data[data["three"] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [None]:
# we can use boolean indexing to set values
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Like Series, DataFrame has special attributes loc and iloc for label-based and integer-based indexing respectively. Since DataFrame is two-dimensional, you can also select a subset of the rows and columns with NumPy-like notation using either axis labels (loc) or integers (iloc).

In [None]:
data.loc["Colorado"]

one      0
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [None]:
data.loc[["Colorado", "New York"]]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
New York,12,13,14,15


In [None]:
data.loc["Colorado", ["two", "three"]]

two      5
three    6
Name: Colorado, dtype: int64

In [None]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [None]:
data.iloc[[2, 1]]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
Colorado,0,5,6,7


In [None]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: Utah, dtype: int64

Both indexing functions work with slices in addition to single labels or lists of labels.

In [None]:
data.loc[:"Utah", "two"]

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [None]:
# Boolean arrays can be used with loc but not with iloc
data.loc[data["three"] >= 2]

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [None]:
ser = pd.Series(np.arange(3.0))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [None]:
ser[-1]  # Error

KeyError: -1

If you have an axis index containing integers, data selection will always be label oriented. If you use loc (for labels) or iloc (for integers) you will get exactly what you want.

In [None]:
ser2 = pd.Series(np.arange(3.0), index=["a", "b", "c"])
ser2[-1]  # No error

2.0

pandas makes simpler to work with objects that have different indexes. For example, when you add objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=["a", "c", "d", "e"])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [None]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=["a", "c", "e", "f", "g"])
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [None]:
# adding two Series will result in the union of the indices
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the columns.

In [None]:
df1 = pd.DataFrame(
    np.arange(9.0).reshape((3, 3)),
    columns=list("bcd"),
    index=["Ohio", "Texas", "Colorado"],
)
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [None]:
df2 = pd.DataFrame(
    np.arange(12.0).reshape((4, 3)),
    columns=list("bde"),
    index=["Utah", "Ohio", "Texas", "Oregon"],
)
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


Adding DataFrames with no column or row labels in common returns a DataFrame with all nulls.

In [None]:
df1 = pd.DataFrame(np.arange(12.0).reshape((3, 4)), columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.0).reshape((4, 5)), columns=list("abcde"))

In [None]:
df2.loc[1, "b"] = np.nan

In [None]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [None]:
# fill missing values with 0
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


As with NumPy arrays of different dimensions, arithmetic between DataFrame and Series is defined.

In [None]:
arr = np.arange(12.0).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [None]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

The subtraction is performed once for each row. This is referred to as broadcasting. Operations between a DataFrame and a Series are similar.

In [None]:
frame = pd.DataFrame(
    np.arange(12.0).reshape((4, 3)),
    columns=list("bde"),
    index=["Utah", "Ohio", "Texas", "Oregon"],
)
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [None]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

By default arithmetic between DataFrame and Series matches the index of the Series on the DataFrame’s columns, broadcasting down the rows.

In [None]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


If an index value is not found in either the DataFrame’s columns or the Series’s index, the objects will be reindexed to form the union.

If instead yoy want to broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods and specify the axis keyword.

## Function Application and Mapping

In [None]:
frame = pd.DataFrame(
    np.random.randn(4, 3),
    columns=list("bde"),
    index=["Utah", "Ohio", "Texas", "Oregon"],
)
frame

Unnamed: 0,b,d,e
Utah,-0.798028,1.327263,0.783642
Ohio,0.375772,2.005561,-0.618395
Texas,-0.670926,-0.038261,0.041389
Oregon,2.282483,1.043059,-1.034633


NumPy ufuncs also work with pandas objects.

In [None]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.798028,1.327263,0.783642
Ohio,0.375772,2.005561,0.618395
Texas,0.670926,0.038261,0.041389
Oregon,2.282483,1.043059,1.034633


Using DataFrame's apply method you can apply a function on one-dimensional array to each column or row.

In [None]:
def f1(x):
    return x.max() - x.min()


frame.apply(f1)

b    3.080510
d    2.043822
e    1.818275
dtype: float64

In [None]:
frame.apply(f1, axis="columns")

Utah      2.125291
Ohio      2.623956
Texas     0.712315
Oregon    3.317116
dtype: float64

In [None]:
def f2(x):
    return pd.Series([x.min(), x.max()], index=["min", "max"])


frame.apply(f2)

Unnamed: 0,b,d,e
min,-0.798028,-0.038261,-1.034633
max,2.282483,2.005561,0.783642


To sort lexiconographically by row or column index, use the sort_index method, which returns a new, sorted object.

In [None]:
obj = pd.Series(range(4), index=["d", "a", "b", "c"])
obj

d    0
a    1
b    2
c    3
dtype: int64

In [None]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

With a DataFrame you can sort by index on either axis. The data is sorted in ascending order by default, but can be sorted in descending order, too.

In [None]:
frame = pd.DataFrame(
    np.arange(8).reshape((2, 4)), index=["three", "one"], columns=["d", "a", "b", "c"]
)
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [None]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [None]:
frame.sort_index(axis="columns")

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [None]:
frame.sort_index(axis="columns", ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its sort_values method.

In [None]:
obj = pd.Series([4, 7, -3, 2])
obj

0    4
1    7
2   -3
3    2
dtype: int64

In [None]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default. But you can change this by passing na_position keyword.

When sorting a DataFrame, you can use the data in one or more columns as the sort keys.

In [None]:
frame = pd.DataFrame({"b": [4, 7, -3, 2], "a": [0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [None]:
frame.sort_values("b")

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [None]:
frame.sort_values(by=["a", "b"])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


In [None]:
# Series with duplicate indices.
obj = pd.Series(np.arange(5), index=["a", "a", "b", "b", "c"])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [None]:
# tells us if the index is unique or not
obj.index.is_unique

False

Data selection is one of the main things that behaves differently with duplicates. Indexing a label with multiple entries returns a Series, while single entries return a scalar value.

In [None]:
obj["a"]

a    0
a    1
dtype: int64

In [None]:
obj["c"]

4

This behavior makes things complicated as the output type from indexing can vary based on whether a label is repeated or not.

In [None]:
# DataFrame with duplicate indices.
df = pd.DataFrame(np.random.standard_normal((5, 3)), index=["a", "a", "b", "b", "c"])
df

Unnamed: 0,0,1,2
a,0.444921,2.308552,1.126439
a,-0.208892,-0.186752,-0.348148
b,0.695585,-0.380315,-0.344052
b,-0.049691,-0.337002,1.445128
c,-1.076606,-0.306704,1.721924


In [None]:
df.loc["b"]

Unnamed: 0,0,1,2
b,0.695585,-0.380315,-0.344052
b,-0.049691,-0.337002,1.445128


In [None]:
df.loc["c"]

0   -1.076606
1   -0.306704
2    1.721924
Name: c, dtype: float64

## Summary and Computing Descriptive Statistics
pandas objects are equiped with a set of common mathematical and statistical methods. Most of these fall into the category of reductions or summary statistics, methods that extract a single value (like the sum or mean) from a Series or a Series of values from the rows or columns of a DataFrame.  
Compared with the similar methods found on NumPy arrays, they have built-in handling for missing data.

In [None]:
df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=["a", "b", "c", "d"],
    columns=["one", "two"],
)
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [None]:
# returns a Series with the sum of each column
df.sum()

one    9.25
two   -5.80
dtype: float64

In [None]:
# sums accross the columns
df.sum(axis="columns")

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

When an entire row or column contains all NA values, the sum is 0, whereas if any value is not NA, then the result is NA. This can be disables with the skipna option, in which case any NA value in a row or column names the corresponding result NA.

In [None]:
df.sum(axis="columns", skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

Some aggregations, like mean, require at least one non-NA value to yield a value result.

In [None]:
df.mean(axis="columns")

a    1.400
b    1.300
c      NaN
d   -0.275
dtype: float64

Some methods like idxmin and idxmax return indirect statistics like the index value where the minimum or maximum values are attained.

In [None]:
df.idxmax()

one    b
two    d
dtype: object

In [None]:
# returns the cumulative sum of the values
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


Some other methods are neither reductions nor accumulations, describe is one example of this. Calling describe produces multiple summary statistics in one shot.

In [None]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


In [None]:
# On non-numeric data, describe produces alternative summary statistics
obj = pd.Series(["a", "a", "b", "c"] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

In [None]:
# number of non-NA values
df.count()

one    3
two    2
dtype: int64

### Correlation and Covariance
Some summary statistics, like correlation and covariance, are computed from pairs of arguments. Let’s consider some DataFrames of stock prices and volumes obtained from Yahoo! Finance using the add-on pandas-datareader package.

In [None]:
price = pd.read_pickle("examples/yahoo_price.pkl")
volume = pd.read_pickle("examples/yahoo_volume.pkl")

In [None]:
price.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,27.990226,313.062468,113.304536,25.884104
2010-01-05,28.038618,311.683844,111.935822,25.892466
2010-01-06,27.592626,303.826685,111.208683,25.733566
2010-01-07,27.541619,296.753749,110.823732,25.465944
2010-01-08,27.724725,300.709808,111.935822,25.641571


In [None]:
volume.head()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-01-04,123432400,3927000,6155300,38409100
2010-01-05,150476200,6031900,6841400,49749600
2010-01-06,138040000,7987100,5605300,58182400
2010-01-07,119282800,12876600,5840600,50559700
2010-01-08,111902700,9483900,4197200,51197400


In [None]:
# compute the percentage change of the prices
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,GOOG,IBM,MSFT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2016-10-17,-0.00068,0.001837,0.002072,-0.003483
2016-10-18,-0.000681,0.019616,-0.026168,0.00769
2016-10-19,-0.002979,0.007846,0.003583,-0.002255
2016-10-20,-0.000512,-0.005652,0.001719,-0.004867
2016-10-21,-0.00393,0.003011,-0.012474,0.042096


The corr method of Series computes the correlation of the overlapping, non-NA, aligned-by-index values in two Series. Relatedly, cov computes the covariance.

In [None]:
returns["MSFT"].corr(returns["IBM"])

0.49976361144151144

In [None]:
returns["MSFT"].cov(returns["IBM"])

8.870655479703546e-05

DataFrame's corr and cov methods, on the other hand, return a full correlation or covariance matrix as a DataFrame, respectively.

In [None]:
returns.corr()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,1.0,0.407919,0.386817,0.389695
GOOG,0.407919,1.0,0.405099,0.465919
IBM,0.386817,0.405099,1.0,0.499764
MSFT,0.389695,0.465919,0.499764,1.0


In [None]:
returns.cov()

Unnamed: 0,AAPL,GOOG,IBM,MSFT
AAPL,0.000277,0.000107,7.8e-05,9.5e-05
GOOG,0.000107,0.000251,7.8e-05,0.000108
IBM,7.8e-05,7.8e-05,0.000146,8.9e-05
MSFT,9.5e-05,0.000108,8.9e-05,0.000215


Using DataFrame's corrwith method, you can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column.

In [None]:
returns.corrwith(returns["IBM"])

AAPL    0.386817
GOOG    0.405099
IBM     1.000000
MSFT    0.499764
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here we compute correlations of percent changes with volume.

In [None]:
returns.corrwith(volume)

AAPL   -0.075565
GOOG   -0.007067
IBM    -0.204849
MSFT   -0.092950
dtype: float64

### Unique Values, Value Counts, and Membership

In [None]:
obj = pd.Series(["c", "a", "d", "a", "a", "b", "b", "c", "c"])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [None]:
# returns the unique values in the Series
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [None]:
# computes a Series containing value frequencies
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [None]:
# performs a vectorized set membership check
mask = obj.isin(["b", "c"])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [None]:
# compute a histogram on multiple related columns in a DataFrame
data = pd.DataFrame(
    {
        "Qu1": [1, 3, 4, 3, 4],
        "Qu2": [2, 3, 1, 2, 3],
        "Qu3": [1, 5, 2, 4, 4],
    }
)
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [None]:
data["Qu1"].value_counts().sort_index()

1    1
3    2
4    2
Name: Qu1, dtype: int64

In [None]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0


The DataFrame.value_counts method computes counts considering each row of the DataFrame as a tuple to determine the number of occurrences of each unique row.

In [None]:
data = pd.DataFrame({"a": [1, 1, 1, 2, 2], "b": [0, 0, 1, 0, 0]})
data

Unnamed: 0,a,b
0,1,0
1,1,0
2,1,1
3,2,0
4,2,0


In [None]:
data.value_counts()

a  b
1  0    2
2  0    2
1  1    1
dtype: int64