<h2>Creating a DataFrame from scratch</h2>

In [1]:
# Reference numpy and pandas
import numpy as np
import pandas as pd

In [2]:
# Set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

There are several ways to create a DataFrame. Probably the most straightforward way, is
by creating it from a NumPy array. The following code creates a DataFrame from a two
dimensional NumPy array.

In [3]:
# create a DataFrame from a 2-d ndarray
pd.DataFrame( np.array( [ [10, 11], [20, 21] ] ) )

    0   1
0  10  11
1  20  21

A DataFrame can also be initialized by passing a list of Series objects.

In [4]:
# create a DataFrame for a list of Series objects
df1 = pd.DataFrame( [ pd.Series( np.arange(10, 15) ), pd.Series( np.arange(15, 20) ) ] )
df1

    0   1   2   3   4
0  10  11  12  13  14
1  15  16  17  18  19

The dimensions of a DataFrame object can be determined using its .shape property. A
DataFrame is always two-dimensional. The first value informs us about the number of
rows and the second value is the number of columns:

In [5]:
# what's the shape of this DataFrame
df1.shape # it is two rows by 5 columns

(2, 5)

Column names can be specified at the time of creating the DataFrame by using the
columns parameter of the DataFrame constructor.

In [6]:
# specify column names
df = pd.DataFrame(np.array([ [10, 11], [20, 21] ]), columns=['a', 'b'])
df

    a   b
0  10  11
1  20  21

The names of the columns of a DataFrame can be accessed with its .columns property:

In [7]:
# what are the names of the columns?
df.columns

Index(['a', 'b'], dtype='object')

This value of the .columns property is actually a pandas index. The individual column
names can be accessed by position.

In [8]:
# retrieve just the names of the columns by position
"{0}, {1}".format(df.columns[0], df.columns[1])

'a, b'

In [9]:
# rename the columns
df.columns = ['c1', 'c2']
df

   c1  c2
0  10  11
1  20  21

Index labels can likewise be assigned using the index parameter of the constructor or by
assigning a list directly to the .index property.

In [10]:
# create a DataFrame with named columns and rows
df = pd.DataFrame(np.array([[0, 1], [2, 3]]), columns=['c1', 'c2'], index=['r1', 'r2'])
df

    c1  c2
r1   0   1
r2   2   3

Similar to the Series object, the index of a DataFrame object can be accessed with its
.index property:

In [11]:
# retrieve the index of the DataFrame
df.index

Index(['r1', 'r2'], dtype='object')

A DataFrame object can also be created by passing a dictionary containing one or more
Series objects, where the dictionary keys contain the column names and each series is
one column of data:

In [12]:
# create a DataFrame with two Series objects
# and a dictionary
s1 = pd.Series(np.arange(1, 6, 1))
s2 = pd.Series(np.arange(6, 11, 1))
pd.DataFrame({'c1': s1, 'c2': s2})

   c1  c2
0   1   6
1   2   7
2   3   8
3   4   9
4   5  10

For example, the following code adds a third column in the DataFrame
initialization. This third Series contains two values and will specify its index. When the
DataFrame is created, each series in the dictionary is aligned with each other by the index
label, as it is added to the DataFrame object. The code is as follows:

In [13]:
# demonstrate alignment during creation
s3 = pd.Series(np.arange(12, 14), index=[1, 2])
df = pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})
df

   c1  c2    c3
0   1   6   NaN
1   2   7  12.0
2   3   8  13.0
3   4   9   NaN
4   5  10   NaN

<h2>Example data</h2>

In [14]:
# read in the data and print the first five rows
# use the Symbol column as the index, and
# only read in columns in positions 0, 2, 3, 7
sp500 = pd.read_csv("data/sp500.csv",
index_col='Symbol',
usecols=[0, 2, 3, 7])

In [15]:
# peek at the first 5 rows of the data using .head()
sp500.head()

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  189.09       17.26
ABT                Health Care   45.00       13.94
ABBV               Health Care   63.69        2.91
ACN     Information Technology  124.14       11.95
ATVI    Information Technology   48.06       12.23

In [16]:
sp500.tail()

                        Sector   Price  Book Value
Symbol                                            
YHOO    Information Technology   45.73       32.50
YUM     Consumer Discretionary   64.02      -15.93
ZBH                Health Care  117.07       48.20
ZION                Financials   45.28       34.10
ZTS                Health Care   53.07        3.02

In [17]:
# how many rows of data?
len(sp500)

505

In [18]:
# examine the index
sp500.index

Index(['MMM', 'ABT', 'ABBV', 'ACN', 'ATVI', 'AYI', 'ADBE', 'AAP', 'AES', 'AET',
       ...
       'XEL', 'XRX', 'XLNX', 'XL', 'XYL', 'YHOO', 'YUM', 'ZBH', 'ZION', 'ZTS'],
      dtype='object', name='Symbol', length=505)

In [19]:
# get the columns
sp500.columns

Index(['Sector', 'Price', 'Book Value'], dtype='object')

<h2>Selecting columns of a DataFrame</h2>

In [20]:
sp500

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  189.09       17.26
ABT                Health Care   45.00       13.94
ABBV               Health Care   63.69        2.91
ACN     Information Technology  124.14       11.95
ATVI    Information Technology   48.06       12.23
...                        ...     ...         ...
YHOO    Information Technology   45.73       32.50
YUM     Consumer Discretionary   64.02      -15.93
ZBH                Health Care  117.07       48.20
ZION                Financials   45.28       34.10
ZTS                Health Care   53.07        3.02

[505 rows x 3 columns]

In [21]:
# create a new DataFrame with integers as the column names
# make sure to use .copy() or change will be in-place
df = sp500.copy()
df.columns=[0, 1, 2]
df.head()

                             0       1      2
Symbol                                       
MMM                Industrials  189.09  17.26
ABT                Health Care   45.00  13.94
ABBV               Health Care   63.69   2.91
ACN     Information Technology  124.14  11.95
ATVI    Information Technology   48.06  12.23

In [22]:
# this is not an exception
df[1]

Symbol
MMM     189.09
ABT      45.00
ABBV     63.69
ACN     124.14
ATVI     48.06
         ...  
YHOO     45.73
YUM      64.02
ZBH     117.07
ZION     45.28
ZTS      53.07
Name: 1, Length: 505, dtype: float64

In [23]:
# this is a Series not a DataFrame
type(df[1])

pandas.core.series.Series

In [24]:
# get price column by name
# result is a Series
sp500['Price']

Symbol
MMM     189.09
ABT      45.00
ABBV     63.69
ACN     124.14
ATVI     48.06
         ...  
YHOO     45.73
YUM      64.02
ZBH     117.07
ZION     45.28
ZTS      53.07
Name: Price, Length: 505, dtype: float64

In [25]:
# get Price and Sector columns
# since a list is passed, the result is a DataFrame
sp500[['Price', 'Sector']]

         Price                  Sector
Symbol                                
MMM     189.09             Industrials
ABT      45.00             Health Care
ABBV     63.69             Health Care
ACN     124.14  Information Technology
ATVI     48.06  Information Technology
...        ...                     ...
YHOO     45.73  Information Technology
YUM      64.02  Consumer Discretionary
ZBH     117.07             Health Care
ZION     45.28              Financials
ZTS      53.07             Health Care

[505 rows x 2 columns]

In [26]:
# attribute access of the column by name
sp500.Price

Symbol
MMM     189.09
ABT      45.00
ABBV     63.69
ACN     124.14
ATVI     48.06
         ...  
YHOO     45.73
YUM      64.02
ZBH     117.07
ZION     45.28
ZTS      53.07
Name: Price, Length: 505, dtype: float64

Note that this will not work for the Book Value column, as the name has a space.
If you do want to find the zero-based location of one or more columns using the name of
the column (technically, the value of the index entry of a column), use the .get_loc()
method of the columns index:

In [27]:
# get the position of the column with the value of Price
loc = sp500.columns.get_loc('Price')
loc

1

In [28]:
sp500[['Price', 'Sector', 'Book Value']]

         Price                  Sector  Book Value
Symbol                                            
MMM     189.09             Industrials       17.26
ABT      45.00             Health Care       13.94
ABBV     63.69             Health Care        2.91
ACN     124.14  Information Technology       11.95
ATVI     48.06  Information Technology       12.23
...        ...                     ...         ...
YHOO     45.73  Information Technology       32.50
YUM      64.02  Consumer Discretionary      -15.93
ZBH     117.07             Health Care       48.20
ZION     45.28              Financials       34.10
ZTS      53.07             Health Care        3.02

[505 rows x 3 columns]

<h2>Selecting rows and values of a DataFrame
using the index</h2>

Elements of an array or Series are selected using the [] operator. DataFrame overloads []
to select columns instead of rows, except for a specific case of slicing. Therefore, most
operations of selection of one or more rows in a DataFrame, require alternate methods to
using [].

Understanding this is important in pandas, as a common mistake is try and select rows
using [] due to familiarity with other languages or data structures. When doing so, errors
are often received, and can often be difficult to diagnose without realizing [] is working
along a completely different axis than with a Series object.

<h2>Slicing using the [] operator</h2>

Slicing a DataFrame across its index is syntactically identical to performing the same on a
Series. Because of this, we will not go into the details of the various permutations of
slices in this section, and only give representative examples applied to a DataFrame.

In [29]:
# first five rows
sp500[:5]

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  189.09       17.26
ABT                Health Care   45.00       13.94
ABBV               Health Care   63.69        2.91
ACN     Information Technology  124.14       11.95
ATVI    Information Technology   48.06       12.23

In [30]:
# ABT through ACN labels
sp500['ABT':'ACN']

                        Sector   Price  Book Value
Symbol                                            
ABT                Health Care   45.00       13.94
ABBV               Health Care   63.69        2.91
ACN     Information Technology  124.14       11.95

<h2>Selecting rows by index label and location: .loc[]
and .iloc[]</h2>

In [31]:
# get row with label MMM
# returned as a Series
sp500.loc['MMM']

Sector        Industrials
Price              189.09
Book Value          17.26
Name: MMM, dtype: object

In [32]:
# rows with label MMM and MSFT
# this is a DataFrame result
sp500.loc[['MMM', 'MSFT']]

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  189.09       17.26
MSFT    Information Technology   64.40        8.90

In [33]:
# get rows in locations 0 and 2
sp500.iloc[[0, 2]]

             Sector   Price  Book Value
Symbol                                 
MMM     Industrials  189.09       17.26
ABBV    Health Care   63.69        2.91

It is possible to look up the location in the index of a specific label value, which can then
be used to retrieve the row(s):

In [34]:
# get the location of MMM and A in the index
i1 = sp500.index.get_loc('MMM')
i2 = sp500.index.get_loc('MSFT')
"{0} {1}".format(i1, i2)

'0 307'

In [35]:
# and get the rows
sp500.iloc[[i1, i2]]

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  189.09       17.26
MSFT    Information Technology   64.40        8.90

<h2>Scalar lookup by label or location using .at[] and
.iat[]</h2>

Scalar values can be looked up by label using .at, by passing both the row label and then
the column name/value:

In [36]:
# by label in both the index and column
sp500.at['MMM', 'Price']

189.09

Scalar values can also be looked up by location using .iat by passing both the row
location and then the column location. This is the preferred method of accessing single
values and gives the highest performance.

In [37]:
# by location. Row 0, column 1
sp500.iat[0, 1]

189.09

<h2>Selecting rows of a DataFrame by Boolean
selection</h2>

Rows can also be selected by using Boolean selection, using an array calculated from the result of applying a logical condition on the values in any of the columns. This allows us to build more complicated selections than those based simply upon index labels or positions. Consider the following that is an array of all companies that have a price below 100.0.

In [38]:
sp500

                        Sector   Price  Book Value
Symbol                                            
MMM                Industrials  189.09       17.26
ABT                Health Care   45.00       13.94
ABBV               Health Care   63.69        2.91
ACN     Information Technology  124.14       11.95
ATVI    Information Technology   48.06       12.23
...                        ...     ...         ...
YHOO    Information Technology   45.73       32.50
YUM     Consumer Discretionary   64.02      -15.93
ZBH                Health Care  117.07       48.20
ZION                Financials   45.28       34.10
ZTS                Health Care   53.07        3.02

[505 rows x 3 columns]

In [39]:
# what rows have a price < 100?
sp500.Price < 100

Symbol
MMM     False
ABT      True
ABBV     True
ACN     False
ATVI     True
        ...  
YHOO     True
YUM      True
ZBH     False
ZION     True
ZTS      True
Name: Price, Length: 505, dtype: bool

These results are a Series that can be used to select the rows where the value is True,
exactly the same way it was done with a Series or a NumPy array:

In [40]:
# now get the rows with Price < 100
sp500[sp500.Price < 100]

                        Sector  Price  Book Value
Symbol                                           
ABT                Health Care  45.00       13.94
ABBV               Health Care  63.69        2.91
ATVI    Information Technology  48.06       12.23
AES                  Utilities  11.33        4.24
AFL                 Financials  71.95       50.47
...                        ...    ...         ...
XYL                Industrials  48.78       12.21
YHOO    Information Technology  45.73       32.50
YUM     Consumer Discretionary  64.02      -15.93
ZION                Financials  45.28       34.10
ZTS                Health Care  53.07        3.02

[364 rows x 3 columns]

In [41]:
# get only the Price where Price is < 10 and > 0
r = sp500[(sp500.Price < 10) & (sp500.Price > 0)] [['Price']]

r

        Price
Symbol       
CHK      5.26
FTR      2.62
SWN      7.61
SPLS     8.78
XRX      7.36

<h2>Modifying the structure and content of
DataFrame</h2>

The structure and content of a DataFrame can be mutated in several ways. Rows and
columns can be added and removed, and data within either can be modified to take on new
values. Additionally, columns, as well as index labels, can also be renamed. Each of these
will be described in the following sections.

<h3>Renaming columns</h3>

In [42]:
# rename the Book Value column to not have a space
# this returns a copy with the column renamed
df = sp500.rename(columns= {'Book Value': 'BookValue'})

# print first 2 rows
df[:2]

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94

In [43]:
# verify the columns in the original did not change
sp500.columns

Index(['Sector', 'Price', 'Book Value'], dtype='object')

To modify the DataFrame without making a copy, we can use the inplace=True parameter
to .rename():

In [44]:
sp500.rename(columns= {'Book Value': 'BookValue'},  inplace=True)

# we can see the column is changed
sp500.columns

Index(['Sector', 'Price', 'BookValue'], dtype='object')

In [45]:
# and now we can use .BookValue
sp500.BookValue[:5]

Symbol
MMM     17.26
ABT     13.94
ABBV     2.91
ACN     11.95
ATVI    12.23
Name: BookValue, dtype: float64

<h3>Adding and inserting columns</h3>

In [46]:
# make a copy
copy = sp500.copy()

# add a new column to the copy
copy['TwicePrice'] = sp500.Price * 2
copy

                        Sector   Price  BookValue  TwicePrice
Symbol                                                       
MMM                Industrials  189.09      17.26      378.18
ABT                Health Care   45.00      13.94       90.00
ABBV               Health Care   63.69       2.91      127.38
ACN     Information Technology  124.14      11.95      248.28
ATVI    Information Technology   48.06      12.23       96.12
...                        ...     ...        ...         ...
YHOO    Information Technology   45.73      32.50       91.46
YUM     Consumer Discretionary   64.02     -15.93      128.04
ZBH                Health Care  117.07      48.20      234.14
ZION                Financials   45.28      34.10       90.56
ZTS                Health Care   53.07       3.02      106.14

[505 rows x 4 columns]

<p>This process is actually selecting the Price column out of the sp500 object, then creating
another Series with each value of the Price multiplied by two. The DataFrame then
aligns this new Series by label, copies the data at the appropriate labels, and adds the
column at the end of the columns index.</p>
<p>If you want to add the column at a different location in the DataFrame object, instead of at
the rightmost position, use the .insert() method of the DataFrame. The following code
inserts the TwicePrice column between Price and BookValue:</p>

In [47]:
copy = sp500.copy()
# insert sp500.Price * 2 as the
# second column in the DataFrame
copy.insert(1, 'TwicePrice', sp500.Price * 2)
copy

                        Sector  TwicePrice   Price  BookValue
Symbol                                                       
MMM                Industrials      378.18  189.09      17.26
ABT                Health Care       90.00   45.00      13.94
ABBV               Health Care      127.38   63.69       2.91
ACN     Information Technology      248.28  124.14      11.95
ATVI    Information Technology       96.12   48.06      12.23
...                        ...         ...     ...        ...
YHOO    Information Technology       91.46   45.73      32.50
YUM     Consumer Discretionary      128.04   64.02     -15.93
ZBH                Health Care      234.14  117.07      48.20
ZION                Financials       90.56   45.28      34.10
ZTS                Health Care      106.14   53.07       3.02

[505 rows x 4 columns]

It is important to remember that this is not simply inserting a column into the DataFrame.
The alignment process used here is performing a left join of the DataFrame and the Series
by their index labels, and then creating the column and populating the data in the
appropriate cells in the DataFrame from matching entries in the Series. If an index label
in the DataFrame is not matched in the Series, the value used will be NaN. Items in the
Series that do not have a matching label will be ignored.
The following example demonstrates this operation:

In [48]:
# extract the first three rows and just the Price column
rcopy = sp500[0:3][['Price']].copy()
rcopy

         Price
Symbol        
MMM     189.09
ABT      45.00
ABBV     63.69

In [49]:
# create a new Series to merge as a column
# one label exists in rcopy (MSFT), and MMM does not
s = pd.Series( {'MMM': 'Is in the DataFrame', 'MSFT': 'Not in the DataFrame'} )
s

MMM      Is in the DataFrame
MSFT    Not in the DataFrame
dtype: object

In [50]:
# add rcopy into a column named 'Comment'
rcopy['Comment'] = s
rcopy

         Price              Comment
Symbol                             
MMM     189.09  Is in the DataFrame
ABT      45.00                  NaN
ABBV     63.69                  NaN

<h2>Replacing the contents of a column</h2>

In general, assignment of a Series to a column using the [] operator will either create a
new column if the column does not already exist, or replace the contents of a column if it
already exists. To demonstrate replacement, the following code replaces the Price column
with the result of the multiplication, instead of creating a new column:

In [51]:
copy = sp500.copy()
# replace the Price column data with the new values
# instead of adding a new column
copy.Price = sp500.Price * 2
copy[:5]

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  378.18      17.26
ABT                Health Care   90.00      13.94
ABBV               Health Care  127.38       2.91
ACN     Information Technology  248.28      11.95
ATVI    Information Technology   96.12      12.23

To emphasize that this is also doing an alignment, we can change the sample slightly. The
following code only utilizes the prices from three of the first four rows. This will force the
result to not align values for 497 of the symbols, resulting in NaN values:

In [52]:
# copy all 500 rows
copy = sp500.copy()

# this just copies the first 2 rows of prices
prices = sp500.iloc[[3, 1, 0]].Price.copy()

# examine the extracted prices
prices

Symbol
ACN    124.14
ABT     45.00
MMM    189.09
Name: Price, dtype: float64

In [53]:
# now replace the Prices column with prices
copy.Price = prices
# it's not really simple insertion, it is alignment
# values are put in the correct place according to labels
copy

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care     NaN       2.91
ACN     Information Technology  124.14      11.95
ATVI    Information Technology     NaN      12.23
...                        ...     ...        ...
YHOO    Information Technology     NaN      32.50
YUM     Consumer Discretionary     NaN     -15.93
ZBH                Health Care     NaN      48.20
ZION                Financials     NaN      34.10
ZTS                Health Care     NaN       3.02

[505 rows x 3 columns]

<h2>Deleting columns in a DataFrame</h2>

<p>Columns can be deleted from a DataFrame by using the del keyword, the pop(column)
method of the DataFrame, or by calling the drop() method of the DataFrame.<p>
The behavior of each of these differs slightly:
<ul>
<li>
del will simply delete the Series from the DataFrame (in-place)</li>
<li>pop() will both delete the Series and return the Series as a result (also in-place)</li>
<li>drop(labels, axis=1) will return a new DataFrame with the column(s) removed
(the original DataFrame object is not modified)</li>
</ul>
<p>The following code demonstrates using del to delete the BookValue column from a copy
of the sp500 data:</p>

In [54]:
# Example of using del to delete a column
# make a copy of a subset of the data frame
copy = sp500[:2].copy()
copy

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94

In [55]:
# delete the BookValue column
# deletion is in-place
del copy['BookValue']
copy

             Sector   Price
Symbol                     
MMM     Industrials  189.09
ABT     Health Care   45.00

The following code demonstrates using the .pop() method to remove a column:

In [56]:
# Example of using pop to remove a column from a DataFrame
# first make a copy of a subset of the data frame
# pop works in-place
copy = sp500[:2].copy()

# this will remove Sector and return it as a series
popped = copy.pop('Sector')

# Sector column removed in-place
copy

         Price  BookValue
Symbol                   
MMM     189.09      17.26
ABT      45.00      13.94

In [57]:
popped

Symbol
MMM    Industrials
ABT    Health Care
Name: Sector, dtype: object

The .drop() method can be used to remove both rows and columns. To use it to remove a
column, specify axis=1:

In [58]:
# Example of using drop to remove a column
# make a copy of a subset of the DataFrame
copy = sp500[:2].copy()

# this will return a new DataFrame with 'Sector' removed
# the copy DataFrame is not modified
afterdrop = copy.drop(['Sector'], axis = 1)
afterdrop

         Price  BookValue
Symbol                   
MMM     189.09      17.26
ABT      45.00      13.94

In [59]:
# Example of using drop to remove a row
# make a copy of a subset of the DataFrame
copy = sp500[:2].copy()

# this will return a new DataFrame with 'MMM' row removed
# the copy DataFrame is not modified
afterdrop = copy.drop(['MMM'], axis = 0)
afterdrop

             Sector  Price  BookValue
Symbol                               
ABT     Health Care   45.0      13.94

<h2>Adding rows to a DataFrame</h2>

Rows can be added to a DataFrame object via several different operations:
<ul>
<li>
Appending a DataFrame to another</li>
<li>Concatenation of two DataFrame objects</li>
<li>Setting with enlargement</li>
</ul>

<h3>Appending rows with .append()</h3>
<p>
Appending is performed using the .append() method of the DataFrame. The process of
appending returns a new DataFrame with the data from the original DataFrame added first,
and the rows from the second. <b>Appending does not perform alignment and can result in
    duplicate index values.</b></p>

In [60]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()
# copy 10th and 11th rows
df2 = sp500.iloc[[10, 11, 2]]

In [61]:
df1

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

In [62]:
df2

             Sector   Price  BookValue
Symbol                                
AMG      Financials  166.56      63.84
AFL      Financials   71.95      50.47
ABBV    Health Care   63.69       2.91

In [63]:
# append df1 and df2
appended = df1.append(df2)
# the result is the rows of the first followed by
# those of the second
appended

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91
AMG      Financials  166.56      63.84
AFL      Financials   71.95      50.47
ABBV    Health Care   63.69       2.91

The set of columns of the DataFrame objects being appended do not need to be the same.
The resulting DataFrame will consist of the union of the columns in both and where either
did not have a column, NaN will be used as the value. The following code demonstrates
this by creating a third DataFrame using the same index as df1, but having a single column
with a unique column name:

In [64]:
# DataFrame using df1.index and just a PER column
# also a good example of using a scalar value
# to initialize multiple rows
df3 = pd.DataFrame(0.0, index=df1.index, columns=['PER'])
df3

        PER
Symbol     
MMM     0.0
ABT     0.0
ABBV    0.0

In [65]:
df1

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

In [66]:
# append df1 and df3
# each has three rows, so 6 rows is the result
# df1 had no PER column, so NaN for those rows
# df3 had no BookValue, Price or Sector, so NaN values
df1.append(df3)

             Sector   Price  BookValue  PER
Symbol                                     
MMM     Industrials  189.09      17.26  NaN
ABT     Health Care   45.00      13.94  NaN
ABBV    Health Care   63.69       2.91  NaN
MMM             NaN     NaN        NaN  0.0
ABT             NaN     NaN        NaN  0.0
ABBV            NaN     NaN        NaN  0.0

To append without forcing the index to be taken from either DataFrame, you can use the
ignore_index=True parameter. This is useful when the index values are not of significant
meaning, and you just want concatenated data with sequentially increasing integers as
indexes:

In [67]:
# ignore index labels, create default index
df1.append(df3, ignore_index=True)

        Sector   Price  BookValue  PER
0  Industrials  189.09      17.26  NaN
1  Health Care   45.00      13.94  NaN
2  Health Care   63.69       2.91  NaN
3          NaN     NaN        NaN  0.0
4          NaN     NaN        NaN  0.0
5          NaN     NaN        NaN  0.0

<h3>Concatenating DataFrame objects with pd.concat()</h3>

A DataFrame can be concatenated to another using the pd.concat() function. This
function functions similarly to the .append() method, but also adds the ability to specify
an axis (appending can be row or column based), as well as being able to perform several
join operations between the objects. Also, the function takes a list of pandas objects to
concatenate, so you can concatenate more than two objects in a single call.
The default operation of pd.concat() on two DataFrame objects operates in the same way
as the .append() method. This can be demonstrated by reconstructing the two datasets
from the earlier example and concatenating them. This is shown in the following example:
In

In [68]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()
df1

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

In [69]:
# copy 10th and 11th rows
df2 = sp500.iloc[[10, 11, 2]]
df2

             Sector   Price  BookValue
Symbol                                
AMG      Financials  166.56      63.84
AFL      Financials   71.95      50.47
ABBV    Health Care   63.69       2.91

In [70]:
# copy the first three rows of sp500
df1 = sp500.iloc[0:3].copy()

# copy 10th and 11th rows
df2 = sp500.iloc[[10, 11, 2]]

# pass them as a list
pd.concat([df1, df2])

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91
AMG      Financials  166.56      63.84
AFL      Financials   71.95      50.47
ABBV    Health Care   63.69       2.91

A slight variant of this example adds an additional column to one of the DataFrame objects
and then performs the concatenation:

In [71]:
# copy df2
df2_2 = df2.copy()
# add a column to df2_2 that is not in df1
df2_2.insert(3, 'Foo', pd.Series(0, index=df2.index))
# see what it looks like
df2_2

             Sector   Price  BookValue  Foo
Symbol                                     
AMG      Financials  166.56      63.84    0
AFL      Financials   71.95      50.47    0
ABBV    Health Care   63.69       2.91    0

In [72]:
# now concatenate
pd.concat([df1, df2_2])

             Sector   Price  BookValue  Foo
Symbol                                     
MMM     Industrials  189.09      17.26  NaN
ABT     Health Care   45.00      13.94  NaN
ABBV    Health Care   63.69       2.91  NaN
AMG      Financials  166.56      63.84  0.0
AFL      Financials   71.95      50.47  0.0
ABBV    Health Care   63.69       2.91  0.0

Using the keys parameter, it is possible to differentiate the pandas objects from which the
rows originated. The following code adds a level to the index which represents the source
object:

In [73]:
# specify keys
r = pd.concat([df1, df2_2], keys=['df1', 'df2'])
r

                 Sector   Price  BookValue  Foo
    Symbol                                     
df1 MMM     Industrials  189.09      17.26  NaN
    ABT     Health Care   45.00      13.94  NaN
    ABBV    Health Care   63.69       2.91  NaN
df2 AMG      Financials  166.56      63.84  0.0
    AFL      Financials   71.95      50.47  0.0
    ABBV    Health Care   63.69       2.91  0.0

We can change the axis of the concatenation to work along the columns by specifying
axis=1, which will calculate the sorted union of the distinct index labels from the rows
and then append columns and their data from the specified objects.
To demonstrate, the following splits the sp500 data into two DataFrame objects, each with
a different set of columns, and then concatenates along axis=1:

In [74]:
# first three rows, columns 0 and 1
df3 = sp500[:3][['Sector', 'Price']]
df3

             Sector   Price
Symbol                     
MMM     Industrials  189.09
ABT     Health Care   45.00
ABBV    Health Care   63.69

In [75]:
# first three rows, column 2
df4 = sp500[:3][['BookValue']]
df4

        BookValue
Symbol           
MMM         17.26
ABT         13.94
ABBV         2.91

In [76]:
# put them back together
pd.concat([df3, df4], axis=1)

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

In [77]:
# VS this
pd.concat([df3, df4], axis=0)

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09        NaN
ABT     Health Care   45.00        NaN
ABBV    Health Care   63.69        NaN
MMM             NaN     NaN      17.26
ABT             NaN     NaN      13.94
ABBV            NaN     NaN       2.91

We can further examine this operation by adding a column to the second DataFrame that
has a duplicate name to a column in the first. The result will have duplicate columns, as
the columns are blindly appended without regard to already existing columns:

In [78]:
# make a copy of df4
df4_2 = df4.copy()
# add a column to df4_2, that is also in df3
df4_2.insert(1, 'Sector', pd.Series(1, index=df4_2.index))
df4_2

        BookValue  Sector
Symbol                   
MMM         17.26       1
ABT         13.94       1
ABBV         2.91       1

In [79]:
# demonstrate duplicate columns
pd.concat([df3, df4_2], axis=1)

             Sector   Price  BookValue  Sector
Symbol                                        
MMM     Industrials  189.09      17.26       1
ABT     Health Care   45.00      13.94       1
ABBV    Health Care   63.69       2.91       1

To be very specific, pandas is performing an outer join along the labels of the specified
axis. An inner join can be specified using the join='inner' parameter, which changes the
operation from being a sorted union of distinct labels to the distinct values of the
intersection of the labels. To demonstrate, the following selects two subsets of the
financial data with one row in common and performs an inner join:

In [80]:
# first three rows and first two columns
df5 = sp500[:3][['Sector', 'Price']]
df5

             Sector   Price
Symbol                     
MMM     Industrials  189.09
ABT     Health Care   45.00
ABBV    Health Care   63.69

In [81]:
# row 2 through 4 and first two columns
df6 = sp500[2:5][['Sector', 'Price']]
df6

                        Sector   Price
Symbol                                
ABBV               Health Care   63.69
ACN     Information Technology  124.14
ATVI    Information Technology   48.06

In [82]:
# inner join on index labels will return in only one row
pd.concat([df5, df6], join='inner', axis=1)

             Sector  Price       Sector  Price
Symbol                                        
ABBV    Health Care  63.69  Health Care  63.69

<h3>Adding rows (and columns) via setting with enlargement</h3>

Rows can also be added to a DataFrame through the .loc property. This technique is
referred to as setting with enlargement. The parameter for .loc specifies the index label
where the row is to be placed. If the label does not exist, the values are appended to the
DataFrame using the given index label. If it does exist, then the values in the specified row
are replaced.


In [83]:
# get a small subset of the sp500
# make sure to copy the slice to make a copy
ss = sp500[:3].copy()
ss

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

In [84]:
# create a new row with index label FOO
# and assign some values to the columns via a list
ss.loc['FOO'] = ['the sector', 100, 110]
ss

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91
FOO      the sector  100.00     110.00

Note that the change is made in place. If FOO already exists as an index label, then the
column data would be replaced. This is one of the means of updating data in a DataFrame
in-place, as .loc not only retrieves row(s), but also lets you modify the results that are
returned.

It is also possible to add columns in this manner. The following code demonstrates by
adding a new column to a subset of sp500 using .loc. Note that to accomplish this, we use
the colon in the rows’ position to select all rows to be included to add the new column and
value:

In [85]:
# copy of subset / slice
ss = sp500[:3].copy()

# add the new column initialized to 0
ss.loc[:,'PER'] = 0

# take a look at the results
ss

             Sector   Price  BookValue  PER
Symbol                                     
MMM     Industrials  189.09      17.26    0
ABT     Health Care   45.00      13.94    0
ABBV    Health Care   63.69       2.91    0

<h2>Removing rows from a DataFrame</h2>

<p>Removing rows from a DataFrame object is normally performed using one of three
techniques:</p>
<ul>
<li>Using the .drop() method</li>
<li>Boolean selection</li>
<li>Selection using a slice</li>
</ul>
<p>Technically, only the .drop() method removes rows in-place on the source object. The
other techniques either create a copy without specific rows, or a view into the rows that
are not to be dropped. Details of each are given in the following sections.</p>

<h3>Removing rows using .drop()</h3>

To remove rows from a DataFrame by the index label, you can use the .drop() method of
the DataFrame. The .drop() method takes a list of index labels and will return a copy of
the DataFrame with the rows for the specified labels removed. The source DataFrame
remains unmodified.

In [86]:
# get a copy of the first 5 rows of sp500
ss = sp500[:5].copy()
ss

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care   63.69       2.91
ACN     Information Technology  124.14      11.95
ATVI    Information Technology   48.06      12.23

In [87]:
# drop rows with labels ABT and ACN
afterdrop = ss.drop(['ABT', 'ACN'])
afterdrop

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABBV               Health Care   63.69       2.91
ATVI    Information Technology   48.06      12.23

<h3>Removing rows using Boolean selection</h3>

In [88]:
# determine the rows where Price > 300
selection = sp500.Price > 300
selection

Symbol
MMM     False
ABT     False
ABBV    False
ACN     False
ATVI    False
        ...  
YHOO    False
YUM     False
ZBH     False
ZION    False
ZTS     False
Name: Price, Length: 505, dtype: bool

In [89]:
# to make the output shorter, report the # of rows returned (500),
# and the sum of those where Price > 300 (which is 10)
"{0} {1}".format(len(selection), selection.sum())

'505 13'

We now know both the rows that match this criteria (the 10 with True values) and those
that do not (the other 490). To remove the rows now, select out the complement of the
previous result. This gives us a new DataFrame containing only the rows where we had a
False value from the previous selection:

In [90]:
# select the complement
withPriceLessThan300 = sp500[~selection]
withPriceLessThan300

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care   63.69       2.91
ACN     Information Technology  124.14      11.95
ATVI    Information Technology   48.06      12.23
...                        ...     ...        ...
YHOO    Information Technology   45.73      32.50
YUM     Consumer Discretionary   64.02     -15.93
ZBH                Health Care  117.07      48.20
ZION                Financials   45.28      34.10
ZTS                Health Care   53.07       3.02

[492 rows x 3 columns]

<h3>Removing rows using a slice</h3>

Slicing is also often used to remove records from a DataFrame. It is a process similar to
Boolean selection, where we select out all of the rows, except for the ones you want
deleted.
Suppose we want to remove all but the first three records from sp500. The slice to perform
this task is [:3]:

In [91]:
# get only the first three rows
onlyFirstThree = sp500[:3]
onlyFirstThree

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

Remember, that this result is a slice. Therefore, it is a view into the DataFrame. Data has
not been removed from the sp500 object. Changes to these three rows will change the data
in sp500. To prevent this from occurring, the proper action is to make a copy of the slice,
as follows:

In [92]:
# first three, but a copy of them
onlyFirstThree = sp500[:3].copy()
onlyFirstThree

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

<h2>Changing scalar values in a DataFrame</h2>

In [93]:
# get a subset / copy of the data
subset = sp500[:3].copy()
subset

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABT     Health Care   45.00      13.94
ABBV    Health Care   63.69       2.91

In [94]:
subset = sp500[:3].copy()
subset.loc['MMM', 'Price'] = 10
subset.loc['ABBV', 'Price'] = 20
subset

             Sector  Price  BookValue
Symbol                               
MMM     Industrials   10.0      17.26
ABT     Health Care   45.0      13.94
ABBV    Health Care   20.0       2.91

.loc may suffer from lower performance, as compared to .iloc, due to the possibility of
needing to map the label values into locations. The following example gets the location of
the specific row and column that is desired to be changed and then uses .iloc to execute
the change (the examples only change one price for brevity):

In [95]:
# subset of the first three rows
subset = sp500[:3].copy()


In [96]:
# get the location of the Price column
price_loc = sp500.columns.get_loc('Price')
price_loc

1

In [97]:
# get the location of the MMM row
abt_row_loc = sp500.index.get_loc('ABT')
abt_row_loc

1

In [98]:
# change the price
subset.iloc[abt_row_loc, price_loc] = 1000
subset

             Sector    Price  BookValue
Symbol                                 
MMM     Industrials   189.09      17.26
ABT     Health Care  1000.00      13.94
ABBV    Health Care    63.69       2.91

<h2>Arithmetic on a DataFrame</h2>

Arithmetic operations using scalar values will be applied to every element of a DataFrame.
To demonstrate, we will use a DataFrame object initialized with random values:

In [99]:
# set the seed to allow replicatable results
np.random.seed(123456)

# create the DataFrame
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'])
df

          A         B         C         D
0  0.469112 -0.282863 -1.509059 -1.135632
1  1.212112 -0.173215  0.119209 -1.044236
2 -0.861849 -2.104569 -0.494929  1.071804
3  0.721555 -0.706771 -1.039575  0.271860
4 -0.424972  0.567020  0.276232 -1.087401

By default, any arithmetic operation will be applied across all rows and columns of a
DataFrame and will return a new DataFrame with the results (leaving the original
unchanged):

In [100]:
# multiply everything by 2
df * 2

          A         B         C         D
0  0.938225 -0.565727 -3.018117 -2.271265
1  2.424224 -0.346429  0.238417 -2.088472
2 -1.723698 -4.209138 -0.989859  2.143608
3  1.443110 -1.413542 -2.079150  0.543720
4 -0.849945  1.134041  0.552464 -2.174801

<p>When performing an operation between a DataFrame and a Series, pandas will align the
Series index along the DataFrame columns, performing what is referred to as a row-wise
broadcast.</p>
<p>The following example retrieves the first row of the DataFrame, and then subtracts this
from each row of the DataFrame. pandas is broadcasting the Series to each row of the
DataFrame, which aligns each series item with the DataFrame item of the same index label
and then applies the minus operator on the matched values:</p>

In [101]:
# get first row
s = df.iloc[0]
s

A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name: 0, dtype: float64

In [102]:
# subtract first row from every row of the DataFrame
diff = df - s
diff

          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1  0.743000  0.109649  1.628267  0.091396
2 -1.330961 -1.821706  1.014129  2.207436
3  0.252443 -0.423908  0.469484  1.407492
4 -0.894085  0.849884  1.785291  0.048232

This also works when reversing the order by subtracting the DataFrame to the Series
object:

In [103]:
# subtract DataFrame from Series
diff2 = s - df
diff2

          A         B         C         D
0  0.000000  0.000000  0.000000  0.000000
1 -0.743000 -0.109649 -1.628267 -0.091396
2  1.330961  1.821706 -1.014129 -2.207436
3 -0.252443  0.423908 -0.469484 -1.407492
4  0.894085 -0.849884 -1.785291 -0.048232

The set of columns returned will be the union of the labels in the index of both the series
and the columns index of the DataFrame object. If a label representing the result column is
not found in either the Series of the DataFrame object, then the values will be NaN filled.
The following code demonstrates, by creating a Series with an index representing a
subset of the column in the DataFrame, but also with an additional label:

In [104]:
# B, C
s2 = s[1:3]
s2

B   -0.282863
C   -1.509059
Name: 0, dtype: float64

In [105]:
s2['E'] = 0
s2

B   -0.282863
C   -1.509059
E    0.000000
Name: 0, dtype: float64

In [106]:
# see how alignment is applied in math
df + s2

    A         B         C   D   E
0 NaN -0.565727 -3.018117 NaN NaN
1 NaN -0.456078 -1.389850 NaN NaN
2 NaN -2.387433 -2.003988 NaN NaN
3 NaN -0.989634 -2.548633 NaN NaN
4 NaN  0.284157 -1.232826 NaN NaN

An arithmetic operation between two DataFrame objects will align by both the column and
index labels. The following extracts a small portion of df and subtracts it from df. The
result demonstrates that the aligned values subtract to 0, while the others are set to NaN:

In [107]:
# get rows 1 through three, and only B, C columns
subframe = df[1:4][['B', 'C']]
subframe

          B         C
1 -0.173215  0.119209
2 -2.104569 -0.494929
3 -0.706771 -1.039575

In [108]:
# demonstrate the alignment of the subtraction
df - subframe

    A    B    C   D
0 NaN  NaN  NaN NaN
1 NaN  0.0  0.0 NaN
2 NaN  0.0  0.0 NaN
3 NaN  0.0  0.0 NaN
4 NaN  NaN  NaN NaN

Additional control of an arithmetic operation can be gained using the arithmetic methods
provided by the DataFrame object. These methods provide the specification of a specific
axis. The following demonstrates performing subtraction along a column axis by using the
DataFrame objects .sub() method, subtracting the A column from every column:

In [109]:
# get the A column
a_col = df['A']
a_col

0    0.469112
1    1.212112
2   -0.861849
3    0.721555
4   -0.424972
Name: A, dtype: float64

In [110]:
df.sub(a_col, axis=0)

     A         B         C         D
0  0.0 -0.751976 -1.978171 -1.604745
1  0.0 -1.385327 -1.092903 -2.256348
2  0.0 -1.242720  0.366920  1.933653
3  0.0 -1.428326 -1.761130 -0.449695
4  0.0  0.991993  0.701204 -0.662428

<h2>Resetting and reindexing</h2>

A DataFrame can have its index reset by using the .reset_index(). A common use of
this, is to move the contents of a DataFrame object’s index into one or more columns. The
following code moves the symbols in the index of sp500 into a column and replaces the
index with a default integer index. The result is a new DataFrame, not an in-place update.
The code is as follows:

In [111]:
# reset the index, moving it into a column
reset_sp500 = sp500.reset_index()
reset_sp500

    Symbol                  Sector   Price  BookValue
0      MMM             Industrials  189.09      17.26
1      ABT             Health Care   45.00      13.94
2     ABBV             Health Care   63.69       2.91
3      ACN  Information Technology  124.14      11.95
4     ATVI  Information Technology   48.06      12.23
..     ...                     ...     ...        ...
500   YHOO  Information Technology   45.73      32.50
501    YUM  Consumer Discretionary   64.02     -15.93
502    ZBH             Health Care  117.07      48.20
503   ZION              Financials   45.28      34.10
504    ZTS             Health Care   53.07       3.02

[505 rows x 4 columns]

One or more columns can also be moved into the index. Another common scenario is
exhibited by the reset variable we just created, as this may have been data read in from a
file with the symbols in a column when we really would like it in the index. To do this, we
can utilize the .set_index() method. The following code moves Symbol into the index of
a new DataFrame:

In [112]:
# move the Symbol column into the index
reset_sp500.set_index('Symbol')

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care   63.69       2.91
ACN     Information Technology  124.14      11.95
ATVI    Information Technology   48.06      12.23
...                        ...     ...        ...
YHOO    Information Technology   45.73      32.50
YUM     Consumer Discretionary   64.02     -15.93
ZBH                Health Care  117.07      48.20
ZION                Financials   45.28      34.10
ZTS                Health Care   53.07       3.02

[505 rows x 3 columns]

An index can be explicitly set using the .set_index() method. This method, given a list
of values representing the new index, will create a new DataFrame using the specified
values, and then align the data from the target in the new object. The following code
demonstrates this, by using a subset of sp500 and assigning a new index that contains a
subset of those indexes and an additional label FOO:

In [113]:
# get first four rows
subset = sp500[:4].copy()
subset

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care   63.69       2.91
ACN     Information Technology  124.14      11.95

In [114]:
# reindex to have MMM, ABBV, and FOO index labels
reindexed = subset.reindex(index=['MMM', 'ABBV', 'FOO'])
# note that ABT and ACN are dropped and FOO has NaN values
reindexed

             Sector   Price  BookValue
Symbol                                
MMM     Industrials  189.09      17.26
ABBV    Health Care   63.69       2.91
FOO             NaN     NaN        NaN

Reindexing can also be done upon the columns. The following reindexes the columns of
subset:

In [115]:
# reindex columns
subset.reindex(columns=['Price', 'Book Value', 'NewCol'])

         Price  Book Value  NewCol
Symbol                            
MMM     189.09         NaN     NaN
ABT      45.00         NaN     NaN
ABBV     63.69         NaN     NaN
ACN     124.14         NaN     NaN

Finally, a DataFrame can also be reindexed on rows and columns at the same time,

In [116]:
subset.reindex(index=['MMM', 'ABBV', 'FOO'], columns=['Price', 'Book Value', 'NewCol'])

         Price  Book Value  NewCol
Symbol                            
MMM     189.09         NaN     NaN
ABBV     63.69         NaN     NaN
FOO        NaN         NaN     NaN

<h2>Hierarchical indexing</h2>

<p>Hierarchical indexing is a feature of pandas that allows specifying two or more index
levels on an axis. The specification of multiple levels in an index allows for efficient
selection of subsets of data. A pandas index that has multiple levels of hierarchy is
referred to as a MultiIndex.</p>

<p>We can demonstrate creating a MultiIndex using the sp500 data. Suppose we want to
organize this data by both the Sector and Symbol. We can accomplish this with the
following code:</p>

In [117]:
# first, push symbol into a column
reindexed = sp500.reset_index()
reindexed

    Symbol                  Sector   Price  BookValue
0      MMM             Industrials  189.09      17.26
1      ABT             Health Care   45.00      13.94
2     ABBV             Health Care   63.69       2.91
3      ACN  Information Technology  124.14      11.95
4     ATVI  Information Technology   48.06      12.23
..     ...                     ...     ...        ...
500   YHOO  Information Technology   45.73      32.50
501    YUM  Consumer Discretionary   64.02     -15.93
502    ZBH             Health Care  117.07      48.20
503   ZION              Financials   45.28      34.10
504    ZTS             Health Care   53.07       3.02

[505 rows x 4 columns]

In [118]:
# and now index sp500 by sector and symbol
multi_fi = reindexed.set_index(['Sector', 'Symbol'])
multi_fi

                                Price  BookValue
Sector                 Symbol                   
Industrials            MMM     189.09      17.26
Health Care            ABT      45.00      13.94
                       ABBV     63.69       2.91
Information Technology ACN     124.14      11.95
                       ATVI     48.06      12.23
...                               ...        ...
                       YHOO     45.73      32.50
Consumer Discretionary YUM      64.02     -15.93
Health Care            ZBH     117.07      48.20
Financials             ZION     45.28      34.10
Health Care            ZTS      53.07       3.02

[505 rows x 2 columns]

We can now examine the .index property and check whether it is a MultiIndex object:

In [119]:
# the index is a MultiIndex
type(multi_fi.index)

pandas.core.indexes.multi.MultiIndex

In [120]:
# examine the index
print (multi_fi.index)

MultiIndex([(           'Industrials',  'MMM'),
            (           'Health Care',  'ABT'),
            (           'Health Care', 'ABBV'),
            ('Information Technology',  'ACN'),
            ('Information Technology', 'ATVI'),
            (           'Industrials',  'AYI'),
            ('Information Technology', 'ADBE'),
            ('Consumer Discretionary',  'AAP'),
            (             'Utilities',  'AES'),
            (           'Health Care',  'AET'),
            ...
            (             'Utilities',  'XEL'),
            ('Information Technology',  'XRX'),
            ('Information Technology', 'XLNX'),
            (            'Financials',   'XL'),
            (           'Industrials',  'XYL'),
            ('Information Technology', 'YHOO'),
            ('Consumer Discretionary',  'YUM'),
            (           'Health Care',  'ZBH'),
            (            'Financials', 'ZION'),
            (           'Health Care',  'ZTS')],
           names=['Sect

A MultiIndex contains two or more levels:

In [121]:
# this has two levels
len(multi_fi.index.levels)

2

Also, each level is a distinct Index object:

In [122]:
# each index level is an index
multi_fi.index.levels[0]

Index(['Consumer Discretionary', 'Consumer Staples', 'Energy', 'Financials',
       'Health Care', 'Industrials', 'Information Technology', 'Materials',
       'Real Estate', 'Telecommunications Services', 'Utilities'],
      dtype='object', name='Sector')

In [123]:
# each index level is an index
multi_fi.index.levels[1]

Index(['A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI',
       ...
       'XLNX', 'XOM', 'XRAY', 'XRX', 'XYL', 'YHOO', 'YUM', 'ZBH', 'ZION',
       'ZTS'],
      dtype='object', name='Symbol', length=505)

Values of the index, at a specific level for every row, can be retrieved by the
.get_level_values() method:

In [124]:
# values of the index level 0
multi_fi.index.get_level_values(0)

Index(['Industrials', 'Health Care', 'Health Care', 'Information Technology',
       'Information Technology', 'Industrials', 'Information Technology',
       'Consumer Discretionary', 'Utilities', 'Health Care',
       ...
       'Utilities', 'Information Technology', 'Information Technology',
       'Financials', 'Industrials', 'Information Technology',
       'Consumer Discretionary', 'Health Care', 'Financials', 'Health Care'],
      dtype='object', name='Sector', length=505)

Access of elements via a hierarchical index is performed using the .xs() method.

In [125]:
# get all stocks that are Industrials
# note the result drops level 0 of the index
multi_fi.xs('Industrials')

         Price  BookValue
Symbol                   
MMM     189.09      17.26
AYI     205.41      39.50
ALK      95.12      23.77
ALLE     72.86       1.19
AAL      44.84       7.46
...        ...        ...
URI     128.15      19.57
UTX     112.28      34.10
VRSK     81.84       7.98
WM       72.70      12.06
XYL      48.78      12.21

[66 rows x 2 columns]

To select the rows with a specific value of the index at level 1, use the level parameter.

In [126]:
# select rows where level 1 (Symbol) is ALLE
# note that the Sector level is dropped from the result
multi_fi.xs('ALLE', level=1)

             Price  BookValue
Sector                       
Industrials  72.86       1.19

To prevent levels from being dropped, you can use the drop_levels=False option:

In [127]:
# Industrials, without dropping the level
multi_fi.xs('Industrials', drop_level=False)

                     Price  BookValue
Sector      Symbol                   
Industrials MMM     189.09      17.26
            AYI     205.41      39.50
            ALK      95.12      23.77
            ALLE     72.86       1.19
            AAL      44.84       7.46
...                    ...        ...
            URI     128.15      19.57
            UTX     112.28      34.10
            VRSK     81.84       7.98
            WM       72.70      12.06
            XYL      48.78      12.21

[66 rows x 2 columns]

To select from a hierarchy of indexes you can chain .xs() calls with different levels
together. The following code selects the row with Industrials at level 0 and UPS at level
1:

In [128]:
# drill through the levels
multi_fi.xs('Industrials').xs('UPS')

Price        105.64
BookValue      0.47
Name: UPS, dtype: float64

An alternate syntax is to pass the values of each level of the hierarchical index as a tuple:

In [129]:
# drill through using tuples
multi_fi.xs(('Industrials', 'UPS'))

Price        105.64
BookValue      0.47
Name: (Industrials, UPS), dtype: float64

Note that .xs() can only be used for getting, not setting, values.

<strong>Note</strong>
<p>
One of the things I’d like to point out about indexing in pandas, is that a pandas index is
its own set of data, not references to data in the Series or DataFrame. This is different
from how indexes are used in SQL databases, where the index is built upon the actual data
in the table. The values in a pandas index can be completely different from the data in the
row that it references, and it can be changed as needed to support much more interactive
analysis than can be done with SQL.</p>

<h2>Summarized data descriptive statistics</h2>

<p>Pandas provides several classes of statistical operations that can be applied to a Series or
DataFrame object. These reductive methods, when applied to a Series, result in a single
value. When applied to a DataFrame, an axis can be specified and the method will then be
either applied to each column or row and results in a Series.</p>
<p>The average value is calculated using .mean(). The following calculates the average of the
prices for AAPL and MSFT:</p>

In [130]:
subset = sp500[:4].copy()
subset

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care   63.69       2.91
ACN     Information Technology  124.14      11.95

The average value is calculated using .mean().

In [131]:
subset.Price.mean()

105.47999999999999

In [132]:
# calc the mean of the values in each row
subset.mean(axis=1)

Symbol
MMM     103.175
ABT      29.470
ABBV     33.300
ACN      68.045
dtype: float64

Variance is calculated using the .var() method. The following code calculates the
variance of the price for both stocks during the period represented in the DataFrame
object:

In [133]:
subset.var()

Price        4247.687400
BookValue      37.706967
dtype: float64

In [134]:
# calc the median of the values in each column
subset.median()

Price        93.915
BookValue    12.945
dtype: float64

Although not a reductive calculation, the minimum and maximum values can be found
with the .min() and .max() methods:

In [135]:
subset.min()

Sector       Health Care
Price                 45
BookValue           2.91
dtype: object

In [136]:
subset.max()

Sector       Information Technology
Price                        189.09
BookValue                     17.26
dtype: object

Some pandas statistical methods are referred to as indirect statistics, for example,
.idxmin() and .idxmax() return the index location where the minimum and maximum
values exist, respectively. The following code determines the location of the minimum
prices for both stocks:

In [137]:
subset.Price.idxmin()

'ABT'

In [138]:
subset.Price.idxmax()

'MMM'

In [139]:
subset

                        Sector   Price  BookValue
Symbol                                           
MMM                Industrials  189.09      17.26
ABT                Health Care   45.00      13.94
ABBV               Health Care   63.69       2.91
ACN     Information Technology  124.14      11.95

The mode of a set of values is the value that appears most often. It can be multiple values.


In [140]:
# find the mode of this Series
s = pd.Series([1, 2, 3, 3, 5])
s.mode()

0    3
dtype: int64

In [141]:
# there can be more than one mode
s = pd.Series([1, 2, 3, 3, 5, 1])
s.mode()

0    1
1    3
dtype: int64

Accumulations in pandas are statistical methods that determine a value, by continuously
applying the next value in a Series to the current result. Good examples are the
cumulative product and cumulative sum of a Series. To demonstrate, we can use the
following DataFrame that calculates both on a simple Series of data:

In [142]:
# calculate a cumulative product
pd.Series([1, 2, 3, 4]).cumprod()

0     1
1     2
2     6
3    24
dtype: int64

In [143]:
# calculate a cumulative sum
pd.Series([1, 2, 3, 4]).cumsum()

0     1
1     3
2     6
3    10
dtype: int64

<p>The .describe() returns a simple set of summary statistics about a Series or DataFrame.
The values returned are, themselves, a Series where the index label contains the name of
the specific statistics that are computed. This function is handy if you want to get a quick
and easy overview of the important statistics of a Series or DataFrame.</p>

<p>The following code returns summary statistics on the monthly stock data, including the
count of items that are not part of NaN; the mean and standard deviation; minimum and
maximum values; and the values of the 25, 50, and 75 percentiles. The code is as follows:</p>

In [144]:
# summary statistics
subset.describe()

            Price  BookValue
count    4.000000     4.0000
mean   105.480000    11.5150
std     65.174285     6.1406
min     45.000000     2.9100
25%     59.017500     9.6900
50%     93.915000    12.9450
75%    140.377500    14.7700
max    189.090000    17.2600

In [145]:
# get summary stats on non-numeric data
s = pd.Series(['a', 'a', 'b', 'c', np.NaN])
s.describe()

count     4
unique    3
top       a
freq      2
dtype: object

This has given us the count variable of items that are not part of NaN, the number of
unique items that are not part of NaN, the most common item (top), and the number of
times the most frequent item occurred (freq).

In [146]:
s

0      a
1      a
2      b
3      c
4    NaN
dtype: object

In [147]:
# get summary stats on non-numeric data
s.count()

4

In [148]:
# return a list of unique items
s.unique()

array(['a', 'b', 'c', nan], dtype=object)

In [149]:
# number of occurrences of each unique value
s.value_counts()

a    2
c    1
b    1
dtype: int64