### Pandas Views and Copies

In [1]:
import pandas as pd
import numpy as np

In pandas, whether you get a view or not depends on the structure of the DataFrame and, if you are trying to modify a slice, the type of modification you make. 

To illustrate, below is an example where a slice returns a view, such that changes in the original dataframe df propagate to the slice my_slice:



In [2]:
df = pd.DataFrame({'a':np.arange(4), 'b':np.arange(4)})
df

Unnamed: 0,a,b
0,0,0
1,1,1
2,2,2
3,3,3


In [3]:
my_slice = df.iloc[1:3,]
my_slice

Unnamed: 0,a,b
1,1,1
2,2,2


In [4]:
df.iloc[1,1] = -1 # make an assigment (update) to the df
df

Unnamed: 0,a,b
0,0,0
1,1,-1
2,2,2
3,3,3


In [5]:
my_slice # the slice displays the update

Unnamed: 0,a,b
1,1,-1
2,2,2


In [6]:
df.iloc[1,0] = 3.14  # make a different change to the df
df

Unnamed: 0,a,b
0,0.0,0
1,3.14,-1
2,2.0,2
3,3.0,3


In [7]:
my_slice  # the change in not reflected in the slice

Unnamed: 0,a,b
1,1,-1
2,2,2


 The reason for this difference in behavior is that in the first modification, one integer replaced another, so that operation could be done in the existing integer array; 
 
In the second case, a floating point number was assigned into an integer array. This triggered creation of a new floating point array, the new array replaced the old one as column a in the original DataFrame, breaking the “view” connection.)

 This behavior applies to column slices as well.

In [8]:
df

Unnamed: 0,a,b
0,0.0,0
1,3.14,-1
2,2.0,2
3,3.0,3


In [9]:
column_a = df['a']
df.iloc[0,0] = -100 # this change propogates to the view
column_a

0   -100.00
1      3.14
2      2.00
3      3.00
Name: a, dtype: float64

In [10]:
# But this does not
df.iloc[0,0] = "a"
df

Unnamed: 0,a,b
0,a,0
1,3.14,-1
2,2,2
3,3,3


In [11]:
column_a

0   -100.00
1      3.14
2      2.00
3      3.00
Name: a, dtype: float64

To help address this issue, pandas has a built in alert system to inform you if you try to modify something that might be a view. For example:

In [12]:
df = pd.DataFrame({'a':np.arange(4), 'b':['w', 'x', 'y', 'z']})
df

Unnamed: 0,a,b
0,0,w
1,1,x
2,2,y
3,3,z


In [13]:
my_slice = df.iloc[1:3,]
my_slice

Unnamed: 0,a,b
1,1,x
2,2,y


In [14]:
my_slice.iloc[0,1] = 2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  isetter(loc, value)


This alert is meant to inform you whenever you’re making a modification to something that might (or might not) be a view. 

Generally speaking, whenever you see this warning, the solution is  to make a copy of the thing that might be a view so that you know that it is not.

In [15]:
my_slice = my_slice.copy()
my_slice.iloc[0,1] = 2
my_slice

Unnamed: 0,a,b
1,1,2
2,2,y


Changing the datatype of a column is not the only situation in which a column can lose it’s “view-ness”

In the examples above, each column was it’s own object, and so behaved independently. But this is not always the case in pandas. If a DataFrame is created from a single numpy matrix with multiple columns, pandas will try to be efficient by just keeping that matrix intact.

As a result, if you do something (like change the type) of one of the columns that is tied to that matrix, pandas will create new arrays to back all the columns that were once tied to the matrix. As a result, a view of a single column can stop being a view due to changes to a different column. For example:

In [16]:
my_matrix = np.arange(6).reshape(3,2)
my_matrix

array([[0, 1],
       [2, 3],
       [4, 5]])

In [17]:
df = pd.DataFrame(my_matrix, columns=['a', 'b'])
df

Unnamed: 0,a,b
0,0,1
1,2,3
2,4,5


In [18]:
# Column_a starts it's life as a view
column_a = df['a']
column_a

0    0
1    2
2    4
Name: a, dtype: int32

In [19]:
df.iloc[0, 0] = -100 #The change propogates to the view
column_a

0   -100
1      2
2      4
Name: a, dtype: int32

In [20]:
# Now make a change to column b...
df.loc[0, 'b'] = "eggs"
df

Unnamed: 0,a,b
0,-100,eggs
1,2,3
2,4,5


In [21]:
df.iloc[0, 0] = 52 # we do not see this update reflected in the view
column_a

0   -100
1      2
2      4
Name: a, dtype: int32

If in doubt about whether you are working with a view or copy  of a subset of a DataFrame, best to explicitly make a copy with .copy().

### Hierarchical indexing

MultiIndex objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.

In [22]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

With this indexing scheme, you can index or slice the series based on this multiple index:

In [23]:
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

Suppose we wanted to to select all values from 2010, then we would need to do some more complicated and messy (and potentially slow) approaches.

In [24]:
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

The Pandas MultiIndex makes such indexing easier. 

We can create a multi-index from the tuples as follows:

In [25]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

This MultiIndex contains multiple levels of indexing 

–in this case, the state names and the years, as well as multiple labels for each data point which encode these levels.

If we re-index our series with this MultiIndex, we can see the hierarchical representation of the data:

In [26]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Here the first two columns of the Series representation show the multiple index values.

The third column shows the data. 

Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:

In [27]:
pop[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [28]:
pop.loc[:, 2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

We could have stored the same data using a simple DataFrame with index and column labels. 

Pandas is built with this equivalence in mind. 

The unstack() method will convert a multiply indexed Series into a conventionally indexed DataFrame:

In [29]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


The stack() method provides the opposite operation:

In [30]:
pop_df.stack()

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

Multi-indexing can represent two-dimensional data within a one-dimensional Series. 

We can also use it to represent data of three or more dimensions in a Series or DataFrame. Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property provides more flexibility in the types of data we can represent.
 
For example we might want to add another column of demographic data for each state at each year (say, population under 18) 

In [31]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


all the ufuncs and other functionality in Pandas work with hierarchical indices. 

Here we compute the fraction of people under 18 by year, given the above data:

In [32]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [33]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:

In [34]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df


Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.425836,0.698968
a,2,0.008803,0.38735
b,1,0.586552,0.837918
b,2,0.593063,0.906752


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:

In [35]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

you can construct the MultiIndex from a simple list of arrays giving the index values within each level:

In [36]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples giving the multiple index values of each point:

In [37]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [38]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Sometimes it is convenient to name the levels of the MultiIndex. This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:

In [39]:
pop.index.names = ['state', 'year']
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some  medical data:

In [40]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['John', 'Bob', 'Mary'], ['HR', 'Temp']],
                                     names=['subject', 'diagnostic'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,John,John,Bob,Bob,Mary,Mary
Unnamed: 0_level_1,diagnostic,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,36.0,35.5,34.0,38.4,35.0,36.9
2013,2,23.0,35.4,35.0,37.2,33.0,36.4
2014,1,29.0,37.7,41.0,36.8,29.0,37.9
2014,2,57.0,38.7,31.0,37.1,49.0,38.1


Here we see where the multi-indexing for both rows and columns. 

This is fundamentally four-dimensional data, where the dimensions are the subject, the diagnostic measurement, the year, and the visit number. 

With this in place we can, for example, index the top-level column by the person's name and get a full DataFrame containing just that person's information:

In [41]:
health_data['Bob']

Unnamed: 0_level_0,diagnostic,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,34.0,38.4
2013,2,35.0,37.2
2014,1,41.0,36.8
2014,2,31.0,37.1


### Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.

In [42]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [43]:
pop['California', 2000]

33871648

The MultiIndex also supports partial indexing, or indexing just one of the levels in the index. The result is another Series, with the lower-level indices maintained:

In [44]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

Partial slicing is available as well, as long as the MultiIndex is sorted 

In [45]:
pop.loc['California':'New York']

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
dtype: int64

A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame from before:

In [46]:
health_data['Bob','HR']

year  visit
2013  1        34.0
      2        35.0
2014  1        41.0
      2        31.0
Name: (Bob, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc indexers

In [47]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,John,John
Unnamed: 0_level_1,diagnostic,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,36.0,35.5
2013,2,23.0,35.4


These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:

In [48]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        34.0
      2        35.0
2014  1        41.0
      2        31.0
Name: (Bob, HR), dtype: float64

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:

In [49]:
health_data.loc[(:, 1), (:, 'HR')]

SyntaxError: invalid syntax (<ipython-input-49-fb34fa30ac09>, line 1)

You could get around this by building the desired slice explicitly using Python's built-in slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example:

In [50]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]

Unnamed: 0_level_0,subject,John,Bob,Mary
Unnamed: 0_level_1,diagnostic,HR,HR,HR
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2013,1,36.0,34.0,35.0
2014,1,29.0,41.0,29.0


### Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns.

Sorted and unsorted indices

Many of the MultiIndex slicing operations will fail if the index is not sorted. 

Start by creating some simple multiply indexed data where the indices are not lexographically sorted:

In [51]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

char  int
a     1      0.668750
      2      0.777128
c     1      0.865914
      2      0.774575
b     1      0.736382
      2      0.315600
dtype: float64


If we try to take a partial slice of this index, it will result in an error:

In [52]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

<class 'pandas.errors.UnsortedIndexError'>
'Key length (1) was greater than MultiIndex lexsort depth (0)'


Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted. For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted (i.e., lexographical) order. Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index() and sortlevel() methods of the DataFrame. We'll use the simplest, sort_index(), here:

In [53]:
data = data.sort_index()
data

char  int
a     1      0.668750
      2      0.777128
b     1      0.736382
      2      0.315600
c     1      0.865914
      2      0.774575
dtype: float64

With the index sorted in this way, partial slicing will work as expected:

In [54]:
data['a':'b']

char  int
a     1      0.668750
      2      0.777128
b     1      0.736382
      2      0.315600
dtype: float64

Stacking and unstacking indices
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:

In [55]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [56]:
pop.unstack(level=0)

state,California,New York,Texas
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [57]:
pop.unstack(level=1)

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:

In [58]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


Often when working with data in the real world, the raw input data looks like this and it's useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:

In [59]:
pop_flat.set_index(['state', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


### Data Aggregations on Multi-Indices

We've previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.

For example:

In [60]:
health_data

Unnamed: 0_level_0,subject,John,John,Bob,Bob,Mary,Mary
Unnamed: 0_level_1,diagnostic,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,36.0,35.5,34.0,38.4,35.0,36.9
2013,2,23.0,35.4,35.0,37.2,33.0,36.4
2014,1,29.0,37.7,41.0,36.8,29.0,37.9
2014,2,57.0,38.7,31.0,37.1,49.0,38.1


Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:

In [61]:
data_mean = health_data.mean(level='year')
data_mean

subject,John,John,Bob,Bob,Mary,Mary
diagnostic,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,29.5,35.45,34.5,37.8,34.0,36.65
2014,43.0,38.2,36.0,36.95,39.0,38.0


In [68]:
data_mean.mean(axis=1 ,level='diagnostic')

diagnostic,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,32.666667,36.633333
2014,39.333333,37.716667


Views and copies:
Returning a view versus a copy

When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an example.

In [17]:
dfmi = pd.DataFrame([list('abcd'),
                         list('efgh'),
                         list('ijkl'),
                         list('mnop')],
                        columns=pd.MultiIndex.from_product([['one', 'two'],
                                                            ['first', 'second']]))
    




In [70]:
dfmi

Unnamed: 0_level_0,one,one,two,two
Unnamed: 0_level_1,first,second,first,second
0,a,b,c,d
1,e,f,g,h
2,i,j,k,l
3,m,n,o,p


Compare these two access methods:

In [74]:
dfmi['one']['second']

0    b
1    f
2    j
3    n
Name: second, dtype: object

In [72]:
dfmi.loc[:, ('one', 'second')]

0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained []).

dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

In contrast df.loc[:,('one','second')] passes a nested tuple of (slice(None),('one','second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity. 

This order of operations can be significantly faster, and allows one to index both axes if so desired.

In [18]:
d1=dfmi['one']['second']

If we assign a new object (d1) as a subset of dfmi, then make an assignment to this new object, a SettingWithCopyWarning is generated

In [21]:
d1.loc[:2] =0
d1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0    0
1    0
2    0
3    n
Name: second, dtype: object

If we create a new object (d1) as a subset of dfmi, using the second method, then make an assignment to this new object, there is no SettingWithCopyWarning message.

In [22]:
d1=dfmi.loc[:, ('one', 'second')]

In [23]:
d1.loc[:2] =0
d1

0    0
1    0
2    0
3    n
Name: (one, second), dtype: object