<h1><b><code>SettingWithCopyWarning</code> in Pandas: Views vs Copies</b></h1>
<p>In this notebook, I will be covering an important concept that users must know once they get acquainted with Pandas: the <code>SettingWithCopyWarning</code> issue. As a beginner, I would often disregard this sign as the outcome of the code would not necessarily change (apparently). Now, there are <b>shallow copies</b> and <b>deep copies</b>, notions that are related to the aforementioned warning. I'll discuss the difference between these concepts as well as assessing the impact of their use in data analysis.</p>
<p>By the end of this notebook, I hope to have covered the following topics:</p>
<ul>
    <li>The definition of <b>views</b> and <b>copies</b> in NumPy and Pandas</li>
    <li>How to work with views and copies in these libraries</li>
    <li>Why <code>SettingWithCopyWarning</code> happens in Pandas</li>
    <li>How to avoid getting a <code>SettingWithCopyWarning</code> in Pandas</li>
</ul>

<h2><b>Table of contents:</b></h2>
<ul>
    <li>Example of a <code>SettingWithCopyWarning</code></li>
    <li>Views and Copies in NumPy and Pandas</li>
        <ul>
            <li>Understanding views and copies in NumPy</li>
            <li>Understanding views and copies in Pandas</li>
        </ul>
    <li>Indices and Slices in NumPy and Pandas</li>
        <ul>
            <li>Indexing in NumPy: copies and views</li>
            <li>Indexing in Pandas: copies and views</li>
        </ul>
<li>Use of Views and Copies in Pandas</li>
    <ul>
        <li>Chained Indexing and <code>SettingWithCopyWarning</code></li>
        <li>Impact of Data Types on Views, Copies, and the <code>SettingWithCopyWarning</code></li>
        <li>Hierarchical Indexing and <code>SettingWithCopyWarning</code></li>
    </ul>
<li>Change the Default <code>SettingWithCopyWarning</code> Behavior</li>
<li>Conclusion</li>
</ul>

<p>Let's start by importing the required modules and checking their versions. Then, we can move on and proceed with our discussion on this notebook's topic.</p>



In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
np.__version__

'1.24.2'

In [3]:
pd.__version__

'1.5.3'

<h3><b>1. Example of a <code>SettingWithCopyWarning</code></b></h3>
<p>As previously mentioned, I tended to ignore this issue because it is <i>not</i> and <i>error</i>, but a <i>warning</i>. It might sound obvious now, but what Pandas is doing is <i>warning</i> you that you might get unwanted behavior in your code.</p>
<p>Let's create a Pandas DataFrame and observe this issue in practice:</p>

In [4]:
data = {"x": 2**np.arange(5),
        "y": 3**np.arange(5), 
        "z": np.array([45, 98, 24, 11, 64])
        }

index = ["a", "b", "c", "d", "e"]

df = pd.DataFrame(data=data, index=index)

In [5]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>Now, we have a dictionary referenced by the variable <code>data</code>, which contains:</p>
<ul>
    <li>Keys - <code>x</code>, <code>y</code>, and <code>z</code> - they are our column labels in the DataFrame</li>
    <li>Three Numpy arrays, which are our observations</li>
    <ul>
        <li><code>np.arange(5)</code> returns evenly spaced values within a given interval. Since we gave <code>5</code> as the parameter input, the result was <code>[0, 1, 2, 3, 4]</code>, which were then raised to the power of 2. The same concept was applied to the second row, this time, raising the range to the power of 3.</li>
        <li><code>np.array()</code> creates an array.</li>
    </ul>
    <li>Finally, a list, which will be used to provide an index to our DataFrame</li>
</ul>
<p>We are now ready to deal with a <code>SettingWithCopyWarning</code>.Let's start by creating a mask with Pandas boolean operators:</p>

In [6]:
mask = df["z"] < 50
mask

a     True
b    False
c     True
d     True
e    False
Name: z, dtype: bool

In [7]:
df[mask]

Unnamed: 0,x,y,z
a,1,1,45
c,4,9,24
d,8,27,11


<p>How does the mask work?</p>
<ul>
    <li><code>True</code> - <b>rows</b> in which the value of <code>z</code> is <i>less</i> than <code>50</code>.</li>
    <li><code>False</code> - <b>rows</b> in which the value of <code>z</code> is <i>not less</i> than <code>50</code>.</li>
</ul>
<p>When calling <code>df[mask]</code>, the output will be the original DataFrame in which the rows of the mask were <code>True</code>. We got <code>a</code>, <code>c</code>, and <code>d</code> back.</p>
<p>Now, if we try to change the <code>df</code> by extracting those three rows using <code>mask</code>, you will get a <code>SettingWithCopyWarning</code>, while the <code>df</code> will remain the <b>same</b>:</p>

In [8]:
df[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[mask]["z"] = 0


In [9]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>What did just happened?</p>
<ul>
    <li><code>df[mask]</code> returns a <b>fresh new DataFrame</b>, which is a <b>copy</b> of the data from the original <code>df</code>, corresponding only t the <code>True</code> values from <code>mask</code>.</li>
    <li><code>df[mask]["z"] = 0</code> modifies the column <code>z</code> of the new DataFrame to <b>zeros</b>, leaving <code>df</code> unchanged.
</ul>

<p>Pandas issues this warning to remind you the the original DataFrame was not changed. <b>If you want to modify it</b>, you can apply one of the following <b>accessors</b>:
<ul>
    <li><code>.loc[]</code></li>
    <li><code>.iloc[]</code></li>
    <li><code>.at[]</code></li>
    <li><code>.iat[]</code></li>
</ul>

In [10]:
df.loc[mask, "z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>We provided two arguments (<code>mask</code> and <code>"z"</code>) to the method and assigned values directly to the DataFrame. We can also <b>change the evaluation order</b> as an alternative way.</p>

In [11]:
df = pd.DataFrame(data=data, index=index)
df["z"]

a    45
b    98
c    24
d    11
e    64
Name: z, dtype: int32

In [12]:
df["z"][mask] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>Here's what happened:</p>
<ul>
    <li><code>df["z"]</code> returns a <code>Series</code> that this time pointed to the <b>original data</b> and not its copy.</li>
    <li><code>df["z"][mask] = 0</code> modifies this <code>Series</code> object by using <b>chained assignment</b> to set the masked values to <b>zero</b>.</li>
    <li>Now, <code>df</code> is also modified as the <code>Series</code> object <code>df["z"]</code> holds the same data as <code>df</code>.</li>
</ul>
<p>So, while <code>df[mask]</code> contains a <b>copy</b> of the data, <code>df["z"]</code> points to the <b>same data</b> as <code>df</code>. Hence, the best practices to avoid a <code>SettingWithCopyWarning</code> involve <b>invoking accessors</b>. And why is it so?</p>
<ul>
    <li>Clearer intention to modify <code>df</code> when using a single method.</li>
    <li>Cleaner code for readers.</li>
    <li>Better performance.</li>
</ul>
<p>Still, accessors might return copies, as the code below demonstrates:</p>

In [13]:
df = pd.DataFrame(data=data, index=index)

In [14]:
df.loc[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[mask]["z"] = 0


In [15]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>We observe this behavior because <code>df.loc[mask]</code> returns a <b>new DataFrame</b> with a <b>copy</b> of <code>df</code>. Then, <code>df.loc[mask]["z"] = 0</code> modifies the <b>copy</b> of the new DataFrame, not <code>df</code>. To avoid the warning, we must:</p>
<ul>
    <li><b>Avoid chained assignements</b> that combine <b>two or more</b> indexing operations such as <code>df["z"][mask] = 0</code> and <code>df.loc[mask]["z"] = 0</code>.</li>
    <li><b>Apply single statements</b> with <b>just one indexing operation</b> like <code>df[mask, "z"] = 0</code>.</li>
</ul>
<hr>
<h3><b>2. Views and Copies in NumPy and Pandas</b></h3>
<h4>2.1. Understanding Views and Copies in Numpy</h4>
<p>We can start by creating a NumPy array:</p>

In [16]:
arr = np.array([1, 2, 4, 8, 16, 32])
arr

array([ 1,  2,  4,  8, 16, 32])

<p>Now, let's create other arrays by extracting the second and fourth elements of <code>arr</code> as a new array:</p>

In [17]:
arr[1:4:2]

array([2, 8])

In [18]:
arr[[1, 3]]

array([2, 8])

<p>While both statements return the same array, their behavior is not the same:</p>

In [19]:
arr[1:4:2].base

array([ 1,  2,  4,  8, 16, 32])

In [20]:
arr[1:4:2].flags.owndata

False

In [21]:
arr[[1, 3]].base

In [22]:
arr[[1, 3]].flags.owndata

True

<p>Why this behavior occurs? Well, <code>arr[1:4:2]</code> returns a <b>shallow copy</b>, while <code>arr[[1, 3]]</code> returns a <b>deep copy</b>. We are now going to differentiate these two concepts.</p>
<h3>2.2. Views in NumPy</h3>
<p>In NumPy, a <b>shallow copy</b> or <b>view</b> represents an array that <b>does not have its own data</b>. It is a representation of the data contained in the original array. A view of an array can be created using <code>.view()</code>:

In [23]:
view_of_arr = arr.view()
view_of_arr

array([ 1,  2,  4,  8, 16, 32])

In [24]:
view_of_arr.base

array([ 1,  2,  4,  8, 16, 32])

In [25]:
view_of_arr.base is arr

True

<p>Explanation:</p>
<ul>
    <li><code>view_of_arr</code> represents a view/shallow copy of the original array <code>arr</code>.</li>
    <li>When you apply <code>.base</code> to <code>view_of_arr</code>, the original <code>arr</code> is being called.</li>
    <li><code>view_of_arr</code> doesn't own any data, as it only uses data belonging to <code>arr</code>, a fact that can be verified by using the attribute <code>.flags</code>:</li>
</ul>

In [26]:
view_of_arr.flags.owndata

False

<h3>2.3 Copies in NumPy</h3>
<p>A <b>deep copy</b>, or also known as just a <b>copy</b>, represents a separate NumPy array that <b>has its own data</b>, which is gotten by copying the elements of the original array into the new array. The original and the copy are <b>two separate instances</b>. You can create a copy of an array with <code>.copy()</code>:

In [27]:
copy_of_arr = arr.copy()
copy_of_arr

array([ 1,  2,  4,  8, 16, 32])

In [28]:
copy_of_arr.base is None

True

In [29]:
copy_of_arr.flags.owndata

True

<p>Aha! As we can see, <code>copy_of_arr</code> presents no <code>.base</code>, that is, it is not a shallow copy of an array. Actually, the value of <code>copy_of_arr.base</code> is <code>None</code>. Also, <code>.flags.owndata</code> is <code>True</code>, which means that <code>copy_of_arr</code> owns data.</p>
<h3>2.4. Differences Between Views and Copies</h3>
<p>We can now state two major <b>differences</b> between views and copies:</p>
<ol>
    <li>Views <b>do not require additional storage</b> for data; copies <b>do</b>.</li>
    <li>Modifying the original array <b>affects its views</b>, while changing the original array <b>will not affect its copy</b>.</li>
</ol>
<p>We can verify these differences by comparing the sizes of views and copies using <code>.nbytes</code>, which returns the memory consumed by the elements of the array:</p>

In [30]:
arr.nbytes

24

In [31]:
view_of_arr.nbytes

24

In [32]:
copy_of_arr.nbytes

24

<p>Aparently, there is no difference in terms of memory used. However, if we apply <code>sys.getiszeof()</code> to get the memory amount <i>directly</i> attributed to each array, we get to see the difference:</p>

In [33]:
from sys import getsizeof

getsizeof(arr)

136

In [34]:
getsizeof(view_of_arr)

112

In [35]:
getsizeof(copy_of_arr)

136

<p>Because it doesn't have its own data elements, <code>view_of_arr</code> holds only 112 bytes, which are used for other attributes. The other two variables hold the previous 24 bytes <i>and</i> those attributes.</p>
<p>We can modify any element of the original array to observe another difference #2:</p>

In [36]:
arr[1] = 64
arr

array([ 1, 64,  4,  8, 16, 32])

In [37]:
view_of_arr

array([ 1, 64,  4,  8, 16, 32])

In [38]:
copy_of_arr

array([ 1,  2,  4,  8, 16, 32])

<p>Interesting! Because <code>view_of_arr</code> holds no data and looks at the elements of <code>arr</code> and its <code>.base</code>, it is <b>modified</b>. The copy, however, remains <b>unchanged</b>, as it <i>does not</i> share data with the original.</p>
<h3>2.5 Understanding Views and Copies in Pandas</h3>
<p>Pandas also differentiates views from copies. To create a view or copy of a DataFrame, use <code>.copy()</code>. Its parameter <code>deep</code> determines whether one wants a <b>view</b> (<code>deep=False</code>) or a <b>copy</b> (<code>deep=True</code>). By default, <code>deep</code> is <code>True</code> and will return a copy.</p>


In [39]:
df = pd.DataFrame(data=data, index=index)
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [40]:
view_of_df = df.copy(deep=False)
view_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [41]:
copy_of_df = df.copy()
copy_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>While at first sight there is no apparent difference between the view and the copy, if their NumPy representations are compared, there is a subtle difference:</p>

In [42]:
# Convert DataFrame to NumPy array
view_of_df.to_numpy().base is df.to_numpy().base

True

In [43]:
# Convert DataFrame to NumPy array
copy_of_df.to_numpy().base is df.to_numpy().base

False

<p>Again, we observe that <code>copy_of_df</code> holds its own data, while <code>view_of_df</code> shares the same data with <code>df</code>. We can modify the latter to verify this behavior:</p>

In [44]:
df["z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,0
c,4,9,0
d,8,27,0
e,16,81,0


In [45]:
view_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [46]:
df_array_addr = id(df.values)
view_array_addr = id(view_of_df.values)

if df_array_addr == view_array_addr:
    print("view_of_df is a shallow copy of df")
else:
    print("view_of_df is a deep copy of it")

view_of_df is a shallow copy of df


In [47]:
copy_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>That's weird. I've been trying for the past hour to understand this behavior. The view was supposed to reflect the changes in the original DataFrame, but it didn't. Let's move on.</p>
<p>Rows and column labels should also exhibit the same behavior.</p>

In [48]:
view_of_df.index is df.index

True

In [49]:
view_of_df.columns is df.columns

True

With <code>deep=False</code> neither the indices nor the data are copied.

In [50]:
copy_of_df.index is df.index

False

In [51]:
copy_of_df.columns is df.columns

False

<p><code>df</code> and <code>view_of_df</code> share the same row and column labels, while <code>copy_of_df</code> as separate index instances. Remember that you <i>cannot modify</i> particular elements of <code>.index</code> and <code>.columns</code>, as they are <b>immutable objects</b>.</p>
<hr>
<h2><b>3. Indices and Slices in NumPy and Pandas</b></h2>
<h3>3.1. Indexing in NumPy: Copies and Views</h3>
<p>You will get views or copies of the original data depending on the selected indexing approach: slicing, integer indexing, or Boolean indexing.</p>
<h4>3.1.1. One-Dimensional Arrays</h4>
<p>When you slice a NumPy array, you get a <b>view</b> of the array:</p>

In [52]:
arr = np.array([1, 2, 4, 8, 16, 32])

In [53]:
# Get second and third values from the array
a = arr[1:3]
a

array([2, 4])

In [54]:
a.base

array([ 1,  2,  4,  8, 16, 32])

In [55]:
a.base is arr

True

In [56]:
a.flags.owndata

False

In [57]:
# Get values starting in index 1 until index 3 (4 is exclusive) with a step of 2
b = arr[1:4:2]
b

array([2, 8])

In [58]:
b.base

array([ 1,  2,  4,  8, 16, 32])

In [59]:
b.base is arr

True

In [60]:
b.flags.owndata

False

<p>Neither <code>a</code> or <code>b</code>, which are slices of <code>arr</code> have their own data. They look at the data of <code>arr</code>.</p>
<p>When you create <b>one array from another</b>, however, you'll get a <b>copy</b>. For instance, indexing an awway with a list of integers will return a copy of the original array, which will contain the elements from the original arrays whose indices are present in the list.</p>

In [61]:
c = arr[[1, 3]]
c

array([2, 8])

In [62]:
c.base is None

True

In [63]:
c.flags.owndata

True

<p>As we can see, <code>c</code> contains the elements from <code>arr</code> with the indices <code>1</code> and <code>3</code>, whose respective values are <code>2</code> and <code>8</code>. <code>c</code>, subsequently, is a copy of <code>arr</code>; its <code>.base</code> is <code>None</code>, and it has its owndata. Both variables are therefore independent of each other.</p>
<p>Indexing can also be used with mask arrays or lists. Masks are Boolean arrays/lists of the same shape of the original. You will get a <b>copy</b> of the original array containing only the elemements that correspond to the <code>True</code> values of the mask.</p>

In [64]:
mask = [False, True, False, True, False, False]
d = arr[mask]
d

array([2, 8])

In [65]:
d.base is None

True

In [66]:
d.flags.owndata

True

<p>As a result, we obtained the same values as before.</p>
<p><b>Summary:</b></p>
<ul>
    <li><i>Referencing</i>: returns <b>views</b> when slicing arrays and <b>copies</b> when using index and mask arrays</li>
    <li><i>Assigning</i>: <b>always modify</b> the original data of the array</li>
</ul>
<p>We can now see what happens when the original array is altered:</p>

In [67]:
arr[1] = 64
arr

array([ 1, 64,  4,  8, 16, 32])

In [68]:
a

array([64,  4])

In [69]:
b

array([64,  8])

In [70]:
c

array([2, 8])

In [71]:
d

array([2, 8])

<p>As expected, <code>c</code> and <code>d</code> remained unchanged, since they don't share a common data with <code>arr</code>.
<h4>3.1.2. Chained Indexing in NumPy</h4>
<p>To understand this concept, let's take a look at the following example:</p>

In [72]:
arr = np.array([1, 2, 4, 8, 16, 32])
# Get values starting in index 1 until index 3 (4 is exclusive) with a step of 2 and then replace the first value
arr[1:4:2][0] = 64
arr

array([ 1, 64,  4,  8, 16, 32])

In [73]:
arr = np.array([1, 2, 4, 8, 16, 32])
# Select the elements with indices 1 and 3 and return them as a new array and select only the element with index 0 from it
arr[[1, 3]][0] = 64
arr

array([ 1,  2,  4,  8, 16, 32])

<p>In the first example, the result is a view that references the data of <code>arr</code> and containes the elements <code>2</code> and <code>8</code>. <code>arr[1:4:2][0]</code> modifies the first of the elements to <code>64</code>. </p>
<p>The second example returns a copy that also contains <code>2</code> and <code>8</code>, but they are not the same as in <code>arr</code>: they are new ones. <code>arr[[1, 3]][0] = 64</code> modifies the copy and leaves the original array unchanged.</p>
<h4>3.1.3. Multidimensonal Arrays</h4>
<p>We can apply the same principles as one-dimensional arrays:</p>
<ul>
    <li>Slicing returns <b>views</b></li>
    <li>Indexing and masking return <b>copies</b></li>
</ul>
<p>Let's take a look at a few examples:</p>

In [74]:
arr = np.array([[1, 2, 4, 8],
               [16, 32, 64, 128],
               [256, 512, 1024, 2048]])
arr

array([[   1,    2,    4,    8],
       [  16,   32,   64,  128],
       [ 256,  512, 1024, 2048]])

In [75]:
# Select columns 1 and 2
a = arr[:, 1:3]
a

array([[   2,    4],
       [  32,   64],
       [ 512, 1024]])

In [76]:
a.base

array([[   1,    2,    4,    8],
       [  16,   32,   64,  128],
       [ 256,  512, 1024, 2048]])

In [77]:
a.base is arr

True

In [78]:
# Select columns 1 and 3
b = arr[:, 1:4:2]
b

array([[   2,    8],
       [  32,  128],
       [ 512, 2048]])

In [79]:
b.base

array([[   1,    2,    4,    8],
       [  16,   32,   64,  128],
       [ 256,  512, 1024, 2048]])

In [80]:
b.base is arr

True

In [81]:
# Get columns 1 and 3
c = arr[:, [1, 3]]
c

array([[   2,    8],
       [  32,  128],
       [ 512, 2048]])

In [82]:
c.base

array([[   2,   32,  512],
       [   8,  128, 2048]])

In [83]:
c.base is arr

False

In [84]:
# Select columns 1 and 3
d = arr[:, [False, True, False, True]]
d

array([[   2,    8],
       [  32,  128],
       [ 512, 2048]])

In [85]:
d.base

array([[   2,   32,  512],
       [   8,  128, 2048]])

In [86]:
d.base is arr

False

<p>As expected, slicing returned views, while indexing and masking returned copies, unchanging the original <code>arr</code>.</p>
<p>When you modify the original, likewise, the views will reflect the changes, while copies with remain the same.</p>

In [87]:
# Change the second value of the first array to 100
arr[0, 1] = 100
arr

array([[   1,  100,    4,    8],
       [  16,   32,   64,  128],
       [ 256,  512, 1024, 2048]])

In [88]:
a

array([[ 100,    4],
       [  32,   64],
       [ 512, 1024]])

In [89]:
b

array([[ 100,    8],
       [  32,  128],
       [ 512, 2048]])

In [90]:
c

array([[   2,    8],
       [  32,  128],
       [ 512, 2048]])

In [91]:
d

array([[   2,    8],
       [  32,  128],
       [ 512, 2048]])

<p>As we can observe, the original value <code>2</code> remained unchanged for the copies (<code>c</code> and <code>d</code>). That was not the case for the views, whose values changed to <code>100</code>.</p>
<h3><b>3.2. Indexing in Pandas: Copies and Views</b></h3>
<p>Pandas is more flexible than NumPy and offers more functionalities and, as a consequence, the rules for returning views and copies can be more complex and not as straighforward.</p>
<p>We can start by showing examples of how Pandas behaves similarly to NumPy. First, by accessing the first three rows of <code>df</code> with a <b>slice</b>, we'll get a <b>view</b> in return:</p>

In [92]:
df = pd.DataFrame(data=data, index=index)
# Slice the data with the index labels
df["a": "c"]

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24


In [93]:
# Convert to an array to check its base
df["a": "c"].to_numpy().base

array([[ 1,  2,  4,  8, 16],
       [ 1,  3,  9, 27, 81],
       [45, 98, 24, 11, 64]], dtype=int32)

In [94]:
# Check if the slice returns a view by seeing if it shares tha same data as df
df["a": "c"].to_numpy().base is df.to_numpy().base

True

<p>We can observe that the view is looking at the same data as <code>df</code>. This will not be the case when accessing the first two columns of <code>df</code> with a <b>list</b> of labels, which will return a <b>copy</b>.</p>

In [95]:
df[["x", "y"]]

Unnamed: 0,x,y
a,1,1
b,2,3
c,4,9
d,8,27
e,16,81


In [96]:
df[["x", "y"]].to_numpy().base

array([[ 1,  2,  4,  8, 16],
       [ 1,  3,  9, 27, 81]], dtype=int32)

In [97]:
df[["x", "y"]].to_numpy().base is df.to_numpy().base

False

The copy is looking at different base data than <code>df</code>.</p>
<h2><b>4. Use of Views and Copies in Pandas</b></h2>
<h3><b>4.1. Chained Indexing and <code>SettingWithCopyWarning</code></b></h3>
<p>Let's remember how <code>SettingWithCopy</code> works with chained indexing elaborating a little on the first example earlier in our notebook. We created a DataFrame and the mask <code>Series</code> corresponding to <code>df["z"] < 50</code>:.</p>


In [98]:
df = pd.DataFrame(data=data, index=index)
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [99]:
mask = df["z"] < 50
mask

a     True
b    False
c     True
d     True
e    False
Name: z, dtype: bool

<p>We already know that if you apply <code>df[mask]["z"] = 0</code> you'll get a <code>SettingWithCopyWarning</code>:</p>

In [100]:
df[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[mask]["z"] = 0


In [101]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>And why does it fail? That's because the mask is a <b>copy</b> of <code>df</code>, and this assigniment is made on it, not affecting the original DataFrame.</p>
<p>Remember also that in Pandas, <b>evaluation order matters</b>, which means that if we switch the order of operations, we can succesfully apply the mask without any warning being raised:</p>

In [102]:
# No SettingWithCopyWarning will be raised
df["z"][mask] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>You can use accessors like <code>.loc</code>, but you can get problems here too:</p>

In [103]:
df = pd.DataFrame(data=data, index=index)
df.loc[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[mask]["z"] = 0


<p>In some cases, Pandas might no detect the problem and raise no warning. The copy will pass and you will not be notified:</p>

In [104]:
# No SettingWithCopyWarning will be raised
df.loc[["a", "c", "e"]]["z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>As a list of indices was used with the accessor, a copy was returned with no <code>SettingWithCopyWarning</code> being raised.</p>
<p>The warning may be raised, however, in some cases, like the following types:</p>


In [105]:
df = pd.DataFrame(data=data, index=index)
# Select the first three rows with slices, which will return views, and assign a value of 0 to each of them
df[:3]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[:3]["z"] = 0


In [106]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [107]:
df.loc["a": "c"]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc["a": "c"]["z"] = 0


<p>As for these cases, despite selecting the first three rows with slices and getting views, you still get a <code>SettingWithCopyWarning</code>. It is recommended, thus, to avoid chained indexing with the use of accessors:</p>

In [108]:
df.loc[mask, "z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<h3><b>4.2. Impact of Data Types on Views,Copies, and the <code>SettingWithCopyWarning</code></b></h3>
<p>In Pandas, the data type <b>matters</b> when deciding whether to return a view or a copy. Let's use our example and take a look at possible outcomes:</p>

In [109]:
df = pd.DataFrame(data=data, index=index)
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [110]:
df.dtypes

x    int32
y    int32
z    int32
dtype: object

<p>Because all the three columns present the same data type (integers), you can select rows witha  slice and get a view:</p>

In [111]:
df["b": "d"]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["b": "d"]["z"] = 0


In [112]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [113]:
df["b": "d"]["z"].to_numpy().base is df

False

<p>Probably for newer versions such as the one that I'm using, slicing operations returned a copy and did not affect the original DataFrame. Let's check if a different result is returned when using columns with different data types:</p>

In [114]:
# Convert column "z" to float
df = pd.DataFrame(data=data, index=index).astype(dtype={"z": float})

In [115]:
df

Unnamed: 0,x,y,z
a,1,1,45.0
b,2,3,98.0
c,4,9,24.0
d,8,27,11.0
e,16,81,64.0


In [116]:
df.dtypes

x      int32
y      int32
z    float64
dtype: object

In [117]:
df["b": "d"]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["b": "d"]["z"] = 0


In [118]:
df

Unnamed: 0,x,y,z
a,1,1,45.0
b,2,3,98.0
c,4,9,24.0
d,8,27,11.0
e,16,81,64.0


<p>As we can see, slicing operations in this case return the same behavior.</p>
<h3><b>4.3. Hierarchical Indexing and <code>SettingWithCopyWarning</code></b></h3>
<p>Also known as <code>MultiIndex</code>, hierarchical indexing allows you to organize row or colum indices on multiple levels based on a hierarchy. Let's give an example of this feature:</p>

In [119]:
df = pd.DataFrame(
    data={("powers", "x"): 2**np.arange(5),
          ("powers", "y"): 3**np.arange(5),
          ("random", "z"): np.array([45, 98, 24, 11, 64])},
    index=["a", "b", "c", "d", "e"]
)

In [120]:
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>Here's a description of what happened:</p>
<ul>
    <li>The <b>first level</b> contains the labels <code>powers</code> and <code>random</code></li>
    <li>The <b>second level</b> has the labels <code>x</code>, <code>y</code>, which belongs to <code>powers</code>, and <code>z</code>, which belongs to <code>random</code></li>
</ul>
<p>What happens if we access <code>df["powers"]</code>? Let's take a look:</p>

In [121]:
df["powers"]

Unnamed: 0,x,y
a,1,1
b,2,3
c,4,9
d,8,27
e,16,81


<p>Aha! We just got back a DataFrame containing the columns below <code>powers</code>, as expected. To get just one of the two columns, you'd use the expression <code>df["powers","x"]</code>:

In [122]:
df["powers", "x"]

a     1
b     2
c     4
d     8
e    16
Name: (powers, x), dtype: int32

<p>Likewise, you can change its values in such way:</p>

In [123]:
df["powers", "x"] = 0
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,0,1,45
b,0,3,98
c,0,9,24
d,0,27,11
e,0,81,64


<p>Additonally, you can use accessors to get or to modify the data:</p>

In [124]:
df = pd.DataFrame(
    data={("powers", "x"): 2**np.arange(5),
          ("powers", "y"): 3**np.arange(5),
          ("random", "z"): np.array([45, 98, 24, 11, 64])},
    index=["a", "b", "c", "d", "e"]
)

In [125]:
df.loc[["a", "b"], "powers"]

Unnamed: 0,x,y
a,1,1
b,2,3


<p>The returned object is a DataFrame. To get a particular column/row, use the following procedure:</p>

In [126]:
# Retrieve rows 'a' and 'b' from column 'x'
df.loc[["a", "b"], ("powers", "x")]

a    1
b    2
Name: (powers, x), dtype: int32

<p>The result is a Series object. To modify the elements of DataFrames with hierarchical indices, you can use this same approach:</p>

In [127]:
df.loc[["a", "b"], ("powers", "x")] = 0
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,0,1,45
b,0,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>In this way, you avoid chained indexing with and without accessors. Remember that chained indexing can lead to a <code>SettingWithCopyWarning</code>:</p>

In [128]:
df = pd.DataFrame(
    data={("powers", "x"): 2**np.arange(5),
          ("powers", "y"): 3**np.arange(5),
          ("random", "z"): np.array([45, 98, 24, 11, 64])},
    index=["a", "b", "c", "d", "e"]
)

In [129]:
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [130]:
df["powers"]

Unnamed: 0,x,y
a,1,1
b,2,3
c,4,9
d,8,27
e,16,81


In [131]:
df["powers"]["x"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["powers"]["x"] = 0


In [132]:
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>Here, <code>df["powers"]</code> returns a DataFrame with the columns <code>x</code> and <code>y</code>. In previous versions (I guess), the output will show all rows in <code>x</code> set as <code>0</code>. This is odd, as now, despite the warning being raised, the original DataFrame is affected by the chained indexing.</p>

In [133]:
df["powers"]["x"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["powers"]["x"] = 0


In [134]:
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>Therefore, whenever possible, avoid chained assignment. Explicitly using <code>.loc</code> or <code>.iloc</code> will separate the indexing part from the assignment.</p>

In [136]:
# Avoid chained assignment without accessors
df["powers", "x"] = 0
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,0,1,45
b,0,3,98
c,0,9,24
d,0,27,11
e,0,81,64


In [138]:
df = pd.DataFrame(
    data={("powers", "x"): 2**np.arange(5),
          ("powers", "y"): 3**np.arange(5),
          ("random", "z"): np.array([45, 98, 24, 11, 64], dtype=float)},
    index=["a", "b", "c", "d", "e"]
)

In [139]:
# Avoid chained assignments with accessors
df.loc[:, ("powers", "x")] = 0
df

Unnamed: 0_level_0,powers,powers,random
Unnamed: 0_level_1,x,y,z
a,0,1,45.0
b,0,3,98.0
c,0,9,24.0
d,0,27,11.0
e,0,81,64.0


<p>Both attempts returned the modified DataFrame with no warning being raised.</p>
<h2><b>5. Change the Default <code>SettingWithCopyWarning</code> Behavior</b></h2>
<p>Use <code>mode.chained_assignment</code> with <code>pd.set_option()</code> to modify the default warning behavior. The possible settings are as it follows:</p>
<ul>
    <li><code>pd.set_option("mode.chained_assignment", "raise")</code> - raises a <code>SettingWithCopyError</code></li>
    <li><b>(DEFAULT)</b> <code>pd.set_option("mode.chained_assignment" , "raise")</code> - raises a <code>SettingWithCopyWarning</code></li>
    <li><code>pd.set_option("mode.chained_assignment", None)</code> - supresses <i>both</i> the warning and the error</li>
</ul>
<p>Let's see some examples of the possible settings:</p>

In [140]:
df = pd.DataFrame(
    data={("powers", "x"): 2**np.arange(5),
          ("powers", "y"): 3**np.arange(5),
          ("random", "z"): np.array([45, 98, 24, 11, 64], dtype=float)},
    index=["a", "b", "c", "d", "e"]
)

In [141]:
pd.set_option("mode.chained_assignment", "raise")

In [None]:
# Will raise a SettingWithCopyError
df["powers"]["x"] = 0

<p>To get the current setting, use <code>mode.chained_assignment</code> together with <code>pd.get_option()</code>:</p>

In [143]:
pd.get_option("mode.chained_assignment")

'raise'

<h2><b>6. Conclusion</b></h2>
<p>This was a very useful lesson on understanding views and copies in NumPy and Pandas, on how they behave (and the differences between their behavior). I also learned a great deal on how to understand the recurrent <code>SettingWithCopyWarning</code>, how it affects our datasets, and how to use the proper tools according to one's goals. Thsese are relevant issues in the world of data, as we must be aware of how views and copies may or may not affect our analysis. As a reminder:</p>
<ul>
    <li>Indexing-based assignments in NumPy + Pandas can return <b>views</b> or <b>copies</b> depending on the context.</li>
    <li>Both views and copies are useful, yet their present <b>different behaviors</b>.</li>
    <li>