<h1><b><code>SettingWithCopyWarning</code> in Pandas: Views vs Copies</b></h1>
<p>In this notebook, I will be covering an important concept that users must know once they get acquainted with Pandas: the <code>SettingWithCopyWarning</code> issue. As a beginner, I would often disregard this sign as the outcome of the code would not necessarily change (apparently). Now, there are <b>shallow copies</b> and <b>deep copies</b>, notions that are related to the aforementioned warning. I'll discuss the difference between these concepts as well as assessing the impact of their use in data analysis.</p>
<p>By the end of this notebook, I hope to have covered the following topics:</p>
<ul>
    <li>The definition of <b>views</b> and <b>copies</b> in NumPy and Pandas</li>
    <li>How to work with views and copies in these libraries</li>
    <li>Why <code>SettingWithCopyWarning</code> happens in Pandas</li>
    <li>How to avoid getting a <code>SettingWithCopyWarning</code> in Pandas</li>
</ul>

<h2><b>Table of contents:</b></h2>
<ul>
    <li>Example of a <code>SettingWithCopyWarning</code></li>
    <li>Views and Copies in NumPy and Pandas</li>
        <ul>
            <li>Understanding views and copies in NumPy</li>
            <li>Understanding views and copies in Pandas</li>
        </ul>
    <li>Indices and Slices in NumPy and Pandas</li>
        <ul>
            <li>Indexing in NumPy: copies and views</li>
            <li>Indexing in Pandas: copies and views</li>
        </ul>
<li>Use of Views and Copies in Pandas</li>
    <ul>
        <li>Chained Indexing and <code>SettingWithCopyWarning</code></li>
        <li>Impact of Data Types on Views, Copies, and the <code>SettingWithCopyWarning</code></li>
        <li>Hierarchical Indexing and <code>SettingWithCopyWarning</code></li>
    </ul>
<li>Change the Default <code>SettingWithCopyWarning</code> Behavior</li>
<li>Conclusion</li>
</ul>

<p>Let's start by importing the required modules and checking their versions. Then, we can move on and proceed with our discussion on this notebook's topic.</p>



In [224]:
# Import libraries
import pandas as pd
import numpy as np

In [225]:
np.__version__

'1.24.2'

In [226]:
pd.__version__

'1.5.3'

<h3><b>1. Example of a <code>SettingWithCopyWarning</code></b></h3>
<p>As previously mentioned, I tended to ignore this issue because it is <i>not</i> and <i>error</i>, but a <i>warning</i>. It might sound obvious now, but what Pandas is doing is <i>warning</i> you that you might get unwanted behavior in your code.</p>
<p>Let's create a Pandas DataFrame and observe this issue in practice:</p>

In [227]:
data = {"x": 2**np.arange(5),
        "y": 3**np.arange(5), 
        "z": np.array([45, 98, 24, 11, 64])
        }

index = ["a", "b", "c", "d", "e"]

df = pd.DataFrame(data=data, index=index)

In [228]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>Now, we have a dictionary referenced by the variable <code>data</code>, which contains:</p>
<ul>
    <li>Keys - <code>x</code>, <code>y</code>, and <code>z</code> - they are our column labels in the DataFrame</li>
    <li>Three Numpy arrays, which are our observations</li>
    <ul>
        <li><code>np.arange(5)</code> returns evenly spaced values within a given interval. Since we gave <code>5</code> as the parameter input, the result was <code>[0, 1, 2, 3, 4]</code>, which were then raised to the power of 2. The same concept was applied to the second row, this time, raising the range to the power of 3.</li>
        <li><code>np.array()</code> creates an array.</li>
    </ul>
    <li>Finally, a list, which will be used to provide an index to our DataFrame</li>
</ul>
<p>We are now ready to deal with a <code>SettingWithCopyWarning</code>.Let's start by creating a mask with Pandas boolean operators:</p>

In [229]:
mask = df["z"] < 50
mask

a     True
b    False
c     True
d     True
e    False
Name: z, dtype: bool

In [230]:
df[mask]

Unnamed: 0,x,y,z
a,1,1,45
c,4,9,24
d,8,27,11


<p>How does the mask work?</p>
<ul>
    <li><code>True</code> - <b>rows</b> in which the value of <code>z</code> is <i>less</i> than <code>50</code>.</li>
    <li><code>False</code> - <b>rows</b> in which the value of <code>z</code> is <i>not less</i> than <code>50</code>.</li>
</ul>
<p>When calling <code>df[mask]</code>, the output will be the original DataFrame in which the rows of the mask were <code>True</code>. We got <code>a</code>, <code>c</code>, and <code>d</code> back.</p>
<p>Now, if we try to change the <code>df</code> by extracting those three rows using <code>mask</code>, you will get a <code>SettingWithCopyWarning</code>, while the <code>df</code> will remain the <b>same</b>:</p>

In [231]:
df[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[mask]["z"] = 0


In [232]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>What did just happened?</p>
<ul>
    <li><code>df[mask]</code> returns a <b>fresh new DataFrame</b>, which is a <b>copy</b> of the data from the original <code>df</code>, corresponding only t the <code>True</code> values from <code>mask</code>.</li>
    <li><code>df[mask]["z"] = 0</code> modifies the column <code>z</code> of the new DataFrame to <b>zeros</b>, leaving <code>df</code> unchanged.
</ul>

<p>Pandas issues this warning to remind you the the original DataFrame was not changed. <b>If you want to modify it</b>, you can apply one of the following <b>accessors</b>:
<ul>
    <li><code>.loc[]</code></li>
    <li><code>.iloc[]</code></li>
    <li><code>.at[]</code></li>
    <li><code>.iat[]</code></li>
</ul>

In [233]:
df.loc[mask, "z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>We provided two arguments (<code>mask</code> and <code>"z"</code>) to the method and assigned values directly to the DataFrame. We can also <b>change the evaluation order</b> as an alternative way.</p>

In [234]:
df = pd.DataFrame(data=data, index=index)
df["z"]

a    45
b    98
c    24
d    11
e    64
Name: z, dtype: int32

In [235]:
df["z"][mask] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>Here's what happened:</p>
<ul>
    <li><code>df["z"]</code> returns a <code>Series</code> that this time pointed to the <b>original data</b> and not its copy.</li>
    <li><code>df["z"][mask] = 0</code> modifies this <code>Series</code> object by using <b>chained assignment</b> to set the masked values to <b>zero</b>.</li>
    <li>Now, <code>df</code> is also modified as the <code>Series</code> object <code>df["z"]</code> holds the same data as <code>df</code>.</li>
</ul>
<p>So, while <code>df[mask]</code> contains a <b>copy</b> of the data, <code>df["z"]</code> points to the <b>same data</b> as <code>df</code>. Hence, the best practices to avoid a <code>SettingWithCopyWarning</code> involve <b>invoking accessors</b>. And why is it so?</p>
<ul>
    <li>Clearer intention to modify <code>df</code> when using a single method.</li>
    <li>Cleaner code for readers.</li>
    <li>Better performance.</li>
</ul>
<p>Still, accessors might return copies, as the code below demonstrates:</p>

In [236]:
df = pd.DataFrame(data=data, index=index)

In [237]:
df.loc[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[mask]["z"] = 0


In [238]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>We observe this behavior because <code>df.loc[mask]</code> returns a <b>new DataFrame</b> with a <b>copy</b> of <code>df</code>. Then, <code>df.loc[mask]["z"] = 0</code> modifies the <b>copy</b> of the new DataFrame, not <code>df</code>. To avoid the warning, we must:</p>
<ul>
    <li><b>Avoid chained assignements</b> that combine <b>two or more</b> indexing operations such as <code>df["z"][mask] = 0</code> and <code>df.loc[mask]["z"] = 0</code>.</li>
    <li><b>Apply single statements</b> with <b>just one indexing operation</b> like <code>df[mask, "z"] = 0</code>.</li>
</ul>
<hr>
<h3><b>2. Views and Copies in NumPy and Pandas</b></h3>
<h4>2.1. Understanding Views and Copies in Numpy</h4>
<p>We can start by creating a NumPy array:</p>

In [239]:
arr = np.array([1, 2, 4, 8, 16, 32])
arr

array([ 1,  2,  4,  8, 16, 32])

<p>Now, let's create other arrays by extracting the second and fourth elements of <code>arr</code> as a new array:</p>

In [240]:
arr[1:4:2]

array([2, 8])

In [241]:
arr[[1, 3]]

array([2, 8])

<p>While both statements return the same array, their behavior is not the same:</p>

In [242]:
arr[1:4:2].base

array([ 1,  2,  4,  8, 16, 32])

In [243]:
arr[1:4:2].flags.owndata

False

In [244]:
arr[[1, 3]].base

In [245]:
arr[[1, 3]].flags.owndata

True

<p>Why this behavior occurs? Well, <code>arr[1:4:2]</code> returns a <b>shallow copy</b>, while <code>arr[[1, 3]]</code> returns a <b>deep copy</b>. We are now going to differentiate these two concepts.</p>
<h3>2.2. Views in NumPy</h3>
<p>In NumPy, a <b>shallow copy</b> or <b>view</b> represents an array that <b>does not have its own data</b>. It is a representation of the data contained in the original array. A view of an array can be created using <code>.view()</code>:

In [246]:
view_of_arr = arr.view()
view_of_arr

array([ 1,  2,  4,  8, 16, 32])

In [247]:
view_of_arr.base

array([ 1,  2,  4,  8, 16, 32])

In [248]:
view_of_arr.base is arr

True

<p>Explanation:</p>
<ul>
    <li><code>view_of_arr</code> represents a view/shallow copy of the original array <code>arr</code>.</li>
    <li>When you apply <code>.base</code> to <code>view_of_arr</code>, the original <code>arr</code> is being called.</li>
    <li><code>view_of_arr</code> doesn't own any data, as it only uses data belonging to <code>arr</code>, a fact that can be verified by using the attribute <code>.flags</code>:</li>
</ul>

In [249]:
view_of_arr.flags.owndata

False

<h3>2.3 Copies in NumPy</h3>
<p>A <b>deep copy</b>, or also known as just a <b>copy</b>, represents a separate NumPy array that <b>has its own data</b>, which is gotten by copying the elements of the original array into the new array. The original and the copy are <b>two separate instances</b>. You can create a copy of an array with <code>.copy()</code>:

In [250]:
copy_of_arr = arr.copy()
copy_of_arr

array([ 1,  2,  4,  8, 16, 32])

In [251]:
copy_of_arr.base is None

True

In [252]:
copy_of_arr.flags.owndata

True

<p>Aha! As we can see, <code>copy_of_arr</code> presents no <code>.base</code>, that is, it is not a shallow copy of an array. Actually, the value of <code>copy_of_arr.base</code> is <code>None</code>. Also, <code>.flags.owndata</code> is <code>True</code>, which means that <code>copy_of_arr</code> owns data.</p>
<h3>2.4. Differences Between Views and Copies</h3>
<p>We can now state two major <b>differences</b> between views and copies:</p>
<ol>
    <li>Views <b>do not require additional storage</b> for data; copies <b>do</b>.</li>
    <li>Modifying the original array <b>affects its views</b>, while changing the original array <b>will not affect its copy</b>.</li>
</ol>
<p>We can verify these differences by comparing the sizes of views and copies using <code>.nbytes</code>, which returns the memory consumed by the elements of the array:</p>

In [253]:
arr.nbytes

24

In [254]:
view_of_arr.nbytes

24

In [255]:
copy_of_arr.nbytes

24

<p>Aparently, there is no difference in terms of memory used. However, if we apply <code>sys.getiszeof()</code> to get the memory amount <i>directly</i> attributed to each array, we get to see the difference:</p>

In [256]:
from sys import getsizeof

getsizeof(arr)

136

In [257]:
getsizeof(view_of_arr)

112

In [258]:
getsizeof(copy_of_arr)

136

<p>Because it doesn't have its own data elements, <code>view_of_arr</code> holds only 112 bytes, which are used for other attributes. The other two variables hold the previous 24 bytes <i>and</i> those attributes.</p>
<p>We can modify any element of the original array to observe another difference #2:</p>

In [259]:
arr[1] = 64
arr

array([ 1, 64,  4,  8, 16, 32])

In [260]:
view_of_arr

array([ 1, 64,  4,  8, 16, 32])

In [261]:
copy_of_arr

array([ 1,  2,  4,  8, 16, 32])

<p>Interesting! Because <code>view_of_arr</code> holds no data and looks at the elements of <code>arr</code> and its <code>.base</code>, it is <b>modified</b>. The copy, however, remains <b>unchanged</b>, as it <i>does not</i> share data with the original.</p>
<h3>2.5 Understanding Views and Copies in Pandas</h3>
<p>Pandas also differentiates views from copies. To create a view or copy of a DataFrame, use <code>.copy()</code>. Its parameter <code>deep</code> determines whether one wants a <b>view</b> (<code>deep=False</code>) or a <b>copy</b> (<code>deep=True</code>). By default, <code>deep</code> is <code>True</code> and will return a copy.</p>


In [262]:
df = pd.DataFrame(data=data, index=index)
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [267]:
view_of_df = df.copy(deep=False)
view_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [268]:
copy_of_df = df.copy()
copy_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>While at first sight there is no apparent difference between the view and the copy, if their NumPy representations are compared, there is a subtle difference:</p>

In [269]:
# Convert DataFrame to NumPy array
view_of_df.to_numpy().base is df.to_numpy().base

True

In [270]:
# Convert DataFrame to NumPy array
copy_of_df.to_numpy().base is df.to_numpy().base

False

<p>Again, we observe that <code>copy_of_df</code> holds its own data, while <code>view_of_df</code> shares the same data with <code>df</code>. We can modify the latter to verify this behavior:</p>

In [271]:
df["z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,0
c,4,9,0
d,8,27,0
e,16,81,0


In [272]:
view_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


In [273]:
df_array_addr = id(df.values)
view_array_addr = id(view_of_df.values)

if df_array_addr == view_array_addr:
    print("view_of_df is a shallow copy of df")
else:
    print("view_of_df is a deep copy of it")

view_of_df is a shallow copy of df


In [274]:
copy_of_df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>That's weird. I've been trying for the past hour to understand this behavior. The view was supposed to reflect the changes in the original DataFrame, but it didn't. Let's move on.</p>
<p>Rows and column labels should also exhibit the same behavior.</p>

In [275]:
view_of_df.index is df.index

True

In [276]:
view_of_df.columns is df.columns

True

With deep=False neither the indices nor the data are copied.

In [277]:
copy_of_df.index is df.index

False

In [278]:
copy_of_df.columns is df.columns

False