<h1><b><code>SettingWithCopyWarning</code> in Pandas: Views vs Copies</b></h1>
<p>In this notebook, I will be covering an important concept that users must know once they get acquainted with Pandas: the <code>SettingWithCopyWarning</code> issue. As a beginner, I would often disregard this sign as the outcome of the code would not necessarily change (apparently). Now, there are <b>shallow copies</b> and <b>deep copies</b>, notions that are related to the aforementioned warning. I'll discuss the difference between these concepts as well as assessing the impact of their use in data analysis.</p>
<p>By the end of this notebook, I hope to have covered the following topics:</p>
<ul>
    <li>The definition of <b>views</b> and <b>copies</b> in NumPy and Pandas</li>
    <li>How to work with views and copies in these libraries</li>
    <li>Why <code>SettingWithCopyWarning</code> happens in Pandas</li>
    <li>How to avoid getting a <code>SettingWithCopyWarning</code> in Pandas</li>
</ul>

<h2><b>Table of contents:</b></h2>
<ul>
    <li>Example of a <code>SettingWithCopyWarning</code></li>
    <li>Views and Copies in NumPy and Pandas</li>
        <ul>
            <li>Understanding views and copies in NumPy</li>
            <li>Understanding views and copies in Pandas</li>
        </ul>
    <li>Indices and Slices in NumPy and Pandas</li>
        <ul>
            <li>Indexing in NumPy: copies and views</li>
            <li>Indexing in Pandas: copies and views</li>
        </ul>
<li>Use of Views and Copies in Pandas</li>
    <ul>
        <li>Chained Indexing and <code>SettingWithCopyWarning</code></li>
        <li>Impact of Data Types on Views, Copies, and the <code>SettingWithCopyWarning</code></li>
        <li>Hierarchical Indexing and <code>SettingWithCopyWarning</code></li>
    </ul>
<li>Change the Default <code>SettingWithCopyWarning</code> Behavior</li>
<li>Conclusion</li>
</ul>

<p>Let's start by importing the required modules and checking their versions. Then, we can move on and proceed with our discussion on this notebook's topic.</p>



In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
np.__version__

'1.24.2'

In [3]:
pd.__version__

'1.5.3'

<h3><b>1. Example of a <code>SettingWithCopyWarning</code></b></h3>
<p>As previously mentioned, I tended to ignore this issue because it is <i>not</i> and <i>error</i>, but a <i>warning</i>. It might sound obvious now, but what Pandas is doing is <i>warning</i> you that you might get unwanted behavior in your code.</p>
<p>Let's create a Pandas DataFrame and observe this issue in practice:</p>

In [4]:
data = {"x": 2**np.arange(5),
        "y": 3**np.arange(5), 
        "z": np.array([45, 98, 24, 11, 64])
        }

index = ["a", "b", "c", "d", "e"]

df = pd.DataFrame(data=data, index=index)

In [5]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>Now, we have a dictionary referenced by the variable <code>data</code>, which contains:</p>
<ul>
    <li>Keys - <code>x</code>, <code>y</code>, and <code>z</code> - they are our column labels in the DataFrame</li>
    <li>Three Numpy arrays, which are our observations</li>
    <ul>
        <li><code>np.arange(5)</code> returns evenly spaced values within a given interval. Since we gave <code>5</code> as the parameter input, the result was <code>[0, 1, 2, 3, 4]</code>, which were then raised to the power of 2. The same concept was applied to the second row, this time, raising the range to the power of 3.</li>
        <li><code>np.array()</code> creates an array.</li>
    </ul>
    <li>Finally, a list, which will be used to provide an index to our DataFrame</li>
</ul>
<p>We are now ready to deal with a <code>SettingWithCopyWarning</code>.Let's start by creating a mask with Pandas boolean operators:</p>

In [6]:
mask = df["z"] < 50
mask

a     True
b    False
c     True
d     True
e    False
Name: z, dtype: bool

In [7]:
df[mask]

Unnamed: 0,x,y,z
a,1,1,45
c,4,9,24
d,8,27,11


<p>How does the mask work?</p>
<ul>
    <li><code>True</code> - <b>rows</b> in which the value of <code>z</code> is <i>less</i> than <code>50</code>.</li>
    <li><code>False</code> - <b>rows</b> in which the value of <code>z</code> is <i>not less</i> than <code>50</code>.</li>
</ul>
<p>When calling <code>df[mask]</code>, the output will be the original DataFrame in which the rows of the mask were <code>True</code>. We got <code>a</code>, <code>c</code>, and <code>d</code> back.</p>
<p>Now, if we try to change the <code>df</code> by extracting those three rows using <code>mask</code>, you will get a <code>SettingWithCopyWarning</code>, while the <code>df</code> will remain the <b>same</b>:</p>

In [8]:
df[mask]["z"] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[mask]["z"] = 0


In [9]:
df

Unnamed: 0,x,y,z
a,1,1,45
b,2,3,98
c,4,9,24
d,8,27,11
e,16,81,64


<p>What did just happened?</p>
<ul>
    <li><code>df[mask]</code> returns a <b>fresh new DataFrame</b>, which is a <b>copy</b> of the data from the original <code>df</code>, corresponding only t the <code>True</code> values from <code>mask</code>.</li>
    <li><code>df[mask]["z"] = 0</code> modifies the column <code>z</code> of the new DataFrame to <b>zeros</b>, leaving <code>df</code> unchanged.
</ul>

<p>Pandas issues this warning to remind you the the original DataFrame was not changed. <b>If you want to modify it</b>, you can apply one of the following <b>accessors</b>:
<ul>
    <li><code>.loc[]</code></li>
    <li><code>.iloc[]</code></li>
    <li><code>.at[]</code></li>
    <li><code>.iat[]</code></li>
</ul>

In [10]:
df.loc[mask, "z"] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>We provided two arguments (<code>mask</code> and <code>"z"</code>) to the method and assigned values directly to the DataFrame. We can also <b>change the evaluation order</b> as an alternative way.</p>

In [11]:
df = pd.DataFrame(data=data, index=index)
df["z"]

a    45
b    98
c    24
d    11
e    64
Name: z, dtype: int32

In [12]:
df["z"][mask] = 0
df

Unnamed: 0,x,y,z
a,1,1,0
b,2,3,98
c,4,9,0
d,8,27,0
e,16,81,64


<p>Here's what happened:</p>
<ul>
    <li><code>df["z"]</code> returns a <code>Series</code> that this time pointed to the <b>original data</b> </li>