<a href="https://colab.research.google.com/github/edelord/DS-practice/blob/main/2_8_Series__view_vs_copy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
https://www.gormanalysis.com/blog/python-pandas-for-your-grandpa-2-8-series-view-vs-copy/

In this section, we’ll see when Series operations create a copy and when they create a view. Suppose you have this series, x.

In [1]:
import numpy as np
import pandas as pd

x = pd.Series(
    data=[2, 3, 5, 7, 11],
    index=[2, 11, 12, 30, 30]
)
print(x)
## 2      2
## 11     3
## 12     5
## 30     7
## 30    11
## dtype: int64

2      2
11     3
12     5
30     7
30    11
dtype: int64


and then you set a new variable y equal to x.

In [3]:
y = x

Then you modify the first element of y to be, 99. Obviously this modifies y, but you might be surprised to see it also modifies x.

In [4]:
y.iloc[0] = 99

print(x)


 2     99
11     3
12     5
30     7
30    11
dtype: int64


The reason this happens is because when we set y equal to x, Pandas didn’t make a copy of x, it merely made y a reference to x. In other words, the variable y points to the same block of data stored by x. This is known as “assignment by reference” and some people would call y a “view” of x.

To avoid this, we can explicitly set y equal to a copy of x using something like

In [5]:
y = x.copy()

Now if we change y, x is unchanged because y points to a completely separate block of data.

In [6]:
y.iloc[0] = 123
print(x)
## 2     99
## 11     3
## 12     5
## 30     7
## 30    11
## dtype: int64

2     99
11     3
12     5
30     7
30    11
dtype: int64


One of the reasons this is so confusing is because assignment by reference only happens under some circumstances which aren’t clearly documented and aren’t always obvious. For example, if we have the Series,

In [7]:
foo = pd.Series(['a', 'b', 'c', 'd'], dtype='string')
print(foo)
## 0    a
## 1    b
## 2    c
## 3    d
## dtype: string

0    a
1    b
2    c
3    d
dtype: string


and we set bar = foo.loc[foo <= 'b']

In [10]:
bar = foo.loc[foo <= 'b']
print(bar)
## 0    a
## 1    b
## dtype: string

0    a
1    b
dtype: string


Then we modify bar, setting the 1st element equal to ‘z’.

In [11]:
bar.iloc[0] = 'z'
print(bar)
## 0    z
## 1    b
## dtype: string

0    z
1    b
dtype: string


foo doesn’t get changed which means under the hood, Pandas copied the data in foo to create bar.

In [12]:
print(foo)
## 0    a
## 1    b
## 2    c
## 3    d
## dtype: string

0    a
1    b
2    c
3    d
dtype: string


Now, if we set baz = foo.iloc[:2], which is the same exact subset we used when we built bar, except here we use slicing, and then, just like with bar, we set the first element of baz equal to ‘z’.

In [14]:
baz = foo.iloc[:2]
print(baz)
## 0    a
## 1    b
## dtype: string

baz.iloc[0] = 'z'


0    a
1    b
dtype: string


This time, in addition to baz changing, foo also gets changed.

In [15]:
print(foo)
## 0    z
## 1    b
## 2    c
## 3    d
## dtype: string

0    z
1    b
2    c
3    d
dtype: string


As far as I can tell, when it comes to Series, if you assign A equal to B.loc[something], Pandas returns a copy, otherwise it returns a view, but this is undocumented and the rules change when we start using DataFrames. So I don’t recommend memorizing any hard and fast rules. Instead, you kind of just have to play around with things. Use .copy() to be safe, and just be aware that this quirky behavior exists. I know it sounds weird, but this is the kind of thing you get a feel for over time.

Another situation where it’s important to understand if Pandas is copying data is when it comes to pretty much any Pandas function that modifies a Series. For example, every Series has a method called replace() which basically lets you replace values with other values. So if you have a Series of strings like this one called zoo

In [16]:
zoo = pd.Series(['tiger', 'lion', 'zebra', 'lion'])
print(zoo)
## 0    tiger
## 1     lion
## 2    zebra
## 3     lion
## dtype: object

0    tiger
1     lion
2    zebra
3     lion
dtype: object


If you want to replace every instance of ‘lion’ with ‘hamster’ and every instance of ‘tiger’ with ‘bunny’, you could do

In [20]:
zoo.replace({'lion':'hamster', 'tiger':'bunny'})
## 0      bunny
## 1    hamster
## 2      zebra
## 3    hamster
## dtype: object

0      bunny
1    hamster
2      zebra
3    hamster
dtype: object

In [19]:
print(zoo)    # which shows the original series is not changed

0    tiger
1     lion
2    zebra
3     lion
dtype: object


The result of this method is a copy of zoo with the replaced values. So we’re not actually modifying zoo, we’re just building a brand new Series from it.

If you wanted to update zoo with these replacements, you could just overwrite the variable like zoo = zoo.replace({'lion':'hamster', 'tiger':'bunny'}) which would work, but it’d be highly inefficient since internally Pandas would create a whole new Series, reassign zoo to it, and then delete the old Series. To circumvent this, lots of Pandas functions have a parameter called ‘inplace’ which, when True, tells Pandas to modify the data you’re operating on rather than return a modified copy of the data.

So, if we wanted our replacements to stick, we could call

In [None]:
zoo.replace({'lion':'hamster', 'tiger':'bunny'}, inplace=True)
print(zoo)
## 0      bunny
## 1    hamster
## 2      zebra
## 3    hamster
## dtype: object

and now the data inside zoo actually gets updated with our replacements.