In [36]:
# https://www.dataquest.io/blog/settingwithcopywarning/

In [3]:
import pandas as pd
data = pd.read_csv('./xbox-3-day-auctions.csv')
data.head()

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
0,8213034705,95.0,2.927373,jake7870,0,95.0,117.5
1,8213034705,115.0,2.943484,davidbresler2,1,95.0,117.5
2,8213034705,100.0,2.951285,gladimacowgirl,58,95.0,117.5
3,8213034705,117.5,2.998947,daysrus,10,95.0,117.5
4,8213060420,2.0,0.065266,donnie4814,5,1.0,120.0


In [4]:
# auctionid — A unique identifier of each auction.
# bid — The value of the bid.
# bidtime — The age of the auction, in days, at the time of the bid.
# bidder — eBay username of the bidder.
# bidderrate – The bidder’s eBay user rating.
# openbid — The opening bid set by the seller for the auction.
# price — The winning bid at the close of the auction.

<img src="view-vs-copy.png">

## Common issue #1: Chained assignment

In [5]:
# Assignment — Operations that set the value of something, for example 
# data = pd.read_csv('xbox-3-day-auctions.csv'). Often referred to as a set.

In [6]:
# Access — Operations that return the value of something, such as the below 
# examples of indexing and chaining. Often referred to as a get.

In [7]:
# Indexing — Any assignment or access method that references a subset of the data; 
# for example data[1:5].

data[1:5]

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
1,8213034705,115.0,2.943484,davidbresler2,1,95.0,117.5
2,8213034705,100.0,2.951285,gladimacowgirl,58,95.0,117.5
3,8213034705,117.5,2.998947,daysrus,10,95.0,117.5
4,8213060420,2.0,0.065266,donnie4814,5,1.0,120.0


In [8]:
# Chaining — The use of more than one indexing operation back-to-back; for example data[1:5][1:3].

data[1:5][1:3]

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
2,8213034705,100.0,2.951285,gladimacowgirl,58,95.0,117.5
3,8213034705,117.5,2.998947,daysrus,10,95.0,117.5


In [9]:
# Chained assignment is the combination of chaining and assignment. Let’s take a quick look at an 
# example with the data set we loaded earlier. We will go over this in more detail later on.

# For the sake of this example, let’s say that we have been told that looking at the current values
# the user 'parakeet2004'‘s bidder rating is incorrect and we must update it.


In [10]:
data[data.bidder == 'parakeet2004']

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
6,8213060420,3.0,0.186539,parakeet2004,5,1.0,120.0
7,8213060420,10.0,0.18669,parakeet2004,5,1.0,120.0
8,8213060420,24.99,0.187049,parakeet2004,5,1.0,120.0


In [11]:
# We have three rows to update the bidderrate field on; let’s go ahead and do that.

In [14]:
data[data.bidder == 'parakeet2004']['bidderrate']

6    5
7    5
8    5
Name: bidderrate, dtype: int64

In [12]:
data[data.bidder == 'parakeet2004']['bidderrate'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[data.bidder == 'parakeet2004']['bidderrate'] = 100


In [13]:
# If we take a look, we can see that in this case the values were not changed:

data[data.bidder == 'parakeet2004']

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
6,8213060420,3.0,0.186539,parakeet2004,5,1.0,120.0
7,8213060420,10.0,0.18669,parakeet2004,5,1.0,120.0
8,8213060420,24.99,0.187049,parakeet2004,5,1.0,120.0


In [15]:
# The warning was generated because we have chained two indexing operations together. This is made easier to spot because we’ve used square brackets twice, but the same would be true if we used other access methods such as .bidderrate, .loc[], .iloc[], .ix[] and so on. Our chained operations are:

#    data[data.bidder == 'parakeet2004']
#    ['bidderrate'] = 100

In [16]:
# These two chained operations execute independently, one after another. 

# The first is an access method (get operation), that will return a DataFrame containing all rows 
# where bidder equals 'parakeet2004'. 

# The second is an assignment operation (set operation), that is called on this new DataFrame. 

# We are not operating on the original DataFrame at all.

In [17]:
# The solution is simple: combine the chained operations into a single operation using loc so that 
# pandas can ensure the original DataFrame is set. Pandas will always ensure that unchained set 
# perations, like the below, work.

In [18]:
# Setting the new value
data.loc[data.bidder == 'parakeet2004', 'bidderrate'] = 100

In [19]:
# Taking a look at the result
data[data.bidder == 'parakeet2004']['bidderrate']

6    100
7    100
8    100
Name: bidderrate, dtype: int64

In [20]:
# This is what the warning suggests we do, and it works perfectly in this case.

## Common issue #2: Hidden chaining

In [21]:
# Let’s investigate winning bids. We will create a new dataframe to work with them, taking care to 
# use loc going forward now that we have learned our lesson about chained assignment.

In [22]:
winners = data.loc[data.bid == data.price]
winners.head()

Unnamed: 0,auctionid,bid,bidtime,bidder,bidderrate,openbid,price
3,8213034705,117.5,2.998947,daysrus,10,95.0,117.5
25,8213060420,120.0,2.999722,djnoeproductions,17,1.0,120.0
44,8213067838,132.5,2.996632,*champaignbubbles*,202,29.99,132.5
45,8213067838,132.5,2.997789,*champaignbubbles*,202,29.99,132.5
66,8213073509,114.5,2.999236,rr6kids,4,1.0,114.5


In [24]:
# By chance, we come across another mistake in our DataFrame. 

# This time the bidder value is missing from the row labelled 304

In [25]:
winners.loc[304, 'bidder']

nan

In [26]:
#For the sake of our example, let’s say that we know the true username of this bidder and update our data.

winners.loc[304, 'bidder'] = 'therealname'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value, pi)


In [27]:
#But we used loc, how has this happened again? To investigate, let’s take a look at the result of our code:

print(winners.loc[304, 'bidder'])

therealname


In [28]:
# It worked this time, so why did we get the warning?

In [29]:
# Chained indexing can occur across two lines as well as within one. Because winners was created as
# the output of a get operation (data.loc[data.bid == data.price]), it might be a copy of the
# original DataFrame or it might not be, but until we checked there was no way to know!

# When we indexed winners, we were actually using chained indexing.

# This means that we may have also modified data as well when we were trying to modify winners.

# In a real codebase, these lines could occur very far apart so tracking down the source of the problem 
# might be more difficult, but the situation is the same.

In [30]:
# To prevent the warning in this case, the solution is to explicitly tell pandas to make a copy 
# when we create the new dataframe:

In [31]:
winners = data.loc[data.bid == data.price].copy()
winners.loc[304, 'bidder'] = 'therealname'

In [32]:
print(winners.loc[304, 'bidder'])
print(data.loc[304, 'bidder'])

therealname
nan


In [34]:
print(winners.loc[304, 'bidder'] == data.loc[304, 'bidder'])

False
