In [1]:
import pandas as pd
movies = pd.read_csv('http://bit.ly/imdbratings')

In [2]:
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


We are going to concentrate on the content_rating column and we are going to look for missing values

In [3]:
movies.content_rating.isnull().sum() # We're just counting up how many missing values there are

3

Let's look at those missing values'

In [6]:
movies[movies.content_rating.isnull()]

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
187,8.2,Butch Cassidy and the Sundance Kid,,Biography,110,"[u'Paul Newman', u'Robert Redford', u'Katharin..."
649,7.7,Where Eagles Dare,,Action,158,"[u'Richard Burton', u'Clint Eastwood', u'Mary ..."
936,7.4,True Grit,,Adventure,128,"[u'John Wayne', u'Kim Darby', u'Glen Campbell']"


NaN - This is the special value, which stands for Not a Number, which means that our content is missing

In [8]:
movies.content_rating.value_counts() # To show us all the unique values in the content_rating column

R            460
PG-13        189
PG           123
NOT RATED     65
APPROVED      47
UNRATED       38
G             32
PASSED         7
NC-17          7
X              4
GP             3
TV-MA          1
Name: content_rating, dtype: int64

We're going to decide that the 65 Not Rated movies should be considered to be a missing value

Sometimes in datasets there is a flag that means missing and it is best to replace those values with the NaN so that we can take advantage of the missing value functionality

1. To find all the movies with the Not Rated content rating

In [9]:
movies[movies.content_rating=='NOT RATED']

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
5,8.9,12 Angry Men,NOT RATED,Drama,96,"[u'Henry Fonda', u'Lee J. Cobb', u'Martin Bals..."
6,8.9,"The Good, the Bad and the Ugly",NOT RATED,Western,161,"[u'Clint Eastwood', u'Eli Wallach', u'Lee Van ..."
41,8.5,Sunset Blvd.,NOT RATED,Drama,110,"[u'William Holden', u'Gloria Swanson', u'Erich..."
63,8.4,M,NOT RATED,Crime,99,"[u'Peter Lorre', u'Ellen Widmann', u'Inge Land..."
66,8.4,Munna Bhai M.B.B.S.,NOT RATED,Comedy,156,"[u'Sunil Dutt', u'Sanjay Dutt', u'Arshad Warsi']"
72,8.4,Rang De Basanti,NOT RATED,Drama,157,"[u'Aamir Khan', u'Soha Ali Khan', u'Siddharth']"
83,8.4,To Kill a Mockingbird,NOT RATED,Drama,129,"[u'Gregory Peck', u'John Megna', u'Frank Overt..."
87,8.4,Bicycle Thieves,NOT RATED,Drama,93,"[u'Lamberto Maggiorani', u'Enzo Staiola', u'Li..."
88,8.4,The Kid,NOT RATED,Comedy,68,"[u'Charles Chaplin', u'Edna Purviance', u'Jack..."
89,8.4,Swades,NOT RATED,Drama,189,"[u'Shah Rukh Khan', u'Gayatri Joshi', u'Kishor..."


2. That is the series that I want to overwrite with NaN

In [10]:
movies[movies.content_rating=='NOT RATED'].content_rating

5      NOT RATED
6      NOT RATED
41     NOT RATED
63     NOT RATED
66     NOT RATED
72     NOT RATED
83     NOT RATED
87     NOT RATED
88     NOT RATED
89     NOT RATED
93     NOT RATED
100    NOT RATED
104    NOT RATED
105    NOT RATED
108    NOT RATED
109    NOT RATED
111    NOT RATED
116    NOT RATED
122    NOT RATED
128    NOT RATED
132    NOT RATED
133    NOT RATED
134    NOT RATED
140    NOT RATED
149    NOT RATED
165    NOT RATED
167    NOT RATED
169    NOT RATED
174    NOT RATED
178    NOT RATED
         ...    
215    NOT RATED
231    NOT RATED
234    NOT RATED
246    NOT RATED
252    NOT RATED
254    NOT RATED
255    NOT RATED
263    NOT RATED
265    NOT RATED
315    NOT RATED
328    NOT RATED
343    NOT RATED
405    NOT RATED
419    NOT RATED
427    NOT RATED
453    NOT RATED
478    NOT RATED
481    NOT RATED
491    NOT RATED
528    NOT RATED
531    NOT RATED
546    NOT RATED
573    NOT RATED
592    NOT RATED
647    NOT RATED
665    NOT RATED
673    NOT RATED
763    NOT RAT

Now we just have that series

Finally, we want to overwrite the 'NOT RATED' with NaN

NaN is not a string, it is a special value from the Numpy library

In [14]:
import numpy as np

In [15]:
movies[movies.content_rating=='NOT RATED'].content_rating = np.nan

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [16]:
movies.content_rating.isnull().sum()

3

This tells us that the code didn't work otherwise the value here would be 68. The SettingWithCopyWarning is just that, a warning rather than an error. So you can check to see if it in fact did work...alas, ours didn't

###### The Solution - Is to follow the advice that we were given in the warning message

In [18]:
movies.loc[movies.content_rating=='NOT RATED', 'content_rating'] = np.nan

In [19]:
movies.content_rating.isnull().sum()

68

Now when we re-run the line of code from above we see that the code did work and we now have the 65 NOT RATED movies added to NaN

###### What actually happened here?

The line of code that threw the error is actually two operations. The get item, ie movies[movies.content_rating=='NOT RATED'], the set item, ie content_rating = np.nan, which has a reference to the get item.  Here's the problem, pandas cannot guarantee whether the get item returned a view or a copy of the data. If it returned a view of the dataframe the set item would affect the dataframe but if it returned a copy it still would modify something, it would modify the copy but the copy just gets discarded and so the original dataframe does not get modified. So Pandas does not know if your code had resulted in a view or a copy so it is trying to warn you that it is not sure what has happened.

loc() solves this problem by turning it from two operations into a single set item operation. That's why this does not throw the error.

If you are trying to select rows and columns in the same line of code use the .loc() method and that will work better with Pandas

###### Second example of the SettingwithCopyWarning

Scenario - We only want to focus on movies with a very high star rating

In [21]:
top_movies = movies.loc[movies.star_rating >= 9, :]

This gives us all the movies in the df with a rating of 9+

In [22]:
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


We notice that the duration for the Shawshank Redemption is incorrect and we want to fix it

In [24]:
top_movies.loc[0, 'duration'] = 150

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Again, the suggestion is to use .loc() but that is what we used so why are we still getting this error?

In [25]:
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,150,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."


This is a good example why it is not a good idea to turn these warning off. Sometimes it means the operation didn't work but sometimes they do work but you still get a warning. You need to be able to see the warning to check to see if your code worked

###### Why did we get the warning?

Pandas isn't sure whether top_movies is a view or a copy of movies. It is not sure if it is a reference to the original movies df or it is a copy. So it is trying to warn you, are you modifying one thing ie top_movies, or are you modifying two things 

###### What is the solution

top_movies = movies.loc[movies.star_rating >= 9, :] - This is where the problem stems from

In [27]:
top_movies = movies.loc[movies.star_rating >= 9, :].copy()

Anytime you are trying to make a df copy, you should explicitly use the .copy() method and then pandas can be sure it is a copy and thus it can not be confused about whether top_movies is a copy or a view. It is now absolutely sure that it is dealing with a copy

In [28]:
top_movies.loc[0, 'duration'] = 150 # Now we do not get a warning