---
# Chapter 2: Essential DataFrame Operations
---

In [1]:
import numpy as np
import pandas as pd

In [3]:
movies = pd.read_csv('./movie.csv')
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


## Selecting multiple DataFrame columns


In [4]:
# Select all the actor and director columns
movie_actor_director = movies[['director_name', 
                               'actor_1_name', 
                               'actor_2_name', 
                               'actor_3_name'
                               ]]
movie_actor_director.head()

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Doug Walker,Doug Walker,Rob Walker,


There are instances when one column of a DataFrame needs to be selected. Using
the index operation can return either a Series or a DataFrame. If we pass in a list
with a single item, we will get back a DataFrame. If we pass in just a string with
the column name, we will get a Series back:

In [5]:
type(movie_actor_director['director_name'])

pandas.core.series.Series

In [6]:
type(movie_actor_director[['director_name']])

pandas.core.frame.DataFrame

We can also use `.loc` to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series

In [7]:
type(movie_actor_director.loc[:, 'director_name'])

pandas.core.series.Series

In [8]:
type(movie_actor_director.loc[:, ['director_name']])

pandas.core.frame.DataFrame

Passing a long list inside the indexing operator might cause readability issues. To help with
this, you may save all your column names to a list variable first.

In [9]:
# Create a list that contains columns name
cols = ['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']
movie_actor_director = movies[cols]

## Selecting columns with methods

There are some DataFrame methods that facilitate columns selection such as `.select_dtypes` if want to select columns by type or, `.filter` method.

Read in the movie dataset. Shorten the column names for display. Use the `.value_counts` method to output the number of columns with each specific data type:

In [10]:
def shorten(col):
  return (
      str(col)
      .replace("facebook_likes", "fb")
      .replace("_for_reviews", "")
  )

movies.rename(columns=shorten, inplace=True)

In [11]:
# dtypes counts
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

Use the `.select_dtypes` method to select only the integer columns

In [16]:
movies.select_dtypes(include='int').head(3)

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000


If you would like to select all the numeric columns, you may pass the string number to the include parameter:

In [17]:
movies.select_dtypes(include='number').head()

Unnamed: 0,num_critic,duration,director_fb,actor_3_fb,actor_1_fb,gross,num_voted_users,cast_total_fb,facenumber_in_poster,num_user,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


If we wanted integer and string columns we could do the following:

In [18]:
movies.select_dtypes(include=['int', 'object']).head(3)

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


To exclude only floating-point columns, do the following

In [20]:
movies.select_dtypes(exclude='float').head(3)


Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names

In [21]:
movies.filter(like='fb').head()

Unnamed: 0,director_fb,actor_3_fb,actor_1_fb,cast_total_fb,actor_2_fb,movie_fb
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0


In [22]:
cols

['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']

In [23]:
movies.filter(items=cols).head(3)

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman


The `.filter` method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:

In [25]:
movies.filter(regex=r'\d').head(3)

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0


## Ordering Column Names
**Guideline:**  

Classify each column as either categorical or continuous
-  Group common columns within the categorical and continuous columns
-  Place the most important groups of columns first with categorical columns before continuous ones

In [26]:
# Output all the column names and scan for similar categorical and continuous
# columns
movies.columns

Index(['color', 'director_name', 'num_critic', 'duration', 'director_fb',
       'actor_3_fb', 'actor_2_name', 'actor_1_fb', 'gross', 'genres',
       'actor_1_name', 'movie_title', 'num_voted_users', 'cast_total_fb',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user', 'language', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_fb', 'imdb_score', 'aspect_ratio',
       'movie_fb'],
      dtype='object')

In [28]:
cat_core = ["movie_title", "title_year", "content_rating", "genres",]
movies[cat_core].head(1)

Unnamed: 0,movie_title,title_year,content_rating,genres
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi


In [29]:
cat_people = ["director_name", "actor_1_name", "actor_2_name", "actor_3_name"]
movies[cat_people].head(1)

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi


In [30]:
 cat_other = ["color", "country", "language", "plot_keywords", "movie_imdb_link",]
 movies[cat_other].head(1)

Unnamed: 0,color,country,language,plot_keywords,movie_imdb_link
0,Color,USA,English,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...


In [31]:
cont_fb = ["director_fb", "actor_1_fb", "actor_2_fb", "actor_3_fb",
           "cast_total_fb", "movie_fb",]
movies[cont_fb].head(3)

Unnamed: 0,director_fb,actor_1_fb,actor_2_fb,actor_3_fb,cast_total_fb,movie_fb
0,0.0,1000.0,936.0,855.0,4834,33000
1,563.0,40000.0,5000.0,1000.0,48350,0
2,0.0,11000.0,393.0,161.0,11700,85000


In [32]:
cont_finance = ["budget", "gross"]
movies[cont_finance].head(3)

Unnamed: 0,budget,gross
0,237000000.0,760505847.0
1,300000000.0,309404152.0
2,245000000.0,200074175.0


In [33]:
cont_num_reviews = ["num_voted_users", "num_user", "num_critic",]
movies[cont_num_reviews].head(3)

Unnamed: 0,num_voted_users,num_user,num_critic
0,886204,3054.0,723.0
1,471220,1238.0,302.0
2,275868,994.0,602.0


In [34]:
cont_other = ["imdb_score", "duration", "aspect_ratio", "facenumber_in_poster",]
movies[cont_other].head(3)

Unnamed: 0,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,7.9,178.0,1.78,0.0
1,7.1,169.0,2.35,0.0
2,6.8,148.0,2.35,1.0


Concatenate all the lists together to get the final column order. Also, ensure that this
list contains all the columns from the original:

In [36]:
new_col_order = (
    cat_core
    + cat_people
    + cat_other
    + cont_fb
    + cont_finance
    + cont_num_reviews
    + cont_other

)
set(movies.columns) == set(new_col_order)

True

Pass the list with the new column order to the indexing operator of the DataFrame to
reorder the columns

In [37]:
movies[new_col_order].head()

Unnamed: 0,movie_title,title_year,content_rating,genres,director_name,actor_1_name,actor_2_name,actor_3_name,color,country,language,plot_keywords,movie_imdb_link,director_fb,actor_1_fb,actor_2_fb,actor_3_fb,cast_total_fb,movie_fb,budget,gross,num_voted_users,num_user,num_critic,imdb_score,duration,aspect_ratio,facenumber_in_poster
0,Avatar,2009.0,PG-13,Action|Adventure|Fantasy|Sci-Fi,James Cameron,CCH Pounder,Joel David Moore,Wes Studi,Color,USA,English,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,0.0,1000.0,936.0,855.0,4834,33000,237000000.0,760505847.0,886204,3054.0,723.0,7.9,178.0,1.78,0.0
1,Pirates of the Caribbean: At World's End,2007.0,PG-13,Action|Adventure|Fantasy,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport,Color,USA,English,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,563.0,40000.0,5000.0,1000.0,48350,0,300000000.0,309404152.0,471220,1238.0,302.0,7.1,169.0,2.35,0.0
2,Spectre,2015.0,PG-13,Action|Adventure|Thriller,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Color,UK,English,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,0.0,11000.0,393.0,161.0,11700,85000,245000000.0,200074175.0,275868,994.0,602.0,6.8,148.0,2.35,1.0
3,The Dark Knight Rises,2012.0,PG-13,Action|Thriller,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Color,USA,English,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,22000.0,27000.0,23000.0,23000.0,106759,164000,250000000.0,448130642.0,1144337,2701.0,813.0,8.5,164.0,2.35,0.0
4,Star Wars: Episode VII - The Force Awakens,,,Documentary,Doug Walker,Doug Walker,Rob Walker,,,,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,131.0,131.0,12.0,,143,0,,,8,,,7.1,,,0.0


## Summarizing a DataFrame

In [39]:
movies.size

137648

In [40]:
movies.ndim

2

In [41]:
len(movies)

4916

In [42]:
movies.count()

color                   4897
director_name           4814
num_critic              4867
duration                4901
director_fb             4814
actor_3_fb              4893
actor_2_name            4903
actor_1_fb              4909
gross                   4054
genres                  4916
actor_1_name            4909
movie_title             4916
num_voted_users         4916
cast_total_fb           4916
actor_3_name            4893
facenumber_in_poster    4903
plot_keywords           4764
movie_imdb_link         4916
num_user                4895
language                4904
country                 4911
content_rating          4616
budget                  4432
title_year              4810
actor_2_fb              4903
imdb_score              4916
aspect_ratio            4590
movie_fb                4916
dtype: int64

In [43]:
movies.min()

num_critic                                                              1
duration                                                                7
director_fb                                                             0
actor_3_fb                                                              0
actor_1_fb                                                              0
gross                                                                 162
genres                                                             Action
movie_title                                                       #Horror
num_voted_users                                                         5
cast_total_fb                                                           0
facenumber_in_poster                                                    0
movie_imdb_link         http://www.imdb.com/title/tt0006864/?ref_=fn_t...
num_user                                                                1
budget                                

In [44]:
movies.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_critic,4867.0,137.9889,120.2394,1.0,49.0,108.0,191.0,813.0
duration,4901.0,107.0908,25.28602,7.0,93.0,103.0,118.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,7.0,48.0,189.75,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,132.0,366.0,633.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,607.0,982.0,11000.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,5019656.25,25043962.0,61108412.75,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,8361.75,33132.5,93772.75,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,1394.75,3049.0,13616.75,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,1.0,2.0,43.0
num_user,4895.0,267.6688,372.9348,1.0,64.0,153.0,320.5,5060.0


In [45]:
movies.describe(percentiles=[.01, .3, .99]).T

Unnamed: 0,count,mean,std,min,1%,30%,50%,99%,max
num_critic,4867.0,137.9889,120.2394,1.0,2.0,60.0,108.0,546.68,813.0
duration,4901.0,107.0908,25.28602,7.0,43.0,95.0,103.0,189.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,0.0,11.0,48.0,16000.0,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,0.0,176.0,366.0,11000.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,6.08,694.0,982.0,44920.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,8474.8,7914068.6,25043962.0,326412800.0,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,53.0,11864.5,33132.5,681584.6,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,6.0,1684.5,3049.0,62413.9,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,0.0,1.0,8.0,43.0
num_user,4895.0,267.6688,372.9348,1.0,1.94,80.0,153.0,1999.24,5060.0


In [46]:
movies.min(skipna=False)

num_critic                                                            NaN
duration                                                              NaN
director_fb                                                           NaN
actor_3_fb                                                            NaN
actor_1_fb                                                            NaN
gross                                                                 NaN
genres                                                             Action
movie_title                                                       #Horror
num_voted_users                                                         5
cast_total_fb                                                           0
facenumber_in_poster                                                  NaN
movie_imdb_link         http://www.imdb.com/title/tt0006864/?ref_=fn_t...
num_user                                                              NaN
budget                                