---
# Chapter 2: Essential DataFrame Operations
---

In [1]:
import numpy as np
import pandas as pd

In [3]:
movies = pd.read_csv('./movie.csv')
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


## Selecting multiple DataFrame columns


In [4]:
# Select all the actor and director columns
movie_actor_director = movies[['director_name', 
                               'actor_1_name', 
                               'actor_2_name', 
                               'actor_3_name'
                               ]]
movie_actor_director.head()

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman
3,Christopher Nolan,Tom Hardy,Christian Bale,Joseph Gordon-Levitt
4,Doug Walker,Doug Walker,Rob Walker,


There are instances when one column of a DataFrame needs to be selected. Using
the index operation can return either a Series or a DataFrame. If we pass in a list
with a single item, we will get back a DataFrame. If we pass in just a string with
the column name, we will get a Series back:

In [5]:
type(movie_actor_director['director_name'])

pandas.core.series.Series

In [6]:
type(movie_actor_director[['director_name']])

pandas.core.frame.DataFrame

We can also use `.loc` to pull out a column by name. Because this index operation requires that we pass in a row selector first, we will use a colon (:) to indicate a slice that selects all of the rows. This can also return either a DataFrame or a Series

In [7]:
type(movie_actor_director.loc[:, 'director_name'])

pandas.core.series.Series

In [8]:
type(movie_actor_director.loc[:, ['director_name']])

pandas.core.frame.DataFrame

Passing a long list inside the indexing operator might cause readability issues. To help with
this, you may save all your column names to a list variable first.

In [9]:
# Create a list that contains columns name
cols = ['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']
movie_actor_director = movies[cols]

## Selecting columns with methods

There are some DataFrame methods that facilitate columns selection such as `.select_dtypes` if want to select columns by type or, `.filter` method.

Read in the movie dataset. Shorten the column names for display. Use the `.value_counts` method to output the number of columns with each specific data type:

In [10]:
def shorten(col):
  return (
      str(col)
      .replace("facebook_likes", "fb")
      .replace("_for_reviews", "")
  )

movies.rename(columns=shorten, inplace=True)

In [11]:
# dtypes counts
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

Use the `.select_dtypes` method to select only the integer columns

In [16]:
movies.select_dtypes(include='int').head(3)

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000


If you would like to select all the numeric columns, you may pass the string number to the include parameter:

In [17]:
movies.select_dtypes(include='number').head()

Unnamed: 0,num_critic,duration,director_fb,actor_3_fb,actor_1_fb,gross,num_voted_users,cast_total_fb,facenumber_in_poster,num_user,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


If we wanted integer and string columns we could do the following:

In [18]:
movies.select_dtypes(include=['int', 'object']).head(3)

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


To exclude only floating-point columns, do the following

In [20]:
movies.select_dtypes(exclude='float').head(3)


Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000


An alternative method to select columns is with the .filter method. This method is flexible and searches column names (or index labels) based on which parameter is used. Here, we use the like parameter to search for all the Facebook columns or the names that contain the exact string, fb. The like parameter is checking for substrings in column names

In [21]:
movies.filter(like='fb').head()

Unnamed: 0,director_fb,actor_3_fb,actor_1_fb,cast_total_fb,actor_2_fb,movie_fb
0,0.0,855.0,1000.0,4834,936.0,33000
1,563.0,1000.0,40000.0,48350,5000.0,0
2,0.0,161.0,11000.0,11700,393.0,85000
3,22000.0,23000.0,27000.0,106759,23000.0,164000
4,131.0,,131.0,143,12.0,0


In [22]:
cols

['director_name', 'actor_1_name', 'actor_2_name', 'actor_3_name']

In [23]:
movies.filter(items=cols).head(3)

Unnamed: 0,director_name,actor_1_name,actor_2_name,actor_3_name
0,James Cameron,CCH Pounder,Joel David Moore,Wes Studi
1,Gore Verbinski,Johnny Depp,Orlando Bloom,Jack Davenport
2,Sam Mendes,Christoph Waltz,Rory Kinnear,Stephanie Sigman


The `.filter` method allows columns to be searched with regular expressions using the regex parameter. Here, we search for all columns that have a digit somewhere in their name:

In [25]:
movies.filter(regex=r'\d').head(3)

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
