pandas DataFrame 연산에 대해 알아보겠습니다.  
특히 Series에는 없는 Column을 집중해서 다루어 보겠습니다.

# 여러 열 선택하기  
여러 열을 한번에 선택해야 할 때에는 아래와 같은 방법으로 접근할 수 있습니다.  
- 열의 이름을 넣은 리스트를 활용하기
- loc  
- iloc  
- select_dtype  
- filter  

In [1]:
import pandas as pd
import numpy as np

In [2]:
movies = pd.read_csv('movie.csv')

# 열 이름 직접 사용  
본인이 필요한 열의 이름을 직접 나열하여 가져오는 방법이 있습니다.

In [3]:
movie_people_names=[
    'actor_1_name',
    'actor_2_name',
    'actor_3_name',
    'director_name'
]

먼저 리스트로 내가 가져오고 싶은 컬럼명을 나열합니다.  
그 후 이를 index로 넘겨주면 됩니다.

In [4]:
movies[movie_people_names].head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


'이름을 활용한다'는 것은 loc으로고 가능합니다.   
위와 동일한 columns을 loc를 이용해 가져와 보겠습니다.

In [5]:
movies.loc[:,movie_people_names].head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


단일 컬럼을 선택할 때는, 내가 원하는 형태의 따라 DataFrame과 Series를 결정할 수 있습니다.  
예를 들어, actor_1_name을 선택한다고 하면, DF로 원할때엔 list를, S를 원할 때에는 str을 넘겨주면 됩니다.  
(loc 함수를 사용할 때에도 동일한 원칙이 적용됩니다.)

In [6]:
movies['actor_1_name'].head() #series

0        CCH Pounder
1        Johnny Depp
2    Christoph Waltz
3          Tom Hardy
4        Doug Walker
Name: actor_1_name, dtype: object

### 데이터 프레임을 살리고 싶으면 LIST 형태로 넘겨주시면 됩니다.  
### 컬럼의 이름또한 같이 가져올 수 있습니다.

In [9]:
 movies[['actor_1_name']].head() #DataFrame ->list로 넘겨줌

Unnamed: 0,actor_1_name
0,CCH Pounder
1,Johnny Depp
2,Christoph Waltz
3,Tom Hardy
4,Doug Walker


# 메서드를 이용해 열 선택하기  
rename을 활용해 보겠습니다.  
col_map 을 이용해 딕셔너리 형태로 새로운 컬럼 이름을 정할 수 있습니다.  
이번에는 딕셔너리가 아닌 함수 형태로 먼저 컬럼의 이름을 정리해 보겠습니다.

In [10]:
movies.loc[:,'actor_1_name'].value_counts().head()

Robert De Niro    48
Johnny Depp       36
Nicolas Cage      32
J.K. Simmons      29
Matt Damon        29
Name: actor_1_name, dtype: int64

In [14]:
pd.Series(movies.columns).head()

0                      color
1              director_name
2     num_critic_for_reviews
3                   duration
4    director_facebook_likes
dtype: object

facebook_likes를 찾으면 fb로 바꾸어 주고 
_ for_ review 찾으면 ' ' 로 바꾸어 줍니다.

In [15]:
def shorten (col):
    return (
    str(col)
        .replace('facebook_likes','fb')
        .replace('_for_review','')
    )
movies=movies.rename(columns=shorten)
pd.Series(movies.columns).head()

0            color
1    director_name
2      num_critics
3         duration
4      director_fb
dtype: object

In [17]:
movies.dtypes.value_counts()

float64    13
object     12
int64       3
dtype: int64

# select_dtypes
이제 select_dtypes 메서드를 사용해보겠습니다.  
### 내가 원하는 타입만 가져와 줄 수 있습니다.
include 에는 어떤 데이터 타입을 가져올 것인지 정할 수 있고, 여러 타입을 선택할 때엔 리스트로 만들어 줍니다.    
숫자만 선택할 땐  
### number
를 이용하고, 특정 타입을 제거하고 싶을 때에는  
### exclude
로 데이터 타입을 정해줍니다.

In [18]:
movies.select_dtypes(include='int64').head()

Unnamed: 0,num_voted_users,cast_total_fb,movie_fb
0,886204,4834,33000
1,471220,48350,0
2,275868,11700,85000
3,1144337,106759,164000
4,8,143,0


In [19]:
movies.select_dtypes(include=['object','int64'])

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,num_voted_users,cast_total_fb,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating,movie_fb
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13,33000
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13,0
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13,85000
3,Color,Christopher Nolan,Christian Bale,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,English,USA,PG-13,164000
4,,Doug Walker,Rob Walker,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,8,143,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,Daphne Zuniga,Comedy|Drama,Eric Mabius,Signed Sealed Delivered,629,2283,Crystal Lowe,fraud|postal worker|prison|theft|trial,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,English,Canada,,84
4912,Color,,Valorie Curry,Crime|Drama|Mystery|Thriller,Natalie Zea,The Following,73839,1753,Sam Underwood,cult|fbi|hideout|prison escape|serial killer,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,English,USA,TV-14,32000
4913,Color,Benjamin Roberds,Maxwell Moody,Drama|Horror|Thriller,Eva Boehnke,A Plague So Pleasant,38,0,David Chandler,,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,English,USA,,16
4914,Color,Daniel Hsia,Daniel Henney,Comedy|Drama|Romance,Alan Ruck,Shanghai Calling,1255,2386,Eliza Coupe,,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,English,USA,PG-13,660


In [20]:
movies.select_dtypes(include='number').head()

Unnamed: 0,num_critics,duration,director_fb,actor_3_fb,actor_1_fb,gross,num_voted_users,cast_total_fb,facenumber_in_poster,num_users,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,723.0,178.0,0.0,855.0,1000.0,760505847.0,886204,4834,0.0,3054.0,237000000.0,2009.0,936.0,7.9,1.78,33000
1,302.0,169.0,563.0,1000.0,40000.0,309404152.0,471220,48350,0.0,1238.0,300000000.0,2007.0,5000.0,7.1,2.35,0
2,602.0,148.0,0.0,161.0,11000.0,200074175.0,275868,11700,1.0,994.0,245000000.0,2015.0,393.0,6.8,2.35,85000
3,813.0,164.0,22000.0,23000.0,27000.0,448130642.0,1144337,106759,0.0,2701.0,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,,131.0,,131.0,,8,143,0.0,,,,12.0,7.1,,0


원하는 object 열만 제외하고 싶으면 exclude='number' 방식으로 제거해 줄 수 있습니다.

In [22]:
movies.select_dtypes(exclude='number').head()

Unnamed: 0,color,director_name,actor_2_name,genres,actor_1_name,movie_title,actor_3_name,plot_keywords,movie_imdb_link,language,country,content_rating
0,Color,James Cameron,Joel David Moore,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,Wes Studi,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,English,USA,PG-13
1,Color,Gore Verbinski,Orlando Bloom,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,Jack Davenport,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,English,USA,PG-13
2,Color,Sam Mendes,Rory Kinnear,Action|Adventure|Thriller,Christoph Waltz,Spectre,Stephanie Sigman,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,English,UK,PG-13
3,Color,Christopher Nolan,Christian Bale,Action|Thriller,Tom Hardy,The Dark Knight Rises,Joseph Gordon-Levitt,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,English,USA,PG-13
4,,Doug Walker,Rob Walker,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens,,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,


이제는 filter 함수를 적용해보겠습니다.  
다른 방법보다 조금 더 다양하게 활용할 수 있다는 장점이 있습니다.


In [23]:
cols=[c for c in movies.columns if 'name' in c]
print('cols : ',cols)

cols :  ['director_name', 'actor_2_name', 'actor_1_name', 'actor_3_name']


In [24]:
movies.filter(items=cols).head()

Unnamed: 0,director_name,actor_2_name,actor_1_name,actor_3_name
0,James Cameron,Joel David Moore,CCH Pounder,Wes Studi
1,Gore Verbinski,Orlando Bloom,Johnny Depp,Jack Davenport
2,Sam Mendes,Rory Kinnear,Christoph Waltz,Stephanie Sigman
3,Christopher Nolan,Christian Bale,Tom Hardy,Joseph Gordon-Levitt
4,Doug Walker,Rob Walker,Doug Walker,


filter 함수는 regex를 사용할 수 있습니다.  
정규 표현식에 대해서는 다음 자료를 참고해 볼게요.  
https://www.nextree.co.kr/p4327/  
만약 숫자를 가진 컬럼들을 모두 선택하고 싶다면 아래와 같이 작성합니다.  
문장안에 숫자가 존재하면 걸러줘라  
r은 정규표현식이라는 표현식을 선언해주는 용도이다. 없어도 상관은 없지만 r의 활용도 알아두자.


In [25]:
movies.filter(regex=r'\d')

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
3,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0
4,,Rob Walker,131.0,Doug Walker,,12.0
...,...,...,...,...,...,...
4911,318.0,Daphne Zuniga,637.0,Eric Mabius,Crystal Lowe,470.0
4912,319.0,Valorie Curry,841.0,Natalie Zea,Sam Underwood,593.0
4913,0.0,Maxwell Moody,0.0,Eva Boehnke,David Chandler,0.0
4914,489.0,Daniel Henney,946.0,Alan Ruck,Eliza Coupe,719.0


# like를 활용한 검색기능  
actor가 들어간 열들도 쭉 나열해 줍니다.


In [143]:
movies.filter(like='actor')

Unnamed: 0,actor_3_fb,actor_2_name,actor_1_fb,actor_1_name,actor_3_name,actor_2_fb
0,855.0,Joel David Moore,1000.0,CCH Pounder,Wes Studi,936.0
1,1000.0,Orlando Bloom,40000.0,Johnny Depp,Jack Davenport,5000.0
2,161.0,Rory Kinnear,11000.0,Christoph Waltz,Stephanie Sigman,393.0
3,23000.0,Christian Bale,27000.0,Tom Hardy,Joseph Gordon-Levitt,23000.0
4,,Rob Walker,131.0,Doug Walker,,12.0
...,...,...,...,...,...,...
4911,318.0,Daphne Zuniga,637.0,Eric Mabius,Crystal Lowe,470.0
4912,319.0,Valorie Curry,841.0,Natalie Zea,Sam Underwood,593.0
4913,0.0,Maxwell Moody,0.0,Eva Boehnke,David Chandler,0.0
4914,489.0,Daniel Henney,946.0,Alan Ruck,Eliza Coupe,719.0


4916 rows x 6 columns  
filter 함수는 items, like, 그리고 regex를 인자로 가질 수 있습니다.  
like는 말 그대로 비슷한 텍스트를 가진 컬럼을 찾아 줍니다.  
자세한 내용은 pandas 교재 89 page를 참고하면 됩니다.

In [27]:
movies.filter(like='director')

Unnamed: 0,director_name,director_fb
0,James Cameron,0.0
1,Gore Verbinski,563.0
2,Sam Mendes,0.0
3,Christopher Nolan,22000.0
4,Doug Walker,131.0
...,...,...
4911,Scott Smith,2.0
4912,,
4913,Benjamin Roberds,0.0
4914,Daniel Hsia,0.0


# 열 이름 정렬  
열의 이름을 정렬해서 DF를 정리하고 싶을 때가 있습니다.
교재 90~95p 에는 의미 있는 열을 위주로 정리하는 것에 대한 내용이 있습니다.  
여기에서는 컬럼의 순서만 바꿔서 열을 다뤄보는 것만 해보겠습니다.  
위에서 가져온 cols만 가져와보겠습니다.

In [36]:
cols.sort()

In [37]:
names=movies[cols].filter(like='name')
names.head()

Unnamed: 0,actor_1_name,actor_2_name,actor_3_name,director_name
0,CCH Pounder,Joel David Moore,Wes Studi,James Cameron
1,Johnny Depp,Orlando Bloom,Jack Davenport,Gore Verbinski
2,Christoph Waltz,Rory Kinnear,Stephanie Sigman,Sam Mendes
3,Tom Hardy,Christian Bale,Joseph Gordon-Levitt,Christopher Nolan
4,Doug Walker,Rob Walker,,Doug Walker


In [40]:
names=movies[['actor_2_name','director_name']]

In [41]:
names

Unnamed: 0,actor_2_name,director_name
0,Joel David Moore,James Cameron
1,Orlando Bloom,Gore Verbinski
2,Rory Kinnear,Sam Mendes
3,Christian Bale,Christopher Nolan
4,Rob Walker,Doug Walker
...,...,...
4911,Daphne Zuniga,Scott Smith
4912,Valorie Curry,
4913,Maxwell Moody,Benjamin Roberds
4914,Daniel Henney,Daniel Hsia


In [42]:
movies.head()

Unnamed: 0,color,director_name,num_critics,duration,director_fb,actor_3_fb,actor_2_name,actor_1_fb,gross,genres,...,num_users,language,country,content_rating,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


sort를 통해서 알파벳 순서로 이름을 나열해 보겠습니다.

In [44]:
m_cols=list(movies.columns)
m_cols.sort()
movies[m_cols].head()

Unnamed: 0,actor_1_fb,actor_1_name,actor_2_fb,actor_2_name,actor_3_fb,actor_3_name,aspect_ratio,budget,cast_total_fb,color,...,imdb_score,language,movie_fb,movie_imdb_link,movie_title,num_critics,num_users,num_voted_users,plot_keywords,title_year
0,1000.0,CCH Pounder,936.0,Joel David Moore,855.0,Wes Studi,1.78,237000000.0,4834,Color,...,7.9,English,33000,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,Avatar,723.0,3054.0,886204,avatar|future|marine|native|paraplegic,2009.0
1,40000.0,Johnny Depp,5000.0,Orlando Bloom,1000.0,Jack Davenport,2.35,300000000.0,48350,Color,...,7.1,English,0,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,Pirates of the Caribbean: At World's End,302.0,1238.0,471220,goddess|marriage ceremony|marriage proposal|pi...,2007.0
2,11000.0,Christoph Waltz,393.0,Rory Kinnear,161.0,Stephanie Sigman,2.35,245000000.0,11700,Color,...,6.8,English,85000,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,Spectre,602.0,994.0,275868,bomb|espionage|sequel|spy|terrorist,2015.0
3,27000.0,Tom Hardy,23000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,2.35,250000000.0,106759,Color,...,8.5,English,164000,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,The Dark Knight Rises,813.0,2701.0,1144337,deception|imprisonment|lawlessness|police offi...,2012.0
4,131.0,Doug Walker,12.0,Rob Walker,,,,,143,,...,7.1,,0,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,Star Wars: Episode VII - The Force Awakens,,,8,,


## 근데 내가 마지막 열인 title_years를 가장 앞으로 가져오고 싶으면 어떻게 할까요?

In [48]:
movies[[m_cols[-1]]+m_cols[:-1]].head()

Unnamed: 0,title_year,actor_1_fb,actor_1_name,actor_2_fb,actor_2_name,actor_3_fb,actor_3_name,aspect_ratio,budget,cast_total_fb,...,gross,imdb_score,language,movie_fb,movie_imdb_link,movie_title,num_critics,num_users,num_voted_users,plot_keywords
0,2009.0,1000.0,CCH Pounder,936.0,Joel David Moore,855.0,Wes Studi,1.78,237000000.0,4834,...,760505847.0,7.9,English,33000,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,Avatar,723.0,3054.0,886204,avatar|future|marine|native|paraplegic
1,2007.0,40000.0,Johnny Depp,5000.0,Orlando Bloom,1000.0,Jack Davenport,2.35,300000000.0,48350,...,309404152.0,7.1,English,0,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,Pirates of the Caribbean: At World's End,302.0,1238.0,471220,goddess|marriage ceremony|marriage proposal|pi...
2,2015.0,11000.0,Christoph Waltz,393.0,Rory Kinnear,161.0,Stephanie Sigman,2.35,245000000.0,11700,...,200074175.0,6.8,English,85000,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,Spectre,602.0,994.0,275868,bomb|espionage|sequel|spy|terrorist
3,2012.0,27000.0,Tom Hardy,23000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,2.35,250000000.0,106759,...,448130642.0,8.5,English,164000,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,The Dark Knight Rises,813.0,2701.0,1144337,deception|imprisonment|lawlessness|police offi...
4,,131.0,Doug Walker,12.0,Rob Walker,,,,,143,...,,7.1,,0,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,Star Wars: Episode VII - The Force Awakens,,,8,


## 그럼 중간에 있는 놈은 어떻게 불러올까요?  

리스트와 리스트를 합할수 있도록 color를 리스트로 만들어줍니다.  
만약 컬러가 아니면 가져와라

In [50]:

movies[['color']+[c for c in m_cols if c != 'color']].head()

Unnamed: 0,color,actor_1_fb,actor_1_name,actor_2_fb,actor_2_name,actor_3_fb,actor_3_name,aspect_ratio,budget,cast_total_fb,...,imdb_score,language,movie_fb,movie_imdb_link,movie_title,num_critics,num_users,num_voted_users,plot_keywords,title_year
0,Color,1000.0,CCH Pounder,936.0,Joel David Moore,855.0,Wes Studi,1.78,237000000.0,4834,...,7.9,English,33000,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,Avatar,723.0,3054.0,886204,avatar|future|marine|native|paraplegic,2009.0
1,Color,40000.0,Johnny Depp,5000.0,Orlando Bloom,1000.0,Jack Davenport,2.35,300000000.0,48350,...,7.1,English,0,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,Pirates of the Caribbean: At World's End,302.0,1238.0,471220,goddess|marriage ceremony|marriage proposal|pi...,2007.0
2,Color,11000.0,Christoph Waltz,393.0,Rory Kinnear,161.0,Stephanie Sigman,2.35,245000000.0,11700,...,6.8,English,85000,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,Spectre,602.0,994.0,275868,bomb|espionage|sequel|spy|terrorist,2015.0
3,Color,27000.0,Tom Hardy,23000.0,Christian Bale,23000.0,Joseph Gordon-Levitt,2.35,250000000.0,106759,...,8.5,English,164000,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,The Dark Knight Rises,813.0,2701.0,1144337,deception|imprisonment|lawlessness|police offi...,2012.0
4,,131.0,Doug Walker,12.0,Rob Walker,,,,,143,...,7.1,,0,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,Star Wars: Episode VII - The Force Awakens,,,8,,


# DF 요약  
DF 정보를 보여주는 다양한 함수에 대해 알아보았습니다.  
shape, size, len 그리고 ndim에 대해서 알아보겠습니다.

In [51]:
movies

Unnamed: 0,color,director_name,num_critics,duration,director_fb,actor_3_fb,actor_2_name,actor_1_fb,gross,genres,...,num_users,language,country,content_rating,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,Color,Scott Smith,1.0,87.0,2.0,318.0,Daphne Zuniga,637.0,,Comedy|Drama,...,6.0,English,Canada,,,2013.0,470.0,7.7,,84
4912,Color,,43.0,43.0,,319.0,Valorie Curry,841.0,,Crime|Drama|Mystery|Thriller,...,359.0,English,USA,TV-14,,,593.0,7.5,16.00,32000
4913,Color,Benjamin Roberds,13.0,76.0,0.0,0.0,Maxwell Moody,0.0,,Drama|Horror|Thriller,...,3.0,English,USA,,1400.0,2013.0,0.0,6.3,,16
4914,Color,Daniel Hsia,14.0,100.0,0.0,489.0,Daniel Henney,946.0,10443.0,Comedy|Drama|Romance,...,9.0,English,USA,PG-13,,2012.0,719.0,6.3,2.35,660


In [52]:
movies.shape # 행, 열정보를 보여줍니다.

(4916, 28)

In [53]:
movies.size #행x열, 총 데이터의 개수를 세어줍니다.

137648

In [54]:
len(movies) #행의 수를 세어줍니다.

4916

In [55]:
movies.ndim # 데이터는 당연히 2차원이겠죠.

2

시리즈의 정보를 한 개의 값을 바꿔주던 함수가 있었습니다.  
count, min, max, mean, median, std 가 잇었습니다.  
이런 함수들을 집계함수(aggregation function)이라고 합니다.  
DF에 이런 함수들을 적용하면, 각 컬럼별로 계산해 줄 수있습니다.  
결과값은 Series 이죠.

In [56]:
movies.count().head()

color            4897
director_name    4814
num_critics      4867
duration         4901
director_fb      4814
dtype: int64

describe는 매우 중요한 함수입니다.  
DF를 한번에 정리해주니 정말 편리합니다.  
(아래의 T는 Transpose의 약자로 행과 열을 바구어 주는 기능을 합니다.)

In [57]:
movies.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
num_critics,4867.0,137.9889,120.2394,1.0,49.0,108.0,191.0,813.0
duration,4901.0,107.0908,25.28602,7.0,93.0,103.0,118.0,511.0
director_fb,4814.0,691.0145,2832.954,0.0,7.0,48.0,189.75,23000.0
actor_3_fb,4893.0,631.2763,1625.875,0.0,132.0,366.0,633.0,23000.0
actor_1_fb,4909.0,6494.488,15106.99,0.0,607.0,982.0,11000.0,640000.0
gross,4054.0,47644510.0,67372550.0,162.0,5019656.25,25043962.0,61108412.75,760505800.0
num_voted_users,4916.0,82644.92,138322.2,5.0,8361.75,33132.5,93772.75,1689764.0
cast_total_fb,4916.0,9579.816,18164.32,0.0,1394.75,3049.0,13616.75,656730.0
facenumber_in_poster,4903.0,1.37732,2.023826,0.0,0.0,1.0,2.0,43.0
num_users,4895.0,267.6688,372.9348,1.0,64.0,153.0,320.5,5060.0


filter 에서 사용한 include/exclude가 있었습니다.  
describe에서도 사용할 수 있습니다
#### include를 통해 내가 원하는 데이터타입만 볼 수 있습니다. 

In [59]:
movies.describe(include='object').T

Unnamed: 0,count,unique,top,freq
color,4897,2,Color,4693
director_name,4814,2397,Steven Spielberg,26
actor_2_name,4903,3030,Morgan Freeman,18
genres,4916,914,Drama,233
actor_1_name,4909,2095,Robert De Niro,48
movie_title,4916,4916,Fight Valley,1
actor_3_name,4893,3519,Steve Coogan,8
plot_keywords,4764,4756,based on novel,4
movie_imdb_link,4916,4916,http://www.imdb.com/title/tt0087277/?ref_=fn_t...,1
language,4904,47,English,4582


# 메서드 체인  
시리즈와 마찬가지로 DF에도 여러 메서드들을 연결해서 적용할 수 있습니다.  
여기서는 이 특성을 이용해 null 값을 다루어 보겠습니다.

In [89]:
movies=pd.read_csv('movie.csv')
def shorten(col):
    return(col.replace('facebook_likes','fb').replace('_for_reviews',''))
movies=movies.rename(columns=shorten)

행방향으로 하는것이 열덧셈 아래로!   

열방향으로 하는게 행 덧셈이다. 오른쪽으로!

In [92]:
movies.isna() #movies.isna()  
#값이 있는 자리는 False로 출력됩니다.

Unnamed: 0,color,director_name,num_critic,duration,director_fb,actor_3_fb,actor_2_name,actor_1_fb,gross,genres,...,num_user,language,country,content_rating,budget,title_year,actor_2_fb,imdb_score,aspect_ratio,movie_fb
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,False,True,True,False,True,False,False,True,False,...,True,True,True,True,True,True,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4911,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,True,False,False,False,True,False
4912,False,True,False,False,True,False,False,False,True,False,...,False,False,False,False,True,True,False,False,False,False
4913,False,False,False,False,False,False,False,False,True,False,...,False,False,False,True,False,False,False,False,True,False
4914,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False


4916 rows x 28 columns  
여기에 sum을 적용하면 행별/열별로 결측치를 측정할 수 있습니다.  
(axis=0/1)  
세로방향으로 더했습니다. 지금 color는 19개의 결측치가 있습니다. 

In [94]:
movies.isna().sum(axis=0)

color                    19
director_name           102
num_critic               49
duration                 15
director_fb             102
actor_3_fb               23
actor_2_name             13
actor_1_fb                7
gross                   862
genres                    0
actor_1_name              7
movie_title               0
num_voted_users           0
cast_total_fb             0
actor_3_name             23
facenumber_in_poster     13
plot_keywords           152
movie_imdb_link           0
num_user                 21
language                 12
country                   5
content_rating          300
budget                  484
title_year              106
actor_2_fb               13
imdb_score                0
aspect_ratio            326
movie_fb                  0
dtype: int64

컬렴별로 NA값을 셌는데 다시 NA값이 몇개있는지 셀수 있다.

In [98]:
movies.isna().sum(axis=1).sum()

2654

isnull은 True가 결측치가 있는것이고 False는 결측치가 없는것입니다.

In [65]:
movies.isnull().sum()

color                    19
director_name           102
num_critic               49
duration                 15
director_fb             102
actor_3_fb               23
actor_2_name             13
actor_1_fb                7
gross                   862
genres                    0
actor_1_name              7
movie_title               0
num_voted_users           0
cast_total_fb             0
actor_3_name             23
facenumber_in_poster     13
plot_keywords           152
movie_imdb_link           0
num_user                 21
language                 12
country                   5
content_rating          300
budget                  484
title_year              106
actor_2_fb               13
imdb_score                0
aspect_ratio            326
movie_fb                  0
dtype: int64

여기에 다시 한 번 sum을 넣으면, 시리즈의 숫자를 모두 더해줄 수 있습니다.  
그런데, 각 컬럼별 결측치를 정리한 것이니, 이것의 합은 데이터에 존재하는 총 결측치의 수가 됩니다.

In [66]:
movies.isnull().sum().sum()

2654

DF에 결측치가 존재하는지 빠르게 확인하는 방법은, any를 쓰는 것입니다.  
말 그대로, 어떤 것이든 True가 있으면 True를 반환하는 함수이지요.  
두 번 쓰는 것은 열과 행을 각각 보기 위합입니다.  
온전한 데이터라면 isnull의 결과로 모두 False가 나올것 입니다.

In [67]:
all([True,False,True])

False

# 전체 데이터의 결측치 확인하기  

이 데이터에는 결측치가 존재해요 any().any()

In [69]:
movies.isnull().any().any()

True

# DF 연산  
Series에 scalar 값을 연산에 사용하면, 알아서 복사가 되는 broadcasting에 대해서 다루었습니다. 그런데 같은일을 DF에 적용하면 어떻게 될까요?  
  
여기서는 새 데이터인 college.csv를 사용해보겠습니다. 대학 진학과 인구 통계학적 구성에 대한 자료입니다.  
### index_col을 통해서 어떤 열을 인덱스로 사용할지 지정할 수 있습니다. 이렇게하면 이제 더이상 0~N이 인덱스가 아니에요

In [99]:
colleges=pd.read_csv('college.csv',index_col='INSTNM')
colleges.head()

Unnamed: 0_level_0,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0
University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,595.0,590.0,0.0,5451.0,...,0.0172,0.0332,0.035,0.2146,1,0.3072,0.4596,0.264,45500,24097.0
Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,425.0,430.0,0.0,4811.0,...,0.0098,0.0243,0.0137,0.0892,1,0.7347,0.7554,0.127,26600,33118.5


In [72]:
colleges.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7535 entries, 0 to 7534
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   INSTNM              7535 non-null   object 
 1   CITY                7535 non-null   object 
 2   STABBR              7535 non-null   object 
 3   HBCU                7164 non-null   float64
 4   MENONLY             7164 non-null   float64
 5   WOMENONLY           7164 non-null   float64
 6   RELAFFIL            7535 non-null   int64  
 7   SATVRMID            1185 non-null   float64
 8   SATMTMID            1196 non-null   float64
 9   DISTANCEONLY        7164 non-null   float64
 10  UGDS                6874 non-null   float64
 11  UGDS_WHITE          6874 non-null   float64
 12  UGDS_BLACK          6874 non-null   float64
 13  UGDS_HISP           6874 non-null   float64
 14  UGDS_ASIAN          6874 non-null   float64
 15  UGDS_AIAN           6874 non-null   float64
 16  UGDS_N

In [73]:
colleges+5

TypeError: can only concatenate str (not "int") to str

문자열에는 숫자를 더하거나 뺄 수 없겠죠?  
즉 컬럼별로 계산이 달라야하는데, 이를 반영하지 않은 것입니다.  
이를 고려해서 다시 시도해 보겠습니다.  
여기서는 UGDS_ 로 시작하는 컬럼만 추출해서 써보겠습니다.

In [74]:
colleges=pd.read_csv('college.csv',index_col='INSTNM')

college_ugds=colleges.filter(like='UGDS_')
college_ugds.head()

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.0333,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.5922,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.299,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.6988,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.0158,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


모든 컬럼이 수치형인것처럼 보입니다.  
info로 확인해보죠 !

In [75]:
college_ugds.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7535 entries, Alabama A & M University to Excel Learning Center-San Antonio South
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   UGDS_WHITE  6874 non-null   float64
 1   UGDS_BLACK  6874 non-null   float64
 2   UGDS_HISP   6874 non-null   float64
 3   UGDS_ASIAN  6874 non-null   float64
 4   UGDS_AIAN   6874 non-null   float64
 5   UGDS_NHPI   6874 non-null   float64
 6   UGDS_2MOR   6874 non-null   float64
 7   UGDS_NRA    6874 non-null   float64
 8   UGDS_UNKN   6874 non-null   float64
dtypes: float64(9)
memory usage: 588.7+ KB


In [76]:
college_ugds+1

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,1.0333,1.9353,1.0055,1.0019,1.0024,1.0019,1.0000,1.0059,1.0138
University of Alabama at Birmingham,1.5922,1.2600,1.0283,1.0518,1.0022,1.0007,1.0368,1.0179,1.0100
Amridge University,1.2990,1.4192,1.0069,1.0034,1.0000,1.0000,1.0000,1.0000,1.2715
University of Alabama in Huntsville,1.6988,1.1255,1.0382,1.0376,1.0143,1.0002,1.0172,1.0332,1.0350
Alabama State University,1.0158,1.9208,1.0121,1.0019,1.0010,1.0006,1.0098,1.0243,1.0137
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


pandas는 숫자가 0.5에 있으면 가까운 짝수 쪽으로 반올림을 시행합니다.

In [103]:
round(1.5)

2

In [105]:
round(0.5)

0

In [78]:
round(0.5+0.000000001)

1

이를 ugds 데이터에 적용해 보겠습니다.  
Northwest-Shoals Community College 라는 학교에 적용을 해보겠습니다.  
filter로 찾아보면 아래와 같습니다. 

In [115]:
college_ugds.filter(like='Northwest-Shoals',axis=0).T

INSTNM,Northwest-Shoals Community College
UGDS_WHITE,0.7912
UGDS_BLACK,0.125
UGDS_HISP,0.0339
UGDS_ASIAN,0.0036
UGDS_AIAN,0.0088
UGDS_NHPI,0.0006
UGDS_2MOR,0.0012
UGDS_NRA,0.0033
UGDS_UNKN,0.0324


소수점 둘째짜리 까지만 보고싶군요, 셋째 짜리에서 반올림하려면 round(2)하면됩니다.

In [112]:
college_ugds.filter(like='Northwest-Shols',axis=0).T.add(.0001).round(2)

INSTNM
UGDS_WHITE
UGDS_BLACK
UGDS_HISP
UGDS_ASIAN
UGDS_AIAN
UGDS_NHPI
UGDS_2MOR
UGDS_NRA
UGDS_UNKN


In [113]:
(college_ugds+0.00001).round(2)

Unnamed: 0_level_0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Alabama A & M University,0.03,0.94,0.01,0.00,0.00,0.0,0.00,0.01,0.01
University of Alabama at Birmingham,0.59,0.26,0.03,0.05,0.00,0.0,0.04,0.02,0.01
Amridge University,0.30,0.42,0.01,0.00,0.00,0.0,0.00,0.00,0.27
University of Alabama in Huntsville,0.70,0.13,0.04,0.04,0.01,0.0,0.02,0.03,0.04
Alabama State University,0.02,0.92,0.01,0.00,0.00,0.0,0.01,0.02,0.01
...,...,...,...,...,...,...,...,...,...
SAE Institute of Technology San Francisco,,,,,,,,,
Rasmussen College - Overland Park,,,,,,,,,
National Personal Training Institute of Cleveland,,,,,,,,,
Bay Area Medical Academy - San Jose Satellite Location,,,,,,,,,


# NA는 비교가 불가합니다.  
NA 값은 NA 끼리도 비교하지 못합니다.  

In [85]:
print(np.nan !=np.nan)
print(np.nan>1)
print(np.nan<1)
print(np.nan !=1)

True
False
False
True


오직 같지 않다 ! 만 참이 나오고 나머지가 거짓이 나옵니다.  
그러니, S나 DF를 비교할 때에도 null 값을 처리하지 않으면 문제가 됩니다. !  
아래를 보죠  

In [86]:
import pandas as pd
a = pd.Series([1, 2, 3])
b = pd.Series([1, 2, 3])
c = pd.Series([1, np.nan, 3])
d = pd.Series([1, np.nan, 3])


In [119]:
#무사통과하는 a == b
print(a == b)
print()
# 1번 인덱스를 잘 보세요.
print(c == d)


0    True
1    True
2    True
dtype: bool

0     True
1    False
2     True
dtype: bool


In [88]:
#무사통과하는 a == b
print(a == b)
print()
# 1번 인덱스를 잘 보세요.
print(c == d)


0    True
1    True
2    True
dtype: bool

0     True
1    False
2     True
dtype: bool


a와 b가 동일한 요소를 가지고 있나요? 하면 NaN이어도 True입니다.

In [120]:
print(a.equals(b))
print(c.equals(d))

True
True


# HOME WORK
1. movies df 에서 'movie' 라는 단어가 들어간 컬럼만 선택하고자 합니다. 어떻게 하면 될까요? for ~ if 를 사용하는 것과, filter 메서드를 사용하는 것 두 가지 모두 고려하세요.
2. movies df 에서 가장 na 값이 많은 컬럼을 찾고자 합니다. 아래의 힌트를 보고, 어떻게 해야할지 코드를 작성해보세요.
 - pd.Series([5, 2, 3, 1, 7]).sort_values(ascending=False)  
1. sort_values에서 ascending=True 로 바꾸면 어떤 일이 일어나나요?

In [121]:
movies.filter(like='movie')

Unnamed: 0,movie_title,movie_imdb_link,movie_fb
0,Avatar,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,33000
1,Pirates of the Caribbean: At World's End,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,0
2,Spectre,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,85000
3,The Dark Knight Rises,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,164000
4,Star Wars: Episode VII - The Force Awakens,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,0
...,...,...,...
4911,Signed Sealed Delivered,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,84
4912,The Following,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,32000
4913,A Plague So Pleasant,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,16
4914,Shanghai Calling,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,660


In [152]:
cols=[c for c in movies.columns if 'movie' in c]
movies[cols]

Unnamed: 0,movie_title,movie_imdb_link,movie_fb
0,Avatar,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,33000
1,Pirates of the Caribbean: At World's End,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,0
2,Spectre,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,85000
3,The Dark Knight Rises,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,164000
4,Star Wars: Episode VII - The Force Awakens,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,0
...,...,...,...
4911,Signed Sealed Delivered,http://www.imdb.com/title/tt3000844/?ref_=fn_t...,84
4912,The Following,http://www.imdb.com/title/tt2071645/?ref_=fn_t...,32000
4913,A Plague So Pleasant,http://www.imdb.com/title/tt2107644/?ref_=fn_t...,16
4914,Shanghai Calling,http://www.imdb.com/title/tt2070597/?ref_=fn_t...,660


In [139]:
movies.isna().sum(axis=0).sort_values(ascending=False)

gross                   862
budget                  484
aspect_ratio            326
content_rating          300
plot_keywords           152
title_year              106
director_name           102
director_fb             102
num_critic               49
actor_3_name             23
actor_3_fb               23
num_user                 21
color                    19
duration                 15
facenumber_in_poster     13
actor_2_name             13
actor_2_fb               13
language                 12
actor_1_name              7
actor_1_fb                7
country                   5
movie_fb                  0
genres                    0
movie_title               0
num_voted_users           0
movie_imdb_link           0
imdb_score                0
cast_total_fb             0
dtype: int64