# ***Project: Data Wrangling on IMDb Dataset***



***Project Type*** - *Data Wrangling*

*Data Wrangling is the process of transforming and structuring data from one raw form into a desired format with the intent of improving data quality and making it more consumable and useful for analytics or machine learning.*

## Based on the list of 3000 movies and the corresponding details, a movie producer needs recommendations based on thorough analysis of the data he shared with you, help him decide what type of movies to produce and which actors to cast.

## We have to first explore the data and check its sanity.

## Further, we have to answer the following questions:
1. ### <b> Which movie made the highest profit?
2. ### Who were its producer and director? Identify the actors in that film.</b>
2. ### <b>This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)? </b>
4. ### <b> Find out the unique genres of movies in this dataset.</b>
5. ### <b> Which actor has acted in the most number of movies? </b>



In [9]:
#Let's get started
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
# Importing libraries
import numpy as np
import pandas as pd

In [11]:
df = pd.read_csv('/content/drive/MyDrive/Python-Colab/imdb_data.csv')

In [12]:
df.head(1)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   id                     3000 non-null   int64  
 1   belongs_to_collection  604 non-null    object 
 2   budget                 3000 non-null   int64  
 3   genres                 2993 non-null   object 
 4   homepage               946 non-null    object 
 5   imdb_id                3000 non-null   object 
 6   original_language      3000 non-null   object 
 7   original_title         3000 non-null   object 
 8   overview               2992 non-null   object 
 9   popularity             3000 non-null   float64
 10  poster_path            2999 non-null   object 
 11  production_companies   2844 non-null   object 
 12  production_countries   2945 non-null   object 
 13  release_date           3000 non-null   object 
 14  runtime                2998 non-null   float64
 15  spok

In [14]:
df.columns

Index(['id', 'belongs_to_collection', 'budget', 'genres', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'runtime', 'spoken_languages',
       'status', 'tagline', 'title', 'Keywords', 'cast', 'crew', 'revenue'],
      dtype='object')

##After reading all Qs. now subsetting out columns neceesary for answering the question in order to get proper insights



In [15]:
#(We need to keep all necessary columns which are non null)
Neces_Col = ['budget','genres','original_lanuage','original_title','cast','crew','revenue']

In [16]:
#find all the row indexes for which genere is not null
df.loc[~df['genres'].isna(),'genres']

0                          [{'id': 35, 'name': 'Comedy'}]
1       [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
2                           [{'id': 18, 'name': 'Drama'}]
3       [{'id': 53, 'name': 'Thriller'}, {'id': 18, 'n...
4       [{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...
                              ...                        
2995    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
2996    [{'id': 18, 'name': 'Drama'}, {'id': 10402, 'n...
2997    [{'id': 80, 'name': 'Crime'}, {'id': 28, 'name...
2998    [{'id': 35, 'name': 'Comedy'}, {'id': 10749, '...
2999    [{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...
Name: genres, Length: 2993, dtype: object

In [17]:
type(df.loc[0,'cast'])

str

##Converting the string values to proper list and we are using it on columns --> cast crew genres

In [18]:
#Converting only non null string
def convert_to_list(str):
  return eval(str)

In [19]:
#applying the above fn only on the non null values in genres, cast & crew column
df.loc[~df['genres'].isna(),'genres']= df.loc[~df['genres'].isna(),'genres'].apply(convert_to_list)
df.loc[~df['crew'].isna(),'crew'] = df.loc[~df['crew'].isna(),'crew'].apply(convert_to_list)

In [20]:
df.loc[~df['cast'].isna(),'cast']= df.loc[~df['cast'].isna(),'cast'].apply(convert_to_list)

In [21]:
#creating a copy of orginal df
cdf = df.copy()

In [22]:
cdf.head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,2/20/15,93.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,8/6/04,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435


#Q1.Which movie made the highest profit? Who were its producer and director? Identify the actors in that film.

In [23]:
#checking for sanity in budget columns (outliers,vague values etc)
cdf.describe()

Unnamed: 0,id,budget,popularity,runtime,revenue
count,3000.0,3000.0,3000.0,2998.0,3000.0
mean,1500.5,22531330.0,8.463274,107.856571,66725850.0
std,866.169729,37026090.0,12.104,22.086434,137532300.0
min,1.0,0.0,1e-06,0.0,1.0
25%,750.75,0.0,4.018053,94.0,2379808.0
50%,1500.5,8000000.0,7.374861,104.0,16807070.0
75%,2250.25,29000000.0,10.890983,118.0,68919200.0
max,3000.0,380000000.0,294.337037,338.0,1519558000.0


In [24]:
cdf[cdf['budget']==0].head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,release_date,runtime,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue
4,5,,0,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",,tt1380152,ko,마린보이,Marine Boy is the story of a former national s...,1.14807,...,2/5/09,118.0,"[{'iso_639_1': 'ko', 'name': '한국어/조선말'}]",Released,,Marine Boy,,"[{'cast_id': 3, 'character': 'Chun-soo', 'cred...","[{'credit_id': '52fe464b9251416c75073b43', 'de...",3923970
7,8,,0,"[{'id': 99, 'name': 'Documentary'}]",,tt0391024,en,Control Room,A chronicle which provides a rare window into ...,1.949044,...,1/15/04,84.0,"[{'iso_639_1': 'ar', 'name': 'العربية'}, {'iso...",Released,Different channels. Different truths.,Control Room,"[{'id': 917, 'name': 'journalism'}, {'id': 163...","[{'cast_id': 2, 'character': 'Himself', 'credi...","[{'credit_id': '52fe47a69251416c750a0daf', 'de...",2586511


In [25]:
cdf['budget'].median()

8000000.0

In [26]:
#To replace extremely low values of budget and revenue column with median values of budget, revenue
cdf.loc[cdf['budget']<1000,'budget']=cdf['budget'].median()

In [27]:
cdf.loc[cdf['revenue']<1000,'revenue']=cdf['revenue'].median()

In [28]:
cdf.describe()

Unnamed: 0,id,budget,popularity,runtime,revenue
count,3000.0,3000.0,3000.0,2998.0,3000.0
mean,1500.5,24744670.0,8.463274,107.856571,67045180.0
std,866.169729,35832540.0,12.104,22.086434,137396400.0
min,1.0,2500.0,1e-06,0.0,1404.0
25%,750.75,8000000.0,4.018053,94.0,2947600.0
50%,1500.5,8000000.0,7.374861,104.0,16808730.0
75%,2250.25,29000000.0,10.890983,118.0,68919200.0
max,3000.0,380000000.0,294.337037,338.0,1519558000.0


In [29]:
cdf['genres'].isnull().sum()

7

In [30]:
#To create profit and ROI column
cdf['profit'] = cdf['revenue'] - cdf['budget']
cdf['profit']

0        -1685349
1        55149435
2         9792000
3        14800000
4        -4076030
          ...    
2995     -6403313
2996     -7819410
2997     24456761
2998    129963386
2999     47087155
Name: profit, Length: 3000, dtype: int64

In [31]:
cdf['ROI'] = (cdf['profit']/cdf['budget'])*100
cdf['ROI']

0        -12.038207
1        137.873588
2        296.727273
3       1233.333333
4        -50.950375
           ...     
2995     -80.041412
2996     -97.742625
2997      37.625786
2998     309.436633
2999     134.534729
Name: ROI, Length: 3000, dtype: float64

In [32]:
cdf.head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue,profit,ROI
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651,-1685349,-12.038207
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435,55149435,137.873588


In [33]:
#maximum profit
cdf['profit'].max()

1316249360

In [34]:
#To find index or row which have the max profit using .idxmax()
#.idxmax()-->> returns the row number(index) for the max value of the column
cdf['profit'].idxmax()

1761

**The movie which made the highest profit is:**

In [35]:
cdf.loc[cdf['profit'].idxmax(),'original_title']

'Furious 7'

In [96]:
max_profit_movie = pd.DataFrame(cdf.iloc[cdf['profit'].idxmax()])
max_profit_movie

Unnamed: 0,1761
id,1762
belongs_to_collection,"[{'id': 9485, 'name': 'The Fast and the Furiou..."
budget,190000000
genres,"[{'id': 28, 'name': 'Action'}]"
homepage,http://www.furious7.com/
imdb_id,tt2820852
original_language,en
original_title,Furious 7
overview,Deckard Shaw seeks revenge against Dominic Tor...
popularity,27.275687


In [37]:
max_profit_movie.head(4)

id                                                                    1762
belongs_to_collection    [{'id': 9485, 'name': 'The Fast and the Furiou...
budget                                                           190000000
genres                                      [{'id': 28, 'name': 'Action'}]
Name: 1761, dtype: object

In [38]:
max_profit_movie.loc['cast'][1]['name']

'Paul Walker'

In [39]:
max_profit_movie.loc['cast'][0]['name']

'Vin Diesel'

In [40]:
crew_list = max_profit_movie.loc['crew']

In [41]:
crew_list[0:2]

[{'credit_id': '52fe4cc8c3a36847f823e681',
  'department': 'Production',
  'gender': 2,
  'id': 12835,
  'job': 'Producer',
  'name': 'Vin Diesel',
  'profile_path': '/7rwSXluNWZAluYMOEWBxkPmckES.jpg'},
 {'credit_id': '52fe4cc8c3a36847f823e687',
  'department': 'Production',
  'gender': 2,
  'id': 11874,
  'job': 'Producer',
  'name': 'Neal H. Moritz',
  'profile_path': '/cNcsEYmoS4niCz3UkVAA09dUIob.jpg'}]

### Q1. Name of the director and producer of the movie which made highest profit are:

In [42]:
producers = []
directors = []
for i in crew_list:
  if i['job']=='Producer':
    producers.append(i['name'])
  if i['job']=='Director':
    directors.append(i['name'])

In [43]:
print(f'Producers of the movie: {producers}')
print(f'Director of the movie: {directors}')

Producers of the movie: ['Vin Diesel', 'Neal H. Moritz', 'Michael Fottrell', 'Brandon Birtell']
Director of the movie: ['James Wan']


In [44]:
cast_list = max_profit_movie['cast']

In [45]:
cast_list[0:2]

[{'cast_id': 17,
  'character': 'Dominic Toretto',
  'credit_id': '5431dfd10e0a265915002c34',
  'gender': 2,
  'id': 12835,
  'name': 'Vin Diesel',
  'order': 0,
  'profile_path': '/7rwSXluNWZAluYMOEWBxkPmckES.jpg'},
 {'cast_id': 19,
  'character': "Brian O'Conner",
  'credit_id': '5431dfe4c3a3681143002b98',
  'gender': 2,
  'id': 8167,
  'name': 'Paul Walker',
  'order': 1,
  'profile_path': '/iqvYezRoEY5k8wnlfHriHQfl5dX.jpg'}]

Actors in the maximum profit movie:


In [46]:
actors = []
for i in cast_list:
  actors.append(i['name'])

In [47]:
#list of actors
print(f'Actors of the movie:')
actors[0:17]

Actors of the movie:


['Vin Diesel',
 'Paul Walker',
 'Dwayne Johnson',
 'Michelle Rodriguez',
 'Tyrese Gibson',
 'Ludacris',
 'Jordana Brewster',
 'Djimon Hounsou',
 'Tony Jaa',
 'Ronda Rousey',
 'Nathalie Emmanuel',
 'Kurt Russell',
 'Jason Statham',
 'Sung Kang',
 'Gal Gadot',
 'Lucas Black',
 'Elsa Pataky']

#Q2.This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)?

In [48]:
# We have already calculated ROI

In [49]:
# Using groupby fn on movie language & ROI, then finding mean
cdf.groupby('original_language')['ROI'].mean().reset_index().sort_values(by='ROI',ascending=False).head(4)

Unnamed: 0,original_language,ROI
18,ko,11309.685605
6,el,5198.013245
28,sr,3261.4136
7,en,952.119148


In [50]:
#Language with highest average roi
cdf.groupby('original_language')['ROI'].mean().reset_index().sort_values(by='ROI',ascending=False).iloc[0]

original_language              ko
ROI                  11309.685605
Name: 18, dtype: object

#Q3.Find out the unique genres of movies in this dataset.

In [51]:
#considering only those rows in genres column which have no null values
No_NaN_genres = cdf[~cdf['genres'].isna()]
#To show all data: pd.set_option('display.max_rows',None)
No_NaN_genres.head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue,profit,ROI
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651,-1685349,-12.038207
1,2,"[{'id': 107674, 'name': 'The Princess Diaries ...",40000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0368933,en,The Princess Diaries 2: Royal Engagement,Mia Thermopolis is now a college graduate and ...,8.248895,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,It can take a lifetime to find true love; she'...,The Princess Diaries 2: Royal Engagement,"[{'id': 2505, 'name': 'coronation'}, {'id': 42...","[{'cast_id': 1, 'character': 'Mia Thermopolis'...","[{'credit_id': '52fe43fe9251416c7502563d', 'de...",95149435,55149435,137.873588


In [52]:
#Here we can only see rows in genres column which have Nan values
NaN_genres = cdf[cdf['genres'].isna()]
NaN_genres.head(2)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue,profit,ROI
470,471,,2000000,,,tt0349159,en,"The Book of Mormon Movie, Volume 1: The Journey",The story of Lehi and his wife Sariah and thei...,0.079856,...,,Released,"2600 years ago, one family began a remarkable ...","The Book of Mormon Movie, Volume 1: The Journey",,"[{'cast_id': 1, 'character': 'Sam', 'credit_id...",,1672730,-327270,-16.3635
1622,1623,,400000,,,tt0261755,en,Jackpot,"Sunny Holiday, an aspiring singing star, aband...",0.218588,...,,Released,,Jackpot,,"[{'cast_id': 4, 'character': '', 'credit_id': ...","[{'credit_id': '52fe4d3c9251416c9110f319', 'de...",43719,-356281,-89.07025


In [53]:
len(No_NaN_genres)

2993

In [54]:
No_NaN_genres.loc[0,'genres']

[{'id': 35, 'name': 'Comedy'}]

In [55]:
No_NaN_genres.loc[0,'genres'][0]

{'id': 35, 'name': 'Comedy'}

In [56]:
#create a list of genres and using .iterrow() method to iterate over genres column
# .iterrow() --->> same as enumerate() its compulsory to use it in case of DataFrame
genre_list = []
for i,j in No_NaN_genres.iterrows():
  genre = No_NaN_genres.loc[i,'genres']
  for k in genre:
    genre_list.append(k['name'])


In [57]:
#unique list of genres are:
pd.DataFrame(set(genre_list), columns=['Unique Genres'])

Unnamed: 0,Unique Genres
0,Romance
1,TV Movie
2,Thriller
3,Documentary
4,Family
5,History
6,War
7,Western
8,Science Fiction
9,Action


#Q4.Make a table of all the producers and directors of each movie. Find the top 3 producers who have produced movies with the highest average RoI?

In [58]:
#considering only those rows in crew column which have no null values
No_NaN_crew = cdf[~cdf['crew'].isna()]
No_NaN_crew.head(1)

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue,profit,ROI
0,1,"[{'id': 313576, 'name': 'Hot Tub Time Machine ...",14000000,"[{'id': 35, 'name': 'Comedy'}]",,tt2637294,en,Hot Tub Time Machine 2,"When Lou, who has become the ""father of the In...",6.575393,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,The Laws of Space and Time are About to be Vio...,Hot Tub Time Machine 2,"[{'id': 4379, 'name': 'time travel'}, {'id': 9...","[{'cast_id': 4, 'character': 'Lou', 'credit_id...","[{'credit_id': '59ac067c92514107af02c8c8', 'de...",12314651,-1685349,-12.038207


In [59]:
No_NaN_crew.shape

(2984, 25)

In [60]:
#Using a simple function to extract list of all producers for a given movie_index
def producer_name(index):
  movie_index = No_NaN_crew.iloc[index]
  crew_list = movie_index.loc['crew']
  producer_list = []
  for i in crew_list:
    if i['job']=='Producer':
      producer_list.append(i['name'])
  return producer_list

In [61]:
producer_name(100)

['Jeff Young']

In [62]:
#Using another simple function to extract list of all producers for a given movie_index
# As we know, each movie has only one director
def director_name(index):
  movie_index = No_NaN_crew.iloc[index]
  crew_list = movie_index.loc['crew']
  director_list = []
  for i in crew_list:
    if i['job']=='Director':
      director_list.append(i['name'])
  return director_list

In [63]:
director_name(1)

['Garry Marshall']

In [77]:
#creating an empty DataFrame with required Column names in which we will append data later
Table = pd.DataFrame(columns=['Movie Title', 'Producers', 'Directors', 'ROI'])

####Now appending in Table Df and using Try Except block to bypass error because some values of the crew dictionaries contain float as value



In [65]:
for i,row in No_NaN_crew.iterrows():
  try:
    Table = Table.append({'Movie Title':No_NaN_crew[i,'original_title'],'Producers':producer_name(i), 'Directors':director_name(i), 'ROI':No_NaN_crew[i,'ROI'] },ignore_index=True)
  except:
    continue

###Table containing columns of Movie Title,its Producers, Directors and ROI

In [154]:
T3 = No_NaN_crew.sort_values(by='ROI',ascending=False).head(3)
pd.set_option('display.max_colwidth', 20)
T3

Unnamed: 0,id,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,popularity,...,spoken_languages,status,tagline,title,Keywords,cast,crew,revenue,profit,ROI
1230,1231,"[{'id': 41437, '...",15000,"[{'id': 27, 'nam...",http://www.paran...,tt1179904,en,Paranormal Activity,"After a young, m...",12.706424,...,[{'iso_639_1': '...,Released,What Happens Whe...,Paranormal Activity,"[{'id': 10224, '...","[{'cast_id': 3, ...",[{'credit_id': '...,193355800,193340800,1288939.0
1679,1680,"[{'id': 64750, '...",60000,"[{'id': 27, 'nam...",http://www.blair...,tt0185937,en,The Blair Witch ...,In October of 19...,14.838386,...,[{'iso_639_1': '...,Released,The scariest mov...,The Blair Witch ...,"[{'id': 616, 'na...","[{'cast_id': 41,...",[{'credit_id': '...,248000000,247940000,413233.3
2610,2611,,5000,"[{'id': 28, 'nam...",,tt5066556,ko,대호,While the Kingdo...,3.447894,...,[{'iso_639_1': '...,Released,,The Tiger: An Ol...,"[{'id': 414, 'na...","[{'cast_id': 0, ...",[{'credit_id': '...,11083449,11078449,221569.0


In [144]:
#T4 = pd.DataFrame(cdf.iloc[cdf['ROI'].idxmax()])
# C3 = pd.DataFrame(T3.loc['crew'])
# pd.set_option('display.max_colwidth', None)
# C3.reset_index(inplace=True)
#P0 = [i['name'] for i in C3['crew'] if i.get('job') == 'Producer']

#Q5.Which actor has acted in the most number of movies?

In [67]:
#We are considering only no null rows in cast column
No_NaN_cast = cdf[~cdf['cast'].isna()]

In [68]:
No_NaN_cast.loc[0,'cast'][0]['name']

'Rob Corddry'

In [69]:
actor_list = []
for index,rows in No_NaN_cast.iterrows():
  for iter in No_NaN_cast.loc[index,'cast']:
    if type(iter) == dict:
      actor = iter['name']
      actor_list.append(actor)

In [70]:
#creating a  DataFrame with actor list
Actor_Table = pd.DataFrame(actor_list, columns=['Name of Actor'])

In [71]:
Actor_Table.shape

(61811, 1)

In [72]:
#sorting the actors using groupby function
Actor_Table.value_counts().reset_index().head(4)

Unnamed: 0,Name of Actor,0
0,Samuel L. Jackson,30
1,Robert De Niro,30
2,Morgan Freeman,27
3,Liam Neeson,25


# ***Conculsion:***

1. Which movie made the highest profit?:[Furious 7]

2. Who were its producer and director?: Producers of the movie: ['Vin Diesel', 'Neal H. Moritz', 'Michael Fottrell', 'Brandon Birtell']
Identify the actors in that film: Director of the movie: ['James Wan']

3. This data has information about movies made in different languages. Which language has the highest average ROI (return on investment)?: [ko : Korean]

4. Find out the unique genres of movies in this dataset.: [War, Fantasy, History, Crime, Comedy, Horror, Thriller, TV Movie, Mystery, Drama, Adventure, Foreign, Romance, Music, Animation, Western, Family, Documentary, Science Fiction, Action]

5. Which actor has acted in the most number of movies? [Samuel L. Jackson	: 30, Robert De Niro : 30]