# Project: Investigate a TMDb Movie Dataset 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue,cast,release year.

> Brief Description of each column in the dataset

> **1.id** - Unique id's for each row

> **2.imdb_id** - System generated unique id

> **3.popularity** - popularity score 

> **4.budget** - budget in dollars

> **5.revenue** - revenue in dollars

> **6.original_title** - Movie title

> **7.cast** - cast performed in movie

> **8.homepage** - website link of movie

> **9.director** - director name

> **10.tagline** - tagline of movie

> **11.keywords** - words used to give significant idea

> **12.overview** - general review of movie

> **13.runtime** - movie duration in seconds

> **14.genres** - categories of movie

> **15.production_companies** - company name which manages a movie process from start to finish

> **16.release_date** - movie releasing date

> **17.vote_count** - votes given by viewers

> **18.vote_average** - average votes given by viewers

> **19.release_year** - movie releasing year

> **20.budget_adj**-budget in 2010 dollars,accounting for inflation

> **21.revenue_adj** -revenue in 2010 dollars,accounting for inflation

**Questions that I planned on exploring over the course of the report.**

<ul>
    
<li><a href="#q1">**1.What kind of  movie genres are made the most?**</a></li>

<li><a href="#q2">**2.Which genres are most popular from year to year?**</a></li>

<li><a href="#q3">**3.In which year most of the movies released?**</a></li>

<li><a href="#q4">**4.What kinds of properties are associated with movies that have high revenues.**</a></li>

<li><a href="#q5">**5. High Gross profit movies from year to year**</a></li>

<li><a href="#q6">**6. High Budget Movies from year to year.**</a></li>

<li><a href="#q7">**7.What are the months when most of the movies released?**</a></li>

<li><a href="#q8">**8.As per popularity score which actor / actress is most famous?**</a></li>

<li><a href="#q9">**9.As per popularity score which movie is most famous?**</a></li>

<li><a href="#q10">**10.As per Average vote score which movie is most famous?**</a></li>

<li><a href="#q11">**11.As per Average vote score which Actor/Actress is most famous?**</a></li>

<li><a href="#q12">**12.Movies which have earned highest revenue.**</a></li>

<li><a href="#q13">**13.Most Frequent Runtime of movies**</a></li>

<li><a href="#q14">**14.Most famous action directors**</a></li>

<li><a href="#q15">**15. Most famous Actors in Action genre.**</a></li>

<li><a href="#q16">**16.Which production companies have made most of the movies**</a></li>

<li><a href="#q17">**17.Most frequent Keywords used in every genres.**</a></li>

<li><a href="#q18">**18.Older or newer movies get more popularity score**</a></li>

<li><a href="#q19">**19.Visualization**</a></li>
</ul>

In [None]:
import pandas as pd
import numpy as np
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

<a id='wrangling'></a>
## Data Wrangling

In [None]:
#Read TMDb movie csv file
tmdb_movies = pd.read_csv('tmdb-movies.csv')
tmdb_movies.head()

In [None]:
tmdb_movies.shape

### Let's get intuition of our data by checking its types,null values,duplicates.

In [None]:
tmdb_movies.dtypes

### We have to convert release_date column to datetime as it is in object type.

In [None]:
tmdb_movies['release_date'] = pd.to_datetime(tmdb_movies['release_date'])

### Confirm if changes are done as we want it tobe.

In [None]:
tmdb_movies.dtypes

### Let's get release month and weekday from release_date column

In [None]:
tmdb_movies['release_month'] = tmdb_movies['release_date'].dt.month
tmdb_movies['release_day'] = tmdb_movies['release_date'].dt.weekday_name

In [None]:
tmdb_movies.describe()

<a id='cleaning'></a>
## Data Cleaning 

>## Dropping Columns:
>#### I have collected month, weekday from release_date.that means release_date column does not contain any information which we dont have with us.
>#### imdb_id column contains some system generated numbers which are not of our use much.
>#### Similarly homepage data is also not very useful.

In [None]:
tmdb_movies.drop(labels = ['imdb_id','homepage','release_date'],axis =1,inplace =True)

In [None]:
tmdb_movies.shape

### Let us move forward to check if our data contains some duplicates & null values

In [None]:
sum(tmdb_movies.duplicated())

In [None]:
tmdb_movies.drop_duplicates(keep= 'first',inplace = True)

In [None]:
total = tmdb_movies.isnull().sum().sort_values(ascending = False)
percent = ((tmdb_movies.isnull().sum() / tmdb_movies.shape[0])*100).sort_values(ascending = False)
pd.concat([total,percent],axis = 1,keys = ['total','percent'])

In [None]:
tmdb_movies.isnull().sum()

### It is better to drop null values.

In [None]:
tmdb_movies.dropna(how = 'any',inplace = True)

In [None]:
tmdb_movies.shape

### Checking if our dataset contains runtime of zero or less than zero 

In [None]:
is_runtime_zero = tmdb_movies['runtime'] == 0
tmdb_movies[is_runtime_zero].shape , tmdb_movies[is_runtime_zero].index.values

In [None]:
tmdb_movies[is_runtime_zero]

### Probably above data might be fake or having typos errors because budget and revenue also contains zero value

In [None]:
tmdb_movies.drop(labels = [334,1289,1293], inplace = True)

In [None]:
tmdb_movies.shape

### Our dataset contains lots of columns separted by '|' character. For better analysis I have decided to split them up.

In [None]:
def separate(data):
    return data.str[0:].str.split('|',expand = True)
genres = separate(tmdb_movies['genres'])
keywords = separate(tmdb_movies['keywords'])
cast = separate(tmdb_movies['cast'])
production_companies = separate(tmdb_movies['production_companies'])

In [None]:
genres.head()

### My idea here is to split the data  by '|' character & then melt it to one single column.This way I will be able to get separate dataframes for each such column ('genres','cast','keywords','production_companies').So I can merge any of the two dataframes for answering question. 

### To acheive the desired result First of all copy the id column from tmdb_movies dataframe so that it would help me further to merge other dataframes on 'id' column.

In [None]:
genres['id'] = tmdb_movies['id']
cast['id'] = tmdb_movies['id']
keywords['id'] = tmdb_movies['id']
production_companies['id'] = tmdb_movies['id']
    

In [None]:
genres.isnull().sum() , cast.isnull().sum(),keywords.isnull().sum(), production_companies.isnull().sum()

In [None]:
def melt_df(data ):
    return pd.melt(data,col_level = 0,id_vars = 'id',value_vars = [0,1,2,3,4]) #keep id column apply merge on rest of the columns

genres=melt_df(genres)
cast = melt_df(cast)
keywords = melt_df(keywords)
production_companies = melt_df(production_companies)

In [None]:
genres.head()

### As we can see above variable column is containing name of our columns before merging, which is absolutely not of our use now I have decided to drop 'variable' column

In [None]:
grouped_df = [genres,cast,keywords,production_companies]
for df in grouped_df:
    df.columns = ['id','variable','value']
    df.drop(df.columns[1],axis=1, inplace=True)


### Rename column of DataFrames

In [None]:
genres.rename(columns = {'value':'genres'},inplace = True)
cast.rename(columns = {'value':'cast'},inplace = True)
production_companies.rename(columns = {'value':'production_companies'},inplace = True)
keywords.rename(columns = {'value':'keywords'},inplace = True)

### Dropping all the null values

In [None]:
genres.dropna(inplace = True)
cast.dropna(inplace = True)
keywords.dropna(inplace = True)
production_companies.dropna(inplace = True)


### We have been able to make different DataFrames for each of the below columns So it would be good idea to drop from primary DataFrame

In [None]:
movies=tmdb_movies.drop(columns = ['cast','keywords','production_companies', 'genres'],axis = 1)

In [None]:
cast.head()

In [None]:
cast['cast'].value_counts()[0:10]

### For Future reference saving all the csv files we just created.

In [None]:
genres.to_csv('genres.csv',index = False)
cast.to_csv('cast.csv',index = False)
keywords.to_csv('keywords.csv',index = False)
production_companies.to_csv('production_companies.csv',index = False)



### To be more sure let us check if our DataFrames contains any duplicates

In [None]:
genres.isnull().sum() , cast.isnull().sum(), keywords.isnull().sum(),production_companies.isnull().sum()

### Let us also check for any duplicates.

In [None]:
sum(genres.duplicated()) , sum(cast.duplicated()) , sum(keywords.duplicated()) , sum(production_companies.duplicated())

In [None]:
cast.shape , genres.shape

In [None]:
cast.drop_duplicates(inplace = True) , production_companies.drop_duplicates(inplace = True)

In [None]:
movies.isnull().sum()

### Here I have organised all the columns.

In [None]:
movies = movies[['id','original_title','tagline', 'overview','runtime','release_day', 'release_month', 'release_year','popularity','vote_count','vote_average','director','budget','revenue','budget_adj','revenue_adj']]

In [None]:
movies.to_csv('clean_movies.csv',index = False)

In [None]:
movies.head()

#### Final check for duplicates and null values.

In [None]:
movies.isnull().sum()

In [None]:
sum(movies.duplicated())

<a id='eda'></a>
## Exploratory Data Analysis


<a id='q1'></a>
> ### 1.What kind of  movie genres are made the most? 

In [None]:
genres['genres'].value_counts()

### As we can see movies in Drama genre are made most of the times.

<a id='q2'></a>
> ### 2.Which genres are most popular from year to year?

#### I have merged two df's i.e. genres & movies then used groupby to get the value counts of genres again used groupby to get the most popular genre of every year.

In [None]:
genres_movies = pd.merge(movies , genres , how = 'inner' , on = 'id')

In [None]:
grouped_data = genres_movies.groupby('release_year')['genres'].value_counts().reset_index(name = 'counts')

In [None]:
grouped_data.set_index('genres',inplace =True)

In [None]:
grouped_data.head()

In [None]:
grouped_data.groupby(['release_year'])['counts'].idxmax()

### Drama,Comedy are most popular genres from year to year.

<a id='q3'></a>
> ### 3. In which year most of the movies released?

In [None]:
movies['release_year'].value_counts()[0:10]

<a id='q4'></a>
> ### 4.What kinds of properties are associated with movies that have high revenues.

### I am going check if high revenue depends on particular cast,genres, production companies.

In [None]:
# to calculate gross profit subtract revenue from budget
movies['gross_profit_adj'] = movies['revenue_adj'].sub(movies['budget_adj'],axis = 'index')

### To calculate gross profit I have subtracted budget_adj from revenue_adj & as the gross_profit earned in 2006 won't be of same value as 2015 so to have them all on same plane I decided to standardize gross_profit_adj ,revenue_adj & budget_adj.

In [None]:
#In order to get the proper computations,we should actually be setting the value of the "ddof" parameter to 0 in  the .std() function 
                                              
def standardize(df):
    return (df - df.mean()) / df.std(ddof = 0) 

In [None]:
movies[['budget_adj', 'revenue_adj', 'gross_profit_adj']]=movies[['budget_adj',
                                                                  'revenue_adj', 'gross_profit_adj']].apply(standardize)

In [None]:
movies.loc[:,['budget_adj' , 'revenue_adj' , 'gross_profit_adj']].head()

In [None]:
movies['gross_profit_adj'].isnull().sum()

### As we have added new column gross_profit_adj I am merging again.

In [None]:
genres_movies = pd.merge(movies , genres , how = 'inner' , on = 'id')

> ### 4a. What are the genres contribute to earn higher revenues.

In [None]:
grouped_by_genres=genres_movies.groupby('genres').sum() #group data by genres

In [None]:
#sort according to gross profit but in a descending order to get highest on the top.
grouped_by_genres.sort_values(by = 'gross_profit_adj',ascending = False).loc[:,'gross_profit_adj'][0:10] 

### I beleive result has got these genres particularly because their budget would be higher as well.I will also check for higher budget genres.

In [None]:
grouped_by_genres.sort_values(by ='budget_adj',ascending = False).loc[:,'budget_adj'][0:10]

### It shows my assumption particulary goes right we will also check for the popularity score of top genres.

### Drama is the most popular genre over the years as we just saw but it's intersting to see that it couldnt even come in top 10 genres of getting a high revenue

In [None]:
#top genres by popularity score
grouped_by_genres.sort_values(by = 'popularity',ascending = False).loc[:,'popularity'][0:10]

### Does there any relationship of popularity & Gross Profit.

In [None]:
plt.figure(figsize = (12,9));
sns.lmplot(x = 'gross_profit_adj' , y = 'popularity', data = genres_movies);
plt.xlabel('Gross Profit' , fontsize = 18);
plt.ylabel('Popularity', fontsize =18);
plt.title('Gross Profit Vs popularity',fontsize = 18);

### No strong relationship between gross profit & popularity scores.

> ### 4b. Who are the directors contributing to high revenue movies?

In [None]:
movies.groupby('director').sum().sort_values(by = 'gross_profit_adj',ascending = False)['gross_profit_adj'][0:10]

> ### 4c. What are the production comapnies which were able to make high revenues.

In [None]:
movies_prod_companies = pd.merge(movies,production_companies , how = 'inner', on = 'id')

In [None]:
movies_prod_companies.groupby('production_companies').sum().sort_values(by= 'gross_profit_adj',ascending = False)['gross_profit_adj'][0:10]

> ### 4d. What Actors/Actress are able to  make high revenues.

In [None]:
movies_cast = pd.merge(movies,cast,how = 'inner' , on = 'id')

In [None]:
movies_cast.groupby('cast').sum().sort_values(by = 'gross_profit_adj',ascending = False)['gross_profit_adj'][0:10]

<a id='q5'></a>
> ### 5. High Gross profit movies from year to year.

In [None]:
def sort_by_gross_profit(df):
    return df.sort_values(by = 'gross_profit_adj',ascending = False)['original_title'].head(1)


In [None]:
movies.groupby('release_year').apply(sort_by_gross_profit)

<a id='q6'></a>
> ### 6. High Budget Movies from year to year.

In [None]:
def sort_by_budget(df):
    return df.sort_values(by = 'budget_adj',ascending = False)['original_title'].head(1)


In [None]:
movies.groupby('release_year').apply(sort_by_budget)

<a id='q7'></a>
> ### 7. What are the months when most of the movies released?

In [None]:
movies['release_month'].value_counts()[0:10]

<a id='q8'></a>
> ### 8. As per popularity score which actor / actress is most famous?

In [None]:
movies_cast = pd.merge(movies , cast , how = 'inner' , on= 'id') #Merge movies cast DataFrame on id

In [None]:
movies_cast.groupby('cast').sum().sort_values(by = 'popularity' , ascending = False ).loc[:,'popularity'][0:10]

<a id='q9'></a>
> ### 9. As per popularity score which movie is most famous?

In [None]:
movies.sort_values(by = 'popularity', ascending = False).loc[:,['popularity','original_title']][0:10]

<a id='q10'></a>
> ### 10. As per Average vote score which movie is most famous?

In [None]:
movies.sort_values(by = 'vote_average' ,ascending = False).loc[:,['vote_average','original_title']][0:10]

<a id='q11'></a>
> ### 11. As per Average vote score which Actor/Actress is most famous?

In [None]:
movies_cast.sort_values(by = 'vote_average' ,ascending = False).loc[:,['vote_average','cast']][0:10]

<a id='q12'></a>
> ### 12. Movies which have earned highest revenue.

In [None]:
movies.sort_values(by = 'gross_profit_adj',ascending = False).loc[:,['gross_profit_adj', 'original_title']][0:10]

<a id='q13'></a>
> ### 13. Most Frequent Runtime of movies 

In [None]:
sns.kdeplot(movies['runtime'], shade = True, color = 'r');
plt.xlabel('Runtime in seconds',fontsize = 18);

plt.title('Most Frequent Runtime of movies', fontsize = 18);


### Typically movies have runtime from 90 secs to 150 sec

In [None]:
binval = np.arange(0,200,5)
plt.hist(movies['runtime'],bins = binval);


<a id='q14'></a>
> ### 14. Most famous action directors

In [None]:
action=genres_movies[genres_movies['genres'] == 'Action'] #filter dataframe by Action Genre

In [None]:
#Use groupby to sum up by director & then sort by gross profit in descending order 
action.groupby('director').sum().sort_values(by = 'gross_profit_adj',ascending = False)['gross_profit_adj'][0:10]

<a id='q15'></a>
> ### 15. Most famous Actors in Action genre. 

In [None]:
genres_movies_cast = pd.merge(genres_movies,cast,how = 'inner', on = 'id') #merge dataframe genre_movies & cast

In [None]:
#filter data by Action genere
action_cast=genres_movies_cast[genres_movies_cast['genres'] == 'Action']

In [None]:
#group the data by cast take sum of it & then sort it accoring to popularity score in descending order.
action_cast.groupby('cast').sum().sort_values(by = 'popularity', ascending = False)['popularity'][0:10]

In [None]:
genres_cast = pd.merge(genres,cast, how = 'inner', on = 'id')

<a id='q16'></a>
> ### 16. Which production companies have made most of the movies

In [None]:
movies_prod_companies['production_companies'].value_counts()[0:10]

<a id='q17'></a>
> ### 17.Most frequent Keywords used in every genres.

In [None]:
movies_key = pd.merge(movies , keywords , how = 'inner' , on = 'id')

### I have merged two DataFrames for answering my question.

In [None]:
genres_keywords = pd.merge(genres , keywords , how = 'inner', on = 'id')

In [None]:
genres_keywords.head(10)


###  Most common keywords used in all the genres

In [None]:
keywords['keywords'].value_counts()[0:10]

### It would be good idea to the keywords in visualization for better understanding.I will do that in further sections

<a id='q18'></a>
> ### 18. Older or newer movies get more popularity score 

In [None]:
#group data by release year take the median of all the numeric columns
groupby_year=movies.groupby('release_year').median()

In [None]:
#get release year column back into dataframe for visualization
groupby_year['release_year'] = groupby_year.index.get_level_values(0)

In [None]:
plt.figure(figsize = (13,9)) #set a figure size
sns.lmplot(x = 'release_year', y = 'popularity', data = groupby_year); #plot a lineplot 
#gives names to x-axis , y-axis & the title
plt.xlabel('Release Year', fontsize = 15); 
plt.ylabel('Popularity', fontsize = 15);
plt.title('Release Year Vs Popularity ',fontsize = 15);

### This visulaization clearly shows recent Year movies are more popular than older movies.

<a id='q17'></a>
## Visualization

> ### Plot Most Famous genres over the years

In [None]:
plt.figure(figsize = (10,6)) #set a figure size
ax = sns.countplot(x="genres", data= genres, order = genres['genres'].value_counts().index) #plots bar graph for genres
#rotate x-tick labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize = 13)
plt.tight_layout()
#gives names to x-axis , y-axis & the title
plt.xlabel('Genres' , fontsize = 30)
plt.ylabel('Counts' , fontsize = 30)
plt.title('Genres counts over the years',fontsize = 30)
plt.show()

### Graph depicts Drama,Comedy,Thriller,Action & Horror are most popular genres over the years. 

#### Using wordcloud

In [None]:
plt.figure(figsize = (9,9))
stopwords = set(STOPWORDS)
tagline_cloud = WordCloud(width=800, height=400,background_color="white",max_words=30,stopwords=stopwords).generate(' '.join(genres['genres']))

# Generate plot

plt.axis("off")
plt.imshow(tagline_cloud)
plt.show()


### We can observe here that the most popular genres are Drama,Comedy, Thriller, Action, Adventure, Romance, Horror.

> ### Most famous movies by popularity score.

In [None]:
movies_sorted_by_popularity = movies.sort_values(by = 'popularity', ascending = False)[0:10]

In [None]:
plt.figure(figsize = (10,6)) #set a figure size
ax=sns.barplot(x = 'original_title' , y = 'popularity', data =movies_sorted_by_popularity) #plots bar graph for movies title by popularity
#rotate xtick labels by 40 degrees
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize = 12)
plt.tight_layout()
#gives names to x-axis , y-axis & the title
plt.xlabel('Movies' , fontsize = 30)
plt.ylabel('Popularity Score' , fontsize = 30)
plt.title('Most famous Movies',fontsize = 30)
plt.show()

### We can observe Jurassic world is most famous movie as per popularity score

In [None]:
movies_sortedby_popularity = movies.sort_values(by = 'popularity', ascending = False).tail(10)

#### I am considering movies which have got less popularity score are the flop movies.

In [None]:
plt.figure(figsize = (10,6))
#plots a bar plot for movie title by popularity score
ax=sns.barplot(x = 'original_title' , y = 'popularity', data =movies_sortedby_popularity)
#rotate Xtick labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right",fontsize = 12)
plt.tight_layout()
plt.xlabel('Movies' , fontsize = 30)
plt.ylabel('Popularity Score' , fontsize = 30)
plt.title('Flop Movies',fontsize = 30)
plt.show()

### We can observe Least popular movie is 'The Hospital'.

> ### Graph for Average votes received by each genres

In [None]:
#set a figure size
fig, ax = plt.subplots() 
fig.set_size_inches(12, 8)
sns.violinplot(x = 'genres' , y = 'vote_average' , data =genres_movies);

#gives names to x-axis , y-axis & the title
plt.xlabel('Genres',fontsize = 18);
plt.ylabel('Average Votes',fontsize = 18);
plt.title('Genres Vs Average Votes', fontsize = 18)
plt.xticks(rotation = 25);

### Most common Average vote is from 5 to 7.50.average votes for Documentary lies in between 6.5 to 7.5 which is highest among all the genres & Horror is having most spreadout votes.

> ### Plot a graph for Number of movies released every month

In [None]:
#set a figure size
plt.figure(figsize = (12,8))
#Bar plot for no.of movies released every month
ax=sns.countplot(x = 'release_month', data = movies,order = movies['release_month'].value_counts().index);
#rotate x ticklabels
ax.set_xticklabels(ax.get_xticklabels(),rotation = 40, ha= 'right', fontsize = 12);
#gives names to x-axis , y-axis & the title
plt.xlabel('Release Month', fontsize = 18);
plt.ylabel('Frequency',fontsize = 18);
plt.title('No.of Movies released every month',fontsize = 18);

### Top 5 release months are September,October,December ,August,June

> ### Plot a graph for number of movies released every year

In [None]:
#set a figure size
plt.figure(figsize = (12,8))
#bar plot for no.movies released every year
ax=sns.countplot(x = 'release_year', data = movies, order = movies['release_year'].value_counts().index);
#Rotate xticklabels
ax.set_xticklabels(ax.get_xticklabels(),rotation = 40, ha= 'right', fontsize = 7.6);
#Give label to x-axis,y-axis & a title
plt.xlabel('Release year', fontsize = 18);
plt.ylabel('Frequency',fontsize = 18);
plt.title('No.of Movies released every year',fontsize = 18);

### Maximum released movies are in 2014 followed by 2013,2015,2009,2011 & the lowest is 1969.This graph also gives us an idea that more number of movies have been released over the years.


> ### Graph for High profit earning cast.

In [None]:
#group data frame by cast & sort by gross profit in descending order.
movies_cast_sortedby_grossprofit=movies_cast.groupby('cast').sum().sort_values(by = 'gross_profit_adj' , ascending = False )[0:10]

In [None]:
#get cast column back in dataframe
movies_cast_sortedby_grossprofit['cast'] = movies_cast_sortedby_grossprofit.index.get_level_values(0)

In [None]:
#set figure size
plt.figure(figsize=(10,8))
ab=sns.barplot(x = 'cast', y ='gross_profit_adj',data =movies_cast_sortedby_grossprofit );
#Rotate Xtick labels
ab.set_xticklabels(ab.get_xticklabels(),rotation = 40, ha ='right', fontsize = 12 );
#Give labels to axis
plt.xlabel('Cast', fontsize = 18);
plt.ylabel('Gross Profit', fontsize = 18);
plt.title('High Profit earning cast', fontsize = 18);

### Harrison Ford has made highest gross profit among all.

> ### Plot a graph for Most Famous Actors by Popularity score

In [None]:
#group data frame by cast sort it according to popularity score in descending score
movies_cast_sortedby_popularity=movies_cast.groupby('cast').sum().sort_values(by = 'popularity' , ascending = False )[0:10]

In [None]:
#get cast column back in dataframe 
movies_cast_sortedby_popularity['cast'] = movies_cast_sortedby_popularity.index.get_level_values(0)

In [None]:
#set figure size
plt.figure(figsize=(10,8))
ab=sns.barplot(x = 'cast', y ='popularity',data =movies_cast_sortedby_popularity );
#Rotate x tick labels
ab.set_xticklabels(ab.get_xticklabels(),rotation = 40, ha ='right', fontsize = 12 );
#Give labels to axis
plt.xlabel('Cast', fontsize = 18);
plt.ylabel('Popularity Score', fontsize = 18);
plt.title('Most Famous Actors by Popularity Score', fontsize = 18);

### Top 3 Most popular actors are Samuel L. Jackson,Micheal Caine, Harrison Ford.

In [None]:
genres_movies_cast.head(1)

### As drama is the most popular genre I will try pull the data for actors,directors, production comapnies by popularity score who have done most drama movies.

In [None]:
drama = genres_movies_cast[genres_movies_cast['genres'] == 'Drama']

In [None]:
drama_sorted_by_popularity=drama.groupby('cast').sum().sort_values(by = 'popularity', ascending = False)[0:10]

In [None]:
drama_sorted_by_popularity['cast'] = drama_sorted_by_popularity.index.get_level_values(0)

In [None]:
#set figure size
plt.figure(figsize=(10,8))
ab=sns.barplot(x = 'cast', y ='popularity',data= drama_sorted_by_popularity);
#Rotate x tick labels
ab.set_xticklabels(ab.get_xticklabels(),rotation = 40, ha ='right', fontsize = 12 );
#Give labels to axis
plt.xlabel('Cast', fontsize = 18);
plt.ylabel('Popularity Score', fontsize = 18);
plt.title('Most Famous Actors in Drama by Popularity Score', fontsize = 18);

### Top 3 Most Popular Actors in Drama are Micheal Caine, Leonardo DiCaprio, Brad Pitt

In [None]:
sorted_by_popularity=drama.groupby('original_title').sum().sort_values(by = 'popularity', ascending = False)[0:10]

In [None]:
sorted_by_popularity['original_title'] =sorted_by_popularity.index.get_level_values(0)

In [None]:
#set figure size
plt.figure(figsize=(10,8))
ab=sns.barplot(x = 'original_title', y ='popularity',data= sorted_by_popularity);
#Rotate Xtick labels
ab.set_xticklabels(ab.get_xticklabels(),rotation = 40, ha ='right', fontsize = 12 );
#Give labels to axis
plt.xlabel('Movies', fontsize = 18);
plt.ylabel('Popularity Score', fontsize = 18);
plt.title('Most Famous Movies in Drama by Popularity Score', fontsize = 18);

### This graph shows Most Popular Movie in Drama is Interstellar

In [None]:
director_sorted_by_popularity=drama.groupby('director').sum().sort_values(by = 'popularity', ascending = False)[0:10]

In [None]:
director_sorted_by_popularity['director'] =director_sorted_by_popularity.index.get_level_values(0)

In [None]:
#set figure size
plt.figure(figsize=(10,8))
ab=sns.barplot(x = 'director', y ='popularity',data=director_sorted_by_popularity);
#Rotate xtick labels
ab.set_xticklabels(ab.get_xticklabels(),rotation = 40, ha ='right', fontsize = 12 );
#Give labels to axis
plt.xlabel('Directors', fontsize = 18);
plt.ylabel('Popularity Score', fontsize = 18);
plt.title('Most Famous Directors in Drama by Popularity Score', fontsize = 18);

### Christopher Nolan is Most Popular Director in Drama among all.

In [None]:
#set figure size
plt.figure(figsize = (17,6))
ag=sns.pointplot(x="release_year", y="gross_profit_adj", data=movies);
#Rotate xtick labels
ag.set_xticklabels(ag.get_xticklabels(), rotation = 40,ha='right', fontsize = 9);
#Give labels to axis
plt.xlabel('Years',fontsize = 18);
plt.ylabel('Gross Profit',fontsize = 18);
plt.title('Gross Profit Over the years',fontsize = 18);

### The year which earned highest profit was 1977 & the year which earned least of all is 1966.

In [None]:
#set a figure size
plt.figure(figsize = (17,6));

sns.boxplot(x = 'release_month', y = 'vote_average' , data = movies);
#Give labels to axis
plt.xlabel('Months',fontsize = 18);
plt.ylabel('Average Vote',fontsize = 18);
plt.title('Votes received over Months',fontsize = 18);

### 75% of the average votes received are from 3.8 to 6.7.In December & January more number of votes has been received.

### As per our analysis top 5 genres are Drama,Comedy,Action,Thriller,Romance. So it would be good idea to find the most used keywords in those genres.

> ### Wordcloud for Most common keywords used in Drama

In [None]:
stopwords = set(STOPWORDS)
cloud = WordCloud(width=800, height=400,background_color="white", max_words=50,stopwords=stopwords)
plt.figure( figsize=(20,10) );


positive_cloud = cloud.generate(genres_keywords.loc[genres_keywords['genres'] == 'Drama', 'keywords'].str.cat(sep='\n'));

plt.imshow(positive_cloud);
plt.axis("off");
plt.imshow(cloud);
plt.show();

### Wordcloud shows us independent,novel woman,biography,new york,murder,sex prison,world war,secret are common keywords used in Drama.

> ### Wordcloud for Most common keywords used in Action.

In [None]:
stopwords = set(STOPWORDS)
#set width height background color maximum words on wordcloud
cloud = WordCloud(width=800, height=400,background_color="white", max_words=50,stopwords=stopwords)
#set figure size
plt.figure( figsize=(20,10))
#select Action genre from dataframe,generate a wordcloud of keywords used in action genre
action_cloud = cloud.generate(genres_keywords.loc[genres_keywords['genres'] == 'Action', 'keywords'].str.cat(sep='\n'));

plt.imshow(action_cloud);
#remove axis
plt.axis("off");
plt.imshow(cloud);
plt.show();

### Above wordcloud shows us most commonly used words in Action Genre.

> ### Wordcloud for Most common keywords used in Adventure.

In [None]:
stopwords = set(STOPWORDS)
#set width height background color maximum words on wordcloud
cloud = WordCloud(width=800, height=400,background_color="white",stopwords=stopwords)
#set figure size
plt.figure( figsize=(20,10) );

#select Adventure genre from dataframe,generate a wordcloud of keywords used in adventure genre
adventure_cloud = cloud.generate(genres_keywords.loc[genres_keywords['genres'] == 'Adventure', 'keywords'].str.cat(sep='\n'))

#Generate Plot
plt.imshow(adventure_cloud);
plt.axis("off");
plt.imshow(cloud);
plt.show();

### Most frequently used words in Adventure Genre are Magic,Novel,Space,Alien,dystopia

> ### Most Popular words used in a tagline

In [None]:
#set width height background color maximum words on wordcloud
plt.figure(figsize = (20,10))
stopwords = set(STOPWORDS)
tagline_cloud = WordCloud(width=800, height=400,background_color="white",stopwords=stopwords).generate(movies['tagline'].str.cat(sep = '\n'))

# Generate plot
plt.axis("off")
plt.imshow(tagline_cloud)
plt.show()


### Above wordcloud shows most frequently used words in tagline

<a id='conclusions'></a>
## Conclusions

> **As per my overall analysis I have found**

**Trend of releasing movies has increased over the years. **

**Frequent Runtimes are from 92 sec to 150 sec**

**Drama,Action,Thriller,Romance,Horror are most popular geners**

**By popularity Score**

> Most Popular Movie is Jurrassic World

> Most Famous Genre is Drama

**Highest gross Profit earned**

>**Director** Steven Spielberg 

>**Movie** Star Wars

>**Cast** Harrison Ford

>**Production Company** Twentieth Century Fox Film Corporation

**By Average Votes**

>**Popular Movie** The Jinx:The Life and Deaths of Robert Durst

>**Popular Cast** Tina Weymouth

**I have found the results of popular movies,popular cast based on popularity score,budget,revenue,average votes received but the analysis could have been more informative if data also has details of awards received by actors or actress**

**Data doesn't contain any information such as movies recommended for kids,providing this data could have been useful to get information about movies,cast & genres which are famous among kids.**

**We can do additional research on Tagline & overview of movies to understand how movies emotion changes according to genres**

**Finally the dataset given to me was of 10886 rows in total but it contained lot of missing values & some movies runtime values,budget,revenue was zero so I had to delete such rows after doing all the cleaning the data left was of 7028 rows that means I lost apprx 3000 rows which would have definitely helped me in my findings if the information it contained wasn't false.** 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_TMDb_Dataset.ipynb'])