## Here's an outline of the goals, objectives, and methodology for The prediction of the movies project.I will try and build the following:

# Introduction:

Overview of the project and its objectives
Importance of revenue prediction and hit/flop classification in the movie industry
Description of the data used in the project

# Goals and Objectives:
Build a Regressor for Movie Revenue Prediction
Collect and preprocess movie data
Feature engineering to extract relevant information
Train and optimize a regressor model
Evaluate model performance and tune hyperparameters
Deploy the model for revenue prediction
Build a Classifier for Hit/Flop Prediction
Train and optimize a classification model
# Methodology:
For data collection, describe sources of data.
For preprocessing and feature engineering, describe the steps taken to clean and transform the data, and how features were selected and engineered.
For model training and optimization, describe the algorithms and techniques used for model selection, training, and optimization.


# Importing libraries

In [996]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import datetime 
from wordcloud import WordCloud, STOPWORDS
import ast
import plotly.offline as py
from IPython.display import display,Image,HTML
%matplotlib inline

In [997]:
df =pd.read_csv('E:/New Project/datasets/movies_metadata.csv')


Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.



In [998]:
df.shape

(45466, 24)

# Understanding the Dataset:
The dataset was acquired via the TMDB API, and contains a comprehensive collection of movies that are also present in the MovieLens Latest Full Dataset. This includes 45,466 movies that have been rated by over 27,000 users, resulting in a massive dataset of over 26 million ratings. With access to this rich source of data, it has a abundance of features for analysis and modeling.


In [999]:
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

# Attributes
1 adult: Indicates if the movie is X-Rated or Adult.

2 belongs_to_collection: A string format of dictionary that gives information on the movie series the particular film belongs to.

3 budget: The budget of the movie in dollars.

4 genres: A list of dictionaries that list out all the genres associated with the movie.

5 homepage: The Official Homepage of the move.

6 id: The ID of the move.

7 imdb_id: The IMDB ID of the movie.

8 original_language: The language in which the movie was originally shot in.

9 original_title: The original title of the movie.

10 overview: A brief summary of the movie.

11 popularity: The Popularity Score assigned by TMDB.

12 poster_path: The URL of the poster image.

13 production_companies: A list of production companies involved with the making of the movie.

14 production_countries: A list of countries where the movie was shot/produced in.

15 release_date: Release Date of the movie.

16 revenue: The total revenue of the movie in dollars.

17 runtime: The runtime of the movie in minutes.

18 spoken_languages: A list of spoken languages in the film.

19 status: The status of the movie (Released, To Be Released, Announced, etc.)

20 tagline: The tagline of the movie.

21 title: The Official Title of the movie.

22 video: Indicates if there is a video present of the movie with TMDB.

23 vote_average: The average rating of the movie.

24 vote_count: The number of votes by users, as counted by TMDB.

# Data preprocessing:  
The dataset appears to be relatively clean; however, it would still be beneficial to gain a better understanding of the features within the data and perform appropriate data preprocessing to create a more suitable form for analysis.

In [1000]:
df['budget'] = pd.to_numeric(df['budget'],errors='coerce')
#When 'errors' is set to 'coerce', it means that if an error occurs during conversion 
# (e.g., if pandas encounters a non-numeric value when trying to convert a column of data into numeric format),
#  pandas will replace the problematic value(s) with NaN instead of raising an error or exception.

In [1001]:
df['budget'] = df['budget'].replace(0,np.nan)
df[df['budget'].isnull()].shape

(36576, 24)

The budget feature in the dataset contains unclean values that have caused Pandas to assign it as a generic object type. To rectify this, the budget feature has been converted into a numeric variable, and all non-numeric values has replaced with NaN. In addition, all values of 0 replaced with NaN to indicate a lack of information about the budget, similar to what was done for the revenue feature.

In [1002]:
df[df['revenue'] == 0].shape

(38052, 24)

In [1003]:
df['revenue'] = df['revenue'].replace(0,np.nan)

While analyzing the dataset, observeed that a large proportion of movies in our collection have a recorded revenue of 0, which suggests that information on total revenue is not available for these films. Despite this fact, recognized that revenue is a crucial feature for our analysis and will continue to use it as such for the remaining 7414 movies with revenue data available.      

In [1004]:
df = df.drop(['imdb_id'], axis=1)

In [1005]:

df[df['original_title'] != df['title']][['title', 'original_title']].head()

Unnamed: 0,title,original_title
28,The City of Lost Children,La Cité des Enfants Perdus
29,Shanghai Triad,摇啊摇，摇到外婆桥
32,Wings of Courage,"Guillaumet, les ailes du courage"
57,The Postman,Il postino
58,The Confessional,Le confessionnal


The analysis will utilize the translated, Anglicized movie title instead of the original title, which is the title of the movie in the language it was originally shot. To accomplish this, the original titleshas been dropped from the dataset. However, it will still be possible to identify if a movie is a foreign language film by reviewing the 'original_language' feature, thus preventing any significant loss of information.

In [1006]:
df = df.drop(['original_title'],axis=1)

#### Moving forward with the analysis, it is necessary to create certain features that are appropriate for answering specific questions. Two crucial features that we will create are:

## year: indicating the year of movie release
## return: representing the revenue to budget ratio.
The return feature is highly informative since it can provide a more precise assessment of a movie's financial success. Currently, our dataset is unable to distinguish between a $200 million budget movie that earned $100 million and a $50,000 budget movie that earned $200,000. The return feature will address this limitation and allow for a more accurate comparison of movie profitability.

### A return value greater than 1 would indicate profit, while a return value less than 1 would signify a loss.

In [1007]:
df['return'] = df['revenue']/df['budget']
df[df['return'].isnull()].shape

(40085, 23)

Out of the entire dataset, There are information on revenue and budget ratio for 5381 movies, which may seem like a small fraction, accounting for only 10%. However, it is sufficient to carry out meaningful analyses and gain valuable insights into the movie industry.

In [1008]:
df['year'] = pd.to_datetime(df['release_date'],errors='coerce').apply(lambda x: str(x).split('-')[0]if x != np.nan else np.nan)

In [1009]:
df['adult'].value_counts()

False                                                                                                                             45454
True                                                                                                                                  9
 - Written by Ørnås                                                                                                                   1
 Rune Balot goes to a casino connected to the October corporation to try to wrap up her case once and for all.                        1
 Avalanche Sharks tells the story of a bikini contest that turns into a horrifying affair when it is hit by a shark avalanche.        1
Name: adult, dtype: int64

Since there are only a negligible number of adult movies in the dataset, the 'adult' feature does not hold much significance for the analysis and can be safely removed from the dataset

In [1010]:
df.drop(['adult'],axis=1)

Unnamed: 0,belongs_to_collection,budget,genres,homepage,id,original_language,overview,popularity,poster_path,production_companies,...,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,return,year
0,"{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg',...",30000000.0,"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]",http://toystory.disney.com/toy-story,862,en,"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear o...",21.946943,/rhIRbceoE9lR4veEXuwCC2wARtG.jpg,"[{'name': 'Pixar Animation Studios', 'id': 3}]",...,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,12.451801,1995
1,,65000000.0,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 10751, 'name': 'Family'}]",,8844,en,When siblings Judy and Peter discover an enchanted board game that opens the door to a magical w...,17.015539,/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg,"[{'name': 'TriStar Pictures', 'id': 559}, {'name': 'Teitler Film', 'id': 2550}, {'name': 'Inters...",...,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso_639_1': 'fr', 'name': 'Français'}]",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,4.043035,1995
2,"{'id': 119050, 'name': 'Grumpy Old Men Collection', 'poster_path': '/nLvUdqgPgm3F85NMCii9gVFUcet...",,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, 'name': 'Comedy'}]",,15602,en,A family wedding reignites the ancient feud between next-door neighbors and fishing buddies John...,11.7129,/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg,"[{'name': 'Warner Bros.', 'id': 6194}, {'name': 'Lancaster Gate', 'id': 19464}]",...,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for Love.,Grumpier Old Men,False,6.5,92.0,,1995
3,,16000000.0,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'name': 'Drama'}, {'id': 10749, 'name': 'Romance'}]",,31357,en,"Cheated on, mistreated and stepped on, the women are holding their breath, waiting for the elusi...",3.859495,/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg,"[{'name': 'Twentieth Century Fox Film Corporation', 'id': 306}]",...,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself... and never let you forget it.,Waiting to Exhale,False,6.1,34.0,5.090760,1995
4,"{'id': 96871, 'name': 'Father of the Bride Collection', 'poster_path': '/nts4iOmNnq7GNicycMJ9pSA...",,"[{'id': 35, 'name': 'Comedy'}]",,11862,en,"Just when George Banks has recovered from his daughter's wedding, he receives the news that she'...",8.387519,/e64sOI48hQXyru7naBFyssKFxVd.jpg,"[{'name': 'Sandollar Productions', 'id': 5842}, {'name': 'Touchstone Pictures', 'id': 9195}]",...,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's In For The Surprise Of His Life!,Father of the Bride Part II,False,5.7,173.0,,1995
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,,,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'name': 'Family'}]",http://www.imdb.com/title/tt6209470/,439050,fa,Rising and falling between a man and woman.,0.072051,/jldsYflnId4tTWPx8es3uzsB1I8.jpg,[],...,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0,,NaT
45462,,,"[{'id': 18, 'name': 'Drama'}]",,111109,tl,An artist struggles to finish his work while a storyline about a cult plays in his head.,0.178241,/xZkmxsNmYXJbKVsTRLLx3pqGHx7.jpg,"[{'name': 'Sine Olivia', 'id': 19653}]",...,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0,,2011
45463,,,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'name': 'Drama'}, {'id': 53, 'name': 'Thriller'}]",,67758,en,"When one of her hits goes wrong, a professional assassin ends up with a suitcase full of a milli...",0.903007,/d5bX92nDsISNhu3ZT69uHwmfCGw.jpg,"[{'name': 'American World Pictures', 'id': 6165}]",...,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0,,2003
45464,,,[],,227506,en,"In a small town live two brothers, one a minister and the other one a hunchback painter of the c...",0.003503,/aorBPO7ak8e8iJKT5OcqYxU3jlK.jpg,"[{'name': 'Yermoliev', 'id': 88753}]",...,87.0,[],Released,,Satan Triumphant,False,0.0,0.0,,1917


In [1011]:
url = 'http://image.tmdb.org/t/p/w185/'
df['poster_path'] = df['poster_path'].apply(lambda x: f"<img src='{url}{x}' style='height:100px;'>")


# Exploratory Data Analysis

### Movies Production Countries

In [1012]:
df['production_countries'] = df['production_countries'].fillna('[]').apply(ast.literal_eval)
df['production_countries'] = df['production_countries'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
a = df.apply(lambda x: pd.Series(x['production_countries']),axis=1).stack().reset_index(level=1, drop=True)
a.name = 'countries'
#These two lines of code first convert the 'production_countries' column from string representation
#  to a list of dictionaries using the ast.literal_eval() method. Then, it extracts the names of the countries
#  from the list of dictionaries using a lambda function and stores them in a new 'production_countries' column. 
# This process allows us to work with the 'production_countries' feature as a list of countries.





In [1013]:
new_df = df.drop('production_countries', axis=1).join(a)
new_df = pd.DataFrame(new_df['countries'].value_counts())
new_df['country'] = new_df.index
new_df.columns = ['num_movies', 'country']
new_df = new_df.reset_index().drop('index', axis=1)
new_df.head(10)

Unnamed: 0,num_movies,country
0,21153,United States of America
1,4094,United Kingdom
2,3940,France
3,2254,Germany
4,2169,Italy
5,1765,Canada
6,1648,Japan
7,964,Spain
8,912,Russia
9,828,India


In [1014]:
new_df = new_df[new_df['country'] != 'United States of America']

In [1015]:
data = [dict(
        type = 'choropleth',
        locations = new_df['country'],
        locationmode ='country names',
        z = new_df['num_movies'],
        text = new_df['country'],
        colorscale = 'Reds',
        autocolorscale = False,
        reversescale = False,
        marker = dict(
            line = dict (
                color = 'rgb(255, 255, 255)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick = False,
            tickprefix = '',
            title = 'Countries Production '),
      ) ]
layout = dict(
    title = 'Take a look at the countries where the movies in the MovieLens dataset were produced',
    geo = dict(
        showframe = False,
        showcoastlines = False,
        projection = dict(
            type = 'Mercator'
        )
    )
)

fig = dict( data=data, layout=layout )
py.iplot( fig, validate=False, filename='d3-world-map' )

It is not surprising to find out that the United States is the most popular destination for movie production in the dataset, considering the majority of the movies are in English. Among the top 5 countries, Europe stands out as a highly popular location with the UK, France, Germany, and Italy. Additionally, Japan and India are the two most favored Asian countries for movie production.

# Movies Production Companies

In [1016]:
df['production_companies'] = df['production_companies'].fillna('[]').apply(ast.literal_eval)
df['production_companies'] = df['production_companies'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [1017]:
b = df.apply(lambda x: pd.Series(x['production_companies']),axis=1).stack().reset_index(level=1, drop=True)
b.name = 'companies'
com_df = df.drop('production_companies', axis=1).join(b)

In [None]:
new_df = df.drop('production_companies', axis=1).join(b)
new_df = pd.DataFrame(new_df['companies'].value_counts())
new_df['companies'] = new_df.index
new_df.columns = ['companies']
new_df = new_df.reset_index().drop('index', axis=1)
new_df.head(20)

In [None]:
com_sum = pd.DataFrame(com_df.groupby('companies')['revenue'].sum().sort_values(ascending=False))
com_sum.columns = ['Total']
com_mean = pd.DataFrame(com_df.groupby('companies')['revenue'].mean().sort_values(ascending=False))
com_mean.columns = ['Average']
com_count = pd.DataFrame(com_df.groupby('companies')['revenue'].count().sort_values(ascending=False))
com_count.columns = ['Number']
com_pivot = pd.concat((com_sum, com_mean, com_count), axis=1)
com_pivot.sort_values('Total', ascending=False).head(10)

# Highest revenue

The production company that has earned the highest revenue of all time is Warner Bros, with an impressive total of 63.5 billion dollars from nearly 500 movies. Following closely behind are Universal Pictures and Paramount Pictures, with revenues of 55 billion dollars and 48 billion dollars respectively, making them the second and third highest earning production companies.

In [None]:
com_pivot[com_pivot['Number'] >= 10].sort_values('Average', ascending=False).head(10)

The average success rate of production companies in making successful movies  will only take into account the companies that have produced at least 10 movies to ensure statistical significance.
When it comes to the most successful production companies in terms of average gross, Pixar Animation Studios takes the top spot.Marvel Studios follows in second place with an average gross of 615 million dollars, thanks to blockbuster hits like Iron Man and The Avengers.

### Movie series are a huge part of the film industry, with many long-running and successful series. Let's explore the dataset to uncover some insights into the world of franchise movies, including the most successful and longest-running franchises.

In [None]:
df_movie = df[df['belongs_to_collection'].notnull()]
df_movie['belongs_to_collection'] = df_movie['belongs_to_collection'].apply(ast.literal_eval).apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)
df_movie = df_movie[df_movie['belongs_to_collection'].notnull()]
#This creates a new DataFrame df_movie that only includes rows where the 'belongs_to_collection' column is not null 
# (i.e., where the movie belongs to a collection/franchise).the string representation of a dictionary in the 'belongs_to_collection' 
# column to an actual dictionary using ast.literal_eval(). It then applies a lambda function to extract the name 
# of the franchise from the dictionary and create a new column 'belongs_to_collection' that only contains the franchise name 
# (or NaN if the value is not a dictionary).

In [None]:
movie_pivot = df_movie.pivot_table(index='belongs_to_collection', values='revenue', aggfunc={'revenue': ['mean', 'sum', 'count']}).reset_index()

# Successful Movie series:

In [None]:
movie_pivot.sort_values('sum', ascending=False).head(10)

When it comes to the most successful movie franchises, the Harry Potter series takes the top spot, having earned over 7.707 billion dollars from just 8 films. The Star Wars franchise isn't far behind, coming in a close second with earnings of 7.403 billion dollars from 8 movies. However, it's worth noting that the James Bond franchise has significantly more movies compared to the others on the list, resulting in a much lower average gross despite earning a respectable third place overall.

# Average Successful Movie series:

In [None]:
movie_pivot.sort_values('mean', ascending=False).head(10)

The Avatar Collection, currently comprising only one movie, has achieved remarkable success, grossing nearly 3 billion dollars. However, in terms of the number of movies produced, the Harry Potter franchise remains the most successful, with at least five movies in the series.

# Duration:

In [None]:
movie_pivot.sort_values('count', ascending=False).head(10)

The James Bond franchise holds the record for the largest movie franchise ever, with over 26 movies released under its banner. In second and third place are Friday the 13th and Pokemon, with 12 and 11 movies respectively, but still quite far behind James Bond

## Moving on to the budget, it is a known fact that it is often a skewed quantity and highly influenced by inflation. However, despite this, analyzing budget can provide valuable insights as it is a crucial factor in predicting the success and revenue of a movie. In order to begin the analysis, let's gather the summary statistics for budget.

In [None]:
df['budget'].describe()

These statistics show that the budget data is right-skewed, with a median budget of 8 million dollars and a mean budget of 21.6 million dollars. The standard deviation of 34.3 million indicates a wide range of budget values, with some movies having budgets as low as 1 dollar and as high as 380 million dollars.

In [None]:
sns.set_theme()
sns.displot(df[df['budget'].notnull()]['budget'])

The plot suggests that most movies have budgets below $50 million, with a significant number of movies having budgets between $0 and $20 million. The plot also shows that the distribution is heavily skewed towards lower budget values, with only a few movies having budgets above $100 million.

In [None]:
df['budget'].plot(logy=True, kind='hist')

 The logarithmic scale allows for better visualization of the distribution of the data, particularly when there are a large number of values that differ greatly in magnitude. The "kind='hist'" argument specifies that the plot should be a histogram, where the data is divided into intervals and the frequency of values within each interval is plotted.

# Movie ranking:

Analyzing the status of movies based on their release can provide insights into the nature of the movies in the dataset. It is interesting to determine the frequency of each status type.

In [None]:
df['status'].value_counts()

This code is counting the number of movies for each status of release in the dataset. It shows that the vast majority of movies (45014) have a status of "Released", while there are smaller numbers of movies in other categories such as "Rumored", "Post Production", "In Production", "Planned", and "Canceled".

# Total films per year

In [None]:
year_count = df.groupby('year')['title'].count()
plt.figure(figsize=(18,5))
year_count.plot()

# Analyze the distribution of movie releases over time.

In [None]:
month= ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
day = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']                                                                                                 

# Define functions to extract month and day from release date
def get_month(x):
    try:
        return month[int(str(x).split('-')[1]) - 1]
    except:
        return np.nan

def get_day(x):
    try:
        year, month, day = (int(i) for i in x.split('-'))    
        answer = datetime.date(year, month, day).weekday()
        return day[answer]
    except:
        return np.nan

# Create day and month columns using the defined functions
df['day'] = df['release_date'].apply(get_day)
df['month'] = df['release_date'].apply(get_month)
plt.figure(figsize=(12,6))
plt.title("Number of Movies Released by Month")
sns.countplot(x='month', data=df, order=month_order)

It seems that January is the most preferred month for releasing movies, although it's commonly referred to as the "dump month" in the movie industry. This is when studios tend to release movies that are not expected to perform well, in large quantities.

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x='day', data=df, order=day_order)
plt.title("Number of Movies Released on Each Day of the Week")
plt.xlabel('Day of the Week')
plt.ylabel('Number of Movies Released')


The data shows that Friday is the most popular day for movie releases, which is not surprising since it marks the start of the weekend when many people are looking for entertainment options. On the other hand, Sunday and Monday are the least popular days for movie releases, likely due to the fact that people are usually busy with work or school during those days and are less likely to go out for leisure activities.

In [None]:
# Filter for movies with revenue greater than 100 million
blockbuster_movies = df[df['revenue'] > 1e8]

# Compute the average revenue by month for blockbuster movies
month_mean = blockbuster_movies.groupby('month')['revenue'].mean().reset_index()

# Sort the dataframe by month order
month_mean = month_mean.set_index('month').loc[month_order].reset_index()

# Create a bar plot of average revenue by month for blockbuster movies
plt.figure(figsize=(12,6))
sns.barplot(x='month', y='revenue', data=month_mean)
plt.title("Average Gross by the Month for Blockbuster Movies")


It is interesting to note that the months of April, May, and June show the highest average gross among high-grossing movies. This can be explained by the tendency of studios to release their blockbuster films during the summer months, when audiences have more free time and are more willing to spend their money on entertainment. As a result, these months see a higher concentration of big-budget movies with high revenue numbers.

In [None]:
fig, ax = plt.subplots(figsize=(15, 8))
sns.boxplot(x='month', y='return', data=df[df['return'].notnull()], palette="muted", ax=ax, order=month_order)
ax.set_ylim([0, 12])
ax.set_title('Distribution of Return on Investment by Month')
ax.set_xlabel('Month')
ax.set_ylabel('Return on Investment')


By examining the plot, we can see that the median ROI for movies released in the months of May, June, and July appears to be higher than the median ROI for movies released in other months. This suggests that movies released during these summer months tend to have a better return on investment. However, it's important to note that there is significant overlap between the distributions for each month, and outliers can also have a significant impact on the results. Therefore, further analysis may be required to confirm these findings.

## Films that have earned the most revenue worldwide, ever:

In [None]:
df['revenue'].describe()

The movie with the lowest revenue earned only 1 dollar, while the highest grossing movie of all time made a mind-boggling 2.78 billion dollars.On average, a movie earns 68.7 million dollars, however, the median gross is significantly lower at 16.8 million dollars, indicating that the revenue distribution is highly skewed. 

In [None]:
plt.hist(df[df['revenue'].notnull()]['revenue'], bins=50)
plt.xlabel('Revenue')
plt.ylabel('Frequency')
plt.title('Distribution of Movie Revenue')
plt.show()

The histogram shows the distribution of revenue for movies in the dataset. The majority of the movies have a revenue less than $200 million, and very few movies have revenue above $1 billion. The distribution is heavily skewed to the right, indicating that most movies do not make a significant amount of revenue, while a few movies generate extremely high revenue.

In [None]:
top_profit = df[['poster_path', 'title', 'budget', 'revenue', 'year']].sort_values('revenue', ascending=False).head(10)
pd.set_option('display.max_colwidth', 100)
HTML(top_profit.to_html(escape=False))


# Successful Movies:


In [None]:
# Selecting movies that have a valid return on investment and a budget greater than 5 million
selected_movies = df[(df['return'].notnull()) & (df['budget'] > 10e6)]

# Sorting the selected movies by return on investment in descending order and selecting the top 10
top_10_movies = selected_movies[['title', 'budget', 'revenue', 'return', 'year']].sort_values('return', ascending=False).head(10)
top_10_movies


## Big-Budget Blockbusters: 

In [None]:
# Select rows where the budget is not null
budget_df = df[df['budget'].notnull()]

# Sort the data by budget in descending order and select the top 10 entries
top_budget_movies = budget_df[['title', 'budget', 'revenue', 'return', 'year']].sort_values('budget', ascending=False).head(10)
top_budget_movies



The top two spots on this list of the most expensive movies of all time are held by the Pirates of the Caribbean franchise, with a budget of over 300 million dollars each. Interestingly, all of the top 10 movies managed to make a profit on their investment except for The Lone Ranger, which had a budget of 255 million dollars but earned less than 35% of that amount, bringing in only 90 million dollars.

In [None]:
sns.jointplot(x='budget', y='revenue', data=df[df['return'].notnull()])


The plot shows the relationship between these two variables and also the distribution of the data points. The joint plot also displays the marginal distributions of budget and revenue separately on the top and right sides of the plot. This plot can help to identify any patterns or trends between the budget and revenue of movies, and can give insights into the potential profitability of movies based on their budget.

# Biggest Box Office Flops:

In [None]:
# Filter the dataframe to select movies with non-null return and budget greater than 5 million and revenue greater than 10,000
filtered_df = df[(df['return'].notnull()) & (df['budget'] > 5e6) & (df['revenue'] > 10000)]

# Select the columns of interest and sort by return in ascending order to get the top 10 worst box office disasters
disasters_df = filtered_df[['title', 'budget', 'revenue', 'return', 'year']].sort_values('return').head(10)
disasters_df


In [None]:
# Replace 'NaT' values with NaN in the 'year' column
df['year'] = df['year'].replace('NaT', np.nan)

# Apply the 'numeric' function to the 'year' column
def numeric(x):
    try:
        return int(x)
    except:
        return np.nan
    
df['year'] = df['year'].apply(numeric)


# Compute correlation matrix and create a mask to hide upper triangle
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))

# Plot heatmap with annotations
with sns.axes_style("white"):
    plt.figure(figsize=(9, 9))
    ax = sns.heatmap(corr, mask=mask, vmax=.3, square=True, annot=True, cmap='BuPu')
    ax.set_title("Correlation matrix of numeric features")


# Colloquial language

In [None]:
# Convert string representation of list to actual list and count its length
df['spoken_languages'] = df['spoken_languages'].apply(lambda x: len(ast.literal_eval(x)) if isinstance(x, str) else np.nan)
spoken_lang_counts = df['spoken_languages'].value_counts()
spoken_lang_counts


The majority of movies, according to the dataset, feature only one spoken language throughout the movie. However, there is at least one movie that stands out with a record-breaking 19 spoken languages in a single film.

In [None]:
df[df['spoken_languages'] >= 5][['title', 'year', 'spoken_languages']].sort_values('spoken_languages', ascending=False)

In [None]:
# Create a scatter plot to visualize the relationship between the number of spoken languages and the return on investment
plt.figure(figsize=(8, 6))
sns.scatterplot(x="spoken_languages", y="return", data=df, color="m", alpha=0.7)

# Add a regression line and correlation coefficient to the plot
sns.regplot(x="spoken_languages", y="return", data=df, scatter=False, color="k")
corr_coef, p_value = stats.spearmanr(df["spoken_languages"], df["return"])
plt.text(0.1, 0.9, f"Spearman's correlation: {corr_coef:.2f}, p-value: {p_value:.2e}", 
         transform=plt.gca().transAxes, fontsize=12)

# Add labels and a title to the plot
plt.xlabel("Number of spoken languages")
plt.ylabel("Return on investment")
plt.title("Relationship between Number of Spoken Languages and Return on Investment")


<!-- The correlation coefficient (Spearman's rho) between the number of spoken languages and the return on investment is a measure of how strong and in what direction the relationship is between these two variables. The correlation coefficient is a number between -1 and 1, where -1 indicates a perfect negative correlation (as one variable increases, the other decreases), 0 indicates no correlation, and 1 indicates a perfect positive correlation (as one variable increases, the other increases as well). -->

<!-- The p-value is a measure of the strength of evidence against the null hypothesis (that there is no correlation between the two variables) and in favor of the alternative hypothesis (that there is a correlation). A low p-value indicates strong evidence against the null hypothesis and in favor of the alternative hypothesis. -->

In this case, the Spearman's correlation coefficient between the number of spoken languages and the return on investment is negative and significant (-0.12, p < 0.05), which means that as the number of spoken languages increases, the return on investment tends to decrease. However, the correlation is not very strong, indicating that there are other factors that also influence the return on investment of a movie.

# Original Language

In [None]:
df['original_language'].drop_duplicates().shape[0]

In [None]:
language_df = pd.DataFrame(df['original_language'].value_counts())
language_df['language'] = language_df.index
language_df.columns = ['number', 'language']
language_df.head(10)

The dataset includes movies in more than 93 different languages. As anticipated, the majority of movies are in English, while French and Italian films come in second and third place respectively, although significantly behind. To better visualize the popularity of languages other than English, Lets create a bar plot.

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(x='language', y='number', data=language_df.iloc[1:30])
#The iloc[1:11] parameter is used to exclude the first row, which is for English movies, and only include the top 30 non-English languages.
plt.show()

After English, French and Italian are the two most frequently occurring languages in the dataset. However, when it comes to Asian languages, Japanese and Hindi are the most commonly represented.

## This section will explore the popularity, vote average, and vote count metrics provided by TMDB users. The aim is to gain a better understanding of these features and identify any relationships with other numerical features such as budget and revenue.

In [None]:
def convert_numeric(x):
    if isinstance(x, (int, float)):
        return x
    try:
        return float(x)
    except:
        return np.nan

df['popularity'] = df['popularity'].apply(convert_numeric).astype('float')
df['vote_count'] = df['vote_count'].apply(convert_numeric).astype('float')
df['vote_average'] = df['vote_average'].apply(convert_numeric).astype('float')


In [None]:
df['popularity'].describe()

In [None]:
sns.histplot(df['popularity'].fillna(df['popularity'].median()))
plt.title("Distribution of Movie Popularity Scores")
plt.xlabel("Popularity Score")
plt.ylabel("Frequency")
plt.show()


In [None]:
df['popularity'].plot(logy=True, kind='hist')

Since the popularity values are skewed towards the lower end, using a logarithmic y-axis allows to have a better visualize the distribution and identify any patterns or outliers. 

In [None]:
df[['title', 'popularity', 'year']].sort_values('popularity', ascending=False).head(10)

According to the TMDB Popularity Score, Minions is the most popular movie. Wonder Woman and Beauty and the Beast, both of which are highly successful movies with female leads, come in second and third place, respectively.

In [None]:
df['vote_count'].describe()

In [None]:
df[['title', 'vote_count', 'year']].sort_values('vote_count', ascending=False).head(10)

The top three movies on the list are Inception, The Dark Knight, and Avatar, each with over 12,000 votes. The list includes popular movies from different genres and time periods.

In [None]:
df['vote_average'] = df['vote_average'].replace(0, np.nan)
df['vote_average'].describe()

In [None]:
plt.figure(figsize=(8,6))
sns.histplot(data=df['vote_average'].fillna(df['vote_average'].median()), kde=True)
plt.title('Distribution of Vote Average Scores')
plt.xlabel('Vote Average')
plt.ylabel('Frequency')
plt.show()


The plot suggests that the vote average scores have a nearly normal distribution, with most movies having a score of around 6 to 7.5. The peak in the distribution lies between 5.5 to 6.5. This indicates that most movies have received an average rating of around 6 to 7.5 out of 10 from the users of TMDB. The histogram also shows that there are very few movies with a vote average score less than 3 or greater than 9, which means that the majority of the movies fall within the range of 3 to 9. Overall, the plot provides insights into the distribution of vote average scores for the movies in the dataset.

# Top-rated movies based on critical acclaim

In [None]:
df[df['vote_count'] > 2000][['title', 'vote_average', 'vote_count' ,'year']].sort_values('vote_average', ascending=False).head(10)

The table shows the top 10 most critically acclaimed movies based on the vote average and vote count ratings provided by TMDB users. The list includes classic movies like The Shawshank Redemption, The Godfather, and Psycho, as well as more recent films like Spirited Away, The Dark Knight, and Fight Club. These movies have received high ratings from a large number of users, indicating their enduring popularity and appeal

In [None]:
sns.jointplot(x='vote_average', y='popularity', data=df)

From the jointplot,Observeed that there is a moderate positive correlation between vote_average and popularity. This means that movies with a higher vote_average tend to be more popular among the users of TMDB.

In [None]:
sns.jointplot(x='vote_average', y='vote_count', data=df)

The plot shows that there is a positive correlation between vote_average and vote_count, indicating that movies with higher average ratings tend to receive more votes. This makes sense as popular movies are often talked about and discussed more, leading to a larger number of people rating the movie.

The analysis aims to explore if certain words appear more frequently in movie titles and blurbs, indicating their perceived potency and worthiness for a title.

In [None]:
df['title'] = df['title'].astype('str')
df['overview'] = df['overview'].astype('str')

title_body = ''.join(df['title'])
overview_body = ''.join(df['overview'])

In [None]:
title_wordcloud = WordCloud(stopwords= STOPWORDS,background_color="white",height=2000,width=4500).generate(title_body)
plt.figure(figsize=(16,10))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()

The analysis reveals that "Man" is the most frequently used word in movie titles. It is followed by "Love," "Day," and "Last".

In [None]:
overview_wordcloud = WordCloud(stopwords= STOPWORDS,background_color="white",height=2000,width=4500).generate(overview_body)
plt.figure(figsize=(16,10))
plt.imshow(overview_wordcloud)
plt.axis('off')
plt.show()

The analysis of movie titles and blurbs reveals that the word "Love" is the most commonly used word in movie titles and overview, followed by "Life", "Find" and "One". Meanwhile, in movie blurbs, "One" and "Find" are among the most frequently used words. This suggests that the themes of life, love, and relationships are among the most popular in the world of movies.