## *** Reel Insights: A Comprehensive Analysis of Movie Recommendation Systems - EDA***

**CAS Applied Data Science Final Project**

**By: Avisek Regmi, Avenue de la Foretaille 27b, Chambesy, 1292 CH**

![](https://camo.githubusercontent.com/68abad8a66113eb3c56dd584fa9b0b1fe4aab28200b3dfc61d3b00d40dba440c/68747470733a2f2f7261772e67697468756275736572636f6e74656e742e636f6d2f6a616c616a7468616e616b692f4d6f7669655f7265636f6d6d656e646174696f6e5f656e67696e652f6d61737465722f696d672f325f332e6a7067)

In [None]:
from google.colab import drive
drive.mount('/content/drive')






![](https://media.giphy.com/media/3ohhwDMC187JqL69DG/giphy.gif)

TThe advent of the motion picture camera in the late 18th century heralded the dawn of what would become one of the most captivating forms of entertainment: cinema. From the mesmerizing one-second clips of galloping horses in the 1890s to the groundbreaking addition of sound in the 1920s, the vibrant burst of color in the 1930s, and the immersive experience of mainstream 3D movies in the early 2010s, films have consistently fascinated audiences worldwide.

Cinema's beginnings were modest, with simplistic plots, rudimentary direction, and basic acting, largely due to the brevity of early films. However, the industry has since evolved dramatically, nurtured by the talents of visionary directors, ingenious screenwriters, charismatic actors, innovative sound designers, and skilled cinematographers. This evolution has given rise to a myriad of genres, spanning from romance and comedy to science fiction and horror.

Like many children of the past century, I was spellbound by movies. This passion turned into a deep-seated desire to unravel the mysteries of the cinematic world. In this notebook, I aim to explore these mysteries through data analysis. We have a dataset comprising approximately 45,000 movies, with metadata sourced from TMDB. This data will help us delve into various questions I've long pondered about the film industry.

Moreover, this notebook will guide us through the development of:

A **Regressor** capable of predicting a movie's revenue to a reasonable degree.
A **Classifier** that can determine whether a movie will be a financial hit or a loss for its producers.




## **Importing Libraries and Loading Data**

In [None]:
! pip install plotly


In [None]:
# Installing chart-studio
try:
    import chart_studio.plotly as py
except ModuleNotFoundError:
    !pip install chart-studio
    import chart_studio.plotly as py

# Import statements
%matplotlib inline
from IPython.display import Image, HTML
import json
import plotly.offline as py
import plotly.graph_objs as go
import datetime
import ast
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, XGBRegressor
from wordcloud import WordCloud, STOPWORDS
import chart_studio.plotly as py
import plotly.graph_objs as go

# Plotly credentials setup
import chart_studio.tools as tls
tls.set_credentials_file(username='rounakbanik', api_key='xTLaHBy9MVv5szF4Pwan')

# Seaborn settings
sns.set_style('whitegrid')
sns.set(font_scale=1.25)

# Pandas settings
pd.set_option('display.max_colwidth', 50)


In [None]:
# Below is the path with the actual CSV file
file_path = '/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/movies_metadata.csv'

df = pd.read_csv(file_path)
df.head().transpose()


## **Understanding the Dataset**


The dataset I am working with has been sourced via the TMDB API and aligns with the **MovieLens Latest Full Dataset**, which includes 26 million ratings on 45,000 films from 27,000 users. Now, let's delve into the various features available in this extensive dataset.



In [None]:
df.columns

### **Features**

* **adult:** Indicates if the movie is X-Rated or Adult.
* **belongs_to_collection:** A stringified dictionary that gives information on the movie series the particular film belongs to.
* **budget:** The budget of the movie in dollars.
* **genres:** A stringified list of dictionaries that list out all the genres associated with the movie.
* **homepage:** The Official Homepage of the move.
* **id:** The ID of the move.
* **imdb_id:** The IMDB ID of the movie.
* **original_language:** The language in which the movie was originally shot in.
* **original_title:** The original title of the movie.
* **overview:** A brief blurb of the movie.
* **popularity:** The Popularity Score assigned by TMDB.
* **poster_path:** The URL of the poster image.
* **production_companies:** A stringified list of production companies involved with the making of the movie.
* **production_countries:** A stringified list of countries where the movie was shot/produced in.
* **release_date:** Theatrical Release Date of the movie.
* **revenue:** The total revenue of the movie in dollars.
* **runtime:** The runtime of the movie in minutes.
* **spoken_languages:** A stringified list of spoken languages in the film.
* **status:** The status of the movie (Released, To Be Released, Announced, etc.)
* **tagline:** The tagline of the movie.
* **title:** The Official Title of the movie.
* **video:** Indicates if there is a video present of the movie with TMDB.
* **vote_average:** The average rating of the movie.
* **vote_count:** The number of votes by users, as counted by TMDB.

In [None]:
df.shape

In [None]:
df.info()

The dataset encompasses **45,466 movies**, each described by **24 features**. Most of these features are largely complete, with minimal NaN values, except for the **homepage** and **tagline** fields. In the next section, we will focus on refining this dataset, transforming it into a format that's ready for detailed analysis.




## **Data Wrangling**

The initial data was sourced as a JSON file, which we manually converted into a CSV format to facilitate easy loading into a Pandas DataFrame. Consequently, the dataset we now possess is already in a relatively clean state. Nevertheless, we will further investigate our features and apply necessary data wrangling techniques to refine the dataset for optimal analysis.

To begin, let's eliminate the features that are not pertinent to our analysis.



In [None]:
df = df.drop(['imdb_id'], axis=1)

In [None]:
df[df['original_title'] != df['title']][['title', 'original_title']].head()

In this analysis, I'll opt for the translated or Anglicized titles over the original titles, which represent the movie's name in its native language. Consequently, we'll exclude the original titles from our dataset. By leveraging the **original_language** feature, we can still discern whether a movie is in a foreign language, ensuring no substantive information is sacrificed in this process.



In [None]:
df = df.drop('original_title', axis=1)

In [None]:
df[df['revenue'] == 0].shape

A notable observation is that the vast majority of movies in our dataset have a recorded revenue of **0**, implying a lack of information regarding their total earnings. Despite this prevalence among the available movies, we'll continue to regard revenue as a pivotal feature as we proceed, particularly focusing on the subset of approximately 7,000 movies with recorded revenue data.

In [None]:
df['revenue'] = df['revenue'].replace(0, np.nan)

Upon closer examination, it becomes evident that the **budget** feature contains certain irregularities, resulting in Pandas categorizing it as a generic object. To rectify this, we'll transform it into a numeric variable, substituting all non-numeric values with NaN (Not a Number). Furthermore, mirroring this approach with the 'budget' values, we'll replace all instances of zero with NaN, effectively denoting the absence of budgetary information.

In [None]:
df['budget'] = pd.to_numeric(df['budget'], errors='coerce')
df['budget'] = df['budget'].replace(0, np.nan)
df[df['budget'].isnull()].shape

As we progress in addressing specific questions, it is essential to develop several features tailored to each query. At this point, I will introduce two crucial features:

**Year:** The year the movie was released.
**Return:** The ratio of revenue to budget.

The return feature is particularly insightful as it provides a clearer understanding of a movie's financial success. For example, without this feature, our data cannot accurately compare the performance of a USD 200 million budget movie that earned USD 100 million to a USD 50,000 budget movie that grossed USD 200,000. The return ratio allows us to capture this critical information.

A return value greater than 1 indicates a profit, while a return value less than 1 signifies a loss.



In [None]:
df['return'] = df['revenue'] / df['budget']
df[df['return'].isnull()].shape

In our repository, we possess data on nearly **5,000 movies**, comprising of **10% of the entire dataset**. While this proportion may appear modest, it furnishes an ample foundation for conducting profoundly insightful analyses and uncovering captivating revelations within the cinematic realm.


In [None]:
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [None]:
df['adult'].value_counts()


Given that there are virtually no **0** adult movies in this dataset, the **adult** feature holds little relevance and can be safely omitted.



In [None]:
df = df.drop('adult', axis=1)

In [None]:
base_poster_url = 'http://image.tmdb.org/t/p/w185/'
df['poster_path'] = "<img src='" + base_poster_url + df['poster_path'] + "' style='height:100px;'>"

## **Exploratory Data Analysis**
### **Title and Overview Wordclouds**

Do certain words appear more frequently in movie titles and blurbs? I believe there are specific terms deemed more impactful and worthy of a title. Let's investigate and uncover the truth!

In [None]:
df['title'] = df['title'].astype('str')
df['overview'] = df['overview'].astype('str')

In [None]:
title_corpus = ' '.join(df['title'])
overview_corpus = ' '.join(df['overview'])

In [None]:
title_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='lightyellow', height=2000, width=4000).generate(title_corpus)
plt.figure(figsize=(16,8))
plt.imshow(title_wordcloud)
plt.axis('off')
plt.show()


The word **Love** is the most frequently used term in movie titles, followed closely by **Girl,** **Day**, and **Man** This prevalence underscores the pervasive theme of romance in films quite effectively.



In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

overview_wordcloud = WordCloud(stopwords=STOPWORDS, background_color='white', height=2000, width=4000, colormap='ocean').generate(overview_corpus)
plt.figure(figsize=(16, 8))
plt.imshow(overview_wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()



The word **Life** appears most frequently in movie titles, while **One** and **Find** are prevalent in movie blurbs. Alongside **Love**, **Man** and **Girl**, these terms in our word clouds provide a clear picture of the most common themes in films.


### **Production Countries**

The Full MovieLens Dataset predominantly features movies in the English language, with over 31,000 titles. However, these films are often shot in diverse locations worldwide. It would be fascinating to explore which countries are the top filming destinations, particularly for filmmakers from the United States and the United Kingdom.



In [None]:
import pandas as pd
import ast

def safe_literal_eval(x):
    try:
        return ast.literal_eval(x)
    except (SyntaxError, ValueError):
        return None  # or any default value you prefer



In [None]:
# Ensure only valid strings are passed to ast.literal_eval
df['production_countries'] = df['production_countries'].fillna('[]').apply(safe_literal_eval)

# Apply lambda function to extract country names
df['production_countries'] = df['production_countries'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [None]:
# Transform the DataFrame as required
s = df.apply(lambda x: pd.Series(x['production_countries']), axis=1).stack().reset_index(level=1, drop=True)
s.name = 'countries'

In [None]:
con_df = df.drop('production_countries', axis=1).join(s)
con_df = pd.DataFrame(con_df['countries'].value_counts())
con_df['country'] = con_df.index
con_df.columns = ['num_movies', 'country']

# Reset index without attempting to drop a non-existent column
con_df = con_df.reset_index(drop=True)
con_df.head(10)

In [None]:
# Filter out 'United States of America'
con_df = con_df[con_df['country'] != 'United States of America']


Unsurprisingly, the **United States**  tops the list as the most popular filming destination, reflecting the dominance of English-language movies in our dataset. **Europe** also stands out, with the UK, France, Germany, and Italy ranking among the top five. In Asia, **Japan**and **India**  emerge as the leading countries for movie production.

### **Franchise Movies**

Now, let's turn our attention to franchise movies. I'm eager to uncover the longest-running and most successful franchises, among other intriguing details. Let's dive into the data and see what we can discover!

In [None]:
df_fran = df[df['belongs_to_collection'].notnull()]
df_fran['belongs_to_collection'] = df_fran['belongs_to_collection'].apply(ast.literal_eval).apply(lambda x: x['name'] if isinstance(x, dict) else np.nan)
df_fran = df_fran[df_fran['belongs_to_collection'].notnull()]

In [None]:
fran_pivot = df_fran.pivot_table(index='belongs_to_collection', values='revenue', aggfunc={'revenue': ['mean', 'sum', 'count']}).reset_index()

#### **Highest Grossing Movie Franchises**

In [None]:
fran_pivot.sort_values('sum', ascending=False).head(10)

The **Harry Potter** franchise tops the list as the most successful movie series, earning over USD7.707 billion from its 8 films. Close behind, the **Star Wars** franchise has garnered USD7.403 billion from its 8 movies. The **James Bond** series ranks third, but with a significantly larger number of films, its average gross per movie is considerably lower.

#### **Most Successful Movie Franchises (by Average Gross)**


I will utilize the average gross per movie as a metric to assess the success of a movie franchise. However, it's important to note that this metric may not be entirely reliable, as the revenue data in our dataset hasn't been adjusted for inflation. Consequently, revenue statistics are likely to heavily favor franchises from more recent times.

In [None]:
fran_pivot.sort_values('mean', ascending=False).head(10)

The **Avatar**  Collection, though currently limited to a single installment, stands as a beacon of unparalleled success in the realm of cinema. Its sole film has shattered records, amassing a staggering revenue of nearly 3 billion dollars, a feat unmatched by any other single movie. While the **Harry Potter** franchise boasts a greater number of films, with at least five entries to its name, it is the singular triumph of Avatar that solidifies its status as the reigning titan of box office achievement.

#### **Longest Running Franchises**


In this segment, we turn our attention to the enduring franchises that have weathered the passage of time, continually delivering a multitude of cinematic experiences under a singular banner. This metric, unaffected by inflation, serves as a robust indicator of a franchise's longevity and creative prowess. However, it's crucial to note that the quantity of films within a franchise doesn't inherently correlate with its success. Certain franchises, like Harry Potter, possess a meticulously crafted storyline with a finite arc, rendering further film productions unnecessary despite their monumental success.

In [None]:
fran_pivot.sort_values('count', ascending=False).head(10)

The **James Bond** film series reigns supreme as the epitome of longevity in the realm of cinema, boasting an impressive catalog of over 26 thrilling adventures under its iconic banner. Following in its wake, but at a considerable distance, are franchises like **Friday the 13th** and **Pokemon**, occupying the second and third spots with 12 and 11 films respectively. This illustrious lineup underscores not only the enduring popularity of these franchises but also their ability to captivate audiences across generations with their distinct narratives and characters.

### **Production Companies**

In [None]:
df['production_companies'] = df['production_companies'].fillna('[]').apply(ast.literal_eval)
df['production_companies'] = df['production_companies'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [None]:
s = df.apply(lambda x: pd.Series(x['production_companies']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'companies'

In [None]:
com_df = df.drop('production_companies', axis=1).join(s)

In [None]:
com_sum = pd.DataFrame(com_df.groupby('companies')['revenue'].sum().sort_values(ascending=False))
com_sum.columns = ['Total']
com_mean = pd.DataFrame(com_df.groupby('companies')['revenue'].mean().sort_values(ascending=False))
com_mean.columns = ['Average']
com_count = pd.DataFrame(com_df.groupby('companies')['revenue'].count().sort_values(ascending=False))
com_count.columns = ['Number']

com_pivot = pd.concat((com_sum, com_mean, com_count), axis=1)

#### **Highest Earning Production Companies**

Let's delve into the realm of cinema and uncover the top-grossing production companies, those titans of the silver screen whose creations have not only captivated audiences but also reaped substantial financial rewards in the dynamic world of filmmaking.

In [None]:
com_pivot.sort_values('Total', ascending=False).head(10)

**Warner Bros** stands as the undisputed heavyweight champion in the realm of film production, boasting an unparalleled record of success with an astronomical USD 63.5 billion in earnings derived from nearly 500 cinematic creations. Trailing closely behind, **Universal Pictures** and **Paramaount Pictures** claim their positions as formidable contenders, amassing USD55 billion and USD 48 billion in revenue respectively, solidifying their status among the elite echelons of the industry.


#### **Most Succesful Production Companies**

Let's embark on a quest to uncover the true champions of cinematic success by examining which production companies consistently deliver the most successful movies, measured by their average performance. To ensure statistical significance, I narrow our focus to companies that have produced a minimum of 15 films.

In [None]:
com_pivot[com_pivot['Number'] >= 15].sort_values('Average', ascending=False).head(10)


**Pixar Animation Studios** emerges as the undisputed leader in crafting consistently successful cinematic masterpieces, a feat undoubtedly owed to their remarkable portfolio spanning several decades. With iconic gems like "Up," "Finding Nemo," "Inside Out," "Wall-E," "Ratatouille," and the beloved "Toy Story" and "Cars" franchises, Pixar has etched its name in cinematic history. Following closely behind is **Marvel Studios** , renowned for its superhero epics like "Iron Man" and "The Avengers," boasting an impressive average gross of $615 million per film, securing its place as a formidable force in the industry.

### **Original Language**

In this segment, we'll delve into the linguistic landscape of the movies within our dataset. Having already established that English dominates as the primary language across productions from various countries, it's now intriguing to explore the spectrum of other major languages present.

In [None]:
df['original_language'].drop_duplicates().shape[0]

In [None]:
lang_df = pd.DataFrame(df['original_language'].value_counts())
lang_df['language'] = lang_df.index
lang_df.columns = ['number', 'language']
lang_df.head()

In my  dataset, we've captured the rich tapestry of human expression with over 93 distinct languages making their presence known. Unsurprisingly, the dominant force within this linguistic mosaic is English, reigning supreme with its ubiquitous presence in films. However, beyond the realm of English, we encounter a fascinating array of cinematic languages. French and Italian emerge as notable contenders, albeit trailing far behind in terms of representation. To visually encapsulate this linguistic diversity, we propose creating a bar plot highlighting the prevalence of these languages, offering a compelling snapshot of global cinematic culture beyond the English-speaking sphere.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 5))
sns.barplot(x='language', y='number', data=lang_df.iloc[1:11], palette='Set2')
plt.show()



As we delve deeper into our dataset, it becomes evident that after English, **French** and **Italian** stand out as prominent languages. Within the realm of Asian languages, **Japanese** and **Hindi**emerge as frontrunners, embodying a significant presence. This linguistic landscape not only reflects the global reach of cinema but also underscores the rich diversity of cultural narratives being shared on screen.

### **Popularity, Vote Average and Vote Count**

In the upcoming segment, my focus will shifts towards the metrics generously provided by TMDB users. My aim is to embark on a journey of comprehension, delving into the intricacies of popularity, vote average, and vote count features. Through meticulous analysis, we endeavor to unveil any underlying relationships they may share, while also exploring correlations with other numerical attributes such as budget and revenue. By navigating through these metrics, we aspire to glean valuable insights that illuminate the dynamics of audience engagement and the financial landscapes of cinematic endeavors.

In [None]:
def clean_numeric(x):
    try:
        return float(x)
    except:
        return np.nan

In [None]:
df['popularity'] = df['popularity'].apply(clean_numeric).astype('float')
df['vote_count'] = df['vote_count'].apply(clean_numeric).astype('float')
df['vote_average'] = df['vote_average'].apply(clean_numeric).astype('float')

Let's delve into the summary statistics and meticulously explore the distribution of each individual feature. By dissecting each aspect separately, we gain a comprehensive understanding that illuminates the nuances and intricacies within the dataset.

In [None]:
df['popularity'].describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
sns.distplot(df['popularity'].fillna(df['popularity'].median()), color='purple', kde_kws={'shade': True}, hist=False)
plt.xlabel('Popularity')
plt.ylabel('Density')
plt.title('Density Plot of Popularity')
plt.show()


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
df['popularity'].plot(logy=True, kind='hist', color='red')
plt.xlabel('Popularity')
plt.ylabel('Frequency (log scale)')
plt.title('Histogram of Popularity (log scale)')
plt.grid(True)
plt.show()


The Popularity score presents a striking disparity, exhibiting significant skewness with a modest mean of **2.9** , juxtaposed against staggering maximum values soaring as high as 547, a colossal leap of nearly 1800% above the mean. Yet, upon scrutinizing the distribution plot, a notable trend emerges: the overwhelming majority of movies boast a popularity score below 10, underscored by the fact that even at the 75th percentile, the score hovers at a modest 3.678902.

#### **Most Popular Movies by Popularity Score**

In [None]:
df[['title', 'popularity', 'year']].sort_values('popularity', ascending=False).head(10)

At the helm of the TMDB Popularity Score leaderboard stands **Minions** reigning as the epitome of cinematic acclaim. Following closely behind are **Wonder Woman**  and **Beauty and the Beast**, two formidable contenders that have not only captivated audiences worldwide but have also solidified their status as empowering symbols of female-centric storytelling, securing second and third positions with unwavering grace and acclaim.

In [None]:
df['vote_count'].describe()

Similar to popularity scores, the distribution of vote counts reveals a stark skewness, with the median count resting at a modest 10 votes. However, there exists a considerable outlier, with one movie amassing a staggering 14,075 votes, accentuating the vast spectrum of engagement among TMDB users. While TMDB Votes may not wield the same influence and depth as its IMDb counterpart, it still serves as a noteworthy metric for gauging audience interaction. Let us now embark on a journey to uncover the most celebrated and widely-voted movies on the platform.

#### **Most Voted on Movies**

In [None]:
df[['title', 'vote_count', 'year']].sort_values('vote_count', ascending=False).head(10)

At the summit of our chart stand two masterpieces from the visionary director Christopher Nolan: **Inception**  and **The Dark Knight** Both films have garnered widespread acclaim from critics and audiences alike, transcending mere box office success to become emblematic of cinematic excellence. Nolan's unparalleled storytelling prowess and cinematic craftsmanship shine through in these timeless classics, solidifying their place at the pinnacle of our rankings.


In [None]:
df['vote_average'] = df['vote_average'].replace(0, np.nan)
df['vote_average'].describe()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
sns.distplot(df['vote_average'].fillna(df['vote_average'].median()), color='skyblue')
plt.xlabel('Vote Average')
plt.ylabel('Density')
plt.title('Density Plot of Vote Average')
plt.show()


TMDB users demonstrate discerning tastes in their ratings, evident from the modest mean rating of **5.6** on a 10-point scale. With half of the movies garnering ratings equal to or less than 6, it's clear that excellence is rigorously evaluated by the community. In our pursuit of identifying the most esteemed cinematic gems according to TMDB, we adopt a criterion akin to IMDb's threshold of 5000 votes for its top 250 list. Thus, we focus our attention solely on movies with more than 2000 votes, ensuring a robust selection process that highlights the true cream of the crop.

#### **Most Critically Acclaimed Movies**

In [None]:
df[df['vote_count'] > 2000][['title', 'vote_average', 'vote_count' ,'year']].sort_values('vote_average', ascending=False).head(10)

In the realm of cinematic acclaim, **The Shawshank Redemption** and **The Godfather** stand as pillars of excellence within the TMDB Database. Remarkably, their eminence extends beyond TMDB, as they reign supreme as the top two movies on IMDb's prestigious Top 250 Movies list. Garnering ratings exceeding 9 on IMDb, these timeless classics exude a level of reverence and admiration that transcends platforms, despite their slightly lower but still commendable TMDB scores of 8.5.

Are popularity and vote average interconnected in a tangible manner? In essence, do these metrics exhibit a robust positive correlation? To unravel this inquiry, let's embark on a visual exploration by crafting a scatterplot to depict their relationship. Through this graphical representation, we aim to discern any discernible patterns or trends that may shed light on the extent of their correlation.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid")
sns.jointplot(x='vote_average', y='popularity', data=df, cmap='viridis')
plt.show()



Astoundingly, the Pearson Coefficient between the two aforementioned variables barely scratches the surface at**0.097** , indicating a **lack of substantial correlation**. This implies that popularity and vote average stand as separate entities, each with its own sway and influence. It sparks curiosity to delve into the methodology employed by TMDB in assigning numerical popularity metrics to its films.

In [None]:
sns.jointplot(x='vote_average', y='vote_count', data=df)

The correlation between Vote Count and Vote Average is notably modest. It's important to recognize that a high number of votes for a movie doesn't automatically translate to its quality.

### **Movie Release Dates**

The timing of a movie's release can wield significant influence over its success and financial returns. In the upcoming analysis, we aim to unravel the significance of release dates, exploring trends across years, months, and even days of the week.

As part of our preliminary data preparation, we've already integrated the **year** feature. Now, our focus shifts to extracting the month and day components for each movie's release date.

In [None]:
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

In [None]:
def get_month(x):
    try:
        return month_order[int(str(x).split('-')[1]) - 1]
    except:
        return np.nan

In [None]:
def get_day(x):
    try:
        year, month, day = (int(i) for i in x.split('-'))
        answer = datetime.date(year, month, day).weekday()
        return day_order[answer]
    except:
        return np.nan

In [None]:
df['day'] = df['release_date'].apply(get_day)
df['month'] = df['release_date'].apply(get_month)

Armed with these crucial features, it's time to delve into the analysis and uncover which months and days emerge as the hotspots for popularity and success in the realm of movie releases.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
plt.title("Number of Movies released in a particular month.")
sns.countplot(x='month', data=df, order=month_order, palette='Set2')
plt.show()



**January** emerges as the reigning champion in terms of movie releases, a phenomenon often associated with Hollywood's "dump month," notorious for the influx of subpar releases.

**Now, shifting our focus, I will aim to pinpoint the months favored by blockbuster releases. To achieve this, I will examine movies exceeding the $100 million mark and analyze the average gross for each month.**


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

month_mean = pd.DataFrame(df[df['revenue'] > 1e8].groupby('month')['revenue'].mean())
month_mean['mon'] = month_mean.index

plt.figure(figsize=(12, 6))
plt.title("Average Gross by the Month for Blockbuster Movies")
sns.barplot(x='mon', y='revenue', data=month_mean, order=month_order, palette='Set2')
plt.show()



 **April**, **May** and **June** emerge as the prime contenders with the highest average gross among high-grossing films. This trend finds its roots in the strategic timing of blockbuster releases, often coinciding with the summer season. With school holidays in session and families embarking on vacations, the audience demographic is primed for entertainment, leading to increased spending on leisure activities.

**Now, turning our attention to the broader picture, I will aim to explore whether certain months consistently outshine others in terms of success. To achieve this, I will visualize the relationship between returns and months through box plots, offering insights into the variability and distribution of returns across different months.**



In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15, 8))
sns.boxplot(x='month', y='return', data=df[df['return'].notnull()], palette="muted", ax =ax, order=month_order)
ax.set_ylim([0, 12])

**June** and **July** consistently stand out with the highest median returns, underscoring their status as peak months for successful movie releases. Conversely, **September** trails behind as the least successful month across these metrics. Once more, the buoyancy of June and July releases aligns with the summer season and vacation periods, fostering a conducive environment for movie attendance. In contrast, September marks the onset of the academic semester, leading to a dip in movie consumption as individuals refocus their priorities.

**Having examined the monthly landscape, let's now shift my focus will be to explore the popularity trends across days of the week, mirroring our previous analysis.**



In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 5))
plt.title("Number of Movies released on a particular day.")
sns.countplot(x='day', data=df, order=day_order, palette='Greens')
plt.show()



**Friday** stands as the undisputed champion of movie releases, reigning supreme as the favored day for cinematic debuts. It's a logical choice, marking the gateway to the weekend, when audiences are primed for entertainment. Conversely, **Sunday** and **Monday** see minimal activity in comparison, a trend easily traced back to their roles as the tail end and kickoff of the workweek, respectively. The rhythm of the week dictates the ebb and flow of audience attendance, with Friday shining brightest in the spotlight of movie magic.

#### **Number of Movies by the year**

Although our dataset of 45,000 movies may not encompass the entirety of cinematic history, it does offer a substantial glimpse into the evolution of the film industry. While it's not exhaustive, it's fair to infer that it covers a significant portion of major releases from Hollywood and other prominent global film hubs like Bollywood. Keeping this assumption in perspective, we can delve into the trends of movie production over the years.

In [None]:
import matplotlib.pyplot as plt

year_count = df.groupby('year')['title'].count()
plt.figure(figsize=(18, 5))
year_count.plot(color='red')
plt.show()


In examining the dataset, a notable trend emerges: a pronounced surge in the quantity of films **beginning from the 1990s**. While this spike is intriguing, it's prudent to exercise caution in drawing conclusions, considering the potential for oversampling of contemporary films within the dataset.

**Moving forward, our focus shifts to the cinematic pioneers, delving into the earliest entries within this repository of cinematic history.**




#### **Earliest Movies Represented**

In [None]:
df[df['year'] != 'NaT'][['title', 'year']].sort_values('year').head(10)

The inaugural film, **Passage of Venus** captures a pivotal celestial event: the transit of Venus across the Sun in 1874. Crafted through a series of photographs, this historic visual documentation emanated from Japan, courtesy of the renowned French astronomer Pierre Janssen and his groundbreaking 'photographic revolver.' Notably, its significance extends beyond its chronological primacy; "Passage of Venus" proudly holds the distinction of being the oldest film archived on both IMDB and TMDB, solidifying its enduring legacy in cinematic annals.


**Concluding our exploration of this section, I will now embark on the creation of a heatmap delineating movie releases by month and year within the confines of the current century. Such visual representation promises valuable insights into the ebb and flow of cinematic output, offering discerning movie enthusiasts a panoramic view of the industry's temporal rhythms. By discerning the peaks and troughs within this cinematic chronology, I will illuminate the hotbeds of cinematic fervor and the cooler intervals, thereby painting a vivid portrait of the cinephile's calendar.**

In [None]:
months = {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}

In [None]:
df_21 = df.copy()
df_21['year'] = df_21[df_21['year'] != 'NaT']['year'].astype(int)
df_21 = df_21[df_21['year'] >=2000]
hmap_21 = pd.pivot_table(data=df_21, index='month', columns='year', aggfunc='count', values='title')
hmap_21 = hmap_21.fillna(0)

In [None]:
sns.set(font_scale=1)
f, ax = plt.subplots(figsize=(16, 8))
sns.heatmap(hmap_21, annot=True, linewidths=.5, ax=ax, fmt='n', yticklabels=month_order)

In [None]:
sns.set(font_scale=1.25)

### **Movie Status**

While diverging slightly from our primary focus on movie analysis, delving into the categorization of films based on their release status holds promise for uncovering intriguing facets of our dataset. Initially, my intuition led me to anticipate that the majority of films would carry the **Released** status.

**Yet, as I embark on this investigative journey, I aim to shed light on the distribution and proportions of various release statuses within my dataset, potentially unraveling unexpected trends and nuances that enrich my understanding of the cinematic landscape encapsulated within my data.**


In [None]:
df['status'].value_counts()

While the film industry churns out releases, MovieLens stands out by featuring user ratings for movies still in development. This data could be invaluable for enhancing our collaborative filtering recommendation system.

### **Spoken Languages**


**Can the number of languages spoken in a movie affect its success? It's an intriguing question worth exploring.**

 To delve into this, I plan to transform the **spoken_languages**feature of my dataset into a numerical representation, indicating the count of languages spoken in each film. This approach could shed light on the potential correlation between linguistic diversity and a movie's reception, offering valuable insights for my analysis.

In [None]:
df['spoken_languages'] = df['spoken_languages'].fillna('[]').apply(ast.literal_eval).apply(lambda x: len(x) if isinstance(x, list) else np.nan)

In [None]:
df['spoken_languages'].value_counts()

While most films stick to one language, there are exceptions. One remarkable film features dialogue in **19** languages. Let's now focus on movies with more than 10 spoken languages.

In [None]:
df[df['spoken_languages'] >= 10][['title', 'year', 'spoken_languages']].sort_values('spoken_languages', ascending=False)

**Visions of Europe,** the movie boasting the highest number of languages, isn't a singular narrative but rather a compilation of 25 short films, each helmed by a different European director. This anthology format accounts for its remarkable linguistic diversity.

### **Runtime**

Movies have evolved significantly in terms of length, from humble one-minute silent, black-and-white clips to epic three-hour visual masterpieces. In this section, we aim to delve deeper into the evolution of movie lengths, uncovering insights about their nature and historical trends.

In [None]:
df['runtime'].describe()

The average movie length hovers around 1 hour and 30 minutes. However, the longest film in this dataset clocks in at a **staggering 1256 minutes, equivalent to 20 hours of runtime**.

In [None]:
df['runtime'] = df['runtime'].astype('float')

Recognizing that the majority of movies fall under the 5-hour mark, **I aim to visualize the distribution of these mainstream films through plotting**.

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(df[(df['runtime'] < 300) & (df['runtime'] > 0)]['runtime'])

Let's explore whether there's a significant correlation between a movie's runtime and its return.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df_mat = df[(df['return'].notnull()) & (df['runtime'] > 0) & (df['return'] < 10)]
sns.jointplot(x='return', y='runtime', data=df_mat, color='lightgreen')
plt.show()


While it appears that **a movie's duration is unrelated to its success**, there's a suspicion that this might not hold true for duration and budget. Intuitively, longer films would require a larger budget.


**Let's investigate whether this hypothesis bears in my analysis.**


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df_mat = df[(df['budget'].notnull()) & (df['runtime'] > 0)]
sns.jointplot(x='budget', y='runtime', data=df_mat, color='green')
plt.show()


Surprisingly, the correlation between the two variables is weaker than anticipated. It appears that a film's genre plays a more substantial role in determining its budget. For instance, a lengthy art film tends to be more cost-effective compared to a shorter Sci-Fi flick.

Additionally, I'm curious about the average durations of films spanning from the 1890s to the 2010s. Examining these trends could offer insights into filmmakers' perceptions of ideal movie lengths throughout history.

In [None]:
plt.figure(figsize=(18,5))
year_runtime = df[df['year'] != 'NaT'].groupby('year')['runtime'].mean()
plt.plot(year_runtime.index, year_runtime)
plt.xticks(np.arange(1874, 2024, 10.0))
plt.show()

It's fascinating to observe that films reached the **60-minute mark** as early as 1914. By **1924**, the standard 90-minute duration emerged and has persisted ever since.

**Moving on, let's explore the longest and shortest films in our dataset.**

#### **Shortest Movies**

In [None]:
df[df['runtime'] > 0][['runtime', 'title', 'year']].sort_values('runtime').head(10)


With the exception of **A Gathering of Cats** all the films on this list were shot in the late 1890s and early 20th century, each lasting only one minute.


#### **Longest Movies**

In [None]:
df[df['runtime'] > 0][['runtime', 'title', 'year']].sort_values('runtime', ascending=False).head(10)

The majority of entries in the chart are miniseries, not feature-length films. Therefore, it's challenging to draw meaningful insights since our dataset lacks a clear distinction between the two without manual sorting.

### **Budget**

Now, let's shift focus to budget. Acknowledging its skewed nature and susceptibility to inflation, delving into budget data still promises valuable insights. Budget plays a crucial role in predicting a movie's revenue and success. To begin, let's gather summary statistics for our budget data.

In [None]:
df['budget'].describe()

The mean budget for a film stands at USD 21.6 million, contrasting sharply with the median of USD 8 million. This disparity strongly indicates the influence of outliers on the mean value.

In [None]:
sns.distplot(df[df['budget'].notnull()]['budget'], color='red')

In [None]:
df['budget'].plot(logy=True, kind='hist', color='yellow')
plt.show()


The distribution of movie budgets follows an exponential decay, with over 75% of films having budgets below $25 million. Now, let's examine the highest-budget movies of all time and the revenue and returns they generated.

#### **Most Expensive Movies of all Time**

In [None]:
df[df['budget'].notnull()][['title', 'budget', 'revenue', 'return', 'year']].sort_values('budget', ascending=False).head(10)

Two **Pirates of the Carribean** films lead the list with budgets exceeding **USD 300 million** each. Among the top 10 most expensive films, all but **The Lone Ranger** turned a profit, with the latter recouping less than 35% of its **USD 255 million** budget.

**Regarding correlation, the strength of the budget-revenue correlation determines forecast accuracy**.



In [None]:
sns.jointplot(x='budget', y='revenue', data=df[df['return'].notnull()], color='purple')

The pearson r value of **0.73** between the two quantities indicates a very strong correlation.

### **Revenue**

Let's delve into revenue, a pivotal metric in movie analysis. Predicting revenue based on various features is our next endeavor. Similar to budget analysis, we'll commence by examining summary statistics.

In [None]:
df['revenue'].describe()


With a mean gross of **USD 68.7 million** and a considerably lower median of **USD 16.8 million**, it's evident that movie revenue is skewed. The spectrum ranges from a mere **USD 1** to a staggering **USD 2.78 billion**, showcasing the vast potential in this industry.

In [None]:
sns.distplot(df[df['revenue'].notnull()]['revenue'], color='cyan')


Revenue distribution mirrors that of budget, displaying exponential decay. Moreover, we observed a strong correlation between the two. Now, let's examine both the highest and lowest grossing movies ever.

#### **Highest Grossing Films of All Time**

In [None]:
gross_top = df[['poster_path', 'title', 'budget', 'revenue', 'year']].sort_values('revenue', ascending=False).head(10)
pd.set_option('display.max_colwidth', 100)
HTML(gross_top.to_html(escape=False))

In [None]:
pd.set_option('display.max_colwidth', 50)

These figures aren't adjusted for inflation, leading to a bias towards recent movies in the top 10 list. To grasp revenue trends over time, let's plot the maximum revenue across the years.

In [None]:
plt.figure(figsize=(18,5))
year_revenue = df[(df['revenue'].notnull()) & (df['year'] != 'NaT')].groupby('year')['revenue'].max()
plt.plot(year_revenue.index, year_revenue)
plt.xticks(np.arange(1874, 2024, 10.0))
plt.show()

The graph illustrates a consistent rise in maximum gross over the years. **Titanic** marked the industry's first billion-dollar milestone in 1997, followed by **Avatar** in 2009, both directed by James Cameron, to surpass the USD 2 billion mark after 12 years.


### **Returns**


For now, I won't delve into returns. Instead, **let's focus on identifying the least and most successful movies ever. I will narrow our scope to films with a budget exceeding $5 million.**

#### **Most Successful Movies**

In [None]:
df[(df['return'].notnull()) & (df['budget'] > 5e6)][['title', 'budget', 'revenue', 'return', 'year']].sort_values('return', ascending=False).head(10)

#### **Worst Box Office Disasters**

In [None]:
df[(df['return'].notnull()) & (df['budget'] > 5e6) & (df['revenue'] > 10000)][['title', 'budget', 'revenue', 'return', 'year']].sort_values('return').head(10)

With these analyses completed, we're well-prepared to build our correlation matrix.

## **Correlation Matrix**

In [None]:
df['year'] = df['year'].replace('NaT', np.nan)

In [None]:
df['year'] = df['year'].apply(clean_numeric)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Select only numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=np.number)

# Compute the correlation matrix
corr = numeric_df.corr()

# Create a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
plt.figure(figsize=(9, 9))

# Plot the heatmap
sns.heatmap(corr, mask=mask, vmax=0.3, square=True, annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap')
plt.show()


In [None]:
sns.set(font_scale=1.25)

### **Genres**

In [None]:
df['genres'] = df['genres'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [None]:
s = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'

In [None]:
gen_df = df.drop('genres', axis=1).join(s)

In [None]:
gen_df['genre'].value_counts().shape[0]

Explore 45,000 movies across 32 distinct genres on TMDB.

**Now lets delve into the top genres shaping cinematic landscapes.**

In [None]:
pop_gen = pd.DataFrame(gen_df['genre'].value_counts()).reset_index()
pop_gen.columns = ['genre', 'movies']
pop_gen.head(10)

In [None]:
plt.figure(figsize=(18, 8))
sns.barplot(x='genre', y='movies', data=pop_gen.head(15), palette='Paired')
plt.show()


**Drama** dominates, with nearly half of all films falling into this genre, while **Comedy** follows at a distance with 25% offering laughs. Among the top 10 genres are also Action, Horror, Crime, Mystery, Science Fiction, Animation, and Fantasy.

**Now, let's delve into global genre trends.**
- **Are Science Fiction flicks gaining traction?**
- **Do certain years favor Animation?**


**I willfocus on data from 2000 onwards, examining the top 15 genres while excluding Documentaries, Family, and Foreign Films from our analysis.**


In [None]:
genres = ['Drama', 'Comedy', 'Thriller', 'Romance', 'Action', 'Horror', 'Crime', 'Adventure', 'Science Fiction', 'Mystery', 'Fantasy', 'Mystery', 'Animation']

In [None]:
pop_gen_movies = gen_df[(gen_df['genre'].isin(genres)) & (gen_df['year'] >= 2000) & (gen_df['year'] <= 2017)]
ctab = pd.crosstab([pop_gen_movies['year']], pop_gen_movies['genre']).apply(lambda x: x/x.sum(), axis=1)
ctab[genres].plot(kind='bar', stacked=True, colormap='jet', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.title("Stacked Bar Chart of Movie Proportions by Genre")
plt.show()

In [None]:
ctab[genres].plot(kind='line', stacked=False, colormap='jet', figsize=(12,8)).legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

Genre proportions have generally held steady since the turn of the century, except for **Drama**, which saw a decline of over 5%. Conversely, **Thrillers** experienced a slight uptick in representation.

One lingering question pertains to genre success metrics. While Science Fiction and Fantasy films often boast high revenue figures.

**Do they maintain profitability when adjusted for budget? To shed light on this, I will visualize genre performance using violin plots, comparing revenue and returns.**

In [None]:
violin_genres = ['Drama', 'Comedy', 'Thriller', 'Romance', 'Action', 'Horror', 'Crime', 'Science Fiction', 'Fantasy', 'Animation']
violin_movies = gen_df[(gen_df['genre'].isin(violin_genres))]

In [None]:
# Reset index of the DataFrame to remove duplicate labels
violin_movies.reset_index(drop=True, inplace=True)

# Create the figure and axes
plt.figure(figsize=(18, 8))
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 8))

# Plot the boxplot
sns.boxplot(x='genre', y='revenue', data=violin_movies, palette="muted", ax=ax)

# Set the y-axis limit
ax.set_ylim([0, 3e8])

# Show the plot
plt.show()



**Animation** films shine with the widest revenue range from the 25th to the 75th percentile, alongside boasting the highest median revenue across all genres. Following closely behind are **Fantasy** and **Science Fiction**, claiming the second and third spots for median revenue, respectively.

In [None]:
plt.figure(figsize=(18,8))
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(15, 8))
sns.boxplot(x='genre', y='return', data=violin_movies, palette="muted", ax =ax)
ax.set_ylim([0, 10])
plt.show()

Based on the boxplot analysis, **Animation** movies emerge as the top earners on average, with **Horror** movies also showing promise. This is attributed partly to the lower budgets of Horror films compared to Fantasy counterparts, yet their potential for significant revenue generation remains notable.

### **Cast and Crew**

Now, let's shift our focus to the cast and crew of our films. While this information isn't within our primary dataset, I do possess a separate file containing comprehensive cast and crew credits for all the Movielens movies.

**Let's dive into this credits data**

In [None]:
credits_df = pd.read_csv('/content/drive/MyDrive/CAS DS Final Project - Movie Recommendation System - Avisek Regmi/credits.csv')
credits_df.head()

#### Credits Dataset



In the credits dataset, we have two main components: **cast** which includes cast names and their respective characters, and **crew** which contains crew names and their roles. Additionally, there's an **id** field representing the TMDB ID of the movie.



**My task involves left joining this data with our original movies metadata dataframe based on the TMDB Movie ID. Before this join, I must ensure the ID column in our main dataframe is clean and of integer type. I will attempt an integer conversion and replace any problematic IDs with NaN, followed by dropping these rows from our dataframe.**

In [None]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [None]:
df['id'] = df['id'].apply(convert_int)

In [None]:
df[df['id'].isnull()]

In [None]:
df = df.drop([19730, 29503, 35587])

In [None]:
df['id'] = df['id'].astype('int')

In [None]:
df = df.merge(credits_df, on='id')
df.shape

In [None]:
df['cast'] = df['cast'].apply(ast.literal_eval)
df['crew'] = df['crew'].apply(ast.literal_eval)

In [None]:
df['cast_size'] = df['cast'].apply(lambda x: len(x))
df['crew_size'] = df['crew'].apply(lambda x: len(x))

In [None]:
df['cast'] = df['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [None]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
df['director'] = df['crew'].apply(get_director)

In [None]:
s = df.apply(lambda x: pd.Series(x['cast']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'actor'
cast_df = df.drop('cast', axis=1).join(s)

**Now, let's delve into the top-earning actors and directors in the film industry.**

#### **Actors with the Highest Total Revenue**

In [None]:
sns.set_style('whitegrid')
plt.title('Actors with the Highest Total Revenue')
cast_df.groupby('actor')['revenue'].sum().sort_values(ascending=False).head(10).plot(kind='bar', color='red')
plt.show()


#### **Directors with the Highest Total Revenue**

In [None]:
plt.title('Directors with the Highest Total Revenue')
df.groupby('director')['revenue'].sum().sort_values(ascending=False).head(10).plot(kind='bar', color='lightgreen')
plt.show()


**I will only factor in actors and directors with a portfolio of at least five films each when assessing average revenues.**

In [None]:
actor_list = cast_df.groupby('actor')['revenue'].count().sort_values(ascending=False)
actor_list = list(actor_list[actor_list >= 5].index)
director_list = df.groupby('director')['revenue'].count().sort_values(ascending=False)
director_list = list(director_list[director_list >= 5].index)

#### **Actors with Highest Average Revenue**

In [None]:
plt.title("Actors with Highest Average Revenue")
cast_df[cast_df['actor'].isin(actor_list)].groupby('actor')['revenue'].mean().sort_values(ascending=False).head(10).plot(kind='bar', color='yellow')
plt.show()


#### **Directors with Highest Average Revenue**

In [None]:
plt.title("Directors with Highest Average Revenue")
df[df['director'].isin(director_list)].groupby('director')['revenue'].mean().sort_values(ascending=False).head(10).plot(kind='bar', color='lightblue')
plt.show()


Who are the most reliable actors and directors? We'll gauge this by looking at the average earnings generated by their projects. Only movies grossing at least $10 million will be included, and we'll focus on individuals with a track record of at least five films each.


#### **Most Successful Actors**

In [None]:
success_df = cast_df[(cast_df['return'].notnull()) & (cast_df['revenue'] > 1e7) & (cast_df['actor'].isin(actor_list))]
pd.DataFrame(success_df.groupby('actor')['return'].mean().sort_values(ascending=False).head(10))

#### **Most Successful Directors**

In [None]:
success_df = df[(df['return'].notnull()) & (df['revenue'] > 1e7) & (df['director'].isin(director_list))]
pd.DataFrame(success_df.groupby('director')['return'].mean().sort_values(ascending=False).head(10))

**John G. Avildsen** stands out with an exceptionally high return. None of the other directors on the list come close to matching his level of success. Let's examine his filmography.

In [None]:
df[(df['director'] == 'John G. Avildsen') & (df['return'].notnull())][['title', 'budget', 'revenue', 'return', 'year']]

**The Karate Kid, Part II** reportedly had a budget of only USD 113, which appears to be an anomaly considering its official cost of **USD 13 million**. Despite directing remarkable films, this discrepancy disqualifies Avildsen from inclusion in our list.

**With that, I will conclude my Exploratory Data Analysis. Let's leverage the insights gained to develop valuable predictive models.**

## **Regression: Predicting Movie Revenues**

In this section, I will build a **regression model** to forecast movie revenues. **Recognizing it's not my primary focus, I won't delve deeply into feature engineering or hyperparameter tuning.**

Predicting movie revenues is a well-explored area in Machine Learning, with extensive literature available. Many models leverage potent features like Facebook Page Likes, Twitter activity, YouTube Trailer metrics, and various rating systems. Since we lack these, we'll use TMDB's **Popularity Score** and **Vote Average** as proxies for popularity. However, it's crucial to note that these metrics won't be available for unreleased movies in real-world scenarios.


In [None]:
rgf = df[df['return'].notnull()]
rgf.shape

My training set comprises of **5393 records**.

**Let's review our features and eliminate any unnecessary ones.**

In [None]:
rgf.columns

In [None]:
rgf = rgf.drop(['id', 'overview', 'poster_path', 'release_date', 'status', 'tagline', 'video', 'return', 'crew'], axis=1)

We will perform the following feature engineering tasks:

1. **belongs_to_collection** will be turned into a Boolean variable. 1 indicates a movie is a part of collection whereas 0 indicates it is not.
2. **genres** will be converted into number of genres.
3. **homepage** will be converted into a Boolean variable that will indicate if a movie has a homepage or not.
4. **original_language** will be replaced by a feature called **is_foreign** to denote if a particular film is in English or a Foreign Language.
5. **production_companies** will be replaced with just the number of production companies collaborating to make the movie.
6. **production_countries** will be replaced with the number of countries the film was shot in.
7. **day** will be converted into a binary feature to indicate if the film was released on a Friday.
8. **month** will be converted into a variable that indicates if the month was a holiday season.

In [None]:
s = rgf.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_rgf = rgf.drop('genres', axis=1).join(s)
genres_train = gen_rgf['genre'].drop_duplicates()

In [None]:
def feature_engineering(df):
    df['belongs_to_collection'] = df['belongs_to_collection'].apply(lambda x: 0 if x == np.nan else 1)
    for genre in genres_train:
        df['is_' + str(genre)] = df['genres'].apply(lambda x: 1 if genre in x else 0)
    df['genres'] = df['genres'].apply(lambda x: len(x))
    df['homepage'] = df['homepage'].apply(lambda x: 0 if x == np.nan else 1)
    df['is_english'] = df['original_language'].apply(lambda x: 1 if x=='en' else 0)
    df = df.drop('original_language', axis=1)
    df['production_companies'] = df['production_companies'].apply(lambda x: len(x))
    df['production_countries'] = df['production_countries'].apply(lambda x: len(x))
    df['is_Friday'] = df['day'].apply(lambda x: 1 if x=='Fri' else 0)
    df = df.drop('day', axis=1)
    df['is_Holiday'] = df['month'].apply(lambda x: 1 if x in ['Apr', 'May', 'Jun', 'Nov'] else 0)
    df = df.drop('month', axis=1)
    df = df.drop(['title', 'cast', 'director'], axis=1)
    df = pd.get_dummies(df, prefix='is')
    df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
    df['vote_average'] = df['vote_average'].fillna(df['vote_average'].mean())
    return df

In [None]:
X, y = rgf.drop('revenue', axis=1), rgf['revenue']

In [None]:
X = feature_engineering(X)

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, test_size=0.25)

In [None]:
X.shape

In [None]:
reg = GradientBoostingRegressor()
reg.fit(train_X, train_y)
reg.score(test_X, test_y)

Our model achieves a coefficient of determination of **0.77**, indicating a strong performance for our basic model. Now, let's compare its score to that of a Dummy Regressor.

In [None]:
dummy = DummyRegressor()
dummy.fit(train_X, train_y)
dummy.score(test_X, test_y)


**My model significantly outperforms the Dummy Regressor.**

**Now, let's visualize feature importances with a bar plot to identify the most influential features in our predictions.**

In [None]:
# Assuming reg.feature_importances_ and X.columns are defined

sns.set_style('whitegrid')
plt.figure(figsize=(10, 12))

# Define a list of colors, one for each bar
colors = ['skyblue', 'lightgreen', 'salmon', 'gold', 'orchid', 'cornflowerblue', 'lightcoral', 'limegreen', 'dodgerblue', 'hotpink']

# Plot the bar plot with the specified colors
sns.barplot(x=reg.feature_importances_, y=X.columns, palette=colors)
plt.show()



The most influential feature in our Gradient Boosting Model turns out to be **vote_count**, affirming the significance of popularity metrics in revenue prediction. Budget follows as the second most important, succeeded by **Popularity** (an explicit popularity metric) and **Crew Size.****



## **Classification: Predicting Movie Success**

What determines whether a movie will recoup its investment? To answer this, I will build a binary classifier to predict a movie's profitability. Like our regression model, this classifier will use some idealized features due to the lack of real-world popularity metrics.

While  I have extensively analyzed our data,  I haven't yet pinpointed the factors that drive a movie's success. In this section, I will identify these factors and then construct our model.

In [None]:
cls = df[df['return'].notnull()]
cls.shape

In [None]:
cls.columns

In [None]:
cls = cls.drop(['id', 'overview', 'poster_path', 'release_date', 'status', 'tagline', 'revenue'], axis=1)

Let's convert our **return** feature into a binary variable: **0** for a flop and **1** for a hit.

In [None]:
cls['return'] = cls['return'].apply(lambda x: 1 if x >=1 else 0)

In [None]:
cls['return'].value_counts()

Our classes are fairly balanced, so no additional methods are needed to address class imbalance. Now, let's focus on our features.

In [None]:
cls['belongs_to_collection'] = cls['belongs_to_collection'].fillna('').apply(lambda x: 0 if x == '' else 1)

In [None]:
sns.set(style="whitegrid")
g = sns.PairGrid(data=cls, x_vars=['belongs_to_collection'], y_vars='return', height=5)
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))


Movies that are part of a franchise tend to have a higher chance of success.

In [None]:
cls['popularity'] = cls['popularity'].fillna('').apply(lambda x: 0 if x == '' else 1)

In [None]:
sns.set(style="whitegrid")
g = sns.PairGrid(data=cls, x_vars=['popularity'], y_vars='return', height=5)
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))

The probability difference with popularity isn't significant. To avoid the curse of dimensionality, I will eliminate this feature since it's not particularly useful.

In [None]:
s = cls.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_cls = cls.drop('genres', axis=1).join(s)

In [None]:
ctab = pd.crosstab([gen_cls['genre']], gen_cls['return'], dropna=False).apply(lambda x: x/x.sum(), axis=1)
ctab.plot(kind='bar', stacked=True, legend=False)

**TV movies** show a 0% failure rate, likely due to their small sample size. **Foreign films**, however, have a higher than average failure rate. Since no genre shows a drastic pattern, we'll proceed with one-hot encoding for all genres.

In [None]:
cls.columns

In [None]:
def classification_engineering(df):
    for genre in genres_train:
        df['is_' + str(genre)] = df['genres'].apply(lambda x: 1 if genre in x else 0)
    df['genres'] = df['genres'].apply(lambda x: len(x))
    df = df.drop('homepage', axis=1)
    df['is_english'] = df['original_language'].apply(lambda x: 1 if x=='en' else 0)
    df = df.drop('original_language', axis=1)
    df['production_companies'] = df['production_companies'].apply(lambda x: len(x))
    df['production_countries'] = df['production_countries'].apply(lambda x: len(x))
    df['is_Friday'] = df['day'].apply(lambda x: 1 if x=='Fri' else 0)
    df = df.drop('day', axis=1)
    df['is_Holiday'] = df['month'].apply(lambda x: 1 if x in ['Apr', 'May', 'Jun', 'Nov'] else 0)
    df = df.drop('month', axis=1)
    df = df.drop(['title', 'cast', 'director'], axis=1)
    #df = pd.get_dummies(df, prefix='is')
    df['runtime'] = df['runtime'].fillna(df['runtime'].mean())
    df['vote_average'] = df['vote_average'].fillna(df['vote_average'].mean())
    df = df.drop('crew', axis=1)
    return df


In [None]:
cls = classification_engineering(cls)

In [None]:
cls.columns

In [None]:
X, y = cls.drop('return', axis=1), cls['return']

In [None]:
train_X, test_X, train_y, test_y = train_test_split(X, y, train_size=0.75, test_size=0.25, stratify=y)

In [None]:
clf = GradientBoostingClassifier()
clf.fit(train_X, train_y)
clf.score(test_X, test_y)

My basic **Gradient Boosting Classifier** achieves **80%** accuracy. Although hyperparameter tuning and advanced feature engineering could improve the model,
I will skip these steps as they aren't the primary focus of this project.

In [None]:
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(train_X, train_y)
dummy.score(test_X, test_y)

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 12))

# Define a color palette with different colors
colors = sns.color_palette("husl", len(X.columns))

# Plot the bar plot with the specified palette
sns.barplot(x=clf.feature_importances_, y=X.columns, palette=colors)

plt.show()



**Vote Count** is our classifier's most significant feature, followed by **Budget**, **Popularity**, and **Year**. This concludes our discussion on the classification model.

Next, I will build a **Hybrid Recommendation System** combining popularity, content, and collaborative filtering, using both the MovieLens dataset and the TMDB Movies Metadata.