# AHJIN STUDIOS: Box Office Success Blueprint

## Project Overview

Ahjin Studios is taking a bold leap into the world of original video content. With major players in the industry producing record-breaking films, it's time we carve our own path to the silver screen. But before the cameras roll, we need to ground our creativity in strategy. This project explores which types of films are dominating the box office - genres, themes, release seasons, production budgets, and more - to identify what’s *actually working* in today’s film market.

## Business Problem

The entertainment industry is undergoing a massive transformation. Streaming giants and traditional studios alike are pouring billions into original content, resulting in a saturated, competitive, and fast-evolving market. Ahjin Studios, a newcomer in this arena, wants to make a strong, strategic entrance. But without prior experience in filmmaking or content production, the studio lacks a grounded understanding of what drives box office success.

While creativity is the soul of cinema, data is its compass. The financial risk of producing a film is substantial, with production budgets often ranging from millions to hundreds of millions of dollars — and no guaranteed return. Choosing the wrong genre, misreading audience interests, or releasing at the wrong time can spell disaster. Conversely, aligning a film's concept with market demand can lead to runaway hits, brand recognition, and long-term profitability.

Ahjin Studios needs clear, evidence-based insights to answer critical questions:

- What types of movies are **worth betting on**?
- Where can we find the **sweet spot between budget and revenue**?
- Which trends are **passing fads**, and which are **sustainable opportunities**?
- How can a **new studio** stand out in a market dominated by legacy franchises and big-name talent?

By conducting a comprehensive analysis of recent box office performance, this project aims to **remove the guesswork** from movie production decisions and provide Ahjin Studios with a **strategic blueprint** for launching commercially viable, audience-ready films that can hold their own in today’s high-stakes entertainment landscape.

## Objective

To analyze recent box office trends and translate key findings into **actionable, data-driven recommendations** that will guide Ahjin Studios in developing high-performing original films.

## Key Questions

- Which **genres** are consistently earning the highest revenue?
- What **budget range** yields the best ROI?
- How do **release dates** affect performance?
- Do **star power** and **director reputation** play a measurable role?
- Are **franchise films** outperforming standalones?
- What **audience demographics** are driving ticket sales?

## Deliverables

- A clean and exploratory dataset analysis of recent box office films
- Visual breakdowns of top-performing genres, budgets, and seasons
- A concise summary report with **strategic recommendations** for Ahjin Studios

## Final Goal

To provide the leadership team at Ahjin Studios with a **clear roadmap for movie production** - one that maximizes commercial success while carving out a unique space in the entertainment industry.

> Lights, camera... data! Let's get to work.

## INITIAL DATA UNDERSTANDING

### 1. BUDGET DATASET

In [122]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [123]:
# Load the dataset
budget_df = pd.read_csv('C:\\Users\\lenovo\\OneDrive\\Desktop\\DS\\PROJECTS\\Data-s-Cut3\\tn.movie_budgets.csv', encoding = 'Latin1')
budget_df.head()

Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [124]:
# Display the shape
print(f"The datset has {budget_df.shape[0]} rows and {budget_df.shape[1]} columns.")

The datset has 5782 rows and 6 columns.


In [125]:
# Display column names
budget_df.columns

Index(['id', 'release_date', 'movie', 'production_budget', 'domestic_gross',
       'worldwide_gross'],
      dtype='object')

In [126]:
# Get metadata
budget_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [127]:
# Display descriptive statistics for categorical columns
budget_df.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
release_date,5782,2418,"Dec 31, 2014",24
movie,5782,5698,King Kong,3
production_budget,5782,509,"$20,000,000",231
domestic_gross,5782,5164,$0,548
worldwide_gross,5782,5356,$0,367


In [128]:
# Check for duplicates and null values
print("Duplicates:", budget_df.duplicated().sum())
print("\nNull Values:\n", budget_df.isna().sum())

Duplicates: 0

Null Values:
 id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64


### 2. GROSS DATASET

In [129]:
# Load gross dataset
gross_df = pd.read_csv('C:\\Users\\lenovo\\OneDrive\\Desktop\\DS\\PROJECTS\Data-s-Cut3\\bom.movie_gross.csv')
gross_df.head()

  gross_df = pd.read_csv('C:\\Users\\lenovo\\OneDrive\\Desktop\\DS\\PROJECTS\Data-s-Cut3\\bom.movie_gross.csv')


Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [130]:
# Display the shape 
print(f"The dataset has {gross_df.shape[0]} rows and {gross_df.shape[1]} columns.")

The dataset has 3387 rows and 5 columns.


In [131]:
# Get metadata 
gross_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [132]:
# Get basic statistics for numerical columns
gross_df.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [133]:
# Get basic statistics for categorical columns
gross_df.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
title,3387,3386,Bluebeard,2
studio,3382,257,IFC,166
foreign_gross,2037,1204,1200000,23


In [134]:
# Check for duplicates and null values
print("Duplicates:", gross_df.duplicated().sum())
print("\nNull Values:\n", gross_df.isna().sum())

Duplicates: 0

Null Values:
 title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64


### 3. ROTTEN TOMATOES MOVIE INFORMATION DATASET

In [135]:
# Load Rotten Tomatoes movie info dataset
rt_info_df = pd.read_csv('C:\\Users\\lenovo\\OneDrive\\Desktop\\DS\\PROJECTS\\Data-s-Cut3\\rt.movie_info.tsv', sep = '\t')
rt_info_df.tail()

Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
1555,1996,Forget terrorists or hijackers -- there's a ha...,R,Action and Adventure|Horror|Mystery and Suspense,,,"Aug 18, 2006","Jan 2, 2007",$,33886034.0,106 minutes,New Line Cinema
1556,1997,The popular Saturday Night Live sketch was exp...,PG,Comedy|Science Fiction and Fantasy,Steve Barron,Terry Turner|Tom Davis|Dan Aykroyd|Bonnie Turner,"Jul 23, 1993","Apr 17, 2001",,,88 minutes,Paramount Vantage
1557,1998,"Based on a novel by Richard Powell, when the l...",G,Classics|Comedy|Drama|Musical and Performing Arts,Gordon Douglas,,"Jan 1, 1962","May 11, 2004",,,111 minutes,
1558,1999,The Sandlot is a coming-of-age story about a g...,PG,Comedy|Drama|Kids and Family|Sports and Fitness,David Mickey Evans,David Mickey Evans|Robert Gunter,"Apr 1, 1993","Jan 29, 2002",,,101 minutes,
1559,2000,"Suspended from the force, Paris cop Hubert is ...",R,Action and Adventure|Art House and Internation...,,Luc Besson,"Sep 27, 2001","Feb 11, 2003",,,94 minutes,Columbia Pictures


In [136]:
# Display the shape
print(f"The dataset has {rt_info_df.shape[0]} rows and {rt_info_df.shape[1]} columns.")

The dataset has 1560 rows and 12 columns.


In [137]:
# Display column names
rt_info_df.columns

Index(['id', 'synopsis', 'rating', 'genre', 'director', 'writer',
       'theater_date', 'dvd_date', 'currency', 'box_office', 'runtime',
       'studio'],
      dtype='object')

In [138]:
# Get metadata
rt_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB


In [139]:
# Display descriptive statistics for categorical columns
rt_info_df.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
synopsis,1498,1497,A group of air crash survivors are stranded in...,2
rating,1557,6,R,521
genre,1552,299,Drama,151
director,1361,1125,Steven Spielberg,10
writer,1111,1069,Woody Allen,4
theater_date,1201,1025,"Jan 1, 1987",8
dvd_date,1201,717,"Jun 1, 2004",11
currency,340,1,$,340
box_office,340,336,200000,2
runtime,1530,142,90 minutes,72


In [140]:
# Check for duplicates and null values
print("Duplicates:", rt_info_df.duplicated().sum())
print("\nNull Values:\n", rt_info_df.isna().sum())

Duplicates: 0

Null Values:
 id                 0
synopsis          62
rating             3
genre              8
director         199
writer           449
theater_date     359
dvd_date         359
currency        1220
box_office      1220
runtime           30
studio          1066
dtype: int64


### 4. ROTTEN TOMATOES MOVIE REVIEWS DATASET

In [141]:
# Load Rotten Tomatoes reviews dataset  
rt_reviews_df = pd.read_csv('C:\\Users\\lenovo\\OneDrive\\Desktop\\DS\\PROJECTS\\Data-s-Cut3\\rt.reviews.tsv', sep = '\t', encoding = 'Latin1')
rt_reviews_df.tail()

Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
54427,2000,The real charm of this trifle is the deadpan c...,,fresh,Laura Sinagra,1,Village Voice,"September 24, 2002"
54428,2000,,1/5,rotten,Michael Szymanski,0,Zap2it.com,"September 21, 2005"
54429,2000,,2/5,rotten,Emanuel Levy,0,EmanuelLevy.Com,"July 17, 2005"
54430,2000,,2.5/5,rotten,Christopher Null,0,Filmcritic.com,"September 7, 2003"
54431,2000,,3/5,fresh,Nicolas Lacroix,0,Showbizz.net,"November 12, 2002"


In [142]:
# Display the column names
rt_reviews_df.columns

Index(['id', 'review', 'rating', 'fresh', 'critic', 'top_critic', 'publisher',
       'date'],
      dtype='object')

In [143]:
# Display the shape
print(f'The dataset has {rt_reviews_df.shape[0]} rows and {rt_reviews_df.shape[1]} columns.')

The dataset has 54432 rows and 8 columns.


In [144]:
# Get metadata
rt_reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB


In [145]:
# Display descriptive statistics for numerical columns
rt_reviews_df.describe()

Unnamed: 0,id,top_critic
count,54432.0,54432.0
mean,1045.706882,0.240594
std,586.657046,0.427448
min,3.0,0.0
25%,542.0,0.0
50%,1083.0,0.0
75%,1541.0,0.0
max,2000.0,1.0


In [146]:
# Display descriptive statistics for categorical columns
rt_reviews_df.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
review,48869,48682,Parental Content Review,24
rating,40915,186,3/5,4327
fresh,54432,2,fresh,33035
critic,51710,3496,Emanuel Levy,595
publisher,54123,1281,eFilmCritic.com,673
date,54432,5963,"January 1, 2000",4303


In [147]:
# Check for duplicates and null values
print('Duplicates:', rt_reviews_df.duplicated().sum())
print('\nNull Values:\n', rt_reviews_df.isna().sum())

Duplicates: 9

Null Values:
 id                0
review         5563
rating        13517
fresh             0
critic         2722
top_critic        0
publisher       309
date              0
dtype: int64


### 5. TMDB MOVIES DATASET

In [148]:
# Load TMDB movies dataset
tmdb_df = pd.read_csv('C:\\Users\\lenovo\\OneDrive\\Desktop\\DS\\PROJECTS\\Data-s-Cut3\\tmdb.movies.csv', index_col = 0)
tmdb_df.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [149]:
# Display the shape
print(f'The dataset has {tmdb_df.shape[0]} rows and {tmdb_df.shape[1]} columns.')

The dataset has 26517 rows and 9 columns.


In [150]:
# Display column names
tmdb_df.columns

Index(['genre_ids', 'id', 'original_language', 'original_title', 'popularity',
       'release_date', 'title', 'vote_average', 'vote_count'],
      dtype='object')

In [151]:
# Get metadata
tmdb_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26517 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          26517 non-null  object 
 1   id                 26517 non-null  int64  
 2   original_language  26517 non-null  object 
 3   original_title     26517 non-null  object 
 4   popularity         26517 non-null  float64
 5   release_date       26517 non-null  object 
 6   title              26517 non-null  object 
 7   vote_average       26517 non-null  float64
 8   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 2.0+ MB


In [152]:
# Display descriptive statistics for numerical columns
tmdb_df.describe()

Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0
mean,295050.15326,3.130912,5.991281,194.224837
std,153661.615648,4.355229,1.852946,960.961095
min,27.0,0.6,0.0,1.0
25%,157851.0,0.6,5.0,2.0
50%,309581.0,1.374,6.0,5.0
75%,419542.0,3.694,7.0,28.0
max,608444.0,80.773,10.0,22186.0


In [153]:
# Display descriptive statistics for categorical columns
tmdb_df.describe(include = 'O').T

Unnamed: 0,count,unique,top,freq
genre_ids,26517,2477,[99],3700
original_language,26517,76,en,23291
original_title,26517,24835,Eden,7
release_date,26517,3433,2010-01-01,269
title,26517,24688,Eden,7


In [154]:
# Check for duplicates and null values
print('Duplicates:', tmdb_df.duplicated().sum())
print('\nNull Values:\n', tmdb_df.isna().sum())

Duplicates: 1020

Null Values:
 genre_ids            0
id                   0
original_language    0
original_title       0
popularity           0
release_date         0
title                0
vote_average         0
vote_count           0
dtype: int64


## OBSERVATIONS AND EARLY INSIGHTS