![imdb logo](imdb%20logo.jpg)

# **Introduction**
I have access to the IMDb Movies dataset, which is structured in a long format based on movie genres. This dataset contains a multitude of insightful details, including movie titles, release dates, runtimes, ratings, voter counts, approval indices, budgets, domestic and international revenues, revenue disparities, and a critical indicator that distinguishes between movie profitability and losses.

In my analysis, conducted within the SQL environment of Jupyter Notebook, I will embark on a journey to explore the intricate relationship between movie genres and various aspects of the film industry. By utilizing SQL queries and advanced data manipulation techniques, I aim to uncover valuable patterns, correlations, and statistics that can provide deep insights into the world of cinema. My investigation will delve into genre-specific trends, profitability across different genres, audience preferences, budget allocation, and much more.

Join me as I navigate the cinematic landscape, guided by data, and reveal the captivating narratives concealed within the IMDb Movies dataset, now presented in its long data format.

The data visualization dashboard (Tableau) and dataset can be accessed at the following link:
1. Tableau Public : https://tinyurl.com/moviesimdbdashbord
2. Dataset : https://tinyurl.com/imdbdatasets

- On this occasion, I will try to display 10 film data :

In [1]:
SELECT * FROM 'imdb_movies_dataset - long_data_genre.csv'
LIMIT 10;

Unnamed: 0,movie_title,genre,release_date,year,runtime,rating,number_of_voters,approval_index,budget,domestic_revenue,international_revenue,difference_of_revenue,indicator
0,80 for Brady (2023),Comedy,2023-02-03 00:00:00+00:00,2023,98,5.2,359,2.119074,28000000,37688000,37688000,47376000,Profit
1,80 for Brady (2023),Drama,2023-02-03 00:00:00+00:00,2023,98,5.2,359,2.119074,28000000,37688000,37688000,47376000,Profit
2,80 for Brady (2023),Sport,2023-02-03 00:00:00+00:00,2023,98,5.2,359,2.119074,28000000,37688000,37688000,47376000,Profit
3,A Man Called Otto (2022),Comedy,2022-12-25 00:00:00+00:00,2022,126,7.6,13059,5.167562,50000000,62802303,104002303,116804606,Profit
4,A Man Called Otto (2022),Drama,2022-12-25 00:00:00+00:00,2022,126,7.6,13059,5.167562,50000000,62802303,104002303,116804606,Profit
5,Amsterdam (2022),Comedy,2022-10-06 00:00:00+00:00,2022,134,6.1,62180,4.82202,80000000,14947969,29400826,-35651205,Loss
6,Amsterdam (2022),Drama,2022-10-06 00:00:00+00:00,2022,134,6.1,62180,4.82202,80000000,14947969,29400826,-35651205,Loss
7,Amsterdam (2022),History,2022-10-06 00:00:00+00:00,2022,134,6.1,62180,4.82202,80000000,14947969,29400826,-35651205,Loss
8,Armageddon Time (2022),Drama,2022-09-14 00:00:00+00:00,2022,114,6.6,8652,4.270361,15000000,1860050,6076886,-7063064,Loss
9,Avatar: The Way of Water (2022),Action,2022-12-09 00:00:00+00:00,2022,192,7.8,277543,7.061101,460000000,667830256,2265935552,2473765808,Profit


Based on the data above, it can be seen that there are many duplicate titles, but they are segmented based on different genres.

In the first analysis, I will perform a query to display the films with the highest ratings in each year :

In [2]:
SELECT
	i.year AS year,
	MAX(i.rating) AS max_rating
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
GROUP BY i.year
ORDER BY year DESC;

Unnamed: 0,year,max_rating
0,2023,5.2
1,2022,8.3
2,2021,8.2
3,2020,8.2
4,2019,8.4
...,...,...
91,1929,5.6
92,1925,7.9
93,1920,4.3
94,1916,6.1


Next, I will focus on analyzing films based on title, genre and rating.
First, I will show the top 10 films by rating without including genre segmentation :

In [3]:
SELECT
	i.movie_title AS title,
	ROUND(AVG(i.rating),1) AS rating,
	AVG(i.number_of_voters) AS voters_total,
	AVG(i.approval_index) AS approval_total
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
GROUP BY i.movie_title
ORDER BY rating DESC
LIMIT 10;

Unnamed: 0,title,rating,voters_total,approval_total
0,The Shawshank Redemption (1994),9.3,2695887.0,10.0
1,The Godfather (1972),9.2,1870922.0,9.64379
2,The Dark Knight (2008),9.0,2669915.0,9.666757
3,Solitude (2005),9.0,15.0,1.661377
4,The Lord of the Rings: The Return of the King ...,9.0,1857170.0,9.426391
5,Pulp Fiction (1994),8.9,2069502.0,9.391076
6,Inception (2010),8.8,2368570.0,9.37147
7,The Lord of the Rings: The Two Towers (2002),8.8,1677003.0,9.147916
8,The Lord of the Rings: The Fellowship of the R...,8.8,1886597.0,9.224165
9,Fight Club (1999),8.8,2141276.0,9.306151


Based on the query results above, the 3 films with the highest ratings are:
1. The Shawshank Redemption (1994)
2. The Godfather (1972)
3. The Lord of the Rings: The Return of the King (2003)

### **The Shawshank Redemption (1994)**
The Shawshank Redemption (1994) has the highest rating, namely 9.3. This film is a classic film directed by Frank Darabont. It falls under the genre of drama. This film is renowned for its compelling storyline, memorable characters, and its exploration of themes such as hope and redemption within the confines of a prison setting.


### **The Godfather (1972)**
The Godfather (1972) came in second with a 9.2 rating. This film is a classic crime drama film directed by Francis Ford Coppola and based on the novel of the same name by Mario Puzo. The film revolves around the Corleone family, a powerful Italian-American mafia clan, and their patriarch, Vito Corleone, played by Marlon Brando. The film explores the themes of power, family, loyalty, and the consequences of a life in organized crime. It also spawned two sequels, "The Godfather Part II" (1974) and "The Godfather Part III" (1990), which further explore the Corleone family's saga.


### **The Lord of the Rings: The Return of the King (2003)**
This film is in third place with a rating of 9, following other films below it. The Lord of the Rings: The Return of the King" is a fantasy film released in 2003, directed by Peter Jackson. It is the third and final installment of "The Lord of the Rings" film trilogy, based on the epic fantasy novel" The Return of the King" by J.R.R. Tolkien. The film concludes the epic tale set in the fictional realm of Middle-earth.

After I succeeded in getting the 10 films with the highest ratings, and a brief explanation of the 3 films above, on this occasion, I will try to analyze which film genres are most popular with film audiences. 

I will group some of these genres because there are formatting errors in the dataset.

In [4]:
SELECT
	CASE WHEN i.genre = 'Film-Noir' OR i.genre = 'News' OR i.genre = 'Criminal' THEN 'Others'
		 ELSE i.genre END AS movies_genre,
	ROUND(AVG(i.number_of_voters),0) AS voters_number,
	COUNT(*) AS total_number_of_productions
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
GROUP BY i.genre
ORDER BY voters_number DESC, total_number_of_productions DESC;

Unnamed: 0,movies_genre,voters_number,total_number_of_productions
0,Sci-Fi,271237.0,378
1,Adventure,200065.0,966
2,Action,196265.0,1206
3,Animation,164876.0,240
4,Fantasy,160763.0,346
5,Mystery,152757.0,408
6,Thriller,147413.0,661
7,Crime,144504.0,776
8,War,134574.0,85
9,Drama,129363.0,2176


Based on the query results above, it can be seen that the top 3 films that are most popular with audiences are films with the genre:
1. Sci-Fi (271,237 Votes)
2. Action (200,065 Votes)
3. Adventure (196,265 Votes)

Based on this, I will try to find the 20 best films from 3 genre based on the number of votes on IMDB.

# **Sci-Fi Movies**

In [5]:
SELECT
	i.movie_title,
	i.genre,
	i.number_of_voters AS voters_number
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
WHERE i.genre = 'Sci-Fi'
ORDER BY voters_number DESC
LIMIT 10;

Unnamed: 0,movie_title,genre,voters_number
0,Inception (2010),Sci-Fi,2368570
1,The Matrix (1999),Sci-Fi,1924457
2,Interstellar (2014),Sci-Fi,1852238
3,The Avengers (2012),Sci-Fi,1398299
4,The Prestige (2006),Sci-Fi,1342184
5,Back to the Future (1985),Sci-Fi,1214789
6,Terminator 2: Judgment Day (1991),Sci-Fi,1106289
7,Avengers: Infinity War (2018),Sci-Fi,1091968
8,Iron Man (2008),Sci-Fi,1066239
9,Eternal Sunshine of the Spotless Mind (2004),Sci-Fi,1015223


# **Action Movies**

In [6]:
SELECT
	i.movie_title,
	i.genre,
	i.number_of_voters AS voters_number
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
WHERE i.genre = 'Action'
ORDER BY voters_number DESC
LIMIT 10;

Unnamed: 0,movie_title,genre,voters_number
0,The Dark Knight (2008),Action,2669915
1,Inception (2010),Action,2368570
2,The Matrix (1999),Action,1924457
3,The Lord of the Rings: The Fellowship of the R...,Action,1886597
4,The Lord of the Rings: The Return of the King ...,Action,1857170
5,The Dark Knight Rises (2012),Action,1715219
6,The Lord of the Rings: The Two Towers (2002),Action,1677003
7,Gladiator (2000),Action,1510184
8,Batman Begins (2005),Action,1480348
9,The Avengers (2012),Action,1398299


# **Adventure Movies**

In [7]:
SELECT
	i.movie_title,
	i.genre,
	i.number_of_voters AS voters_number
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
WHERE i.genre = 'Adventure'
ORDER BY voters_number DESC
LIMIT 10;

Unnamed: 0,movie_title,genre,voters_number
0,Inception (2010),Adventure,2368570
1,The Lord of the Rings: The Fellowship of the R...,Adventure,1886597
2,The Lord of the Rings: The Return of the King ...,Adventure,1857170
3,Interstellar (2014),Adventure,1852238
4,The Lord of the Rings: The Two Towers (2002),Adventure,1677003
5,Gladiator (2000),Adventure,1510184
6,Inglourious Basterds (2009),Adventure,1461682
7,Avatar (2009),Adventure,1316701
8,Back to the Future (1985),Adventure,1214789
9,Guardians of the Galaxy (2014),Adventure,1187949


After conducting analysis related to films, next I will focus on trends in the number of film productions, analysis of budget and revenue for films in the following dataset.

The first analysis, I will aggregate the data to display the number of film productions, average number of voters and average of approval index based on genre from 2000 to the present, in hierarchical form :

In [8]:
SELECT
	i.year AS year,
	CASE WHEN i.genre IS NULL THEN 'Total of Calculation ='
		ELSE i.genre END AS genre,
	COUNT(i.genre) AS number_of_genre,
	ROUND(AVG(i.number_of_voters),2) AS number_of_voters,
	ROUND(AVG(i.approval_index),2) AS number_approval
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
GROUP BY ROLLUP (i.year,i.genre)
HAVING (i.year) > '2000'
ORDER BY year DESC, genre ASC;

Unnamed: 0,year,genre,number_of_genre,number_of_voters,number_approval
0,2023,Comedy,1,359.00,2.12
1,2023,Drama,1,359.00,2.12
2,2023,Sport,1,359.00,2.12
3,2023,Total of Calculation =,3,359.00,2.12
4,2022,Action,21,185202.71,5.29
...,...,...,...,...,...
454,2001,Sport,4,48285.50,4.53
455,2001,Thriller,21,149643.76,5.19
456,2001,Total of Calculation =,384,124889.71,4.82
457,2001,War,2,156699.00,6.49


Based on the query results above, there are very few film genres produced in 2023, namely only 3. Because this film dataset is an update of films as of February 2023.

The following SQL code produces a report that aggregates data by year and genre. This report will include some statistical information about the films in the dataset that have a release year greater than 2000.

The output results also display the average number of voters and the average approval index for each combination of year and genre.

I want to show the films with the highest approval index in each genre :

In [9]:
SELECT 
	genre, 
	movie_title,
	approval_index
FROM (
    	SELECT 
			genre, 
			movie_title, 
			approval_index, 
			ROW_NUMBER() OVER (PARTITION BY genre ORDER BY approval_index DESC) AS rn
    	FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
) AS ranked
WHERE rn = 1;


Unnamed: 0,genre,movie_title,approval_index
0,Animation,The Lion King (1994),8.548207
1,News,An Inconvenient Truth (2006),6.040821
2,Biography,Goodfellas (1990),8.811903
3,Horror,Alien (1979),8.434864
4,Documentary,Bowling for Columbine (2002),6.865744
5,Musical,Singin' in the Rain (1952),7.447506
6,Mystery,Se7en (1995),8.932272
7,Music,Whiplash (2014),8.424185
8,History,Braveheart (1995),8.432932
9,Comedy,Back to the Future (1985),8.629912


I want to show the films with the highest number of votes in each genre:

In [10]:
SELECT 
	genre, 
	movie_title,
	number_of_voters
FROM (
    	SELECT 
			i.genre, 
			i.movie_title, 
			i.number_of_voters, 
			ROW_NUMBER() OVER (PARTITION BY i.genre ORDER BY i.number_of_voters DESC) AS rn
    	FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
) AS ranked
WHERE rn = 1;


Unnamed: 0,genre,movie_title,number_of_voters
0,Animation,The Lion King (1994),1066010
1,News,An Inconvenient Truth (2006),83892
2,Biography,The Wolf of Wall Street (2013),1426046
3,Horror,The Shining (1980),1030407
4,Documentary,Bowling for Columbine (2002),145581
5,Musical,The Greatest Showman (2017),286676
6,Mystery,Se7en (1995),1664484
7,Music,Whiplash (2014),874250
8,History,Braveheart (1995),1043556
9,Comedy,The Wolf of Wall Street (2013),1426046


If you look closely, films that have a high approval index tend to have a high number of voters too. On this occasion I will try to find the correlation between these 2 columns for each genre :

In [11]:
WITH GenreStats AS (
    SELECT
        i.genre,
        AVG(i.approval_index) AS avg_approval_index,
        AVG(i.number_of_voters) AS avg_number_of_voters
    FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
    GROUP BY i.genre
)
SELECT
    i.genre,
    ROUND(CORR(approval_index, number_of_voters),3) AS correlation
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
INNER JOIN GenreStats
ON i.genre = GenreStats.genre
GROUP BY i.genre;

Unnamed: 0,genre,correlation
0,Comedy,0.732
1,Drama,0.702
2,Sport,0.676
3,History,0.719
4,Action,0.755
5,Adventure,0.758
6,Fantasy,0.746
7,Horror,0.774
8,Mystery,0.772
9,Thriller,0.761


Based on the results of the correlation analysis above, it can be seen that each film genre has a positive correlation between the approval index and the number of voters, so it can be concluded that the higher the approval index, the higher the number of voters.

The next analysis is that I will group film categories based on the length of the film into the following 3 categories:

1. Epic Film : runtime more than 180 minutes
2. Feature Film : runtime between 70 - 179 minutes
3. Short Movie : runtime between 30 - 69 minutes

In [12]:
SELECT
	i.movie_title AS title,
	ROUND(AVG(i.runtime),0) AS duration,
	CASE WHEN duration > 180 THEN 'Epic Movie'
		 WHEN duration > 70 THEN 'Feature Movie'
		 WHEN duration > 30 THEN 'Short Movie' END AS duration_category
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
GROUP BY i.movie_title
ORDER BY duration DESC;

Unnamed: 0,title,duration,duration_category
0,Gettysburg (1993),271.0,Epic Movie
1,The Greatest Story Ever Told (1965),260.0,Epic Movie
2,Hamlet (1996),242.0,Epic Movie
3,Gone with the Wind (1939),238.0,Epic Movie
4,Once Upon a Time in America (1984),229.0,Epic Movie
...,...,...,...
4307,The Land Before Time (1988),69.0,Short Movie
4308,Bambi (1942),69.0,Short Movie
4309,Pooh's Heffalump Movie (2005),68.0,Short Movie
4310,She Done Him Wrong (1933),66.0,Short Movie


In [13]:
WITH summary AS (
		SELECT
			i.movie_title AS title,
			ROUND(AVG(i.runtime),0) AS duration,
			CASE WHEN duration > 180 THEN 'Epic Movie'
				 WHEN duration > 70 THEN 'Feature Movie'
				 WHEN duration > 30 THEN 'Short Movie' END AS duration_category
		FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
		GROUP BY i.movie_title
		ORDER BY duration DESC)
SELECT
	duration_category,
	COUNT(*) AS number_of_time_category
FROM summary
GROUP BY duration_category;

Unnamed: 0,duration_category,number_of_time_category
0,Feature Movie,4262
1,Short Movie,5
2,Epic Movie,45


Based on the results above, I used a common table expression to query the duration category. It turns out that the film industry produces more feature films with a duration of around 70 to 180 minutes.

Film producers realize that a film that has a detailed plot is not necessarily a film that will attract the interest of the audience. In fact, films with a tight plot and the right length of film are really paid attention to by film producers, so it's not surprising that many film scenes are cut.

The next analysis is that I want to carry out an analysis of the budget spent by the producer on the film.

The first query will display the sub average budget, domestic revenue and international revenue for each genre per year (2020 - 2023) using the ROLL UP operator

In [14]:
SELECT
	i.year AS year,
	CASE WHEN i.genre IS NULL THEN 'Total of Calculation ='
		ELSE i.genre END AS genre,
	COUNT(i.genre) AS number_of_genre,
	ROUND(AVG(i.budget),2) AS average_budget,
	ROUND(AVG(i.domestic_revenue),2) AS average_dom_revenue,
	ROUND(AVG(i.international_revenue),2) AS average_intl_revenue
FROM 'imdb_movies_dataset - long_data_genre.csv' AS i
GROUP BY ROLLUP (i.year,i.genre)
HAVING (i.year) >= 2020
ORDER BY year DESC, genre ASC;

Unnamed: 0,year,genre,number_of_genre,average_budget,average_dom_revenue,average_intl_revenue
0,2023,Comedy,1,28000000.0,37688000.0,37688000.0
1,2023,Drama,1,28000000.0,37688000.0,37688000.0
2,2023,Sport,1,28000000.0,37688000.0,37688000.0
3,2023,Total of Calculation =,3,28000000.0,37688000.0,37688000.0
4,2022,Action,21,135233300.0,191161200.0,458053000.0
5,2022,Adventure,16,157437500.0,204447400.0,535609500.0
6,2022,Animation,6,96666670.0,114650800.0,270566600.0
7,2022,Biography,2,44500000.0,86068360.0,151354500.0
8,2022,Comedy,16,71181250.0,97512960.0,218230500.0
9,2022,Crime,3,106666700.0,145091900.0,308982100.0


In the following analysis, I will try to show the films with the highest budgets for each genre :

In [15]:
SELECT genre, movie_title, budget
FROM (SELECT
		i.genre,
		i.movie_title,
		i.budget,
		ROW_NUMBER () OVER (PARTITION BY i.genre ORDER BY i.budget DESC) AS rn
		FROM 'imdb_movies_dataset - long_data_genre.csv' AS i) AS ranked
WHERE rn = 1;

Unnamed: 0,genre,movie_title,budget
0,Animation,Tangled (2010),260000000
1,News,An Inconvenient Truth (2006),1000000
2,Biography,Alexander (2004),155000000
3,Horror,World War Z (2013),190000000
4,Documentary,Jackass 3D (2010),20000000
5,Musical,West Side Story (2021),100000000
6,Mystery,Quantum of Solace (2008),230000000
7,Music,Elvis (2022),85000000
8,History,Deepwater Horizon (2016),156000000
9,Comedy,Tangled (2010),260000000


In the following analysis, I will try to show the films with the highest international revenue for each year :

In [16]:
SELECT 
	year, 
	movie_title, 
	international_revenue
FROM (SELECT
		i.year,
		i.movie_title,
		i.international_revenue,
		ROW_NUMBER () OVER (PARTITION BY i.year ORDER BY i.international_revenue DESC) AS rn
		FROM 'imdb_movies_dataset - long_data_genre.csv' AS i) AS ranked
WHERE rn = 1
ORDER BY year DESC;

Unnamed: 0,year,movie_title,international_revenue
0,2023,80 for Brady (2023),37688000
1,2022,Avatar: The Way of Water (2022),2265935552
2,2021,Spider-Man: No Way Home (2021),1910048245
3,2020,Tenet (2020),360240189
4,2019,Avengers: Endgame (2019),2794731755
...,...,...,...
91,1929,The Broadway Melody (1929),4358000
92,1925,The Big Parade (1925),22000000
93,1920,Over the Hill to the Poorhouse (1920),3000000
94,1916,"20,000 Leagues Under the Sea (1916)",8000000


Percobaan Line Chart


That's all for the project portfolio regarding IMDB film data analysis, hopefully it can provide useful insights, information and recommendations for watching films



Feby Renaldi (Data Analytics Enthusiast)