# IMDb Movie Data Analysis
## Author: Andrew Mayfield
## IMDB Dataset Release Date: 3/18/24
## Project Publish Date: 2/10/24

---

# 1. **Project Overview and Dataset Selection** 

## Introduction
This project marks my first official data analysis project, and as such, I intend to document my entire process in detail, ensuring a thorough understanding of each step involved. My goal is not just to analyze IMDb movie data but also to develop a strong foundation in data handling, cleaning, visualization, and interpretation. By explaining every decision I make, from data preparation to final insights, I hope to build a structured approach that I can refine and apply to future projects.  

For this project, I will be working with multiple datasets from IMDb, which provide extensive information about movies, TV shows, ratings, crew members, genres, and voting trends. This data presents an opportunity to explore key questions such as how movie ratings have changed over time, whether runtime affects a film’s success, and the correlation between votes and popularity. Given the variety and complexity of the dataset, my approach will involve multiple tools and techniques to ensure a comprehensive analysis.  

I will primarily be using **Python within VS Code** for coding and data manipulation, leveraging libraries such as **Pandas** for data munging, **NumPy** for numerical operations, and **Matplotlib** and **Seaborn** for visualization. The flexibility of **Jupyter Notebook** will allow me to document and test my code in an interactive format, making it easier to track my workflow and present my findings clearly.  

Beyond Python, I will incorporate **Tableau** for interactive visualizations and dashboards, helping to uncover patterns in the data more intuitively. **MySQL** will be used for structured querying and database management, allowing me to practice handling large datasets efficiently. Additionally, I plan to experiment with **Microsoft Excel** for quick exploratory analysis and possibly utilize **Power BI** and **SAS** for further insights, depending on the complexity of my findings.  

By using a mix of programming, database management, and visualization tools, I aim to gain a well-rounded understanding of data analysis in a real-world setting. This project will not only strengthen my technical skills but also improve my ability to interpret and communicate data-driven insights effectively. Since this is my first deep dive into an extensive dataset, I anticipate challenges along the way, from data cleaning to ensuring my visualizations accurately reflect meaningful trends. However, by documenting my thought process and methodology step by step, I hope to create a structured and insightful analysis that can serve as a strong portfolio project.

## Disclaimer  
While this dataset contains a large amount of data within many different variables, this project is focused on making general observations and analyses of the IMDb movie dataset. The scope of this analysis is limited to examining overall film data, such as runtime, release years, and average ratings. It does not include an in-depth exploration of specific genres, television shows, video games, or individual actors and directors. While these aspects may be considered in future projects, they are not within the scope of this study. 

## Why and How I Chose This Specific Dataset
When selecting a dataset, I explored numerous sources, including **Kaggle, Google Dataset Search, Reddit, and GitHub**. While these platforms offer a vast selection of datasets across various domains, I encountered a recurring issue: many datasets contained inconsistencies, inaccuracies, and missing information that only became apparent after I had already begun working on them. While data cleaning is an essential skill in data analysis, I did not want to spend the majority of my time correcting errors from unknown sources rather than conducting meaningful analysis.  

To ensure reliability and minimize time spent fixing fundamental data quality issues, I decided to look for an official dataset from a well-established source. After extensive searching, I came across the **IMDb dataset**, which is released and maintained by IMDb itself. The dataset is updated regularly and provides structured, well-documented information about movies, TV shows, ratings, cast and crew, genres, and more. Since IMDb is one of the most reputable and widely used databases for film and television data, I felt confident that this dataset would provide a solid foundation for analysis without the risk of major inconsistencies or unreliable data entries.  

By selecting an official dataset that is structured and relatively clean, I can focus more on **exploratory data analysis, trend identification, and visualization**, rather than spending excessive time on fixing errors from an unknown origin. This decision aligns with my goal of developing a structured and insightful analysis while ensuring that the dataset I work with is both credible and valuable for meaningful conclusions.

## Why Analyze Movie Data?
Movies have been a significant part of global culture, shaping entertainment, storytelling, and even societal trends. With the rise of online platforms like IMDb (Internet Movie Database), movie ratings and popularity have become **quantifiable** through user reviews, votes, and engagement metrics. This dataset, containing thousands of movies spanning decades, provides an opportunity to explore **how movies have evolved over time** and **what makes a film well-received**.

## Objectives of This Analysis
This project aims to explore IMDb movie and TV show data by analyzing key factors such as ratings, runtime, popularity, and trends over time. Using the seven datasets provided, we will investigate the following questions:

- Are movies getting better or worse over time? Analyzing `title.ratings.tsv.gz` to see if IMDb ratings have increased or declined over different decades (`startYear`).  
- How has movie length evolved? Examining `title.basics.tsv.gz` (`runtimeMinutes`) to determine whether movies have become longer or shorter over time.   
- Do longer movies get better ratings? Exploring `runtimeMinutes` vs. `averageRating` to find the optimal movie length.  
- Do highly rated movies receive more votes? Comparing `numVotes` to `averageRating` to determine if popular movies are also critically well-received.  
- How have IMDb voting patterns changed over time? Using `title.ratings.tsv.gz` and `startYear` to analyze whether newer movies receive more votes than older classics.  
- Are movies being produced at a higher rate than in the past? Analyzing the number of unique `tconst` entries per year in `title.basics.tsv.gz`.   

## Dataset Information
The dataset consists of seven interconnected tables from IMDb, each containing specific information about movies, TV shows, and the people involved in their creation. Below is a breakdown of each dataset:

### 1. title.akas.tsv.gz – Alternative Titles  
This dataset contains different titles used for the same movie across various regions and languages.  
- **titleId** – Unique identifier for the title (linked to other datasets)  
- **ordering** – A number to uniquely identify rows for a given titleId  
- **title** – The localized title of the movie  
- **region** – The country or region where this title is used  
- **language** – The language of the title  
- **types** – Categorization of the title (e.g., "alternative", "dvd", "festival", etc.)  
- **attributes** – Additional details about the alternative title  
- **isOriginalTitle** – A boolean flag indicating if this is the movie’s original title  

### 2. title.basics.tsv.gz – Movie and TV Show Information  
This dataset provides basic information about each movie, TV series, or video.  
- **tconst** – Unique identifier for the title  
- **titleType** – Format of the title (e.g., movie, short, tvSeries, video, etc.)  
- **primaryTitle** – The most commonly used title  
- **originalTitle** – The title in its original language  
- **isAdult** – A flag indicating whether the title is adult content (0 = no, 1 = yes)  
- **startYear** – The release year of the title  
- **endYear** – The final year for TV series (otherwise NULL)  
- **runtimeMinutes** – The length of the movie in minutes  
- **genres** – Up to three genres associated with the title  

### 3. title.crew.tsv.gz – Directors & Writers  
This dataset links movies and TV shows to their directors and writers.  
- **tconst** – Unique identifier for the title  
- **directors** – List of `nconst` IDs for the directors  
- **writers** – List of `nconst` IDs for the writers  

### 4. title.episode.tsv.gz – TV Series Episode Details  
This dataset connects individual TV episodes to their respective series.  
- **tconst** – Unique identifier for the episode  
- **parentTconst** – Unique identifier for the parent TV series  
- **seasonNumber** – Season number the episode belongs to  
- **episodeNumber** – Episode number within the season  

### 5. title.principals.tsv.gz – Cast & Crew Information  
This dataset provides details on key people involved in movies and TV shows, including actors, directors, and other roles.  
- **tconst** – Unique identifier for the title  
- **ordering** – A ranking number for importance in credits  
- **nconst** – Unique identifier for the person  
- **category** – The role category (e.g., actor, director, cinematographer)  
- **job** – Specific job title (if available)  
- **characters** – Names of the characters played (if applicable)  

### 6. title.ratings.tsv.gz – IMDb Ratings & Votes  
This dataset contains IMDb ratings and vote counts for each movie or TV show.  
- **tconst** – Unique identifier for the title  
- **averageRating** – Weighted IMDb user rating (out of 10)  
- **numVotes** – Total number of votes received  

### 7. name.basics.tsv.gz – People Information  
This dataset provides biographical details about actors, directors, and other industry professionals.  
- **nconst** – Unique identifier for the person  
- **primaryName** – The person’s most commonly credited name  
- **birthYear** – Birth year (YYYY format)  
- **deathYear** – Death year (if applicable, otherwise NULL)  
- **primaryProfession** – Up to three primary professions (e.g., actor, director, writer)  
- **knownForTitles** – List of `tconst` values representing the person’s most well-known works  

By combining these datasets through shared identifiers (`tconst` for movies/TV shows and `nconst` for people), we can analyze trends in **ratings, runtime, release year, etc**.


## Expected Insights

Through this analysis, I aim to uncover meaningful patterns and trends within the IMDb dataset that provide valuable insights into the movie industry. By examining different aspects of the data, I expect to answer key questions related to movie ratings, popularity, release trends, and runtime distributions.  

- **How does movie runtime correlate with ratings?**  
- **How has the number of movie releases changed over time?**  
- **Have IMDb ratings remained consistent over time, or have they fluctuated?**  
- **What is the distribution of votes across movies?**  
- **What factors contribute to a movie being highly rated?**  
- **How does the release of movies compare to the release of short films?**  
- **Are there specific years or periods when more highly rated movies were produced?**  

These expected insights will guide my analysis as I work through the dataset, helping me uncover compelling trends and relationships within the IMDb data.


## Project Outline  

This project follows a structured workflow to ensure a thorough and systematic analysis of the IMDb dataset. The key sections of the project are as follows:  

### 1. **Project Overview and Dataset Selection**  
   - Introduction to the project  
   - Tools and technologies used  
   - Objectives of the analysis   
   - Reasons for selecting the IMDb dataset  
   - Breakdown of the different IMDb dataset files  
   - Explanation of key variables  

### 2. **Data Cleaning and Preprocessing**  
   - Handling missing or inconsistent data  
   - Data type conversions and transformations  

### 3. **Exploratory Data Analysis (EDA)**  
   - Distribution and trends of key variables  
   - Visualizations to identify patterns  
   - Examination of trends based on runtime, release year, and ratings  
   - Insights drawn from different visualizations

### 4. **Findings and Conclusion**  
   - Summary of key insights  
   - Limitations of the analysis  
   - Potential areas for further exploration  

***

# 2. **Data Cleaning and Preprocessing** 

Before starting anything, I need to import the required libraries to be able to analyze the data.

In [2]:
#This imports all of the libraries that I will use
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Now I will need to import all of the data set files and assign them separate variables, based off of what information the file contains. 

In [3]:
# This reads all of the files on my computer and assigns them to different dataframes through pandas which allows me to work with tabular data
# Since the files are tsv (Tab Separated Values) files and not csv (Comma Separated Values) files, I include at the end that the separator is a tab and not a comma
# **Warning** These are also very large files and will require a decent amount of time for your computer to run through them
df_title_akas = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.akas.tsv", sep="\t")
df_title_basics = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.basics.tsv", sep="\t")
df_title_crew = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.crew.tsv", sep="\t")
df_title_episode = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.episode.tsv", sep="\t")
df_title_principals = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.principals.tsv", sep="\t")
df_title_ratings = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.ratings.tsv", sep="\t")
df_name_basics = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\name.basics.tsv", sep="\t")

  df_title_basics = pd.read_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.basics.tsv", sep="\t")


You will notice Instantly that we get a warning claiming `mixed types` for one of the columns in one of the files. We will find and fix this issue in a bit.

First, let's look at how the data is formatted between the different datasets. We will use the `.head` function to look at the title of each column and the first 5 rows that are in each dataset.

I will then leave a description of each table with information provided on the IMDb Developer Website (__[IMDb Developer](https://developer.imdb.com/non-commercial-datasets/)__)

They also provide on the website that all missing values within the dataset should contain `\N`, which will be helpfully when cleaning.

In [4]:
df_title_akas.head()

Unnamed: 0,titleId,ordering,title,region,language,types,attributes,isOriginalTitle
0,tt0000001,1,Carmencita,\N,\N,original,\N,1
1,tt0000001,2,Carmencita,DE,\N,\N,literal title,0
2,tt0000001,3,Carmencita,US,\N,imdbDisplay,\N,0
3,tt0000001,4,Carmencita - spanyol tánc,HU,\N,imdbDisplay,\N,0
4,tt0000001,5,Καρμενσίτα,GR,\N,imdbDisplay,\N,0


File Name: 
`title.akas.tsv.gz`

Definition of Each Column:
- `titleId` (string) - a tconst, an alphanumeric unique identifier of the title
- `ordering` (integer) – a number to uniquely identify rows for a given titleId
- `title` (string) – the localized title
- `region` (string) - the region for this version of the title
- `language` (string) - the language of the title
- `types` (array) - Enumerated set of attributes for this alternative title. One or more of the following: "alternative", "dvd", "festival", "tv", "video", "working", "original", "imdbDisplay". New - - `values` may be added in the future without warning
- `attributes` (array) - Additional terms to describe this alternative title, not enumerated
- `isOriginalTitle` (boolean) – 0: not original title; 1: original title

In [5]:
df_title_basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


File Name:
`title.basics.tsv.gz`

Definition of Each Column:
- `tconst` (string) - alphanumeric unique identifier of the title
- `titleType` (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- `primaryTitle` (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- `originalTitle` (string) - original title, in the original language
- `isAdult` (boolean) - 0: non-adult title; 1: adult title
- `startYear` (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- `endYear` (YYYY) – TV Series end year. '\N' for all other title types
- `runtimeMinutes` – primary runtime of the title, in minutes
- `genres` (string array) – includes up to three genres associated with the title

In [6]:
df_title_crew.head()

Unnamed: 0,tconst,directors,writers
0,tt0000001,nm0005690,\N
1,tt0000002,nm0721526,\N
2,tt0000003,nm0721526,\N
3,tt0000004,nm0721526,\N
4,tt0000005,nm0005690,\N


File Name:
`title.crew.tsv.gz`

Definition of Each Column:
- `tconst` (string) - alphanumeric unique identifier of the title
- `directors` (array of nconsts) - director(s) of the given title
- `writers` (array of nconsts) – writer(s) of the given title

In [7]:
df_title_episode.head()

Unnamed: 0,tconst,parentTconst,seasonNumber,episodeNumber
0,tt0031458,tt32857063,\N,\N
1,tt0041951,tt0041038,1,9
2,tt0042816,tt0989125,1,17
3,tt0042889,tt0989125,\N,\N
4,tt0043426,tt0040051,3,42


File Name:
`title.episode.tsv.gz`

Definition of Each Column:
- `tconst` (string) - alphanumeric identifier of episode
- `parentTconst` (string) - alphanumeric identifier of the parent TV Series
- `seasonNumber` (integer) – season number the episode belongs to
- `episodeNumber` (integer) – episode number of the tconst in the TV series

In [8]:
df_title_principals.head()

Unnamed: 0,tconst,ordering,nconst,category,job,characters
0,tt0000001,1,nm1588970,self,\N,"[""Self""]"
1,tt0000001,2,nm0005690,director,\N,\N
2,tt0000001,3,nm0005690,producer,producer,\N
3,tt0000001,4,nm0374658,cinematographer,director of photography,\N
4,tt0000002,1,nm0721526,director,\N,\N


File Name:
`title.principals.tsv.gz`

Definition of Each Column:
- `tconst` (string) - alphanumeric unique identifier of the title
- `ordering` (integer) – a number to uniquely identify rows for a given titleId
- `nconst` (string) - alphanumeric unique identifier of the name/person
- `category` (string) - the category of job that person was in
- `job` (string) - the specific job title if applicable, else '\N'
- `characters` (string) - the name of the character played if applicable, else '\N'

In [9]:
df_title_ratings.head()

Unnamed: 0,tconst,averageRating,numVotes
0,tt0000001,5.7,2128
1,tt0000002,5.6,286
2,tt0000003,6.5,2163
3,tt0000004,5.3,183
4,tt0000005,6.2,2891


File Name:
`title.ratings.tsv.gz`

Definition of Each Column:
- `tconst` (string) - alphanumeric unique identifier of the title
- `averageRating` – weighted average of all the individual user ratings
- `numVotes` - number of votes the title has received

In [10]:
df_name_basics.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm0000001,Fred Astaire,1899,1987,"actor,miscellaneous,producer","tt0072308,tt0050419,tt0027125,tt0031983"
1,nm0000002,Lauren Bacall,1924,2014,"actress,soundtrack,archive_footage","tt0037382,tt0075213,tt0117057,tt0038355"
2,nm0000003,Brigitte Bardot,1934,\N,"actress,music_department,producer","tt0057345,tt0049189,tt0056404,tt0054452"
3,nm0000004,John Belushi,1949,1982,"actor,writer,music_department","tt0072562,tt0077975,tt0080455,tt0078723"
4,nm0000005,Ingmar Bergman,1918,2007,"writer,director,actor","tt0050986,tt0069467,tt0050976,tt0083922"


File Name:
`name.basics.tsv.gz`

Definition of Each Column:
- `nconst` (string) - alphanumeric unique identifier of the name/person
- `primaryName` (string)– name by which the person is most often credited
- `birthYear` – in YYYY format
- `deathYear` – in YYYY format if applicable, else '\N'
- `primaryProfession` (array of strings)– the top-3 professions of the person
- `knownForTitles` (array of tconsts) – titles the person is known for



For this part of the data cleaning section, I will focus on cleaning and analyzing a subset of the data extracted from the files `title.basics.tsv.gz` and `title.ratings.tsv.gz`. The data from these two files contains essential information about the movies, including basic attributes and user ratings, which will serve as the foundation for further analysis. 

The `title.basics.tsv.gz` file includes details such as the movie’s title, genre, release year, runtime, and title type (e.g., movie, short film, etc.). This data is critical for understanding the nature of the movies and their release trends over time. I will clean this data to ensure accuracy and remove any irrelevant or missing values, such as entries for TV shows and Video Games, which are not needed for this section of the analysis.

The `title.ratings.tsv.gz` file contains user ratings for each movie, including the average rating and the number of votes it received. This information will allow for analysis of how movie ratings correlate with factors like release year, genre, or runtime. I will clean this data by handling missing or erroneous ratings and ensuring the proper alignment of ratings with the corresponding movies in the `title.basics.tsv.gz` dataset.

By narrowing my focus to these two files, I can better analyze trends in movie ratings and the distribution of movies over time, making the analysis more manageable and relevant to the overall project. This step will provide a solid foundation for deeper exploration into other variables in the dataset as the project progresses.

## **Cleaning and Analyzing `title.basics.tsv.gz` and `title.ratings.tsv.gz`**

Since we've seen how the file is formatted above, lets get a quick summary of the data with `.info`.

In [11]:
# This provides a quick summary of the data columns and their Datatype
df_title_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11429975 entries, 0 to 11429974
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          object
 1   titleType       object
 2   primaryTitle    object
 3   originalTitle   object
 4   isAdult         object
 5   startYear       object
 6   endYear         object
 7   runtimeMinutes  object
 8   genres          object
dtypes: object(9)
memory usage: 784.8+ MB


We see that there are 11429975 total entries in this set and 9 total columns, each currently being set to `object` as their Datatype. Since we were provided with a list of all the datasets and their respective variables, we can go ahead and assign each column to its respective datatype.

In [12]:
# Defines dtype mapping and handles missing values for numeric columns
dtype_mapping = {
    "tconst": "string",
    "titleType": "string",
    "primaryTitle": "string",
    "originalTitle": "string",
    "isAdult": "bool",
    "startYear": "Int64",  # Use Int64 (pandas nullable integer) to handle missing values
    "endYear": "Int64",  # Same reason as startYear
    "runtimeMinutes": "Int64",  # Nullable integer for missing runtimes
    "genres": "string"  # Treat as a string; later can be split into lists if needed
}

# Reads the file with dtype specifications
df_title_basics = pd.read_csv(
    r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\title.basics.tsv", 
    sep="\t", 
    dtype={"tconst": "string", "titleType": "string", "primaryTitle": "string", 
           "originalTitle": "string", "genres": "string"},
    na_values=["\\N"],  # Since IMDb mention that they left null values as `\N`, this Converts all '\N' to NaN values which helps pandas read the data
    keep_default_na=False
)

# Converts integer and boolean columns with custom logic. 
df_title_basics["isAdult"] = df_title_basics["isAdult"].astype(bool)
df_title_basics["startYear"] = pd.to_numeric(df_title_basics["startYear"], errors="coerce").astype("Int64")
df_title_basics["endYear"] = pd.to_numeric(df_title_basics["endYear"], errors="coerce").astype("Int64")
df_title_basics["runtimeMinutes"] = pd.to_numeric(df_title_basics["runtimeMinutes"], errors="coerce").astype("Int64")

  df_title_basics = pd.read_csv(


Now that we have assigned all of the columns to their respective datatype, we can use `.info` again to confirm.

In [13]:
df_title_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11429975 entries, 0 to 11429974
Data columns (total 9 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   tconst          string
 1   titleType       string
 2   primaryTitle    string
 3   originalTitle   string
 4   isAdult         bool  
 5   startYear       Int64 
 6   endYear         Int64 
 7   runtimeMinutes  Int64 
 8   genres          string
dtypes: Int64(3), bool(1), string(5)
memory usage: 741.2 MB


Perfect! Now lets see how many missing values are in each column with `.isnull().sum()`. This will find how many null values exist in each column.

In [14]:
# Counts the total amount of null entires for each column and displays their totals
df_title_basics.isnull().sum()

tconst                   0
titleType                0
primaryTitle             0
originalTitle            0
isAdult                  0
startYear          1424418
endYear           11294536
runtimeMinutes     7805734
genres              503429
dtype: int64

We see that the columns `tconst`, `titleType`, `primaryTitle`, `originalTitle` and `isAdult` contain no null values. However, the columns `startYear`, `endYear`, `runtimeMinutes` and `genres` contain a large amount of null values each. Below are possible reasons that each column would contain null values.

### **1. `startYear` (Release Year)**
**Possible reasons for missing values:**
- **Unreleased or upcoming projects**: Some titles may not have a confirmed release year.
- **Incomplete or missing data**: Older or lesser-known films might not have a recorded release year.
- **Errors in data entry**: Human errors or database inconsistencies may result in missing values.
- **TV episodes without a start year**: If the dataset includes episodes but not their parent series, they might lack a recorded start year.

### **2. `endYear` (TV Series End Year)**
**Possible reasons for missing values:**
- **Movies and standalone productions**: This column is primarily for TV series, so movies, shorts, and other standalone titles will have `NaN` values.
- **Ongoing TV series**: If a TV series is still airing, its `endYear` will be missing.
- **Missing data in IMDb’s records**: Some older series may not have an accurately recorded `endYear`.

### **3. `runtimeMinutes` (Duration)**
**Possible reasons for missing values:**
- **Missing or incomplete metadata**: Some titles, especially lesser-known films, may not have a recorded runtime.
- **TV series without a single runtime**: If the dataset includes TV series as a whole rather than individual episodes, their runtime might not be well-defined.
- **Special formats (e.g., web series, experimental films)**: Some media formats might not have a standardized runtime.
- **New releases without runtime information**: Recently announced or upcoming movies might not have a final runtime.

### **4. `genres` (List of Genres)**
**Possible reasons for missing values:**
- **Uncategorized or unknown genres**: Some films or experimental works may not fit into IMDb’s predefined genres.
- **Incomplete metadata**: Older or obscure movies may lack genre classification.
- **Non-traditional content**: Certain types of content, such as test footage, trailers, or placeholders, may not be assigned genres.
- **Errors or inconsistencies in IMDb’s data**: Some records might have been entered incorrectly or left incomplete.

Since we aren't interested in every type of film in this dataset, lets see what values are contained within `titleType` using `.value_counts`.

In [15]:
# Counts the unique values in TitleType and how many of each occurs in the column
title_type_counts = df_title_basics["titleType"].value_counts()

# Print the unique values and their totals
print(title_type_counts)

titleType
tvEpisode       8787841
short           1042439
movie            705351
video            305178
tvSeries         276363
tvMovie          149673
tvMiniSeries      59446
tvSpecial         51260
videoGame         41891
tvShort           10532
tvPilot               1
Name: count, dtype: Int64


The `titleType` column categorizes titles based on their format or medium. Below is an explanation of each `titleType` found in the dataset:

### **1. `tvEpisode` (8,787,841 entries)**
   - Represents individual episodes of a TV series.
   - Each `tvEpisode` entry is associated with a `tvSeries` or `tvMiniSeries` through the `parentTconst` in `title.episode.tsv.gz`.
   - Since TV shows have multiple episodes, this category has the highest count.

### **2. `short` (1,042,439 entries)**
   - Short films, typically under 40 minutes in runtime.
   - Can include animated shorts, student films, independent projects, and experimental films.

### **3. `movie` (705,351 entries)**
   - Full-length feature films intended for theatrical release or streaming.
   - This category includes Hollywood productions, international films, independent films, and direct-to-video movies.

### **4. `video` (305,178 entries)**
   - Represents content released directly on IMDb
   - Includes trailers, clips, interviews, featurettes, and other promotional materials, typically provided by studios or uploaded by filmmakers

### **5. `tvSeries` (276,363 entries)**
   - Represents an entire television series rather than individual episodes.
   - Each `tvSeries` entry serves as a parent for multiple `tvEpisode` entries in the dataset.

### **6. `tvMovie` (149,673 entries)**
   - Feature-length movies made specifically for television.
   - Includes Lifetime movies, Hallmark films, BBC TV films, and streaming platform-exclusive movies.

### **7. `tvMiniSeries` (59,446 entries)**
   - A limited-run TV series with a predefined number of episodes.
   - Typically tells a single story arc across multiple episodes (e.g., *Chernobyl*, *Band of Brothers*).

### **8. `tvSpecial` (51,260 entries)**
   - One-time television events, such as holiday specials, award shows, or stand-up comedy specials.
   - Examples: *The Oscars*, *Super Bowl Halftime Show*, *Christmas at Rockefeller Center*.

### **9. `videoGame` (41,891 entries)**
   - Represents video games listed on IMDb.
   - Includes both console and PC games, often with voice acting or cinematic cutscenes.

### **10. `tvShort` (10,532 entries)**
   - Short-form content created for television.
   - Includes animated shorts, skits, and mini-documentaries.

### **11. `tvPilot` (1 entry)**
   - A pilot episode created to test a potential TV series.
   - The single entry suggests IMDb may not track pilots separately, or this is an error in the dataset.

There are a total of 11 different types of titles within this column. Since we are only interested in content not from television or video games for this first analysis, we will create a `subset` of this data that only contains films that are considered `short` or `movie`. We will use the `.isin` command to look through the column `titleType` and select each instance of `short` and `movie`.

In [16]:
# Filters out everything in the data that isn't a short or movie
df_title_basics_short_or_movie = df_title_basics[df_title_basics["titleType"].isin(["short", "movie"])]

Now that we have a data set only containing the entries that we will analyze for our first analysis, let's use the `.isnull().sum()` command again to see how many null values are left over in this dataset. I will also will want to see how many entires we have in total for this new subset, so i will use the `.shape[0]` command to count the amount of rows in the set.

In [17]:
# Counts the Number of rows
df_title_basics_short_or_movie_num_rows = df_title_basics_short_or_movie.shape[0]

# Prints the Number of rows
print(f"Total number of Movies and Shorts: {df_title_basics_short_or_movie_num_rows}")

# Counts the total amount of null entires for each column and displays their totals
df_title_basics_short_or_movie.isnull().sum()

Total number of Movies and Shorts: 1747790


tconst                  0
titleType               0
primaryTitle            0
originalTitle           0
isAdult                 0
startYear          143119
endYear           1747790
runtimeMinutes     634358
genres              76271
dtype: int64

While we still see several issue's with the null entires in the data, we can start by removing the `endYear` column which only is used for TV Series that this data set does not have. We will use the `.drop` command to achieve this and set `axis=1` to remove the column and not the row

In [18]:
# This remove the `endYear` column from the data
df_title_basics_short_or_movie = df_title_basics_short_or_movie.drop('endYear', axis=1)


We now need to look into explanations for the null values in the other 3 columns (`startYear`, `runtimeMinutes`, and `genres`). I will first test and see if a trend appears regarding wether the film is a `short` or a `movie`. I will do this my creating two separate subsets that just look at each type. Then I will use the `.isnull().sum()` command again to identify each subsets total amount of null entires. I will also create new variables that identify the total amount of entires for each.

In [19]:
# Filters out everything in the data that isn't a short
df_title_basics_short = df_title_basics_short_or_movie[df_title_basics_short_or_movie["titleType"].isin(["short"])]

# Filters out everything in the data that isn't a movie
df_title_basics_movie = df_title_basics_short_or_movie[df_title_basics_short_or_movie["titleType"].isin(["movie"])]

# Counts the Number of rows
df_title_basics_short_num_rows = df_title_basics_short.shape[0]

# Prints the Number of rows
print(f"Total number of Shorts: {df_title_basics_short_num_rows}")

# Counts the total amount of null entires for each column and displays their totals
df_title_basics_short.isnull().sum()

Total number of Shorts: 1042439


tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear          39588
runtimeMinutes    373076
genres                 1
dtype: int64

In [20]:
# Counts the Number of rows
df_title_basics_movie_num_rows = df_title_basics_movie.shape[0]

# Prints the Number of rows
print(f"Total number of Movies: {df_title_basics_movie_num_rows}")

# Counts the total amount of null entires for each column and displays their totals
df_title_basics_movie.isnull().sum()

Total number of Movies: 705351


tconst                 0
titleType              0
primaryTitle           0
originalTitle          0
isAdult                0
startYear         103531
runtimeMinutes    261282
genres             76270
dtype: int64

Since the amount of nulls value entries between the two subsets are similar, I will choose to leave these values in and keep them in consideration when analyzing the data later

Now we will want to check for any duplicate entries within our data set. We can check for the total number of duplicates using the `.duplicated().sum()` command.

In [21]:
# Counts the total amount of duplicate entires and prints the total
df_title_basics_short_or_movie.duplicated().sum()

np.int64(0)

Perfect! Our dataset contained no duplicated rows. Now we can connect the movie entires with their respective rating from `title.ratings.tsv.gz`. We will use the `.merge` command to combine the two data sets. Since the data contains the `tconst` variable that stays consistent between all the entries, we can use that as that combining factor using `on="tconst"`. We will also use `how="left"` to place the rating data on the right.

In [22]:
# This merges together the two data sets, based off their "tconst" value
df_merged_data = df_title_basics_short_or_movie.merge(df_title_ratings, on="tconst", how="left")

df_merged_data.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,False,1894,1,"Documentary,Short",5.7,2128.0
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,False,1892,5,"Animation,Short",5.6,286.0
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,False,1892,5,"Animation,Comedy,Romance",6.5,2163.0
3,tt0000004,short,Un bon bock,Un bon bock,False,1892,12,"Animation,Short",5.3,183.0
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,False,1893,1,Short,6.2,2891.0


Now that we have our cleaned dataset, we can export it as a csv file and begin analysis with other softwares.

In [23]:
# This exports the cleaned data into a csv file and setting `index=False` prevents extra columns from being added
# df_merged_data.to_csv(r"C:\Users\andre\Desktop\The Movie Project\Official IMDB Movie Data Set 6\df_title_basics_short_or_movie_cleaned_data.csv", index=False)

***
# 3. **Exploratory Data Analysis (EDA)**  

In this step, I will conduct an in-depth exploration of the dataset to uncover patterns, trends, and potential insights. The goal of EDA is to develop a strong understanding of the data by summarizing its main characteristics, identifying missing values, detecting anomalies, and analyzing distributions.  

Since I will be using **Tableau** for all visualizations, I will generate relevant plots and charts in Tableau and incorporate them into this notebook for further analysis. This approach allows for a more interactive and dynamic exploration of the data. Key aspects of the EDA process will include:  

- **Summary Statistics:** Examining measures such as count, mean, median, min/max values, and standard deviation for numerical columns.  
- **Data Distributions:** Understanding how different variables are distributed to detect any skewness or outliers.  
- **Relationships Between Variables:** Analyzing correlations and patterns between key features such as runtime, rating, number of votes, and release year.  
- **Categorical Data Analysis:** Exploring the distribution of categorical variables.

To gain a deeper understanding of the data presented in my analysis, I conducted independent research on the history and evolution of the movie industry. This research helped contextualize the trends observed in the dataset, such as the dominance of short films in total count, the rise of highly rated films over time, and variations in audience engagement through ratings and votes.

## **Research Approach**  

1. **Exploring Film Industry Trends**  
   - I studied the evolution of film production across different decades, identifying key historical events that influenced the number of films released per era.  
   - This included examining the Golden Age of Hollywood, the rise of independent cinema, and the impact of streaming services in the 21st century.  

2. **Understanding Ratings and Audience Engagement**  
   - I researched how audience preferences and rating behaviors have changed over time, especially with the advent of online review platforms like IMDb, Rotten Tomatoes, and Letterboxd.  
   - I analyzed how short films tend to receive fewer votes but higher average ratings due to a more niche and appreciative audience.  

3. **Studying Film Technology and Runtime Trends**  
   - I explored how film technology advancements have influenced movie runtime and content.  
   - The shift from traditional theatrical releases to streaming platforms has impacted both film length and production volume.

## **Resources for Researching the History of the Movie Industry**

### **Websites on Movie Industry History**
- **[American Film Institute (AFI)](https://www.afi.com/)**
- **[British Film Institute (BFI)](https://www.bfi.org.uk/)**  
- **[Academy of Motion Picture Arts and Sciences](https://www.oscars.org/)**
- **[The Library of Congress: National Film Registry](https://www.loc.gov/programs/national-film-preservation-board/film-registry/)**
- **[The Hollywood Reporter](https://www.hollywoodreporter.com/)**
- **[Variety](https://variety.com/)**
- **[Turner Classic Movies (TCM)](https://www.tcm.com/)**  

### **Articles on Film History**
- **["A Brief History of the Movie Industry"](https://www.filmsite.org/filmh.html)**
- **["The Golden Age of Hollywood"](https://www.history.com/news/golden-age-of-hollywood-stars-studios#:~:text=The%20Golden%20Age%20of%20Hollywood,1930s%20and%201940s.)**
- **["The History of Independent Cinema"](https://www.indiewire.com/)** 
- **["New Hollywood: The 1970s Film Revolution"](https://www.criterion.com/current/posts/7479-new-hollywood-and-the-rise-of-the-director-as-auteur)** 
- **["How Streaming Changed the Film Industry"](https://www.nytimes.com/2020/07/01/movies/movie-industry-streaming.html)**

All insights derived from these analyses will help guide the next steps in the project for addressing trends and answering general questions about the data.


![alt text](Basic_Data.png)

## Basic Data Analysis

This table provides key statistics about movies and short films, including **average ratings, total number of films, high-rated films, number of votes, and average runtime**.

### **Key Observations**

### **1. Average Ratings**
- The **overall average rating** across all titles is **6.383**.
- **Movies have a slightly lower average rating (6.158)** compared to **short films (6.820)**.  
  - This suggests that short films may receive more favorable ratings on average, potentially due to **a more niche audience** or **higher appreciation among viewers**.

### **2. Total Number of Films**
- There are **1,747,790 total films**, categorized into:
  - **705,351 movies**  
  - **1,042,439 short films**  
- Short films **outnumber movies** significantly, making up **nearly 60% of all titles**.

### **3. Distribution of Highly Rated Films**
- **157,089 films (≈9% of all films) have a rating above 7.0**.  
- **57,631 films (≈3.3% of all films) have a rating above 8.0**.  
- **12,847 films (≈0.7% of all films) have a rating above 9.0**, making them the **elite tier** of highly rated films.  
- Interestingly, short films have **more films rated above 9.0 (9,268) than movies (3,579)**.  
  - This supports the earlier observation that **short films receive more extreme positive ratings**.

### **4. Average Number of Votes**
- **Movies receive significantly more votes (3,629 per movie)** compared to **short films (76 per short film)**.  
  - This suggests that movies **garner more widespread attention and engagement** from viewers.  
  - Short films may be **highly rated but receive fewer votes**, potentially due to **a smaller, dedicated audience**.

### **5. Average Runtime**
- The **overall average runtime** is **43.649 minutes**, but it differs drastically between categories:
  - **Movies average 89.710 minutes (~1.5 hours).**
  - **Short films average just 13.092 minutes.**  
  - The shorter length of short films might contribute to **higher ratings**, as shorter content often has **tighter storytelling and higher engagement**.

### **Conclusions**
- **Short films dominate in quantity but have fewer total votes per title.**  
- **Movies attract more audience engagement**, reflected in their significantly higher number of votes.  
- **High ratings are more common among short films**, but this may be due to **a more selective and appreciative audience**.  
- **Very few films (0.7%) achieve a rating above 9.0**, making them the **top-tier masterpieces**.  
- The difference in runtime strongly **distinguishes movies from short films**, with short films being significantly shorter but often higher rated.

![alt text](Graph_1.png)

## Graph 1: Analysis: Average Rating & Number of Films vs. Runtime in Minutes

### Overview
This graph visualizes the relationship between film runtime and two key metrics:
- **Average rating** (blue line) on the left y-axis.
- **Number of films released** (red bars) on the right y-axis.

### Key Observations
1. **Initial Increase in Average Ratings:**
   - Films with runtimes between 0 and 15 minutes show a rapid increase in average ratings.
   - This suggests that shorter films (likely shorts) might have higher viewer satisfaction or quality control.

2. **Peak in Average Ratings (~20-40 minutes):**
   - The ratings stabilize around 6.8 to 7.0 within this range before experiencing fluctuations.
   - This could indicate a sweet spot where films are neither too short nor too long for audience engagement.

3. **Decline in Ratings (~40-80 minutes):**
   - A notable downward trend is observed between 40 and 80 minutes, with ratings gradually declining.
   - This could be attributed to lower production quality or engagement issues in mid-length films.

4. **Volatility Beyond 80 Minutes:**
   - The ratings fluctuate significantly beyond this point, with peaks and troughs.
   - Some longer films (~100+ minutes) show higher ratings, which may reflect well-produced feature-length films.

5. **Distribution of Films by Runtime:**
   - The highest concentration of films appears to be in the 80-100 minute range.
   - This aligns with standard feature-length films, suggesting that most productions fall within this duration.
   - However, there are occasional spikes at specific runtimes (e.g., around 120 minutes and 180 minutes), possibly indicating common runtime choices for major productions.

6. **Rare Occurrences of Extremely Long Films:**
   - Films longer than 150 minutes are scarce but tend to have relatively high average ratings.
   - This suggests that longer films might be more carefully crafted and well-received by audiences.

### Statistical Insights
- **Mean and Median Runtime:** Likely concentrated around 80-100 minutes.
- **Skewness:** The distribution of the number of films appears to be right-skewed, meaning shorter films are more common.
- **Potential Outliers:** Some extremely long films with high ratings suggest either cult classics or high-budget productions.

### Conclusion
This graph highlights an important trade-off between runtime and audience reception. While shorter films generally receive favorable ratings, feature-length films (~90 minutes) dominate the industry. Longer films, while fewer in number, tend to be critically acclaimed, possibly due to higher production quality.



![alt text](Graph_2.png)

## Graph 2: Average Rating & Number of Films vs. Runtime in Minutes (5 Min Intervals)

### **Overview**
This graph visualizes the relationship between film runtime (in 5-minute intervals) and two key metrics:
- **Average rating** (blue line) on the left y-axis.
- **Number of films released** (red bars) on the right y-axis.

### **Key Observations**
1. **Short Films (0-30 minutes)**:
   - The number of films is high in this range, peaking around the **5-10 minute mark**.
   - Films with runtimes of **5-15 minutes tend to have high average ratings (~6.8 to 7.1)**.
   - After 15 minutes, there is a **gradual decline in average rating**.

2. **Moderate-Length Films (30-90 minutes)**:
   - There is a **notable dip in average ratings**, reaching the lowest point (~5.8) around **80 minutes**.
   - The number of films also decreases significantly in this range.
   - The ratings trend suggests that shorter films (below 30 minutes) or longer ones (above 90 minutes) tend to be rated higher.

3. **Feature-Length Films (90-120 minutes)**:
   - Average ratings start to **increase steadily after 90 minutes**.
   - The number of films in this range is lower compared to shorter films but still significant.
   - Films around **120 minutes** have noticeably **better average ratings (~6.5 to 6.7).**

4. **Long Films (Above 120 minutes)**:
   - The average rating continues **rising beyond 120 minutes**, reaching **7.0+ in the 160-170 minute range**.
   - The number of films released in this range is significantly lower.
   - This suggests that **longer films tend to be rated higher**, possibly due to factors like higher production budgets or greater storytelling depth.

### **Statistical Trends**
- **U-Shaped Rating Distribution**: 
  - Ratings are higher for very short and very long films, with a dip in mid-range runtimes.
- **Inverse Relationship Between Quantity and Rating**:
  - The **most common runtimes (short films)** do not necessarily have the best ratings.
  - **Longer films, though less frequent, tend to receive better ratings.**

### **Conclusion**
- The data suggests that **short films (under 15 minutes) and long films (over 120 minutes) are generally well-received**.
- There is a **clear dip in ratings for films between 60-90 minutes**, which could indicate lower production value or weaker narratives.




![alt text](Graph_3.png)

## Graph 3: Average Rating & Number of Films vs. Release Year

### **Overview**
This graph examines the relationship between the release year of films and two key metrics:
- **Average rating** (blue line) on the left y-axis.
- **Number of films released** (green bars) on the right y-axis.

Additionally, the mean (**6.382**) and median (**6.070**) ratings are indicated with dashed horizontal lines.

### **Key Observations**
1. **Early Cinema (Pre-1930s)**
   - The number of films produced was relatively low compared to later years.
   - Average ratings increased steadily from around **4.5 in the 1890s to 6.0 by the 1920s**.
   - The **1910s-1920s** show a peak in average ratings, possibly reflecting the rise of silent film masterpieces.

2. **Mid-20th Century (1930-1980)**
   - The number of films remained relatively stable but did not grow significantly.
   - The **average rating hovered between 6.0 and 6.2**, with minor fluctuations.
   - A **slight decline in ratings** is observed in the **1950s and 1960s**, possibly due to an industry shift towards more commercialized productions.

3. **Modern Cinema Boom (1980-Present)**
   - The **number of films released began to rise significantly in the 1990s** and exploded in the 2000s.
   - The average rating **remained stable** around the mean (~6.3), with minor fluctuations.
   - In the **2010s and 2020s**, there was a noticeable increase in the **number of films released**, peaking around 2018-2019.
   - **Recent films (2020s) have higher average ratings**, potentially influenced by selective film releases, digital streaming platforms, or improvements in production quality.

### **Statistical Trends**
- **Early films (before 1930) had a steady improvement in ratings**, likely due to evolving filmmaking techniques.
- **A plateau in ratings (6.0-6.2) occurred from 1930 to 1980**, despite changes in technology and storytelling.
- **A surge in film production after 2000 did not negatively impact average ratings**, suggesting that quality remained consistent despite the increase in quantity.
- **Recent films (2020s) have higher ratings**, which may be due to a shift in viewing habits or improved production standards.

### **Conclusion**
- The film industry has grown significantly, with **a dramatic increase in film production after 2000**.
- **Older films (before 1930) tended to have lower ratings**, while modern films maintain a **stable rating around 6.3-6.4**.
- The **number of films released peaked around 2018-2019**, but ratings did not decline.



![alt text](Graph_4.png)

## Graph 4: Average Rating & Number of Films vs. Release Decade

### **Overview**
This graph presents the **average rating of films** by decade, summarizing the trends observed in individual years. The mean (**6.382**) and median (**6.070**) ratings are marked with dashed horizontal lines.

### **Key Observations**
1. **Early Cinema (1900s-1920s)**
   - The **1900s had the lowest average rating (~5.0)**, likely due to the infancy of filmmaking techniques.
   - Ratings improved significantly in the **1910s (~5.6)** and further in the **1920s (~6.0)**, reflecting advancements in storytelling, cinematography, and silent film innovations.

2. **Golden Age of Hollywood (1930s-1950s)**
   - The **1940s and 1950s saw the highest ratings of the mid-century (~6.2)**.
   - This period aligns with the rise of **classic Hollywood storytelling, influential directors, and technical improvements** such as sound and color cinematography.

3. **Ratings Dip (1960s-1980s)**
   - The **1960s to 1980s show a slight decline**, with average ratings falling below **6.0**.
   - This could be attributed to the transition from classical Hollywood to more experimental filmmaking and commercialized productions.

4. **Modern Cinema (1990s-Present)**
   - The **1990s show a return to a 6.1+ average rating**, possibly due to the impact of digital technology and globalization.
   - The **2000s, 2010s, and 2020s show a sharp increase in ratings, reaching above 6.4**, the highest in the dataset.
   - The **2020s have the highest average rating (~6.6)**, which may be due to selective film releases, streaming platform curation, or changes in audience rating behavior.

### **Statistical Trends**
- **Early cinema (before 1930) saw steady improvements in ratings** as the industry developed.
- **The 1940s and 1950s had the highest-rated classic films**, aligning with Hollywood's golden era.
- **A decline in ratings from the 1960s to 1980s** may reflect a shift in filmmaking styles and industry changes.
- **A resurgence in ratings since the 1990s** suggests that modern films are generally better received.
- **The highest-rated films belong to the 2020s**, possibly due to changes in distribution methods, audience engagement, and review biases.

### **Conclusion**
- The **1900s had the lowest-rated films**, while the **2020s have the highest-rated films**.
- **Mid-century cinema (1930s-1950s) had strong ratings**, but **the 1960s-1980s saw a decline**.
- **Ratings have been rising since the 1990s**, with modern films performing well.
- The increase in **recent ratings (2020s) may warrant further analysis** to determine whether this is a long-term trend or an anomaly.



![alt text](Graph_5.png)

## Graph 5: Average Rating & Number of Movies and Shorts vs. Release Year

### **Overview**
This graph presents the **average rating of movies and shorts** over time, along with the **number of films released per year**. It distinguishes between **movies (blue) and shorts (orange)** in both rating trends and volume of releases.

### **Key Observations**
1. **Early Film Era (1900-1920s)**
   - Early short films had significantly **lower ratings (~4.5-5.0) before 1910**, but gradually improved.
   - Movies began to emerge with higher ratings (~5.5) in the 1910s.
   - The number of films released **peaked around the 1910s**, primarily due to the dominance of silent shorts.
   - By the 1920s, movies had an **average rating above 6.0**, surpassing shorts.

2. **Golden Age of Cinema (1930s-1950s)**
   - Both movies and shorts stabilized at **ratings between 6.0 and 6.5**.
   - The number of short films decreased dramatically, while movies became the dominant format.
   - This period saw the rise of **Hollywood studio productions**, with consistent quality improvements.

3. **Transition and Decline (1960s-1980s)**
   - The **average rating of movies declined slightly (~6.0-6.2)** during this period.
   - The production of both movies and shorts remained relatively **low compared to later decades**.
   - The decline in shorts reflects a shift away from traditional short films in theaters.

4. **Modern Film Boom (1990s-Present)**
   - The **number of films released surged significantly after 2000**, driven by digital technology and the rise of independent productions.
   - Both movies and shorts saw an increase in releases, with **shorts growing faster** in recent years.
   - The **average rating of movies remained stable (~6.0-6.3)** but has seen a slight decline post-2015.
   - **Short films' average rating surpassed movies around 2005**, possibly due to higher-quality curated releases on platforms like YouTube and film festivals.

5. **Recent Trends (2010s-2020s)**
   - The **number of movies and shorts released peaked in the 2010s**, with **over 70,000 films per year** at its height.
   - The **rating of movies slightly declined**, while shorts continued to improve, reaching **above 7.0 in recent years**.
   - The decline in film releases after 2020 may be due to **industry disruptions (e.g., COVID-19 pandemic, streaming shifts)**.

### **Statistical Trends**
- **Early shorts had lower ratings but improved over time**, eventually surpassing movies.
- **Movies peaked in ratings in the mid-20th century** before slightly declining.
- **Film production exploded in the 2000s**, with short films making a strong comeback.
- **Modern short films have higher ratings than movies**, possibly due to selective distribution and niche content.

### **Conclusion**
- **Movies dominated ratings historically**, but **short films now rate higher on average**.
- **The number of films released surged post-2000**, driven by independent filmmaking and digital platforms.
- **Short films have seen a resurgence**, potentially reflecting changes in viewer preferences and accessibility.


![alt text](Graph_6.png)

## Graph 6: Average Runtime of Movies and Shorts vs. Release Year

### **Overview**
This graph examines the **average runtime of movies and shorts** over time, along with the **number of films released per year**. It differentiates between **movies (blue) and shorts (orange)** in both duration trends and volume of releases.

### **Key Observations**
1. **Early Film Era (1895-1920s)**
   - Short films dominated, with **average runtimes gradually increasing from 1-2 minutes (1895) to around 20 minutes (1915)**.
   - Movies started appearing in the **1910s**, with a rapid increase in runtime, reaching **over 60 minutes by the mid-1910s**.
   - A **peak in short film production around 1915** coincides with their highest average runtimes (~20 minutes).

2. **Golden Age of Cinema (1930s-1950s)**
   - **Movies stabilized at runtimes between 80-90 minutes**, reflecting the standardization of feature-length films.
   - **Short films declined in both runtime (~10-12 minutes) and production volume**, as full-length movies became the primary format.
   - The emergence of **Hollywood studios and feature-length storytelling** reinforced the preference for longer movies.

3. **Transition Period (1960s-1980s)**
   - **Movies maintained an average runtime of 90 minutes**, showing little fluctuation.
   - Short films persisted with **consistent runtimes (~10-12 minutes)** but in significantly fewer numbers.
   - The decline in shorts may reflect **the rise of television and changes in audience preferences**.

4. **Modern Film Boom (1990s-Present)**
   - The **number of films released surged significantly after 2000**, driven by digital technology and independent filmmaking.
   - Movies saw a **gradual increase in runtime, peaking at around 100 minutes in the early 2010s** before slightly declining.
   - Short films became more prevalent again post-2000, but their **average runtime remained stable (~10 minutes)**.
   - The increase in movie runtimes may reflect trends like **epic storytelling, franchise films, and streaming services** allowing for longer formats.

5. **Recent Trends (2010s-2020s)**
   - **Movie runtimes have stabilized around 95 minutes**, with a slight decline in recent years.
   - **Short film runtimes have slightly decreased**, possibly due to the rise of **online short-form content (e.g., YouTube, TikTok, Vimeo)**.
   - The **number of movie and short releases peaked around 2015-2019**, followed by a decline, possibly due to **COVID-19 disruptions** and changes in the industry.

### **Statistical Trends**
- **Early short films increased in runtime until 1915, then declined.**
- **Movies quickly adopted feature-length runtimes (60+ minutes) by the 1910s and stabilized around 90-100 minutes.**
- **Short film production dwindled mid-century but resurged post-2000, though runtimes remained steady (~10 minutes).**
- **Modern movies are slightly longer than historical averages, reflecting shifts in storytelling preferences.**

### **Conclusion**
- **Movies have maintained a relatively stable runtime (~90 minutes), with minor increases over time.**
- **Short films dominated early cinema but declined after the 1920s, resurging in volume post-2000 but not in runtime.**
- **The explosion of films post-2000 coincides with digital distribution, independent cinema, and streaming services.**




![alt text](Graph_7.png)

## Graph 7: Total Number of Votes and Average Number of Votes per Film vs. Release Year

### **Overview**
This graph explores the relationship between the **total number of votes** received by films and the **average number of votes per film** over time. It highlights key peaks and trends in audience engagement with movies across different eras.

### **Key Observations**
1. **Early Film Period (1920s-1950s)**
   - The **total number of votes remained low**, reflecting limited audience reach and accessibility of films.
   - The **average number of votes per film was relatively stable** but low, suggesting that films from this era received relatively little modern audience engagement.

2. **Growth in Film Popularity (1960s-1980s)**
   - A gradual **increase in both total votes and average votes per film** indicates growing audience engagement.
   - This period coincides with the **rise of television and home video**, which likely contributed to increased interest in older films.
   - The **late 1970s show an uptick**, possibly due to the emergence of blockbuster films like *Star Wars* (1977).

3. **Blockbuster Era and Digital Expansion (1990s-2000s)**
   - The **1990s saw a dramatic rise in average votes per film**, peaking in **1994 with an average of 6,421 votes per film**. This may be due to the **popularity of films like *The Shawshank Redemption*, *Pulp Fiction*, and *Forrest Gump***, which remain highly rated and frequently reviewed.
   - The **total number of votes increased sharply in the 2000s**, likely driven by the **expansion of the internet, IMDb, and online film communities**.
   - The **turn of the millennium marks a significant shift**, as the number of votes per film started to decline while the total number of votes continued rising.

4. **Peak Engagement and Decline (2010s-Present)**
   - The **highest total number of votes occurred in 2014, reaching 47.6 million votes**.
   - However, the **average number of votes per film declined** post-2000, possibly due to:
     - A massive increase in the number of films produced, diluting votes across more titles.
     - Shifts in audience behavior, where newer films receive fragmented attention across multiple platforms.
   - The decline in total votes in the **late 2010s-2020s** may reflect:
     - The **rise of streaming services**, where engagement isn't necessarily measured by votes.
     - **Fewer theatrical releases during the COVID-19 pandemic**, impacting audience interaction.

### **Statistical Trends**
- **The 1990s saw the highest average votes per film, peaking in 1994.**
- **The 2010s had the highest total votes, peaking in 2014.**
- **A sharp decline in average votes per film after 2000, despite total votes continuing to rise.**
- **Recent years (2020s) show a drop in total votes, potentially due to changes in viewing habits.**

### **Conclusion**
- **Audience engagement with films surged in the 1990s and peaked in the 2010s but has since declined.**
- **The rise of online film platforms contributed to the increase in total votes.**
- **A growing number of films post-2000 may have led to vote dilution, reducing average votes per film.**
- **Streaming services and digital consumption likely contribute to recent declines in total votes.**


![alt text](Graph_8.png)

## Graph 8: Number of Votes vs. Average Rating

### **Overview**
This graph examines the relationship between a film's **number of votes** and its **average rating**. It provides insight into how audience engagement (as measured by the number of votes) correlates with perceived film quality (as measured by average ratings).

### **Key Observations**
1. **Highly-Rated Films with Few Votes**
   - Films with very few votes tend to have extreme ratings (either very high or very low).
   - Some films with fewer than 10,000 votes have **ratings close to 10**, likely due to niche appeal or dedicated fanbases.
   - Conversely, some low-vote films have **ratings near 1**, possibly due to poor quality or review bombing.

2. **Majority of Films Cluster Around 6-8 Ratings**
   - The majority of films, especially those with moderate vote counts (10K - 500K), tend to have ratings between **6 and 8**.
   - This suggests that as a film receives more votes, its rating stabilizes within a predictable range.

3. **Films with Extremely High Votes**
   - Movies with over **1 million votes** generally have ratings above **7**, with very few exceptions.
   - This indicates that widely popular films tend to be well-received, as films with poor reception rarely accumulate such high vote counts.

4. **Lower-Rated Films and Vote Count**
   - The lowest-rated films (ratings below **5**) tend to have fewer votes.
   - This suggests that poorly rated films may not be widely watched, limiting audience engagement.

5. **Vote Count Thresholds**
   - Dashed vertical lines highlight key vote count thresholds (**250K, 500K, 1M**), marking different levels of audience engagement.
   - A clear trend emerges where higher vote counts are associated with higher average ratings.

### **Statistical Trends**
- Films with very few votes show high variance in ratings.
- Most films fall within the **6-8 rating range**, especially those with moderate votes.
- The highest-rated films tend to have higher vote counts, while the lowest-rated films remain relatively obscure.
- There is a **strong positive correlation** between a film’s popularity (vote count) and the likelihood of having a higher rating.

### **Conclusion**
- Films with low vote counts can be highly polarizing, leading to extreme ratings.
- As a film receives more votes, its rating tends to stabilize, converging towards the **6-8 range**.
- Highly popular films generally receive higher ratings, suggesting that mainstream movies tend to be better received or more widely appreciated.
- Niche or independent films may receive high ratings but often remain underrepresented in total vote counts.


![alt text](Graph_9.png)

## Graph 9: Number of Highly Rated Movies vs. Release Year

### **Overview**
This graph tracks the number of highly rated films (with ratings above **7.0, 8.0, and 9.0**) over time, categorized by their **release year**. It highlights trends in film production quality and audience reception over more than a century of cinema history.

### **Key Observations**
1. **Significant Growth in Highly Rated Films Over Time**
   - The number of films rated **above 7.0, 8.0, and 9.0** has increased dramatically over the years.
   - The trend is particularly noticeable from the **1990s onward**, likely due to an increase in film production, global accessibility, and online rating platforms.

2. **Peak Release Years for Highly Rated Films**
   - The **year with the most films rated > 7.0**: **2017** (7,505 films).
   - The **year with the most films rated > 8.0**: **2017** (3,342 films).
   - The **year with the most films rated > 9.0**: **2023** (1,125 films).
   - The peak in **2017** suggests a high volume of well-received films that year, while the high count for **2023** indicates a recent surge in films rated exceptionally high.

3. **Pre-1980s Era: Relatively Low Numbers of Highly Rated Films**
   - From **1910 to 1980**, the number of highly rated films remained relatively low.
   - This could be due to:
     - Lower overall film production.
     - Limited audience reach and fewer rating sources.
     - The predominance of classic, older films that only a niche audience rates highly today.

4. **Post-2000 Boom in Highly Rated Films**
   - The number of films rated **> 7.0 and > 8.0** saw **rapid growth after 2000**.
   - This surge could be attributed to:
     - The rise of digital filmmaking, streaming platforms, and independent films.
     - More widespread internet access, allowing a larger number of people to rate films.
     - The globalization of film industries leading to more diverse content.

5. **Steady Growth in Films Rated Above 9.0**
   - The number of films rated **above 9.0** has also increased over time, but the growth is **slower** compared to films rated **above 7.0 or 8.0**.
   - The highest count of **films with > 9.0 ratings** was recorded in **2023** (1,125 films).
   - This suggests that **exceptional films are being produced at a higher rate in recent years**, though they remain a rare subset.

### **Possible Explanations for Trends**
- **Increase in Film Production**: More movies are being made each year, leading to a natural increase in the number of highly rated films.
- **Changing Audience Preferences**: Modern audiences might be more generous with ratings, especially on online platforms.
- **Streaming & Accessibility**: More people now have access to international films, leading to wider appreciation and higher ratings.
- **Nostalgia & Bias in Ratings**: Older films may have fewer ratings overall, impacting their presence in higher-rating categories.

### **Conclusion**
- The number of highly rated films has **significantly increased over the past two decades**, with peaks around 2017 and 2023.
- Films rated **above 7.0 and 8.0** have grown exponentially, while films rated **above 9.0** remain relatively rare but are increasing.
- Modern trends in **film production, distribution, and audience engagement** are likely driving this shift.


![alt text](Graph_10.png)

## Graph 10: Percentage of Total Number of Films by Decade

### **Overview**
This pie chart represents the **distribution of total films produced across different decades**, showing the relative proportion of films released in each era. The visualization helps identify trends in film production growth over time.

### **Key Observations**
1. **Dominance of the 2010s and 2020s**
   - The largest portion of films were released in the **2010s**, making up the most significant percentage of total films.
   - The **2020s** also contribute a substantial portion, despite the decade not being complete yet.

2. **Steady Growth in Film Production Over Time**
   - Film production has increased **exponentially** in recent decades.
   - Earlier decades (pre-1950s) account for a **small fraction** of total films, reflecting the **early years of cinema** when film production was limited.
   - The trend indicates a **continuous rise in global film production**, likely due to the expansion of the industry, technology improvements, and accessibility to filmmaking.

3. **Decline in Share for Older Decades**
   - The **1900s–1940s** have a relatively **tiny share** of the total films, as expected.
   - The growth in film production appears to take off in the **1950s and beyond**, with each subsequent decade producing more films than the last.

4. **Explosive Growth from the 1990s Onward**
   - The **1990s, 2000s, 2010s, and 2020s** make up the bulk of the chart.
   - This rapid expansion is likely due to:
     - The rise of **digital filmmaking**.
     - Growth in **independent film production**.
     - Expansion of **streaming platforms** leading to more diverse film production worldwide.

### **Possible Explanations for Trends**
- **Technological Advancements**: The shift from film reels to digital technology has made filmmaking **cheaper and more accessible**.
- **Streaming and Online Platforms**: Services like **Netflix, Amazon Prime, and YouTube** have encouraged **mass production** of films and content.
- **Globalization of Film Industry**: More countries now produce films at a higher rate, contributing to the growth seen in the **2000s–2020s**.
- **Increased Audience Demand**: The expansion of **entertainment markets** worldwide has fueled the need for more films across genres and languages.

### **Conclusion**
- The **2010s and 2020s** dominate in terms of film production.
- **Film production has steadily increased over time**, with **exponential growth from the 1990s onward**.
- Advances in **technology, distribution platforms, and global access to filmmaking** have driven this trend.


![alt text](Graph_11.png)

## Graph 11: Top 10 Films of all Time by Rating Weight (Bayesian Average)

Using a Bayesian approach ensures that films with a small number of votes do not artificially reach the top rankings. This list is dominated by classics and widely acclaimed films.

### Observations:
- The highest-rated film is *The Shawshank Redemption* (1994) with a Bayesian rating of 9.2976.
- *The Godfather* (1972) and *The Chaos Class* (1975) follow closely behind.
- *The Dark Knight* (2008) is the highest-ranked film from the 21st century.
- The *Lord of the Rings* series has two entries, showing its strong cultural and critical impact.

### Statistical Insights:
- **Era-based Popularity**: The majority of these films are from the late 20th century, with only one film from after 2000 (*The Dark Knight*).
- **Genre Influence**: Most films are drama or crime-based, suggesting these genres tend to receive higher critical acclaim.
- **Bayesian Adjustment**: The use of Bayesian ratings helps mitigate the effect of vote count variations and ensures fair ranking.

***

# 4. **Findings and Conclusion**

## **1. Growth in Film Production Over Time**
- The number of films produced has significantly increased over the past few decades.
- Advancements in **technology, digital filmmaking, and streaming platforms** have contributed to the surge in film production.
- Older decades have a smaller share of total films, suggesting that **barriers to entry were higher in the past**, and the industry has become more accessible over time.

## **2. Film Ratings and Audience Perception**
- Most films receive **moderate ratings**, with only a small percentage achieving extremely high scores.
- While more films are being produced, **quality does not necessarily scale with quantity**—exceptional films remain rare.
- Audience engagement varies widely, with some films receiving significantly more attention and votes than others.
- The **Bayesian Average approach** ensures that the most highly rated films have a strong consensus among viewers.

## **3. The Enduring Legacy of Classic Films**
- Despite the rapid growth of new films, **older, critically acclaimed films remain highly rated** and continue to be regarded as some of the best in history.
- Films that **feature strong storytelling, cultural impact, and critical acclaim** tend to maintain their status over time.
- Some of the highest-rated films belong to well-known directors or franchises, suggesting that reputation and branding play a role in audience reception.

## **4. The Rise of Streaming and Digital Media**
- The rapid increase in film production in recent decades aligns with the growth of **online streaming platforms, independent filmmaking, and international cinema**.
- Short films outnumber feature-length films, likely due to the ease of production and distribution via **digital and social media**.
- The industry has shifted from traditional theatrical releases to **on-demand and streaming-first distribution models**.

## **5. Hollywood’s Influence and Global Trends**
- A significant portion of the most acclaimed films come from **Hollywood**, reinforcing its dominance in global cinema.
- However, with increasing accessibility, **international and independent films are also gaining recognition**.
- The industry’s focus on **franchise films and large-scale productions** reflects trends in audience demand and commercial success.

## **6. Audience Engagement and Popularity**
- Some films receive a vastly higher number of votes and reviews, indicating **wider cultural impact and audience engagement**.
- **Blockbuster films and franchises tend to dominate** ratings and public attention, suggesting that branding and familiarity contribute to long-term popularity.
- The length and format of a film also play a role in audience engagement, with feature-length films generally receiving more attention than short films.

## **Final Thoughts**
This project involved an in-depth analysis of IMDb movie data, focusing on general film characteristics such as runtime, release year trends, and average ratings. Using a structured data analysis workflow, the dataset was cleaned, processed, and explored to uncover meaningful insights. Tableau was utilized for visualizing trends, allowing for a clearer understanding of the relationships between different variables.  

Key observations from the data include variations in film runtimes over the decades, shifts in the volume of movie releases, and how these factors correlate with audience ratings. By analyzing patterns in the data, broad trends in filmmaking over time were identified, such as the increasing prevalence of certain runtime categories and the distribution of ratings across different time periods.  

Overall, this project highlights the importance of structured data analysis in understanding large datasets and deriving insights from them. By applying data cleaning, visualization, and interpretation techniques, meaningful observations about the film industry were extracted, setting the stage for further studies in related areas.  

Some key points to take away from this project are:
- The movie industry has **expanded tremendously** in both production volume and accessibility.
- Despite the growth, **only a small percentage of films achieve widespread critical and audience acclaim**.
- The **rise of digital platforms, independent filmmaking, and international cinema** is reshaping how films are created, distributed, and consumed.
- Classic films continue to hold **strong cultural significance**, proving that certain elements of storytelling and filmmaking have timeless appeal.