![alt text](../movie-3057394_1280.jpg)

# Business Understanding

The objective of this project is to analyze historical movie data to generate actionable insights for a new movie studio venture. Specifically, the analysis aims to identify the key factors that contribute to the commercial success of movies, focusing on profitability, audience reception, and production efficiency. This will help the studio make data-driven decisions regarding budget allocation, genre selection, and release strategies to maximize return on investment (ROI) and minimize financial risks.

## Key Business Questions:

- **Which genres and types of movies yield the highest ROI and profitability?**
- **How does production budget correlate with movie revenue and audience ratings?**
- **Which studios and production companies have a consistent track record of success?**
- **What factors (e.g., runtime, release year, genre) influence a movie’s box office performance?**


# Data Understanding 

The project involves analyzing multiple datasets collected from various sources to uncover insights, trends, and patterns in the movie industry. Below is a detailed understanding of the datasets, their key features, and the data issues that need to be addressed for a successful analysis:

---

## 1. **Box Office Mojo Data**

**Overview**: This dataset provides box office revenue and studio-related details.

- **Shape**: 3,387 rows and 5 columns.

### Columns:
- **title**: Movie title (non-null).
- **studio**: Studio responsible for the movie (5 missing values).
- **domestic_gross**: Domestic gross earnings (28 missing values).
- **foreign_gross**: Foreign gross earnings (1,350 missing values, stored as strings).
- **year**: Release year (non-null).

### Key Issues:
- `foreign_gross` is stored as strings, requiring conversion to numeric format.
- Missing values in the `studio` and revenue columns.

---

## 2. **The Numbers Data**

**Overview**: Focuses on production budgets, domestic, and worldwide gross revenue.

- **Shape**: 5,782 rows and 6 columns.

### Columns:
- **id**: Unique identifier for each movie (non-null).
- **release_date**: Movie release date (non-null).
- **movie**: Movie title (non-null).
- **production_budget**: Production budget (stored as strings with commas, requires conversion).
- **domestic_gross** and **worldwide_gross**: Revenue columns stored as strings with commas.

### Key Issues:
- All revenue columns and budgets are in string format, requiring numeric conversion.
- No missing values, but the format needs cleaning for analysis.

---

## 3. **Rotten Tomatoes Reviews Data**

**Overview**: Contains reviews, ratings, and publisher information for movies.

- **Shape**: 54,432 rows and 8 columns.

### Columns:
- **id**: Unique identifier for movies (non-null).
- **review**: Textual review (5,563 missing values).
- **rating**: Rating given by critics (13,517 missing values).
- **fresh**: Whether the review is "fresh" or "rotten" (non-null).
- **critic**: Name of the critic (2,722 missing values).
- **top_critic**: Binary flag for top critics (non-null).
- **publisher**: Publisher of the review (309 missing values).
- **date**: Date of the review (non-null).

### Key Issues:
- Missing values in `review`, `rating`, and `critic` columns.
- Some columns may not directly impact the analysis depending on objectives.

---

## 4. **Rotten Tomatoes Movie Info Data**

**Overview**: Provides additional metadata such as genres, directors, runtime, and box office data.

- **Shape**: 1,560 rows and 12 columns.

### Columns:
- **id**: Unique identifier (non-null).
- **synopsis**: Movie synopsis (62 missing values).
- **rating**: MPAA rating (3 missing values).
- **genre**: Movie genre (8 missing values).
- **director**: Director name (199 missing values).
- **writer**: Writer name (449 missing values).
- **theater_date**: Theater release date (359 missing values).
- **dvd_date**: DVD release date (359 missing values).
- **currency** and **box_office**: Currency type and box office earnings (non-null values are very sparse).
- **runtime**: Runtime of the movie (30 missing values).
- **studio**: Studio responsible (sparse).

### Key Issues:
- High number of missing values in `studio`, `currency`, and `box_office`.
- Sparse data may limit the usability of certain columns in the analysis.

---

## 5. **TheMovieDB Data**

![alt text](../movie_data_erd.jpeg)

**Overview**: Covers popularity metrics, genre information, vote averages, and counts.

- **Shape**: 26,517 rows and 10 columns.

### Columns:
- **id**: Unique identifier (non-null).
- **genre_ids**: List of genre IDs associated with the movie (non-null).
- **original_language**: Language of the movie (non-null).
- **original_title** and **title**: Original and common titles (non-null).
- **popularity**: Popularity score (non-null).
- **release_date**: Release date (non-null).
- **vote_average** and **vote_count**: Average rating and number of votes (non-null).

### Key Issues:
- No missing values, but `genre_ids` may require decoding for interpretability.


## Dataframes for Analysis

Based on the analysis goals, the following datasets will be most useful for deriving insights:

### 1. **Box Office Mojo Data**  
- **Key Variables**: `title`, `studio`, `domestic_gross`, `foreign_gross`, `year`  
- **Usage**: Analyze box office performance, studio performance, and trends in domestic vs. international earnings.

### 2. **The Numbers Data**  
- **Key Variables**: `movie`, `production_budget`, `domestic_gross`, `worldwide_gross`, `release_date`  
- **Usage**: Analyze the correlation between production budgets and box office revenues to assess profitability.

### 3. **Rotten Tomatoes Movie Info Data**  
- **Key Variables**: `rating`, `genre`, `director`, `runtime`, `box_office`  
- **Usage**: Analyze how different factors like genre, director, and runtime impact box office performance.

### 4. **TheMovieDB Data**  
- **Key Variables**: `title`, `popularity`, `vote_average`, `vote_count`, `release_date`  
- **Usage**: Investigate how popularity, audience ratings, and vote counts correlate with box office success.


---
## Combined Issues and Cleaning Plan

### 1. **Data Consistency**
- Titles across datasets may differ, which requires careful handling of titles or IDs when merging datasets to ensure consistency.

### 2. **Formatting**
- Several columns related to revenue and budget (e.g., `domestic_gross`, `worldwide_gross`, `production_budget`, etc.) are stored as strings with special characters (e.g., commas), so these need to be cleaned and converted to numeric values.

### 3. **Missing Data**
- Sparse columns (e.g., `foreign_gross`, `studio`, `currency`) will need to be addressed using imputation or exclusion depending on their relevance to the analysis.

### 4. **Deduplication**
- Ensure that there are no duplicate entries after merging datasets. Identifying duplicates based on movie title or ID is crucial for maintaining data quality.

### 5. **Alignment**
- Date formats, genre categories, and identifiers (like movie IDs) need to be standardized across datasets to ensure consistent analysis.

---


