In [1]:

# Import libraries


# Data manipulation and analysis
import pandas as pd  # pandas is used for handling and processing data in DataFrame structures
import numpy as np  # numpy is useful for numerical computations and handling arrays
import gzip  # gzip is for handling compressed files

# Data visualization
import matplotlib.pyplot as plt  # matplotlib is used for creating static, interactive, and animated visualizations
import seaborn as sns  # seaborn provides a high-level interface for drawing attractive statistical graphics

# Database interaction
import sqlite3  # sqlite3 is used to connect to SQLite databases
import nbconvert  # nbconvert is used to convert Jupyter Notebooks into various formats
import os


# Set visualization style
sns.set_theme(style="whitegrid")



## 1. **Box Office Mojo Data**

**Overview**: This dataset provides box office revenue and studio-related details.

- **Shape**: 3,387 rows and 5 columns.

### Columns:
- **title**: Movie title (non-null).
- **studio**: Studio responsible for the movie (5 missing values).
- **domestic_gross**: Domestic gross earnings (28 missing values).
- **foreign_gross**: Foreign gross earnings (1,350 missing values, stored as strings).
- **year**: Release year (non-null).

### Key Issues:
- `foreign_gross` is stored as strings, requiring conversion to numeric format.
- Missing values in the `studio` and revenue columns.

In [2]:
# Define the path to your raw zipped data
file_path = 'C:/Users/USER/Desktop/Movie-Project/data/raw/zippedData/bom.movie_gross.csv.gz'

# Load the gzipped CSV directly
bom_gross = pd.read_csv(file_path, compression='gzip')

# Display the first few rows of the data
display(bom_gross.head())
bom_gross.dtypes

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


title              object
studio             object
domestic_gross    float64
foreign_gross      object
year                int64
dtype: object

## 2. **The Numbers Data**

**Overview**: Focuses on production budgets, domestic, and worldwide gross revenue.

- **Shape**: 5,782 rows and 6 columns.

### Columns:
- **id**: Unique identifier for each movie (non-null).
- **release_date**: Movie release date (non-null).
- **movie**: Movie title (non-null).
- **production_budget**: Production budget (stored as strings with commas, requires conversion).
- **domestic_gross** and **worldwide_gross**: Revenue columns stored as strings with commas.

### Key Issues:
- All revenue columns and budgets are in string format, requiring numeric conversion.
- No missing values, but the format needs cleaning for analysis.

---

In [3]:
# Load The Numbers (movie budgets) dataset
tn_budgets = pd.read_csv('C:/Users/USER/Desktop/Movie-Project/data/raw/zippedData/tn.movie_budgets.csv.gz', compression='gzip') 
print("The Numbers Data:")
print(tn_budgets.info())  # Get an overview of the dataset
display(tn_budgets.head())  # Display the first few rows

The Numbers Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB
None


Unnamed: 0,id,release_date,movie,production_budget,domestic_gross,worldwide_gross
0,1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
1,2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
2,3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
3,4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
4,5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


## 3. **Rotten Tomatoes Reviews Data**

**Overview**: Contains reviews, ratings, and publisher information for movies.

- **Shape**: 54,432 rows and 8 columns.

### Columns:
- **id**: Unique identifier for movies (non-null).
- **review**: Textual review (5,563 missing values).
- **rating**: Rating given by critics (13,517 missing values).
- **fresh**: Whether the review is "fresh" or "rotten" (non-null).
- **critic**: Name of the critic (2,722 missing values).
- **top_critic**: Binary flag for top critics (non-null).
- **publisher**: Publisher of the review (309 missing values).
- **date**: Date of the review (non-null).

### Key Issues:
- Missing values in `review`, `rating`, and `critic` columns.
- Some columns may not directly impact the analysis depending on objectives.

---

In [4]:
# Load Rotten Tomatoes Reviews dataset
rt_reviews = pd.read_csv('C:/Users/USER/Desktop/Movie-Project/data/raw/zippedData/rt.reviews.tsv.gz', compression='gzip', sep='\t', encoding='latin-1') 
print("Rotten Tomatoes Reviews Data:")
print(rt_reviews.info())  # Get an overview of the dataset
display(rt_reviews.head(), "\n")  # Display the first few rows

Rotten Tomatoes Reviews Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54432 entries, 0 to 54431
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          54432 non-null  int64 
 1   review      48869 non-null  object
 2   rating      40915 non-null  object
 3   fresh       54432 non-null  object
 4   critic      51710 non-null  object
 5   top_critic  54432 non-null  int64 
 6   publisher   54123 non-null  object
 7   date        54432 non-null  object
dtypes: int64(2), object(6)
memory usage: 3.3+ MB
None


Unnamed: 0,id,review,rating,fresh,critic,top_critic,publisher,date
0,3,A distinctly gallows take on contemporary fina...,3/5,fresh,PJ Nabarro,0,Patrick Nabarro,"November 10, 2018"
1,3,It's an allegory in search of a meaning that n...,,rotten,Annalee Newitz,0,io9.com,"May 23, 2018"
2,3,... life lived in a bubble in financial dealin...,,fresh,Sean Axmaker,0,Stream on Demand,"January 4, 2018"
3,3,Continuing along a line introduced in last yea...,,fresh,Daniel Kasman,0,MUBI,"November 16, 2017"
4,3,... a perverse twist on neorealism...,,fresh,,0,Cinema Scope,"October 12, 2017"


'\n'

## 4. **Rotten Tomatoes Movie Info Data**

**Overview**: Provides additional metadata such as genres, directors, runtime, and box office data.

- **Shape**: 1,560 rows and 12 columns.

### Columns:
- **id**: Unique identifier (non-null).
- **synopsis**: Movie synopsis (62 missing values).
- **rating**: MPAA rating (3 missing values).
- **genre**: Movie genre (8 missing values).
- **director**: Director name (199 missing values).
- **writer**: Writer name (449 missing values).
- **theater_date**: Theater release date (359 missing values).
- **dvd_date**: DVD release date (359 missing values).
- **currency** and **box_office**: Currency type and box office earnings (non-null values are very sparse).
- **runtime**: Runtime of the movie (30 missing values).
- **studio**: Studio responsible (sparse).

### Key Issues:
- High number of missing values in `studio`, `currency`, and `box_office`.
- Sparse data may limit the usability of certain columns in the analysis.

---

In [5]:
# Load Rotten Tomatoes Movie Info dataset
rt_info = pd.read_csv('C:/Users/USER/Desktop/Movie-Project/data/raw/zippedData/rt.movie_info.tsv.gz', compression='gzip', sep='\t') 
print("Rotten Tomatoes Movie Info Data:")
print(rt_info.info())  # Get an overview of the dataset
display(rt_info.head(), "\n")  # Display the first few rows

Rotten Tomatoes Movie Info Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1560 entries, 0 to 1559
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   id            1560 non-null   int64 
 1   synopsis      1498 non-null   object
 2   rating        1557 non-null   object
 3   genre         1552 non-null   object
 4   director      1361 non-null   object
 5   writer        1111 non-null   object
 6   theater_date  1201 non-null   object
 7   dvd_date      1201 non-null   object
 8   currency      340 non-null    object
 9   box_office    340 non-null    object
 10  runtime       1530 non-null   object
 11  studio        494 non-null    object
dtypes: int64(1), object(11)
memory usage: 146.4+ KB
None


Unnamed: 0,id,synopsis,rating,genre,director,writer,theater_date,dvd_date,currency,box_office,runtime,studio
0,1,"This gritty, fast-paced, and innovative police...",R,Action and Adventure|Classics|Drama,William Friedkin,Ernest Tidyman,"Oct 9, 1971","Sep 25, 2001",,,104 minutes,
1,3,"New York City, not-too-distant-future: Eric Pa...",R,Drama|Science Fiction and Fantasy,David Cronenberg,David Cronenberg|Don DeLillo,"Aug 17, 2012","Jan 1, 2013",$,600000.0,108 minutes,Entertainment One
2,5,Illeana Douglas delivers a superb performance ...,R,Drama|Musical and Performing Arts,Allison Anders,Allison Anders,"Sep 13, 1996","Apr 18, 2000",,,116 minutes,
3,6,Michael Douglas runs afoul of a treacherous su...,R,Drama|Mystery and Suspense,Barry Levinson,Paul Attanasio|Michael Crichton,"Dec 9, 1994","Aug 27, 1997",,,128 minutes,
4,7,,NR,Drama|Romance,Rodney Bennett,Giles Cooper,,,,,200 minutes,


'\n'

### . **TMDB Dataset**

tmdb.movies.csv.gz
Columns: Unnamed: 0, genre_ids, id, original_language, original_title, popularity, release_date, title, vote_average, vote_count
Issues:
Unnamed: 0 appears to be an unnecessary index column.
Check if release_date is properly formatted (likely needs conversion to datetime)

In [6]:
# Load TMDB dataset
tmdb_movies = pd.read_csv('C:/Users/USER/Desktop/Movie-Project/data/raw/zippedData/tmdb.movies.csv.gz', compression='gzip') 
print("TheMovieDB Data:")
print(tmdb_movies.info())  # Get an overview of the dataset
display(tmdb_movies.head(), "\n")  # Display the first few rows

TheMovieDB Data:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB
None


Unnamed: 0.1,Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


'\n'

## **DATAA CLEANING**

### 1. **Box Office Mojo Data**

In [7]:
# Replace missing 'studio' with 'Unknown'
bom_gross['studio'] = bom_gross['studio'].fillna('Unknown')

# Convert 'foreign_gross' to numeric by removing non-numeric characters
bom_gross['foreign_gross'] = bom_gross['foreign_gross'].replace('[^0-9]', '', regex=True).astype(float)


In [8]:
# Convert 'year' column to integer type, handling non-convertible values
bom_gross['year'] = pd.to_numeric(bom_gross['year'], errors='coerce').astype('Int64')

# Drop duplicate rows
bom_gross = bom_gross.drop_duplicates()



In [9]:
# Display the first few rows of the data
display(bom_gross.head())
# Check for missing values
print(bom_gross.isnull().sum())

bom_gross.info()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000.0,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000.0,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000.0,2010
3,Inception,WB,292600000.0,535700000.0,2010
4,Shrek Forever After,P/DW,238700000.0,513900000.0,2010


title                0
studio               0
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3387 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   float64
 4   year            3387 non-null   Int64  
dtypes: Int64(1), float64(2), object(2)
memory usage: 135.7+ KB


### 2. **The Numbers Data**

In [10]:
# Remove '$' and ',' from financial columns and convert them to numeric
for col in ['production_budget', 'domestic_gross', 'worldwide_gross']:
    tn_budgets[col] = tn_budgets[col].replace(r'[\$,]', '', regex=True).astype(float)

# Convert 'release_date' to datetime
tn_budgets['release_date'] = pd.to_datetime(tn_budgets['release_date'], errors='coerce')


In [11]:
tn_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5782 non-null   int64         
 1   release_date       5782 non-null   datetime64[ns]
 2   movie              5782 non-null   object        
 3   production_budget  5782 non-null   float64       
 4   domestic_gross     5782 non-null   float64       
 5   worldwide_gross    5782 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 271.2+ KB


### 3. **TMDB Dataset**

In [12]:
# Drop the unnecessary 'Unnamed: 0' column
tmdb_movies.drop(columns=['Unnamed: 0'], inplace=True)

# Convert 'release_date' to datetime
tmdb_movies['release_date'] = pd.to_datetime(tmdb_movies['release_date'], errors='coerce')


In [20]:
tmdb_movies.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


### 4. **Rotten Tomatoes Reviews Data**

In [14]:
# Drop rows where `review` or `rating` is missing
rt_reviews.dropna(subset=['review', 'rating'], inplace=True)

# Fill missing `critic` and `publisher` with "Unknown"
rt_reviews['critic'] = rt_reviews['critic'].fillna('Unknown')
rt_reviews['publisher'] = rt_reviews['publisher'].fillna('Unknown')


# Convert `date` column to datetime
rt_reviews['date'] = pd.to_datetime(rt_reviews['date'], errors='coerce')





In [15]:
# Convert Data Types
# Parse `rating` to extract numeric scores (e.g., '3/5' -> 3.0)
def parse_rating(rating):
    try:
        return float(rating.split('/')[0]) if '/' in rating else None
    except:
        return None

rt_reviews['rating'] = rt_reviews['rating'].apply(parse_rating)

In [16]:
# Simplify `fresh` to Binary
rt_reviews['fresh'] = rt_reviews['fresh'].apply(lambda x: 1 if x == 'fresh' else 0)


In [17]:
# Remove Duplicates
rt_reviews.drop_duplicates(inplace=True)

In [18]:
# Rename Columns to snake_case
rt_reviews.rename(columns={
    'id': 'review_id',
    'review': 'review_text',
    'rating': 'rating_score',
    'fresh': 'is_fresh',
    'critic': 'critic_name',
    'top_critic': 'is_top_critic',
    'publisher': 'publisher_name',
    'date': 'review_date'
}, inplace=True)

In [19]:
print("Rotten Tomatoes Reviews Data:")
print(rt_reviews.info())  # Get an overview of the dataset
display(rt_reviews.head(), "\n")  # Display the first few rows

Rotten Tomatoes Reviews Data:
<class 'pandas.core.frame.DataFrame'>
Index: 35378 entries, 0 to 54424
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   review_id       35378 non-null  int64         
 1   review_text     35378 non-null  object        
 2   rating_score    28760 non-null  float64       
 3   is_fresh        35378 non-null  int64         
 4   critic_name     35378 non-null  object        
 5   is_top_critic   35378 non-null  int64         
 6   publisher_name  35378 non-null  object        
 7   review_date     35378 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(3), object(3)
memory usage: 2.4+ MB
None


Unnamed: 0,review_id,review_text,rating_score,is_fresh,critic_name,is_top_critic,publisher_name,review_date
0,3,A distinctly gallows take on contemporary fina...,3.0,1,PJ Nabarro,0,Patrick Nabarro,2018-11-10
6,3,"Quickly grows repetitive and tiresome, meander...",,0,Eric D. Snider,0,EricDSnider.com,2013-07-17
7,3,Cronenberg is not a director to be daunted by ...,2.0,0,Matt Kelemen,0,Las Vegas CityLife,2013-04-21
11,3,"While not one of Cronenberg's stronger films, ...",,1,Emanuel Levy,0,EmanuelLevy.Com,2013-02-03
12,3,Robert Pattinson works mighty hard to make Cos...,2.0,0,Christian Toto,0,Big Hollywood,2013-01-15


'\n'