# INFO 2950 Phase 2

## Research question

What are the key factors influencing a movie's box office success, and how do these factors differ across various genres? 

## Data collection and cleaning

In [1]:
import numpy as np
import pandas as pd
import duckdb
import matplotlib.pyplot as plt
import seaborn as sns
import ast

Datasets: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset

In [2]:
# Create data frames from CSV files
movies_data = pd.read_csv("data/movies_metadata.csv")
ratings_data = pd.read_csv("data/ratings_small.csv")

  movies_data = pd.read_csv("data/movies_metadata.csv")


In [3]:
print(movies_data.shape)
movies_data.head()

(45466, 24)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
print(ratings_data.shape)
ratings_data.head()

(100004, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


### Clean Data

In [5]:
# Drop irrelevant columns
movies_data.drop(columns=['homepage', 'tagline', 'overview', 'poster_path'], inplace=True)

# Convert release dates to datetime object
print(movies_data['release_date'].dtype)
movies_data['release_date'] = pd.to_datetime(movies_data['release_date'], format='mixed', errors='coerce')
print(movies_data['release_date'].dtype)

# Convert id from object to float
print(movies_data['id'].dtype)
movies_data['id'] = pd.to_numeric(movies_data['id'], errors='coerce')
print(movies_data['id'].dtype)

object
datetime64[ns]
object
float64


In [6]:
# Drop rows with missing data
movies_data.dropna(inplace=True)
print(movies_data.shape)

(4477, 20)


In [7]:
# Find average of ratings for each movie
ratings_data = duckdb.sql("SELECT movieId, AVG(rating) AS average_rating FROM ratings_data GROUP BY movieId").df()
ratings_data.head()

Unnamed: 0,movieId,average_rating
0,1343,3.74359
1,3671,3.935484
2,50,4.370647
3,150,3.9025
4,266,3.486301


### Create Joined Dataset

In [8]:
# Merge the data frames
data = duckdb.sql("SELECT * FROM movies_data INNER JOIN ratings_data ON movies_data.id = ratings_data.movieId").df()
print(data.shape)
data.head()

(538, 22)


Unnamed: 0,adult,belongs_to_collection,budget,genres,id,imdb_id,original_language,original_title,popularity,production_companies,...,revenue,runtime,spoken_languages,status,title,video,vote_average,vote_count,movieId,average_rating
0,False,"{'id': 93295, 'name': '48 Hrs. Collection', 'p...",12000000,"[{'id': 53, 'name': 'Thriller'}, {'id': 28, 'n...",150.0,tt0083511,en,48 Hrs.,15.297121,"[{'name': 'Paramount Pictures', 'id': 4}]",...,78868508.0,96.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,48 Hrs.,False,6.5,364.0,150,3.9025
1,False,"{'id': 528, 'name': 'The Terminator Collection...",200000000,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",296.0,tt0181852,en,Terminator 3: Rise of the Machines,20.818907,"[{'name': 'Columbia Pictures', 'id': 5}, {'nam...",...,435000000.0,109.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Terminator 3: Rise of the Machines,False,5.9,2177.0,296,4.256173
2,False,"{'id': 300546, 'name': 'Once were Warriors Col...",0,"[{'id': 18, 'name': 'Drama'}]",527.0,tt0110729,en,Once Were Warriors,4.025276,"[{'name': 'Avalon Studios', 'id': 293}, {'name...",...,2201126.0,99.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Once Were Warriors,False,7.6,106.0,527,4.303279
3,False,"{'id': 295862, 'name': 'The BRD Collection', '...",0,"[{'id': 10752, 'name': 'War'}, {'id': 18, 'nam...",661.0,tt0079095,de,Die Ehe der Maria Braun,2.617272,"[{'name': 'Westdeutscher Rundfunk (WDR)', 'id'...",...,0.0,116.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,The Marriage of Maria Braun,False,7.1,39.0,661,3.632075
4,False,"{'id': 123720, 'name': 'Frankenstein (Hammer S...",0,"[{'id': 27, 'name': 'Horror'}, {'id': 878, 'na...",3104.0,tt0061683,en,Frankenstein Created Woman,2.302582,"[{'name': 'Hammer Film Productions', 'id': 1314}]",...,0.0,92.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Frankenstein Created Woman,False,5.9,33.0,3104,3.965517


### Data description

Have an initial draft of your data description section (details below). Your data description should be about your analysis-ready data.

- What are the observations (rows) and the attributes (columns)?
- Why was this dataset created?
- Who funded the creation of the dataset?
- What processes might have influenced what data was observed and recorded and what was not?
- What preprocessing was done, and how did the data come to be in the form that you are using?
- If people are involved, were they aware of the data collection and if so, what purpose did they expect the data to be used for?
- Where can your raw source data be found, if applicable? Provide a link to the raw data (hosted on Github, in a Cornell Google Drive or Cornell Box)

The dataset contains metadata for 45,000 movies, with each row representing a unique movie. Key attributes include the movie's title, release year, film rating, total runtime, genre(s), IMDB rating, a brief summary, Metacritic score, director's name, leading actors, total votes received, and total box office gross revenue. This dataset was created to provide comprehensive metadata that facilitates analyses related to film performance, audience reception, and industry trends, making it a valuable resource for researchers, marketers, and film enthusiasts. The datasets were created by 2 individuals and posted for public use on Kaggle. The datasets have been downloaded over 300,000 times. Preprocessing steps included data cleaning to remove duplicates and handle missing values (e.g., replacing NaNs in the Gross column with 0 for films without recorded revenue). We also have to transform data which involved standardizing genre formats, such as splitting multiple genres into separate entries, and converting the Released_Year to a numeric format. We also merged additional data from the ratings dataset enriched the analysis with user feedback metrics.

## Data limitations

The dataset that we use will inherently have bias because movie reviews are user-generated and are subjective. Highly-rated movies may attract more votes, while lesser-known films may receive fewer ratings which can skew the average rating of a movie. There could also be temporal bias, where older movies may have fewer ratings or skewed perceptions compared to newer films that have more exposure. Genre classifications can be subjective, leading to inconsistencies in how movies are categorized. Ratings can differ significantly by region or demographic, which is not captured in the dataset. Movie review systems like IMDb are constantly updated, which can lead to versioning issues if datasets are not synchronized properly.

## Exploratory data analysis

## Questions for reviewers

1. Do you find the research question(s) clearly stated? Are they specific enough to guide our analysis?
2. Do you think the research question(s) are complex enough to yield interesting insights? If not, what suggestions do you have for enhancing them?
3. Do you see any potential issues with the data quality or completeness that we should address before proceeding?
4. Do you feel that the data cleaning process is comprehensive enough for the analyses we plan to conduct? Are there any areas you think require more attention?
5. Do you have any suggestions for best practices in data cleaning that we may have overlooked?