## Exploratory Data Analysis for Microsoft's New Movie Studio

## Overview

In this project, we will perform exploratory data analysis (EDA) to generate insights for Microsoft, 
helping them decide what type of films to create for their new movie studio. We will use various datasets 
related to movies, analyze trends, and provide actionable recommendations based on our findings.


## Business Problem

Microsoft sees all the big companies creating original video content and they want to get in on the fun. 
They have decided to create a new movie studio but lack knowledge about creating movies. Our task is to explore 
what types of films are currently doing the best at the box office and translate those findings into actionable insights 
for Microsoft's new movie studio.

## Objectives

The objectives of this project are to:
1. Identify the top-performing movie genres at the box office by analyzing historical box office data.
2. Determine the relationship between movie ratings and box office success to recommend the ideal quality standard for new films.
3. Analyze production budgets and their impact on box office returns to suggest optimal budget ranges for new movie projects.

## Loading Data

In [2]:
import pandas as pd 
import sqlite3

In [3]:
bom_movie_gross = pd.read_csv('data/bom.movie_gross.csv')
rt_movie_info = pd.read_csv('data/rt.movie_info.tsv', sep='\t')
rt_reviews = pd.read_csv('data/rt.reviews.tsv', sep='\t', encoding='ISO-8859-1')
tmdb_movies = pd.read_csv('data/tmdb.movies.csv')
tn_movie_budgets = pd.read_csv('data/tn.movie_budgets.csv')

conn = sqlite3.connect('data/im.db')
imdb_movie_basics = pd.read_sql_query("SELECT * FROM movie_basics", conn)
imdb_movie_ratings = pd.read_sql_query("SELECT * FROM movie_ratings", conn)

Displaying the first few rows of each dataframe.

In [12]:

print("Movie Gross Data:")
print(bom_movie_gross.head())
print("-------------------------------------------------------------")

print("Movie Info Data:")
print(rt_movie_info.head())
print("-------------------------------------------------------------")

print("Reviews Data:")
print(rt_reviews.head())
print("-------------------------------------------------------------")

print("TMDB Movies Data:")
print(tmdb_movies.head())
print("-------------------------------------------------------------")

print("Movie Budgets Data:")
print(tn_movie_budgets.head())
print("-------------------------------------------------------------")

print("Movie Basics Data:")
print(imdb_movie_basics.head())
print("-------------------------------------------------------------")

print("Movie Ratings Data:")
print(imdb_movie_ratings.head())

Movie Gross Data:
                                         title studio  domestic_gross  \
0                                  Toy Story 3     BV     415000000.0   
1                   Alice in Wonderland (2010)     BV     334200000.0   
2  Harry Potter and the Deathly Hallows Part 1     WB     296000000.0   
3                                    Inception     WB     292600000.0   
4                          Shrek Forever After   P/DW     238700000.0   

  foreign_gross  year  
0     652000000  2010  
1     691300000  2010  
2     664300000  2010  
3     535700000  2010  
4     513900000  2010  
-------------------------------------------------------------
Movie Info Data:
   id                                           synopsis rating  \
0   1  This gritty, fast-paced, and innovative police...      R   
1   3  New York City, not-too-distant-future: Eric Pa...      R   
2   5  Illeana Douglas delivers a superb performance ...      R   
3   6  Michael Douglas runs afoul of a treacherous s

Upon initial examination of the dataframes, the following datasets were selected as they contain the necessary information to meet the project objectives:

- bom_movie_gross
- tn_movie_budgets
- imdb_movie_basics
- imdb_movie_ratings

#### Inspecting the selected datasets 

#### 1. bom_movie_gross

In [13]:
#Examining the total number of rows, columns, non-null values and datatypes in the dataframe
bom_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [15]:
# Checking for null values
bom_movie_gross.isna().sum()

title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [16]:
#checking for duplicetes
bom_movie_gross.duplicated().sum()

0

#### 2. tn_movie_budgets

In [17]:
#Examining the total number of rows, columns, non-null values and datatypes in the dataframe
tn_movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In [18]:
# Checking for null values
tn_movie_budgets.isna().sum()

id                   0
release_date         0
movie                0
production_budget    0
domestic_gross       0
worldwide_gross      0
dtype: int64

In [19]:
#checking for duplicetes
tn_movie_budgets.duplicated().sum()

0

#### 3. imdb_movie_basics

In [22]:
#Examining the total number of rows, columns, non-null values and datatypes in the dataframe
imdb_movie_basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146144 entries, 0 to 146143
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   movie_id         146144 non-null  object 
 1   primary_title    146144 non-null  object 
 2   original_title   146123 non-null  object 
 3   start_year       146144 non-null  int64  
 4   runtime_minutes  114405 non-null  float64
 5   genres           140736 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 6.7+ MB


In [21]:
# Checking for null values
imdb_movie_basics.isna().sum()

movie_id               0
primary_title          0
original_title        21
start_year             0
runtime_minutes    31739
genres              5408
dtype: int64

In [23]:
#checking for duplicetes
imdb_movie_basics.duplicated().sum()

0

#### 4. imdb_movie_ratings

In [24]:
#Examining the total number of rows, columns, non-null values and datatypes in the dataframe
imdb_movie_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73856 entries, 0 to 73855
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movie_id       73856 non-null  object 
 1   averagerating  73856 non-null  float64
 2   numvotes       73856 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 1.7+ MB


In [25]:
# Checking for null values
imdb_movie_ratings.isna().sum()

movie_id         0
averagerating    0
numvotes         0
dtype: int64

In [26]:
#checking for duplicetes
imdb_movie_ratings.duplicated().sum()

0

## Data Cleaning