## Final Project Submission

Please fill out:
* Student name: `Ilyas Suleman Bourzat`
* Student pace: part time
* Scheduled project review date/time: Monday June 3 at 2.59pm
* Instructor name: Samuel Karu, Samuel G Mwangi, Winnie Anyoso
* Blog post URL:


In [None]:
# Add image of Microsoft 

# From Data to Movies: Guiding Microsoft's Entry into the Movie Industry
___

## Overview

___

This project aims to provide strategic insights to Microsoft's new movie studio through an in-depth Exploratory Data Analysis (EDA). We will be examining variuos datasets from [IMDB](https://www.imdb.com/) and [Box Office Mojo](https://www.boxofficemojo.com/) - we'll be trying to identify Box Office trends and impactful movie performance metrics. This analysis will mainly be focused on what type of films are highly successful and in turn translating these findigs into actionable recommendations for Microsoft's film production strategy.

## Introduction

___

In today's highly competitive entertainment industry, where companies continously aim to create captivating content while simultaneously giving high financial returns. Microsoft has been inspired by the success of other major companies - such as Amazon - to create original films and plans to establish their own new movie studio. However, entering the film production industry poses a significant challenge since it requires a comprehensive understanding of movies that not only ressonate with majority of audience around the world but also do well financially. 

This project aims to address this challenge by leveraging EDA to analyse various movie datasets. This willl help us to identify current Box Office trends, movie performance metrics and audience preferences. The insights derived from this data will provide Microsoft's new movie studio with actionable recommendations on what types of films to create. 


The main stakeholders of this project include the head of the Microsoft's new movie studio, the executive, the marketing team, the film production team, investors, financial analysts, data science and anaytics team. These stakeholder's will used the projets findings to make data driven decisions on movie production, marketing strategies, and financial investments ensuring ensuring the new studio succeeds in the competitive space.

## Business Goal

___

The business goal of this project is to provide actionable insights to Microsoft's new movie studio to guide their production strategy. Specifically, the project aims to identify the types of films that are currently performing well at the box office, allowing the studio to make data-driven decisions on which genres, themes, and characteristics of films to focus on. This will help ensure the studio's success in the competitive film industry by maximizing their chances of producing commercially successful movies.

## Data Understanding

___

The movie datasets used in this project has been obtained from:
1. [IMDB](https://www.imdb.com/)
    * `IMDB` is a reputable extensive database for movies, TV shows and celebrities. It has been offering detailed information for over 3 decades and remains an authority in the movie space. The `im.db` file is an SQL database containing information about movies, cast and more that are essential to determine various factors which contribute to a movie's success in this project.
    
2. [Box Office Mojo](https://www.boxofficemojo.com/)
    * `Box Office Mojo` is a website that tracks revenue generated by Box Office movies and provides detailed statistics about them. It is majorly used to analyze and track the financial success of film. Our `bom.movie_gross.csv` file will provide critical information that will help access the financial performance of films. This will help the new studio head, company executives and investors to determine what type of films to invest in.

3. [Rotten Tomatoes](https://www.rottentomatoes.com/)
    * `Rotten Tomatoes` is a popular movie site that collects and aggregates reviews from movie critics and audiences. It provides a score known as the "Tomatometer" which reflects the overall success of films. Both the `rt.reviews.tsv` and the `rt.movie_info.tsv` file will be used to determine the audience opinion and satisfaction on various types of films. This will help in identifying patterns in what types of movies are well-rated by the audience and tailor production towards that direction.

4. [TheMovieDB](https://www.themoviedb.org/)
    * `The Movie Database` is a database driven by users and provides comprehensive metadata on films including genres, crew, cast and production details. The `tmdb.movies.csv` file allows us to utilise the information from this database that will in turn allow us to uncover meaningful insights about genres and release dates which will lead to movie success for the studio.


5. [The Numbers](https://www.the-numbers.com/)
    * `The Numbers` is a resource for movie industry data, offering detailed information about box offfice performance, movie budgets and more. The `tn.movie_budgets.csv` file contains information about financial movie performance that will help us understand the economic aspects of movie production. 

### Importing relevant libraries

In [13]:
# Importing pandas, numpy, matplotlib with their alias 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Importing chardet to find out type of encoding
import chardet

# Importing sqlite3 to connect to database
import sqlite3

# Adding magic command
%matplotlib inline

### Importing the datasets
We should first start by loading all the relevant datasets into the notebook so we can analyze them. We also need to connect to the `im.db` database. 

In [14]:
# Determining the file encoding for the tsv files
with open('Data/rt.reviews.tsv', 'rb') as f:
    result = chardet.detect(f.read())
    print(result)

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}


Determining the encoding was a key step since trying to load tsv files, continously brought up an encoding error and required me to specify the type of file encoding used.

In [19]:
# Importing all datasets
# Importing csv files
box_office_movie_gross = pd.read_csv("Data/bom.movie_gross.csv")
numbers_movie_budgets = pd.read_csv("Data/tn.movie_budgets.csv")
tmdb_movies_dataset = pd.read_csv("Data/tmdb.movies.csv")

# Importing tsv files
rtn_tomatoes_mv_reviews = pd.read_csv("Data/rt.reviews.tsv", sep='\t', encoding='ISO-8859-1')
rtn_tomatoes_mv_info = pd.read_csv("Data/rt.movie_info.tsv", sep='\t', encoding='ISO-8859-1')

# Instanciate a connection to the im.db database
conn = sqlite3.connect("Data/im.db")

### Overview of the datasets

In this section, the aim is to comprehensively understand the data we are working with. They are several key steps to this process: 
1. **Statistical Analysis** - Here we will summarize the main characteristics of the data such as the mean, median and standard deviation.
2. **Dataset size** - Here we will be able to determine the size of our datasets and looking at how many records there are.
3. **Feature Inclusion** - We will go through the available feature sin the dataset in order to determine which ones will be relevant for our goal.
4. **Identifying Limitations** - Identifying any limitations that may have implications on the project in the future.

We will use various pandas methods such as `.describe()` and `.info()` to gain some insights on what type of data we are dealing with. We will also find the size of the datasets and features we would potentially want to include in our analysis and why.

#### Overview of the box_office_movies_gross DataFrame
##### 1. Statistical Analysis

In [30]:
# Viewing the firt 5 records of our DataFrame
box_office_movie_gross.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010


In [32]:
# Using .describe() and .info() on our data to get a statistical overview
box_office_movie_gross.describe()

Unnamed: 0,domestic_gross,year
count,3359.0,3387.0
mean,28745850.0,2013.958075
std,66982500.0,2.478141
min,100.0,2010.0
25%,120000.0,2012.0
50%,1400000.0,2014.0
75%,27900000.0,2016.0
max,936700000.0,2018.0


In [33]:
box_office_movie_gross.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


We can see that the `box_office_movies_gross` DataFrame contains some missing values and the foreign gross is of type object instead of float or integer hence we are unable to retrieve information from this feature.This will be sorted out later when we are carrying out our data preparation.

##### 2. Dataset size
This DataFrame contains `3387` records. This is a fair amount of data we can use for our analysis.

##### 3. Feature Inclusion
   * Features included: `title`, `domestic_gross`, `foreign_gross`, `year`
   * Reasons: 
       * **The Gross Earnings**: This helps us measure financial success of the movies.
       * **Title**: This will help us in identifying which movies had financial success and ones that didn't.
       * **Year**: Knowing when the movies were released helps identify patterns between earnings and release year.

##### 4. Identifying Implications
In this dataset, the only noticeable limitations are the missing values some of the features (`domestic_gross`,  `foreign_gross`), the `foreign_gross` feature being of type object instead of float and the `year` column being of type int64 instead of datetime.

In [24]:
numbers_movie_budgets.describe()

Unnamed: 0,id
count,5782.0
mean,50.372363
std,28.821076
min,1.0
25%,25.0
50%,50.0
75%,75.0
max,100.0


In [25]:
numbers_movie_budgets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5782 entries, 0 to 5781
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   id                 5782 non-null   int64 
 1   release_date       5782 non-null   object
 2   movie              5782 non-null   object
 3   production_budget  5782 non-null   object
 4   domestic_gross     5782 non-null   object
 5   worldwide_gross    5782 non-null   object
dtypes: int64(1), object(5)
memory usage: 271.2+ KB


In the `numbers_movie_budgets`, we see that the columns "domestic_gross" and "worldwide_gross" are objects instead of floats or integers hence the `.describe()` method was not able to give us some statistical analysis based on these features.

In [26]:
tmdb_movies_dataset.describe()

Unnamed: 0.1,Unnamed: 0,id,popularity,vote_average,vote_count
count,26517.0,26517.0,26517.0,26517.0,26517.0
mean,13258.0,295050.15326,3.130912,5.991281,194.224837
std,7654.94288,153661.615648,4.355229,1.852946,960.961095
min,0.0,27.0,0.6,0.0,1.0
25%,6629.0,157851.0,0.6,5.0,2.0
50%,13258.0,309581.0,1.374,6.0,5.0
75%,19887.0,419542.0,3.694,7.0,28.0
max,26516.0,608444.0,80.773,10.0,22186.0


In [27]:
tmdb_movies_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


Here we can see that in `tmdb_movies_dataset` DataFrame, the mean vote_average is 5.99 and the mean vote_count is 194.22. 