# Unleashing the Power of Data: Analyzing Box Offices Success to Shape Microsoft's New Movie Studio strategy 

## 1. Business Understanding

## Introduction

>In today's dynamic entertainment landscape, the demand for original video content is skyrocketing, with major companies establishing their own movie studios to captivate audiences. Inspired by this trend, Microsoft has embarked on a mission to venture into the realm of filmmaking by creating its own movie studio. However, given their limited knowledge and experience in the domain, Microsoft is confronted with a crucial challenge: Determining the types of films that are currently enjoying significant success at the box office. This is where data analysis and exploration come into play.

>As a data analyst, my role is pivotal in unraveling the intricate dynamics of the film industry, focusing specifically on box office performance. By delving into the realm of cinema, I will uncover valuable insights that can inform the head of Microsoft's new movie studio in making informed decisions about the types of films to produce. This project aims to leverage the power of data to provide actionable recommendations that will guide Microsoft in creating compelling content, tailored to capture the hearts and minds of moviegoers, and position the company as a major player in the ever-evolving world of film production.

## Problem Statement

>Microsoft sees all the big companies creating original video content and they want to get in on the fun. They have decided to create a new movie studio, but they don’t know anything about creating movies. You are charged with exploring what types of films are currently doing the best at the box office. You must then translate those findings into actionable insights that the head of Microsoft's new movie studio can use to help decide what type of films to create

## Main Objective

>The main objective of this project is to conduct exploratory data analysis to identify the types of films that are currently performing exceptionally well at the box office. By analyzing relevant data and trends in the film industry, the goal is to provide actionable insights to the head of Microsoft's new movie studio. These insights will assist in making informed decisions regarding the types of films to create, ensuring alignment with audience preferences and increasing the studio's chances of achieving commercial success.

## Experimental Design

>1. Data Collection
>2. Data Processing
>3. Exploratory Data Analysis
>4. Findings and Insights
>5. Decision Making

## Data Understanding

>The data being used is from `im.db` database. The database has 8 tables:

>>- movie_basics
>>- directors
>>- known_for
>>- movie_akas
>>- movie_ratings
>>- persons
>>- principals
>>- writers

>The `movie_ratings` and `movie_basics` tables will be used in the analysis. `movie_ratings` provides information on movie ratings while `movie_basics` table provides information on various movie attributes such as start_year, runtime and genre.

## 2. Reading the Data

In [None]:
import zipfile
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import warnings
%matplotlib inline
warnings.simplefilter(action='ignore', category=Warning)

In [None]:
with zipfile.ZipFile("Data/im.db.zip", 'r') as zip_file:
    zip_file.extractall("Data")

    
conn = sqlite3.connect("Data/im.db")
sql_query = """SELECT name FROM sqlite_master WHERE type = 'table';"""
tables = pd.read_sql(sql_query,conn)
tables



## 3. Checking the Data

In [None]:
# Previewing the movie_basics table
pd.read_sql("""
SELECT * FROM movie_basics
""", conn).head()

In [None]:
# Previewing the known_for table
pd.read_sql("""
SELECT * FROM known_for
""", conn).head()

In [None]:
# Previewing the movie_ratings table
pd.read_sql("""
SELECT * FROM movie_ratings
""", conn).head()

In [None]:
#Joining the movie_basics and movie_ratings tables

df = pd.read_sql("""
SELECT * FROM movie_basics
JOIN movie_ratings
USING (movie_id)
""", conn)
df.head()

## 4. Tidying the Dataset

In [None]:
#Checking the shape of the dataframe
df.shape

In [None]:
#Checking for missing values and their percentages

(df.isna().sum()*100/df.shape[0]).round(2)

In [None]:
#Dropping irrelevant columns and rows that contain null values
df.drop(df.columns[1:5].tolist(),1,inplace=True)
df.dropna(inplace=True)

In [None]:
# Checking the final dataframe; shape and any missing values
print(df.shape)
df.isna().any()

In [None]:
# Checking for any duplicates
df[df.duplicated()]

In [None]:
# A glance of the final dataframe
df.head()

## 6. Exploratory Analysis

In [None]:
#Split the genres column
df['genres'] = df.genres.str.split(',')
df = df.explode('genres')
df.head()

In [None]:
df['genres'].value_counts()

In [None]:
count = df['genres'].value_counts(ascending = True)
x, y =  count.index.tolist(), count.values.tolist()

fig, ax = plt.subplots(figsize=(10,6))
ax.barh(x, y, color = 'steelblue')
ax.set_title('Bar graph for movie genres')
ax.set_xlabel('Genre')
ax.set_ylabel('Count')
ax.grid(axis='x', linestyle='--')

plt.tight_layout()

In [None]:
# Data Reduction (dropping genres with low count)
low_genres = df['genres'].value_counts().index[df['genres'].value_counts().values < 500].tolist()
df = df.drop(df[df['genres'].isin(low_genres)].index)
df['genres'].value_counts()

>Excluding genres with low counts is essential to obrain reliable ratings, The genre `short`, for example has a count of one and a rating of 9.0 which can produce a misleading representation

In [None]:
plt.figure(figsize = (10, 6))
ax = sns.boxplot(x='genres', y='averagerating', data=df)
plt.setp(ax.artists, alpha=.5, linewidth=2, edgecolor="k")
plt.xticks(rotation=45);

>In movie ratings data, one measure of variation that is useful is the standard deviation. The standard deviation measures the dispersion or spread of the ratings within the dataset. It provides an indication of how much the ratings deviate from the average or mean rating.

>By calculating the standard deviation of movie ratings, you can assess the level of agreement or disagreement among viewers. A smaller standard deviation indicates that the ratings tend to be close to the mean, suggesting a higher level of consensus among viewers. On the other hand, a larger standard deviation implies a wider range of ratings, indicating a greater diversity of opinions and preferences among viewers.

In [None]:
df.groupby('genres')['averagerating'].std().sort_values(ascending = False)

>The close standard deviation among movie ratings of different genres suggests a relatively low level of variation or disagreement in ratings between the genres. This could mean that viewers tend to perceive and rate movies from different genres in a similar manner or a similar level of quality across genres. The latter could mean that filmmakers or studios successfully deliver a certain level of storytelling, production value, or artistic merit that is appreciated by viewers across different genres. 

In [None]:
#Plotting a pie chart of the number of votes in each genre

vote_count = df.groupby('genres')['numvotes'].sum().sort_values(ascending = False)
labels = vote_count.index.tolist()[:9]
votes = vote_count.values.tolist()[:9]
fig, ax = plt.subplots()
ax.pie(votes, labels=labels);

>The number of votes in movie genres can provide insights into various aspects related to audience engagement, popularity, and the level of interest in specific genres. A larger number of votes like the case of Action and Drama genres indicate a higher level of engagement or interest from the audience in a particular genre. It implies that viewers are actively seeking out and participating in discussions, rating, or reviewing movies from that genre. Additionally, if a genre consistently receives a higher number of votes compared to others, it suggests that it appeals to a larger segment of the audience.

## 7. Findings and Insight

>Before deciding on a movie genre to move into, several factors should be considered to make an informed decision. Some of the factors to consider are:

>>1. Target Audience
>>2. Storytelling Potential
>>3. Market Saturation and Trends

>From the exploratory data analysis exercise, Action, Drama, Comedy and Documentary genres are preffered genres that the new Microsoft movie studio should priorotize. The four genres have a large community as shown in the pie chart. However, action genre has more variation in ratings which could indicate a lack of consensus or agreement among the audience. Biography movies have the least variation in their ratings but a lower audience size. Animation, Horror and Thriller movies have a high variation in rating and a low audience size this is less desirable. 

## 8. Recommendations

> The new movie studio has a variety of genres to choose from. However, the most suitable genres are Action, Comedy and Documentary movies. Action movies have a large audience size and a higher variation in rating which allows for a greater storytelling potential.This will allow for the production of compelling stories that resonate with the audience.

>The comedy genre has a relatively large audience and standard deviation in movie ratings. The larger audience makes it more suitable for production. The target audience plays a crucial role in the selection of a movie genre. The audience is the primary consumer of movies, and their preferences and interests greatly influence the success and reception of a film. A larger audience in Action and Comedy genres will increase the reception of movies produced by the studio.

>Documentary films, despite having a low target audience, have lower variation in ratings and the highest median rating. The high median rating means that documentary movies resonates well with viewers and has a higher likelihood of attracting and engaging the audience.

>Overall, the three movie genres have a higher likelihood of success.