# Box Office - data investigations

Here we investigate the Box Office - some of the findings are useful for our conclusions and others just explore the data.

In [1]:
import os
import numpy as np
import pandas as pd

### Load data

We load the pickled dataframe.

In [2]:
filename = f"boxOffice"
data_dir = os.getcwd() + os.sep + 'data'

df_boxOffice = pd.read_pickle(rf"{data_dir}{os.sep}{filename}.pkl")
df_boxOffice.head()

Unnamed: 0,days,dow,rank,daily,theaters,special events,movie
0,2019-05-24,Friday,1,31358935.0,4476,,Aladdin
1,2019-05-25,Saturday,1,30013295.0,4476,,Aladdin
2,2019-05-26,Sunday,1,30128699.0,4476,,Aladdin
3,2019-05-27,Monday,1,25305033.0,4476,Memorial Day,Aladdin
4,2019-05-28,Tuesday,1,12014982.0,4476,,Aladdin


### Inital analysis of Box Office data

We want to do a simple investigation to see whether we should handle outlier, etc. in the box office data. We start by describing the continuous attributes.

In [3]:
df_boxOffice.describe()

Unnamed: 0,rank,daily,theaters
count,6905.0,6905.0,6905.0
mean,14.968573,2958313.0,1746.85286
std,12.610773,8340898.0,1592.607404
min,1.0,60.0,5.0
25%,4.0,32178.0,221.0
50%,12.0,266139.0,1238.0
75%,24.0,1842660.0,3323.0
max,63.0,157461600.0,4802.0


Check for any NaN-values. They only occur in the special events which is alright.

In [4]:
df_boxOffice.isnull().any()

days              False
dow               False
rank              False
daily             False
theaters          False
special events     True
movie             False
dtype: bool

Well, this is all good. Let is do a dummy check and see that the cumulative boxOffice for a movie is the same in this dataset as the total boxOffice for USA in the IMDB-data:

In [81]:
%store -r movies 
#imdb dataset from IMDb_investigation.

boxOffice_total = {} #Store in dicts because they are super useful
movies_total = {}
for i in df_boxOffice.movie.unique(): #Loop through all movies collecting total box office from both datasats
    boxOffice_total[i] = df_boxOffice[df_boxOffice.movie==i].daily.sum()
    movies_total[i] = movies[movies.original_title==i].usa_gross_income
#Check if all movies match:
print("Difference in lengths: ",len(boxOffice_total)-len(movies_total)) #Prints 0

#Check if numbers are equal:
print("True if both dicts are equal: ",all(boxOffice_total)==any(movies_total)) #Prints True

Difference in lengths:  0
True if both dicts are equal:  True


So the boxOffice data is not that bad ey? Let's see what a movie looks like plotted (In this case Spectre since we are massive Bond fans):

In [28]:
import plotly.express as px

fig = px.bar(x=df_boxOffice[df_boxOffice.movie=='Spectre'].days, y=df_boxOffice[df_boxOffice.movie=='Spectre'].daily, labels={'x':'date', 'y':'box office'},
            color_discrete_sequence=['indianred'],
            title='boxOffice for Spectre')
fig.show()

After toying around with this for a while, we realised that all the movies looked to follow the same type of distribution, that is a powerlaw. Below is an interactive plot, where we can zoom in on different parts of the plot. Beware, that it looks invisible at first, however, it is only because there is so much data in a single plot.

In [61]:
fig = px.bar(x=df_boxOffice.days, y=df_boxOffice.daily, labels={'x':'date', 'y':'box office'},
            color=df_boxOffice.movie,
            title='boxOffice for all movies')
fig.show()

There looks to be very much activity in the start of 2016, let's see:

In [44]:
fig = px.bar(x=df_boxOffice.days, y=df_boxOffice.daily, labels={'x':'date', 'y':'box office'},
            color=df_boxOffice.movie,
            title='boxOffice for movies in first half of 2016')
fig.update_layout(yaxis_range=[0,1e8], xaxis_range=[pd.Timestamp(2016,1,1),pd.Timestamp(2016,6,1)])
fig.show()

It looks like there are a lot of spikes in the same areas no matter the movies.. This may be weekends:

In [84]:
fig = px.histogram(x=df_boxOffice.dow, y=df_boxOffice.daily, labels={'x':'day of the week', 'y':'box office'},
            #color=df_boxOffice.movie,
            title='boxOffice for all movies on days of the week')
fig.show()

Jep, it looks like there is a trend that people are more likely to go to the theaters during the weekends.