**Question/need:** Can we predict the success (gross revenues, revenues per budget dollar, and/or oscar nominations) of proposed low-budget films (< $10 million) based on its characteristics? Are there different predictive characteristics for success in terms of gross revenues vs. oscar nominations? 

**Movie data:** I'm planning on using all movies (1980 - 2016) from boxofficemojo to investigate the impact of various characteristics on success. If I have time, I would also like to incorporate other variables, such as google searches and wikipedia page views, as well as critic and general public rating information from rottentomatoes. 

**Characteristics of each movie and/or other entities:** I'd like to investigate as many characteristics as possible during my first pass to determine which ones have the greatest predictive impact, and then dig in deeper to these characteristics. The features I'm most interested in are: genre, release date (month, if before a holiday weekend, if during Christmas holiday, if during summer), star power (a score that accounts for actors, director, and producers), production budget, and if there the movie has some preexisting popularity, either through a novel/play adaptation or if there's a prequel. I'll also investigate other features such as runtime, rating, and franchise, but I have a hunch that these will have a smaller impact. 

In [13]:
# storing
import pickle

# analysis 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline



First, let's load the movie data into a dataframe and check the quality of our data.

In [40]:
with open('pickled_data/all-movies-data.pkl', 'r') as picklefile:
    all_movies_data = pickle.load(picklefile)

In [41]:
with open('pickled_data/failed-urls.pkl', 'r') as picklefile:
    failed_urls = pickle.load(picklefile)

In [42]:
print failed_urls

['http://www.boxofficemojo.com/movies/?id=longshot2013.htm']


In [43]:
all_movies = pd.DataFrame(all_movies_data)

In [47]:
all_movies.head()

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,genre,intl_total_gross,oscar_noms,oscar_wins,producers,rating,runtime_(mins),theaters,url,writers
0,The A-Team,2010-06-11,2010-09-16,"[Liam Neeson, Bradley Cooper, Sharlto Copley, ...",110000000.0,[Joe Carnahan],Fox,77222099,Action,177238796.0,0,0,"[Ridley Scott, Tony Scott, Alex Young]",PG-13,117.0,3544,ateam.htm,[Skip Woods]
1,A.C.O.D.,2013-10-04,2013-11-07,"[Adam Scott, Catherine O'Hara, Richard Jenkins...",,,The Film Arcade,175705,Comedy,,0,0,[Teddy Schwarzman],PG-13,88.0,42,acod.htm,
2,A.I. Artificial Intelligence,2001-06-29,NaT,"[Haley Joel Osment, Frances O'Connor, Jude Law...",100000000.0,[Steven Spielberg],Warner Bros.,78616689,Sci-Fi,235926552.0,2,0,"[Kathleen Kennedy, Steven Spielberg]",PG-13,145.0,3242,ai.htm,
3,Aaja Nachle,2007-11-30,2007-12-20,,,,Yash Raj,484108,Foreign,6773493.0,0,0,,Unrated,145.0,66,aajanachle.htm,
4,Aarakshan,2011-08-12,2011-09-22,,,,Reliance Big Pictures,651096,Foreign,651096.0,0,0,,Unrated,,91,aarakshan.htm,


In [52]:
all_movies.describe()

Unnamed: 0,budget,dom_total_gross,intl_total_gross,oscar_noms,oscar_wins,runtime_(mins),theaters
count,2874.0,14063.0,6476.0,16100.0,16100.0,15163.0,13009.0
mean,43968190.0,18001540.0,70712390.0,0.196646,0.043106,104.045242,783.702052
std,43601490.0,42897840.0,150081100.0,0.89873,0.362035,22.042827,1102.616644
min,1100000.0,30.0,1.0,0.0,0.0,35.0,1.0
25%,14000000.0,69021.0,903002.0,0.0,0.0,91.0,7.0
50%,30000000.0,1110522.0,12592950.0,0.0,0.0,100.0,74.0
75%,60000000.0,16296910.0,73832560.0,0.0,0.0,112.0,1395.0
max,300000000.0,863148200.0,2787965000.0,10.0,9.0,729.0,4468.0


Since we're interested in budget data, let's check to see how many movies of the 16,100 total movies have budget data.

In [51]:
all_movies.budget.count()

2874

Only 2,874 out of 16,100 movies have budget data. That's only 17.9% of the entire dataset! Let's take a closer look at movies with budget data.

In [59]:
# drop movies without budget data 
only_budget = all_movies[pd.notnull(all_movies['budget'])]

# add some roi calculations
only_budget['dom_roi'] = only_budget['dom_total_gross'] / only_budget['budget']
only_budget['intl_roi'] = only_budget['intl_total_gross'] / only_budget['budget']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Let's look at only low budget films, which we're defining as films under $10m and evaluate the quality of the data.

In [58]:
# get only low budget (<$10m) films
low_budget = only_budget[only_budget['budget'] < 10000000]
low_budget.count()

1-title             469
2-release_date      463
3-closing_date      285
actors              335
budget              469
director            252
distributor         469
dom_total_gross     442
genre               469
intl_total_gross    263
oscar_noms          469
oscar_wins          469
producers           192
rating              469
runtime_(mins)      464
theaters            401
url                 469
writers             175
dtype: int64

Looks like there are 469 total movies in our low budget films dataset.