**Question/need:** Can we predict the success (gross revenues, revenues per budget dollar, and/or oscar nominations) of proposed low-budget films (< $10 million) based on its characteristics? Are there different predictive characteristics for success in terms of gross revenues vs. oscar nominations? 

**Movie data:** I'm planning on using all movies (1980 - 2016) from boxofficemojo to investigate the impact of various characteristics on success. If I have time, I would also like to incorporate other variables, such as google searches and wikipedia page views, as well as critic and general public rating information from rottentomatoes. 

**Characteristics of each movie and/or other entities:** I'd like to investigate as many characteristics as possible during my first pass to determine which ones have the greatest predictive impact, and then dig in deeper to these characteristics. The features I'm most interested in are: genre, release date (month, if before a holiday weekend, if during Christmas holiday, if during summer), star power (a score that accounts for actors, director, and producers), production budget, and if there the movie has some preexisting popularity, either through a novel/play adaptation or if there's a prequel. I'll also investigate other features such as runtime, rating, and franchise, but I have a hunch that these will have a smaller impact. 

In [None]:
'''TO DOS:
- clean up genres and rating
- adjust budget for inflation'''

In [22]:
# storing
import pickle

# analysis 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

from pprint import pprint

%matplotlib inline

First, let's load the movie data into a dataframe and check the quality of our data.

In [3]:
with open('pickled_data/all-movies-data.pkl', 'r') as picklefile:
    all_movies_data = pickle.load(picklefile)

In [144]:
# with open('pickled_data/failed-urls.pkl', 'r') as picklefile:
#     failed_urls = pickle.load(picklefile)

In [145]:
# print failed_urls

['http://www.boxofficemojo.com/movies/?id=romero.htm&adjust_yr=2016']


In [4]:
movies = pd.DataFrame(all_movies_data)

In [5]:
movies.shape

(16100, 19)

In [6]:
movies.head()

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,domestic_total_adj_gross,genre,intl_total_gross,oscar_noms,oscar_wins,producers,rating,runtime_mins,theaters,url,writers
0,The A-Team,2010-06-11,2010-09-16,"[Liam Neeson, Bradley Cooper, Sharlto Copley, ...",110000000.0,[Joe Carnahan],Fox,77222099,84507200,Action,177238796.0,0,0,"[Ridley Scott, Tony Scott, Alex Young]",PG-13,117.0,3544,ateam.htm,[Skip Woods]
1,A.C.O.D.,2013-10-04,2013-11-07,"[Adam Scott, Catherine O'Hara, Richard Jenkins...",,,The Film Arcade,175705,183100,Comedy,,0,0,[Teddy Schwarzman],PG-13,88.0,42,acod.htm,
2,A.I. Artificial Intelligence,2001-06-29,NaT,"[Haley Joel Osment, Frances O'Connor, Jude Law...",100000000.0,[Steven Spielberg],Warner Bros.,78616689,120841900,Sci-Fi,235926552.0,2,0,"[Kathleen Kennedy, Steven Spielberg]",PG-13,145.0,3242,ai.htm,
3,Aaja Nachle,2007-11-30,2007-12-20,,,,Yash Raj,484108,612200,Foreign,6773493.0,0,0,,Unrated,145.0,66,aajanachle.htm,
4,Aarakshan,2011-08-12,2011-09-22,,,,Reliance Big Pictures,651096,713400,Foreign,651096.0,0,0,,Unrated,,91,aarakshan.htm,


Time to do some cleaning...

In [7]:
# rename cols
movies.rename(columns = {'1-title': 'title', 
                         '2-release_date': 'release_date', 
                         '3-closing_date': 'closing_date'}, inplace = True)

In [172]:
# split actor, director, producer, writer columns
def get_person(peoplelist, i):
    '''Take a list of people and inde; return person at that index. Return blank if no list present.'''
    try:
        return peoplelist[i]
    except:
        return ''

def sep_people(df, person_type, num_people):
    '''For each column with list of people, split into n columns with one person in each column.
    Name new columns as person_type + n.
    
    Args:
    df -- name of dataframe
    person_type -- name of column (str)
    num_people -- number of columns to generate
    '''
    for n in range(num_people):
        df.loc[:, person_type + str(n+1)] = df.loc[:, person_type].apply(get_person, args=(n,))

# add seasons; rationale from http://www.boxofficemojo.com/seasonal/?chart=byseason
def get_season(date):
    '''Return season given month as int.'''
    months = (range(1, 13))
    seasons = ['winter', 'winter', 'spring', 'spring', 
           'summer', 'summer', 'summer', 'summer', 
           'fall', 'fall', 'holiday', 'holiday']
    seasons = dict(zip(months, seasons))
    try:
        return seasons[date.month]
    except:
        return ''
       
def add_season(df):
    '''Add corresponding season in new column based on release date.'''
    df['season'] = df.loc[:, 'release_date'].apply(get_season) 

# map raw genres to simple genres
def get_simple_genre(g):
    '''Recategorize raw genre into simple genre.'''
    raw_genres = tuple(by_genre.count().index)
    simple_genres = ['Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 
                     'Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 
                     'Animation', 'Comedy', 'Comedy', 'Comedy', 'Other', 'Crime_Thriller', 
                     'Crime_Thriller', 'Crime_Thriller', 'Crime_Thriller', 'Documentary', 
                     'Drama', 'Drama', 'Family', 'Family', 'Family', 'Fantasy_Scifi', 
                     'Fantasy_Scifi', 'Fantasy_Scifi', 'Foreign', 'Foreign', 'Foreign', 
                     'Historical', 'Historical', 'Horror', 'Horror', 'Horror', 'Other', 
                     'Other', 'Other', '', 'Historical', 'Historical', 'Historical', 
                     'Historical', 'Historical', 'Romance', 'Romance', 'Romance', 
                     'Romance', 'Fantasy_Scifi', 'Fantasy_Scifi', 'Fantasy_Scifi', 
                     'Fantasy_Scifi', 'Fantasy_Scifi', 'Fantasy_Scifi', 'Fantasy_Scifi', 
                     'Other', 'Other', 'Other', 'Other', 'Crime_Thriller', '', 'Other', 
                     'Other', 'Other', 'Other', 'Other', ]
    genres = dict(zip(raw_genres, simple_genres))
    try:
        return genres[g]
    except:
        return ''

def add_simple_genre(df):
    df['new_genre'] = df.loc[:, 'genre'].apply(get_simple_genre)
    
# clean up ratings
def get_clean_rating(r):
    rm_ratings = ['Unknown', 'Unrated', 'None', 'Not Yet Rated']
    return '' if r in rm_ratings else r

def clean_rating(df):
    df['clean_rating'] = df.loc[:, 'rating'].apply(get_clean_rating)
    
# add roi
def add_roi(df):
    df['dom_roi'] = df.loc[:, 'dom_total_gross'] / df.loc[:, 'budget']
    df['intl_roi'] = df.loc[:, 'intl_total_gross'] / df.loc[:, 'budget']
    
# add total time in theaters - not necessary if we have theaters?
def add_in_theaters_time(df):
    try:
        df['time_in_theaters'] = df.loc[:, 'closing_date'] - df.loc[:, 'release_date']
    except:
        df['time_in_theaters'] = ''
        


In [85]:
sep_people(movies, 'director', 3)
sep_people(movies, 'actors', 5)
sep_people(movies, 'writers', 2)
sep_people(movies, 'producers', 3)
add_season(movies)
add_simple_genre(movies) # try not to rerun this guy, takes a while
clean_rating(movies)

In [None]:
### ONLY APPLICABLE TO MOVIES WITH BUDGETS

# adjust to 2015 dollars 
def get_infl_rate(startyear, endyear):
    '''Return inflation rate to 2015 dollars. 
    Args:
    startyear (float) -- start year
    endyear (int) -- end year
    Returns:
    infl (float) -- inflation rate
    '''
    try:
        cpiend = float(cpi.loc[endyear, 'AVE.'])
        cpistart = float(cpi.loc[int(startyear), 'AVE.'])
        infl = ((cpiend - cpistart) / cpistart) 
        return (1.0 + infl)
    except:
        return float('NaN')
    
def add_infl_rate(df):
    # add release year
    try:
        df['release_year'] = df['release_date'].apply(lambda x: x.year)
    except:
        df['release_year'] = None
    df['infl_rate'] = df['release_year'].apply(get_infl_rate, args=(2015,))
    
def add_adj_budget(df):
    df['budget_adj'] = df.loc[:, 'budget'] * df.loc[:, 'infl_rate']
    df['budget_adj'] = df['budget_adj'].fillna(0.0).astype(int)

# add roi calculations
def get_dom_roi(df):
    df['dom_roi'] = (df['dom_total_gross'] - df['budget'])/ df['budget']

def get_intl_roi(df):
    df['intl_roi'] = (df['intl_total_gross'] - df['budget'])/ df['budget']

In [None]:
add_infl_rate(movies)
add_adj_budget(movies)
get_dom_roi(movies)
get_infl_rate(movies)

In [385]:
movies.head()

Unnamed: 0,title,release_date,closing_date,actors,budget,director,distributor,dom_total_gross,domestic_total_adj_gross,genre,...,producers2,producers3,season,new_genre,clean_rating,release_year,infl_rate,budget_adj,dom_roi,intl_roi
0,The A-Team,2010-06-11,2010-09-16,"[Liam Neeson, Bradley Cooper, Sharlto Copley, ...",110000000.0,[Joe Carnahan],Fox,77222099,84507200,Action,...,Tony Scott,Alex Young,summer,Act_Adven,PG-13,2010,1.086955,119565019,-0.297981,0.611262
1,A.C.O.D.,2013-10-04,2013-11-07,"[Adam Scott, Catherine O'Hara, Richard Jenkins...",,,The Film Arcade,175705,183100,Comedy,...,,,fall,Comedy,PG-13,2013,1.017428,0,,
2,A.I. Artificial Intelligence,2001-06-29,NaT,"[Haley Joel Osment, Frances O'Connor, Jude Law...",100000000.0,[Steven Spielberg],Warner Bros.,78616689,120841900,Sci-Fi,...,Steven Spielberg,,summer,Fantasy_Scifi,PG-13,2001,1.338323,133832298,-0.213833,1.359266
3,Aaja Nachle,2007-11-30,2007-12-20,,,,Yash Raj,484108,612200,Foreign,...,,,holiday,Foreign,,2007,1.143121,0,,
4,Aarakshan,2011-08-12,2011-09-22,,,,Reliance Big Pictures,651096,713400,Foreign,...,,,summer,Foreign,,2011,1.053695,0,,


#### cpi workspace

In [178]:
cpi = pd.read_csv('cpi-data.csv')

In [181]:
cpi.set_index('YEAR', inplace = True)

In [182]:
cpi.head()

Unnamed: 0_level_0,AVE.
YEAR,Unnamed: 1_level_1
2016,
2015,237.017
2014,236.736
2013,232.957
2012,229.594


In [350]:
# def get_infl_rate(startyear, endyear):
#     '''Return inflation rate to 2015 dollars. 
#     Args:
#     startyear (int) -- start year
#     endyear (int) -- end year
#     Returns:
#     infl (float) -- inflation rate
#     '''
#     cpiend = float(cpi.loc[endyear, 'AVE.'])
#     cpistart = float(cpi.loc[startyear, 'AVE.'])
#     infl = ((cpiend - cpistart) / cpistart) 
#     return infl

In [334]:
type(float('NaN'))

float

In [353]:
movies.head()

Unnamed: 0,title,release_date,closing_date,actors,budget,director,distributor,dom_total_gross,domestic_total_adj_gross,genre,...,writers2,producers1,producers2,producers3,season,new_genre,clean_rating,release_year,infl_rate,budget_adj
0,The A-Team,2010-06-11,2010-09-16,"[Liam Neeson, Bradley Cooper, Sharlto Copley, ...",110000000.0,[Joe Carnahan],Fox,77222099,84507200,Action,...,,Ridley Scott,Tony Scott,Alex Young,summer,Act_Adven,PG-13,2010,1.086955,165000000.0
1,A.C.O.D.,2013-10-04,2013-11-07,"[Adam Scott, Catherine O'Hara, Richard Jenkins...",,,The Film Arcade,175705,183100,Comedy,...,,Teddy Schwarzman,,,fall,Comedy,PG-13,2013,1.017428,
2,A.I. Artificial Intelligence,2001-06-29,NaT,"[Haley Joel Osment, Frances O'Connor, Jude Law...",100000000.0,[Steven Spielberg],Warner Bros.,78616689,120841900,Sci-Fi,...,,Kathleen Kennedy,Steven Spielberg,,summer,Fantasy_Scifi,PG-13,2001,1.338323,150000000.0
3,Aaja Nachle,2007-11-30,2007-12-20,,,,Yash Raj,484108,612200,Foreign,...,,,,,holiday,Foreign,,2007,1.143121,
4,Aarakshan,2011-08-12,2011-09-22,,,,Reliance Big Pictures,651096,713400,Foreign,...,,,,,summer,Foreign,,2011,1.053695,


In [None]:
# def get_adj_budget(budget):
#     adj_budget = 

In [377]:
add_adj_budget(movies)

In [378]:
movies.loc[0, 'budget_adj']

119565019

In [379]:
movies.head()

Unnamed: 0,title,release_date,closing_date,actors,budget,director,distributor,dom_total_gross,domestic_total_adj_gross,genre,...,writers2,producers1,producers2,producers3,season,new_genre,clean_rating,release_year,infl_rate,budget_adj
0,The A-Team,2010-06-11,2010-09-16,"[Liam Neeson, Bradley Cooper, Sharlto Copley, ...",110000000.0,[Joe Carnahan],Fox,77222099,84507200,Action,...,,Ridley Scott,Tony Scott,Alex Young,summer,Act_Adven,PG-13,2010,1.086955,119565019
1,A.C.O.D.,2013-10-04,2013-11-07,"[Adam Scott, Catherine O'Hara, Richard Jenkins...",,,The Film Arcade,175705,183100,Comedy,...,,Teddy Schwarzman,,,fall,Comedy,PG-13,2013,1.017428,0
2,A.I. Artificial Intelligence,2001-06-29,NaT,"[Haley Joel Osment, Frances O'Connor, Jude Law...",100000000.0,[Steven Spielberg],Warner Bros.,78616689,120841900,Sci-Fi,...,,Kathleen Kennedy,Steven Spielberg,,summer,Fantasy_Scifi,PG-13,2001,1.338323,133832298
3,Aaja Nachle,2007-11-30,2007-12-20,,,,Yash Raj,484108,612200,Foreign,...,,,,,holiday,Foreign,,2007,1.143121,0
4,Aarakshan,2011-08-12,2011-09-22,,,,Reliance Big Pictures,651096,713400,Foreign,...,,,,,summer,Foreign,,2011,1.053695,0


In [322]:
type(movies.loc[0, 'infl_rate'])

float

In [321]:
movies.head(15)

Unnamed: 0,title,release_date,closing_date,actors,budget,director,distributor,dom_total_gross,domestic_total_adj_gross,genre,...,writers2,producers1,producers2,producers3,season,new_genre,clean_rating,release_year,infl_rate,budget_adj
0,The A-Team,2010-06-11,2010-09-16,"[Liam Neeson, Bradley Cooper, Sharlto Copley, ...",110000000.0,[Joe Carnahan],Fox,77222099.0,84507200.0,Action,...,,Ridley Scott,Tony Scott,Alex Young,summer,Act_Adven,PG-13,2010,0.0869547,165000000.0
1,A.C.O.D.,2013-10-04,2013-11-07,"[Adam Scott, Catherine O'Hara, Richard Jenkins...",,,The Film Arcade,175705.0,183100.0,Comedy,...,,Teddy Schwarzman,,,fall,Comedy,PG-13,2013,0.0174281,
2,A.I. Artificial Intelligence,2001-06-29,NaT,"[Haley Joel Osment, Frances O'Connor, Jude Law...",100000000.0,[Steven Spielberg],Warner Bros.,78616689.0,120841900.0,Sci-Fi,...,,Kathleen Kennedy,Steven Spielberg,,summer,Fantasy_Scifi,PG-13,2001,0.338323,150000000.0
3,Aaja Nachle,2007-11-30,2007-12-20,,,,Yash Raj,484108.0,612200.0,Foreign,...,,,,,holiday,Foreign,,2007,0.143121,
4,Aarakshan,2011-08-12,2011-09-22,,,,Reliance Big Pictures,651096.0,713400.0,Foreign,...,,,,,summer,Foreign,,2011,0.0536946,
5,Aashayein,2010-08-27,NaT,,,,Reliance Big Pictures,,,Foreign,...,,,,,summer,Foreign,,2010,0.0869547,
6,Aatsinki: The Story of Arctic Cowboys,2014-01-24,2014-03-06,,,,Argot Pictures,6601.0,7200.0,Documentary,...,,,,,winter,Documentary,,2014,0.00118698,
7,Abandon,2002-10-18,2002-12-19,"[Katie Holmes, Benjamin Bratt, Zooey Deschanel]",25000000.0,,Paramount,10727683.0,16063800.0,Thriller,...,,Gary Barber,Roger Birnbaum,Lynda Obst,fall,Crime_Thriller,PG-13,2002,0.31764,37500000.0
8,The Abandoned,2007-02-23,2007-03-08,,,,After Dark Films,1331137.0,1683300.0,Horror,...,,,,,winter,Horror,R,2007,0.143121,
9,ABC Africa,2002-05-03,2002-08-29,,,,New Yorker,9075.0,13600.0,Documentary,...,,,,,summer,Documentary,,2002,0.31764,


In [235]:
print get_infl_rate(2012, 2015)

0.0323309842592


In [223]:
total = 10.0 * (1.0 + get_infl_rate(2012))
total

10.323309842591705

#### begin workspace for testing genre categorization

In [18]:
# original genre categories
raw_genres = tuple(by_genre.count().index)
print raw_genres

('Action', 'Action / Adventure', 'Action / Crime', 'Action Comedy', 'Action Drama', 'Action Fantasy', 'Action Horror', 'Action Thriller', 'Adventure', 'Adventure Comedy', 'Animation', 'Comedy', 'Comedy / Drama', 'Comedy Thriller', 'Concert', 'Crime', 'Crime Comedy', 'Crime Drama', 'Crime Thriller', 'Documentary', 'Drama', 'Drama / Thriller', 'Family', 'Family Adventure', 'Family Comedy', 'Fantasy', 'Fantasy Comedy', 'Fantasy Drama', 'Foreign', 'Foreign / Action', 'Foreign / Horror', 'Historical Drama', 'Historical Epic', 'Horror', 'Horror Comedy', 'Horror Thriller', 'IMAX', 'Music Drama', 'Musical', 'None', 'Period Action', 'Period Adventure', 'Period Comedy', 'Period Drama', 'Period Horror', 'Romance', 'Romantic Adventure', 'Romantic Comedy', 'Romantic Thriller', 'Sci-Fi', 'Sci-Fi Action', 'Sci-Fi Adventure', 'Sci-Fi Comedy', 'Sci-Fi Fantasy', 'Sci-Fi Horror', 'Sci-Fi Thriller', 'Sports', 'Sports Action', 'Sports Comedy', 'Sports Drama', 'Thriller', 'Unknown', 'War', 'War Drama', 'War

In [33]:
print len(raw_genres)

67


In [78]:
# code simple genres 
simple_genres = ['Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 
                 'Act_Adven', 'Act_Adven', 'Act_Adven', 'Act_Adven', 
                 'Animation', 'Comedy', 'Comedy', 'Comedy', 'Other', 
                 'Crime_Thriller', 'Crime_Thriller', 'Crime_Thriller', 'Crime_Thriller', 
                 'Documentary', 'Drama', 'Drama', 'Family', 'Family', 'Family', 'Fantasy_Scifi', 'Fantasy_Scifi', 
                 'Fantasy_Scifi', 'Foreign', 'Foreign', 'Foreign', 'Historical', 'Historical', 
                 'Horror', 'Horror', 'Horror', 'Other', 'Other', 'Other', '', 
                 'Historical', 'Historical', 'Historical', 'Historical', 'Historical', 
                 'Romance', 'Romance', 'Romance', 'Romance', 'Fantasy_Scifi', 'Fantasy_Scifi', 
                 'Fantasy_Scifi', 'Fantasy_Scifi', 'Fantasy_Scifi', 'Fantasy_Scifi', 
                 'Fantasy_Scifi', 'Other', 'Other', 'Other', 'Other', 'Crime_Thriller', 
                 '', 'Other', 'Other', 'Other', 'Other', 'Other', ]

In [79]:
# count how many genres
print set(simple_genres)
print len(set(simple_genres))

set(['', 'Romance', 'Family', 'Horror', 'Other', 'Act_Adven', 'Foreign', 'Drama', 'Animation', 'Historical', 'Crime_Thriller', 'Fantasy_Scifi', 'Comedy', 'Documentary'])
14


In [80]:
genres = dict(zip(raw_genres, simple_genres))

In [81]:
pprint(genres)

{'Action': 'Act_Adven',
 'Action / Adventure': 'Act_Adven',
 'Action / Crime': 'Act_Adven',
 'Action Comedy': 'Act_Adven',
 'Action Drama': 'Act_Adven',
 'Action Fantasy': 'Act_Adven',
 'Action Horror': 'Act_Adven',
 'Action Thriller': 'Act_Adven',
 'Adventure': 'Act_Adven',
 'Adventure Comedy': 'Act_Adven',
 'Animation': 'Animation',
 'Comedy': 'Comedy',
 'Comedy / Drama': 'Comedy',
 'Comedy Thriller': 'Comedy',
 'Concert': 'Other',
 'Crime': 'Crime_Thriller',
 'Crime Comedy': 'Crime_Thriller',
 'Crime Drama': 'Crime_Thriller',
 'Crime Thriller': 'Crime_Thriller',
 'Documentary': 'Documentary',
 'Drama': 'Drama',
 'Drama / Thriller': 'Drama',
 'Family': 'Family',
 'Family Adventure': 'Family',
 'Family Comedy': 'Family',
 'Fantasy': 'Fantasy_Scifi',
 'Fantasy Comedy': 'Fantasy_Scifi',
 'Fantasy Drama': 'Fantasy_Scifi',
 'Foreign': 'Foreign',
 'Foreign / Action': 'Foreign',
 'Foreign / Horror': 'Foreign',
 'Historical Drama': 'Historical',
 'Historical Epic': 'Historical',
 'Horror': '

#### </ end genre testing workspace >

Since we're interested in budget data, let's check to see how many movies of the 16,100 total movies have budget data.

In [150]:
all_movies.budget.count()

2874

Only 2,874 out of 16,100 movies have budget data. That's only 17.9% of the entire dataset! Let's take a closer look at movies with budget data.

In [61]:
# drop movies without budget data 
only_budget = all_movies[pd.notnull(all_movies['budget'])]

# add some roi calculations
only_budget['dom_roi'] = only_budget['dom_total_gross'] / only_budget['budget']
only_budget['intl_roi'] = only_budget['intl_total_gross'] / only_budget['budget']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [65]:
only_budget.sort_values(by='dom_roi', ascending=False).head()

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,genre,intl_total_gross,oscar_noms,oscar_wins,producers,rating,runtime_(mins),theaters,url,writers,dom_roi,intl_roi
9634,My Big Fat Greek Wedding,2002-04-19,2003-04-13,"[John Corbett, Nia Vardalos]",5000000,,IFC,241438208,Romantic Comedy,368744044.0,1,0,"[Paul Brooks, Gary Goetzman, Tom Hanks]",PG,95,2016.0,mybigfatgreekwedding.htm,[Nia Vardalos],48.287642,73.748809
12187,Saw,2004-10-29,2004-12-30,"[Cary Elwes, Danny Glover]",1200000,[James Wan],Lions Gate,55185045,Horror,103911669.0,0,0,,R,100,2467.0,saw.htm,"[James Wan, Leigh Whannell]",45.987538,86.593058
12871,Snow White and the Seven Dwarfs,1937-12-21,NaT,,1499000,,Disney,66596803,Animation,,0,0,,G,83,,snowwhite.htm,,44.427487,
7166,Jaws,1975-06-20,NaT,,7000000,[Steven Spielberg],Universal,260000000,Horror Thriller,470653000.0,0,0,,PG,125,675.0,jaws.htm,,37.142857,67.236143
10316,One Flew Over the Cuckoo's Nest,1975-11-20,NaT,"[Jack Nicholson, Danny DeVito]",3000000,,United Artists,108981275,Drama,,0,0,[Michael Douglas],R,133,,oneflewoverthecuckoosnest.htm,,36.327092,


Let's look at only low budget films, which we're defining as films under $10m and evaluate the quality of the data.

In [151]:
# get only low budget (<$10m) films
low_budget = only_budget[only_budget['budget'] < 10000000]
low_budget.shape

(469, 20)

In [76]:
# pickle for later/challenge use
with open('pickled_data/low-budget-movies.pkl', 'w') as picklefile: 
    pickle.dump(low_budget, picklefile)

Looks like there are 469 total movies in our low budget films dataset. Some of these go back to 1925. Let's only look at movies since 1980.

In [99]:
newer_movies = low_budget[low_budget['2-release_date'] > datetime.strptime('1990', '%Y')]
newer_movies.count()

1-title             376
2-release_date      376
3-closing_date      283
actors              274
budget              376
director            205
distributor         376
dom_total_gross     372
genre               376
intl_total_gross    254
oscar_noms          376
oscar_wins          376
producers           159
rating              376
runtime_(mins)      372
theaters            370
url                 376
writers             145
dom_roi             372
intl_roi            254
dtype: int64

In [70]:
# top 10 movies by oscar nominations
low_budget.sort_values(by='oscar_noms', ascending=False).head(10)

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,genre,intl_total_gross,oscar_noms,oscar_wins,producers,rating,runtime_(mins),theaters,url,writers,dom_roi,intl_roi
4482,Fargo,1996-03-08,NaT,"[Steve Buscemi, William H. Macy, Frances McDor...",7000000,[Joel Coen],Gramercy,24611975,Crime Comedy,60611975.0,8,2,"[Tim Bevan, Ethan Coen, Eric Fellner]",R,95,716.0,fargo.htm,"[Ethan Coen, Joel Coen]",3.515996,8.658854
11936,A Room with a View,1986-03-07,NaT,"[Helena Bonham Carter, Daniel Day-Lewis, Judi ...",3000000,[James Ivory],Cinecom,20966644,Romance,,8,3,,Unrated,117,150.0,roomwithaview.htm,[Ruth Prawer Jhabvala],6.988881,
10941,Platoon,1986-12-19,NaT,"[Tom Berenger, Willem Dafoe, Charlie Sheen, Fo...",6000000,[Oliver Stone],Orion Pictures,138530565,War,,8,4,[Arnold Kopelson],R,120,1564.0,platoon.htm,,23.088428,
11268,Pulp Fiction,1994-10-14,NaT,"[John Travolta, Uma Thurman, Samuel L. Jackson...",8000000,[Quentin Tarantino],Miramax,107928762,Crime,213928762.0,7,1,"[Lawrence Bender, Michael Shamberg, Stacey Sher]",R,154,1494.0,pulpfiction.htm,"[Roger Avary, Quentin Tarantino]",13.491095,26.741095
2976,Crash,2005-05-06,2006-04-06,"[Sandra Bullock, Don Cheadle, Keith David, Lor...",6500000,[Paul Haggis],Lions Gate,54580300,Drama,98410061.0,6,3,"[Paul Haggis, Cathy Schulman, Bob Yari]",R,100,1905.0,crash05.htm,[Paul Haggis],8.396969,15.140009
5513,"Good Night, and Good Luck.",2005-10-07,2006-03-16,"[David Strathairn, Patricia Clarkson, George C...",7000000,[George Clooney],Warner Independent,31558003,Period Drama,54641191.0,6,0,"[Jennifer Fox, Samuel Hadida, Grant Heslov, Je...",PG,90,929.0,goodnightandgoodluck.htm,"[George Clooney, Grant Heslov]",4.508286,7.805884
1793,Boyhood,2014-07-11,2015-03-22,"[Ethan Hawke, Patricia Arquette]",4000000,[Richard Linklater],IFC,25352281,Drama,44495281.0,6,1,"[Richard Linklater, John Sloss]",R,166,775.0,boyhood.htm,[Richard Linklater],6.33807,11.12382
9184,Midnight Express,1978-10-27,NaT,"[John Hurt, Randy Quaid]",2300000,,Columbia,35000000,Drama / Thriller,,6,2,,R,122,,midnightexpress.htm,,15.217391,
3145,Dallas Buyers Club,2013-11-01,2014-05-01,"[Matthew McConaughey, Jared Leto, Jennifer Gar...",5000000,[Jean-Marc Vallee],Focus Features,27298285,Drama,55198285.0,6,3,,R,117,1110.0,dallasbuyersclub.htm,,5.459657,11.039657
472,Amour,2012-12-19,2013-05-16,,8900000,[Michael Haneke],Sony Classics,6739492,Foreign,19839492.0,5,1,,PG-13,125,333.0,amour.htm,[Michael Haneke],0.757246,2.229156


In [71]:
# top 10 movies by dom total gross 
low_budget.sort_values(by='dom_total_gross', ascending=False).head(10)

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,genre,intl_total_gross,oscar_noms,oscar_wins,producers,rating,runtime_(mins),theaters,url,writers,dom_roi,intl_roi
7166,Jaws,1975-06-20,NaT,,7000000,[Steven Spielberg],Universal,260000000,Horror Thriller,470653000.0,0,0,,PG,125,675.0,jaws.htm,,37.142857,67.236143
9634,My Big Fat Greek Wedding,2002-04-19,2003-04-13,"[John Corbett, Nia Vardalos]",5000000,,IFC,241438208,Romantic Comedy,368744044.0,1,0,"[Paul Brooks, Gary Goetzman, Tom Hanks]",PG,95,2016.0,mybigfatgreekwedding.htm,[Nia Vardalos],48.287642,73.748809
5610,Grease,1978-06-16,NaT,"[John Travolta, Olivia Newton-John]",6000000,,Paramount,159978870,Musical,394955690.0,1,0,,PG,110,862.0,grease.htm,,26.663145,65.825948
13006,The Sound of Music,1965-03-02,NaT,"[Julie Andrews, Christopher Plummer, Eleanor P...",8199999,[Robert Wise],Fox,158671368,Musical,,0,0,"[Robert Wise, Saul Chaplin]",G,174,,soundofmusic.htm,"[Ernest Lehman, Howard Lindsay, Russel Crouse]",19.350169,
7359,Juno,2007-12-05,2008-06-19,"[Ellen Page, Michael Cera, Jennifer Garner, Ja...",7500000,[Jason Reitman],Fox Searchlight,143495265,Comedy / Drama,231411584.0,4,1,[Mason Novick],PG-13,91,2534.0,juno.htm,[Diablo Cody],19.132702,30.854878
10941,Platoon,1986-12-19,NaT,"[Tom Berenger, Willem Dafoe, Charlie Sheen, Fo...",6000000,[Oliver Stone],Orion Pictures,138530565,War,,8,4,[Arnold Kopelson],R,120,1564.0,platoon.htm,,23.088428,
5415,The Godfather,1972-03-15,NaT,"[Marlon Brando, Al Pacino, James Caan, Robert ...",6000000,,Paramount,133698921,Crime Drama,245066411.0,0,0,,R,175,,godfather.htm,,22.283154,40.844402
4384,Fahrenheit 9/11,2004-06-23,2004-10-28,,6000000,[Michael Moore],Lions Gate,119194771,Documentary,222446882.0,0,0,,R,116,2011.0,fahrenheit911.htm,,19.865795,37.07448
8705,Magic Mike,2012-06-29,2012-09-27,"[Riley Keough, Channing Tatum, Alex Pettyfer, ...",7000000,[Steven Soderbergh],Warner Bros.,113721571,Comedy / Drama,167221571.0,0,0,"[Reid Carolin, Gregory Jacobs, Channing Tatum,...",R,110,3120.0,magicmike.htm,[Reid Carolin],16.245939,23.888796
10316,One Flew Over the Cuckoo's Nest,1975-11-20,NaT,"[Jack Nicholson, Danny DeVito]",3000000,,United Artists,108981275,Drama,,0,0,[Michael Douglas],R,133,,oneflewoverthecuckoosnest.htm,,36.327092,


In [72]:
# top 10 movies by intl total gross 
low_budget.sort_values(by='intl_total_gross', ascending=False).head(10)

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,genre,intl_total_gross,oscar_noms,oscar_wins,producers,rating,runtime_(mins),theaters,url,writers,dom_roi,intl_roi
7166,Jaws,1975-06-20,NaT,,7000000,[Steven Spielberg],Universal,260000000,Horror Thriller,470653000,0,0,,PG,125,675.0,jaws.htm,,37.142857,67.236143
5610,Grease,1978-06-16,NaT,"[John Travolta, Olivia Newton-John]",6000000,,Paramount,159978870,Musical,394955690,1,0,,PG,110,862.0,grease.htm,,26.663145,65.825948
9634,My Big Fat Greek Wedding,2002-04-19,2003-04-13,"[John Corbett, Nia Vardalos]",5000000,,IFC,241438208,Romantic Comedy,368744044,1,0,"[Paul Brooks, Gary Goetzman, Tom Hanks]",PG,95,2016.0,mybigfatgreekwedding.htm,[Nia Vardalos],48.287642,73.748809
5075,The Full Monty,1997-08-15,1998-05-14,[Tom Wilkinson],3500000,,Fox Searchlight,45950122,Comedy,257938649,4,1,,R,91,783.0,fullmonty.htm,[Simon Beaufoy],13.128606,73.696757
553,Annabelle,2014-10-03,2014-12-18,[Alfre Woodard],6500000,[John R. Leonetti],Warner Bros. (New Line),84273813,Horror,256873813,0,0,"[Peter Safran, James Wan]",R,95,3215.0,newlinehorror.htm,[Gary Dauberman],12.965202,39.519048
4919,Four Weddings and a Funeral,1994-03-11,NaT,[Hugh Grant],4400000,[Mike Newell],Gramercy,52700832,Romantic Comedy,245700832,2,0,"[Tim Bevan, Eric Fellner]",R,118,1069.0,fourweddingsandafuneral.htm,,11.977462,55.841098
5415,The Godfather,1972-03-15,NaT,"[Marlon Brando, Al Pacino, James Caan, Robert ...",6000000,,Paramount,133698921,Crime Drama,245066411,0,0,,R,175,,godfather.htm,,22.283154,40.844402
7359,Juno,2007-12-05,2008-06-19,"[Ellen Page, Michael Cera, Jennifer Garner, Ja...",7500000,[Jason Reitman],Fox Searchlight,143495265,Comedy / Drama,231411584,4,1,[Mason Novick],PG-13,91,2534.0,juno.htm,[Diablo Cody],19.132702,30.854878
4384,Fahrenheit 9/11,2004-06-23,2004-10-28,,6000000,[Michael Moore],Lions Gate,119194771,Documentary,222446882,0,0,,R,116,2011.0,fahrenheit911.htm,,19.865795,37.07448
3598,Dirty Dancing,1987-08-21,NaT,"[Patrick Swayze, Jennifer Grey]",6000000,,Vestron,63446382,Romance,213954274,1,1,,PG-13,105,1012.0,dirtydancing.htm,,10.574397,35.659046


In [82]:
# genre? 

by_genre = newer_movies.groupby('genre')

In [97]:
ls = []
for i in list(newer_movies['genre']):
    print i
    break
    if i[:6] == 'Action':
       ls.append('Action')
    else:
        ls.append(i)

Drama


In [None]:
newer_movies['new_genre'] = ls

In [91]:
zip(list(by_genre.count().index),ls)

[('Action', 'Action'),
 ('Action / Adventure', 'Action'),
 ('Action / Crime', 'Action'),
 ('Action Comedy', 'Action'),
 ('Action Drama', 'Action'),
 ('Action Thriller', 'Action'),
 ('Adventure', 'Adventure'),
 ('Animation', 'Animation'),
 ('Comedy', 'Comedy'),
 ('Comedy / Drama', 'Comedy / Drama'),
 ('Comedy Thriller', 'Comedy Thriller'),
 ('Concert', 'Concert'),
 ('Crime', 'Crime'),
 ('Crime Comedy', 'Crime Comedy'),
 ('Crime Drama', 'Crime Drama'),
 ('Crime Thriller', 'Crime Thriller'),
 ('Documentary', 'Documentary'),
 ('Drama', 'Drama'),
 ('Drama / Thriller', 'Drama / Thriller'),
 ('Family', 'Family'),
 ('Fantasy', 'Fantasy'),
 ('Fantasy Comedy', 'Fantasy Comedy'),
 ('Fantasy Drama', 'Fantasy Drama'),
 ('Foreign', 'Foreign'),
 ('Historical Drama', 'Historical Drama'),
 ('Horror', 'Horror'),
 ('Horror Comedy', 'Horror Comedy'),
 ('Horror Thriller', 'Horror Thriller'),
 ('IMAX', 'IMAX'),
 ('Music Drama', 'Music Drama'),
 ('Musical', 'Musical'),
 ('Period Adventure', 'Period Adventure

In [93]:
pd.get_dummies(newer_movies, columns = ['genre']).head()

Unnamed: 0,1-title,2-release_date,3-closing_date,actors,budget,director,distributor,dom_total_gross,intl_total_gross,oscar_noms,...,genre_Romantic Thriller,genre_Sci-Fi Action,genre_Sci-Fi Horror,genre_Sci-Fi Thriller,genre_Sports Comedy,genre_Sports Drama,genre_Thriller,genre_Unknown,genre_War,genre_Western
123,Affliction,1998-12-30,NaT,[Willem Dafoe],6000000,[Paul Schrader],Lions Gate,6330054,,2,...,0,0,0,0,0,0,0,0,0,0
200,Airplane!,1980-06-27,NaT,[Leslie Nielsen],3500000,"[Jim Abrahams, David Zucker, Jerry Zucker]",Paramount,83453539,,0,...,0,0,0,0,0,0,0,0,0,0
207,Akeelah and the Bee,2006-04-28,2006-07-20,"[Laurence Fishburne, Angela Bassett, Keke Palmer]",8000000,,Lionsgate,18848430,18948425.0,0,...,0,0,0,0,0,0,0,0,0,0
223,Albino Alligator,1997-01-17,1997-02-13,"[Matt Dillon, Viggo Mortensen]",6000000,,Miramax,339379,,0,...,0,0,0,0,0,0,0,1,0,0
289,All or Nothing,2002-10-25,NaT,"[Taraji P. Henson, Sally Hawkins]",9000000,[Mike Leigh],United Artists,201546,201546.0,0,...,0,0,0,0,0,0,0,1,0,0
