# Optimization Function
This notebook contains the code to investigate a model to optimize the net revenue. The goal is to recommend a movie genre and associated personnel that will maximize the net profit of the movie. It will make the following assumptions to simplify the model as it is out of the scope of this investigation:
* The only budget for a film is the cost of the writer, actor, and director.
* There is only one of each of the persons mentioned above.
* Each of the persons mentioned above use 1/3 of the budget.

The objective funtion will be to maximize the net profit for a genre modeled as:
* net profit = genre_gross - writer_budget - actor_budget - director_budget

In [1]:
# import libraries
from helper_functions import get_clean_df
import pandas as pd
from scipy.optimize import minimize
import sqlite3

## Load data

In [2]:
# get clean df from helper_functions.py
df = get_clean_df()
df.head()

Unnamed: 0,movie_id,averagerating,numvotes,primary_title,original_title,year,runtime_minutes,genres,ordering,title,...,id,release_date,production_budget,domestic_gross_movie_budgets,worldwide_gross_movie_budgets,studio,domestic_gross_movie_gross,foreign_gross_movie_gross,clean_domestic_gross,clean_worldwide_gross
0,tt1043726,4.2,50352,The Legend of Hercules,The Legend of Hercules,2014,99.0,"Action,Adventure,Fantasy",20,thelegendofhercules,...,42.0,"Jan 10, 2014",70000000.0,18848538.0,58953319.0,,,,18848538.0,58953319.0
1,tt1171222,5.1,8296,Baggage Claim,Baggage Claim,2013,96.0,Comedy,5,baggageclaim,...,38.0,"Sep 27, 2013",8500000.0,21569509.0,22885836.0,,,,21569509.0,22885836.0
2,tt1210166,7.6,326657,Moneyball,Moneyball,2011,133.0,"Biography,Drama,Sport",14,moneyball,...,15.0,"Sep 23, 2011",50000000.0,75605492.0,111300835.0,,,,75605492.0,111300835.0
3,tt1212419,6.5,87288,Hereafter,Hereafter,2010,129.0,"Drama,Fantasy,Romance",4,hereafter,...,61.0,"Oct 15, 2010",50000000.0,32746941.0,108660270.0,,,,32746941.0,108660270.0
4,tt1232829,7.2,477771,21 Jump Street,21 Jump Street,2012,109.0,"Action,Comedy,Crime",26,21jumpstreet,...,44.0,"Mar 16, 2012",42000000.0,138447667.0,202812429.0,,,,138447667.0,202812429.0


## Load additional data on movie personnel from sql db

In [3]:
# conncect the sql db and get the staff personnel info
conn = sqlite3.connect("../Data/im.db")
sql_query = """
SELECT per.primary_name, per.death_year, prin.category, prin.movie_id FROM persons per
JOIN principals prin USING (person_id)
"""
staff = pd.read_sql(sql_query,conn)

# close the db connection
conn.close()

staff.head()

Unnamed: 0,primary_name,death_year,category,movie_id
0,Mary Ellen Bauder,,producer,tt2398241
1,Joseph Bauer,,composer,tt0433397
2,Joseph Bauer,,composer,tt1681372
3,Joseph Bauer,,composer,tt2281215
4,Joseph Bauer,,composer,tt2387710


### Clean the staff data to merge in with the movie revenue figures

In [4]:
# see types of staff
staff["category"].unique()

array(['producer', 'composer', 'actor', 'cinematographer',
       'production_designer', 'director', 'actress', 'writer', 'editor',
       'self', 'archive_footage', 'archive_sound'], dtype=object)

In [5]:
# view the death years
# we only want to include people who are still alive
staff["death_year"].unique()

array([  nan, 2013., 2004., 2017., 1965., 2003., 2018., 2012., 1937.,
       1976., 2019., 1971., 1994., 2008., 2009., 1986., 2011., 2010.,
       1890., 2015., 1918., 2016., 1916., 1995., 1985., 2014., 1779.,
       1997., 1961., 1840., 2002., 1900., 1688., 1707., 1969., 1948.,
       1990., 1870., 1960., 1991., 2005., 1981., 1968., 1989., 2006.,
       1951., 1967., 1938., 1926., 1944., 1975., 1898., 1970., 1883.,
       1972., 1974., 1998., 1993., 1959., 1979., 1999., 1987., 2000.,
       1925., 1992., 1978., 1878., 1902., 1942., 1954., 1935., 2001.,
       1940., 1996., 1815., 2007., 1982., 1893., 1933., 1983., 1946.,
       1988., 1803., 1939., 1980., 1984., 1872., 1880., 1924., 1934.,
       1932., 1966., 1949., 1947., 1774., 1873., 1593., 1956., 1952.,
       1855., 1912., 1963., 1919., 1915., 1950., 1957., 1962., 1955.,
       1894., 1892., 1929., 1977., 1843., 1876., 1812., 1901., 1817.,
       1828., 1964., 1031., 1851., 1864., 1973., 1943., 1831., 1904.,
       1838., 1931.,

We will assume that having a missing value in the death year column indicates that they are still alive.

In [6]:
# keep only producers, actors, wirters who do not have a death year
staff = staff[(staff["category"].isin(["writer", "actor", "director"]))  & (staff["death_year"].isna())]

In [7]:
staff["category"].unique()

array(['actor', 'director', 'writer'], dtype=object)

In [8]:
staff["death_year"].unique()

array([nan])

In [9]:
# for this simple model we will only keep 1 actor, 1 director, and 1 writer per movie
staff = staff.drop_duplicates(subset=["movie_id", "category"])
staff.sort_values("movie_id").head(25)

Unnamed: 0,primary_name,death_year,category,movie_id
254622,Dilip Kumar,,actor,tt0063540
47248,Gulzar,,writer,tt0063540
65265,Arun Khopkar,,actor,tt0066787
197027,Peter Bogdanovich,,actor,tt0069049
47249,Gulzar,,writer,tt0069204
353801,Asrani,,actor,tt0069204
31499,Francisco Reyes,,actor,tt0100275
109688,Valeria Sarmiento,,director,tt0100275
447129,Pía Rey,,writer,tt0100275
259167,Frank Howson,,director,tt0111414


In [10]:
# merge staff into df with gross revenues
df = staff.merge(df, how="left", on="movie_id")
df

Unnamed: 0,primary_name,death_year,category,movie_id,averagerating,numvotes,primary_title,original_title,year,runtime_minutes,...,id,release_date,production_budget,domestic_gross_movie_budgets,worldwide_gross_movie_budgets,studio,domestic_gross_movie_gross,foreign_gross_movie_gross,clean_domestic_gross,clean_worldwide_gross
0,Bruce Baum,,actor,tt6463956,,,,,,,...,,,,,,,,,,
1,Ruel S. Bayani,,director,tt1592569,,,,,,,...,,,,,,,,,,
2,Ruel S. Bayani,,director,tt2057445,,,,,,,...,,,,,,,,,,
3,Ruel S. Bayani,,director,tt2590280,,,,,,,...,,,,,,,,,,
4,Ruel S. Bayani,,director,tt8421806,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270916,Zheng Wei,,director,tt8697720,,,,,,,...,,,,,,,,,,
270917,Elina Gakou Gomba,,writer,tt8081326,,,,,,,...,,,,,,,,,,
270918,Rama Narayanan,,director,tt8715016,,,,,,,...,,,,,,,,,,
270919,Rama Narayanan,,director,tt8919136,,,,,,,...,,,,,,,,,,


In [11]:
df.columns

Index(['primary_name', 'death_year', 'category', 'movie_id', 'averagerating',
       'numvotes', 'primary_title', 'original_title', 'year',
       'runtime_minutes', 'genres', 'ordering', 'title', 'region', 'language',
       'types', 'attributes', 'is_original_title', 'id', 'release_date',
       'production_budget', 'domestic_gross_movie_budgets',
       'worldwide_gross_movie_budgets', 'studio', 'domestic_gross_movie_gross',
       'foreign_gross_movie_gross', 'clean_domestic_gross',
       'clean_worldwide_gross'],
      dtype='object')

In [12]:
# remove missing values
df.dropna(subset=["clean_domestic_gross", "production_budget"], inplace=True)
df.head()

Unnamed: 0,primary_name,death_year,category,movie_id,averagerating,numvotes,primary_title,original_title,year,runtime_minutes,...,id,release_date,production_budget,domestic_gross_movie_budgets,worldwide_gross_movie_budgets,studio,domestic_gross_movie_gross,foreign_gross_movie_gross,clean_domestic_gross,clean_worldwide_gross
74,Matt Bomer,,actor,tt3799694,7.4,240337.0,The Nice Guys,The Nice Guys,2016.0,116.0,...,56.0,"May 20, 2016",50000000.0,36261763.0,59596747.0,,,,36261763.0,59596747.0
117,David Bowers,,director,tt1650043,6.6,23135.0,Diary of a Wimpy Kid: Rodrick Rules,Diary of a Wimpy Kid: Rodrick Rules,2011.0,99.0,...,80.0,"Mar 25, 2011",18000000.0,52698535.0,73695194.0,,,,52698535.0,73695194.0
118,David Bowers,,director,tt2023453,6.3,19571.0,Diary of a Wimpy Kid: Dog Days,Diary of a Wimpy Kid: Dog Days,2012.0,94.0,...,17.0,"Aug 3, 2012",22000000.0,49008662.0,77229695.0,,,,49008662.0,77229695.0
119,David Bowers,,director,tt6003368,4.4,5635.0,Diary of a Wimpy Kid: The Long Haul,Diary of a Wimpy Kid: The Long Haul,2017.0,91.0,...,27.0,"May 19, 2017",22000000.0,20738724.0,35609577.0,,,,20738724.0,35609577.0
125,Dan Bradley,,director,tt1234719,5.4,69599.0,Red Dawn,Red Dawn,2012.0,93.0,...,1.0,"Nov 21, 2012",65000000.0,44806783.0,48164150.0,,,,44806783.0,48164150.0


In [13]:
# confirm there are no missing values
df = df[["movie_id", "primary_title", "genres", "primary_name", "category", "production_budget", "clean_domestic_gross"]]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2897 entries, 74 to 270098
Data columns (total 7 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   movie_id              2897 non-null   object 
 1   primary_title         2897 non-null   object 
 2   genres                2897 non-null   object 
 3   primary_name          2897 non-null   object 
 4   category              2897 non-null   object 
 5   production_budget     2897 non-null   float64
 6   clean_domestic_gross  2897 non-null   float64
dtypes: float64(2), object(5)
memory usage: 181.1+ KB


In [14]:
# for this simple model we will assume that each movie has 1 writer, 1 main actor, and 1 director each recieving 1/3 of the budget
df["salary"] = df["production_budget"]/3
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["salary"] = df["production_budget"]/3


Unnamed: 0,movie_id,primary_title,genres,primary_name,category,production_budget,clean_domestic_gross,salary
74,tt3799694,The Nice Guys,"Action,Comedy,Crime",Matt Bomer,actor,50000000.0,36261763.0,16666670.0
117,tt1650043,Diary of a Wimpy Kid: Rodrick Rules,"Comedy,Family",David Bowers,director,18000000.0,52698535.0,6000000.0
118,tt2023453,Diary of a Wimpy Kid: Dog Days,"Comedy,Family",David Bowers,director,22000000.0,49008662.0,7333333.0
119,tt6003368,Diary of a Wimpy Kid: The Long Haul,"Comedy,Family",David Bowers,director,22000000.0,20738724.0,7333333.0
125,tt1234719,Red Dawn,"Action,Sci-Fi,Thriller",Dan Bradley,director,65000000.0,44806783.0,21666670.0


In [15]:
# we will use an average of the 1/3 budgets for each person as their "cost" in this model
# get average salary and gross revenues per person
actors = df[df["category"]=="actor"].groupby("primary_name").mean()[["salary", "clean_domestic_gross"]].reset_index()
directors = df[df["category"]=="director"].groupby("primary_name").mean()[["salary", "clean_domestic_gross"]].reset_index()
writers = df[df["category"]=="writer"].groupby("primary_name").mean()[["salary", "clean_domestic_gross"]].reset_index()

In [16]:
# get average gross revenue per genre
df["genres"] = df["genres"].str.split(',')
genres_df = df.explode("genres").reset_index(drop=True).groupby("genres").mean()[["clean_domestic_gross"]].reset_index()
genres_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["genres"] = df["genres"].str.split(',')


Unnamed: 0,genres,clean_domestic_gross
0,Action,79434350.0
1,Adventure,115516200.0
2,Animation,131582200.0
3,Biography,48345470.0
4,Comedy,62658830.0
5,Crime,44332920.0
6,Documentary,10474970.0
7,Drama,39070830.0
8,Family,80786640.0
9,Fantasy,75472540.0


## Define function to optimize

In [17]:
# objective function: maximize profit
# genre_gross - writer_budget - actor_budget - director_budget
def objective(x):
    # the inputs will be rounded and turned into integers for indexing as the optimization function uses floats
    genre_gross = genres_df["clean_domestic_gross"][int(round(x[0],0))]
    writer_budget = writers["salary"][int(round(x[1],0))]
    actor_budget = actors["salary"][int(round(x[2],0))]
    director_budget = directors["salary"][int(round(x[3],0))]
    
    net = genre_gross - writer_budget - actor_budget - director_budget
    
    # return net * -1 because the optimizer will try to minimize the function
    return net * -1

In [18]:
# test an input to make sure the function outputs properly
objective([1.1,2.56,7.899,3.45])

-99816213.38733432

### Get ranges of the index of variable dataframes to use as bounds for the optimizer

In [19]:
(genres_df.index.min(), genres_df.index.max())

(0, 20)

In [20]:
(writers.index.min(), writers.index.max())

(0, 715)

In [21]:
(actors.index.min(), actors.index.max())

(0, 651)

In [22]:
(directors.index.min(), directors.index.max())

(0, 745)

### Define bounds

In [23]:
# bounds: must have 1 genre, 1 actor, 1 writer, 1 director from the list
# mathematically will select the index from df for the corresponding variable
genre_bound = (genres_df.index.min(), genres_df.index.max())
writer_bound = (writers.index.min(), writers.index.max())
actor_bound = (actors.index.min(), actors.index.max())
director_bound = (directors.index.min(), directors.index.max())
bounds = (genre_bound, writer_bound, actor_bound, director_bound)

In [24]:
# intial guess to start the optimzer
x0 = [2, 5, 7, 9]

### Create the optimizer
Use scipy dual_annealing optimizer as it will conduct a global min search.

In [25]:
# create the optimizer and check the results
# due to local computing constraints this will be run on the default value of 1000 iterations
from scipy.optimize import dual_annealing
ret = dual_annealing(objective, bounds=bounds)
ret

     fun: -134953466.52631575
 message: ['Maximum number of iteration reached']
    nfev: 8051
    nhev: 0
     nit: 1000
    njev: 10
  status: 0
 success: True
       x: array([ 12.60304697, 407.8939274 , 484.8661791 , 375.05781299])

The results show are in the format of the index to get the correspoing genre or staff member from the their respective dataframes.

* The optimzer ran the max number of iterations therefore it most likely did not reach a global minimum. This is ok for the purposes of this demonstration.

## Results

In [26]:
best_genre = int(round(ret.x[0],0))
best_writer = int(round(ret.x[1],0))
best_actor = int(round(ret.x[2],0))
best_director = int(round(ret.x[3],0))
net_profit = round(ret.fun * -1, 2)

In [27]:
print(genres_df["genres"][best_genre])
print(writers["primary_name"][best_writer])
print(actors["primary_name"][best_actor])
print(directors["primary_name"][best_director])

Musical
Kirsten Elms
R.L. Mann
Joseph Mazzella


In [28]:
print("In order to maximize profits, this simple model recommends creating a film with the following characteristics:")
print(f"Genre: {genres_df['genres'][best_genre]}")
print(f"Writer: {writers['primary_name'][best_writer]}")
print(f"Actor: {actors['primary_name'][best_actor]}")
print(f"Director: {directors['primary_name'][best_director]}")
print(f"This combination will result in a net profit of: ${net_profit}")

In order to maximize profits, this simple model recommends creating a film with the following characteristics:
Genre: Musical
Writer: Kirsten Elms
Actor: R.L. Mann
Director: Joseph Mazzella
This combination will result in a net profit of: $134953466.53


## An expanded version of this model can be used to make future business decisions.