# 1. Bisection


One of the most common algorithms for numerical root-finding is *bisection*.

To understand the idea, recall the well-known game where:

- Player A thinks of a secret number between 1 and 100  
- Player B asks if it’s less than 50  
  
  - If yes, B asks if it’s less than 25  
  - If no, B asks if it’s less than 75  
  

And so on.

This is bisection, a relative of [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm). It works for all sufficiently well behaved increasing continuous functions with $ f(a) < 0 < f(b) $. 

Write an implementation of the bisection algorith, `bisect(f, lower, upper, tol)` which, given a function `f`, a lower bound `lower` and an upper bound `upper` finds the point `x` where `f(x) = 0`. The parameter `tol` is a numerical tolerance, you should stop once your step size is smaller than `tol`.


Use it to minimize the function:

$$
f(x) = \sin(4 (x - 1/4)) + x + x^{20} - 1 \tag{2}
$$

in python: `lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1`

The value where f(x) = 0 should be around `0.408`. See derivates_optimization notes.

In [10]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline

f = lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1

#bisect code from https://www.math.ubc.ca/~pwalls/math-python/roots-optimization/bisection/

def bisect(f, lower, upper, tol):
    if f(lower)*f(upper) >= 0:
        print("Bisection method fails.")
        return None
    lower_n = lower
    upper_n = upper
    for n in range(1, tol+1):
        m_n = (lower_n + upper_n)/2
        f_m_n = f(m_n)
        if f(lower_n)*f_m_n < 0:
            lower_n = lower_n
            upper_n = m_n
        elif f(upper_n)*f_m_n < 0:
            lower_n = m_n
            upper_n = upper_n
        elif f_m_n == 0:
            print("Found exact solution.")
            return m_n
        else:
            print("Bisection method fails.")
            return None
    return (lower_n + upper_n) / 2

bisection = bisect(f, -1, 2, 25)
print(bisection)

0.4082935303449631


# 1.2 (stretch) Recursive Bisect

Write a recursive version of the bisection algorithm

# 2.1 Movies Regression

Write the best linear regression model you can on the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv) to predict the profitability of a movie (revenue - budget). Maintain the interpretability of the model.

Few notes:

1. Clean your data! Movies where the budget or revenue are invalid should be thrown out

2. Be creative with feature engineering. You can include processing to one-hot encode the type of movie, etc.

3. The model should be useful for someone **who is thinking about making a movie**. So features like the popularity can't be used. You could, however, use the ratings to figure out if making "good" or "oscar bait" movies is a profitable strategy.

In [13]:
import pandas as pd
import statsmodels.api as sm

df = pd.read_csv('/Users/mike_stein612/Desktop/3-5-optimization/archive/movies_metadata.csv')
df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [14]:
#Light cleaning
df.drop(['adult', 'belongs_to_collection', 'homepage',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path',
       'production_countries', 'release_date',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_count'],
           axis = 1,
           inplace = True)

In [15]:
df = df.dropna(subset=['genres'],axis=0)
df = df.dropna(subset=['production_companies'],axis=0)

In [16]:
#Reading the genres properly
df['genres'] = df.genres.apply(lambda x: eval(x))
df['genres'] = df.genres.apply(lambda x: x[0]['name'] if len(x) > 0 else '')
df.head()

Unnamed: 0,budget,genres,id,production_companies,revenue,runtime,vote_average
0,30000000,Animation,862,"[{'name': 'Pixar Animation Studios', 'id': 3}]",373554033.0,81.0,7.7
1,65000000,Adventure,8844,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",262797249.0,104.0,6.9
2,0,Romance,15602,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",0.0,101.0,6.5
3,16000000,Comedy,31357,[{'name': 'Twentieth Century Fox Film Corporat...,81452156.0,127.0,6.1
4,0,Comedy,11862,"[{'name': 'Sandollar Productions', 'id': 5842}...",76578911.0,106.0,5.7


In [17]:
#help from Kaleb
s = pd.Series(df['production_companies'], dtype= str)
s2 = s.str.split(pat="'",expand=True)
df['prod_companies'] = s2[3]
df

Unnamed: 0,budget,genres,id,production_companies,revenue,runtime,vote_average,prod_companies
0,30000000,Animation,862,"[{'name': 'Pixar Animation Studios', 'id': 3}]",373554033.0,81.0,7.7,Pixar Animation Studios
1,65000000,Adventure,8844,"[{'name': 'TriStar Pictures', 'id': 559}, {'na...",262797249.0,104.0,6.9,TriStar Pictures
2,0,Romance,15602,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'...",0.0,101.0,6.5,Warner Bros.
3,16000000,Comedy,31357,[{'name': 'Twentieth Century Fox Film Corporat...,81452156.0,127.0,6.1,Twentieth Century Fox Film Corporation
4,0,Comedy,11862,"[{'name': 'Sandollar Productions', 'id': 5842}...",76578911.0,106.0,5.7,Sandollar Productions
...,...,...,...,...,...,...,...,...
45461,0,Drama,439050,[],0.0,90.0,4.0,
45462,0,Drama,111109,"[{'name': 'Sine Olivia', 'id': 19653}]",0.0,360.0,9.0,Sine Olivia
45463,0,Action,67758,"[{'name': 'American World Pictures', 'id': 6165}]",0.0,90.0,3.8,American World Pictures
45464,0,,227506,"[{'name': 'Yermoliev', 'id': 88753}]",0.0,87.0,0.0,Yermoliev


In [18]:
df.drop(['production_companies'], axis = 1, inplace = True)

In [19]:
df.id.astype(str)
df['id'] = df['id'].astype(str)

In [20]:
#Fixing the budget column
df = df[(df != 0).all(1)]

#Help from Kaleb

zero_bud = df[df['budget'] == '0'].index
df.drop(zero_bud, inplace = True)
ran_bud = df[df['budget'] == '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'].index
ran_bud2 = df[df['budget'] == '/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg'].index
ran_bud3 = df[df['budget'] == '/zaSf5OG7V8X8gqFvly88zDdRm46.jpg'].index
df.drop(ran_bud, inplace = True)
df.drop(ran_bud2, inplace = True)
df.drop(ran_bud3, inplace = True)

df

Unnamed: 0,budget,genres,id,revenue,runtime,vote_average,prod_companies
0,30000000,Animation,862,373554033.0,81.0,7.7,Pixar Animation Studios
1,65000000,Adventure,8844,262797249.0,104.0,6.9,TriStar Pictures
3,16000000,Comedy,31357,81452156.0,127.0,6.1,Twentieth Century Fox Film Corporation
5,60000000,Action,949,187436818.0,170.0,7.7,Regency Enterprises
8,35000000,Action,9091,64350171.0,106.0,5.5,Universal Pictures
...,...,...,...,...,...,...,...
45167,11000000,Action,395834,184770205.0,111.0,7.4,Thunder Road Pictures
45250,12000000,Action,24049,19000000.0,185.0,6.9,AVM Productions
45409,800000,Comedy,62757,1328612.0,100.0,5.8,
45412,2000000,Romance,63281,1268793.0,107.0,4.0,Profit


In [21]:
#define the budget
df.budget = df.budget.astype(float)
df['profit'] = df['revenue'] - df['budget']
df = df[~(df['profit'] < 0)]
df

Unnamed: 0,budget,genres,id,revenue,runtime,vote_average,prod_companies,profit
0,30000000.0,Animation,862,373554033.0,81.0,7.7,Pixar Animation Studios,343554033.0
1,65000000.0,Adventure,8844,262797249.0,104.0,6.9,TriStar Pictures,197797249.0
3,16000000.0,Comedy,31357,81452156.0,127.0,6.1,Twentieth Century Fox Film Corporation,65452156.0
5,60000000.0,Action,949,187436818.0,170.0,7.7,Regency Enterprises,127436818.0
8,35000000.0,Action,9091,64350171.0,106.0,5.5,Universal Pictures,29350171.0
...,...,...,...,...,...,...,...,...
45014,60000000.0,Action,353491,71000000.0,95.0,5.7,Imagine Entertainment,11000000.0
45139,50000000.0,Comedy,378236,66913939.0,86.0,5.8,Columbia Pictures,16913939.0
45167,11000000.0,Action,395834,184770205.0,111.0,7.4,Thunder Road Pictures,173770205.0
45250,12000000.0,Action,24049,19000000.0,185.0,6.9,AVM Productions,7000000.0


In [22]:
df_director = pd.read_csv('archive/credits.csv')
df_director

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862
...,...,...,...
45471,"[{'cast_id': 0, 'character': '', 'credit_id': ...","[{'credit_id': '5894a97d925141426c00818c', 'de...",439050
45472,"[{'cast_id': 1002, 'character': 'Sister Angela...","[{'credit_id': '52fe4af1c3a36847f81e9b15', 'de...",111109
45473,"[{'cast_id': 6, 'character': 'Emily Shaw', 'cr...","[{'credit_id': '52fe4776c3a368484e0c8387', 'de...",67758
45474,"[{'cast_id': 2, 'character': '', 'credit_id': ...","[{'credit_id': '533bccebc3a36844cf0011a7', 'de...",227506


In [23]:
#extract the name of the director.
df_director.crew = df_director.crew.apply(lambda x: eval(x))

def director(crew):
    for person in crew:
        if person['job'] == 'Director':
            return person['name']

df_director['Director'] = df_director.crew.apply(director)
directors = df_director['Director']
df_director['Director'] = directors
df_director.id.astype(str)
df_director['id'] = df_director['id'].astype(str)

In [24]:
df_new = pd.merge(df, df_director, on = 'id')
df_new

Unnamed: 0,budget,genres,id,revenue,runtime,vote_average,prod_companies,profit,cast,crew,Director
0,30000000.0,Animation,862,373554033.0,81.0,7.7,Pixar Animation Studios,343554033.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",John Lasseter
1,65000000.0,Adventure,8844,262797249.0,104.0,6.9,TriStar Pictures,197797249.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",Joe Johnston
2,16000000.0,Comedy,31357,81452156.0,127.0,6.1,Twentieth Century Fox Film Corporation,65452156.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",Forest Whitaker
3,60000000.0,Action,949,187436818.0,170.0,7.7,Regency Enterprises,127436818.0,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...","[{'credit_id': '52fe4292c3a36847f802916d', 'de...",Michael Mann
4,35000000.0,Action,9091,64350171.0,106.0,5.5,Universal Pictures,29350171.0,"[{'cast_id': 1, 'character': 'Darren Francis T...","[{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de...",Peter Hyams
...,...,...,...,...,...,...,...,...,...,...,...
3771,60000000.0,Action,353491,71000000.0,95.0,5.7,Imagine Entertainment,11000000.0,"[{'cast_id': 9, 'character': 'Roland Deschain'...","[{'credit_id': '5912cf71c3a36864d40533b7', 'de...",Nikolaj Arcel
3772,50000000.0,Comedy,378236,66913939.0,86.0,5.8,Columbia Pictures,16913939.0,"[{'cast_id': 2, 'character': 'Gene (voice)', '...","[{'credit_id': '5952d8f1c3a368151c025d0e', 'de...",Anthony Leondis
3773,11000000.0,Action,395834,184770205.0,111.0,7.4,Thunder Road Pictures,173770205.0,"[{'cast_id': 9, 'character': 'Cory Lambert', '...","[{'credit_id': '572815d0c3a3687a00001314', 'de...",Taylor Sheridan
3774,12000000.0,Action,24049,19000000.0,185.0,6.9,AVM Productions,7000000.0,"[{'cast_id': 14, 'character': 'Sivaji Arumugam...","[{'credit_id': '52fe447ec3a368484e02663b', 'de...",S. Shankar


In [25]:
#Created a copy of the dataset so that genre and director don't revert back to the messy data they were before.
df_good = df_new.copy()
df_good

Unnamed: 0,budget,genres,id,revenue,runtime,vote_average,prod_companies,profit,cast,crew,Director
0,30000000.0,Animation,862,373554033.0,81.0,7.7,Pixar Animation Studios,343554033.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",John Lasseter
1,65000000.0,Adventure,8844,262797249.0,104.0,6.9,TriStar Pictures,197797249.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",Joe Johnston
2,16000000.0,Comedy,31357,81452156.0,127.0,6.1,Twentieth Century Fox Film Corporation,65452156.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",Forest Whitaker
3,60000000.0,Action,949,187436818.0,170.0,7.7,Regency Enterprises,127436818.0,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...","[{'credit_id': '52fe4292c3a36847f802916d', 'de...",Michael Mann
4,35000000.0,Action,9091,64350171.0,106.0,5.5,Universal Pictures,29350171.0,"[{'cast_id': 1, 'character': 'Darren Francis T...","[{'credit_id': '52fe44dbc3a36847f80ae0f1', 'de...",Peter Hyams
...,...,...,...,...,...,...,...,...,...,...,...
3771,60000000.0,Action,353491,71000000.0,95.0,5.7,Imagine Entertainment,11000000.0,"[{'cast_id': 9, 'character': 'Roland Deschain'...","[{'credit_id': '5912cf71c3a36864d40533b7', 'de...",Nikolaj Arcel
3772,50000000.0,Comedy,378236,66913939.0,86.0,5.8,Columbia Pictures,16913939.0,"[{'cast_id': 2, 'character': 'Gene (voice)', '...","[{'credit_id': '5952d8f1c3a368151c025d0e', 'de...",Anthony Leondis
3773,11000000.0,Action,395834,184770205.0,111.0,7.4,Thunder Road Pictures,173770205.0,"[{'cast_id': 9, 'character': 'Cory Lambert', '...","[{'credit_id': '572815d0c3a3687a00001314', 'de...",Taylor Sheridan
3774,12000000.0,Action,24049,19000000.0,185.0,6.9,AVM Productions,7000000.0,"[{'cast_id': 14, 'character': 'Sivaji Arumugam...","[{'credit_id': '52fe447ec3a368484e02663b', 'de...",S. Shankar


In [26]:
df_good.drop(['cast',
             'crew'],
            axis = 1,
            inplace = True)

In [27]:
df_good

Unnamed: 0,budget,genres,id,revenue,runtime,vote_average,prod_companies,profit,Director
0,30000000.0,Animation,862,373554033.0,81.0,7.7,Pixar Animation Studios,343554033.0,John Lasseter
1,65000000.0,Adventure,8844,262797249.0,104.0,6.9,TriStar Pictures,197797249.0,Joe Johnston
2,16000000.0,Comedy,31357,81452156.0,127.0,6.1,Twentieth Century Fox Film Corporation,65452156.0,Forest Whitaker
3,60000000.0,Action,949,187436818.0,170.0,7.7,Regency Enterprises,127436818.0,Michael Mann
4,35000000.0,Action,9091,64350171.0,106.0,5.5,Universal Pictures,29350171.0,Peter Hyams
...,...,...,...,...,...,...,...,...,...
3771,60000000.0,Action,353491,71000000.0,95.0,5.7,Imagine Entertainment,11000000.0,Nikolaj Arcel
3772,50000000.0,Comedy,378236,66913939.0,86.0,5.8,Columbia Pictures,16913939.0,Anthony Leondis
3773,11000000.0,Action,395834,184770205.0,111.0,7.4,Thunder Road Pictures,173770205.0,Taylor Sheridan
3774,12000000.0,Action,24049,19000000.0,185.0,6.9,AVM Productions,7000000.0,S. Shankar


In [29]:
#We want the best, even at the expense of bias (not really, but I'm doing it anyways)! Let's only take values where the profit is greater than the mean profit!
df_good = df_good[~(df_good['profit'] < df_good['profit'].mean())]
df_good

Unnamed: 0,budget,genres,id,revenue,runtime,vote_average,prod_companies,profit,Director,director_company_genre
0,30000000.0,Animation,862,3.735540e+08,81.0,7.7,Pixar Animation Studios,343554033.0,John Lasseter,John Lasseter_Pixar Animation Studios_Animation
1,65000000.0,Adventure,8844,2.627972e+08,104.0,6.9,TriStar Pictures,197797249.0,Joe Johnston,Joe Johnston_TriStar Pictures_Adventure
3,60000000.0,Action,949,1.874368e+08,170.0,7.7,Regency Enterprises,127436818.0,Michael Mann,Michael Mann_Regency Enterprises_Action
5,58000000.0,Adventure,710,3.521940e+08,130.0,6.6,United Artists,294194034.0,Martin Campbell,Martin Campbell_United Artists_Adventure
8,16500000.0,Drama,4584,1.350000e+08,136.0,7.2,Columbia Pictures Corporation,118500000.0,Ang Lee,Ang Lee_Columbia Pictures Corporation_Drama
...,...,...,...,...,...,...,...,...,...,...
3755,80000000.0,Action,324852,1.020063e+09,96.0,6.2,Illumination Entertainment,940063384.0,Kyle Balda,Kyle Balda_Illumination Entertainment_Action
3760,152000000.0,Drama,281338,3.699080e+08,140.0,6.7,Chernin Entertainment,217907963.0,Matt Reeves,Matt Reeves_Chernin Entertainment_Drama
3765,100000000.0,Action,374720,5.198769e+08,107.0,7.5,Canal+,419876949.0,Christopher Nolan,Christopher Nolan_Canal+_Action
3767,260000000.0,Action,335988,6.049421e+08,149.0,6.2,Paramount Pictures,344942143.0,Michael Bay,Michael Bay_Paramount Pictures_Action


In [30]:
#Let's get creative and combine 3 values together!
df_good['director_company_genre'] = df_good.Director.astype(str) + '_' + df_good['prod_companies'].astype(str) + '_' + df_good['genres']
director_company_genre_dum = pd.get_dummies(df_good['director_company_genre'], drop_first = True)

df_new3 = pd.concat((director_company_genre_dum, df_good), axis = 1)
X = sm.add_constant(df_new3)
X = X.drop(['genres', 'profit', 'revenue', 'Director', 'id', 'vote_average', 'runtime', 'prod_companies', 'director_company_genre'],
                     axis = 1
                    )
y = df_good.profit
est = sm.OLS(y, X).fit()
est.summary()

0,1,2,3
Dep. Variable:,profit,R-squared:,0.95
Model:,OLS,Adj. R-squared:,0.51
Method:,Least Squares,F-statistic:,2.158
Date:,"Mon, 01 Feb 2021",Prob (F-statistic):,9.86e-07
Time:,19:31:29,Log-Likelihood:,-19804.0
No. Observations:,1035,AIC:,41470.0
Df Residuals:,106,BIC:,46060.0
Df Model:,928,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.378e+08,1.57e+08,0.879,0.381,-1.73e+08,4.49e+08
Adam McKay_Paramount Pictures_Comedy,2.151e+06,1.9e+08,0.011,0.991,-3.74e+08,3.78e+08
Adam Shankman_Hyde Park Films_Comedy,-1.208e+07,2.19e+08,-0.055,0.956,-4.47e+08,4.23e+08
Adam Shankman_Walt Disney Pictures_Fantasy,4.731e+07,2.18e+08,0.217,0.829,-3.86e+08,4.8e+08
Adrian Lyne_Paramount Pictures_Drama,1.156e+08,2.19e+08,0.529,0.598,-3.18e+08,5.49e+08
Adrian Lyne_Paramount Pictures_Romance,1.775e+08,2.19e+08,0.809,0.421,-2.58e+08,6.13e+08
Alan J. Pakula_Mirage Enterprises_Drama,7.585e+07,2.19e+08,0.346,0.730,-3.59e+08,5.1e+08
Alan J. Pakula_Warner Bros._Drama,4.184e+07,2.19e+08,0.191,0.849,-3.92e+08,4.75e+08
Alan Taylor_Marvel Studios_Action,4.478e+08,2.21e+08,2.022,0.046,8.82e+06,8.87e+08

0,1,2,3
Omnibus:,273.196,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,18356.094
Skew:,0.05,Prob(JB):,0.0
Kurtosis:,23.631,Cond. No.,91600000000.0


In [None]:
#Yes, this was crazy, but now we can easily recommend the director, the genre, and which studio could make the most money. For 2.2, we'll use Alfonso Cuarón.

# 2.2 Movies Manual Regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using the normal equation $(X^T X)^{-1}X^Ty$.

Verify that the coefficients are the same.

In [31]:
X = X[[s for s in X.columns if 'Alfonso' in s]]
X

Unnamed: 0,Alfonso Cuarón_1492 Pictures_Adventure,Alfonso Cuarón_Warner Bros._Science Fiction
0,0,0
1,0,0
3,0,0
5,0,0
8,0,0
...,...,...
3755,0,0
3760,0,0
3765,0,0
3767,0,0


In [32]:
X = sm.add_constant(X)
sm.OLS(y, X).fit().summary()

0,1,2,3
Dep. Variable:,profit,R-squared:,0.006
Model:,OLS,Adj. R-squared:,0.004
Method:,Least Squares,F-statistic:,2.945
Date:,"Mon, 01 Feb 2021",Prob (F-statistic):,0.053
Time:,19:34:17,Log-Likelihood:,-21349.0
No. Observations:,1035,AIC:,42700.0
Df Residuals:,1032,BIC:,42720.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.583e+08,6.85e+06,37.723,0.000,2.45e+08,2.72e+08
Alfonso Cuarón_1492 Pictures_Adventure,4.015e+08,2.2e+08,1.823,0.069,-3.06e+07,8.34e+08
Alfonso Cuarón_Warner Bros._Science Fiction,3.531e+08,2.2e+08,1.604,0.109,-7.9e+07,7.85e+08

0,1,2,3
Omnibus:,764.905,Durbin-Watson:,1.905
Prob(Omnibus):,0.0,Jarque-Bera (JB):,15416.677
Skew:,3.204,Prob(JB):,0.0
Kurtosis:,20.788,Cond. No.,32.2


In [35]:
import numpy as np

np.linalg.inv(X.T @ X) @ X.T @y

0    2.583119e+08
1    4.014926e+08
2    3.530808e+08
dtype: float64

In [None]:
#The numbers match!

# 2.3 Movies gradient descent regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using **gradient descent**. 

Hint: use `scipy.optimize` and remember we're finding the $\beta$ that minimizes the squared loss function of linear regression: $f(\beta) = (\beta X - y)^2$. This will look like part 3 of this lecture.

Verify your coefficients are similar to the ones in 2.1 and 2.2. They won't necessarily be exactly the same, but should be roughly similar.

In [36]:
from scipy.optimize import minimize

betas = np.random.rand(X.shape[1])

def fun(betas):
    return sum((y - X @ betas) ** 2)

minimize(fun, betas, method = 'Powell')

   direc: array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
     fun: array(4.9987633e+19)
 message: 'Optimization terminated successfully.'
    nfev: 85
     nit: 2
  status: 0
 success: True
       x: array([2.58313368e+08, 4.01490280e+08, 3.53079378e+08])

In [None]:
#Only with Powell did the numbers match. All else failed miserably.