Dylan Hastings

# 1. Bisection


One of the most common algorithms for numerical root-finding is *bisection*.

To understand the idea, recall the well-known game where:

- Player A thinks of a secret number between 1 and 100  
- Player B asks if it’s less than 50  
  
  - If yes, B asks if it’s less than 25  
  - If no, B asks if it’s less than 75  
  

And so on.

This is bisection, a relative of [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm). It works for all sufficiently well behaved increasing continuous functions with $ f(a) < 0 < f(b) $. 

Write an implementation of the bisection algorith, `bisect(f, lower, upper, tol)` which, given a function `f`, a lower bound `lower` and an upper bound `upper` finds the point `x` where `f(x) = 0`. The parameter `tol` is a numerical tolerance, you should stop once your step size is smaller than `tol`.


Use it to minimize the function:

$$
f(x) = \sin(4 (x - 1/4)) + x + x^{20} - 1 \tag{2}
$$

in python: `lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1`

The value where f(x) = 0 should be around `0.408`

In [1]:
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
def bisect(f, lower, upper, tol):
    '''
    Given a function 'f', a lower bound 'lower', and an upper bound 'upper', 
    finds the point 'x' where f(x) = 0.
    '''

    N = 1
    
    while N <= 50: #Limit number of iterations to prevent infinite loop
        
        x = (lower + upper) / 2 #New midpoint
        
        if (f(x) == 0) or (((upper - lower) / 2) < tol): #Solution found
            return x
            break
        
        N += 1
        
        if np.sign(f(x)) == np.sign(f(lower)):
            lower = x
            
        else:
            upper = x

In [3]:
def f(x): 
    return np.sin(4 * (x - 1/4)) + x + x**20 - 1

In [4]:
x_min = bisect(f, -0.5, 0.5, 0.0000001)
x_min

0.40829354524612427

In [5]:
import scipy.optimize as opt

In [6]:
x_min = opt.bisect(f, -0.5, 0.5)
x_min

0.4082935042806639

# 1.2 (stretch) Recursive Bisect

Write a recursive version of the bisection algorithm

In [7]:
def bisect(f, lower, upper, tol):
    '''
    Given a function 'f', a lower bound 'lower', and an upper bound 'upper', 
    finds the point 'x' where f(x) = 0.
    '''
        
    x = (lower + upper) / 2 #New midpoint
        
    if (f(x) == 0) or (((upper - lower) / 2) < tol): #Solution found
        return x
        
    if np.sign(f(x)) == np.sign(f(lower)):
        return bisect(f, x, upper, tol)
            
    else:
        return bisect(f, lower, x, tol)

In [8]:
x_min = bisect(f, -0.5, 0.5, 0.0000001)
x_min

0.40829354524612427

# 2.1 Movies Regression

Write the best linear regression model you can on the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv) to predict the profitability of a movie (revenue - budget). Maintain the interpretability of the model.

Few notes:

1. Clean your data! Movies where the budget or revenue are invalid should be thrown out

2. Be creative with feature engineering. You can include processing to one-hot encode the type of movie, etc.

3. The model should be useful for someone **who is thinking about making a movie**. So features like the popularity can't be used. You could, however, use the ratings to figure out if making "good" or "oscar bait" movies is a profitable strategy.

In [9]:
import statsmodels.api as sm
import pandas as pd
import json

In [63]:
movies_url = {
"movies_metadata": "1RLvh6rhzYiDDjPaudDgyS9LmqjbKH-wh",
"keywords": "1YLOIxb-EPC_7QpkmRqkq9E6j7iqmoEh3",
"ratings": "1_5HNurSOMnU0JIcXBJ5mv1NaXCx9oCVG",
"credits": "1bX9othXfLu5NZbVZtIPGV5Hbn8b5URPf",
"ratings_small": "1fCWT69efrj4Oxdm8ZNoTeSahCOy6_u6w",
"links_small": "1fh6pS7XuNgnZk2J3EmYk_9jO_Au_6C15",
"links": "1hWUSMo_GwkfmhehKqs8Rs6mWIauklkbP",
}

def read_gdrive(url):
    """
    Reads file from Google Drive sharing link
    """
    path = 'https://drive.google.com/uc?export=download&id='+url
    return pd.read_csv(path)

df = read_gdrive(movies_url["movies_metadata"])

  if (await self.run_code(code, result,  async_=asy)):


In [64]:
df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [65]:
df = df[df['budget'].notna()]
df = df[df['revenue'].notna()]
df

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


In [66]:
df['budget'] = df['budget'].astype('float')

In [67]:
df['profit'] = df.revenue - df.budget

In [68]:
df1 = df[['adult','genres', 'runtime', 'vote_average','profit', 'budget']]
df1

Unnamed: 0,adult,genres,runtime,vote_average,profit,budget
0,False,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",81.0,7.7,343554033.0,30000000.0
1,False,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",104.0,6.9,197797249.0,65000000.0
2,False,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",101.0,6.5,0.0,0.0
3,False,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",127.0,6.1,65452156.0,16000000.0
4,False,"[{'id': 35, 'name': 'Comedy'}]",106.0,5.7,76578911.0,0.0
...,...,...,...,...,...,...
45461,False,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",90.0,4.0,0.0,0.0
45462,False,"[{'id': 18, 'name': 'Drama'}]",360.0,9.0,0.0,0.0
45463,False,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",90.0,3.8,0.0,0.0
45464,False,[],87.0,0.0,0.0,0.0


In [69]:
    (df.genres
   # Could be optimized
   .str.replace("'", '"')
   .apply(json.loads)
   .apply(lambda row : [d['name'] for d in row])
   .astype(str)
   # Could be done in a single regex
   .str.replace("[", "")
   .str.replace("]", "")
   .str.replace("'", "")
   .str.replace(" ", "")
   .str.get_dummies(',')
    ).columns

Index(['Action', 'Adventure', 'Animation', 'Comedy', 'Crime', 'Documentary',
       'Drama', 'Family', 'Fantasy', 'Foreign', 'History', 'Horror', 'Music',
       'Mystery', 'Romance', 'ScienceFiction', 'TVMovie', 'Thriller', 'War',
       'Western'],
      dtype='object')

In [70]:
df1 = pd.concat(
    [
    df1,
    (df.genres
   # Could be optimized
   .str.replace("'", '"')
   .apply(json.loads)
   .apply(lambda row : [d['name'] for d in row])
   .astype(str)
   # Could be done in a single regex
   .str.replace("[", "")
   .str.replace("]", "")
   .str.replace("'", "")
   .str.replace(" ", "")
   .str.get_dummies(',')
    )], axis=1
)

In [72]:
del df1['genres']
df1

Unnamed: 0,adult,runtime,vote_average,profit,budget,Action,Adventure,Animation,Comedy,Crime,...,History,Horror,Music,Mystery,Romance,ScienceFiction,TVMovie,Thriller,War,Western
0,False,81.0,7.7,343554033.0,30000000.0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,False,104.0,6.9,197797249.0,65000000.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,False,101.0,6.5,0.0,0.0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
3,False,127.0,6.1,65452156.0,16000000.0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
4,False,106.0,5.7,76578911.0,0.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,90.0,4.0,0.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45462,False,360.0,9.0,0.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45463,False,90.0,3.8,0.0,0.0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
45464,False,87.0,0.0,0.0,0.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [73]:
df1['adult'] = df1.adult.replace(('True', 'False'), (1, 0))

In [104]:
X = df1.copy()
X = X.dropna()
X = X.reset_index(drop=True)

In [105]:
y = X.profit
y

0        343554033.0
1        197797249.0
2                0.0
3         65452156.0
4         76578911.0
            ...     
45198            0.0
45199            0.0
45200            0.0
45201            0.0
45202            0.0
Name: profit, Length: 45203, dtype: float64

In [106]:
X = X.drop('profit', axis = 1)

In [107]:
sm.OLS(
    y,
    X
).fit().summary()

0,1,2,3
Dep. Variable:,profit,R-squared (uncentered):,0.392
Model:,OLS,Adj. R-squared (uncentered):,0.392
Method:,Least Squares,F-statistic:,1215.0
Date:,"Fri, 05 Feb 2021",Prob (F-statistic):,0.0
Time:,18:28:14,Log-Likelihood:,-856650.0
No. Observations:,45203,AIC:,1713000.0
Df Residuals:,45179,BIC:,1714000.0
Df Model:,24,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
adult,7.898e+05,1.37e+07,0.058,0.954,-2.61e+07,2.77e+07
runtime,-8151.5219,4751.134,-1.716,0.086,-1.75e+04,1160.780
vote_average,5.952e+05,8.51e+04,6.995,0.000,4.28e+05,7.62e+05
budget,1.8232,0.012,155.163,0.000,1.800,1.846
Action,-3.09e+06,6.16e+05,-5.013,0.000,-4.3e+06,-1.88e+06
Adventure,5.002e+06,8.03e+05,6.226,0.000,3.43e+06,6.58e+06
Animation,2.156e+06,1.05e+06,2.048,0.041,9.28e+04,4.22e+06
Comedy,-1.958e+06,4.61e+05,-4.250,0.000,-2.86e+06,-1.06e+06
Crime,-1.514e+06,7.07e+05,-2.142,0.032,-2.9e+06,-1.29e+05

0,1,2,3
Omnibus:,72868.511,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,185130015.682
Skew:,10.061,Prob(JB):,0.0
Kurtosis:,315.87,Cond. No.,1270000000.0


In [108]:
X['Other'] = (X['Fantasy'] | X['Foreign'] | X['Horror'] | X['Music'] | X['Mystery'] 
             | X['Romance'] | X['ScienceFiction'] | X['TVMovie'])
X

Unnamed: 0,adult,runtime,vote_average,budget,Action,Adventure,Animation,Comedy,Crime,Documentary,...,Horror,Music,Mystery,Romance,ScienceFiction,TVMovie,Thriller,War,Western,Other
0,0,81.0,7.7,30000000.0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,104.0,6.9,65000000.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,101.0,6.5,0.0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
3,0,127.0,6.1,16000000.0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
4,0,106.0,5.7,0.0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45198,0,90.0,4.0,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45199,0,360.0,9.0,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45200,0,90.0,3.8,0.0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
45201,0,87.0,0.0,0.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [109]:
X = X.drop('Fantasy', axis = 1)
X = X.drop('Foreign', axis = 1)
X = X.drop('Horror', axis = 1)
X = X.drop('Music', axis = 1)
X = X.drop('Mystery', axis = 1)
X = X.drop('Romance', axis = 1)
X = X.drop('ScienceFiction', axis = 1)
X = X.drop('TVMovie', axis = 1)
X

Unnamed: 0,adult,runtime,vote_average,budget,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,History,Thriller,War,Western,Other
0,0,81.0,7.7,30000000.0,0,0,1,1,0,0,0,1,0,0,0,0,0
1,0,104.0,6.9,65000000.0,0,1,0,0,0,0,0,1,0,0,0,0,1
2,0,101.0,6.5,0.0,0,0,0,1,0,0,0,0,0,0,0,0,1
3,0,127.0,6.1,16000000.0,0,0,0,1,0,0,1,0,0,0,0,0,1
4,0,106.0,5.7,0.0,0,0,0,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45198,0,90.0,4.0,0.0,0,0,0,0,0,0,1,1,0,0,0,0,0
45199,0,360.0,9.0,0.0,0,0,0,0,0,0,1,0,0,0,0,0,0
45200,0,90.0,3.8,0.0,1,0,0,0,0,0,1,0,0,1,0,0,0
45201,0,87.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [110]:
X.dtypes

adult             int64
runtime         float64
vote_average    float64
budget          float64
Action            int64
Adventure         int64
Animation         int64
Comedy            int64
Crime             int64
Documentary       int64
Drama             int64
Family            int64
History           int64
Thriller          int64
War               int64
Western           int64
Other             int64
dtype: object

In [111]:
sm.OLS(
    y,
    X
).fit().summary()

0,1,2,3
Dep. Variable:,profit,R-squared (uncentered):,0.392
Model:,OLS,Adj. R-squared (uncentered):,0.392
Method:,Least Squares,F-statistic:,1715.0
Date:,"Fri, 05 Feb 2021",Prob (F-statistic):,0.0
Time:,18:28:14,Log-Likelihood:,-856650.0
No. Observations:,45203,AIC:,1713000.0
Df Residuals:,45186,BIC:,1713000.0
Df Model:,17,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
adult,6.602e+05,1.37e+07,0.048,0.962,-2.62e+07,2.75e+07
runtime,-8935.5562,4747.864,-1.882,0.060,-1.82e+04,370.336
vote_average,5.775e+05,8.48e+04,6.809,0.000,4.11e+05,7.44e+05
budget,1.8230,0.012,156.377,0.000,1.800,1.846
Action,-3.131e+06,6.04e+05,-5.181,0.000,-4.32e+06,-1.95e+06
Adventure,5.007e+06,7.96e+05,6.291,0.000,3.45e+06,6.57e+06
Animation,2.121e+06,1.05e+06,2.025,0.043,6.81e+04,4.17e+06
Comedy,-1.775e+06,4.52e+05,-3.928,0.000,-2.66e+06,-8.89e+05
Crime,-1.513e+06,6.98e+05,-2.166,0.030,-2.88e+06,-1.44e+05

0,1,2,3
Omnibus:,72867.64,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,184994933.845
Skew:,10.061,Prob(JB):,0.0
Kurtosis:,315.755,Cond. No.,1270000000.0


# 2.2 Movies Manual Regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using the normal equation $(X^T X)^{-1}X^Ty$.

Verify that the coefficients are the same.

In [112]:
np.linalg.inv(X.T @ X) @ X.T @ y

0     6.601799e+05
1    -8.935556e+03
2     5.774740e+05
3     1.822961e+00
4    -3.131160e+06
5     5.007351e+06
6     2.121242e+06
7    -1.774638e+06
8    -1.512503e+06
9    -2.036216e+06
10   -2.697167e+06
11    1.918922e+06
12   -6.462149e+06
13   -3.507474e+06
14   -2.563437e+06
15   -3.974388e+06
16    1.555328e+05
dtype: float64

# 2.3 Movies gradient descent regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using **gradient descent**. 

Hint: use `scipy.optimize` and remember we're finding the $\beta$ that minimizes the squared loss function of linear regression: $f(\beta) = (\beta X - y)^2$. This will look like part 3 of this lecture.

Verify your coefficients are similar to the ones in 2.1 and 2.2. They won't necessarily be exactly the same, but should be roughly similar.

In [113]:
import numpy as np

def CostFunction(betas, y, x):
    """
    Probit Log Likelihood function
    Very slow naive Python version
    Input:
        betas is a np.array of parameters
        y is a one dimensional np.array of endogenous data
        x is a 2 dimensional np.array of exogenous data
            First vertical colmn of X is assumed to be constant term,
            corresponding to betas[0]
    returns:
        negative of log likehood value (scalar)
    """
    result = 0
    #Sum operation
    for i in range(0, len(y)):
        #Get X_i * Beta value
        #xt = np.transpose(x[i])
        #print(x[i])
        bx = np.dot(betas, x[i])
        #print(xb, i)
        #print("y", y)
        #compute both binary probabilities from xb     
        #Add to total log likelihood
        llf = (bx-y[i]) ** 2
        result += llf
    #print(result)
    return result

In [114]:
import numpy as np

def CostFunction(betas, y, x):
    """
    Probit Log Likelihood function
    Very slow naive Python version
    Input:
        betas is a np.array of parameters
        y is a one dimensional np.array of endogenous data
        x is a 2 dimensional np.array of exogenous data
            First vertical colmn of X is assumed to be constant term,
            corresponding to betas[0]
    returns:
        negative of log likehood value (scalar)
    """
    result = 0
    xb = x @ np.transpose(betas)
    #Sum operation
    #Get X_i * Beta value
    #xt = np.transpose(x[i])
    #print(x[i])
    #bx = np.dot(betas, x)
    #print(xb, i)
    #print("y", y)
    #compute both binary probabilities from xb     
    #Add to total log likelihood
    llf = sum((xb-y) ** 2)
    return llf

In [115]:
from scipy.optimize import minimize

In [116]:
X = X.to_numpy()

In [127]:
#create beta hat vector to maximize on
#will store the values of maximum likelihood beta parameters
#Arbitrarily initialized to all zeros
bhat = np.ones(len(X[0]))

In [128]:
#unvectorized MLE estimation
#X = X.to_numpy()
probit_est = minimize(CostFunction, bhat, args=(y,X), method='powell', options = {'maxiter': 1000})

In [129]:
probit_est['x']

array([ 5.28950039e+05, -6.49500913e+03,  5.04559771e+05,  1.82175794e+00,
       -3.18476731e+06,  5.07486962e+06,  2.39556705e+06, -1.66339279e+06,
       -1.46606556e+06, -1.83596179e+06, -2.52300631e+06,  1.95433037e+06,
       -6.44940167e+06, -3.39691787e+06, -2.54519625e+06, -3.88736403e+06,
        1.34136086e+05])