# 1. Bisection


One of the most common algorithms for numerical root-finding is *bisection*.

To understand the idea, recall the well-known game where:

- Player A thinks of a secret number between 1 and 100  
- Player B asks if it’s less than 50  
  
  - If yes, B asks if it’s less than 25  
  - If no, B asks if it’s less than 75  
  

And so on.

This is bisection, a relative of [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm). It works for all sufficiently well behaved increasing continuous functions with $ f(a) < 0 < f(b) $. 

Write an implementation of the bisection algorith, `bisect(f, lower, upper, tol)` which, given a function `f`, a lower bound `lower` and an upper bound `upper` finds the point `x` where `f(x) = 0`. The parameter `tol` is a numerical tolerance, you should stop once your step size is smaller than `tol`.


Use it to minimize the function:

$$
f(x) = \sin(4 (x - 1/4)) + x + x^{20} - 1 \tag{2}
$$

in python: `lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1`

The value where f(x) = 0 should be around `0.408`

In [14]:
import numpy as np

def f(x):
    return np.sin(4 * (x - 1/4)) + x + x**20 - 1

def bisect(f, lower, upper, tol):
    assert f(lower) * f(upper) < 0
    
    while (upper - lower) / 2 > tol:
        midpoint = (lower + upper)/2
        if f(midpoint) == 0:
            return(midpoint)
        elif f(upper) * f(midpoint) < 0:
            lower = midpoint
        else:
            upper = midpoint

    return midpoint

bisect(f,-1,1,0.001)

0.408203125

# 1.2 (stretch) Recursive Bisect

Write a recursive version of the bisection algorithm

# 2.1 Movies Regression

Write the best linear regression model you can on the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv) to predict the profitability of a movie (revenue - budget). Maintain the interpretability of the model.

Few notes:

1. Clean your data! Movies where the budget or revenue are invalid should be thrown out

2. Be creative with feature engineering. You can include processing to one-hot encode the type of movie, etc.

3. The model should be useful for someone **who is thinking about making a movie**. So features like the popularity can't be used. You could, however, use the ratings to figure out if making "good" or "oscar bait" movies is a profitable strategy.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from ast import literal_eval

df = pd.read_csv('archive/movies_metadata.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [2]:
#DATA CLEANUP
#Drop unnecessary columns 

df = df.drop(columns=['belongs_to_collection', 'homepage', 'imdb_id', 'original_title', 'overview', 'poster_path',
                     'spoken_languages', 'tagline', 'video', 'production_countries', 'popularity'])

In [3]:
#Remove movies that are unreleased or cancelled

df = df[df['status'] == 'Released']

#Clean genres column and take the first genre of each movie for simplicity

df.genres = df.genres.apply(lambda x: eval(x))
df.genres = df.genres.apply(lambda x: x[0]['name'] if len(x) > 0 else '')

#Remove blank genre columns
df = df[(df.genres != '')]

#Create return column
df.budget = df.budget.astype('float')
df = df[(df.budget != 0) & (df.revenue != 0)]
df.adult = df.adult.replace('False', 0).replace('True', 1)
df['return'] = df.revenue - df.budget
df = df.reset_index().drop(columns=['index'])

#Get dummies for genre and language
genres = pd.get_dummies(df.genres)
lang = pd.get_dummies(df.original_language)

#New DF for the regression variables that are to be concatenated with the dummies
Vars = ['adult', 'revenue', 'runtime', 'vote_average', 'return']
Var = df[Vars]

#Concatenate everything
df = pd.concat([Var, genres, lang], axis=1)

#Get rid of duplicates
df = df.T.groupby(level=0).first().T

#One NAN found so just drop it
df = df.dropna()

In [4]:
#Finally make the model

X = df.drop(columns=['return'])
y = df['return']
x = sm.add_constant(X)

est = sm.OLS(y, x).fit()
est.summary()

0,1,2,3
Dep. Variable:,return,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.967
Method:,Least Squares,F-statistic:,2572.0
Date:,"Fri, 05 Feb 2021",Prob (F-statistic):,0.0
Time:,16:44:55,Log-Likelihood:,-99039.0
No. Observations:,5364,AIC:,198200.0
Df Residuals:,5302,BIC:,198600.0
Df Model:,61,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.797e+07,3.92e+06,-4.584,0.000,-2.57e+07,-1.03e+07
Action,-9.51e+06,1.68e+06,-5.666,0.000,-1.28e+07,-6.22e+06
Adventure,-1.577e+07,1.92e+06,-8.214,0.000,-1.95e+07,-1.2e+07
Animation,-1.977e+07,2.55e+06,-7.739,0.000,-2.48e+07,-1.48e+07
Comedy,3.144e+06,1.66e+06,1.895,0.058,-1.08e+05,6.4e+06
Crime,-2.622e+05,2.11e+06,-0.124,0.901,-4.41e+06,3.88e+06
Documentary,8.501e+06,3.85e+06,2.206,0.027,9.45e+05,1.61e+07
Drama,3.008e+06,1.66e+06,1.816,0.069,-2.39e+05,6.26e+06
Family,-9.124e+06,3.62e+06,-2.522,0.012,-1.62e+07,-2.03e+06

0,1,2,3
Omnibus:,1372.049,Durbin-Watson:,1.839
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14747.551
Skew:,-0.91,Prob(JB):,0.0
Kurtosis:,10.916,Cond. No.,9.92e+21


In [5]:
z=[]

a = dict(est.pvalues)

for i, pvalue in a.items():
    if pvalue > 0.05:
        z.append(i)

X = X.drop(columns=z)
y = df['return']
x = sm.add_constant(X)

est = sm.OLS(y, x).fit()
est.summary()

#tried a model after removing all high pvalue coeffs

0,1,2,3
Dep. Variable:,return,R-squared:,0.967
Model:,OLS,Adj. R-squared:,0.967
Method:,Least Squares,F-statistic:,9834.0
Date:,"Fri, 05 Feb 2021",Prob (F-statistic):,0.0
Time:,16:44:55,Log-Likelihood:,-99053.0
No. Observations:,5364,AIC:,198100.0
Df Residuals:,5347,BIC:,198300.0
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.561e+07,3.19e+06,-4.890,0.000,-2.19e+07,-9.35e+06
Action,-1.186e+07,9.6e+05,-12.348,0.000,-1.37e+07,-9.97e+06
Adventure,-1.799e+07,1.36e+06,-13.218,0.000,-2.07e+07,-1.53e+07
Animation,-2.156e+07,2.24e+06,-9.640,0.000,-2.59e+07,-1.72e+07
Documentary,6.133e+06,3.74e+06,1.640,0.101,-1.2e+06,1.35e+07
Family,-1.115e+07,3.49e+06,-3.192,0.001,-1.8e+07,-4.3e+06
Fantasy,-1.421e+07,2.2e+06,-6.450,0.000,-1.85e+07,-9.89e+06
Horror,7.271e+06,1.5e+06,4.838,0.000,4.32e+06,1.02e+07
Science Fiction,-9.354e+06,2.54e+06,-3.680,0.000,-1.43e+07,-4.37e+06

0,1,2,3
Omnibus:,1379.766,Durbin-Watson:,1.842
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14735.533
Skew:,-0.919,Prob(JB):,0.0
Kurtosis:,10.909,Cond. No.,2890000000.0


# 2.2 Movies Manual Regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using the normal equation $(X^T X)^{-1}X^Ty$.

Verify that the coefficients are the same.

In [6]:
import numpy as np

np.linalg.inv(x.T @ x) @ x.T @y

#Coefficients match!

0    -1.560550e+07
1    -1.185606e+07
2    -1.799015e+07
3    -2.156394e+07
4     6.133159e+06
5    -1.114811e+07
6    -1.420969e+07
7     7.270747e+06
8    -9.354353e+06
9    -1.191073e+07
10   -1.258946e+07
11   -6.643560e+06
12    1.044707e+07
13    8.407762e-01
14   -2.557538e+05
15    1.498028e+07
16    6.750593e+06
dtype: float64

# 2.3 Movies gradient descent regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using **gradient descent**. 

Hint: use `scipy.optimize` and remember we're finding the $\beta$ that minimizes the squared loss function of linear regression: $f(\beta) = (\beta X - y)^2$. This will look like part 3 of this lecture.

Verify your coefficients are similar to the ones in 2.1 and 2.2. They won't necessarily be exactly the same, but should be roughly similar.

In [13]:
from scipy.optimize import minimize

In [15]:
bhat = np.zeros(x.shape[1])

def gradient_descent(x,y,bhat):
    return np.sum((bhat @ x - y) ** 2)

probit_est = minimize(gradient_descent, bhat, args=(y,x), method='Powell')

probit_est['x']

#Coeffiecients do not match but are roughly similar I suppose

array([-1.73112427e+07, -1.15207434e+07, -1.77953009e+07, -2.18456926e+07,
        6.00496754e+06, -1.12555085e+07, -1.40249812e+07,  7.51478250e+06,
       -9.12604347e+06, -1.15524242e+07, -1.22085709e+07, -6.35160876e+06,
        1.15438536e+07,  8.40420332e-01, -2.72148045e+05,  1.57955460e+07,
        7.24684792e+06])