# 1. Bisection


One of the most common algorithms for numerical root-finding is *bisection*.

To understand the idea, recall the well-known game where:

- Player A thinks of a secret number between 1 and 100  
- Player B asks if it’s less than 50  
  
  - If yes, B asks if it’s less than 25  
  - If no, B asks if it’s less than 75  
  

And so on.

This is bisection, a relative of [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm). It works for all sufficiently well behaved increasing continuous functions with $ f(a) < 0 < f(b) $. 

Write an implementation of the bisection algorith, `bisect(f, lower, upper, tol)` which, given a function `f`, a lower bound `lower` and an upper bound `upper` finds the point `x` where `f(x) = 0`. The parameter `tol` is a numerical tolerance, you should stop once your step size is smaller than `tol`.


Use it to minimize the function:

$$
f(x) = \sin(4 (x - 1/4)) + x + x^{20} - 1 \tag{2}
$$

in python: `lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1`

The value where f(x) = 0 should be around `0.408`

In [1]:
# Help from this link https://www.math.ubc.ca/~pwalls/math-python/roots-optimization/bisection/

In [80]:
import numpy as np
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [2]:
def bisect(f, lower, upper, tol):

    if f(lower)*f(upper) >= 0:
        print("Bisection method fails")
        return None

    a = lower
    b = upper

    for n in range(1, tol+1):
        m_n = (a + b)/2
        f_m_n = f(m_n)

        if f(a) * f_m_n < 0:
            a = a
            b = m_n

        elif f(b) * f_m_n < 0:
            a = m_n
            b = b

        elif f_m_n == 0:
            print("Found exact solution.")
            return m_n

        else:
            print("Bisection method fails.")
            return None

    return (a + b)/2

f = lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1
res = bisect(f,-1,2, 100)
res


Found exact solution.


0.40829350427936706

# 1.2 (stretch) Recursive Bisect

Write a recursive version of the bisection algorithm

# 2.1 Movies Regression

Write the best linear regression model you can on the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv) to predict the profitability of a movie (revenue - budget). Maintain the interpretability of the model.

Few notes:

1. Clean your data! Movies where the budget or revenue are invalid should be thrown out

2. Be creative with feature engineering. You can include processing to one-hot encode the type of movie, etc.

3. The model should be useful for someone **who is thinking about making a movie**. So features like the popularity can't be used. You could, however, use the ratings to figure out if making "good" or "oscar bait" movies is a profitable strategy.

In [None]:
# Help in various places from this link, mainly data clean/prep
# https://www.kaggle.com/hyejinjeon/predicting-movie-success-dense-neural-network

# Data Prep

In [179]:
df = pd.read_csv('/Users/kalebmckenzie/Documents/GitHub/3-5-optimization/movies_metadata.csv')

zero_rev = df[df['revenue'] == 0.0].index
df.drop(zero_rev, inplace = True)

zero_bud = df[df['budget'] == '0'].index
df.drop(zero_bud, inplace = True)

#Getting rid of random values in budget column

ran_bud = df[df['budget'] == '/ff9qCepilowshEtG2GYWwzt2bs4.jpg'].index
ran_bud2 = df[df['budget'] == '/zV8bHuSL6WXoD6FWogP9j4x80bL.jpg'].index
ran_bud3 = df[df['budget'] == '/zaSf5OG7V8X8gqFvly88zDdRm46.jpg'].index

df.drop(ran_bud, inplace = True)
df.drop(ran_bud2, inplace = True)
df.drop(ran_bud3, inplace = True)

#Dropping columns I dont need

df = df.drop(['id','belongs_to_collection', 'homepage', 'imdb_id', 
                'original_language', 'original_title', 'overview',
                'poster_path', 'production_countries', 'release_date', 
                'spoken_languages', 'status', 'tagline', 'title', 'video', 'popularity','adult'], axis=1)

#Cleaning and prep

df['budget'] = df['budget'].astype(float)

#Creation of profit col

df['profit'] = df['revenue'] - df['budget']

df['vote_average'] = df['vote_average'].fillna(0)

df['runtime'] = df['runtime'].fillna(0)

#Seperating Genres
s = pd.Series(df['genres'], dtype= str)

s1 = s.str.split(pat="'",expand=True)

df['genre_ed'] = s1[5]

#Seperating companies

c = pd.Series(df['production_companies'], dtype= str)

c1 = c.str.split(pat="'",expand=True)

df['prod_comp'] = c1[3]

#Dropping again

df = df.drop(['production_companies'], axis=1)

df = df.drop(['genres'], axis=1)

#Creating mean variables to filter by

mean_run = df['runtime'].mean()

mean_vote = df['vote_average'].mean()

mean_count = df['vote_count'].mean()

mean_profit = df['profit'].mean()

#Cutting out genres with less than 100 observations

df = df[~df['genre_ed'].isin(['Mystery', 'Family', 'Documentary', 'War', 'Music', 'Western', 'History', 'Foreign', 'TV Movie'])]

In [180]:
#Filtering data to only be looking at movies above the mean

df.drop(df[df['runtime'] < mean_run].index, inplace=True)
df.drop(df[df['vote_average'] < mean_vote].index, inplace=True)
df.drop(df[df['vote_count'] < mean_count].index, inplace=True)
df.drop(df[df['profit'] < mean_profit].index, inplace=True)
df

Unnamed: 0,budget,revenue,runtime,vote_average,vote_count,profit,genre_ed,prod_comp
5,60000000.0,1.874368e+08,170.0,7.7,1886.0,127436818.0,Action,Regency Enterprises
9,58000000.0,3.521940e+08,130.0,6.6,1194.0,294194034.0,Adventure,United Artists
15,52000000.0,1.161124e+08,178.0,7.8,1343.0,64112375.0,Drama,Universal Pictures
31,29500000.0,1.688400e+08,129.0,7.4,2470.0,139340000.0,Science Fiction,Universal Pictures
46,33000000.0,3.273119e+08,127.0,8.1,5915.0,294311859.0,Crime,New Line Cinema
...,...,...,...,...,...,...,...,...
42168,40000000.0,1.715399e+08,122.0,6.7,2924.0,131539887.0,Thriller,Thunder Road Pictures
42170,97000000.0,6.168018e+08,137.0,7.6,6310.0,519801808.0,Action,Twentieth Century Fox Film Corporation
43255,250000000.0,1.238765e+09,136.0,6.8,3803.0,988764765.0,Action,Universal Pictures
43644,34000000.0,2.245113e+08,113.0,7.2,2083.0,190511319.0,Action,Big Talk Productions


In [181]:
df2 = df.copy()
df2 = df2.drop(['budget','revenue','genre_ed','prod_comp'],axis=1)

# Regression model with votes and runtimes

In [203]:
#Decided to leave out genres and companies as they werent significant to the model
#and were a little too messy to work with.
#but just as a representation I created a regression model
#it will be below this one.
#just to show a potential film maker which genres and production
#companies predict profitability


#Decided to Leave out the constant
#as it yields a better R-2.
#Plus the data fits well without one

x = df2[['vote_count','vote_average','runtime']]
y = df2['profit']

m_model = sm.OLS(y,x).fit()

m_pred = m_model.predict(x)

m_model.summary()

0,1,2,3
Dep. Variable:,profit,R-squared (uncentered):,0.76
Model:,OLS,Adj. R-squared (uncentered):,0.758
Method:,Least Squares,F-statistic:,429.4
Date:,"Mon, 01 Feb 2021",Prob (F-statistic):,1.1400000000000001e-125
Time:,19:27:36,Log-Likelihood:,-8414.8
No. Observations:,410,AIC:,16840.0
Df Residuals:,407,BIC:,16850.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
vote_count,7.769e+04,4406.835,17.629,0.000,6.9e+04,8.64e+04
vote_average,-5.59e+07,1.06e+07,-5.249,0.000,-7.68e+07,-3.5e+07
runtime,3.384e+06,5.76e+05,5.873,0.000,2.25e+06,4.52e+06

0,1,2,3
Omnibus:,209.922,Durbin-Watson:,1.988
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2379.343
Skew:,1.902,Prob(JB):,0.0
Kurtosis:,14.172,Cond. No.,4280.0


# Regression with genres and production companies

In [198]:
x1 = pd.get_dummies(df[['runtime','vote_average','vote_count','genre_ed','prod_comp']])
y1 = df['profit']

m_model = sm.OLS(y1,x1).fit()

m_pred = m_model.predict(x1)

m_model.summary()

0,1,2,3
Dep. Variable:,profit,R-squared:,0.625
Model:,OLS,Adj. R-squared:,0.464
Method:,Least Squares,F-statistic:,3.879
Date:,"Mon, 01 Feb 2021",Prob (F-statistic):,2.54e-21
Time:,19:19:56,Log-Likelihood:,-8349.1
No. Observations:,410,AIC:,16950.0
Df Residuals:,286,BIC:,17440.0
Df Model:,123,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
runtime,3.917e+06,8.59e+05,4.561,0.000,2.23e+06,5.61e+06
vote_average,-9.862e+07,2.91e+07,-3.389,0.001,-1.56e+08,-4.13e+07
vote_count,7.567e+04,7080.493,10.687,0.000,6.17e+04,8.96e+04
genre_ed_Action,2.369e+08,1.94e+08,1.219,0.224,-1.45e+08,6.19e+08
genre_ed_Adventure,1.902e+08,1.97e+08,0.963,0.336,-1.98e+08,5.79e+08
genre_ed_Animation,1.796e+08,2.35e+08,0.763,0.446,-2.84e+08,6.43e+08
genre_ed_Comedy,2.274e+08,1.96e+08,1.160,0.247,-1.58e+08,6.13e+08
genre_ed_Crime,1.479e+08,2.06e+08,0.717,0.474,-2.58e+08,5.54e+08
genre_ed_Drama,1.844e+08,2.09e+08,0.882,0.378,-2.27e+08,5.96e+08

0,1,2,3
Omnibus:,205.157,Durbin-Watson:,1.997
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2261.922
Skew:,1.854,Prob(JB):,0.0
Kurtosis:,13.893,Cond. No.,7.64e+19


# 2.2 Movies Manual Regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using the normal equation $(X^T X)^{-1}X^Ty$.

Verify that the coefficients are the same.

In [183]:
import numpy as np

In [184]:
manual_reg = np.linalg.inv(x.T.dot(x)).dot(x.T).dot(y)
manual_reg

array([    77687.61309268, -55903642.59426895,   3383939.7566597 ])

In [None]:
#Same coefs found!

# 2.3 Movies gradient descent regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using **gradient descent**. 

Hint: use `scipy.optimize` and remember we're finding the $\beta$ that minimizes the squared loss function of linear regression: $f(\beta) = (\beta X - y)^2$. This will look like part 3 of this lecture.

Verify your coefficients are similar to the ones in 2.1 and 2.2. They won't necessarily be exactly the same, but should be roughly similar.

In [199]:
from scipy.optimize import minimize

In [200]:
grad = np.zeros(x.shape[1])

def gradient_des(x,y,grad):
    return np.sum((grad @ x - y) ** 2)
G = minimize(gradient_des, grad, args=(y,x), method='Powell')

G['x']

array([    77687.61312404, -55903642.38171674,   3383939.75623857])

In [None]:
# Coefficients found to be similar but not the same