In [2]:
import numpy as np 
import pandas as pd 
import statsmodels.api as sm

# 1. Bisection


One of the most common algorithms for numerical root-finding is *bisection*.

To understand the idea, recall the well-known game where:

- Player A thinks of a secret number between 1 and 100  
- Player B asks if it’s less than 50  
  
  - If yes, B asks if it’s less than 25  
  - If no, B asks if it’s less than 75  
  

And so on.

This is bisection, a relative of [binary search](https://en.wikipedia.org/wiki/Binary_search_algorithm). It works for all sufficiently well behaved increasing continuous functions with $ f(a) < 0 < f(b) $. 

Write an implementation of the bisection algorith, `bisect(f, lower, upper, tol)` which, given a function `f`, a lower bound `lower` and an upper bound `upper` finds the point `x` where `f(x) = 0`. The parameter `tol` is a numerical tolerance, you should stop once your step size is smaller than `tol`.


Use it to minimize the function:

$$
f(x) = \sin(4 (x - 1/4)) + x + x^{20} - 1 \tag{2}
$$

in python: `lambda x: np.sin(4 * (x - 1/4)) + x + x**20 - 1`

The value where f(x) = 0 should be around `0.408`

In [3]:
def bisection(f,lower,upper,tol):
    mid = (lower + upper)/2
    while abs(f(mid)) > tol:
        mid = (lower + upper)/2
        if f(upper) * f(mid) < 0:
            lower = mid
        else:
            upper = mid
    return mid


f = lambda x: np.sin(4*(x-1/4))+x+x**20-1
tol=0.00001
lower=-1
upper=10
bisection(f,lower,upper,tol)

0.4082913398742676

# 1.2 (stretch) Recursive Bisect

Write a recursive version of the bisection algorithm

In [4]:
#an attempt to make it recursive
def bisection2(f,lower,upper,tol):
    mid = (lower + upper)/2
    if abs(f(mid)) > tol:
        if f(upper) * f(mid) < 0:
            lower = mid
        else:
            upper = mid
        bisection2(f,lower,upper,tol)
        print(mid)
    return mid
tol=0.0001
lower=-1
upper=1
bisection2(f,lower,upper,tol)

0.40826416015625
0.4083251953125
0.408447265625
0.40869140625
0.4091796875
0.408203125
0.41015625
0.4140625
0.421875
0.40625
0.4375
0.375
0.25
0.5
0.0


0.0

# 2.1 Movies Regression

Write the best linear regression model you can on the [Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset?select=ratings.csv) to predict the profitability of a movie (revenue - budget). Maintain the interpretability of the model.

Few notes:

1. Clean your data! Movies where the budget or revenue are invalid should be thrown out

2. Be creative with feature engineering. You can include processing to one-hot encode the type of movie, etc.

3. The model should be useful for someone **who is thinking about making a movie**. So features like the popularity can't be used. You could, however, use the ratings to figure out if making "good" or "oscar bait" movies is a profitable strategy.

In [5]:
df=pd.read_csv('data/movies_metadata.csv')


In [6]:

#some of the rows contains jpg in budget
df_meta=df.drop(df.loc[df.budget.str.contains('jpg')].index)

#df_meta=df_meta.loc[df_meta.revenue>0]
df_meta=df_meta.reset_index()
#object-->float (could also be int)
df_meta.budget=df_meta.budget.astype(float)
df_meta.revenue=df_meta.revenue.astype(float)


#making a new genre column
df_meta['genre']=0

for i in range(len(df_meta)):
    if len(eval(df_meta.genres[i]))>0:
        df_meta['genre'].iloc[i]=pd.DataFrame(eval(df_meta.genres[i])).name.values
    else:
        df_meta['genre'].iloc[i]=np.array(['no_genre'])

df_meta2=df_meta.genre.astype(str)
df_meta2=df_meta2.str.replace('[','')
df_meta2=df_meta2.str.replace(']','')

#since Since Fiction is one genre and not 2
df_meta2=df_meta2.str.replace('Science Fiction','Science-Fiction')
#putting a * where there are spaces
df_meta2=df_meta2.str.replace(' ','*')
df_meta2=df_meta2.str.replace("'",'')
df_meta2=df_meta2.str.replace("\n",'')
d=df_meta2.str.get_dummies(sep='*')

In [7]:
df.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [18]:
import itertools
df2=df_meta[['original_title','adult','budget','runtime']]

df_movie=df2.join(d)
df_meta['revenue']-df_meta['budget']


#not keeping the language because movies can 
#be traduced
df_movie.adult.loc[df_movie.adult=='False']=0
df_movie.adult.loc[df_movie.adult=='True']=1

df_movie['profit']=df_meta['revenue']-df_meta['budget']
#df_movie=df_movie.loc[df_meta['budget']>5000000]

df_movie=df_movie.dropna()

y=df_movie['profit']
X=df_movie.drop(['original_title','profit'],axis=1)
X = sm.add_constant(X)

X=X.drop(['no_genre'],axis=1)


cols=X.columns.drop(['const','budget', 'runtime'])
#cols=X.columns.drop(['const', 'runtime'])

#number of interaction
n=2

#The interactions between the genres. 
interaction=False
L=list(itertools.combinations(cols, n))

if interaction==True:
    if n==2:
        for (a,b) in L:
            X[(a+'_'+b)]=X[a]*X[b]

    if n==3:
        for (a,b,c) in L:
            X[(a+'_'+b+'_'+c)]=X[a]*X[b]*X[c]

# If I want both 2 and 3 genre combinason            
'''            
L=list(itertools.combinations(cols, 2))
for (a,b) in L:
    X[(a+'_'+b)]=X[a]*X[b]
L=list(itertools.combinations(cols, 3))
for (a,b,c) in L:
    X[(a+'_'+b+'_'+c)]=X[a]*X[b]*X[c]
'''

# I drop the columns where the p_value is >0.1 one by one 
#except the constant term 
col_list=[]

reg=sm.OLS(y, X).fit()
#while there are p values higher than 0.1
while len(reg.pvalues.loc[(reg.pvalues>0.1)&(reg.pvalues.index!='const')])!= 1:

    #the column with the highest p values that we will drop
    col=reg.pvalues.loc[(reg.pvalues==reg.pvalues.max())&(reg.pvalues.index!='const') ].index[0]
    col_list.append(col)
    X=X.drop([col],axis=1)
    reg=sm.OLS(y, X).fit()
    print('droped column:',col)


# if runing the loop we have to skip this next line and 
#put the interaction at False otherwise it takes a long time
#to run #### note: I removed the last one manualy, I had
#to fix something in my loop but it was just for the last 
#columns, so I just remooved it manualy by adding it 
#at the end of the col_list list 


reg=sm.OLS(y, X).fit()
reg.summary()




droped column: adult
droped column: Movie
droped column: TV
droped column: Music
droped column: Documentary
droped column: Mystery
droped column: Foreign
droped column: Horror
droped column: Science-Fiction
droped column: Fantasy


0,1,2,3
Dep. Variable:,profit,R-squared:,0.381
Model:,OLS,Adj. R-squared:,0.38
Method:,Least Squares,F-statistic:,1983.0
Date:,"Wed, 03 Feb 2021",Prob (F-statistic):,0.0
Time:,13:15:39,Log-Likelihood:,-856670.0
No. Observations:,45203,AIC:,1713000.0
Df Residuals:,45188,BIC:,1714000.0
Df Model:,14,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8.179e+05,5.76e+05,-1.419,0.156,-1.95e+06,3.12e+05
budget,1.8222,0.012,155.968,0.000,1.799,1.845
runtime,1.614e+04,5337.763,3.023,0.003,5674.059,2.66e+04
Action,-2.781e+06,6.04e+05,-4.605,0.000,-3.96e+06,-1.6e+06
Adventure,5.256e+06,7.95e+05,6.608,0.000,3.7e+06,6.82e+06
Animation,4.226e+06,1.04e+06,4.050,0.000,2.18e+06,6.27e+06
Comedy,-6.852e+05,4.55e+05,-1.507,0.132,-1.58e+06,2.06e+05
Crime,-1.179e+06,6.96e+05,-1.693,0.090,-2.54e+06,1.86e+05
Drama,-1.738e+06,4.23e+05,-4.112,0.000,-2.57e+06,-9.09e+05

0,1,2,3
Omnibus:,72754.958,Durbin-Watson:,1.965
Prob(Omnibus):,0.0,Jarque-Bera (JB):,184316172.182
Skew:,10.027,Prob(JB):,0.0
Kurtosis:,315.183,Cond. No.,122000000.0


In [21]:
reg.params.nlargest(5)

Adventure    5.256484e+06
Animation    4.225695e+06
Family       2.272466e+06
Romance      9.828068e+05
runtime      1.613616e+04
dtype: float64

In [29]:

import itertools
df2=df_meta[['original_title','adult','budget','runtime']]


df_movie=df2.join(d)
df_meta['revenue']-df_meta['budget']


#not keeping the language because movies can 
#be traduced
df_movie.adult.loc[df_movie.adult=='False']=0
df_movie.adult.loc[df_movie.adult=='True']=1

df_movie['profit']=df_meta['revenue']-df_meta['budget']
#df_movie=df_movie.loc[df_meta['budget']>5000000]

df_movie=df_movie.dropna()

y=df_movie['profit']
X=df_movie.drop(['original_title','profit'],axis=1)
X = sm.add_constant(X)

#X=X.drop(['adult','Movie','TV','Documentary','Music',\
#    'Mystery','Fantasy','Foreign','no_genre','Horror',\
#        'Science-Fiction','Comedy','Romance'],axis=1)

X=X.drop(['no_genre'],axis=1)


cols=X.columns.drop(['const','budget', 'runtime'])
#cols=X.columns.drop(['const', 'runtime'])

#number of interaction
#example n2 will make combinasons 
#of 2 genres
n=2

interaction=True
L=list(itertools.combinations(cols, n))

if interaction==True:
    if n==2:
        for (a,b) in L:
            X[(a+'_'+b)]=X[a]*X[b]

    if n==3:
        for (a,b,c) in L:
            X[(a+'_'+b+'_'+c)]=X[a]*X[b]*X[c]
'''            
L=list(itertools.combinations(cols, 2))
for (a,b) in L:
    X[(a+'_'+b)]=X[a]*X[b]
L=list(itertools.combinations(cols, 3))
for (a,b,c) in L:
    X[(a+'_'+b+'_'+c)]=X[a]*X[b]*X[c]
'''



# I drop the columns where the p_value is >0.1 one by one 
#except the constant term this is the list I had with the 
#loop for the combinasons of 2 genres.
col_list=['adult_Drama',
 'Documentary_Mystery',
 'adult_Romance',
 'Animation_Crime',
 'Foreign_TV',
 'Foreign_Movie',
 'Fantasy_Foreign',
 'Family_Foreign',
 'adult_Music',
 'adult_Crime',
 'adult_Comedy',
 'adult',
 'Documentary_Fantasy',
 'Crime_Western',
 'Horror_War',
 'TV',
 'Movie',
 'Movie_TV',
 'Documentary_War',
 'Music_Western',
 'adult_Movie',
 'adult_Adventure',
 'Documentary_Western',
 'Animation_War',
 'Drama_TV',
 'Drama_Movie',
 'adult_Thriller',
 'Horror_Western',
 'Documentary_Science-Fiction',
 'adult_Action',
 'adult_Family',
 'adult_Mystery',
 'adult_Foreign',
 'adult_History',
 'adult_Documentary',
 'adult_War',
 'adult_Animation',
 'adult_Western',
 'adult_TV',
 'adult_Fantasy',
 'Science-Fiction',
 'Foreign_Horror',
 'Documentary_Foreign',
 'Mystery',
 'Animation_History',
 'adult_Horror',
 'Foreign_War',
 'Family_History',
 'Mystery_War',
 'Movie_Western',
 'TV_Western',
 'Documentary_Movie',
 'Documentary_TV',
 'Documentary_Horror',
 'Drama_History',
 'Comedy_TV',
 'Comedy_Movie',
 'Drama_Romance',
 'Movie_Mystery',
 'Mystery_TV',
 'Romance_Western',
 'Comedy_Western',
 'Crime_Documentary',
 'Action_Mystery',
 'Fantasy_War',
 'Horror',
 'Documentary_Thriller',
 'adult_Science-Fiction',
 'Foreign_Mystery',
 'Family_Romance',
 'Family_Horror',
 'Action_Crime',
 'Crime_History',
 'Music_Mystery',
 'Crime_Movie',
 'Crime_TV',
 'Science-Fiction_TV',
 'Movie_Science-Fiction',
 'Foreign_Science-Fiction',
 'Horror_Music',
 'Music_War',
 'Horror_Movie',
 'Horror_TV',
 'Fantasy_History',
 'History_Science-Fiction',
 'Foreign_Western',
 'Action',
 'Comedy_Music',
 'Crime_Music',
 'Foreign',
 'Foreign_Thriller',
 'Family_Thriller',
 'Family_Movie',
 'Family_TV',
 'Animation_Music',
 'Comedy_War',
 'Horror_Romance',
 'Drama_Horror',
 'Documentary_Drama',
 'Documentary_Romance',
 'Animation_Documentary',
 'Family_Mystery',
 'Comedy_Mystery',
 'History_Music',
 'Documentary',
 'History_Movie',
 'History_TV',
 'Comedy_Foreign',
 'Foreign_Music',
 'Horror_Science-Fiction',
 'Crime_Science-Fiction',
 'Comedy_Documentary',
 'War',
 'Comedy_Romance',
 'Romance_TV',
 'Movie_Romance',
 'Documentary_Music',
 'Fantasy_Music',
 'Family_Western',
 'Action_Movie',
 'Action_TV',
 'History_Thriller',
 'Comedy_Horror',
 'Adventure_Romance',
 'Family_War',
 'Drama_Foreign',
 'Foreign_Romance',
 'Mystery_Western',
 'Crime_Drama',
 'Crime',
 'Adventure_Documentary',
 'History_Horror',
 'Mystery_Science-Fiction',
 'Action_Documentary',
 'Adventure_Crime',
 'Thriller_Western',
 'Fantasy_TV',
 'Fantasy_Movie',
 'History_Mystery',
 'Action_War',
 'Movie_Music',
 'Music_TV',
 'Animation_Mystery',
 'Fantasy_Mystery',
 'Fantasy_Horror',
 'Foreign_History',
 'Crime_Foreign',
 'Adventure_War',
 'Romance_War',
 'Western',
 'Animation_Foreign',
 'TV_Thriller',
 'Movie_Thriller',
 'Adventure_Mystery',
 'Crime_Mystery',
 'Mystery_Thriller',
 'Horror_Mystery',
 'Comedy_History',
 'Drama_War',
 'Comedy',
 'Comedy_Drama',
 'Music_Thriller',
 'Movie_War',
 'TV_War',
 'Music_Romance',
 'Music_Science-Fiction',
 'Animation_Horror',
 'Comedy_Science-Fiction',
 'Animation_Movie',
 'Animation_TV',
 'Animation_Thriller',
 'Crime_Family',
 'Animation_Romance',
 'Crime_War',
 'Thriller_War',
 'Music',
 'Mystery_Romance',
 'History_Western',
 'Crime_Romance',
 'Family_Music',
 'Documentary_History',
 'Comedy_Crime',
 'Action_Science-Fiction',
 'Action_Horror',
 'Thriller',
 'Drama_Thriller',
 'Fantasy_Science-Fiction',
 'Drama_Mystery',
 'Crime_Horror',
 'Drama_Music',
 'Action_Foreign',
 'Adventure_Foreign',
 'Drama_Science-Fiction',
 'Action_Drama',
 'History',
 'Fantasy_Western',
 'Comedy_Thriller',
 'Adventure_TV',
 'Adventure_Movie']

'''
# I drop the columns where the p_value is >0.1 one by one 
#except the constant term 
col_list=[]

reg=sm.OLS(y, X).fit()
#while there are p values higher than 0.1
while len(reg.pvalues.loc[(reg.pvalues>0.1)&(reg.pvalues.index!='const')])!= 1:

    #the column with the highest p values that we will drop
    col=reg.pvalues.loc[(reg.pvalues==reg.pvalues.max())&(reg.pvalues.index!='const') ].index[0]
    col_list.append(col)
    X=X.drop([col],axis=1)
    reg=sm.OLS(y, X).fit()
    print(col)
'''
# if runing the loop we have to skip this next line and 
#put the interaction at False otherwise it takes a long time
#to run #### note: I removed the last one manualy, I had
#to fix something in my loop but it was just for the last 
#columns, so I just remooved it manualy by adding it 
#at the end of the col_list list 

#if you want to try the loop, you have te remoove the next line
#  ( X=X.drop(col_list,axis=1) )
X=X.drop(col_list,axis=1)

reg=sm.OLS(y, X).fit()
reg.summary()



0,1,2,3
Dep. Variable:,profit,R-squared:,0.389
Model:,OLS,Adj. R-squared:,0.388
Method:,Least Squares,F-statistic:,513.2
Date:,"Wed, 03 Feb 2021",Prob (F-statistic):,0.0
Time:,13:42:47,Log-Likelihood:,-856370.0
No. Observations:,45203,AIC:,1713000.0
Df Residuals:,45146,BIC:,1713000.0
Df Model:,56,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.386e+06,5.49e+05,-2.523,0.012,-2.46e+06,-3.09e+05
budget,1.7821,0.012,149.010,0.000,1.759,1.806
runtime,1.421e+04,5290.000,2.687,0.007,3843.465,2.46e+04
Adventure,4.965e+06,1.75e+06,2.839,0.005,1.54e+06,8.39e+06
Animation,-3.617e+06,1.82e+06,-1.986,0.047,-7.19e+06,-4.71e+04
Drama,-1.367e+06,4.39e+05,-3.112,0.002,-2.23e+06,-5.06e+05
Family,-7.528e+06,1.96e+06,-3.839,0.000,-1.14e+07,-3.68e+06
Fantasy,3.939e+06,1.9e+06,2.071,0.038,2.11e+05,7.67e+06
Romance,1.108e+06,6.1e+05,1.817,0.069,-8.7e+04,2.3e+06

0,1,2,3
Omnibus:,72129.62,Durbin-Watson:,1.967
Prob(Omnibus):,0.0,Jarque-Bera (JB):,176937980.471
Skew:,9.85,Prob(JB):,0.0
Kurtosis:,308.868,Cond. No.,1470000000.0


In [11]:
reg.params.nlargest(5)

Adventure_Science-Fiction    1.971267e+07
Animation_Family             1.923102e+07
Action_Music                 1.760739e+07
Animation_Comedy             1.546798e+07
Crime_Fantasy                1.450973e+07
dtype: float64

# profitability of a movie 

With the linear regression with  y= profit, we have the following results for the coefficients 

budget: (1.7821 $\pm$ 0.012) 

runtime: (1.421e+04 $\pm$ 5290.000)

this gives us the information that for every \$ they spend on the budget, they have (1.7821 $\pm$ 0.012) \$  of profit 
, and for every extra minute of duration for the movie they have  (1.421e+04 $\pm$ 5290.000) \$ of profit 



for the movie's genre, the best combinasons of genre would be Adventure_Science-Fiction because it has the highest value for the coefficient witch means that fot that genre ( combinason of 2 genre in our case ) the profit was the highest. 

here is the top 5 genres with their coefficients 




Adventure_Science-Fiction:    1.971267e+07

Animation_Family:             1.923102e+07

Action_Music:                 1.760739e+07

Animation_Comedy:             1.546798e+07

Crime_Fantasy:                1.450973e+07


So in conclusion, for a high proit movie, the budget has to be high, the durations has to be high too, and the genres has to be 
Adventure_Science-Fiction ( or Animation_Family) since their coefficients are close.

# 2.2 Movies Manual Regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using the normal equation $(X^T X)^{-1}X^Ty$.

Verify that the coefficients are the same.

In [30]:
#Pandas DataFrame--> array
X2=X.to_numpy()
Y2=y.to_numpy()

#Equation 
Coef=np.linalg.inv((X2.T@X2))@X2.T@Y2

print('with statsmodels ')
print(Coef,'\n')

Coef1=reg.params.to_numpy()

print('with the equation')
print(Coef1)

#The coefficients are the same when I remoove all the columns that are
#statistically insignificant



with statsmodels 
[-1.38609245e+06  1.78213586e+00  1.42119534e+04  4.96455201e+06
 -3.61725874e+06 -1.36709018e+06 -7.52802537e+06  3.93898958e+06
  1.10836079e+06  5.79173316e+06  4.63224555e+06 -4.27026005e+06
 -7.47700544e+06 -5.22899377e+06 -1.24047474e+07  1.76073946e+07
 -4.76727400e+06 -3.08918741e+06 -5.58425819e+06 -9.94648595e+06
 -4.39754930e+06 -8.10671219e+06  5.22188751e+06  1.01890072e+07
 -8.23979563e+06 -1.14764509e+07 -1.89502563e+07  1.97126721e+07
 -8.54898046e+06 -9.80286842e+06  1.54679769e+07  9.85178064e+06
  1.92310180e+07 -4.83057365e+06 -9.33561833e+06 -3.85069537e+07
  6.15884014e+06 -6.75562392e+06  1.45097290e+07 -3.78632982e+06
  9.91426781e+06  8.45433057e+06 -6.70738460e+06  7.07662674e+06
  5.59015507e+06 -2.24494430e+07  1.36584635e+07 -1.54550529e+07
 -7.07346086e+06 -5.80528722e+06  1.81922566e+06 -1.13660424e+07
  3.23440687e+06 -5.04412524e+06 -2.00116236e+07 -3.48496643e+07
 -1.42114134e+07] 

with the equation
[-1.38609245e+06  1.78213586e+00  

# 2.3 Movies gradient descent regression

Use your `X` and `y` matrix from 2.1 to calculate the linear regression yourself using **gradient descent**. 

Hint: use `scipy.optimize` and remember we're finding the $\beta$ that minimizes the squared loss function of linear regression: $f(\beta) = (\beta X - y)^2$. This will look like part 3 of this lecture.

Verify your coefficients are similar to the ones in 2.1 and 2.2. They won't necessarily be exactly the same, but should be roughly similar.

In [31]:
from scipy.optimize import minimize
B=np.zeros(len(X2[0]))
def f(B,X,y):
    return (X@B-y)@(X@B-y)
est = minimize(f, B, args=(X2,Y2), method='Powell',tol=1e-9)

display(est.x)


array([-1.38606329e+06,  1.78215007e+00,  1.41979663e+04,  4.96962167e+06,
       -3.61105115e+06, -1.36612348e+06, -7.53119324e+06,  3.93751211e+06,
        1.10885786e+06,  5.78468940e+06,  4.63275748e+06, -4.26930988e+06,
       -7.47263611e+06, -5.22688968e+06, -1.24030937e+07,  1.76063587e+07,
       -4.76669873e+06, -3.08872641e+06, -5.58306947e+06, -9.94911921e+06,
       -4.39721973e+06, -8.10804730e+06,  5.22069693e+06,  1.01881525e+07,
       -8.24077190e+06, -1.14772358e+07, -1.89514801e+07,  1.97127331e+07,
       -8.54804683e+06, -9.80341100e+06,  1.54632199e+07,  9.84737231e+06,
        1.92284698e+07, -4.83246438e+06, -9.33879022e+06, -3.85085349e+07,
        6.16115652e+06, -6.75480907e+06,  1.45090527e+07, -3.78593725e+06,
        9.91776385e+06,  8.45613104e+06, -6.70622722e+06,  7.07656836e+06,
        5.59265647e+06, -2.24484712e+07,  1.36586655e+07, -1.54547210e+07,
       -7.07366215e+06, -5.80468484e+06,  1.82003072e+06, -1.13663459e+07,
        3.23408743e+06, -