# Optional Assessment: Compare linear regression models

In this assessment, you are tasked with producing several linear regression models that explain box office revenues based on social media data. The dataset is *box_office_social_media.csv* and contains data about the Facebook likes and comments and Twitter followers and replies and the total gross revenues of a movie.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Read in the data 
boxoffice = pd.read_csv("box_office_social_media.csv")
boxoffice.describe()

Unnamed: 0,NrOfLikesFb,NrofCommentsFb,NrOfFollowersTw,NrofRepliesTw,Revenue
count,100.0,100.0,100.0,100.0,100.0
mean,763636.5,11862.95,15696.42,432.34,42679510.0
std,1760275.0,17170.871568,42940.057244,488.068306,55759490.0
min,2346.0,8.0,123.0,11.0,24716.0
25%,64107.25,1611.75,1817.5,137.25,3448721.0
50%,189559.0,5644.0,4482.0,314.0,19896310.0
75%,611935.8,14808.0,13319.0,568.75,59529010.0
max,9628565.0,119309.0,386992.0,3185.0,234168100.0


First of all, we see that the srpead in the dependent variable is quite large. Also, most box office revenue models transform the dependent variable by taking a log transformation. Since there are no zeros, we will take a normal log transformation. We will not conduct any other pre-processing of the data (e.g., removing outliers, normalizing the data, transforming independent variables, etc). 

In [2]:
#Make a log-transformation
boxoffice["Revenue"] = np.log(boxoffice.Revenue)

# Simple linear regression

First, you will make simple linear regression models on all predictors. I will first make the train-test splits that you will use thoughout the exercise.  

In [3]:
np.random.seed(40) 

from sklearn.model_selection import train_test_split
X = boxoffice[["NrOfLikesFb", "NrofCommentsFb", "NrOfFollowersTw", "NrofRepliesTw"]]
y = boxoffice.Revenue

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Now, you should make a function that creates simple linear regression models of of all independent variables. You should use the *scikit-learn* package and build the models on the training set. You should output all the regression models into a list. 


In [4]:
def make_SLR(X,y): 
    from sklearn.linear_model import LinearRegression
    
    variables = ['NrOfLikesFb','NrofCommentsFb','NrOfFollowersTw','NrofRepliesTw']
    models = [None] * 4
    
    ### BEGIN SOLUTION
    i = 0
    for col in variables:
        x = X[[col]]
        lm =LinearRegression()
        models[i] = lm.fit(x,y)
        i += 1
    ### END SOLUTION
    
    #Note the returned linear regression model should be in the same order as variables. 
    #So the first model has to be NrOfLikesFb, the second NrofCommentsFb, and so on.
    
    return models

Verify your results:

In [5]:
slr_models = make_SLR(X= X_train, y = y_train)

assert np.allclose((slr_models[0].intercept_, slr_models[0].coef_[0]),(15.609488173698871, 5.1671695107909227e-07))


### BEGIN HIDDEN TESTS


assert np.allclose((slr_models[1].intercept_, slr_models[1].coef_[0]),(15.265318469966811, 6.0628001418242879e-05))


assert np.allclose((slr_models[2].intercept_, slr_models[2].coef_[0]),(15.769624755714137, 1.376551838293093e-05))


assert np.allclose((slr_models[3].intercept_, slr_models[3].coef_[0]),(15.126197837099062, 0.0020155533841367981))


### END HIDDEN TEST

Now you should have made all the simple linear regression model. Have a look at the intercept and the slope. What do you notice? which predictor has the strongest effect on box office revenues? What is the R-squared to the simple linear regression models? Write these conclusions down, since you will have to discuss this in the next activity. 

# Multiple linear regression

Nowm you will build multiple linear regression models. You have to build the following models: two models containing the likes and the colums, two models containing the Facebook and Twitter variables, and a model with all predictors. Again, it is up to you to interpret the coefficients and write down your conclusions. 

In [6]:
def make_MLR(X,y, variables):
    
    #This function only returns 1 model with a specific set of variables
    
    from sklearn.linear_model import LinearRegression
    
    x = X[variables]
    
    ### BEGIN SOLUTION 
    
    lm = LinearRegression()
    model = lm.fit(x,y)
    
    ### END SOLUTION
    
    return model

Verify your answers: 

In [7]:
#We will make all the models:
lm_likes = make_MLR(X= X_train, y= y_train, variables= ['NrOfLikesFb','NrOfFollowersTw'])
lm_talking = make_MLR(X= X_train, y= y_train, variables= ['NrofCommentsFb','NrofRepliesTw'])
lm_Fb = make_MLR(X= X_train, y= y_train, variables= ['NrOfLikesFb','NrofCommentsFb'])
lm_Tw = make_MLR(X= X_train, y= y_train, variables= ['NrOfFollowersTw','NrofRepliesTw'])
lm_all = make_MLR(X= X_train, y= y_train, variables= ['NrOfLikesFb','NrofCommentsFb','NrOfFollowersTw','NrofRepliesTw'])


assert np.allclose((lm_likes.intercept_, lm_likes.coef_[0], lm_likes.coef_[1]), 
                   (15.526154303083253, 4.4391961176389638e-07, 8.4075527425921782e-06))


### BEGIN HIDDEN TESTS


assert np.allclose((lm_talking.intercept_, lm_talking.coef_[0], lm_talking.coef_[1]), 
                   (14.99132819232663, 4.8143518699883124e-05, 0.0009842254399353783))


assert np.allclose((lm_Fb.intercept_, lm_Fb.coef_[0], lm_Fb.coef_[1]), 
                   (15.277025943868374, 4.1442541480199368e-08, 5.7086227050407943e-05))


assert np.allclose((lm_Tw.intercept_, lm_Tw.coef_[0], lm_Tw.coef_[1]), 
                   (15.166024532079879, 2.3401659335220929e-06, 0.0018342747618596952))


assert np.allclose((lm_all.intercept_, lm_all.coef_[0], lm_all.coef_[1], lm_all.coef_[2], lm_all.coef_[3]), 
                   (15.053790882473528,1.0608497234696554e-07,3.8910849178628834e-05,3.4027697660251417e-06,0.00078424597337231996))


### END HIDDEN TESTS

Check the intercept and the coefficients of the multiple linear regression models and write down your conclusions for the next discussion section. 

Finally, we want to assess the predictive performance of all these models. All the models are normally made on the train set, so you will now evaluate their performance on the test set. 

In [8]:
def evaluate_performance(model, X, y):

    #Return the RMSE, MAE, and R2 of the model
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error as mse
    from sklearn.metrics import mean_absolute_error as mae
    from sklearn.metrics import r2_score as r2
    from math import sqrt
    
    ### BEGIN SOLUTION
    predictions = model.predict(X)
    RMSE = sqrt(mse(y, predictions))
    MAE = mae(y, predictions)
    R2 = r2(y, predictions)
    
    measures = [RMSE, MAE, R2]
    
    ### END SOLUTION
    
    return measures

Verify your answers:

In [9]:
evaluate_performance(model=lm_likes, X= X_test[['NrOfLikesFb','NrOfFollowersTw']], y=  y_test)

assert np.allclose(evaluate_performance(model=lm_likes, X= X_test[['NrOfLikesFb','NrOfFollowersTw']], y=  y_test),
                  (1.7007832761032557, 1.3282081135523005, 0.17475328369066268))


### BEGIN HIDDEN TESTS


assert np.allclose(evaluate_performance(model=lm_talking, X= X_test[['NrofCommentsFb','NrofRepliesTw']], y=  y_test),
                  (1.6609099779256515, 1.3194545903400885, 0.21299400667403368))


assert np.allclose(evaluate_performance(model=lm_Fb, X= X_test[['NrOfLikesFb','NrofCommentsFb']], y=  y_test),
                  (1.6253601512775537, 1.2372103620416932, 0.2463233377969366))


assert np.allclose(evaluate_performance(model=lm_Tw, X= X_test[['NrOfFollowersTw', 'NrofRepliesTw']], y=  y_test),
                  (1.8506008910050995, 1.4795681849169782, 0.022962157064807309))


assert np.allclose(evaluate_performance(model=lm_all, 
                                        X= X_test[['NrOfLikesFb','NrofCommentsFb','NrOfFollowersTw','NrofRepliesTw']], y=  y_test),
                  (1.6311219570599929, 1.2803922157424248, 0.24097038809117377))


### END HIDDEN TESTS