# Model evaluation: Cross validation
## Use “Movie.csv”

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn import model_selection
from sklearn.model_selection import KFold

### 1. Run 5-fold cross validations for the following two OLS models. Calculate MSEs, MAEs, and R-squares for the two models  (Change the unit of Revenue to \\$ million)

- **M1: Rev = b0 + b1 Ad + b2 Budget + b3 Rating + b4 Avgscreen + b5 Sequel + e**
- **M2: Rev = b0 + b1 Budget + b2 Rating + b3 Avgscreen + b4 Sequel + e**

In [2]:
# Read data

df = pd.read_csv("Movie.csv")
df.head()

Unnamed: 0,title,open,rev,sequel,avgscreen,mpaa,budget,rating,ad,genre,studio
0,"10,000 b.c.",07Mar2008,94784201,0,1329,PG-13,105000000,4.9,36000000.0,PERIOD,WB
1,12 rounds,27Mar2009,12234694,0,600,PG-13,22000000,5.5,8180700.0,ACTION,FOX
2,12 years a slave,08Nov2013,56671993,0,554,R,20000000,8.1,14281400.0,DRAMA,FoxS
3,127 hours,28Jan2011,18335230,0,216,R,18000000,7.7,12000000.0,DRAMA,FOXS
4,1408,22Jun2007,71985628,0,877,PG-13,25000000,6.9,22900000.0,HORROR,MGM


- `pd.read_csv("Movie.csv")` reads data from a CSV (Comma-Separated Values) file named "Movie.csv" and loads it into a pandas DataFrame.

In [3]:
# Change the unit of Revenue to $ million

df['rev'] = df.rev/1000000  # Change Revenue to $ million unit
df.head()

Unnamed: 0,title,open,rev,sequel,avgscreen,mpaa,budget,rating,ad,genre,studio
0,"10,000 b.c.",07Mar2008,94.784201,0,1329,PG-13,105000000,4.9,36000000.0,PERIOD,WB
1,12 rounds,27Mar2009,12.234694,0,600,PG-13,22000000,5.5,8180700.0,ACTION,FOX
2,12 years a slave,08Nov2013,56.671993,0,554,R,20000000,8.1,14281400.0,DRAMA,FoxS
3,127 hours,28Jan2011,18.33523,0,216,R,18000000,7.7,12000000.0,DRAMA,FOXS
4,1408,22Jun2007,71.985628,0,877,PG-13,25000000,6.9,22900000.0,HORROR,MGM


- `df.rev`/1000000`: Divides each value in the 'rev' column by one million.
- `df['rev']`: The result of the division is assigned back to the 'rev' column in the DataFrame.

In [4]:
#  Define linear regression and a 5-fold CV

m = LinearRegression()
cv5 = model_selection.KFold(n_splits=5) 

- `m = LinearRegression()`: This line initializes a linear regression model.
- `cv5 = model_selection.KFold(n_splits=5)`: This line creates a 5-fold cross-validation scheme. 
    - This can be omitted for 5-fold CV (default)

In [5]:
# Define x, y for M1

y = df.rev
x1 = df[['sequel', 'avgscreen', 'budget', 'rating', 'ad']]

- `y = df.rev`: y is the dependent variable or target variable. In this case, it is set to the 'rev' column of the DataFrame df.
- `x1 = df[['sequel', 'avgscreen', 'budget', 'rating', 'ad']]`: x1 is a DataFrame that includes a set of independent variables or features that you want to use in your regression analysis.

In [6]:
# Obtain MSE, MAE, R2 for M1

mse1 = -(model_selection.cross_val_score(m, x1, y, cv=cv5, scoring='neg_mean_squared_error'))
 # 'neg_mean_squared_error': negative value of MSE
 # cv=cv5 can be omitted for 5-fold (default)
mae1 = -(model_selection.cross_val_score(m, x1, y, cv=cv5, scoring='neg_mean_absolute_error'))
rsq1 = model_selection.cross_val_score(m, x1, y, cv=cv5, scoring='r2')

- `mse1 = -(model_selection.cross_val_score(m, x1, y, cv=cv5, scoring='neg_mean_squared_error'))`: Calculates the Mean Squared Error (MSE) using 5-fold cross-validation for the linear regression model m.
    - `model_selection.cross_val_score` performs cross-validation
    - `scoring='neg_mean_squared_error'` indicates that it should use MSE as the evaluation metric. The negative sign is used to obtain a positive MSE value.
- `mae1 = -(model_selection.cross_val_score(m, x1, y, cv=cv5, scoring='neg_mean_absolute_error'))`: This line calculates the Mean Absolute Error (MAE) using 5-fold cross-validation for the linear regression model m.
- `rsq1 = model_selection.cross_val_score(m, x1, y, cv=cv5, scoring='r2')`: This line calculates R-squared using 5-fold cross-validation for the linear regression model m. 

In [7]:
# Scores of each run of CV

print('mse1', mse1)
print('mae1', mae1)
print('rsq1', rsq1)

mse1 [3887.66959301 2757.30645036 2797.14349192 4783.4122652  3210.71316873]
mae1 [36.99070036 36.25083943 32.87925187 39.46699928 39.07316148]
rsq1 [0.49248169 0.63405075 0.47337333 0.52848915 0.53249441]


In [8]:
# Average scores of all the runs of CV

print('mse1', mse1.mean())
print('mae1', mae1.mean())
print('r-square1', rsq1.mean())

mse1 3487.248993844189
mae1 36.93219048364908
r-square1 0.5321778638496653


- `mean()`: This code computes the mean value of the MSE, MAE, and R-squared scores across the different folds of cross-validation.

In [9]:
# Define x for M2

x2 = x1.drop('ad', axis=1)
# x2 = df[['sequel', 'avgscreen', 'budget', 'rating']]  # Alternative

 - `x2 = x1.drop('ad', axis=1)` is creating a new DataFrame x2 by removing the 'ad' column from the existing DataFrame x1
 - `axis=1` indicates that you are removing a column.

In [10]:
# Obtain MSE, MAE, R2 for M2

mse2=-(model_selection.cross_val_score(m, x2, y, cv=cv5, scoring='neg_mean_squared_error'))
mae2=-(model_selection.cross_val_score(m, x2, y, cv=cv5, scoring='neg_mean_absolute_error'))
rsq2=model_selection.cross_val_score(m, x2, y, cv=cv5, scoring='r2')

In [11]:
# scores of each run of CV

print(mse2)
print(mae2)
print(rsq2)

[3976.36418064 2545.41365857 2841.12102393 4834.44477207 3277.74066427]
[37.64107041 35.34183014 32.83597082 40.46607438 39.37165164]
[0.480903   0.66217312 0.46509354 0.52345877 0.52273467]


In [12]:
# Average scores of all the runs of CV

print('mse2', mse2.mean())
print('mae2', mae2.mean())
print('r-square2', rsq2.mean())

mse2 3495.01685989574
mae2 37.13131947702719
r-square2 0.5308726199800551


- Model 1 shows better predictive performance than Model 2: Model 1 has lower error measures (MSE, MAE) and higher R-squared

### 2. Run Leave-one-out cross validations for the two previous OLS models. Calculate MSEs and MAEs.

In [13]:
# Define linear regressin and Leave-one-out cross validation

m = LinearRegression()
loo = model_selection.LeaveOneOut()

- `loo = model_selection.LeaveOneOut()`: This line sets up a Leave-One-Out (LOO) cross-validation scheme using scikit-learn's LeaveOneOut class, which is part of the model_selection module. In LOO cross-validation, for each data point in your dataset, a model is trained on all the other data points except that specific data point.

In [14]:
#  Obtarin MSE, MAE for M1

mse1=-(model_selection.cross_val_score(m, x1, y, cv=loo, scoring='neg_mean_squared_error'))
mae1=-(model_selection.cross_val_score(m, x1, y, cv=loo, scoring='neg_mean_absolute_error'))
print('mse1', mse1.mean())
print('mae1', mae1.mean())

mse1 3477.149152965851
mae1 36.64109897183518


 - cv is used to perform LOO cross-validation on a linear regression model (m) with a given set of independent variables (x1) and target variable (y

In [15]:
#  Obtarin MSE, MAE for M2

mse2 = -(model_selection.cross_val_score(m, x2, y, cv=loo, scoring='neg_mean_squared_error'))
mae2 = -(model_selection.cross_val_score(m, x2, y, cv=loo, scoring='neg_mean_absolute_error'))
print('mse2', mse2.mean())
print('mae2', mae2.mean())

mse2 3494.6700327413237
mae2 36.9558248708298


In [16]:
print('mse2', mse2.mean())
print('mae2', mae2.mean())

mse2 3494.6700327413237
mae2 36.9558248708298


- Model 1 shows better predictive performance than Model 2: Model 1 has lower error measures (MSE, MAE)