## WS 2019/20 – The Empirical Evaluation of Management Practices I

# Exam structure

**Total duration**: 1 hour 30 minutes

**Working time**: approx. 1 hour

**Exam materials**: 

On the day of the exam, we will provide you with:

- A dataset
- A Jupyter Notebook template, which contains the general instruction and exam questions
- These will be either downloadable online or distributed via a memory stick

**Deliverables**: At the end of the exam, you will be asked to hand in your solutions in the prepared Jupyter Notebook (both programming code and written explanations in markup) via upload or on a memory stick. We will give specific instructions on how to name and upload your files.

**Auxiliary materials**: open book exam, i.e., all materials are allowed. 

***However, communication between students is not permitted and if detected will result in failing the exam.***

**General structure**: 

The exam will consist of a combination of three types of questions:

i. Questions on conceptual understanding which require only short answers ~ *20 %*;

ii. Programming task which analyzes the data provided ~ *40 %*;

iii. Interpretation and conceptual questions based on the results of your previous analysis ~ *40 %*

# Mock Exam

## Assignment 1 (30 points)

__a)__	Use the dataset provided in class from Bloom et al. (2012) [AMP_Data.csv] and estimate the following OLS regression and show its output using python (remember to cluster the standard errors on the "account_id" level):

**Output 1**: $$ ln(sales) = \alpha + \beta_{1} * management + \beta_{2} * ln(emp) + \beta_{3} * ln(cap) + \epsilon $$

Please give a precise verbal interpretation of the coefficient for management and its statistical significance.   

*Note: In the dataset capital (cap) can be found in the column "ppent" *

<div style="text-align: right"> <b>6 points</b> </div>

In [1]:
# import packages
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import seaborn as sns

# read in dataset
#path_to_data = 'C:\Data\AMP_Data.csv'
path_to_data = '~/AMP_Data.csv'
df = pd.read_csv(path_to_data)

In [2]:
# output 1
output1=smf.ols('np.log(sales)~management+np.log(emp)+np.log(ppent)',data=df).fit(cov_type='cluster',cov_kwds={'groups':df['account_id']})

print(output1.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(sales)   R-squared:                       0.684
Model:                            OLS   Adj. R-squared:                  0.684
Method:                 Least Squares   F-statistic:                     1304.
Date:                Wed, 27 Nov 2019   Prob (F-statistic):               0.00
Time:                        10:43:02   Log-Likelihood:                -8999.5
No. Observations:                7094   AIC:                         1.801e+04
Df Residuals:                    7090   BIC:                         1.803e+04
Df Model:                           3                                         
Covariance Type:              cluster                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.9957      0.143     20.931

'# Insert your answer here:





__b)__	Use the same data set to estimate the following regression:

**Output 2**: $$ ln(cap) =  \alpha + \beta_{1} * management + \beta_{2} * ln(emp) + \epsilon $$

Please give a verbal interpretation of the coefficient of management.

<div style="text-align: right"> <b>4 points</b> </div>

In [3]:
output2=smf.ols('np.log(ppent)~management+np.log(emp)',data=df).fit(cov_type='cluster',cov_kwds={'groups':df['account_id']})

print(output2.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(ppent)   R-squared:                       0.507
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     1164.
Date:                Wed, 27 Nov 2019   Prob (F-statistic):               0.00
Time:                        10:43:35   Log-Likelihood:                -11126.
No. Observations:                7094   AIC:                         2.226e+04
Df Residuals:                    7091   BIC:                         2.228e+04
Df Model:                           2                                         
Covariance Type:              cluster                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       2.7883      0.149     18.691      

'# Insert your answer here:





__c)__ Suppose you would not control for capital in the first regression (i.e. with which you generated output 1). How can you compute the coefficient of "management" arising in such a regression **without** running this additional regression only using regression outputs 1 and 2? Please explain your answer and show how you computed the coefficient. 

<div style="text-align: right"> <b>6 points</b> </div>

Consider capital as an omitted variable in such a regression of log sales on the management score. We can apply the omitted variable bias formula.

rho~ = rho + gamma * Cov(C,X)/V(C)

The coefficient of ln(capital) in the short regression of ln(sales) on ln(emp) alone is equal to the coefficient of management in the long regression + the effect of ln(capital) in the long regression * the regression of capital on management score (i.e. the difference in the level of capital associated with a one unit higher management score).
That is:

rho~ = 0.3982+ 0.4300*0.3032

In [4]:
0.3982+ 0.4300*0.3032

0.528576

In [5]:
## cross check:
output3=smf.ols('np.log(sales)~management+np.log(emp)',data=df).fit(cov_type='cluster',cov_kwds={'groups':df['account_id']})

print(output3.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(sales)   R-squared:                       0.578
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     1688.
Date:                Wed, 27 Nov 2019   Prob (F-statistic):               0.00
Time:                        10:45:30   Log-Likelihood:                -10029.
No. Observations:                7094   AIC:                         2.006e+04
Df Residuals:                    7091   BIC:                         2.008e+04
Df Model:                           2                                         
Covariance Type:              cluster                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       4.1947      0.128     32.863      

__d)__ What does this analysis reveal about the role of capital if you want to estimate the effect of management practices on sales?

<div style="text-align: right"> <b>3 points</b> </div>

Capital is an important control variable that serves to reduce OVB in the regression of sales on the management score. If I interpret the regression in a causal manner: Not controlling for capital leads to an overestimation of the effect of the management score on sales. The reason is that firms with higher management scores have more capital and capital is associated with sales not only through the management score.

__e)__ How do you interpret the R2 in regression Output 1?

<div style="text-align: right"> <b>3 points</b> </div>

'# Insert your answer here:





__f)__ What are reasons to use the logarithm of the depended variable sales?

<div style="text-align: right"> <b>3 points</b> </div>

'# Insert your answer here:





__g)__ Please investigate empirically whether the strength of the association between sales and management score depends on the firm size (as measured by the number of employees). Run a regression and interpret the results with respect to this research question.

<div style="text-align: right"> <b>5 points</b> </div>

In [6]:
df['lnemp_man']=np.log(df.emp)*df.management

output4=smf.ols('np.log(sales)~management+np.log(emp)+np.log(ppent)+lnemp_man',data=df).fit(cov_type='cluster',cov_kwds={'groups':df['account_id']})

print(output4.summary())


output4=smf.ols('np.log(sales)~management*np.log(emp)+np.log(ppent)',data=df).fit(cov_type='cluster',cov_kwds={'groups':df['account_id']})

print(output4.summary())

                            OLS Regression Results                            
Dep. Variable:          np.log(sales)   R-squared:                       0.684
Model:                            OLS   Adj. R-squared:                  0.684
Method:                 Least Squares   F-statistic:                     1229.
Date:                Wed, 27 Nov 2019   Prob (F-statistic):               0.00
Time:                        10:46:35   Log-Likelihood:                -8999.3
No. Observations:                7094   AIC:                         1.801e+04
Df Residuals:                    7089   BIC:                         1.804e+04
Df Model:                           4                                         
Covariance Type:              cluster                                         
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         3.1824      0.532      5.977

The point estimate of the interaction term is positive, which would indicate that the management score has a stronger impact on sales in larger firms. But the standard error is very large such that the effect is statistically insignificant. Moreover, the magnitude of the estimate is very small. Hence, it is unlikely that the effect of the management score on sales strongly depends on firm size as measured by the number of employees.

## Assignment 2 (30 points)

*Your task in this assignment is to use your knowledge of Machine Learning to perform model selection and/or model assessment on the AMP dataset. In particular, you will create and evaluate linear regression models which aim at predicting log(sales) based on 18 management practices. In one instance you will use the 'management' column only, which is the average over all management practice dimensions. In another instance, you will use each of the 18 dimensions explicitly in a regression.*

In [7]:
cols = ['ln_sales', 'management', 'lean1', 'lean2', 'perf1', 'perf2', 'perf3',
        'perf4','perf5', 'perf6', 'perf7', 'perf8', 'perf9', 'perf10', 'talent1',
        'talent2','talent3', 'talent4', 'talent5', 'talent6']

__a)__ Subset the data such that the resulting dataframe only contains the columns defined in the list "cols" above. From the resulting dataframe, remove all rows which contain 'NaNs' (i.e. missing values) in any of the columns.
<div style="text-align: right"> <b>2 points</b> </div>

In [8]:
df['ln_sales'] = np.log(df['sales'])
df = df[cols]
df = df.dropna()

*Now you want to get an unbiased estimate of the mean squared error of the simple linear regression using only the 'management' column to predict 'log(sales)':* 
\begin{equation}
ln(sales) = \alpha + \beta_1* management + \epsilon
\end{equation}

__b)__ Describe the validation set approach (without model selection) and explain why this approach is preferable to using the whole dataset for both model estimation and assessment.
<div style="text-align: right"> <b>3 points</b> </div>

The validation set approach (without model selection) involves splitting our dataset in to two sets: a training set and a test set. We estimate the model on the training set and evaluate the model's performance on the test set.

In general, we want an estimate of the model MSE on the population/independent test set. If we use the whole dataset for both model estimation and model assessment, we are using the predictions of the model on data which the model was trained/already knows. Because the model was trained on this data, and in fact attempted to minimize the MSE on, the resulting estimate will often be too optimistic.

__c)__ Split the data into a training set containing 75% of the observations and a test set containing 25% of the observations. Use 181 as the random state to allow for reproducibility and name the variables: X_train, X_test, y_train, y_test.
<div style="text-align: right"> <b>2 points</b> </div>

In [9]:
# Insert your code here:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[cols].drop(columns='ln_sales'), df['ln_sales'],
                                                   train_size=0.75, random_state=181)


__d)__ Perform the linear regression on the training set and calculate and print the model's mean squared error on both the training set and the test set.
<div style="text-align: right"> <b>4 points</b> </div>

In [10]:
# Insert your code here:
from sklearn.linear_model import LinearRegression

reg = LinearRegression().fit(X_train[['management']],y_train)

from sklearn.metrics import mean_squared_error as MSE

y_pred_tr = reg.predict(X_train[['management']])
y_pred_te = reg.predict(X_test[['management']])

print('training MSE:', MSE(y_train, y_pred_tr))

print('test MSE:', MSE(y_test, y_pred_te))


training MSE: 2.002843880448642
test MSE: 1.941813174995886


__e)__ Please comment on the two error estimates: Does the relationship between the MSE on the training set and test set correspond to your answer in b)? Which of the two is the best estimate of the model's generalization error. Explain your answer.
<div style="text-align: right"> <b>4 points</b> </div>

Here we see that the training MSE is larger than the test MSE. Thus, this is the opposite of our answer to b).

This might happen because of the drawback of the validation set approach: Our estimates may heavily depend on the particular splits we make in the data. 

It could also potentially happen because the linear model with one variable generalizes particularly well. In general, a lower training error is expected when a method easily overfits to the training data, yet, poorly generalizes. This happens usually, but not always, with sophisticated methods that have many parameters.

As we know, the test MSE is our unbiased estimate of the model's general performance.

*Now you want to perform both model selection and model assessment on the data. In particular, you would like to (i) find the best model among the simple linear regression using only the 'management' column you used above and a multivariate regression using each management practice dimension explicitly:*
\begin{align}
ln(sales) =  &  \alpha + \beta_1\times lean1 + \beta_2\times lean2 + \beta_3\times perf1 + \beta_4\times perf2 + \beta_5\times perf3\\
&+ \beta_6\times perf4 + \beta_7\times perf5 + \beta_8\times perf6 + \beta_9\times perf7 + \beta_{10}\times perf8 \\
& + \beta_{11}\times perf9 + \beta_{12}\times perf10 + \beta_{13}\times talent1 + \beta_{14}\times talent2 + \beta_{15}\times talent3 \\
& + \beta_{16}\times talent4 + \beta_{17}\times talent5 + \beta_{18}\times talent6 + \epsilon
\end{align}
*and (ii) get an unbiased estimate of the best of these to models.*

__f)__ Describe the validation set approach (with model selection) and explain why this approach is preferable to using the validation set approach you described in b) when performing both model selection and assessment.
<div style="text-align: right"> <b>3 points</b> </div>

The validation set approach (with model selection) involves splitting our dataset in to three sets: a training set, a validation set and a test set. We estimate the model on the training set, perform model selection on the validation set and evaluate the best model's performance on the test set.

If we want to perform model selection and assessment from a list of models, we will usually get a too optimistic estimate if we use the validation set approach (without model selection). The reason for this is that we are estimating the test error from a list of models. It is thus as if we are using this set as a hyper parameter of our model, such that we will likely choose a model which fits the validation set "too well" in order to provide an unbiased estimate.

In [11]:
X_train_new, X_val, y_train_new, y_val = train_test_split(X_train,
                                                          y_train,
                                                          train_size=2/3,
                                                          random_state=181)

__g)__ The code cell above splits up the training data previously defined into a new training set and a validation set (assuming train_test_split has been imported). Please execute the cell. Now perform the validation set approach (with model selection) in order to choose the best of the two models. For this, please print both models' validation mean squared error. Furthermore, print an unbiased estimate of the best model's mean squared error.
<div style="text-align: right"> <b>6 points</b> </div>

In [12]:
# Insert your code below:
cols.remove('ln_sales')
cols.remove('management')

reg2 = LinearRegression().fit(X_train_new[['management']],y_train_new)
reg3 = LinearRegression().fit(X_train_new[cols], y_train_new)

y_pred2 = reg2.predict(X_val[['management']])
y_pred3 = reg3.predict(X_val[cols])
print('Val MSE 1:', MSE(y_val, y_pred2), '\f Val MSE 2:', MSE(y_val, y_pred3))

Val MSE 1: 2.138539385519372  Val MSE 2: 2.105679794400939


In [13]:
reg3_ = LinearRegression().fit(X_train[cols], y_train)
y_pred3_ = reg3_.predict(X_test[cols])
print('test MSE:', MSE(y_test, y_pred3_))

test MSE: 1.9102719949480327


__h)__ Comment on your results: Which model appear to be the best of the two? Give a potential reason for this.
<div style="text-align: right"> <b>3 points</b> </div>

The multivariate regression appears to be the better of the two models as it has a lower MSE on the validation set. 

A potential reason for this could be that using the 18 dimensions instead of the simple mean of these dimensions provide more information which in turn helps predict ln sales.

*Note that it makes perfect sense that the validation error is higher than the test error here: We initial train the model on 50% of the data and predict on 25% to get the validation error. Afterwards, we fit the model on 75% of the data and predict on 25 % to get the test error.*

__i)__ What is the potential drawback of the validation set approach? Describe a method you have learned about which may remedy this.
<div style="text-align: right"> <b>3 points</b> </div>

When using the validation set approach both with and without model selection, the results we get may depend heavily on the particular split(s) of the data. Thus, if we would choose a different split, we may end up with very different estimates.

A method to remedy this is K-Fold Cross Valiation. Here we divide the data into K parts, and train the model K times on K-1 parts of the data and evaluate it on the last K part. In order to get the estimate of error for a particular model, we take the average of these K error estimates.

*Note that after performing model selection, we still need an independent test set to get an unbiased estimate of model performance of the best model. Alternative methods such as Nested Cross Validation exists to further remedy the problem, but we did not explicitly cover these in class.*