# 1253BMEE00 FA MP MB Econometric Evaluation of Management Practices
## Examiner: Prof. Dr. Dirk Sliwka
## Date: 02.12.2019

## Instructions:

Please follow the instructions below, such that we will be able to correctly identify your solutions to the exam.

**1. Please take a copy of this jupyter notebook and save it as a separate file in the following format:**

*WS1920_EEMP_exam_PT1_matriculationnumber_initials.ipynb*

- i.e., the final file name should look like this: *WS1920_EEMP_exam_PT1_1234567_MM.ipynb*

**2. Please also enter your matriculation number and your initials in the following cell:**

### Matriculation number:
### Initials:

## Background information

The datasets provided on the memory sticks contain data from a study by Bloom et al. (2015): *Does Working from Home Work? Evidence from a Chinese Experiment*, where the authors evaluate the performance effect of giving Chinese call-center employees the opportunity to work from home. To do this, they first asked the employees whether they would generally be willing to work from home. Of those employees who volunteered to work from home, they **<u>randomly</u>** chose a **subgroup** which was actually given the **opportunity to work from home** (**treatment group**). Those employees who **volunteered**, but were **not given the opportunity to work from home**, serve as the **control group**.

The code cell below imports the standard module *pandas*. It also imports the two datasets relevant for this exam, provided that the specified paths are correct (this depends on where you saved the files on your laptop). Please execute this cell before you start your work.

In [None]:
import pandas as pd

path_data_task1 = 'https://raw.githubusercontent.com/armoutihansen/EEMP2020/main/datasets/data_task1.csv'
df1 = pd.read_csv(path_data_task1)

path_data_task2 = 'https://raw.githubusercontent.com/armoutihansen/EEMP2020/main/datasets/data_task2.csv'
df2 = pd.read_csv(path_data_task2)

In [None]:
# further imports
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
from statsmodels.iolib.summary2 import summary_col
import matplotlib.pyplot as plt

*Good luck!*

## Assignment 1 (30 points)

The dataset *data_task1.csv* contains the following variables from the experimental period (that is the time frame in which the treatment group worked from home):
- *personid*: individual employee identifier
- *calllength*: performance measure, indicating the weekly sum of minutes on the phone
- *treatment*: treatment dummy, indicating whether the employee was part of the treatment group
- *commute120*: commuting dummy, indicating whether the employee has to commute more than 120 minutes in total
- *year_week*: indicator for year and calender week

__a)__ Using *data_task1.csv*, estimate the following OLS regression and show its output using python (remember to cluster the standard errors on the "personid" level):

**Regression 1**: $$ ln(calllength) = \alpha + \beta_{1} * treatment + \beta_{t} * year\_week_{t} + \epsilon $$

*Note:* To account for seasonal variation beta_t reflects the full set of weekly time dummies.


Please give a precise verbal interpretation of the coefficient for treatment and its statistical significance.   



<div style="text-align: right"> <b>10 points</b> </div>

In [None]:
# Insert your code here:

'# Give the verbal answer here:

__b)__ In a next step, please explore in another regression (Regression 2) whether the size of the treatment effect depends on the commuting distance (remember to cluster the standard errors on the "personid" level and as before include the full set of weekly time dummies).


Please give a precise verbal interpretation of the results and the respective magnitudes of your estimates. Explain what this means for the effectiveness of the working from home treatment intervention and elaborate on potential reasons. 



<div style="text-align: right"> <b>10 points</b> </div>

In [None]:
# Insert your code here:

'# Give the verbal answer here:

__c)__ As explained above, the researchers first explored which employees would be willing to work from home and then randomly selected a subgroup amongst these employees who would take part in the treatment. Explain why this is an essential step to estimate the causal effect of the treatment. 

<div style="text-align: right"> <b>5 points</b> </div>

'# Give the verbal answer here:

__d)__ Assume now working from home would not have been randomly assigned, i.e., employees could decide individually whether they want to take part in working from home or not. Which alternative methods could help to estimate the causal effect of the management practice in this case. Please also explain verbally which assumption(s) you would have to impose to give a causal interpretation of the results.

<div style="text-align: right"> <b>5 points</b> </div>

'# Give the verbal answer here:

## Assignment 2 (30 points)

_Your general task in this assignment is to use employee features listed below to predict employees' performance. In the first part of the exercise, you will perform data cleaning. In the second part, you are tasked with (i) finding the optimal Random Forest regressor to predict performance (i.e., model selection) and (ii) estimating the general performance of the selected model (i.e. model assessment)._

The dataset data_task2.csv contains the following variables from a pre-experimental period on a subset of the employees:

- *personid*: individual employee identifier
- *age*: age in years
- *tenure*: tenure in months
- *wage*: gross wage
- *children*: children dummy, indicating whether the employee has children
- *bedroom*: bedroom dummy, indicating whether the employee has a bedroom
- *commute*: commuting time in minutes
- *men*: gender dummy, indicating whether the employee is male
- *married*: marriage dummy, indicating whether the employee is married
- *volunteer*: volunteering dummy, indicating whether the employee volunteered for working from home in the experiment
- *high_educ*: education dummy, indicating whether the employee has a higher education
- *z_performance*: performance measure, which indicates the standardized performance of the employee (i.e. subtracted by the mean and divided by the standard deviation).

In [None]:
# Insert your code here:

**a)** Using *data_task2.csv*, remove the 'wage' and 'personid' columns from the dataframe and remove any row that contains missing values (i.e. 'NaN's).
<div style="text-align: right"> <b>2 points</b> </div>

In [None]:
# Insert your code here:

**b)** Split the data into a training set containing 75% of the observations and a test set containing 25% of the observations. Use 181 as the random state to allow for reproducibility.
<div style="text-align: right"> <b>2 points</b> </div>

In [None]:
# Insert your code here:

_In the following, you wish to apply the Cross Validation (CV) technique on the training set to find the optimal Random Forest regressor that can predict performance based on all the other features._

__c)__ Before you perform the model selection, please state and justify your choice of (i) number of folds in the Cross Validation (CV), (ii) hyperparameters, and (iii) parameter grid (i.e. the dictionary containing the hyperparameter candidates).
<div style="text-align: right"> <b>6 points</b> </div>

'# Give the verbal answer here:

__d)__ Based on your answer in c), perform the model selection and print the optimal Random Forest regressor.
<div style="text-align: right"> <b>8 points</b> </div>

In [None]:
# Insert your code here:

**e)** Print out the feature importance of all the features of the optimal Random Forest regressor you found in d). Which three features are most predictive of performance? Provide a potential reason for this.
<div style="text-align: right"> <b>8 points</b> </div>

In [None]:
# Insert your code here:

'# Give the verbal answer here:

**f)** Now get an unbiased estimate of the squared error of the optimal Random Forest regressor you found in d). Explain why this estimate is better than calculating the mean squared error on the training set.
<div style="text-align: right"> <b>4 points</b> </div>

In [None]:
# Insert your code here:

'# Give the verbal answer here: