# 1253M.BPAE1 MP BM People Analytics & Econometrics
## Examiner: Prof. Dr. Dirk Sliwka
## Date: 18.12.2021

## Instructions:

Please follow the instructions below, such that we will be able to correctly identify your solutions to the exam.

**1. Please rename this jupyter notebook and save it as a file in the following format:**

*matriculation_number_WS2122_EEMP_exam_PT1.ipynb*

- i.e., the final file name should look like this: *1234567_WS2122_EEMP_exam_PT1.ipynb*

**2. Before the exam ends, please save the notebook and share it with jeshan49@gmail.com.**

**3. Please also enter your matriculation number and your initials in the following cell:**

### Matriculation number:
### Initials:

## Background information
Please consider the following simulated two-period panel data, in which a given employee's sales (in euros) is given by the following equation:

\begin{equation}
sales_{it} = 10000 + 100*ability_i + 2000*year_t + 5000*mentoring_{it} + 250*age_{it} - 2*age_{it}^2 + 500 * norm\_satis_{it} - 500*home_{it} + 2500*fulltime_{it} + ϵ_{it}
\end{equation}

where $ϵ_{it}\sim N(0,16000000)$

The independent variables can be described as follows:

- $ability_i$: Individual $i$'s time-fixed ability.

- $year_t$: Year indicator that takes the value 1 or 2.

- $mentoring_{it}$: A dummy variable that indicates whether individual $i$ received mentoring in period $t$. The variables only enters at $t=2$. 

- $norm\_satis_{it}$: Normalized self-reported job satisfaction of individual $i$ in period $t$ (based on the aggregation of multiple items in a survey).

- $age_{it}$: The age in years of individual $i$ in period $t$.

- $home_{it}$: dummy variable taking the value 1 if individual $i$ works from home in period $t$.

- $fulltime_{it}$: dummy variable taking the value 1 if individual $i$ works fulltime in period $t$.

The code cell below (i) imports numpy and pandas, (ii) simulates the data described above, and (iii) store the data in the dataframe __df__. Please execute the cell before you work on the exercises.

In [None]:
import numpy as np
import pandas as pd
n=2000
df1=pd.DataFrame(index=range(n))
df1['ability'] = np.random.normal(100,15,n)
df1['abilityTest'] = df1.ability+np.random.normal(0,15,n)
df1['mentoring']=0
df1['norm_satis'] = np.random.normal(0,1,n)
df1['age'] = np.random.uniform(18, 70, n)
df1['home'] = np.random.choice([0,1], size=n, p=[0.85,0.15])
df1['fulltime'] = np.random.choice([0,1], size=n, p=[0.23,0.77])
df1['year']=1
df1['persnr']=df1.index
df2=df1.copy()
df2['year']=2
df2['mentoring']=(df2.ability+np.random.normal(0,10,n)<=87)
df2['age']=df1['age']+1
df2['norm_satis'] = df1['norm_satis']+0.5*df2['mentoring']
df=pd.concat([df1,df2], sort=False)
df['sales'] = 10000 + 100*df['ability'] + 2000*df['year'] + 5000*df['mentoring'] + 250*df['age'] - 2*df['age']**2 + 500*df['norm_satis'] - 500*df['home'] + 2500*df['fulltime'] + np.random.normal(0,4000,2*n)

## Assignment 1 (30 points)
Suppose that a researcher (who does not know the true conditional expectation function (CEF)) has received this data set and wants to learn about the role of the mentoring program. 

a) Please run an two OLS regressions, regressing sales on

- the mentoring dummy and year
- the mentoring dummy, year, age and the home and fulltime dummies

and display the results. Please describe a precise interpretation the researcher should provide for the coefficient and standard error of the mentoring dummy. What is the purpose of including the age, home and fulltime variables here?

<div style="text-align: right"> <b>6 points</b> </div>

In [None]:
# Insert your code here

'# Give the verbal answer here:

b) Generate a variable $age^2$ and add a regression where you also control for this new variable. Comment on what the regression results tell the researcher about the age profile of sales.  

<div style="text-align: right"> <b>4 points</b> </div>

In [None]:
# Insert your code here

'# Give the verbal answer here:

c) Suppose now that the researcher suspects that there is an omitted variable bias as the likelihood that an employee receives mentoring is correlated with ability. What is the regression result if the researcher can directly include ability as a control variable? Please explain what the results tell you about the direction of the omitted variable bias and provide an explanation the researcher can reasonably come up with.
<div style="text-align: right"> <b>8 points</b> </div>

In [None]:
# Insert your code here

'# Give the verbal answer here:

d) Suppose now that the researcher cannot measure ability directly but observes the results of an ability test conducted in period 1. Please estimate another regression where you replace $ability$ with $abilityTest$. Please compare the coefficients of $abilityTest$ with the coefficient of $ability$ in your previous regression and explain the difference. Please also compare the $mentoring$ coefficients between the two regressions and explain the difference.
<div style="text-align: right"> <b>8 points</b> </div>

In [None]:
# Insert your code here

'# Give the verbal answer here:

e) Which other method could have helped here to overcome omitted variable bias when neither ability nor a proxy for ability could be measured? Please explain why the proposed method here should help to solve the OVB problem. (NOTE: You do not need to run a respective regression.)
<div style="text-align: right"> <b>4 points</b> </div>

'# Give the verbal answer here:

## Assignment 2 (30 points)
Now the researcher is mainly interested in predicting sales.

**a)** Please formally state the conditional expectation function (CEF), $f(sales_{it})$. Furthermore, state and explain the three types of prediction errors that the expected squared error of any estimated regression model $\hat{f}(sales_{it})$ can be decomposed into.
<div style="text-align: right"> <b>7 points</b> </div>


'# Give the verbal answer here:

**b)** Based on your answer in a), please state the irreducible error of any estimated regression model $\hat{f}(sales_{it})$. Furthermore, calculate the expected sales of an individual in period $t=1$ with ability of 100, satisfaction of 0 and an age of 50, who works fulltime from home without mentoring.
<div style="text-align: right"> <b>5 points</b> </div>

'# Give the verbal answer here:

The code in the following cell defines a function that lets us simulate additional samples of sales data. Please execute the code before you work on the the following tasks.

In [None]:
def sample(size=4000):
  n=int(size/2)
  df1=pd.DataFrame(index=range(n))
  df1['ability'] = np.random.normal(100,15,n)
  df1['mentoring']=0
  df1['norm_satis'] = np.random.normal(0,1,n)
  df1['age'] = np.random.uniform(18, 70, n)
  df1['home'] = np.random.choice([0,1], size=n, p=[0.85,0.15])
  df1['fulltime'] = np.random.choice([0,1], size=n, p=[0.23,0.77])
  df1['year']=1
  df1['persnr']=df1.index
  df2=df1.copy()
  df2['year']=2
  df2['mentoring']=(df2.ability+np.random.normal(0,10,n)<=87)
  df2['age']=df1['age']+1
  df2['norm_satis'] = df1['norm_satis']+0.5*df2['mentoring']
  df=pd.concat([df1,df2], sort=False)
  df['sales'] = 10000 + 100*df['ability'] + 2000*df['year'] + 5000*df['mentoring'] + 250*df['age'] - 2*df['age']**2 + 500*df['norm_satis'] - 500*df['home'] + 2500*df['fulltime'] + np.random.normal(0,4000,2*n)
  return df

**c)** The code in the cell below contains some of the code needed to estimate (i) the (squared) bias of $\hat{f}(sales_{it})$ and (ii) the variance of $\hat{f}(sales_{it})$ for an individual in period $t=1$ with ability of 100, satisfaction of 0 and an age of 50, who works fulltime from home without mentoring, by resampling 1000 times. In this task, $\hat{f}(sales_{it})$ is a linear regression given by the following equation:

\begin{equation}
\hat{f}(sales_{it})=\hat{β}_0 + \hat{β}_1*ability_i + \hat{β}_2*year_t+\hat{β}_3*mentoring_{it}+\hat{β}_4*age_{it}+\hat{β}_5*norm\_satis_{it}+\hat{β}_6*home_{it}+\hat{β}_7*fulltime_{it}
\end{equation}

Please finalize and execute the code, in order to estimate the bias and variance. Comment on the relative magnitudes of the bias and the variance.
<div style="text-align: right"> <b>8 points</b> </div>

In [None]:
# Finalize the code in this cell
from sklearn.linear_model import LinearRegression
f_hats = []
for i in range(1000):
  df_ = sample(4000)
  X = df_[['ability','year','mentoring','age','norm_satis','home','fulltime']].values
  y = df_['sales']
  lr = 
  f_hat = lr.predict([[100, 1, 0, 50, 0, 1, 1]])
  f_hats.append(f_hat)
bias = 
variance = 
print('Bias: ', bias, '\n','Variance: ', variance)

'# Give the verbal answer here:

d) Now perform the same task as in c), but with an unrestricted regression tree instead of a linear regression. Comment on your results. In particular, compare the magnitude of the bias and variance of the decision tree to that of the linear regression in c), and explain the differences.
<div style="text-align: right"> <b>10 points</b> </div>

In [None]:
# Insert your code here

'# Give the verbal answer here: