# 1253M.BPAE1 MP BM People Analytics & Econometrics
## Examiner: Prof. Dr. Dirk Sliwka
## Date: 18.03.2022

## Instructions:

Please follow the instructions below, such that we will be able to correctly identify your solutions to the exam.

**1. Please rename this jupyter notebook and save it as a file in the following format:**

*matriculation_number_WS2122_EEMP_exam_PT2.ipynb*

- i.e., the final file name should look like this: *1234567_WS2122_EEMP_exam_PT2.ipynb*

**2. Before the exam ends, please save the notebook and share it with jeshan49@gmail.com.**

**3. Please also enter your matriculation number and your initials in the following cell:**

### Matriculation number:
### Initials:

## Background information
Please consider the following simulated two-period panel data, in which a given employee's sales (in euros) is given by the following equation:

\begin{equation}
sales_{it} = 10000 + 100*ability_i +100*self\_control_i + 2000*year_t + 0*WFH_{it} + 180*age_{it} - 2*age_{it}^2 + 2500*fulltime_{it} + ϵ_{it}
\end{equation}

where $ϵ_{it}\sim N(0,16000000)$

The independent variables can be described as follows:

- $ability_i$: Individual $i$'s time-fixed ability.

- $self\_control_i$: Individual $i$'s time fixed self control.

- $year_t$: Year indicator that takes the value 1 or 2.

- $WFH_{it}$: A dummy variable that indicates whether individual $i$ decided to work from home in period $t$. The variable only enters at $t=2$ as the option is not present in $t=1$.

- $age_{it}$: The age in years of individual $i$ in period $t$.

- $fulltime_{it}$: dummy variable taking the value 1 if individual $i$ works fulltime in period $t$.

The code cell below (i) imports numpy, pandas, (ii) installs linearmodels, (iii) simulates the data described above, and (iv) stores the data in the dataframe __df__ and prints a summary of the included variables. Please execute the cell before you work on the exercises.

In [None]:
import numpy as np
import pandas as pd
from scipy.special import expit
import matplotlib.pyplot as plt
!pip install linearmodels
seed = 10
rng = np.random.default_rng(seed)
n=8000
df1=pd.DataFrame(index=range(n))
df1['ability'] = rng.normal(100,15,n)
df1['self_control'] = rng.normal(100,10,n)
df1['WFH']=0
df1['age'] = rng.uniform(18, 70, n)
df1['fulltime'] = rng.choice([0,1], size=n, p=[0.23,0.77])
df1['year']=1
df1['persnr']=df1.index
stand_self_control = (df1['self_control']-np.mean(df1['self_control']))/np.std(df1['self_control'])
stand_ability = (df1['ability']-np.mean(df1['ability']))/np.std(df1['ability'])
p_WFH = expit(stand_self_control-0.2*stand_ability-0.1*df1['fulltime'])
df2=df1.copy()
df2['WFH']=[rng.choice([1,0],1,p=[p,1-p]).item() for p in p_WFH]
df2['year']=2
df2['age']=df1['age']+1
df=pd.concat([df1,df2], sort=False)
df['sales'] = 10000 + 100*df['ability'] +100*df['self_control'] + 2000*df['year'] + 0*df['WFH'] + 180*df['age'] - 2*df['age']**2 + 2500*df['fulltime'] + rng.normal(0,4000,2*n)
df.describe()

## Assignment 1 (30 points)

Suppose a researcher uses the data set created in the above to study the effects of working from home on productivity.

a) First run a simple OLS regression of sales on the WFH dummy indicating whether somebody is working from home, controlling for a dummy variable that takes value 1 if the observation comes from the second year. Please interpret the size and level of significance of the WFH regression coefficient. 

**5 points**

In [None]:
# Insert your code here

'# Give verbal answer here


b) Suppose the researcher suspects that there may be an omitted variable bias. The first candidate variable the researcher investigates is age. She wants to control for age estimating a regression model that also includes a quadratic term for age. First explain the purpose of including a quadratic age term.
Please estimate the respective model. Interpret your findings on the association between age and sales. Given the observed regression results when you compare regression 1 to regression 2, is it likely that age is an important omitted variable in the first regression?

**8 points**

In [None]:
# Insert your code here

'# Give verbal answer here


c) The next candidate variable for an omitted variable bias which the researcher explores is the employee's ability. Please now include ability as an additional control variable in the regression model and interpret your findings. When you compare the WFH cofficient from this regression with the previous results from regression 2, what would you conclude about the association between ability and the likelihood that someone is working from home. 


**8 points**

In [None]:
# Insert your code here

'# Give verbal answer here


d) The researcher now thinks that it is hard to keep track of all potential candidates for omitted variable bias. But she is confident that any important omitted variable is stable over time. Which method allows to "control for" time constant omitted variables? Please run the respective regression (you do not need to control for age) and interpret your findings.

**9 points**

In [None]:
# Insert your code here

'# Give verbal answer here

## Assignment 2 (30 points)
In this assignment, we are mainly interested in predicting whether a given employee decided to work from home (WFH=1) in year 2 based on her year 1 characteristics.

**a)** Please define $y$ to consist of the *WFH* column of the stored dataframe __df2__ and $X$ to consist of the columns *ability, self\_control, age and fulltime* of the stored dataframe __df1__ in the code cell below. Afterwards, please shortly state and explain (i) whether this is a regression or classification problem, and (ii) whether a linear or logistic regression is more appropriate in this particular type of problem. 
<div style="text-align: right"> <b>4 points</b> </div>

In [None]:
# Insert your code here

'# Give verbal answer here

**b)** Before we start fitting models, please split the data ($X,y$) into a training set (*X\_train,y\_train*) consisting of 75% of the observations and a test set (*X\_test,y\_test*) consisting of the remaining 25%. In addition, please explain why splitting the data into a training set and test set is important for model assessment.
<div style="text-align: right"> <b>4 points</b> </div>

In [None]:
# Insert your code here

'# Give verbal answer here

**c)** Based on your choice of the most appropriate regression method in a), please fit it to the training set and print the misclassification rate on both the training set and test set. Furthermore, please explain whether the difference between the two misclassification rates implies that the model is overfitting the data.
<div style="text-align: right"> <b>5 points</b> </div>

In [None]:
# Insert your code here

'# Give verbal answer here

**d)** To evaluate whether the model you fitted and evaluated in c) performs well, you decide to compare it to a so-called "Dummy classifier". That is, a classification method that always predicts the majority class. Please complete the code cell below in order to print the misclassification rate on both the training set and test set of the Dummy classifier. Compare (in words) its performance to the model you fitted and evaluated in c).
<div style="text-align: right"> <b>4 points</b> </div>

In [None]:
# Complete the code here
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier().fit(X_train, y_train)




'# Give verbal answer here

**e)** We now wish to investigate whether more complex methods are able to predict the decision to work from home even better. In order to do this, please fit an unrestricted random forest classifier to the training data and print the misclassification rate on both training and test set. Afterwards, please compare the misclassification rate of the random forest to the model you fitted in c). Which one performs better? Furthermore, explain whether the difference in the misclassification rate between the training and test set of the random forest indicates whether it is overfitting the data. Finally, please explain how the performance of the random forest could be improved given additional time and resources.
<div style="text-align: right"> <b>8 points</b> </div>

In [None]:
# Insert your code here

'# Give verbal answer here

**f)** Print out the feature importance of all the features of the optimal random forest classifier you found in e). Which feature is most predictive of performance? Provide a potential reason for this.
<div style="text-align: right"> <b>5 points</b> </div>

In [None]:
# Insert your code here

'# Give verbal answer here