# Problem Set 5: Multivariate Regressions 2: Categorical Variables and Interaction Terms

### Summary and Motivation
This problem sets continues practing with multiariate regressions. And it focuses on enhancing students' understanding of multivariate regression by incorporating categorical variables, dummy variables, and interaction terms. Using the NLSY97 dataset, students will learn to handle complex data structures and interpret the impacts of categorical variables and their interactions on dependent variables. This problem set aims to provide practical experience in Python, reinforcing the theoretical concepts learned in class. 


### Instructions
The dataset, data_NLSY97v2, are sourced from the NLSY97, a representative national database of individuals born in the early 1980s in the United States. Below are some variables we will use:
- educ: number of years of education completed
- annual_income: annual income that these individuals earned as adults
- TotalWeeksExp: Total weeks of work experience of the individual
- gender: denotes the gender of the individual
- minority: 1 if the individual belongs to a minority group, 0 otherwise.
- m_college: 1 if the individual’s mother has a college degree, 0 otherwise.
- family income: Annual family income when these individuals were teenagers, reported in thousands.
- gpain8: GPA in 8th grade.
- retention: 1 if the individual was required to repeat a grade during middle school, 0 otherwise.

Columns of scores from the AFQT test.  The values are measured as standard deviations from the mean.  A value of '0' means the individual earned the average score.  A value of '-1' means the indivial earned a score one standard deviation below the average. 
- asvabAR: skills in arithmetic reasoning
- asvabMK: skills in mathematical knowledge
- asvabPC: skills in paragraph comprehension
- asvabWK: skills in word knowledge
- AFTQ: average of the above 4 scores 

Please follow the questions and instructions below to complete this problem set. For some questions, please write and execute Python code for data analysis in Cell mode. Comment your code to explain each step. Some questions need text discussion. Please provide a detailed discussion of your results, including interpretations and answers to questions in Raw mode.

Once you have completed the assignment, save your Jupyter notebook with the following naming convention: ECN310_ProblemSetX_LastName_FirstName.ipynb (replace X with the assignment number).

### Exercise 1: Load the Dataset and Libraries, Data Cleaning and Engineering

1. Load needed libraries and the dataset data_NLSY97v2.xlsx, check the info about the datframe and get descritive statistics. 

In [1]:
# Write your code here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf

df = pd.read_excel("data_NLSY97v2.xlsx", index_col=0)
df

Unnamed: 0_level_0,educ,GPA grade 8,School Retention,Annual_Income,TotalWeeksExp,Black,Hispanic,White,m_college,minority,gender,asvabAR,asvabMK,asvabPC,asvabWK,AFQT
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female,0.066,0.707,-0.507,-0.772,-0.12650
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male,-0.238,0.259,1.080,-0.059,0.26050
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female,-1.009,-0.415,0.299,-0.703,-0.45700
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female,-0.598,0.646,-0.236,-0.542,-0.18250
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female,-0.281,-1.036,-1.826,-0.981,-1.03100
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male,-0.538,-0.254,-0.259,-1.153,-0.55100
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male,1.434,1.153,1.683,1.096,1.34150
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male,0.301,-0.392,0.139,0.609,0.16425


2. Add a column `college` to the data set that is an indicator variable that takes on a value of 1 if students have more than 13 years of education.  _Note_: This must be correct for the next question.  Please post in the discussion if you have trouble creating this variable.




In [2]:
# Please write your executable code here
def college(years):
    return (1 if years > 13 else 0)

# takes value 1 if the individual completed moew than 13 years of education, 0 otherwise
df['college'] = df['educ'].apply(college)
df

Unnamed: 0_level_0,educ,GPA grade 8,School Retention,Annual_Income,TotalWeeksExp,Black,Hispanic,White,m_college,minority,gender,asvabAR,asvabMK,asvabPC,asvabWK,AFQT,college
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female,0.066,0.707,-0.507,-0.772,-0.12650,1
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male,-0.238,0.259,1.080,-0.059,0.26050,1
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female,-1.009,-0.415,0.299,-0.703,-0.45700,1
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female,-0.598,0.646,-0.236,-0.542,-0.18250,0
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male,,,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female,-0.281,-1.036,-1.826,-0.981,-1.03100,0
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male,-0.538,-0.254,-0.259,-1.153,-0.55100,0
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male,1.434,1.153,1.683,1.096,1.34150,1
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male,0.301,-0.392,0.139,0.609,0.16425,0


3.   Later on in the  problem set, we will take the log of experience.  Since we cannot take the log of 0, we will assume that all workers who have 0 experience have at least 1 day.
Create another variable called `exper` that transforms the variable TotalWeeksExp into years of experience by adding 1 to TotalWeeksExp and dividing by 52.14.

Hint: `df['exper'] = (df['TotalWeeksExp'] + 1)/52.14` 

In [3]:
# Please write your executable code here
df['exper'] = (df['TotalWeeksExp'] + 1)/52.14
df

Unnamed: 0_level_0,educ,GPA grade 8,School Retention,Annual_Income,TotalWeeksExp,Black,Hispanic,White,m_college,minority,gender,asvabAR,asvabMK,asvabPC,asvabWK,AFQT,college,exper
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female,0.066,0.707,-0.507,-0.772,-0.12650,1,
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male,-0.238,0.259,1.080,-0.059,0.26050,1,18.527043
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female,-1.009,-0.415,0.299,-0.703,-0.45700,1,14.902186
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female,-0.598,0.646,-0.236,-0.542,-0.18250,0,19.351745
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male,,,,,,0,17.088608
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female,-0.281,-1.036,-1.826,-0.981,-1.03100,0,18.085923
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male,-0.538,-0.254,-0.259,-1.153,-0.55100,0,15.822785
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male,1.434,1.153,1.683,1.096,1.34150,1,9.493671
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male,0.301,-0.392,0.139,0.609,0.16425,0,18.527043


3. Create a new variable in the dataframe called `female` that takes on value of 1 if the individual is female and 0 otherwire 

In [4]:
# write your code here
df['female'] = df['gender'].map({'female': 1, 'male': 0})
df

Unnamed: 0_level_0,educ,GPA grade 8,School Retention,Annual_Income,TotalWeeksExp,Black,Hispanic,White,m_college,minority,gender,asvabAR,asvabMK,asvabPC,asvabWK,AFQT,college,exper,female
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female,0.066,0.707,-0.507,-0.772,-0.12650,1,,1
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male,-0.238,0.259,1.080,-0.059,0.26050,1,18.527043,0
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female,-1.009,-0.415,0.299,-0.703,-0.45700,1,14.902186,1
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female,-0.598,0.646,-0.236,-0.542,-0.18250,0,19.351745,1
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male,,,,,,0,17.088608,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female,-0.281,-1.036,-1.826,-0.981,-1.03100,0,18.085923,1
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male,-0.538,-0.254,-0.259,-1.153,-0.55100,0,15.822785,0
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male,1.434,1.153,1.683,1.096,1.34150,1,9.493671,0
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male,0.301,-0.392,0.139,0.609,0.16425,0,18.527043,0


### Exercise 2: A linear probability model
In earlier problem sets, we have seend that years of education is a statistically significant determinant of annual income.  We can use the tools of multivariate regression to make inferences about the probability or likelihood that and individual attends college.

1. Using statsmodels formula notation, perform a univariate regression with `college` and the dependent variable and `female` as the independent variable. Print the results and interpret the constant and the coefficient on female.

In [5]:
# Write your executable code here

# I have used C() to show its a categorical / binary variable. The model produces the same results without the C() as well.

results = smf.ols("college ~ C(female)", data=df).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                college   R-squared:                       0.009
Model:                            OLS   Adj. R-squared:                  0.009
Method:                 Least Squares   F-statistic:                     84.41
Date:                Tue, 03 Dec 2024   Prob (F-statistic):           4.92e-20
Time:                        22:21:11   Log-Likelihood:                -6121.9
No. Observations:                8984   AIC:                         1.225e+04
Df Residuals:                    8982   BIC:                         1.226e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          0.3166      0.007     44.

2. Now include the indicator variables `minority` and `m_college` in your regression. Your regession should be `college` as the dependent variable and `female`, `minority` and `m_college` as independent variables.  Run the regression and interpret the coefficients. 

In your interpretation, answer what the model suggests about probability a minority male whose mother has attended college has also attended college?

In [6]:
# Write your code here
results2 = smf.ols("college ~ C(female) + C(minority) +C(m_college)", data=df).fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:                college   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.108
Method:                 Least Squares   F-statistic:                     324.3
Date:                Tue, 03 Dec 2024   Prob (F-statistic):          9.09e-199
Time:                        22:21:11   Log-Likelihood:                -5054.9
No. Observations:                8014   AIC:                         1.012e+04
Df Residuals:                    8010   BIC:                         1.015e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept               0.2956    

3. What happens when you add the variable `White` to the regression.  Do you think it makes sense to include this variable?  Why or why not.

In [7]:
# Write code here (optional) you can answer the question without code.

results_ = smf.ols("college ~ C(female) + C(minority) +C(m_college) +C(White)", data=df).fit()
print(results_.summary())

                            OLS Regression Results                            
Dep. Variable:                college   R-squared:                       0.108
Model:                            OLS   Adj. R-squared:                  0.108
Method:                 Least Squares   F-statistic:                     242.6
Date:                Tue, 03 Dec 2024   Prob (F-statistic):          5.19e-197
Time:                        22:21:11   Log-Likelihood:                -5056.0
No. Observations:                8014   AIC:                         1.012e+04
Df Residuals:                    8009   BIC:                         1.016e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            1.222e+11   7

### Exercise 3: Multivariate Regression Analysis

#### Linear versus logged dependend variable

1. Using statsmodels formula notation, perform a multivariate regression analysis with `Annual_Income` as the dependent variable and  `educ`, `exper`, `female`, and `AFQT` as the independent variables. A constant should be included by default.

Interpret the coefficients and comment on siginifance of the coefficients in the space below.

Write your code for fitting the model and print the summary of the regression model. 

In [8]:
# Please write your executable code here
results3 = smf.ols("Annual_Income ~ educ + exper + AFQT + C(female)", data=df).fit()
print(results3.summary())

                            OLS Regression Results                            
Dep. Variable:          Annual_Income   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.185
Method:                 Least Squares   F-statistic:                     217.3
Date:                Tue, 03 Dec 2024   Prob (F-statistic):          3.83e-168
Time:                        22:21:11   Log-Likelihood:                -47083.
No. Observations:                3812   AIC:                         9.418e+04
Df Residuals:                    3807   BIC:                         9.421e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept      -1.873e+04   6711.088     -2.

Provide your comments here

Coefficient on intercept is -18,730 which means the predicted annual income when all other variables are zero is -$18,730. It is statistically significant (p = 0.005).


Coefficient on female is -25,130 which means being female is associated with earning $25,130 less annually compared to males, holding other factors constant. It is statiscally significant (p < 0.001).


Coefficient on education is 5,481 which means each additional year of education is associated with earning $5,481 more annually, holding other factors constant. It is statistically significant (p < 0.001).


Coefficient on experience is 1,436 which means each additional year of work experience is associated with earning $1,436 more annually, holding other factors constant. It is statistically significant (p < 0.001).

Coefficient on AFQT is 13,170 which means a one-unit increase in AFQT score is associated with earning $13,170 more annually, holding other factors constant. It is statistically significant (p < 0.001).

#### Logged variables

2a. Run the regression with `log(Annual_Income)` as the dependent variable and the same independent variables, that is `educ`, `exper`, `female`, and `AFQT`. Interpret each of the coefficients and note their significance.

In [9]:
# write your code here
results3 = smf.ols("np.log(Annual_Income) ~ educ + exper + AFQT + C(female)", data=df).fit()
print(results3.summary())

                              OLS Regression Results                             
Dep. Variable:     np.log(Annual_Income)   R-squared:                       0.244
Model:                               OLS   Adj. R-squared:                  0.243
Method:                    Least Squares   F-statistic:                     307.2
Date:                   Tue, 03 Dec 2024   Prob (F-statistic):          2.49e-229
Time:                           22:21:11   Log-Likelihood:                -4831.4
No. Observations:                   3812   AIC:                             9673.
Df Residuals:                       3807   BIC:                             9704.
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept     

2b. Run the regression again, but this time also take the log of `exper`. That is, `educ`, `log(exper)`, `female`, and `AFQT` as independent variables and `log(Annual_Income)` as the dependent variable. . 

Interpret the coefficient of `log(exper)`.  Compare the $R^2$ valuesthe model with logged experience and linear experience, which do you think a better choice?

In [10]:
# write your code here
results4 = smf.ols("np.log(Annual_Income) ~ educ + np.log(exper) + AFQT + C(female)", data=df).fit()
print(results4.summary())

                              OLS Regression Results                             
Dep. Variable:     np.log(Annual_Income)   R-squared:                       0.201
Model:                               OLS   Adj. R-squared:                  0.200
Method:                    Least Squares   F-statistic:                     239.8
Date:                   Tue, 03 Dec 2024   Prob (F-statistic):          6.35e-184
Time:                           22:21:11   Log-Likelihood:                -4936.3
No. Observations:                   3812   AIC:                             9883.
Df Residuals:                       3807   BIC:                             9914.
Df Model:                              4                                         
Covariance Type:               nonrobust                                         
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept     

#### Interaction Terms:

4a. Create a new variable, `female_educ` that is equal to the column `female`  multiplied by the column `educ`. This will allow us to add an interaction term to our model

In [11]:
# Write your executable code here
df['female_educ'] = df['female'] * df['educ']
df

Unnamed: 0_level_0,educ,GPA grade 8,School Retention,Annual_Income,TotalWeeksExp,Black,Hispanic,White,m_college,minority,gender,asvabAR,asvabMK,asvabPC,asvabWK,AFQT,college,exper,female,female_educ
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,female,0.066,0.707,-0.507,-0.772,-0.12650,1,,1,16.0
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,male,-0.238,0.259,1.080,-0.059,0.26050,1,18.527043,0,0.0
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,female,-1.009,-0.415,0.299,-0.703,-0.45700,1,14.902186,1,14.0
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,female,-0.598,0.646,-0.236,-0.542,-0.18250,0,19.351745,1,12.0
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,male,,,,,,0,17.088608,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,female,-0.281,-1.036,-1.826,-0.981,-1.03100,0,18.085923,1,8.0
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,male,-0.538,-0.254,-0.259,-1.153,-0.55100,0,15.822785,0,0.0
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,male,1.434,1.153,1.683,1.096,1.34150,1,9.493671,0,0.0
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,male,0.301,-0.392,0.139,0.609,0.16425,0,18.527043,0,0.0


4b. Return to the linear form of `exper` and add `female_educ` to the regression with the logged annual income and and comment on significance and interpetation of the coefficient `female_educ`.  

Your dependent variable is `log(Annual_Income)` and independent variables should be `educ` , `exper`,   `AFQT`,  `female`, `minority`, `female_educ`.  Interpet the coefficients on `educ`, `female` and `female_educ`.  _Note:_ significance is important.

In [12]:
# Please write your executable code here
results5 = smf.ols("np.log(Annual_Income) ~ educ + exper + AFQT + C(female) + C(minority) + female_educ", data=df).fit()
print(results5.summary())

                              OLS Regression Results                             
Dep. Variable:     np.log(Annual_Income)   R-squared:                       0.245
Model:                               OLS   Adj. R-squared:                  0.244
Method:                    Least Squares   F-statistic:                     198.5
Date:                   Tue, 03 Dec 2024   Prob (F-statistic):          9.24e-220
Time:                           22:21:12   Log-Likelihood:                -4652.9
No. Observations:                   3680   AIC:                             9320.
Df Residuals:                       3673   BIC:                             9363.
Df Model:                              6                                         
Covariance Type:               nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Interc

4c. Create a new variable, `minority_educ` that is the column `minority` multiplied by the column `educ`

In [13]:
# Write your executable code here
df['minority_educ'] = df['minority'] * df['educ']
df

Unnamed: 0_level_0,educ,GPA grade 8,School Retention,Annual_Income,TotalWeeksExp,Black,Hispanic,White,m_college,minority,...,asvabAR,asvabMK,asvabPC,asvabWK,AFQT,college,exper,female,female_educ,minority_educ
PUBID - YTH ID CODE 1997,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,16.0,3.0,0.0,,,0.0,0.0,1.0,0.0,0.0,...,0.066,0.707,-0.507,-0.772,-0.12650,1,,1,16.0,0.0
2,14.0,3.5,0.0,115000.0,965.0,0.0,1.0,0.0,0.0,1.0,...,-0.238,0.259,1.080,-0.059,0.26050,1,18.527043,0,0.0,14.0
3,14.0,3.0,0.0,,776.0,0.0,1.0,0.0,0.0,1.0,...,-1.009,-0.415,0.299,-0.703,-0.45700,1,14.902186,1,14.0,14.0
4,12.0,4.0,0.0,45000.0,1008.0,0.0,1.0,0.0,0.0,1.0,...,-0.598,0.646,-0.236,-0.542,-0.18250,0,19.351745,1,12.0,12.0
5,12.0,2.5,0.0,150000.0,890.0,0.0,1.0,0.0,0.0,1.0,...,,,,,,0,17.088608,0,0.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9018,8.0,2.0,0.0,90000.0,942.0,0.0,0.0,1.0,0.0,0.0,...,-0.281,-1.036,-1.826,-0.981,-1.03100,0,18.085923,1,8.0,0.0
9019,13.0,1.5,1.0,43000.0,824.0,0.0,1.0,0.0,0.0,1.0,...,-0.538,-0.254,-0.259,-1.153,-0.55100,0,15.822785,0,0.0,13.0
9020,14.0,3.5,0.0,,494.0,0.0,0.0,1.0,0.0,0.0,...,1.434,1.153,1.683,1.096,1.34150,1,9.493671,0,0.0,0.0
9021,13.0,1.5,0.0,47000.0,965.0,0.0,0.0,1.0,0.0,0.0,...,0.301,-0.392,0.139,0.609,0.16425,0,18.527043,0,0.0,0.0


4d. Add `minority_educ` to the regression. Your independent variables should be `educ` , `exper`,   `AFQT`,  `female`, `minority`, `female_educ`, `minority_educ`.  Your dependent variable is `log(Annual_Income)`  


Print the results and intepret the coefficients on `minority` and `minority_educ`. What do the coefficients suggest about the return to education for minorities?


In [14]:
# Write your code here
results6 = smf.ols("np.log(Annual_Income) ~ educ + exper + AFQT + C(female) + C(minority) + female_educ + minority_educ", data=df).fit()
print(results6.summary())

                              OLS Regression Results                             
Dep. Variable:     np.log(Annual_Income)   R-squared:                       0.246
Model:                               OLS   Adj. R-squared:                  0.245
Method:                    Least Squares   F-statistic:                     171.2
Date:                   Tue, 03 Dec 2024   Prob (F-statistic):          7.90e-220
Time:                           22:21:12   Log-Likelihood:                -4650.0
No. Observations:                   3680   AIC:                             9316.
Df Residuals:                       3672   BIC:                             9366.
Df Model:                              7                                         
Covariance Type:               nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Interc

#### Exercise 4: Model Specification

1. Compare all model specifications, e.g., logged dependent variable, logged independent variables, interactions, and variables to include (you can add more from the data set, if you choose), and select. the model that you think does the best job describing the effect of education on annual income. Run the regression and print the summary. _Create and run as many as you like, but only include one in your final submission_. Explain why you think it is the best.

In [15]:
# Please write your executable code here
results_final = smf.ols("np.log(Annual_Income) ~ np.log(educ) + exper + AFQT + C(female) + C(minority) + female_educ + minority_educ", data=df).fit()
print(results_final.summary())

                              OLS Regression Results                             
Dep. Variable:     np.log(Annual_Income)   R-squared:                       0.244
Model:                               OLS   Adj. R-squared:                  0.243
Method:                    Least Squares   F-statistic:                     169.3
Date:                   Tue, 03 Dec 2024   Prob (F-statistic):          1.20e-217
Time:                           22:21:12   Log-Likelihood:                -4655.1
No. Observations:                   3680   AIC:                             9326.
Df Residuals:                       3672   BIC:                             9376.
Df Model:                              7                                         
Covariance Type:               nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Interc

#### Exercise 5: Omitted variables and Measurement errors

_Research goal_: We want to understand how education affects annual income.

1. (a) Provide an example of an omitted variable that we want to include in the regression, but we don't have in the current dataset.  It can be something that data could be collected on, but was left out, or something that is difficult or impossible to collect data on.   

1. (b) Do you think that by omitting this variable our conclusions about the effect of education on annual income are biased? What direction is the bias (upward or downward)?

2. The annual income is self-reported. There is measurement error in it. Can you tell me your best guess on the direction of the bias?

In [16]:
# before submitting, run the by clicking the >> button at the top.  Make sure everything runs before you turn it in.