### C3 p5 
Confirm the partialling out interpretation of OLS estimates by explicitly doing the partialling out for e.g. 3.2 : 
- data : WAGE1
  - est eq : log(wage) = 0.284 + 0.092 educ + 0.0041 exper + 0.022 tenure. 
- First, regressing educ on exper & tenure --> save the risidual r1_hat 
- second, regress log (wage ) on r1_hat 
- compare the coef on r1_hat with the coeff on educ in the regreesion of log(wage) on educ, exper, and tenure. 


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

path = '/Users/mouyasushi/Desktop/學校課程/econometrics/DateSets/Excel/wage1.xls'
# Read the data

column_names = {
    'wage': 'average hourly earnings',
    'educ': 'years of education',
    'exper': 'years potential experience',
    'tenure': 'years with current employer',
    'nonwhite': '=1 if nonwhite',
    'female': '=1 if female',
    'married': '=1 if married',
    'numdep': 'number of dependents',
    'smsa': '=1 if live in SMSA',
    'northcen': '=1 if live in north central U.S',
    'south': '=1 if live in southern region',
    'west': '=1 if live in western region',
    'construc': '=1 if work in construc. indus.',
    'ndurman': '=1 if in nondur. manuf. indus.',
    'trcommpu': '=1 if in trans, commun, pub ut',
    'trade': '=1 if in wholesale or retail',
    'services': '=1 if in services indus.',
    'profserv': '=1 if in prof. serv. indus.',
    'profocc': '=1 if in profess. occupation',
    'clerocc': '=1 if in clerical occupation',
    'servocc': '=1 if in service occupation',
    'lwage': 'log(wage)',
    'expersq': 'exper^2',
    'tenursq': 'tenure^2'
}

# Read the Excel file
df = pd.read_excel(path, names=list(column_names.keys()))

In [2]:
# Step 1: Regress educ on exper & tenure
# Create X matrix with constant term
X = df[['exper', 'tenure']]
X = sm.add_constant(X)
y = df['educ']

# Fit the model
model = sm.OLS(y, X).fit()

# Save residuals (r1_hat)
r1_hat = model.resid

# Print regression results
print("Step 1: Regression of Education on Experience and Tenure")
print("=" * 100)
print(model.summary().tables[1])
print("=" * 100)
print("Saved r1_hat: ")
print(r1_hat)

Step 1: Regression of Education on Experience and Tenure
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         13.5861      0.185     73.540      0.000      13.223      13.949
exper         -0.0741      0.010     -7.588      0.000      -0.093      -0.055
tenure         0.0475      0.018      2.593      0.010       0.012       0.084
Saved r1_hat: 
0     -0.050383
1     -2.437860
2     -3.655791
3     -1.162312
4      2.700659
         ...   
520    3.356588
521   -3.437860
522    1.521698
523    2.736978
524    0.594335
Length: 525, dtype: float64


In [3]:
# Step 2: Regress log(wage) on r1_hat
X2 = sm.add_constant(r1_hat)     
y2 = df['lwage']
model2 = sm.OLS(y2, X2).fit()

print("Step 2: Regression of Log(wage) on Residuals (r1_hat)")
print("===================================================")
print(model2.summary().tables[1])

Step 2: Regression of Log(wage) on Residuals (r1_hat)
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.6242      0.021     78.489      0.000       1.584       1.665
0              0.0919      0.008     11.650      0.000       0.076       0.107


In [4]:
# Original full regression
X_full = df[['educ', 'exper', 'tenure']]
X_full = sm.add_constant(X_full)
y_full = df['lwage']

model_full = sm.OLS(y_full, X_full).fit()
print("Full Regression: log(wage) on educ, exper, and tenure")
print("=" * 80)
print(model_full.summary().tables[1])

Full Regression: log(wage) on educ, exper, and tenure
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2867      0.104      2.745      0.006       0.082       0.492
educ           0.0919      0.007     12.519      0.000       0.077       0.106
exper          0.0041      0.002      2.367      0.018       0.001       0.007
tenure         0.0221      0.003      7.126      0.000       0.016       0.028


####  step3 : compare coef on r1_hat and full coef
- r1_hat coef : 0.0919 
- coef on educ : 0.0919 
- they are identical 

### C3 p6 
Use data set WAGE2 ( all of the following regress contains an intercept ):
- run a simple regress of IQ on educ to obtain the slope coef ( delta1_tilde)
- run simple regress of log(wage) on educ , obtain slope coef ( Beta1_tilde)
- run multiple regress of log(wage) on educ & IQ, obtain slope coef ( beta1_hat, beta2_hat)
- verify that Beta1_tilde = beta1_hat + beta2_hat * delta1+tilde 

In [5]:
path = '/Users/mouyasushi/Desktop/學校課程/econometrics/DateSets/Excel/wage2.xls'

# Column names and descriptions for WAGE2 dataset
column_names = {
    'wage': 'monthly earnings',
    'hours': 'average weekly hours',
    'IQ': 'IQ score',
    'KWW': 'knowledge of world work score',
    'educ': 'years of education',
    'exper': 'years of work experience',
    'tenure': 'years with current employer',
    'age': 'age in years',
    'married': '=1 if married',
    'black': '=1 if black',
    'south': '=1 if live in south',
    'urban': '=1 if live in SMSA',
    'sibs': 'number of siblings',
    'brthord': 'birth order',
    'meduc': "mother's education",
    'feduc': "father's education",
    'lwage': 'natural log of wage'
}

# Using this to read the data
df = pd.read_excel(path, names=list(column_names.keys()))

df.head()

Unnamed: 0,wage,hours,IQ,KWW,educ,exper,tenure,age,married,black,south,urban,sibs,brthord,meduc,feduc,lwage
0,808,50,119,41,18,11,16,37,1,0,0,1,1,.,14,14,6.694562
1,825,40,108,46,14,11,9,33,1,0,0,1,1,2,14,14,6.715384
2,650,40,96,32,12,13,7,32,1,0,0,1,4,3,12,12,6.476973
3,562,40,74,27,11,14,5,34,1,0,0,1,10,6,6,11,6.331502
4,1400,40,116,43,16,14,2,35,1,1,0,1,1,2,8,.,7.244227


In [6]:
# Step1 : Simple regression of IQ on education
X = df['educ']
X = sm.add_constant(X)
y = df['IQ']

model = sm.OLS(y, X).fit()
print("Regression of IQ on Education")
print("=" * 80)
print(model.summary().tables[1])
print(f"\nSlope coefficient (delta1_tilde): {model.params['educ']:.4f}")

Regression of IQ on Education
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         53.7041      2.625     20.457      0.000      48.552      58.856
educ           3.5328      0.192     18.366      0.000       3.155       3.910

Slope coefficient (delta1_tilde): 3.5328


In [7]:
# Simple regression of log(wage) on education
X = df['educ']
X = sm.add_constant(X)
y = df['lwage']

model_wage = sm.OLS(y, X).fit()
print("Regression of Log(wage) on Education")
print("=" * 80)
print(model_wage.summary().tables[1])
print(f"\nSlope coefficient (Beta1_tilde): {model_wage.params['educ']:.4f}")

Regression of Log(wage) on Education
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.9733      0.081     73.341      0.000       5.813       6.133
educ           0.0598      0.006     10.025      0.000       0.048       0.072

Slope coefficient (Beta1_tilde): 0.0598


In [8]:
# Multiple regression of log(wage) on education and IQ
X = df[['educ', 'IQ']]
X = sm.add_constant(X)
y = df['lwage']

model_multi = sm.OLS(y, X).fit()
print("Multiple Regression of Log(wage) on Education and IQ")
print("=" * 80)
print(model_multi.summary().tables[1])
print(f"\nSlope coefficients:")
print(f"beta1_hat (Education): {model_multi.params['educ']:.4f}")
print(f"beta2_hat (IQ): {model_multi.params['IQ']:.4f}")

Multiple Regression of Log(wage) on Education and IQ
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.6585      0.096     58.743      0.000       5.469       5.848
educ           0.0391      0.007      5.716      0.000       0.026       0.053
IQ             0.0059      0.001      5.872      0.000       0.004       0.008

Slope coefficients:
beta1_hat (Education): 0.0391
beta2_hat (IQ): 0.0059


#### Verify 
- Beta1_tilda :  0.0598
- beta1_hat : 0.0391
- beta2_hat : 0.0059
- delta1_tilde : 3.5328
result : 0.0598 ≈ 0.0391 + 0.0059 * 3.5328 

### C4 P5
- Data : MLB1
- Model : log(salary) = Beta0 + Beta1*years + Beta2*gamesyr + Beta3 * bavg + Beta4 * hrunsyr + Beta5 * rbisyr + u 
- Q1 : Drop rbisyr, what happen to significance of hrunsyr ? Size of coef on hrunsyr ? 
- Q2 : Add variables 
    - runsyr : runs per year
    - fldperc : fielding percentage 
    - sbasesyr : stolen bases per year 
    - Which of the factors are individually significant ? 
- Q3 : in Q2, test joint significance of bavg, fldperc, sbasesyr 

In [9]:

path = '/Users/mouyasushi/Desktop/學校課程/econometrics/DateSets/Excel/mlb1.xls'

column_names = {
    'salary': '1993 season salary',
    'teamsal': 'team payroll',
    'nl': '=1 if national league',
    'years': 'years in major leagues',
    'games': 'career games played',
    'atbats': 'career at bats',
    'runs': 'career runs scored',
    'hits': 'career hits',
    'doubles': 'career doubles',
    'triples': 'career triples',
    'hruns': 'career home runs',
    'rbis': 'career runs batted in',
    'bavg': 'career batting average',
    'bb': 'career walks',
    'so': 'career strike outs',
    'sbases': 'career stolen bases',
    'fldperc': 'career fielding perc',
    'frstbase': '=1 if first base',
    'scndbase': '=1 if second base',
    'shrtstop': '=1 if shortstop',
    'thrdbase': '=1 if third base',
    'outfield': '=1 if outfield',
    'catcher': '=1 if catcher',
    'yrsallst': 'years as all-star',
    'hispan': '=1 if hispanic',
    'black': '=1 if black',
    'whitepop': 'white pop. in city',
    'blackpop': 'black pop. in city',
    'hisppop': 'hispanic pop. in city',
    'pcinc': 'city per capita income',
    'gamesyr': 'games per year in league',
    'hrunsyr': 'home runs per year',
    'atbatsyr': 'at bats per year',
    'allstar': 'perc. of years an all-star',
    'slugavg': 'career slugging average',
    'rbisyr': 'rbis per year',
    'sbasesyr': 'stolen bases per year',
    'runsyr': 'runs scored per year',
    'percwhte': 'percent white in city',
    'percblck': 'percent black in city',
    'perchisp': 'percent hispanic in city',
    'blckpb': 'black*percblck',
    'hispph': 'hispan*perchisp',
    'whtepw': 'white*percwhte',
    'blckph': 'black*perchisp',
    'hisppb': 'hispan*percblck',
    'lsalary': 'log(salary)'
}


mlb = pd.read_excel(path, names=list(column_names.keys()))
mlb


Unnamed: 0,salary,teamsal,nl,years,games,atbats,runs,hits,doubles,triples,...,runsyr,percwhte,percblck,perchisp,blckpb,hispph,whtepw,blckph,hisppb,lsalary
0,3375000,38407380,1,8,918,3333,407,863,156,38,...,50.87500,70.27797,18.84423,10.8778,18.84423,0,0,10.8778,0,15.03191
1,3100000,38407380,1,5,751,2807,370,840,148,18,...,74.00000,70.27797,18.84423,10.8778,0,0,70.27797,0,0,14.94691
2,2900000,38407380,1,8,1056,3337,405,816,143,18,...,50.62500,70.27797,18.84423,10.8778,0,0,70.27797,0,0,14.88022
3,1650000,38407380,1,12,1196,3603,437,928,19,16,...,36.41667,70.27797,18.84423,10.8778,18.84423,0,0,10.8778,0,14.31629
4,700000,38407380,1,17,2032,7489,1136,2145,270,142,...,66.82353,70.27797,18.84423,10.8778,18.84423,0,0,10.8778,0,13.45884
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
347,312000,35586456,0,5,439,1098,150,260,41,8,...,30.00000,73.14964,13.87162,12.97875,0,0,73.14964,0,0,12.65076
348,275000,35586456,0,2,211,700,63,183,32,1,...,31.50000,73.14964,13.87162,12.97875,0,12.97875,0,0,13.87162,12.52453
349,250000,35586456,0,3,249,828,112,176,36,2,...,37.33333,73.14964,13.87162,12.97875,0,0,73.14964,0,0,12.42922
350,200000,35586456,0,6,667,2087,217,510,92,5,...,36.16667,73.14964,13.87162,12.97875,0,0,73.14964,0,0,12.20607


In [18]:
import pandas as pd
import statsmodels.api as sm
import numpy as np
from scipy import stats

# Original Model
X1 = mlb[['years', 'gamesyr', 'bavg', 'hrunsyr', 'rbisyr']]
X1 = sm.add_constant(X1)
y = mlb['lsalary']

model1 = sm.OLS(y, X1).fit()

# Model without rbisyr
X2 = mlb[['years', 'gamesyr', 'bavg', 'hrunsyr']]
X2 = sm.add_constant(X2)

model2 = sm.OLS(y, X2).fit()

print("Q1: Comparison of hrunsyr coefficient and significance")
print("\nWith rbisyr:")
print(f"hrunsyr coefficient: {model1.params['hrunsyr']:.4f}")
print(f"hrunsyr p-value: {model1.pvalues['hrunsyr']:.4f}")

print("\nWithout rbisyr:")
print(f"hrunsyr coefficient: {model2.params['hrunsyr']:.4f}")
print(f"hrunsyr p-value: {model2.pvalues['hrunsyr']:.4f}")
print("Ans: the size of coef gets larger, and hrynsyr become more significant ")

Q1: Comparison of hrunsyr coefficient and significance

With rbisyr:
hrunsyr coefficient: 0.0135
hrunsyr p-value: 0.4035

Without rbisyr:
hrunsyr coefficient: 0.0357
hrunsyr p-value: 0.0000
Ans: the size of coef gets larger, and hrynsyr become more significant 


In [23]:
# Q2: Extended model with additional variables
X3 = mlb[['years', 'gamesyr', 'bavg', 'hrunsyr', 'runsyr', 'fldperc', 'sbasesyr']]
X3 = sm.add_constant(X3)

model3 = sm.OLS(y, X3).fit()

print("\nQ2: Significance of new variables")
print("\nCoefficients and p-values for all variables:")
for var in model3.pvalues.index:
    print(f"{var}:")
    print(f"Coefficient: {model3.params[var]:.4f}")
    print(f"P-value: {model3.pvalues[var]:.4f}")

print ("factors: years, gamesyr, hrynsyr, runsyr are significant individually")
print(model3.summary().tables[1])  # Print coefficient table



Q2: Significance of new variables

Coefficients and p-values for all variables:
const:
Coefficient: 10.4320
P-value: 0.0000
years:
Coefficient: 0.0698
P-value: 0.0000
gamesyr:
Coefficient: 0.0080
P-value: 0.0032
bavg:
Coefficient: 0.0005
P-value: 0.6328
hrunsyr:
Coefficient: 0.0231
P-value: 0.0080
runsyr:
Coefficient: 0.0173
P-value: 0.0008
fldperc:
Coefficient: 0.0010
P-value: 0.6151
sbasesyr:
Coefficient: -0.0065
P-value: 0.2141
factors: years, gamesyr, hrynsyr, runsyr are significant individually
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         10.4320      2.007      5.199      0.000       6.485      14.379
years          0.0698      0.012      5.816      0.000       0.046       0.093
gamesyr        0.0080      0.003      2.967      0.003       0.003       0.013
bavg           0.0005      0.001      0.478      0.633      -0.002       0.003
hrunsyr        0.023

In [12]:
# Q3: Joint test for bavg, fldperc, sbasesyr
hypotheses = '(bavg = 0), (fldperc = 0), (sbasesyr = 0)'
f_test = model3.f_test(hypotheses)

print("\nQ3: Joint significance test results")
print(f"F-statistic: {f_test.statistic:.4f}")
print(f"P-value: {f_test.pvalue:.4f}")


Q3: Joint significance test results
F-statistic: 0.6867
P-value: 0.5607


### C4 P6 
- data : WAGE2 
- Model : log(wage) = Beta0 + Beta1 * educ + Beta2 * exper + Beta3 8 tenure + u 
- Q1 : State Null hypothesis that another year of exper has the same effect on log(wage) as  another year of tenure with the current employer 
- Q2 : Tets Null hypothesis in Q1 against a two-sided alternative, at 0.05 sig level, by comstructing a 95% confidence interval. What do you conclude ? 

In [13]:
path = '/Users/mouyasushi/Desktop/學校課程/econometrics/DateSets/Excel/wage2.xls'

column_names = {
    'wage': 'monthly earnings',
    'hours': 'average weekly hours',
    'IQ': 'IQ score',
    'KWW': 'knowledge of world work score',
    'educ': 'years of education',
    'exper': 'years of work experience',
    'tenure': 'years with current employer',
    'age': 'age in years',
    'married': '=1 if married',
    'black': '=1 if black',
    'south': '=1 if live in south',
    'urban': '=1 if live in SMSA',
    'sibs': 'number of siblings',
    'brthord': 'birth order',
    'meduc': "mother's education",
    'feduc': "father's education",
    'lwage': 'natural log of wage'
}

wage2 = pd.read_excel(path, names=list(column_names.keys()))
wage2

Unnamed: 0,wage,hours,IQ,KWW,educ,exper,tenure,age,married,black,south,urban,sibs,brthord,meduc,feduc,lwage
0,808,50,119,41,18,11,16,37,1,0,0,1,1,.,14,14,6.694562
1,825,40,108,46,14,11,9,33,1,0,0,1,1,2,14,14,6.715384
2,650,40,96,32,12,13,7,32,1,0,0,1,4,3,12,12,6.476973
3,562,40,74,27,11,14,5,34,1,0,0,1,10,6,6,11,6.331502
4,1400,40,116,43,16,14,2,35,1,1,0,1,1,2,8,.,7.244227
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
929,520,40,79,28,16,6,1,30,1,1,1,0,0,1,11,.,6.253829
930,1202,40,102,32,13,10,3,31,1,0,1,1,7,7,8,6,7.091742
931,538,45,77,22,12,12,10,28,1,1,1,0,9,.,7,.,6.287858
932,873,44,109,25,12,12,12,28,1,0,1,0,1,1,.,11,6.771935


#### Q1
- H₀: β₂ = β₃ (The coefficient on exper equals the coefficient on tenure)
- H₁: β₂ ≠ β₃ (The coefficients are different)

#### Q2

In [27]:
# Estimate the model
X = wage2[['educ', 'exper', 'tenure']]
X = sm.add_constant(X)
y = wage2['lwage']
model = sm.OLS(y, X).fit()
print(model.summary().tables[1])

                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.4962      0.111     49.677      0.000       5.279       5.713
educ           0.0749      0.007     11.490      0.000       0.062       0.088
exper          0.0153      0.003      4.548      0.000       0.009       0.022
tenure         0.0134      0.003      5.169      0.000       0.008       0.018


In [28]:
# Get coefficients and standard errors
beta_exper = model.params['exper']
beta_tenure = model.params['tenure']
se_exper = model.bse['exper']    # bse: std error 
se_tenure = model.bse['tenure']

# Calculate standard error of the difference
se_diff = np.sqrt(se_exper**2 + se_tenure**2)

# Calculate difference between coefficients
diff = beta_exper - beta_tenure

# Construct 95% CI for the difference
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff

print("Regression Results:")
print(f"exper coefficient: {beta_exper:.4f}")
print(f"tenure coefficient: {beta_tenure:.4f}")
print("\n95% Confidence Interval for (β₂ - β₃):")
print(f"({ci_lower:.4f}, {ci_upper:.4f})")
print("Ans : since 0 is in the interval of our confidence interval, we fail to reject the null hypo at the current confidence level")

Regression Results:
exper coefficient: 0.0153
tenure coefficient: 0.0134

95% Confidence Interval for (β₂ - β₃):
(-0.0064, 0.0103)
Ans : since 0 is in the interval of our confidence interval, we fail to reject the null hypo at the current confidence level


### C4 P 11 
- data : HTV 
- Model : educ = Beta0 + Beta1 * motheduc + Beta2 * fatheduc + Beta3 * abil + Beta4 * abil^2 + u 
- Q1: by OLS, report results in usual form. Test null hypo: educ is linearly related to abil against the alernative that the relationship is quadratic 
- Q2 : Using the eq from part 1, test H0 : Beta1 = Beta2 against a two sided alternative. What is th p -value ? 
- Q3 : add two college tuition variables to the regression from Q1, detetermine if they are jointly significant ? 
- Q4 : what's the corr between tuit17 and tuit18 ? explain why the average of the tuition over the two years might be preferred to adding each respectively ? What happen when you use the average ? 
- Q5 : Do the findings for the average tuition variable in Q4 make sense when interpreted caussally ? What might be going on ? 

In [35]:
path = '/Users/mouyasushi/Desktop/學校課程/econometrics/DateSets/Excel/htv.xls'

column_names = {
    'wage': 'hourly wage, 1991',
    'abil': 'abil. measure, not standardized',
    'educ': 'highest grade completed by 1991',
    'ne': '=1 if in northeast, 1991',
    'nc': '=1 if in nrthcntrl, 1991',
    'west': '=1 if in west, 1991',
    'south': '=1 if in south, 1991',
    'exper': 'potential experience',
    'motheduc': 'highest grade, mother',
    'fatheduc': 'highest grade, father',
    'brkhme14': '=1 if broken home, age 14',
    'sibs': 'number of siblings',
    'urban': '=1 if in urban area, 1991',
    'ne18': '=1 if in NE, age 18',
    'nc18': '=1 if in NC, age 18',
    'south18': '=1 if in south, age 18',
    'west18': '=1 if in west, age 18',
    'urban18': '=1 if in urban area, age 18',
    'tuit17': 'college tuition, age 17',
    'tuit18': 'college tuition, age 18',
    'lwage': 'log(wage)',
    'expersq': 'exper^2',
    'ctuit': 'tuit18 - tuit17'
}

htv = pd.read_excel(path, names=list(column_names.keys()))
htv


Unnamed: 0,wage,abil,educ,ne,nc,west,south,exper,motheduc,fatheduc,...,ne18,nc18,south18,west18,urban18,tuit17,tuit18,lwage,expersq,ctuit
0,8.912656,2.037170,13,1,0,0,0,8,12,10,...,1,0,0,0,1,8.595144,9.499537,2.187472,64,0.904392
1,15.514330,2.475895,15,1,0,0,0,11,12,16,...,1,0,0,0,1,7.311346,7.311346,2.741764,121,0.000000
2,13.333330,3.609240,15,1,0,0,0,6,12,12,...,1,0,0,0,1,9.499537,10.162070,2.590267,36,0.662534
3,11.070110,2.636546,13,1,0,0,0,15,12,15,...,1,0,0,0,1,7.311346,7.311346,2.404249,225,0.000000
4,17.482520,3.474334,18,1,0,0,0,8,12,12,...,1,0,0,0,1,7.311346,7.311346,2.861201,64,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1224,7.735584,2.803173,12,0,0,0,1,9,12,12,...,0,0,1,0,1,3.895709,3.810777,2.045831,81,-0.084932
1225,91.309220,4.164562,19,0,0,1,0,6,13,14,...,0,0,0,1,0,0.000000,0.000000,4.514252,36,0.000000
1226,12.980770,0.893115,16,0,0,0,1,11,14,16,...,0,0,1,0,0,2.444079,2.444079,2.563469,121,0.000000
1227,12.500000,-0.633061,8,0,0,0,1,19,6,10,...,0,1,0,0,1,7.582914,7.582914,2.525729,361,0.000000


#### Q1

In [39]:
# Create abil squared term
htv['abil_sq'] = htv['abil']**2

# Estimate the model
X = htv[['motheduc', 'fatheduc', 'abil', 'abil_sq']]
X = sm.add_constant(X)
y = htv['educ']
model1 = sm.OLS(y, X).fit()

print("Model Results:")
print(model1.summary().tables[1])  # Print coefficient table

# Test H0: β₄ = 0 (--> linear relationship)
# H1: β₄ ≠ 0 (--> quadratic relationship)
print("\nTest for quadratic term:")
print(f"abil² coefficient: {model1.params['abil_sq']:.4f}")
print(f"Standard Error: {model1.bse['abil_sq']:.4f}")
print(f"t-statistic: {model1.tvalues['abil_sq']:.4f}")
print(f"p-value: {model1.pvalues['abil_sq']:.4f}")
print('Ans : since the p value is very low, we can reject Ho that educ is linearly related to abil')

Model Results:
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.2405      0.288     28.657      0.000       7.676       8.805
motheduc       0.1901      0.028      6.763      0.000       0.135       0.245
fatheduc       0.1089      0.020      5.554      0.000       0.070       0.147
abil           0.4015      0.030     13.250      0.000       0.342       0.461
abil_sq        0.0506      0.008      6.087      0.000       0.034       0.067

Test for quadratic term:
abil² coefficient: 0.0506
Standard Error: 0.0083
t-statistic: 6.0866
p-value: 0.0000
Ans : since the p value is very low, we can reject Ho that educ is linearly related to abil


#### Q2

In [41]:
# Test H0: β₁ = β₂
# Calculate difference and its standard error
diff = model1.params['motheduc'] - model1.params['fatheduc']
se_diff = np.sqrt(model1.bse['motheduc']**2 + model1.bse['fatheduc']**2)

# Calculate t-statistic
t_stat = diff/se_diff
# Calculate p-value for two-sided test
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=len(y)-len(X.columns)))

print("\nTest for equality of parents' education effects:")
print(f"Difference (mother - father): {diff:.4f}")
print(f"Ans: p-value: {p_value:.4f}")


Test for equality of parents' education effects:
Difference (mother - father): 0.0812
Ans: p-value: 0.0180


#### Q3

In [47]:
# Add tuition variables to original model
X2 = htv[['motheduc', 'fatheduc', 'abil', 'abil_sq', 'tuit17', 'tuit18']]
X2 = sm.add_constant(X2)
model2 = sm.OLS(y, X2).fit()

# Test joint significance of tuition variables
hypotheses = '(tuit17 = 0), (tuit18 = 0)'
f_test = model2.f_test(hypotheses)

print("Joint test for tuition variables:")
print(f"F-statistic: {f_test.statistic:.4f}")
print(f"p-value: {f_test.pvalue:.4f}")
print('ans : since the p-value is pretty large, we fail to reject Ho, hence they are not jointly significant')

Joint test for tuition variables:
F-statistic: 0.8377
p-value: 0.4329
ans : since the p-value is pretty large, we fail to reject Ho, hence they are not jointly significant


#### Q4

In [56]:
# Q4

corr = htv['tuit17'].corr(htv['tuit18'])
print(f"Correlation between tuit17 and tuit18: {corr:.4f}")

# Create average tuition and test its significance
htv['avg_tuit'] = (htv['tuit17'] + htv['tuit18'])/2

X3 = htv[['motheduc', 'fatheduc', 'abil', 'abil_sq', 'avg_tuit']]
X3 = sm.add_constant(X3)
y = htv['educ']
model3 = sm.OLS(y, X3).fit()

print(f"\navg_tuit coefficient: {model3.params['avg_tuit']:.4f}")
print(f"p-value: {model3.pvalues['avg_tuit']:.4f}")
print('ans: high corr suggests using their avg is preferred, because: \
it reduces multicollinearity, provides more precise estimate, and it better represents overall college cost effect')

Correlation between tuit17 and tuit18: 0.9808

avg_tuit coefficient: 0.0160
p-value: 0.1977
ans: high corr suggests using their avg is preferred, because: it reduces multicollinearity, provides more precise estimate, and it better represents overall college cost effect


#### Q5

The average tuition results might not represent causal effects because:
- Students choose colleges based on ability and family background
- College quality (unobserved) affects both tuition and education
- Local economic conditions affect both tuition and education choices