## Background / Motivation

As college students, we are eager to know the return of schooling. Our project aims to address the common questions that college students have regarding the value of education and work experience in relation to future earnings. We aim to provide insight into whether earning a college degree will lead to higher earnings, or if gaining work experience immediately after high school graduation is more beneficial. Additionally, we seek to explore whether a person's intelligence plays a role in their future earnings or if legacy matters a lot, if a person's mother's education level is a reliable predictor of future financial success.

These are complex issues that require careful examination and analysis. By examining the relationship between these factors and future earnings, we hope to provide evidence-based answers to these questions and we hope to gain a better understanding of the factors that contribute to success in the workforce. Our aim is to provide information and insights that can help individuals, schools, employers, and the government to make informed decisions about people's education and career paths.

## Problem statement 

We want to know what aspects of human capital contribute to people's future success - measured in future earnings in this case. 

## Data sources

The data we are using is from the National Longitudinal Survey of US Bureau of Labor. Here is the link of the source: https://www.bls.gov/nls/nlsy79.html.

We are using a subsample of young people aged between 28 and 38, with positive income (middle age young adults). We believe the data within this subset is not affected by hard-to-measure and inevitable history issues.

We have 7 things to look at in this survey.
1. AFQT: “Armed Forces Qualification Test”, a standardized test (similar to SAT or IQ test) given to everybody who was part of the survey in 1979. Scores are from 0 to 140.
2. Wage: Weekly wage, in dollars. 
3. Logwage:  Natural log of Wage.
4. Educ: Years of education
5. Exper: years of work experience
6. Meduc: Mother’s years of education
7. Age: age in years.


## Stakeholders

Our stakeholders are primarily young people and parents who are deciding how to allocate resources to maximize future earnings. We wish to identify which elements of human capital are most influential to predicting wages between education, experience, age, and AFQT. If we are successful, young people will have a clearer picture as to which characteristics are influential to future wages and which are not. For example, we might recomend that they prioritize experience over education and encourage that young people take opportunities to gain workforce experience over going to college.

Other stakeholders include employeers and the government. Employeers can use the significance of predictors to form better expectations of what wages they should pay people with particular characteristics. For example, they may consider an expected change in wage when workers gain experience or pursue further education. 

The government is the final stakeholder. In general, it can use significant predictor results to determine what areas of human capital to invest in to make certain populations better off. Consider a policy that increases education subsudies to minority communities to target their average wages, perhaps.

## Data quality check / cleaning / preparation 

![Screen%20Shot%202023-03-07%20at%209.08.56%20PM.png](attachment:Screen%20Shot%202023-03-07%20at%209.08.56%20PM.png)

The data was fully quantitative and contained no NaN or otherwise invalid values. Thus it was very complete upon initial importation and required minimal cleaning.

## Exploratory data analysis

### Relationship Analysis
We begin our EDA by looking for strong relationships of predictors with Wage. Further, we begin to identify areas where predictor transformations may be necessary.

![Screen%20Shot%202023-03-07%20at%209.15.48%20PM.png](attachment:Screen%20Shot%202023-03-07%20at%209.15.48%20PM.png)

![Screen%20Shot%202023-03-07%20at%209.16.08%20PM.png](attachment:Screen%20Shot%202023-03-07%20at%209.16.08%20PM.png)

We observe:
- Profound quadratic relationship between Wage and AFQT
- Possible quadratic relationship between Wage and Educ

![Screen%20Shot%202023-03-11%20at%2011.31.07%20AM.png](attachment:Screen%20Shot%202023-03-11%20at%2011.31.07%20AM.png)

We observe strongest (and similar) correlations with Wage in:
- AFQT
- Educ

Also, we note that the correlations are particularly low. We contribite this to high variance in the Wage measurements for all predictors in the dataset.

To confirm the potential significance of the relationship of preditors with wage, we then develop a naive model with all linear predictors.

#### Wage~AFQT+Educ+Exper+Age+Meduc-1

![Screen%20Shot%202023-03-11%20at%2011.29.04%20AM.png](attachment:Screen%20Shot%202023-03-11%20at%2011.29.04%20AM.png)

With some evidence of which predictors may be the most conductive to modeling wage, we now turn to analyzing how the predictors interact with eachother. 

Looking for correlations between predictors:
![Screen%20Shot%202023-03-07%20at%209.27.04%20PM.png](attachment:Screen%20Shot%202023-03-07%20at%209.27.04%20PM.png)

We see high correlations (>0.5) between: 
- Educ and AFQT
- Exper and Age

We then look for potential multicolinearity- described by VIF

VIF of constant is extremely high, which makes sense and we don't need further interpretation. 

VIFs of logwage and wage are high, this also makes as logwage is simply the log of wage. In our model, we should remove one of these two variables from the model.

Other than that, almost all VIFs are relatively close to 1, which means there is no severe multicollinearity between relevent predictors (not Logwage).

![Screenshot%202023-03-08%20at%2012.44.16%20AM.png](attachment:Screenshot%202023-03-08%20at%2012.44.16%20AM.png)

## Approach


We are using a multiple linear regression model to answer our problem statement. Since we were most interested in inference than prediction, our group decided to focus on optimizing the R-squared value in our final model. All predictors from our data seemed to be significant at first, just looking at the context of our data. We anticipated that we would have an issue with using some of the best subset or stepwise selection methods, given that all of our predictors may be significant for our model. Although these selection methods to create models did take a long time, we found these methods helpful for the creation of our final model. 

The first model (without an intercept term) that we based on EDA worked very well, with a high R-squared value. However, we wanted to further increase our R-squared value through model analysis. 

We did not find any solutions to our problems elsewhere on the internet. 

# Developing the model

#### Step 1: Initial Model
- The first step we took was to develop an initial model, based on our exploratory data analysis (EDA)
    - Looking at our naive model with all predictors and the predictors' correlation values with wage, we found that the predictors AFQT, Education, and Experience were significant for our model. We created our initial model including these 3 predictors. 
    - Additionally, we observed non-significant relationships with Wage in Age and Meduc. Exper is not significant at the 95% level but is significant at 94%, which is good enough for us to include in our model.
    - We also observed apparent non-linear trends with AFQT and Educ, so we decided to include a quadratic term of AFQT in our model. 

![Screen%20Shot%202023-03-11%20at%2011.48.05%20AM.png](attachment:Screen%20Shot%202023-03-11%20at%2011.48.05%20AM.png)

#### Step 2: RMSE Check

- Next, we checked the RMSE values for our initial model on the train and test datasets. 
    - RMSE train: 372.1630333735138
    - RMSE test: 414.8114189856499
        - We found that the train RMSE and test RMSE values are similar, which proves that our model does not overfit the data. 
        - The RMSE values are also relatively low, meaning our model is making fairly good predictions for the wage response variable. The model is performing well and generalizing to new data.

#### Step 3: Model Assumption Violations Check

- We then checked for model assumption violations. We did so by plotting the residuals vs. fitted values for the training and testing datasets. 
    - Non-linearity assumption: we found that the non-linearity assumption was met by looking at the residuals vs. fitted values plots. Shown below, our initial model seems to satisfy this assumption, as we do not observe a strong pattern in the residuals around the line Residuals = 0. Residuals are distributed more or less in a similar manner on both sides of the blue line for all fitted values. 
        - Given that the non-linearity model assumption is not violated, we can conclude that there is no need for further non-linear transformations of the predictors. 
    - Constant-variance assumption: we found that the constant-variance assumption was met by looking at the residuals vs. fitted values plots. Shown below, we see that the variance of errors seems to stay constant with increase in the fitted values. 
        - Given that the constant variance model assumption is not violated, we can conclude that there is no need for further non-linear transformations on the response variable.


![Screen%20Shot%202023-03-11%20at%205.55.27%20PM.png](attachment:Screen%20Shot%202023-03-11%20at%205.55.27%20PM.png)

![Screen%20Shot%202023-03-11%20at%205.55.32%20PM.png](attachment:Screen%20Shot%202023-03-11%20at%205.55.32%20PM.png)

#### Step 4: Outliers and Influential Points

- Next, we looked for outliers and influential points in our model. 
    - Outliers: As shown in our Project Code file, we have identified 4 outliers in our data. We categorized outliers as observations whose studentized residuals have a magnitude greater than 3.
    - High Leverage Points: As shown in our Project code file, we identified 0 high leverage points in our data. We categorized high leverage points as observations having four times the average leverage as high leverage points. 
    - Influential Points: Since there are 0 high-leverage points, we can conclude that there are 0 influential points, which are observations that are both outliers and high leverage points. Since we have 0 influential points, we do not have to remove any observations from our data. 

![Screen%20Shot%202023-03-11%20at%206.01.06%20PM.png](attachment:Screen%20Shot%202023-03-11%20at%206.01.06%20PM.png)

#### Step 5: Best subset and stepwise selection (interactions)

- We also created four different models to compare with our initial model for improvement.
    - Best subset selection without interaction terms: R-squared of 0.186
    - Best subset selection with interaction terms: R-squared of 0.192
    - Forward stepwise selection with interaction terms: R-squared of 0.198
    - Backward stepwise selection with interaction terms: R-squared of 0.192
- Although all 4 of our new models had very low R-squared values (included intercept term), we found the presence of interaction terms that were significant in multiple models.
    - There were indications of correlation between AFQT*Educ and Educ*Exper in the best subset selection with interaction terms and the backwards stepwise selection with interaction terms. We considered these significant interaction terms when developing our final model.

Model summary for best subset selection, without interaction terms: 

![Screen%20Shot%202023-03-13%20at%205.56.49%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.56.49%20PM.png)

Model summary for best subset selection, with interaction terms: 

![Screen%20Shot%202023-03-13%20at%205.57.07%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.57.07%20PM.png)

Model summary for forward stepwise, with interaction terms

![Screen%20Shot%202023-03-13%20at%205.57.56%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.57.56%20PM.png)

Model summary for backward stepwise, with interaction terms

![Screen%20Shot%202023-03-13%20at%205.58.31%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.58.31%20PM.png)

#### Step 6: Our final model

- Using the results and insights from our previous steps, we developed a final model as shown below. 

![Screen%20Shot%202023-03-13%20at%205.59.48%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.59.48%20PM.png)

Wage is the dependent variable, and there are 4 independent variables: I(AFQT ** 2), Educ, I(AFQT ** 2)*Educ, and Exper. 

The R-squared valuee is still quite high (0.881), indicating that the model explains a large proportion of the variation in the dependent variable. The p-values for all predictors are very low, so they are all statistically significant to Wage. 

For our formula, the I(AFQT****2) term represents the squared value of AFQT. We included this from the initial model summary because there is a nonlinear relationship between AFQT and wage, and AFQT is significant to wage. We included I(AFQT****2)*Educ indicates an interaction effect between these two variables as we saw in the best subset selection with interactions and correlation analyses. This means that the effect of AFQT on wage depends on the level of Educ, and vice versa. We added Exper as a predictor because there indicates that both terms are included in the model as independent variables and because Experience is also statistically significant to wage. 

Finally, the -1 term in the formula indicates that we do not include an intercept term in the model. This is because including an intercept would imply that the expected value of Wage is non-zero when all predictor variables are zero, which may not make sense in our context.

Thus, we have a final model with a high R-Squared that passes multiconlinearity assumptions- making our significant predictors useful to make predictions with.

#### Step 8: Final Model Analysis

Now that we have developed a final model, we analyzed our final model to make sure there is no further improvements needed.

1. RMSE check: 

train model: 370.1348
test model: 410.8099

RMSEs are similar - no overfitting. 
RMSEs are similar to those of the initial model and are still low. Since our goal is to maximize the R-squared value, we can disregard the changes in RMSEs, as it is not a significantly large difference.

2. non-Linear assumption: satisfied, as we do not observe a strong pattern in the residuals around the line Residuals = 0. Residuals are distributed more or less in a similar manner on both sides of the blue line for all fitted values.
   
   Constant Variance assumption: satisfied, as the variance of errors seems to be constant with increase in the fitted values. 

![Screen%20Shot%202023-03-13%20at%205.40.30%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.40.30%20PM.png)

![Screen%20Shot%202023-03-13%20at%205.41.03%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.41.03%20PM.png)

3. Errors: relatively normally distributed

![Screen%20Shot%202023-03-13%20at%205.41.16%20PM.png](attachment:Screen%20Shot%202023-03-13%20at%205.41.16%20PM.png)

4. From the code, we can tell that there are 4 outliers, but no influencial points.

#### Did we succed in achieving our goal?
Yes, we were able to attain a high R2 without clear evidence of overfitting. This provided us ith a small subset of significant predictors with informative coefficients to be used in our recomendations.

## Limitations of the model with regard to inference 

Most importantly, our inferences of predictor significance are limited to people of age 28-38. For people beyond or below that age, we cannot conclude that the same predictors are significant. For example, we might expect experience to take on a stronger relationship with wage as age gets beyond 38 and the importance of education could fade. Thus, our recomendations are constrained to analyzing the most contributive factors to the wages for adults in the designated age range and should not be taken as a general rule for all ages without further analysis of a more comprehensive dataset.

## Conclusions and Recommendations to stakeholders


In our final model, we can conclude that years of education and work experience are two important predictors of future earning. The effect of years of education is heavier than that of work experience. The AFQT test (like an IQ test) itself doesn’t really predict the future earning. Its effect is not statistically significant, and it mean because the effect is also explained by the year of education. For legacy - mother’s years of education, it doesn’t really matter. Also, we need to re-emphasize that no effects are causal cause ths survey is not an experiment and does not imply causal effect. Effects here are simply showing relationships.

Our analysis can create values for stakeholders.

As long as the current school and job market are not strikingly changed, our predictions could always be useful. 

For families, when the kid desires to find a way to earn more in the future, simply try to spend more years in school. If not, don’t panic, experiences in workplace can also contribute to the future earning.

For employers, high correlation between AFQT and Educ means that if hiring smarter employees is the main goal and it is hard to measure the intelligence of a candidate, it is easier to simply hire employees with higher level education.

For government, the state should create more opportunities for kids to get higher level education, especially for those with lower income. It is a way to provide mobility opportunities for lower income families and could definitely improve equality in the society and provide positive externalities to the society, which would definitely benefit the world.

However, the limitation is the linear prediction doesn't imply the causal effect, so there is a chance that maybe education is not the real reason behinds earnings and success.

## GitHub and individual contribution {-}

Put the **Github link** for the project repository.

https://github.com/alandaz/DataScienceRepo

Add details of each team member's contribution in the table below.

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Alanda Zong</td>
    <td>Stepwise Selection code and final model analysis</td>
    <td>Best subset selection, best subset selection with interaction terms, forward stepwise, backwards stepwise, and wrote the code for the graphs of the final model analysis (errors, homoscedasticity, and residual scatter plot) </td>
    <td>14</td>
  </tr>
  <tr>
    <td>Luke Lilienthal</td>
    <td>EDA and Initial Model Creation</td>
    <td>Data Validation. Found variable relationships in the raw data. Including: Scatterplotting, Correlation analysis. Created initial model with EDA findings.</td>
    <td>15</td>
  </tr>
    <tr>
    <td>Junho Park</td>
    <td>Outlier and influential points treatment, Model Assumptions Check, Final Model Analysis, </td>
    <td>Identified outliers/influential points and analayzed their effect on the model. Checked for non-linearity and constant-variance assumption. Analyzed metrics for the final model. </td>
    <td>15</td>    
  </tr>
    <tr>
    <td>Diqiao Wang</td>
    <td>Data selection, Colinearity Check, and Final Model Analysis</td>
    <td>Selected the clean data set that is suitable for running regression on, checked the VIF part, and ran the final model assumption.</td>
    <td>11</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

At first, we feel Github was hard to use because it required so many steps to make some changes to the codes compared to simply writing codes on Jupyter Notebook. I felt it was hard to manage different versions and I did not know how to store the original version of the file besides creating a copy of that. However, when it comes to the final report step, Github's magic appears. Last quarter, our group had to work on one computer to run every piece of codes and that progress was time-consuming. But this quarter, every group number is able to see what is going on easily through Github although we are not sitting together physically.