# Introduction

The National Longitudinal Survey of Youth 1997-2011 dataset is one of the most important databases available to social scientists working with US data. 

It allows scientists to look at the determinants of earnings as well as educational attainment and has incredible relevance for government policy. It can also shed light on politically sensitive issues like how different educational attainment and salaries are for people of different ethnicity, sex, and other factors. When we have a better understanding how these variables affect education and earnings we can also formulate more suitable government policies. 

<center><img src=https://i.imgur.com/cxBpQ3I.png height=400></center>


### Upgrade Plotly

In [92]:
%pip install --upgrade plotly

Note: you may need to restart the kernel to use updated packages.


###  Import Statements


In [93]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## Notebook Presentation

In [94]:
pd.options.display.float_format = '{:,.2f}'.format

# Load the Data



In [95]:
df_data = pd.read_csv('NLSY97_subset.csv')

### Understand the Dataset

Have a look at the file entitled `NLSY97_Variable_Names_and_Descriptions.csv`. 

---------------------------

    :Key Variables:  
      1. S           Years of schooling (highest grade completed as of 2011)
      2. EXP         Total out-of-school work experience (years) as of the 2011 interview.
      3. EARNINGS    Current hourly earnings in $ reported at the 2011 interview

# Preliminary Data Exploration 🔎

**Challenge**

* What is the shape of `df_data`? 
* How many rows and columns does it have?
* What are the column names?
* Are there any NaN values or duplicates?

In [96]:
df_data = df_data[['S', 'EXP', 'EARNINGS']]
df_data.shape
print(f'It has {df_data.shape[0]} rows and {df_data.shape[1]} columns')
print("The column names are: " + ', '.join(df_data.columns))
print(f'Are there any NaNs - {df_data.isna().values.any()}')
print(f'Are there any duplicates - {df_data.duplicated().values.any()}')

It has 2000 rows and 3 columns
The column names are: S, EXP, EARNINGS
Are there any NaNs - False
Are there any duplicates - True


## Data Cleaning - Check for Missing Values and Duplicates

Find and remove any duplicate rows.

In [97]:
df_data = df_data.drop_duplicates()

## Descriptive Statistics

In [98]:
df_data.describe()

Unnamed: 0,S,EXP,EARNINGS
count,1486.0,1486.0,1486.0
mean,14.56,6.7,18.81
std,2.77,2.86,12.0
min,6.0,0.0,2.0
25%,12.0,4.66,11.41
50%,15.0,6.63,15.75
75%,16.0,8.71,22.6
max,20.0,14.73,132.89


## Visualise the Features

In [99]:
df_data.columns

Index(['S', 'EXP', 'EARNINGS'], dtype='object')

# Split Training & Test Dataset

We *can't* use all the entries in our dataset to train our model. Keep 20% of the data for later as a testing dataset (out-of-sample data).  

In [100]:
train_data, test_data = train_test_split(df_data, test_size=0.2)

# Simple Linear Regression

Only use the years of schooling to predict earnings. Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data? 

In [101]:
reg = LinearRegression()
X_train = train_data[['S']]
y_train = train_data[['EARNINGS']]

X_test = test_data[['S']]
y_test = test_data[['EARNINGS']]

reg.fit(X_train, y_train)

### Evaluate the Coefficients of the Model

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative). 

Interpret the regression. How many extra dollars can one expect to earn for an additional year of schooling?

In [102]:
print(f'Coefficient is {round(reg.coef_[0][0], 2)} - {"positive" if reg.coef_ > 0 else "negative"}')
# you can earn additional 1.17 dollar by additional year of schooling

Coefficient is 1.25 - positive


### Analyse the Estimated Values & Regression Residuals

How good our regression is also depends on the residuals - the difference between the model's predictions ( 𝑦̂ 𝑖 ) and the true values ( 𝑦𝑖 ) inside y_train. Do you see any patterns in the distribution of the residuals?

In [103]:
y_pred = reg.predict(X_test)
print(f'R-squared is {reg.score(X_test, y_test)}')
# 8% - not bad result for only one variable, but bad for a model

R-squared is 0.044563814160560655


# Multivariable Regression

Now use both years of schooling and the years work experience to predict earnings. How high is the r-squared for the regression on the training data? 

In [104]:
X_train = train_data[['S', 'EXP']]
X_test = test_data[['S', 'EXP']]

model = LinearRegression().fit(X_train, y_train)

In [106]:
print(f'R-squared is {model.score(X_train, y_train)}')
# 11% - pretty bad result. But better than previous one

R-squared is 0.11489964952018827


### Evaluate the Coefficients of the Model

In [118]:
model.intercept_
print(f'For one year of schooling you have {round(model.coef_[0][0],2)} dollars more, but for experience - only {round(model.coef_[0][1], 2)}')

For one year of schooling you have 1.72 dollars more, but for experience - only 0.79


# Use Your Model to Make a Prediction

How much can someone with a bachelors degree (12 + 4) years of schooling and 5 years work experience expect to earn in 2011?

In [121]:
est_earnings = model.predict([[16, 5]])[0][]



In [123]:
print(f'Roughly {est_earnings} dollars per hour')

Roughly [[19.83671428]] dollars per hour


# Experiment and Investigate Further

Which other features could you consider adding to further improve the regression to better predict earnings? 