In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

# This program chooses to build a regression tree to predict the target variable.

train_df = pd.read_csv('earnings_train.csv') 
test_df = pd.read_csv('earnings_test_features.csv')

# Since district code maps to district name in a 1 to 1 relation (which confirmed in Part 1),
# and it has missing values, I would just simply drop it to simplify the one-hot encoding.
train_df = train_df.drop(columns=["DISTRICT_CODE"])
test_df = test_df.drop(columns=["DISTRICT_CODE"])

target = "WAGE_YEAR4"
features = [col for col in train_df.columns if col != "WAGE_YEAR4"]

# one-hot encoding
train_dum = pd.get_dummies(train_df[features])
test_dum = pd.get_dummies(test_df)

# Preliminary test, divide the trainning data into a training part and a testing part. It will be applied to the whole training data set if it works well.
x_train, x_val, y_train, y_val = train_test_split(train_dum, train_df[target], test_size=0.2)

model = DecisionTreeRegressor(max_depth=10, min_samples_split=10)
model.fit(x_train, y_train)

y_prediction = model.predict(x_val)
print("R squared:", r2_score(y_val, y_prediction))




R squared: 0.9906216833271853


In [None]:
# Formal implementation to the whold trainning data
model.fit(train_dum, train_df[target])
Final_prediction = model.predict(test_dum)
pd.DataFrame(Final_prediction, columns=["WAGE_YEAR4"]).to_csv("preds.csv", index=False, header=True)


In [None]:

# Looking for the most important features that predict the WAGE_YEAR4 variable.
feature_names = train_dum.columns
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': feature_names,'Importance': importances})
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
feature_importance_df.head(10)

Unnamed: 0,Feature,Importance
2,WAGE_YEAR3,0.996077
1,WAGE_YEAR2,0.000702
0,WAGE_YEAR1,0.000686
208,DISTRICT_NAME_Dixon Unified,0.000171
56,DISTRICT_NAME_Assembly District 31,0.000116
540,DISTRICT_NAME_San Ramon Valley Unified,0.000112
706,STUDENT_POPULATION_Asian,0.000109
309,DISTRICT_NAME_La Canada Unified,0.000102
720,AWARD_CATEGORY_Bachelor's Degree - Did Not Tra...,0.000101
695,DISTRICT_NAME_Yuba City Unified,9.7e-05


Part 3
1. Which features best predict the target outcome (WAGE_YEAR4)?

   The income at the 3rd year is the best predictor of the income of the 4th year.

2. What does your model say about the people or populations whose data is provided?

   Based on my regression tree model as well as the simple linear regression model I applied for part 1, it is demonstrated that foster youths, students who experienced homlessness, and students from certain ethnic groups are less likely to earn a high income in the first four years of their career, they also have a slower increase in their expected income level.

3. What features, if any, would you like to have had to make a better model?

   In the current regression tree, I have manually tuned for the best tree depth and minimum splits. For a better model, I would like to add a 10-fold cross validation to determine these two values rather than doing it by hand.