# Step 3: Predictive Modeling
We will attempt to create a predictive model which takes in the columns of `master_df` as features and attempts to predict a school's MCAS score. We will use the `master_df` dataframe which we created in the previous notebook.

## 3.0 Import Libraries

In [332]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import Lasso, Ridge

## 3.1 Load Data

In [333]:
master_df = pd.read_csv("../output/master_df.csv")
master_df.columns

Index(['distname', 'schoolname', 'flag_nerds', 'flag_f33', 'ncesenroll',
       'gradespan', 'pp_stloc_raw_MA', 'pp_fed_raw_MA', 'pp_total_raw_MA',
       'schoolstloc_raw_MA', 'schoolfed_raw_MA', 'schooltot_raw_MA',
       'distname_lower', 'schoolname_lower', 'School Name', 'School Code',
       'ELA_elem', 'MATH_elem', 'SCI', 'avg_score_elem', 'ELA_hs', 'MATH_hs',
       'avg_score_hs', 'avg_score', 'African American', 'Asian', 'Hispanic',
       'White', 'Native American', 'Native Hawaiian, Pacific Islander',
       'Multi-Race, Non-Hispanic', 'Males', 'Females', '% Graduated',
       'First Language Not English #', 'First Language Not English %',
       'English Language Learner #', 'English Language Learner %',
       'Students With Disabilities #', 'Students With Disabilities %',
       'High Needs #', 'High Needs #.1', 'Economically Disadvantaged #',
       'Economically Disadvantaged %', 'african_american_staff', 'asian_staff',
       'hispanic_staff', 'white_staff', 'native_a

## 3.0 Create Train/Test Split
Here, we will first identify which columns we want to use as features and which column we want to use as the target. We will then create a train/test split of the data.

In [334]:
# Let's drop non-feature columns
feature_cols = master_df.drop(columns=["distname", "schoolname", "flag_nerds", "flag_f33", "gradespan", "distname_lower", "schoolname_lower", "School Name", "School Code", "District Name", "District Code", "ELA_elem", "MATH_elem", "SCI", "avg_score_elem", "ELA_hs", "MATH_hs", "avg_score_hs", "avg_score","salary_totals_teachers", "student_teacher_ratio_nan_flag", "no_salary_flag"]).columns.to_list()

# Get rid of columns with '#' in the name, we only want '%' columns
feature_cols = [col for col in feature_cols if "#" not in col]

# # Let's also make independent demographics, finance, and school info feature lists
# demo_feature_cols = ["African American", "Asian", "Hispanic", "White", "Native American", "Native Hawaiian, Pacific Islander", "Multi-Race, Non-Hispanic", "african_american_staff", "asian_staff", "hispanic_staff", "white_staff", "native_american_staff", "hawaiian_pacific_staff", "multi_race_staff", "Males", "Females", "males_staff", "females_staff"]

# finance_feature_cols = ["pp_stloc_raw_MA", "pp_fed_raw_MA", "pp_total_raw_MA", "schoolstloc_raw_MA", "schoolfed_raw_MA", "schooltot_raw_MA"]

# school_info_feature_cols = ["ncesenroll", "% Graduated", "fte_staff", "student_teacher_ratio", "avg_salary_teachers"]

Lets inspect our features.

In [335]:
feature_df = master_df[feature_cols]
# demo_feature_df = master_df[demo_feature_cols]
# finance_feature_df = master_df[finance_feature_cols]
# school_info_feature_df = master_df[school_info_feature_cols]
feature_df.head()

Unnamed: 0,ncesenroll,pp_stloc_raw_MA,pp_fed_raw_MA,pp_total_raw_MA,schoolstloc_raw_MA,schoolfed_raw_MA,schooltot_raw_MA,African American,Asian,Hispanic,...,white_staff,native_american_staff,hawaiian_pacific_staff,multi_race_staff,females_staff,males_staff,fte_staff,student_teacher_ratio,avg_salary_teachers,FTE Count
0,545.0,14880.475672,20.812074,14901.287746,8052652.0,11262.569511,8063915.0,3.9,2.4,7.9,...,95.7,0.0,0.0,1.6,72.5,27.5,62,14.4,93861.0,11469788.0
1,672.0,13057.825084,83.417661,13141.242745,8839857.0,56471.903107,8896329.0,2.1,1.8,7.1,...,92.5,0.0,0.0,0.0,77.3,22.7,73,17.6,93861.0,11469788.0
2,294.0,13201.700857,499.583587,13701.284443,3841768.0,145381.59918,3987150.0,2.7,3.1,7.1,...,100.0,0.0,0.0,0.0,88.8,11.2,39,18.0,93861.0,11469788.0
3,1837.0,14634.333856,191.007988,14825.341844,26766280.0,349354.671134,27115630.0,2.0,32.5,3.8,...,96.1,0.5,0.0,0.0,76.1,23.9,195,14.7,86677.0,32945978.0
4,464.0,16005.357328,289.306328,16294.663656,7481971.0,135241.064602,7617212.0,1.3,31.5,5.0,...,95.9,0.0,0.0,0.0,91.4,8.6,79,14.3,86677.0,32945978.0


Let's normalize our columns so that they are all floats between 0 and 1 for easier comparison in our model.

In [336]:
feature_df = feature_df.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
# demo_feature_df = demo_feature_df.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
# finance_feature_df = finance_feature_df.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
# school_info_feature_df = school_info_feature_df.apply(lambda x: (x - x.min()) / (x.max() - x.min()))

In [337]:
# Look for any columns with null values
feature_df[feature_df.isnull().any(axis=1)]

Unnamed: 0,ncesenroll,pp_stloc_raw_MA,pp_fed_raw_MA,pp_total_raw_MA,schoolstloc_raw_MA,schoolfed_raw_MA,schooltot_raw_MA,African American,Asian,Hispanic,...,white_staff,native_american_staff,hawaiian_pacific_staff,multi_race_staff,females_staff,males_staff,fte_staff,student_teacher_ratio,avg_salary_teachers,FTE Count


Now that we have our features, let's create our train/test split.

In [338]:
X = feature_df
y = master_df["avg_score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# X_demo = demo_feature_df
# X_finance = finance_feature_df
# X_school_info = school_info_feature_df

# X_train_demo, X_test_demo, y_train_demo, y_test_demo = train_test_split(X_demo, y, test_size=0.2, random_state=42)
# X_train_finance, X_test_finance, y_train_finance, y_test_finance = train_test_split(X_finance, y, test_size=0.2, random_state=42)
# X_train_school_info, X_test_school_info, y_train_school_info, y_test_school_info = train_test_split(X_school_info, y, test_size=0.2, random_state=42)

Let's also set up a k-fold cross validation object.

In [339]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

## 3.1 Lasso Model

### 3.1.0 Lasso Model Creation

Let's create a Lasso model and see how it performs.

In [340]:
lasso = Lasso()

param_grid_lasso = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'max_iter': [10, 25, 50, 75, 100, 1000]
}

Let's use a GridSearchCV to find the best alpha value for our Lasso model.

In [341]:
lasso_model = GridSearchCV(lasso, param_grid_lasso, cv=kf)

lasso_model.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

Let's print the best alpha value.

In [342]:
print(lasso_model.best_params_)

{'alpha': 0.01, 'max_iter': 50}


### 3.1.1 Lasso Model Evaluation

Let's see how our Lasso model performs on the training and test set.

In [343]:
print("Training set score: {:.2f}".format(lasso_model.score(X_train, y_train)))
print("Test set score: {:.2f}".format(lasso_model.score(X_test, y_test)))

Training set score: 0.75
Test set score: 0.73


Let's look at the top features.

In [344]:
# Get model's top features
coefs_lasso = pd.DataFrame(lasso_model.best_estimator_.coef_, index=X.columns, columns=["coef"])
coefs_lasso.sort_values(by="coef", ascending=False).head(10)

Unnamed: 0,coef
Asian,13.479807
Hispanic,7.722445
fte_staff,7.173833
% Graduated,4.327504
females_staff,2.824928
"Multi-Race, Non-Hispanic",2.506272
avg_salary_teachers,2.117763
pp_total_raw_MA,1.610514
white_staff,1.24159
ncesenroll,0.0


Let's evaluate the mean squared error and r-squared value for our Lasso model.

In [345]:
y_pred_lasso = lasso_model.predict(X_test)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
rmse_lasso = np.sqrt(mse_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

print('MSE:', mse_lasso)
print('RMSE:', rmse_lasso)
print('R^2:', r2_lasso)

MSE: 27.58828154236
RMSE: 5.252454811072629
R^2: 0.7277915357441027


## 3.1 Ridge Regression Model
Let's compare our Lasso model with a Ridge Model.

### 3.1.0 Ridge Model Creation

In [346]:
ridge = Ridge()

param_grid_ridge = {'alpha': [0.01, 0.1, 1.0, 10.0, 100.0]}

Let's use a GridSearchCV to find the best hyper params for our LR model.

In [347]:
ridge_model = GridSearchCV(ridge, param_grid_ridge, cv=kf)

ridge_model.fit(X_train, y_train)

Let's inspect the best hyper params and corresponding score.

In [348]:
print("Best hyperparameters: ", ridge_model.best_params_)
print("Best accuracy score: ", ridge_model.best_score_)

Best hyperparameters:  {'alpha': 1.0}
Best accuracy score:  0.7252196287353905


### 3.1.1 Ridge Model Evaluation

Let's test the model on training and test data.

In [349]:
print("Training set score: {:.2f}".format(ridge_model.score(X_train, y_train)))
print("Test set score: {:.2f}".format(ridge_model.score(X_test, y_test)))

Training set score: 0.75
Test set score: 0.73


Let's look at the top features.

In [350]:
# Get model's top features
coefs_ridge = pd.DataFrame(ridge_model.best_estimator_.coef_, index=X.columns, columns=["coef"])
coefs_ridge.sort_values(by="coef", ascending=False).head(10)

Unnamed: 0,coef
Asian,10.714379
fte_staff,5.895303
% Graduated,5.285671
Females,5.030708
avg_salary_teachers,3.315996
pp_fed_raw_MA,3.117219
Hispanic,3.053644
schoolstloc_raw_MA,2.247147
native_american_staff,2.062131
females_staff,1.976519


Let's evaluate the mean squared error and r-squared value for our Ridge model.

In [351]:
y_pred_ridge = ridge_model.predict(X_test)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
rmse_ridge = np.sqrt(mse_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

print('MSE:', mse_ridge)
print('RMSE:', rmse_ridge)
print('R^2:', r2_ridge)

MSE: 27.35194812097305
RMSE: 5.229908997389252
R^2: 0.7301233938407843


Overall, we can see our Ridge and Lasso models performed extremely similarly. The coefficients in the Ridge model are a bit more evenly dispersed.