# School District Characteristics and Poverty - Machine Learning Models
## Name: Derek Castleman
## Master of Science in Data Science - Thesis Project

In this jupyter notebook I will be presenting the machine learning models that I have developed throughout the project to address the question whether the percent of students living in poverty can be predicted using different features or characterstics of school districts. The data that I am using in this was prepared in a R Markdown notebook in which I determined features of a school district that correlate with the proportion of students living in poverty.

The first thing that I will do is import the csv file that was created in R which will allow for you to see the columns of the dataset.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# I did the latin encoding from reading about an error that I got and why I got it.
# I got the information from https://stackoverflow.com/questions/5552555/unicodedecodeerror-invalid-continuation-byte
project_data = pd.read_csv('ProjectDataTableNewData.csv', encoding='latin-1')
project_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,1,100190,AL,Alabaster City School District,34015,860,0.128301,0.178274,6064,0.418371,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,2,100005,AL,Albertville City School District,21786,1546,0.375699,0.249885,5444,0.557494,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,3,100030,AL,Alexander City City School District,17073,832,0.312900,0.177590,3032,0.450528,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,4,100060,AL,Andalusia City School District,8854,386,0.267313,0.200248,1773,0.365482,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,5,100090,AL,Anniston City School District,22350,1106,0.347362,0.091723,2050,0.933171,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7572,7573,5602760,WY,Uinta County School District 1,14192,364,0.122683,0.189543,2690,0.202602,...,0.002230,0.002230,0.001487,0.000743,0.005576,0.002230,0.000372,0.697561,0.800000,84
7573,7574,5604260,WY,Uinta County School District 6,3152,30,0.040486,0.234454,739,0.067659,...,0.002706,0.004060,0.000000,0.006766,0.005413,0.001353,0.000000,0.981481,0.000000,79
7574,7575,5606240,WY,Washakie County School District 1,7372,181,0.130970,0.174173,1284,0.295171,...,0.004673,0.006231,0.001558,0.000779,0.007788,0.003894,0.001558,1.000000,0.000000,89
7575,7576,5604830,WY,Weston County School District 1,5465,118,0.145499,0.138884,759,0.117260,...,0.003953,0.006588,0.002635,0.000000,0.003953,0.010540,0.001318,1.000000,1.000000,89


I am going to select the columns from the dataframe that were found to correlate with poverty and set those as the features. And then I will set the poverty proportion as the target values.

I will then scale the data to prepare for the models. And then I will follow this with doing a train and test split on the data, keeping 20% of the data as the testing set and the other 80% as the training set.

In [3]:
features_that_correlate = project_data[['New_Teachers_Proportion', 'Absent_Prop', 'Test_Prop', 'Teams_Prop', 
                                       'Athletes_Prop', 'Takers_Proportion', 'AP_Prop', 
                                        'Expulsion_Prop', 'AVG_Suspenion', 'Alg1', 'Alg2', 'Geo', 'AdvMath', 
                                       'Calc', 'Bio', 'Chem', 'Phys', 'Grad_Rate', 'Non_White_Students']]
target_values = project_data[['Poverty_Proportion']]

In [4]:
from sklearn.preprocessing import StandardScaler

model_scaler = StandardScaler()
scaled_data = model_scaler.fit_transform(features_that_correlate)

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values, test_size=0.2, random_state=42)

## Scoring

To score how well my models perform on the data I will be using the root mean squared error since this is a common way in which you can test how well regression models perform. I also like this form of scoring since it gives insight into the mean of how far your data is off in its predictions with it being in the same units as the target data rather then having the mean squared error which has you dealing with the square of how far the data is off. Also by using the RMSE I can think of the results as being percentages that they are off because the target that they are trying to predict is the proportion (which is basically the percentage) of students that are living in poverty within the district.

I am adapting a function that we used in the California house data to score the models. I am doing a cross validation with five of them since the data set is not extremely large, I do not want to get it so small that it can throw off values.

In [6]:
from sklearn.model_selection import cross_val_score
def display_scores(mode, X_train, y_train):
    model_scores = cross_val_score(model, X_train, y_train,
                         scoring="neg_mean_squared_error", cv=5)
    scores = np.sqrt(-model_scores)
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

## Regression Models with Correlated Data

One of the main reasons I tried this project is I wanted to get practice with the different models that we had learning throughout the program but now use them on regression tasks instead of classification. For this portion I will be looking at the correlated data and I will create basic models of Linear Regression, Random Forest, Gradient Boosting Regressor, KNN Regressor and an SVM Regressor. I will create models with all of the default values and then run them through the display_score function to see how they perform on the data so that I can narrow down what works the best.

In [7]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
display_scores(model, X_train, y_train)

Scores: [0.0678586  0.0673573  0.06655528 0.07158801 0.06822013]
Mean: 0.06831586349100643
Standard deviation: 0.0017287973465112056


In [8]:
x = y_train.to_numpy() #Have to change training data into an array and the shape of it for use in model.
y_train = x.ravel()

In [9]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
display_scores(model, X_train, y_train)

Scores: [0.06138858 0.06015371 0.05997024 0.06141469 0.06202952]
Mean: 0.06099134956159584
Standard deviation: 0.0007948714006737337


In [10]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
display_scores(model, X_train, y_train)

Scores: [0.06171191 0.05948456 0.0595773  0.06181173 0.06167221]
Mean: 0.06085154084261927
Standard deviation: 0.001079631831262314


In [11]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor()
display_scores(model, X_train, y_train)

Scores: [0.06821734 0.06769894 0.0672935  0.06786742 0.06937197]
Mean: 0.06808983505480777
Standard deviation: 0.0007066166815090703


In [12]:
from sklearn.svm import SVR
model = SVR()
display_scores(model, X_train, y_train)

Scores: [0.06980428 0.0685354  0.06806783 0.06892182 0.07049095]
Mean: 0.06916405746328477
Standard deviation: 0.0008747739729145447


### Conclusions

It looks like all the models are performing similar with a root mean squared error that is between 6 and 7 percent. However, it appears that the Random Forest and the Gradient Boosting Regressor are performing a little bit better then the other models with values in the lower 6 percent while the other three are in the upper.

## All Features

For this next portion I want to see if using all the features (even the ones that are not correlated) could increase the performance of each of the models. Once again I will test it on each of the models in the same manner that I did above looking for not only which models work the best on the data but also determing if it is better to go ahead with looking at all of the features and not just the correlated ones. I will first drop the values from the dataset that do not correspond to proportions that were calculated as features of the district and then I will prepare the data in a similar manner as above.

In [13]:
project_data = pd.read_csv('ProjectDataTableNewData.csv', encoding='latin-1')
project_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,1,100190,AL,Alabaster City School District,34015,860,0.128301,0.178274,6064,0.418371,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,2,100005,AL,Albertville City School District,21786,1546,0.375699,0.249885,5444,0.557494,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,3,100030,AL,Alexander City City School District,17073,832,0.312900,0.177590,3032,0.450528,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,4,100060,AL,Andalusia City School District,8854,386,0.267313,0.200248,1773,0.365482,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,5,100090,AL,Anniston City School District,22350,1106,0.347362,0.091723,2050,0.933171,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7572,7573,5602760,WY,Uinta County School District 1,14192,364,0.122683,0.189543,2690,0.202602,...,0.002230,0.002230,0.001487,0.000743,0.005576,0.002230,0.000372,0.697561,0.800000,84
7573,7574,5604260,WY,Uinta County School District 6,3152,30,0.040486,0.234454,739,0.067659,...,0.002706,0.004060,0.000000,0.006766,0.005413,0.001353,0.000000,0.981481,0.000000,79
7574,7575,5606240,WY,Washakie County School District 1,7372,181,0.130970,0.174173,1284,0.295171,...,0.004673,0.006231,0.001558,0.000779,0.007788,0.003894,0.001558,1.000000,0.000000,89
7575,7576,5604830,WY,Weston County School District 1,5465,118,0.145499,0.138884,759,0.117260,...,0.003953,0.006588,0.002635,0.000000,0.003953,0.010540,0.001318,1.000000,1.000000,89


In [14]:
# I am just dropping the columns that do not represent the proportions that were calculated as features.
features = project_data.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 'District_Population', 
                              'Children_Poverty', 'Poverty_Proportion'], axis=1)

In [15]:
features #Checking the features to make sure all are here.

Unnamed: 0,Student_Prop,Non_White_Students,New_Teachers_Proportion,Absent_Teacher_Proportion,Counselor_Ratio,Student_Teacher_Ratio,Absent_Prop,Test_Prop,Teams_Prop,Athletes_Prop,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,0.178274,0.418371,0.024893,0.181719,0.002309,15.095091,0.101418,0.082124,0.004947,0.118734,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,0.249885,0.557494,0.157051,0.288462,0.002572,17.448718,0.134276,0.066679,0.003490,0.050514,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,0.177590,0.450528,0.134293,0.450839,0.002639,14.541966,0.110158,0.074868,0.007586,0.130937,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,0.200248,0.365482,0.044248,0.221239,0.002256,15.690265,0.151720,0.071066,0.002820,0.090243,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,0.091723,0.933171,0.082065,0.246195,0.003659,15.293942,0.182439,0.052195,0.009756,0.119024,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7572,0.189543,0.202602,0.051765,0.263529,0.003439,12.658824,0.018959,0.050929,0.008178,0.178810,...,0.002230,0.002230,0.001487,0.000743,0.005576,0.002230,0.000372,0.697561,0.800000,84
7573,0.234454,0.067659,0.044998,0.119994,0.002706,11.084446,0.017591,0.060893,0.010825,0.217862,...,0.002706,0.004060,0.000000,0.006766,0.005413,0.001353,0.000000,0.981481,0.000000,79
7574,0.174173,0.295171,0.041203,0.214091,0.003559,10.580964,0.017134,0.032710,0.014798,0.186137,...,0.004673,0.006231,0.001558,0.000779,0.007788,0.003894,0.001558,1.000000,0.000000,89
7575,0.138884,0.117260,0.068027,0.384626,0.003953,10.326531,0.018445,0.127800,0.002635,0.093544,...,0.003953,0.006588,0.002635,0.000000,0.003953,0.010540,0.001318,1.000000,1.000000,89


In [16]:
target_values = project_data[['Poverty_Proportion']]

In [17]:
scaled_data = model_scaler.fit_transform(features)
X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values, test_size=0.2, random_state=42)

## Regression Model using All Features

In [18]:
model = LinearRegression()
display_scores(model, X_train, y_train)

Scores: [0.06771566 0.06681499 0.06935738 0.07114569 0.0681126 ]
Mean: 0.06862926373813431
Standard deviation: 0.001500382897369316


In [19]:
x = y_train.to_numpy() 
y_train = x.ravel()

In [20]:
model = RandomForestRegressor()
display_scores(model, X_train, y_train)

Scores: [0.06118011 0.05885639 0.05918467 0.06071037 0.06114498]
Mean: 0.060215304984647486
Standard deviation: 0.0009948957736487668


In [21]:
model = GradientBoostingRegressor()
display_scores(model, X_train, y_train)

Scores: [0.06041076 0.05885083 0.05885599 0.06067022 0.06080572]
Mean: 0.05991870321667767
Standard deviation: 0.0008790229257199773


In [22]:
model = KNeighborsRegressor()
display_scores(model, X_train, y_train)

Scores: [0.06903249 0.06801224 0.06714035 0.06693313 0.06892291]
Mean: 0.06800822513072499
Standard deviation: 0.0008711832394946761


In [23]:
model = SVR()
display_scores(model, X_train, y_train)

Scores: [0.06878402 0.06841306 0.06766052 0.0676359  0.06890413]
Mean: 0.06827952341444632
Standard deviation: 0.0005403521235203433


### Conclusions:

The first main thing that is obvious in this second round is that once again the Random Forest Regressor and the Gradient Boosting Regressor are the models that perform the best on the data. There is also a slight increase in performance in each of these two models with the use of all of the features (both in the mean and the standard of deviation). Therefore, as I move forward with the project I will look at all of the features instead of just the ones that were correlated with it.

## Hyperparameter Fine Tuning - Using All Features

The two best models that I came across were the Random Forest Regressor and the Gradient Boosting Regressor. Now I am going to fine tune the hyperparameters for each of these models to improve on their performance. I will be using a Randomized Search for the best hyperparameters (I had done a closer Grid Search in previous versions of this but found that they do not lead to large enough improvement in model performance). I will use the neg_mean_squared_error as the scoring since that relates to the performance measure of the models that I am looking at. I will also run the same scoring function that I used above on each one of them to see how the models perform with the hyperparameters that are found to be best.

At the end I will also do a Voting Regressor using both of the models to see if that can improve the performance. I am mainly doing this because I want to gain practice with using a Voting Regressor.

## Random Forest Regressor

In [24]:
from sklearn.model_selection import RandomizedSearchCV

In [25]:
# Creating a range of parameters to feed to the RandomizedSearch for the different hyperparameters.
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [26]:
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 150 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 15.3min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 40.4min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 54.0min finished


{'n_estimators': 1960,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 93,
 'bootstrap': True}

In [29]:
# Instantiating an object of Random Forest Regressor using the best hyperparameters
rfr_optimal_model = RandomForestRegressor(**rf_random.best_params_, random_state = 42)

In [30]:
rfr_optimal_model.fit(X_train, y_train)
display_scores(rfr_random_model, X_train, y_train)

Scores: [0.06092507 0.05890451 0.0594177  0.0608506  0.06105405]
Mean: 0.060230384321092255
Standard deviation: 0.0008904011277949548


In [49]:
from sklearn.metrics import mean_squared_error

In [31]:
# Looking at how the optimal model performs on the test data.
y_pred = rfr_optimal_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05718448417941566

In [32]:
rfr_optimal_model.feature_importances_

array([0.02440351, 0.25794706, 0.01514752, 0.01489145, 0.01786203,
       0.02375396, 0.14410893, 0.02365963, 0.01341498, 0.01674786,
       0.0139819 , 0.13899307, 0.07968096, 0.01104817, 0.0081146 ,
       0.00350912, 0.00406568, 0.00477322, 0.00466723, 0.01583635,
       0.0134614 , 0.01209124, 0.01485711, 0.0242965 , 0.01390099,
       0.01858088, 0.02434451, 0.01882153, 0.00883027, 0.01420838])

### Conclusions:

The improvement on the model with the optimal hyperparameters is very slight from 0.06099 to 0.06023. However, the model did perform slightly better when used on the testing data with an error of 0.05718. What is fascinating is looking at the features importance to see the dominance of the proportion of minority students (0.2579), chronic absenteeism (0.1441) and AP students (0.1390) which are far ahead of all of the other features in importance. This gives an understanding of key differences that influence the model between districts based off poverty.

## Gradient Boosting Regressor

In [33]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}

In [152]:
gb = GradientBoostingRegressor()
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
gb_random.fit(X_train, y_train)
gb_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 18.2min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 49.6min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 63.9min finished


{'n_estimators': 845,
 'min_samples_split': 5,
 'min_samples_leaf': 10,
 'max_features': 'sqrt',
 'max_depth': 71,
 'learning_rate': 0.01}

In [192]:
gbr_random_model = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [193]:
gbr_random_model.fit(X_train, y_train)
display_scores(gbr_random_model, X_train, y_train)

Scores: [0.06080159 0.05899833 0.05938647 0.06070843 0.06110608]
Mean: 0.060200180060421715
Standard deviation: 0.0008422883789050026


In [194]:
y_pred = gbr_random_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05653963717971819

In [195]:
gbr_random_model.feature_importances_

array([0.02516741, 0.15891325, 0.01786374, 0.01667344, 0.01963381,
       0.02564509, 0.11722523, 0.02462765, 0.02165939, 0.03621287,
       0.02917294, 0.07792015, 0.09843921, 0.01373602, 0.00889157,
       0.0048129 , 0.00530371, 0.00673861, 0.00562041, 0.02378027,
       0.01547811, 0.01517903, 0.01696589, 0.05294159, 0.01704297,
       0.02508646, 0.04286709, 0.02334704, 0.01170981, 0.04134434])

### Conclusions:

The Gradient Boosting Regressor also showed a slight decrease in performance from 0.05991 to 0.06020 but it very little. However, like the Random Forest it showed a better performance on the test data at 0.05654. It once again showed that minority students and chronic absenteeism were the most important features but their values were much lower and the third important feature was the suspension proportion instead of AP students.

## Voting Regressor

In [189]:
from sklearn.ensemble import VotingRegressor
rfr_voting = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
gbr_voting = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [190]:
voting_full_data = VotingRegressor(estimators = [('rfr', rfr_voting), ('gbr', gbr_voting)])
voting_full_data.fit(X_train, y_train)
display_scores(voting_full_data, X_train, y_train)

Scores: [0.06094624 0.05854554 0.05953887 0.06086675 0.06097258]
Mean: 0.06017399698276429
Standard deviation: 0.0009766536384941866


In [191]:
y_pred = voting_full_data.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05643399443244786

### Conclusions:

The voting regressor led to a slight increase in the final models performance. It had a final error on the testing data of 0.05643 which represents it being off by roughly 5.64% on average when trying to predict the students living in poverty. With a range that goes from the low 1% to the upper 60% for poverty this mean could be good or bad based off the districts that you are looking at.


## Generating Regional Data

Beyond looking how machine learning models work on the national level I wanted to see if performance could be increased by looking at regional and state levels as well trying to get to smaller and smaller levels since school districts are generally funded by state and local taxes. In this next part I will define each of the four major regions (Northeast, Midwest, South and West) as well as generate csv files for each one to perform correlations on them using R.

In [25]:
project_data = pd.read_csv('ProjectDataTableNewData.csv', encoding='latin-1')
project_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,1,100190,AL,Alabaster City School District,34015,860,0.128301,0.178274,6064,0.418371,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,2,100005,AL,Albertville City School District,21786,1546,0.375699,0.249885,5444,0.557494,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,3,100030,AL,Alexander City City School District,17073,832,0.312900,0.177590,3032,0.450528,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,4,100060,AL,Andalusia City School District,8854,386,0.267313,0.200248,1773,0.365482,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,5,100090,AL,Anniston City School District,22350,1106,0.347362,0.091723,2050,0.933171,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7572,7573,5602760,WY,Uinta County School District 1,14192,364,0.122683,0.189543,2690,0.202602,...,0.002230,0.002230,0.001487,0.000743,0.005576,0.002230,0.000372,0.697561,0.800000,84
7573,7574,5604260,WY,Uinta County School District 6,3152,30,0.040486,0.234454,739,0.067659,...,0.002706,0.004060,0.000000,0.006766,0.005413,0.001353,0.000000,0.981481,0.000000,79
7574,7575,5606240,WY,Washakie County School District 1,7372,181,0.130970,0.174173,1284,0.295171,...,0.004673,0.006231,0.001558,0.000779,0.007788,0.003894,0.001558,1.000000,0.000000,89
7575,7576,5604830,WY,Weston County School District 1,5465,118,0.145499,0.138884,759,0.117260,...,0.003953,0.006588,0.002635,0.000000,0.003953,0.010540,0.001318,1.000000,1.000000,89


In [26]:
southern_data = project_data.loc[(project_data['Lea State.x'] == 'MD') |(project_data['Lea State.x'] == 'GA')|
                                (project_data['Lea State.x'] == 'DE')|(project_data['Lea State.x'] == 'VA')|
                                (project_data['Lea State.x'] == 'WV')|(project_data['Lea State.x'] == 'KY')|
                                (project_data['Lea State.x'] == 'TN')|(project_data['Lea State.x'] == 'NC')|
                                (project_data['Lea State.x'] == 'SC')|(project_data['Lea State.x'] == 'FL')|
                                (project_data['Lea State.x'] == 'AL')|(project_data['Lea State.x'] == 'MS')|
                                (project_data['Lea State.x'] == 'LA')|(project_data['Lea State.x'] == 'AR')|
                                (project_data['Lea State.x'] == 'TX')|(project_data['Lea State.x'] == 'OK')]
southern_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,1,100190,AL,Alabaster City School District,34015,860,0.128301,0.178274,6064,0.418371,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,2,100005,AL,Albertville City School District,21786,1546,0.375699,0.249885,5444,0.557494,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,3,100030,AL,Alexander City City School District,17073,832,0.312900,0.177590,3032,0.450528,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,4,100060,AL,Andalusia City School District,8854,386,0.267313,0.200248,1773,0.365482,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,5,100090,AL,Anniston City School District,22350,1106,0.347362,0.091723,2050,0.933171,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7235,7236,5401530,WV,Webster County School District,8372,477,0.372365,0.162685,1362,0.019824,...,0.004405,0.004405,0.004405,0.001468,0.007342,0.001468,0.000734,0.857143,0.300000,95
7236,7237,5401560,WV,Wetzel County School District,15437,587,0.262875,0.165188,2550,0.032549,...,0.005098,0.004706,0.004706,0.001569,0.005882,0.002745,0.000392,0.891892,0.000000,95
7237,7238,5401590,WV,Wirt County School District,5794,229,0.242842,0.178460,1034,0.023211,...,0.005803,0.007737,0.002901,0.000967,0.004836,0.004836,0.000967,0.750000,0.000000,89
7238,7239,5401620,WV,Wood County School District,85104,2965,0.227482,0.148982,12679,0.065857,...,0.004574,0.005205,0.002524,0.000552,0.006704,0.003155,0.001735,0.838235,0.833333,89


In [27]:
southern_data.to_csv('southern_data.csv')

In [28]:
northeast_data = project_data.loc[(project_data['Lea State.x'] == 'ME') |(project_data['Lea State.x'] == 'NH')|
                                (project_data['Lea State.x'] == 'VT')|(project_data['Lea State.x'] == 'NY')|
                                (project_data['Lea State.x'] == 'MA')|(project_data['Lea State.x'] == 'CT')|
                                (project_data['Lea State.x'] == 'RI')|(project_data['Lea State.x'] == 'PA')|
                                (project_data['Lea State.x'] == 'NJ')]
northeast_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
818,819,900060,CT,Ansonia School District,19210,605,0.196940,0.118220,2271,0.625716,...,0.003523,0.004403,0.002642,0.000440,0.003963,0.002642,0.000881,0.872549,1.00,89
819,820,900120,CT,Avon School District,18118,122,0.033370,0.177006,3207,0.331462,...,0.004054,0.004677,0.005301,0.001871,0.004365,0.003742,0.004989,0.972973,1.00,96
820,821,900210,CT,Berlin School District,19903,147,0.047115,0.139828,2783,0.186849,...,0.003953,0.005031,0.007186,0.001078,0.005749,0.005390,0.002156,0.838710,1.00,94
821,822,900270,CT,Bethel School District,19249,211,0.065548,0.158294,3047,0.307516,...,0.003938,0.003610,0.004923,0.001313,0.004923,0.004266,0.002626,0.989305,1.00,93
822,823,900330,CT,Bloomfield School District,20511,320,0.122794,0.104042,2134,0.902062,...,0.004217,0.004217,0.004686,0.002343,0.007029,0.004686,0.002343,0.755396,0.60,89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5889,5890,4401020,RI,South Kingstown School District,30378,298,0.083357,0.101751,3091,0.168877,...,0.003235,0.003559,0.001618,0.003235,0.008735,0.003559,0.001618,0.902439,0.00,94
5890,5891,4401050,RI,Tiverton School District,15842,161,0.079901,0.118672,1880,0.094149,...,0.002128,0.004255,0.002660,0.001596,0.003723,0.004787,0.001064,0.967033,1.00,89
5891,5892,4401110,RI,Warwick School District,81502,1094,0.104569,0.106967,8718,0.210943,...,0.003900,0.003671,0.000229,0.000574,0.004932,0.004588,0.001491,0.867890,1.00,83
5892,5893,4401140,RI,West Warwick School District,29045,626,0.176686,0.122293,3552,0.251971,...,0.002534,0.003378,0.000845,0.001971,0.005349,0.001408,0.000845,0.941489,0.80,85


In [29]:
northeast_data.to_csv('northeast_data.csv')

In [30]:
midwest_data = project_data.loc[(project_data['Lea State.x'] == 'ND') |(project_data['Lea State.x'] == 'SD')|
                                (project_data['Lea State.x'] == 'NE')|(project_data['Lea State.x'] == 'KS')|
                                (project_data['Lea State.x'] == 'MO')|(project_data['Lea State.x'] == 'IA')|
                                (project_data['Lea State.x'] == 'MN')|(project_data['Lea State.x'] == 'WI')|
                                (project_data['Lea State.x'] == 'MI')|(project_data['Lea State.x'] == 'IL')|
                                (project_data['Lea State.x'] == 'IN')|(project_data['Lea State.x'] == 'OH')]
midwest_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
1230,1231,1700105,IL,A-C Central Community Unit School District 262,2398,47,0.119593,0.193495,464,0.025862,...,0.004310,0.006466,0.004310,0.002155,0.004310,0.004310,0.004310,0.941176,0.000000,90
1231,1232,1701413,IL,Abingdon-Avon Community Unit School District 276,5845,234,0.263811,0.165269,966,0.109731,...,0.004141,0.005176,0.003106,0.004141,0.004141,0.002070,0.001035,0.659574,1.000000,90
1232,1233,1703300,IL,Alden-Hebron School District 19,3098,57,0.122318,0.135894,421,0.273159,...,0.004751,0.004751,0.002375,0.002375,0.007126,0.002375,0.004751,0.866667,0.000000,80
1233,1234,1703600,IL,Alton Community Unit School District 11,48443,1812,0.240223,0.126334,6120,0.443464,...,0.002124,0.002778,0.002614,0.000490,0.003431,0.002614,0.001144,0.000000,0.000000,81
1234,1235,1703810,IL,Annawan Community Unit School District 226,2037,35,0.103245,0.186058,379,0.105541,...,0.002639,0.007916,0.002639,0.000000,0.002639,0.002639,0.002639,0.884615,0.000000,80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7530,7531,5501230,WI,Wisconsin Heights School District,6436,101,0.099312,0.120261,774,0.100775,...,0.002584,0.005168,0.002584,0.001292,0.011628,0.003876,0.001292,0.956522,0.000000,90
7531,7532,5517070,WI,Wisconsin Rapids School District,34243,685,0.127347,0.148527,5086,0.155525,...,0.005702,0.005309,0.000393,0.002556,0.007668,0.002163,0.004326,0.849498,0.166667,93
7532,7533,5517100,WI,Wittenberg-Birnamwood School District,7776,170,0.127055,0.149434,1162,0.160069,...,0.002582,0.002582,0.004303,0.000861,0.004303,0.005164,0.000861,0.757576,0.636364,94
7533,7534,5517130,WI,Wonewoc-Union Center School District,3011,78,0.164905,0.115908,349,0.097421,...,0.000000,0.000000,0.000000,0.002865,0.008596,0.002865,0.000000,0.000000,0.000000,79


In [31]:
midwest_data.to_csv('midwest_data.csv')

In [32]:
west_data = project_data.loc[(project_data['Lea State.x'] == 'CO') |(project_data['Lea State.x'] == 'ID')|
                                (project_data['Lea State.x'] == 'MT')|(project_data['Lea State.x'] == 'NV')|
                                (project_data['Lea State.x'] == 'UT')|(project_data['Lea State.x'] == 'WY')|
                                (project_data['Lea State.x'] == 'AK')|(project_data['Lea State.x'] == 'CA')|
                                (project_data['Lea State.x'] == 'HI')|(project_data['Lea State.x'] == 'OR')|
                                (project_data['Lea State.x'] == 'WA')|(project_data['Lea State.x'] == 'NM')|
                                (project_data['Lea State.x'] == 'AZ')]
west_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
124,125,200050,AK,Alaska Gateway School District,2295,85,0.212500,0.172985,397,0.670025,...,0.005038,0.007557,0.002519,0.000000,0.007557,0.007557,0.002519,0.909091,0.714286,59
125,126,200007,AK,Aleutians East Borough School District,3370,22,0.106796,0.061721,208,0.903846,...,0.014423,0.009615,0.004808,0.000000,0.014423,0.004808,0.000000,1.000000,1.000000,80
126,127,200180,AK,Anchorage School District,294356,5553,0.109750,0.160992,47389,0.582308,...,0.002047,0.003672,0.001456,0.000549,0.005402,0.002933,0.001773,0.640334,0.371429,81
127,128,200525,AK,Annette Island School District,1525,54,0.192171,0.212459,324,0.944444,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,80
128,129,200020,AK,Bering Strait School District,6164,522,0.306518,0.324789,2002,0.993506,...,0.004496,0.006993,0.001499,0.000000,0.003996,0.000999,0.000999,0.946809,0.714286,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7572,7573,5602760,WY,Uinta County School District 1,14192,364,0.122683,0.189543,2690,0.202602,...,0.002230,0.002230,0.001487,0.000743,0.005576,0.002230,0.000372,0.697561,0.800000,84
7573,7574,5604260,WY,Uinta County School District 6,3152,30,0.040486,0.234454,739,0.067659,...,0.002706,0.004060,0.000000,0.006766,0.005413,0.001353,0.000000,0.981481,0.000000,79
7574,7575,5606240,WY,Washakie County School District 1,7372,181,0.130970,0.174173,1284,0.295171,...,0.004673,0.006231,0.001558,0.000779,0.007788,0.003894,0.001558,1.000000,0.000000,89
7575,7576,5604830,WY,Weston County School District 1,5465,118,0.145499,0.138884,759,0.117260,...,0.003953,0.006588,0.002635,0.000000,0.003953,0.010540,0.001318,1.000000,1.000000,89


In [33]:
west_data.to_csv('west_data.csv')

## Regional Data Models

For the next portion of my project I wanted to see if the predictions could be better if we broke the country up into the four distinct regions of Northeast, South, Midwest and Western states. I wanted to see if I could get better model performance since these regions often have quite similar characteristics. For this portion I will replicate what I did with the entire United States by using a Random Forest Regressor, Gradient Boosting Regressor and then trying out a Voting Regressor that incorporates both. I will once again use all the features as I did above and do a RandomizedSearch on each model to find the hyperparameters that would work best on them. I will do this for each of the four regions.


## Northeast Data

### Random Forest Regressor

In [34]:
northeast_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
818,819,900060,CT,Ansonia School District,19210,605,0.196940,0.118220,2271,0.625716,...,0.003523,0.004403,0.002642,0.000440,0.003963,0.002642,0.000881,0.872549,1.00,89
819,820,900120,CT,Avon School District,18118,122,0.033370,0.177006,3207,0.331462,...,0.004054,0.004677,0.005301,0.001871,0.004365,0.003742,0.004989,0.972973,1.00,96
820,821,900210,CT,Berlin School District,19903,147,0.047115,0.139828,2783,0.186849,...,0.003953,0.005031,0.007186,0.001078,0.005749,0.005390,0.002156,0.838710,1.00,94
821,822,900270,CT,Bethel School District,19249,211,0.065548,0.158294,3047,0.307516,...,0.003938,0.003610,0.004923,0.001313,0.004923,0.004266,0.002626,0.989305,1.00,93
822,823,900330,CT,Bloomfield School District,20511,320,0.122794,0.104042,2134,0.902062,...,0.004217,0.004217,0.004686,0.002343,0.007029,0.004686,0.002343,0.755396,0.60,89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5889,5890,4401020,RI,South Kingstown School District,30378,298,0.083357,0.101751,3091,0.168877,...,0.003235,0.003559,0.001618,0.003235,0.008735,0.003559,0.001618,0.902439,0.00,94
5890,5891,4401050,RI,Tiverton School District,15842,161,0.079901,0.118672,1880,0.094149,...,0.002128,0.004255,0.002660,0.001596,0.003723,0.004787,0.001064,0.967033,1.00,89
5891,5892,4401110,RI,Warwick School District,81502,1094,0.104569,0.106967,8718,0.210943,...,0.003900,0.003671,0.000229,0.000574,0.004932,0.004588,0.001491,0.867890,1.00,83
5892,5893,4401140,RI,West Warwick School District,29045,626,0.176686,0.122293,3552,0.251971,...,0.002534,0.003378,0.000845,0.001971,0.005349,0.001408,0.000845,0.941489,0.80,85


In [35]:
# Dropping the columns that will not be used and setting the targer data.
northeast_features = northeast_data.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
target_values = northeast_data[['Poverty_Proportion']]
northeast_features

Unnamed: 0,Student_Prop,Non_White_Students,New_Teachers_Proportion,Absent_Teacher_Proportion,Counselor_Ratio,Student_Teacher_Ratio,Absent_Prop,Test_Prop,Teams_Prop,Athletes_Prop,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
818,0.118220,0.625716,0.098079,0.398085,0.002686,13.102175,0.170410,0.083223,0.006165,0.097754,...,0.003523,0.004403,0.002642,0.000440,0.003963,0.002642,0.000881,0.872549,1.00,89
819,0.177006,0.331462,0.019616,0.149078,0.002495,12.581404,0.033053,0.116308,0.005613,0.276271,...,0.004054,0.004677,0.005301,0.001871,0.004365,0.003742,0.004989,0.972973,1.00,96
820,0.139828,0.186849,0.102714,0.428392,0.002875,11.620042,0.035214,0.121452,0.014732,0.174991,...,0.003953,0.005031,0.007186,0.001078,0.005749,0.005390,0.002156,0.838710,1.00,94
821,0.158294,0.307516,0.085397,0.370054,0.003282,12.390712,0.050213,0.097801,0.012799,0.235313,...,0.003938,0.003610,0.004923,0.001313,0.004923,0.004266,0.002626,0.989305,1.00,93
822,0.104042,0.902062,0.075922,0.439262,0.003327,11.572668,0.062793,0.138707,0.011246,0.143861,...,0.004217,0.004217,0.004686,0.002343,0.007029,0.004686,0.002343,0.755396,0.60,89
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5889,0.101751,0.168877,0.011485,0.578101,0.002265,11.833844,0.103850,0.113555,0.007117,0.155937,...,0.003235,0.003559,0.001618,0.003235,0.008735,0.003559,0.001618,0.902439,0.00,94
5890,0.118672,0.094149,0.056683,0.463768,0.003723,12.109501,0.167021,0.066489,0.015957,0.225000,...,0.002128,0.004255,0.002660,0.001596,0.003723,0.004787,0.001064,0.967033,1.00,89
5891,0.106967,0.210943,0.042669,0.359904,0.002007,10.782265,0.200849,0.076050,0.010209,0.197178,...,0.003900,0.003671,0.000229,0.000574,0.004932,0.004588,0.001491,0.867890,1.00,83
5892,0.122293,0.251971,0.028307,0.516112,0.003238,11.691518,0.231419,0.091216,0.007038,0.064189,...,0.002534,0.003378,0.000845,0.001971,0.005349,0.001408,0.000845,0.941489,0.80,85


In [36]:
# Preparing the data for use in the models.
scaled_data = model_scaler.fit_transform(northeast_features)
X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values, test_size=0.2, random_state=42)
x = y_train.to_numpy()
y_train = x.ravel()

In [37]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [43]:
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   30.3s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  7.0min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:  9.2min finished


{'n_estimators': 1243,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 38,
 'bootstrap': False}

In [47]:
rfr_random_northeast = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
rfr_random_northeast.fit(X_train, y_train)
display_scores(rfr_random_northeast, X_train, y_train)

Scores: [0.04539376 0.04229477 0.04051636 0.04796381 0.04304344]
Mean: 0.04384242962208257
Standard deviation: 0.002588001995133253


In [50]:
y_pred = rfr_random_northeast.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.04070685834757105

In [51]:
rfr_random_northeast.feature_importances_

array([0.02676597, 0.06770747, 0.01148449, 0.01074299, 0.01346386,
       0.01142325, 0.17021549, 0.06103894, 0.0168863 , 0.02860451,
       0.01784962, 0.10705596, 0.09233305, 0.00893481, 0.01030818,
       0.00651335, 0.00690985, 0.0060186 , 0.00634336, 0.01549659,
       0.01510164, 0.01289555, 0.02800722, 0.0205296 , 0.01084249,
       0.02704609, 0.02439082, 0.02402861, 0.0064704 , 0.13459093])

### Gradient Boosting Regressor

In [52]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}

In [53]:
gb = GradientBoostingRegressor()
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
gb_random.fit(X_train, y_train)
gb_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   43.5s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 21.1min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 26.5min finished


{'n_estimators': 1562,
 'min_samples_split': 2,
 'min_samples_leaf': 6,
 'max_features': 'sqrt',
 'max_depth': 26,
 'learning_rate': 0.01}

In [56]:
gbr_random_northeast = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [57]:
gbr_random_northeast.fit(X_train, y_train)
display_scores(gbr_random_northeast, X_train, y_train)

Scores: [0.04590611 0.04291144 0.04074596 0.04736831 0.04353037]
Mean: 0.0440924391858176
Standard deviation: 0.0023207917042230805


In [58]:
y_pred = gbr_random_northeast.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.03975759706495273

In [59]:
gbr_random_northeast.feature_importances_

array([0.02647925, 0.06896381, 0.01001843, 0.0105898 , 0.01265398,
       0.01178187, 0.18074507, 0.07167093, 0.01839191, 0.0293364 ,
       0.01940335, 0.09665616, 0.0834317 , 0.00784626, 0.01159813,
       0.00506447, 0.00640981, 0.00528123, 0.00607987, 0.01465687,
       0.01290427, 0.01301172, 0.02874632, 0.01841697, 0.01012945,
       0.02857994, 0.02268237, 0.0304719 , 0.00659591, 0.13140186])

### Voting Regressor

In [66]:
rfr_voting = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
gbr_voting = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [67]:
voting_eastern = VotingRegressor(estimators = [('rfr', rfr_voting), ('gbr', gbr_voting)])
voting_eastern.fit(X_train, y_train)
display_scores(voting_eastern, X_train, y_train)

Scores: [0.04555842 0.04309762 0.04050442 0.04715188 0.04349958]
Mean: 0.043962385979225596
Standard deviation: 0.0022644967343165364


In [68]:
y_pred = voting_eastern.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.04008576260964692

### Conclusions:

When it comes to looking at the Northeastern data there is an increase in the performance of the model. Where the United States showed a mean error of being off by about 5.64% when it comes to predicting the poverty level, I found with the end model results using the Voting Regressor that the Northeastern data produced models that were only off by 4.01%. There is also a change in feature importances with Chronic Absenteeism being the most important followed by the Grad Rate and then AP students, showing that the proportion of minority students which was the top nationally is not as important in this region.

## Southern Data

In [69]:
southern_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,1,100190,AL,Alabaster City School District,34015,860,0.128301,0.178274,6064,0.418371,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,2,100005,AL,Albertville City School District,21786,1546,0.375699,0.249885,5444,0.557494,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,3,100030,AL,Alexander City City School District,17073,832,0.312900,0.177590,3032,0.450528,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,4,100060,AL,Andalusia City School District,8854,386,0.267313,0.200248,1773,0.365482,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,5,100090,AL,Anniston City School District,22350,1106,0.347362,0.091723,2050,0.933171,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7235,7236,5401530,WV,Webster County School District,8372,477,0.372365,0.162685,1362,0.019824,...,0.004405,0.004405,0.004405,0.001468,0.007342,0.001468,0.000734,0.857143,0.300000,95
7236,7237,5401560,WV,Wetzel County School District,15437,587,0.262875,0.165188,2550,0.032549,...,0.005098,0.004706,0.004706,0.001569,0.005882,0.002745,0.000392,0.891892,0.000000,95
7237,7238,5401590,WV,Wirt County School District,5794,229,0.242842,0.178460,1034,0.023211,...,0.005803,0.007737,0.002901,0.000967,0.004836,0.004836,0.000967,0.750000,0.000000,89
7238,7239,5401620,WV,Wood County School District,85104,2965,0.227482,0.148982,12679,0.065857,...,0.004574,0.005205,0.002524,0.000552,0.006704,0.003155,0.001735,0.838235,0.833333,89


In [70]:
southern_features = southern_data.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
target_values = southern_data[['Poverty_Proportion']]

In [71]:
southern_features

Unnamed: 0,Student_Prop,Non_White_Students,New_Teachers_Proportion,Absent_Teacher_Proportion,Counselor_Ratio,Student_Teacher_Ratio,Absent_Prop,Test_Prop,Teams_Prop,Athletes_Prop,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
0,0.178274,0.418371,0.024893,0.181719,0.002309,15.095091,0.101418,0.082124,0.004947,0.118734,...,0.000330,0.001484,0.005607,0.000660,0.003133,0.002968,0.000495,0.977716,1.000000,97
1,0.249885,0.557494,0.157051,0.288462,0.002572,17.448718,0.134276,0.066679,0.003490,0.050514,...,0.002204,0.001837,0.002572,0.000735,0.007348,0.001653,0.002388,0.954955,0.666667,94
2,0.177590,0.450528,0.134293,0.450839,0.002639,14.541966,0.110158,0.074868,0.007586,0.130937,...,0.001649,0.002968,0.000989,0.000000,0.002639,0.001649,0.000000,0.980132,0.000000,89
3,0.200248,0.365482,0.044248,0.221239,0.002256,15.690265,0.151720,0.071066,0.002820,0.090243,...,0.003384,0.003384,0.001128,0.000000,0.003384,0.001692,0.000564,0.975410,0.000000,95
4,0.091723,0.933171,0.082065,0.246195,0.003659,15.293942,0.182439,0.052195,0.009756,0.119024,...,0.002927,0.002927,0.001463,0.000000,0.003415,0.000976,0.000000,0.868132,0.000000,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7235,0.162685,0.019824,0.102260,0.322119,0.002937,13.927804,0.112335,0.181351,0.010279,0.145374,...,0.004405,0.004405,0.004405,0.001468,0.007342,0.001468,0.000734,0.857143,0.300000,95
7236,0.165188,0.032549,0.106195,0.097345,0.003922,11.283186,0.211373,0.000000,0.020392,0.186667,...,0.005098,0.004706,0.004706,0.001569,0.005882,0.002745,0.000392,0.891892,0.000000,95
7237,0.178460,0.023211,0.173333,0.333333,0.002901,13.786667,0.219536,0.049323,0.011605,0.191489,...,0.005803,0.007737,0.002901,0.000967,0.004836,0.004836,0.000967,0.750000,0.000000,89
7238,0.148982,0.065857,0.094681,0.431327,0.002839,14.820573,0.193864,0.013250,0.005284,0.119568,...,0.004574,0.005205,0.002524,0.000552,0.006704,0.003155,0.001735,0.838235,0.833333,89


In [72]:
scaled_data = model_scaler.fit_transform(southern_features)
X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values, test_size=0.2, random_state=42)
x = y_train.to_numpy()
y_train = x.ravel()

### Random Forest Regressor

In [73]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [74]:
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   36.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  3.8min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 10.0min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 13.3min finished


{'n_estimators': 1960,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 93,
 'bootstrap': True}

In [76]:
rfr_random_south = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
rfr_random_south.fit(X_train, y_train)
display_scores(rfr_random_northeast, X_train, y_train)

Scores: [0.06337912 0.05867786 0.06026664 0.06138791 0.06411416]
Mean: 0.0615651396147472
Standard deviation: 0.001992080646067411


In [77]:
y_pred = rfr_random_south.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05707193863155079

In [78]:
rfr_random_south.feature_importances_

array([0.02753703, 0.27320309, 0.01872817, 0.01656723, 0.01969712,
       0.02051394, 0.16028931, 0.01801589, 0.01502365, 0.01483126,
       0.01888642, 0.11453873, 0.05057767, 0.00777198, 0.00915607,
       0.00322943, 0.00346192, 0.00402955, 0.00349494, 0.01816114,
       0.01333056, 0.0152572 , 0.01677047, 0.01638146, 0.01811796,
       0.02026093, 0.0419304 , 0.01724738, 0.00900176, 0.01398731])

### Gradient Boosting Regressor

In [79]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}

In [80]:
gb = GradientBoostingRegressor()
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
gb_random.fit(X_train, y_train)
gb_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 12.6min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 34.1min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 43.8min finished


{'n_estimators': 1960,
 'min_samples_split': 20,
 'min_samples_leaf': 10,
 'max_features': 'sqrt',
 'max_depth': 96,
 'learning_rate': 0.01}

In [83]:
gbr_random_south = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [84]:
gbr_random_south.fit(X_train, y_train)
display_scores(gbr_random_northeast, X_train, y_train)

Scores: [0.06325054 0.05933254 0.0600307  0.06172446 0.06396904]
Mean: 0.061661454735762275
Standard deviation: 0.0017853490387982674


In [85]:
y_pred = gbr_random_south.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05644494966919101

In [86]:
gbr_random_south.feature_importances_

array([0.03210308, 0.18836163, 0.02142173, 0.01879516, 0.02218069,
       0.02411323, 0.11556877, 0.02117226, 0.01720419, 0.01706938,
       0.03318631, 0.07315899, 0.08830602, 0.00954214, 0.01082503,
       0.00540054, 0.00705561, 0.00848679, 0.00756911, 0.02249968,
       0.01761537, 0.02245924, 0.02291412, 0.03396711, 0.01962045,
       0.0226335 , 0.04388988, 0.02406945, 0.01126946, 0.0375411 ])

### Voting Regressor

In [87]:
rfr_voting = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
gbr_voting = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [88]:
voting_south = VotingRegressor(estimators = [('rfr', rfr_voting), ('gbr', gbr_voting)])
voting_south.fit(X_train, y_train)
display_scores(voting_south, X_train, y_train)

Scores: [0.06380172 0.05951913 0.06041445 0.06135584 0.06378553]
Mean: 0.06177533334537997
Standard deviation: 0.0017473150543086014


In [89]:
y_pred = voting_south.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.056063298980915624

### Conclusions:

When it comes to the Southern data it did not improve in the same manner that the Northeastern data did but it generated results that were similar to that of the United States with a 5.61% compared to 5.65% respectively. However, the minority students is once again the most important feature followed by chronic absenteeism then suspension proportion and AP students.

## Midwest

In [90]:
midwest_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
1230,1231,1700105,IL,A-C Central Community Unit School District 262,2398,47,0.119593,0.193495,464,0.025862,...,0.004310,0.006466,0.004310,0.002155,0.004310,0.004310,0.004310,0.941176,0.000000,90
1231,1232,1701413,IL,Abingdon-Avon Community Unit School District 276,5845,234,0.263811,0.165269,966,0.109731,...,0.004141,0.005176,0.003106,0.004141,0.004141,0.002070,0.001035,0.659574,1.000000,90
1232,1233,1703300,IL,Alden-Hebron School District 19,3098,57,0.122318,0.135894,421,0.273159,...,0.004751,0.004751,0.002375,0.002375,0.007126,0.002375,0.004751,0.866667,0.000000,80
1233,1234,1703600,IL,Alton Community Unit School District 11,48443,1812,0.240223,0.126334,6120,0.443464,...,0.002124,0.002778,0.002614,0.000490,0.003431,0.002614,0.001144,0.000000,0.000000,81
1234,1235,1703810,IL,Annawan Community Unit School District 226,2037,35,0.103245,0.186058,379,0.105541,...,0.002639,0.007916,0.002639,0.000000,0.002639,0.002639,0.002639,0.884615,0.000000,80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7530,7531,5501230,WI,Wisconsin Heights School District,6436,101,0.099312,0.120261,774,0.100775,...,0.002584,0.005168,0.002584,0.001292,0.011628,0.003876,0.001292,0.956522,0.000000,90
7531,7532,5517070,WI,Wisconsin Rapids School District,34243,685,0.127347,0.148527,5086,0.155525,...,0.005702,0.005309,0.000393,0.002556,0.007668,0.002163,0.004326,0.849498,0.166667,93
7532,7533,5517100,WI,Wittenberg-Birnamwood School District,7776,170,0.127055,0.149434,1162,0.160069,...,0.002582,0.002582,0.004303,0.000861,0.004303,0.005164,0.000861,0.757576,0.636364,94
7533,7534,5517130,WI,Wonewoc-Union Center School District,3011,78,0.164905,0.115908,349,0.097421,...,0.000000,0.000000,0.000000,0.002865,0.008596,0.002865,0.000000,0.000000,0.000000,79


In [91]:
midwest_features = midwest_data.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
target_values = midwest_data[['Poverty_Proportion']]
midwest_features

Unnamed: 0,Student_Prop,Non_White_Students,New_Teachers_Proportion,Absent_Teacher_Proportion,Counselor_Ratio,Student_Teacher_Ratio,Absent_Prop,Test_Prop,Teams_Prop,Athletes_Prop,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
1230,0.193495,0.025862,0.105263,0.175439,0.004310,16.280702,0.161638,0.064655,0.000000,0.000000,...,0.004310,0.006466,0.004310,0.002155,0.004310,0.004310,0.004310,0.941176,0.000000,90
1231,0.165269,0.109731,0.114754,0.377049,0.002588,15.836066,0.177019,0.000000,0.000000,0.000000,...,0.004141,0.005176,0.003106,0.004141,0.004141,0.002070,0.001035,0.659574,1.000000,90
1232,0.135894,0.273159,0.128205,0.076923,0.003563,10.794872,0.049881,0.061758,0.016627,0.206651,...,0.004751,0.004751,0.002375,0.002375,0.007126,0.002375,0.004751,0.866667,0.000000,80
1233,0.126334,0.443464,0.095982,0.071429,0.000654,13.660714,0.255392,0.093954,0.007026,0.087582,...,0.002124,0.002778,0.002614,0.000490,0.003431,0.002614,0.001144,0.000000,0.000000,81
1234,0.186058,0.105541,0.166667,0.023810,0.005277,9.023810,0.063325,0.073879,0.052770,0.300792,...,0.002639,0.007916,0.002639,0.000000,0.002639,0.002639,0.002639,0.884615,0.000000,80
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7530,0.120261,0.100775,0.278201,0.141006,0.002584,11.798780,0.090439,0.074935,0.018088,0.279070,...,0.002584,0.005168,0.002584,0.001292,0.011628,0.003876,0.001292,0.956522,0.000000,90
7531,0.148527,0.155525,0.041889,0.570732,0.002930,15.217521,0.143531,0.075501,0.009241,0.110106,...,0.005702,0.005309,0.000393,0.002556,0.007668,0.002163,0.004326,0.849498,0.166667,93
7532,0.149434,0.160069,0.187573,0.164127,0.002582,13.622509,0.088640,0.061102,0.000000,0.000000,...,0.002582,0.002582,0.004303,0.000861,0.004303,0.005164,0.000861,0.757576,0.636364,94
7533,0.115908,0.097421,0.367279,0.567613,0.008596,11.652755,0.106017,0.091691,0.037249,0.183381,...,0.000000,0.000000,0.000000,0.002865,0.008596,0.002865,0.000000,0.000000,0.000000,79


In [92]:
scaled_data = model_scaler.fit_transform(midwest_features)
X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values, test_size=0.2, random_state=42)
x = y_train.to_numpy()
y_train = x.ravel()

### Random Forest Regressor

In [93]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [94]:
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   49.8s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  5.4min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 13.8min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 18.0min finished


{'n_estimators': 965,
 'min_samples_split': 5,
 'min_samples_leaf': 2,
 'max_features': 'sqrt',
 'max_depth': 79,
 'bootstrap': False}

In [96]:
rfr_random_midwest = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
rfr_random_midwest.fit(X_train, y_train)
display_scores(rfr_random_midwest, X_train, y_train)

Scores: [0.05508896 0.05437857 0.05310073 0.05533299 0.05568088]
Mean: 0.05471642477431458
Standard deviation: 0.0009136420562234676


In [97]:
y_pred = rfr_random_midwest.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.051002081952420224

In [98]:
rfr_random_midwest.feature_importances_

array([0.04391353, 0.10388446, 0.0172484 , 0.0187883 , 0.02362163,
       0.03105456, 0.15747325, 0.02415963, 0.02075902, 0.03020311,
       0.0257452 , 0.0685091 , 0.12164874, 0.00657592, 0.01097225,
       0.00475209, 0.00535391, 0.00683749, 0.00735204, 0.01972703,
       0.01742188, 0.0178545 , 0.018206  , 0.02787365, 0.01702143,
       0.0292888 , 0.03395432, 0.02078356, 0.0105808 , 0.05843538])

### Gradient Boosting Regressor

In [99]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}

In [100]:
gb = GradientBoostingRegressor()
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
gb_random.fit(X_train, y_train)
gb_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  2.0min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 19.1min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 49.3min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 63.2min finished


{'n_estimators': 1960,
 'min_samples_split': 20,
 'min_samples_leaf': 10,
 'max_features': 'sqrt',
 'max_depth': 96,
 'learning_rate': 0.01}

In [103]:
gbr_random_midwest = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [104]:
gbr_random_midwest.fit(X_train, y_train)
display_scores(gbr_random_northeast, X_train, y_train)

Scores: [0.05488125 0.05477076 0.05306143 0.05601784 0.05565422]
Mean: 0.05487710041382734
Standard deviation: 0.00102109538165938


In [105]:
y_pred = gbr_random_midwest.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05059919043049363

In [106]:
gbr_random_midwest.feature_importances_

array([0.04257151, 0.10844789, 0.0168514 , 0.0198167 , 0.02590994,
       0.03323057, 0.15994363, 0.02588635, 0.02141345, 0.03276974,
       0.02833408, 0.0610491 , 0.1083025 , 0.00494688, 0.01165372,
       0.00338698, 0.00328534, 0.00561899, 0.00546952, 0.02136995,
       0.0173877 , 0.01889093, 0.0185002 , 0.02969409, 0.01578397,
       0.02879328, 0.03723446, 0.01985242, 0.0115654 , 0.06203933])

### Voting Regressor

In [107]:
rfr_voting = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
gbr_voting = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [108]:
voting_midwest = VotingRegressor(estimators = [('rfr', rfr_voting), ('gbr', gbr_voting)])
voting_midwest.fit(X_train, y_train)
display_scores(voting_midwest, X_train, y_train)

Scores: [0.05585236 0.0550502  0.05298796 0.0555045  0.05598302]
Mean: 0.05507560655456277
Standard deviation: 0.0010926656204318875


In [109]:
y_pred = voting_midwest.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05049985184153155

### Conclusions:

The Midwest data performed better when compared to the United States with about a half a percent increase with 5.04% and 5.65% respectively. So looking at this area regionally produced a slightly better model. Chronic absenteeism was the most important feature in this region and the proportion of minority students and suspensions were about the same.

### West

In [157]:
west_data

Unnamed: 0.1,Unnamed: 0,ID,Lea State.x,LEA.x,Area_Population,Children_Poverty,Poverty_Proportion,Student_Prop,District_Population,Non_White_Students,...,Alg2,Geo,AdvMath,Calc,Bio,Chem,Phys,Early_Pass,Late_Pass,Grad_Rate
124,125,200050,AK,Alaska Gateway School District,2295,85,0.212500,0.172985,397,0.670025,...,0.005038,0.007557,0.002519,0.000000,0.007557,0.007557,0.002519,0.909091,0.714286,59
125,126,200007,AK,Aleutians East Borough School District,3370,22,0.106796,0.061721,208,0.903846,...,0.014423,0.009615,0.004808,0.000000,0.014423,0.004808,0.000000,1.000000,1.000000,80
126,127,200180,AK,Anchorage School District,294356,5553,0.109750,0.160992,47389,0.582308,...,0.002047,0.003672,0.001456,0.000549,0.005402,0.002933,0.001773,0.640334,0.371429,81
127,128,200525,AK,Annette Island School District,1525,54,0.192171,0.212459,324,0.944444,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,80
128,129,200020,AK,Bering Strait School District,6164,522,0.306518,0.324789,2002,0.993506,...,0.004496,0.006993,0.001499,0.000000,0.003996,0.000999,0.000999,0.946809,0.714286,79
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7572,7573,5602760,WY,Uinta County School District 1,14192,364,0.122683,0.189543,2690,0.202602,...,0.002230,0.002230,0.001487,0.000743,0.005576,0.002230,0.000372,0.697561,0.800000,84
7573,7574,5604260,WY,Uinta County School District 6,3152,30,0.040486,0.234454,739,0.067659,...,0.002706,0.004060,0.000000,0.006766,0.005413,0.001353,0.000000,0.981481,0.000000,79
7574,7575,5606240,WY,Washakie County School District 1,7372,181,0.130970,0.174173,1284,0.295171,...,0.004673,0.006231,0.001558,0.000779,0.007788,0.003894,0.001558,1.000000,0.000000,89
7575,7576,5604830,WY,Weston County School District 1,5465,118,0.145499,0.138884,759,0.117260,...,0.003953,0.006588,0.002635,0.000000,0.003953,0.010540,0.001318,1.000000,1.000000,89


In [158]:
west_features = west_data.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
target_values = west_data[['Poverty_Proportion']]

In [159]:
scaled_data = model_scaler.fit_transform(west_features)
X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values, test_size=0.2, random_state=42)

In [160]:
x = y_train.to_numpy()
y_train = x.ravel()

### Random Forest Regressor

In [161]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [162]:
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   22.8s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:  4.9min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed:  6.6min finished


{'n_estimators': 1960,
 'min_samples_split': 2,
 'min_samples_leaf': 2,
 'max_features': 'auto',
 'max_depth': 93,
 'bootstrap': True}

In [165]:
rfr_random_west = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
rfr_random_west.fit(X_train, y_train)
display_scores(rfr_random_west, X_train, y_train)

Scores: [0.07466746 0.07004547 0.06737747 0.06684394 0.06120167]
Mean: 0.06802720174981888
Standard deviation: 0.0043949196687138


In [166]:
y_pred = rfr_random_west.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.058920299743652184

In [167]:
rfr_random_west.feature_importances_

array([0.02623201, 0.28305686, 0.02473102, 0.0185911 , 0.01933846,
       0.03890984, 0.04415113, 0.02182399, 0.022117  , 0.01854119,
       0.01265831, 0.12082231, 0.08101422, 0.01116011, 0.01437472,
       0.0052047 , 0.0063615 , 0.0093801 , 0.00840788, 0.01572657,
       0.01547499, 0.01479862, 0.02396329, 0.01654909, 0.01511159,
       0.0194958 , 0.02233009, 0.0213049 , 0.01413352, 0.03423512])

### Gradient Boosting Regressor

In [168]:
n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 6, 10]
learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}

In [169]:
gb = GradientBoostingRegressor()
gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3, verbose=2,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
gb_random.fit(X_train, y_train)
gb_random.best_params_

Fitting 3 folds for each of 150 candidates, totalling 450 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   25.7s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  4.6min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 12.2min
[Parallel(n_jobs=-1)]: Done 450 out of 450 | elapsed: 15.6min finished


{'n_estimators': 845,
 'min_samples_split': 5,
 'min_samples_leaf': 10,
 'max_features': 'sqrt',
 'max_depth': 71,
 'learning_rate': 0.01}

In [170]:
gbr_random_west = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [171]:
gbr_random_west.fit(X_train, y_train)
display_scores(gbr_random_northeast, X_train, y_train)

Scores: [0.07474101 0.0698128  0.06757734 0.06517224 0.06212999]
Mean: 0.06788667595963242
Standard deviation: 0.004273058666669731


In [172]:
y_pred = gbr_random_west.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05877389475288692

In [173]:
gbr_random_west.feature_importances_

array([0.03085171, 0.19200635, 0.03427397, 0.02389036, 0.02186437,
       0.0359589 , 0.06383905, 0.02956256, 0.02398824, 0.01957509,
       0.0194659 , 0.05608992, 0.07267743, 0.00930965, 0.02188696,
       0.00792282, 0.00662471, 0.01109023, 0.01205212, 0.02588113,
       0.02122559, 0.01937   , 0.02730446, 0.03901786, 0.01988597,
       0.02828661, 0.03664959, 0.0266433 , 0.01575211, 0.04705305])

### Voting Regressor

In [174]:
rfr_voting = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
gbr_voting = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)

In [175]:
voting_west = VotingRegressor(estimators = [('rfr', rfr_voting), ('gbr', gbr_voting)])
voting_west.fit(X_train, y_train)
display_scores(voting_west, X_train, y_train)

Scores: [0.07470255 0.06974327 0.06700701 0.06555241 0.06153806]
Mean: 0.0677086568669156
Standard deviation: 0.004387987394324367


In [176]:
y_pred = voting_west.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
rmse

0.05828500911397467

### Conclusions:

The Western models performed the worse on the data out of all of the regions. It did a little bit worse then the United States model with 5.83% mean error on the predictions. What is fascinating is that the minority students were the most importance feature by far being almost three times more important than chronic absenteeism (which was second) was.

## State Based Models

Seeing that performance of some of the models improved on the data with regions, I decided next to look at each state individually. Seeing if the model can work better on the states if it trained on them alone. I will create a function that will first group up the states, transform and split the data, and then loop through them running the Random Forest, Gradient Boosting and Voting Regressor on each one. It will cross validate and put out a score for the model and then produce a score based off the testing data.

I will do this in sections using the best hyperparameters that worked on each of the states based off the region that they are assigned to. I will instantiate an object of each of the model classes and then use them as an input into the function.

In [196]:
def state_models (state_data, rgr, gbr, voting):
    grouped = state_data.groupby('Lea State.x') #Grouping the data by state.
    for name, group in grouped: #Creating a for loop for each state.
        # Need to get the features and target values for each state.
        features = group.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
        target_values = group[['Poverty_Proportion']]
        model_scaler = StandardScaler() 
        scaled_data = model_scaler.fit_transform(features) #Scale the data
        X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values,
                                                            test_size=0.2, random_state=42) #Split the data
        x = y_train.to_numpy()
        y_train_ravel = x.ravel()
        for clf in (rgr, gbr, voting): #Run each model and generate a score and mse
            clf.fit(X_train, y_train_ravel)
            model_scores = cross_val_score(clf, X_train, y_train_ravel,
                         scoring="neg_mean_squared_error", cv=5)
            scores = np.sqrt(-model_scores)
            w = scores.mean()
            y_pred = clf.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            rmse = np.sqrt(mse)
            print(name, clf.__class__.__name__, 'Mean RMSE', round(w, 5), 'RMSE', round(rmse, 5))

### Northeastern

In [197]:
rfr = RandomForestRegressor(n_estimators = 1243, min_samples_split = 5, min_samples_leaf = 1,
                                             max_features = 'sqrt', max_depth  = 38, 
                                             bootstrap = False, random_state = 42)
gbr = GradientBoostingRegressor(n_estimators = 1562, min_samples_split = 2, min_samples_leaf = 6,
                                             max_features = 'sqrt', max_depth  = 26, 
                                             learning_rate = 0.01, random_state = 42)
voting = VotingRegressor(estimators = [('rfr', rfr), ('gbr', gbr)])

In [198]:
state_models(northeast_data, rfr, gbr, voting)

CT RandomForestRegressor Mean RMSE 0.03402 RMSE 0.02059
CT GradientBoostingRegressor Mean RMSE 0.02969 RMSE 0.01724
CT VotingRegressor Mean RMSE 0.03132 RMSE 0.01833
MA RandomForestRegressor Mean RMSE 0.0289 RMSE 0.05252
MA GradientBoostingRegressor Mean RMSE 0.03007 RMSE 0.04828
MA VotingRegressor Mean RMSE 0.02923 RMSE 0.05024
ME RandomForestRegressor Mean RMSE 0.0455 RMSE 0.04793
ME GradientBoostingRegressor Mean RMSE 0.04526 RMSE 0.04887
ME VotingRegressor Mean RMSE 0.04464 RMSE 0.04696
NH RandomForestRegressor Mean RMSE 0.04419 RMSE 0.04295
NH GradientBoostingRegressor Mean RMSE 0.04122 RMSE 0.03729
NH VotingRegressor Mean RMSE 0.04227 RMSE 0.03966
NJ RandomForestRegressor Mean RMSE 0.04802 RMSE 0.04953
NJ GradientBoostingRegressor Mean RMSE 0.04592 RMSE 0.05009
NJ VotingRegressor Mean RMSE 0.04657 RMSE 0.04957
NY RandomForestRegressor Mean RMSE 0.04286 RMSE 0.03411
NY GradientBoostingRegressor Mean RMSE 0.04321 RMSE 0.03542
NY VotingRegressor Mean RMSE 0.04283 RMSE 0.0345
PA Rand

### Southern

In [199]:
rfr = RandomForestRegressor(max_depth = 93, max_features = 'auto', min_samples_split = 2,
                                         n_estimators = 1960, bootstrap = True, 
                            min_samples_leaf = 2, random_state = 42)
gbr = GradientBoostingRegressor(max_depth = 96, max_features = 'sqrt', min_samples_split = 20,
                                         n_estimators = 1960, learning_rate = 0.01, 
                                min_samples_leaf = 10, random_state = 42)
state_models(southern_data, rfr, gbr, voting)

AL RandomForestRegressor Mean RMSE 0.06479 RMSE 0.06032
AL GradientBoostingRegressor Mean RMSE 0.06288 RMSE 0.06314
AL VotingRegressor Mean RMSE 0.06572 RMSE 0.06136
AR RandomForestRegressor Mean RMSE 0.0742 RMSE 0.07242
AR GradientBoostingRegressor Mean RMSE 0.07683 RMSE 0.07814
AR VotingRegressor Mean RMSE 0.07394 RMSE 0.07635
DE RandomForestRegressor Mean RMSE 0.04822 RMSE 0.05494
DE GradientBoostingRegressor Mean RMSE 0.04189 RMSE 0.06689
DE VotingRegressor Mean RMSE 0.04668 RMSE 0.06259
FL RandomForestRegressor Mean RMSE 0.0574 RMSE 0.05026
FL GradientBoostingRegressor Mean RMSE 0.05692 RMSE 0.05078
FL VotingRegressor Mean RMSE 0.05645 RMSE 0.04639
GA RandomForestRegressor Mean RMSE 0.05653 RMSE 0.04964
GA GradientBoostingRegressor Mean RMSE 0.05631 RMSE 0.0476
GA VotingRegressor Mean RMSE 0.05471 RMSE 0.04633
KY RandomForestRegressor Mean RMSE 0.07377 RMSE 0.05742
KY GradientBoostingRegressor Mean RMSE 0.07244 RMSE 0.05675
KY VotingRegressor Mean RMSE 0.07338 RMSE 0.05195
LA Rand

### Midwest

In [201]:
rfr = RandomForestRegressor(max_depth = 79, max_features = 'sqrt', min_samples_split = 5,
                            n_estimators = 965, bootstrap = False, 
                            min_samples_leaf = 2, random_state = 42)
gbr = GradientBoostingRegressor(max_depth = 96, max_features = 'sqrt', min_samples_split = 20,
                                n_estimators = 1960, learning_rate = 0.01, 
                                min_samples_leaf = 10, random_state = 42)
state_models(midwest_data, rfr, gbr, voting)

IA RandomForestRegressor Mean RMSE 0.03221 RMSE 0.03526
IA GradientBoostingRegressor Mean RMSE 0.03231 RMSE 0.03526
IA VotingRegressor Mean RMSE 0.03186 RMSE 0.03475
IL RandomForestRegressor Mean RMSE 0.04701 RMSE 0.07882
IL GradientBoostingRegressor Mean RMSE 0.04476 RMSE 0.07203
IL VotingRegressor Mean RMSE 0.04594 RMSE 0.07496
IN RandomForestRegressor Mean RMSE 0.05017 RMSE 0.03229
IN GradientBoostingRegressor Mean RMSE 0.05083 RMSE 0.03783
IN VotingRegressor Mean RMSE 0.04926 RMSE 0.0346
KS RandomForestRegressor Mean RMSE 0.04704 RMSE 0.03539
KS GradientBoostingRegressor Mean RMSE 0.04768 RMSE 0.03709
KS VotingRegressor Mean RMSE 0.04708 RMSE 0.03581
MI RandomForestRegressor Mean RMSE 0.05041 RMSE 0.04918
MI GradientBoostingRegressor Mean RMSE 0.05106 RMSE 0.05026
MI VotingRegressor Mean RMSE 0.05054 RMSE 0.04927
MN RandomForestRegressor Mean RMSE 0.03703 RMSE 0.0408
MN GradientBoostingRegressor Mean RMSE 0.03759 RMSE 0.04203
MN VotingRegressor Mean RMSE 0.03628 RMSE 0.04017
MO Ran

### West

In [202]:
west_data.drop(west_data.loc[west_data['Lea State.x']== 'HI'].index, inplace=True) #Drop Hawaii due to one district
rfr = RandomForestRegressor(max_depth = 93, max_features = 'auto', min_samples_split = 2,
                            n_estimators = 1960, bootstrap = True, 
                            min_samples_leaf = 2, random_state = 42)
gbr = GradientBoostingRegressor(max_depth = 71, max_features = 'sqrt', min_samples_split = 5,
                                n_estimators = 845, learning_rate = 0.01, 
                                min_samples_leaf = 10, random_state = 42)
state_models(west_data, rfr, gbr, voting)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


AK RandomForestRegressor Mean RMSE 0.07367 RMSE 0.08822
AK GradientBoostingRegressor Mean RMSE 0.06875 RMSE 0.09306
AK VotingRegressor Mean RMSE 0.07517 RMSE 0.09622
AZ RandomForestRegressor Mean RMSE 0.06916 RMSE 0.06799
AZ GradientBoostingRegressor Mean RMSE 0.07173 RMSE 0.0617
AZ VotingRegressor Mean RMSE 0.06984 RMSE 0.06757
CA RandomForestRegressor Mean RMSE 0.06161 RMSE 0.05512
CA GradientBoostingRegressor Mean RMSE 0.06002 RMSE 0.05457
CA VotingRegressor Mean RMSE 0.06018 RMSE 0.05499
CO RandomForestRegressor Mean RMSE 0.0692 RMSE 0.05749
CO GradientBoostingRegressor Mean RMSE 0.06611 RMSE 0.06622
CO VotingRegressor Mean RMSE 0.06592 RMSE 0.06024
ID RandomForestRegressor Mean RMSE 0.04217 RMSE 0.04846
ID GradientBoostingRegressor Mean RMSE 0.04206 RMSE 0.04876
ID VotingRegressor Mean RMSE 0.04183 RMSE 0.04963
MT RandomForestRegressor Mean RMSE 0.08872 RMSE 0.08865
MT GradientBoostingRegressor Mean RMSE 0.09386 RMSE 0.09332
MT VotingRegressor Mean RMSE 0.09091 RMSE 0.09952
NM Ran

### Conclusions: 

The perfomance of the models were quite varied. I had some extremely good results such as with Connecticut get down to 1.83% but I had some bad ones as well with New Mexico being off by 11.4%. For the Northeast there were five out of eight states that performed better then the regional. For the south seven out of sixteen performed better then the region. Eight out of twelve of the Midwest states performed better then the region. Only four out of eleven of the models for the West performed better then the region.

## State Models with Individualized Hyperparameters

For the final part I created another function that would do something similar as the last part but instead of using regional hyperparameter, this function will do a RandomizedSearch on each state to find the best hyperparameters for that particular state and then run a Random Forest Regressor, Gradient Boosting Regressor, and a Voting Regressor on each one with the optimal hyperparameters. A score for each model based off cross validation will be printed along with the performance based on the testing data. I want to do this to see if there can be an improvement on the last section if I individualized for each state.

In [211]:
def state_models_hyper (state_data):
    grouped = state_data.groupby('Lea State.x') #Grouping the data by state.
    for name, group in grouped: #Creating a for loop for each state.
        # Need to get the features and target values for each state.
        features = group.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
        target_values = group[['Poverty_Proportion']]
        model_scaler = StandardScaler() 
        scaled_data = model_scaler.fit_transform(features) #Scale the data
        X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values,
                                                            test_size=0.2, random_state=42) #Split the data
        x = y_train.to_numpy()
        y_train_ravel = x.ravel()
        n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
        max_features = ['auto', 'sqrt']
        max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
        max_depth.append(None)
        min_samples_split = [2, 5, 10, 15, 20]
        min_samples_leaf = [1, 2, 4, 6, 10]
        bootstrap = [True, False]
        random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
        rf = RandomForestRegressor()
        rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
        rf_random.fit(X_train, y_train_ravel)
        rfr = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
        n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
        max_features = ['auto', 'sqrt']
        max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
        max_depth.append(None)
        min_samples_split = [2, 5, 10, 15, 20]
        min_samples_leaf = [1, 2, 4, 6, 10]
        learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
        random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}
        gb = GradientBoostingRegressor()
        gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
        gb_random.fit(X_train, y_train_ravel)
        gbr = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)
        voting = VotingRegressor(estimators = [('rfr', rfr), ('gbr', gbr)])
        for clf in (rfr, gbr, voting): #Run each model and generate a score and mse
            clf.fit(X_train, y_train_ravel)
            model_scores = cross_val_score(clf, X_train, y_train_ravel,
                         scoring="neg_mean_squared_error", cv=5)
            scores = np.sqrt(-model_scores)
            w = scores.mean()
            y_pred = clf.predict(X_test)
            mse = mean_squared_error(y_test, y_pred)
            rmse = np.sqrt(mse)
            print(name, clf.__class__.__name__, 'Mean RMSE', round(w, 5), 'RMSE', round(rmse, 5))

In [232]:
# I have to remove Hawaii and DC since these do not have enough districts to perform this on

project_data = pd.read_csv('ProjectDataTableNewData.csv', encoding='latin-1')
project_data.drop(project_data.loc[(project_data['Lea State.x']== 'HI')|(project_data['Lea State.x']== 'DC')].index, inplace=True)

In [215]:
state_models_hyper(project_data)

AK RandomForestRegressor Mean RMSE 0.06607 RMSE 0.08901
AK GradientBoostingRegressor Mean RMSE 0.07538 RMSE 0.09954
AK VotingRegressor Mean RMSE 0.06961 RMSE 0.09388
AL RandomForestRegressor Mean RMSE 0.06461 RMSE 0.06094
AL GradientBoostingRegressor Mean RMSE 0.06548 RMSE 0.06392
AL VotingRegressor Mean RMSE 0.06368 RMSE 0.0616
AR RandomForestRegressor Mean RMSE 0.07227 RMSE 0.07533
AR GradientBoostingRegressor Mean RMSE 0.07427 RMSE 0.08036
AR VotingRegressor Mean RMSE 0.07278 RMSE 0.07759
AZ RandomForestRegressor Mean RMSE 0.06924 RMSE 0.06937
AZ GradientBoostingRegressor Mean RMSE 0.07137 RMSE 0.07229
AZ VotingRegressor Mean RMSE 0.06938 RMSE 0.07042
CA RandomForestRegressor Mean RMSE 0.06082 RMSE 0.05477
CA GradientBoostingRegressor Mean RMSE 0.06057 RMSE 0.05818
CA VotingRegressor Mean RMSE 0.05988 RMSE 0.05571
CO RandomForestRegressor Mean RMSE 0.06721 RMSE 0.05894
CO GradientBoostingRegressor Mean RMSE 0.0665 RMSE 0.0562
CO VotingRegressor Mean RMSE 0.06606 RMSE 0.05634
CT Rand

## Best Model Scores for States

Throughout doing all of the state scores I had noticed that there were differences in how the different models peroformed on each state, whether it was using the regional parameters or the individualized ones as well as the models itself. For this final portion I am going to create a function that will group each state and then run the regional hyperparameters that it belongs to as well as doing an individualized one as in the previous section for each of the three types of models (Random Forest, Gradient Boosting, and Voting Regressor). I will have the results of each model compared to each other then print out the best results as well as the model that produced it.

In [243]:
# Create a model scoring function to score the models within the best_state_model function
def state_model_scoring (clf, X_train, y_train):
    model_scores = cross_val_score(model, X_train, y_train,
                         scoring="neg_mean_squared_error", cv=5)
    scores = np.sqrt(-model_scores)
    w = scores.mean()
    return w

def best_state_model (state_data):
    grouped = state_data.groupby('Lea State.x') #Grouping the data by state.
    for name, group in grouped: #Creating a for loop for each state.
        # Need to get the features and target values for each state.
        features = group.drop(['Unnamed: 0', 'ID', 'Lea State.x', 'LEA.x', 'Area_Population', 
                                        'District_Population','Poverty_Proportion', 'Children_Poverty'], axis=1)
        target_values = group[['Poverty_Proportion']]
        model_scaler = StandardScaler() 
        scaled_data = model_scaler.fit_transform(features) #Scale the data
        X_train, X_test, y_train, y_test = train_test_split(scaled_data, target_values,
                                                            test_size=0.2, random_state=42) #Split the data
        x = y_train.to_numpy()
        y_train_ravel = x.ravel()
        # Generating the regions and running the optimal hyperparameters on them.
        if name == 'ME' or 'NH' or 'VT' or 'NY' or 'MA' or 'CT' or 'RI' or 'PA' or 'NJ' :
            rfregion = RandomForestRegressor(n_estimators = 1243, min_samples_split = 5, min_samples_leaf = 1,
                                             max_features = 'sqrt', max_depth  = 38, 
                                             bootstrap = False, random_state = 42)
            gbregion = GradientBoostingRegressor(n_estimators = 1562, min_samples_split = 2, min_samples_leaf = 6,
                                             max_features = 'sqrt', max_depth  = 26, 
                                             learning_rate = 0.01, random_state = 42)
            votingregion = VotingRegressor(estimators = [('rfr', rfregion), ('gbr', gbregion)])
        elif name == 'MD' or 'GA' or 'DE' or 'VA' or 'WV' or 'KY' or 'TN' or 'NC' or 'SC' or 'FL' or 'AL' or 'MS' or 'LA' or 'AR' or 'TX' or 'OK':
            rfregion = RandomForestRegressor(max_depth = 93, max_features = 'auto', min_samples_split = 2,
                                         n_estimators = 1960, bootstrap = True, 
                            min_samples_leaf = 2, random_state = 42)
            gbregion = GradientBoostingRegressor(max_depth = 96, max_features = 'sqrt', min_samples_split = 20,
                                         n_estimators = 1960, learning_rate = 0.01, 
                                min_samples_leaf = 10, random_state = 42)
            votingregion = VotingRegressor(estimators = [('rfr', rfregion), ('gbr', gbregion)])
        elif name  == 'ND' or 'SD' or 'NE' or 'KS' or 'MO' or 'IA' or 'MN' or 'WI' or 'MI' or 'IL' or 'IN' or 'OH':
            rfregion = RandomForestRegressor(max_depth = 79, max_features = 'sqrt', min_samples_split = 5,
                            n_estimators = 965, bootstrap = False, 
                            min_samples_leaf = 2, random_state = 42)
            gbregion = GradientBoostingRegressor(max_depth = 96, max_features = 'sqrt', min_samples_split = 20,
                                n_estimators = 1960, learning_rate = 0.01, 
                                min_samples_leaf = 10, random_state = 42)
            votingregion = VotingRegressor(estimators = [('rfr', rfregion), ('gbr', gbregion)])
        else:
            rfregion = RandomForestRegressor(max_depth = 93, max_features = 'auto', min_samples_split = 2,
                            n_estimators = 1960, bootstrap = True, 
                            min_samples_leaf = 2, random_state = 42)
            gbregion = GradientBoostingRegressor(max_depth = 71, max_features = 'sqrt', min_samples_split = 5,
                                n_estimators = 845, learning_rate = 0.01, 
                                min_samples_leaf = 10, random_state = 42)
            votingregion = VotingRegressor(estimators = [('rfr', rfregion), ('gbr', gbregion)])
        # Setting up the parameters for a search for the random forest
        n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
        max_features = ['auto', 'sqrt']
        max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
        max_depth.append(None)
        min_samples_split = [2, 5, 10, 15, 20]
        min_samples_leaf = [1, 2, 4, 6, 10]
        bootstrap = [True, False]
        random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
        rf = RandomForestRegressor()
        rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 150, cv = 3,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
        rf_random.fit(X_train, y_train_ravel)
        rfr = RandomForestRegressor(**rf_random.best_params_, random_state = 42)
        # Setting up the parameters for a search for the Gradient Boosting
        n_estimators = [int(x) for x in np.linspace(start = 50, stop = 2000, num = 50)]
        max_features = ['auto', 'sqrt']
        max_depth = [int(x) for x in np.linspace(2, 110, num = 40)]
        max_depth.append(None)
        min_samples_split = [2, 5, 10, 15, 20]
        min_samples_leaf = [1, 2, 4, 6, 10]
        learning_rate = [0.01, 0.05, 0.1, 0.25, 0.40, 0.50, 0.75, 1.0]
        random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'learning_rate': learning_rate}
        gb = GradientBoostingRegressor()
        gb_random = RandomizedSearchCV(estimator = gb, param_distributions = random_grid, n_iter = 150, cv = 3,
                               random_state=42, n_jobs = -1, scoring = "neg_mean_squared_error")
        gb_random.fit(X_train, y_train_ravel)
        gbr = GradientBoostingRegressor(**gb_random.best_params_, random_state = 42)
        voting = VotingRegressor(estimators = [('rfr', rfr), ('gbr', gbr)])
        for clf in (rfr, gbr, voting, rfregion, gbregion, votingregion): #Run each model 
            clf.fit(X_train, y_train_ravel)
        # Generate a score for each of the models
        a = state_model_scoring (rfr, X_train, y_train_ravel)
        b = state_model_scoring (gbr, X_train, y_train_ravel)
        c = state_model_scoring (voting, X_train, y_train_ravel)
        d = state_model_scoring (rfregion, X_train, y_train_ravel)
        e = state_model_scoring (gbregion, X_train, y_train_ravel)
        f = state_model_scoring (votingregion, X_train, y_train_ravel)
        g = rfr.predict(X_test)
        h = gbr.predict(X_test)
        i = voting.predict(X_test)
        j = rfregion.predict(X_test)
        k = gbregion.predict(X_test)
        l = votingregion.predict(X_test)
        m = np.sqrt(mean_squared_error(y_test, g))
        n = np.sqrt(mean_squared_error(y_test, h))
        o = np.sqrt(mean_squared_error(y_test, i))
        p = np.sqrt(mean_squared_error(y_test, j))
        q = np.sqrt(mean_squared_error(y_test, k))
        r = np.sqrt(mean_squared_error(y_test, l))
        #Comparing the results of each of the models.
        if m < n and m < o and m < p and m < q and m < r:
            print(name, 'Random Forest Regressor', 'Score', round(a, 5), 'MSE', round(m, 5))
        elif n < m and n < o and n < p and n < q and n < r:
            print(name, 'Gradient Boosting Regressor', 'Score', round(b, 5), 'MSE', round(n, 5))
        elif o < m and o < n and o < p and o < q and o < r:
            print(name, 'Voting Regressor', 'Score', round(c, 5), 'MSE', round(o, 5))
        elif p < m and p < n and p < o and p < q and p < r:
            print(name, 'Random Forest Regressor Region', 'Score', round(d, 5), 'MSE', round(p, 5))
        elif q < m and q < n and q < o and q < p and q < r:
            print(name, 'Gradient Boosting Regressor Region', 'Score', round(e, 5), 'MSE', round(q, 5))
        else:
            print(name, 'Voting Regressor Region', 'Score', round(f, 5), 'MSE', round(r, 5))

In [244]:
best_state_model(project_data)

AK Random Forest Regressor Score 0.07292 MSE 0.08899
AL Random Forest Regressor Score 0.06518 MSE 0.06128
AR Voting Regressor Score 0.0748 MSE 0.07189
AZ Gradient Boosting Regressor Region Score 0.06816 MSE 0.06634
CA Random Forest Regressor Score 0.06248 MSE 0.05448
CO Random Forest Regressor Region Score 0.06969 MSE 0.05896
CT Voting Regressor Region Score 0.03464 MSE 0.01833
DE Gradient Boosting Regressor Score 0.04717 MSE 0.02977
FL Gradient Boosting Regressor Region Score 0.05795 MSE 0.04596
GA Gradient Boosting Regressor Region Score 0.05511 MSE 0.04592
IA Voting Regressor Region Score 0.03189 MSE 0.03475
ID Random Forest Regressor Score 0.04361 MSE 0.04733
IL Voting Regressor Region Score 0.04707 MSE 0.07496
IN Random Forest Regressor Score 0.05027 MSE 0.03277
KS Random Forest Regressor Region Score 0.04778 MSE 0.03515
KY Voting Regressor Region Score 0.07479 MSE 0.05195
LA Random Forest Regressor Region Score 0.07545 MSE 0.05918
MA Gradient Boosting Regressor Score 0.03087 MSE 