# Predicting School District Performance
The Data Schoolers - Ashwin Deo, Aasta Frascati-Robinson, Bhanu Kanna, Brendan Law

### Overview and Motivation
What factors have an impact on school district performance?  We seek to learn if we can predict graduation rate based upon numerous school district characteristics.  We look to understand what factors have little or no impact on performance. We also strive to classify school districts by custom team-built peer school district grouping rather than solely geographical grouping by nation, state, and school district, which would include factors like total students, student/teacher ratio, percent of children in poverty, district type, location, etc.  We used the most current national graduation data found, which was for the school year 2009-2010.  We have kept the dataset years consistent across data sources.

The goal of predicting school district performance based on school environment is to inform parents and interested citizens of what factors in school districts influence key success indicators such as graduation rate.  Identifying these factors would help school districts look at potential opportunities to improve.  This topic was selected because of a passion for using technology to enhance education and desire to give back to the education communities that have helped shape us.  One team member would love to work in educational data science in the future.

Open education data is now being provided via several national, state, and local government portals.  It is often up to the end user to piece together datasets across these portals to answer their questions, which is not something that a typical parent or interested citizen has the time or expertise to pursue.  Instead, the data science community can support these users by melding these datasets and answering important education questions.

### Related Work
Dekker, Pechenizkiy, and Vleeshouwers built multiple models to predict Eindhoven University of Technology freshman dropout (2009, <i>Educational Data Mining</i>).  We referenced this work to identify what types of models might be applicable for interpreting education data.

#Worbooks

<font color=blue size=5><b>Grouping Workbook</font></b> <br><br>

We wanted the ability to compare school districts based on similar school districts as well as by statewide. This notebook creates the groupings. The output files for these groupings were used in the Tableau visualization.
<br><br>The columns for these groupings were chosen based on the New York State Education Department's definition of similar schools. Link: http://www.p12.nysed.gov/repcrd2004/information/similar-schools/guide.shtml

[Grouping](grouping.ipynb)

<font color=blue size=5><b>Lasso Workbook</font></b> <br><br>


In order to enable visualizing factors that school districts could readily change versus would be more difficult to change, we needed two more runs of our best models - log with lasso without gender and ethnicity features for classifying high graduation and low graduation.

* [Lasso Classifiers without gender/ethnicity](lastlasso.ipynb#Classifiers-without-gender/ethnicity)
    * [1. High Graduation - No Gender/Ethnicity](lastlasso.ipynb#High-Graduation---No-Gender/Ethnicity)
    * [2. Low Graduation - No Gender/Ethnicity](lastlasso.ipynb#Low-Graduation---No-Gender/Ethnicity)



<font color=blue size=5><b>Regression Workbook</font></b> <br><br>

<b><font size=4>HTML Link to Workbook:<a href="https://github.com/ashwindeo/dataschoolers/blob/master/processbook_NumericalGradRate.ipynb">processbook_NumericalGradRate.ipynb</a></font></b>

Based on feedback from our TF on 12/7, we attempted several forms of regression to see how well we could predict numerical graduation rate with school district characteristics alone, then with previous years graduation rates, then with school district characteristics and previous years graduation rates.

One opportunity in this space is that the U.S. Department of Education typically delays making graduation rate data available.  For instance, it is the 2015-2016 school year, and the most current graduation rate data available is for the year 2009-2010.  If we could build a model to predict graduation rate for the years of missing data, organizations that rely on graduation rate data to provide schools services could use this graduation rate approximation until newer graduation rate data becomes available.

For this notebook, we pulled 3 previous years of graduation rate data (2006-2007, 2007-2008, and 2008-2009).  First we built regression models using school district data alone and predicting numerical graduation rate, then we built regression models using historic graduation rate data alone and predicting numerical graduation rate, then we built regression models using school district data and historic graduation rate data and predicting numerical graduation rate, and lastly we built a regression model using 2006-2007 school district data and fed it new 2009-2010 school district data to see how well it would predict 2009-2010 numerical graduation rate.

We compared the models using mean squared error, with the lower the mean squared error, the better.

* [1. Predicting Numerical Graduation Rate](processbook_NumericalGradRate.ipynb#Predicting-Numerical-Graduation-Rate)
* [2. Table of Contents](processbook_NumericalGradRate.ipynb#Table-of-Contents)
    * [1. Acquiring graduation rate from previous years and adding it to schools districts dataset](processbook_NumericalGradRate.ipynb#Acquiring-graduation-rate-from-previous-years-and-adding-it-to-schools-districts-dataset)
    * [2. Data Analysis](processbook_NumericalGradRate.ipynb#Data-Analysis)
    * [3. Baseline Comparison :](processbook_NumericalGradRate.ipynb#Baseline-Comparison-:)
    * [4. Predictive Modelling using the current school districts datasets.](processbook_NumericalGradRate.ipynb#Predictive-Modelling-using-the-current-school-districts-datasets.)
        * [1. Linear Regression](processbook_NumericalGradRate.ipynb#Linear-Regression)
        * [2. Lasso Regression](processbook_NumericalGradRate.ipynb#Lasso-Regression)
        * [3. Elastic Net Regression](processbook_NumericalGradRate.ipynb#Elastic-Net-Regression)
        * [4. Decision Tree & Random Forests](processbook_NumericalGradRate.ipynb#Decision-Tree-&-Random-Forests)
    * [5. Predicting graduation based on previous year graduation](processbook_NumericalGradRate.ipynb#Predicting-graduation-based-on-previous-year-graduation)
        * [1. Linear Regression - 0607, 0708, and 0809](processbook_NumericalGradRate.ipynb#Linear-Regression---0607,-0708,-and-0809)
        * [2. Linear Regression - 0708 and 0809](processbook_NumericalGradRate.ipynb#Linear-Regression---0708-and-0809)
        * [3. Linear Regression - 0809 Only](processbook_NumericalGradRate.ipynb#Linear-Regression---0809-Only)
        * [4. Linear Regression - 0708 Only](processbook_NumericalGradRate.ipynb#Linear-Regression---0708-Only)
        * [5. Linear Regression - 0607 Only](processbook_NumericalGradRate.ipynb#Linear-Regression---0607-Only)
    * [6. Predicting graduation based on previous year graduation and all other factors](processbook_NumericalGradRate.ipynb#Predicting-graduation-based-on-previous-year-graduation-and-all-other-factors)
        * [1. Linear Regression](processbook_NumericalGradRate.ipynb#Linear-Regression)
        * [2. Lasso Regression](processbook_NumericalGradRate.ipynb#Lasso-Regression)
        * [3. Elastic Net Regression](processbook_NumericalGradRate.ipynb#Elastic-Net-Regression)
        * [4. Decision Trees and Random Forests](processbook_NumericalGradRate.ipynb#Decision-Trees-and-Random-Forests)
        * [5. Creating a best model](processbook_NumericalGradRate.ipynb#Creating-a-best-model)
        * [6. Findings](processbook_NumericalGradRate.ipynb#Findings)
    * [7. Trying 2009-2010 predictions using 2006-2007 data](processbook_NumericalGradRate.ipynb#Trying-2009-2010-predictions-using-2006-2007-data)
        * [1. Linear Regression](processbook_NumericalGradRate.ipynb#Linear-Regression)
        * [2. Lasso Regression](processbook_NumericalGradRate.ipynb#Lasso-Regression)
        * [3. Elastic Net](processbook_NumericalGradRate.ipynb#Elastic-Net)
        * [4. Creating a best model](processbook_NumericalGradRate.ipynb#Creating-a-best-model)
        * [5. Predicting 2009-2010 graduation rate from 2006-2007 data.](processbook_NumericalGradRate.ipynb#Predicting-2009-2010-graduation-rate-from-2006-2007-data.)



##Additional Regression (numerical_part2)

* [Numerical Part 2](numerical_part2.ipynb#Part-2)
    * [1. Baseline Comparison :](numerical_part2.ipynb#Baseline-Comparison-:)
    * [2. Building a Predictive model using the school districts dataset and previous years graduation rates.](numerical_part2.ipynb#Building-a-Predictive-model-using-the-school-districts-dataset-and-previous-years-graduation-rates.)
    * [3. Building a model using School Districts](numerical_part2.ipynb#Building-a-model-using-School-Districts)


##Previous Graduation Rate Analysis

* [Previous Graduation analysis](previousgrad.ipynb)

##School Data Processbook (processbook)

This is the process book that covers how we loaded and cleaned the schools data for the year 2009-2010. We use the schools data in the visualization.

We did not use the schools data in our models because graduation rate data is not available nationwide at the school level. We found invidual states or cities that made graduation rate data publically available, yet it would have been too time consuming to download from many different places.

* [School Data Processbook](processbook.ipynb)

##District Data Processbook (processbook_schooldistricts)

After rigorous cleaning of the data and creating indicators, we attempt a variety of classifiers to determine the best approach at prediting the graduation rate. We also use similar methods to find the dropout rate.

* [1. Predicting School District Performance](processbook_HiLoGradRate.ipynb#Predicting-School-District-Performance)
    * [1. Overview and Motivation](processbook_HiLoGradRate.ipynb#Overview-and-Motivation)
    * [2. Related Work](processbook_HiLoGradRate.ipynb#Related-Work)
* [2. Data Sources](processbook_HiLoGradRate.ipynb#Data-Sources)
    * [1. Schools](processbook_HiLoGradRate.ipynb#Schools)
    * [2. School Districts](processbook_HiLoGradRate.ipynb#School-Districts)
        * [1. Data Loading](processbook_HiLoGradRate.ipynb#Data-Loading)
        * [2. Data Cleaning](processbook_HiLoGradRate.ipynb#Data-Cleaning)
        * [3. Data Derivation](processbook_HiLoGradRate.ipynb#Data-Derivation)
        * [4. Data Filtering](processbook_HiLoGradRate.ipynb#Data-Filtering)
        * [5. Feature Engineering](processbook_HiLoGradRate.ipynb#Feature-Engineering)
        * [6. Test and Training Sets and Standardization](processbook_HiLoGradRate.ipynb#Test-and-Training-Sets-and-Standardization)
* [3. High Graduation Rate](processbook_HiLoGradRate.ipynb#High-Graduation-Rate)
    * [1. Exploratory Data Analysis](processbook_HiLoGradRate.ipynb#Exploratory-Data-Analysis)
    * [2. Writing Classifiers](processbook_HiLoGradRate.ipynb#Writing-Classifiers)
        * [1. Linear SVM](processbook_HiLoGradRate.ipynb#Linear-SVM)
        * [2. Log Regression](processbook_HiLoGradRate.ipynb#Log-Regression)
        * [3. Feature Selection](processbook_HiLoGradRate.ipynb#Feature-Selection)
        * [4. Kernalized SVM](processbook_HiLoGradRate.ipynb#Kernalized-SVM)
        * [5. Decision Trees, Random Forest, ADA & Gradient Boost :](processbook_HiLoGradRate.ipynb#Decision-Trees,-Random-Forest,-ADA-&-Gradient-Boost-:)
        * [6. Decision Trees 1](processbook_HiLoGradRate.ipynb#Decision-Trees-1)
            * [1. Random Forests](processbook_HiLoGradRate.ipynb#Random-Forests)
            * [2. ADA Booster](processbook_HiLoGradRate.ipynb#ADA-Booster)
            * [3. Gradient Boosting](processbook_HiLoGradRate.ipynb#Gradient-Boosting)
            * [4. Decision Tree - No Gender or Ethnicity](processbook_HiLoGradRate.ipynb#Decision-Tree---No-Gender-or-Ethnicity)
            * [5. Random Forests - No Gender/Ethnicity](processbook_HiLoGradRate.ipynb#Random-Forests---No-Gender/Ethnicity)
            * [6. ADA Booster - No Gender/Ethnicity](processbook_HiLoGradRate.ipynb#ADA-Booster---No-Gender/Ethnicity)
            * [7. Gradient Boosting - No Gender/Ethnicity](processbook_HiLoGradRate.ipynb#Gradient-Boosting---No-Gender/Ethnicity)
        * [7. Final Comparison of All Models](processbook_HiLoGradRate.ipynb#Final-Comparison-of-All-Models)
* [4. Low Graduation](processbook_HiLoGradRate.ipynb#Low-Graduation)
    * [1. Exploratory Data Analysis](processbook_HiLoGradRate.ipynb#Exploratory-Data-Analysis)
    * [2. Writing Classifiers](processbook_HiLoGradRate.ipynb#Writing-Classifiers)
        * [1. Linear SVM](processbook_HiLoGradRate.ipynb#Linear-SVM)
        * [2. Log Regression](processbook_HiLoGradRate.ipynb#Log-Regression)
        * [3. Feature Selection](processbook_HiLoGradRate.ipynb#Feature-Selection)
        * [4. Kernalized SVM](processbook_HiLoGradRate.ipynb#Kernalized-SVM)
        * [5. Decision Trees](processbook_HiLoGradRate.ipynb#Decision-Trees)
        * [6. Random Forests](processbook_HiLoGradRate.ipynb#Random-Forests)
        * [7. ADA Booster](processbook_HiLoGradRate.ipynb#ADA-Booster)
        * [8. Gradient Boosting](processbook_HiLoGradRate.ipynb#Gradient-Boosting)
        * [9. Decision Tree - No Gender or Ethnicity](processbook_HiLoGradRate.ipynb#Decision-Tree---No-Gender-or-Ethnicity)
        * [10. Random Forests - No Gender/Ethnicity](processbook_HiLoGradRate.ipynb#Random-Forests---No-Gender/Ethnicity)
        * [11. ADA Booster - No Gender/Ethnicity](processbook_HiLoGradRate.ipynb#ADA-Booster---No-Gender/Ethnicity)
        * [12. Gradient Boosting - No Gender/Ethnicity](processbook_HiLoGradRate.ipynb#Gradient-Boosting---No-Gender/Ethnicity)
        * [13. Final Comparison of All Models](processbook_HiLoGradRate.ipynb#Final-Comparison-of-All-Models)
* [5. Model of Numerical Graduation Rate](processbook_HiLoGradRate.ipynb#Model-of-Numerical-Graduation-Rate)


##Visualization School Cleanup

* [Visualization School Cleanup](visualizationSchoolCleanupEDA.ipynb)

## Visualization

### Grouping Creation
We created our grouping information for Tableau visualization in a separate process book.

We created custom groupings so that an end user could compare a school district nationally, to its state, and to similar schools, a custom grouping that we created.<br/>
Link: <a href="https://github.com/ashwindeo/dataschoolers/blob/master/grouping.ipynb">Grouping Process Book</a>

### Tableau
We built the visualization for our website in Tableau so an end user would have a rich, dynamic experience.

Link: <a href="">Visualization Link</a>

## Conclusion