# M5: Implementing and Evaluating a Series of Regression Models
# DAV 6150

- Group Members: Qing Dou, Ruoyu Chen, Zhengnan Li
- Repository: https://github.com/Zhengnan817/DAV-6150/tree/1f6efa951f5b5984564fb500ce7e1d4271e0e147/Project_1

# 1. Introduction

This project is to analyze the dataset comprised of information pertaining to NY State High School graduation metrics for the 2018-2019 school year. Our goal is to develop one or more regression models to predict the number of dropouts in a school district or for a specific group of students. This prediction will be based on a variety of characteristics related to district characteristics, student groupings, and possibly other relevant factors such as socioeconomic status of students, geographic location of the district, school resources and capacity, etc. 

In short, we will apply the full data science project lifecycle to the implementation and evaluation of a series of regression models that predict the number of student “dropouts” relative to certain properties/characteristics of a given school district and associated student subgrouping. 


### 1.1 Approach
Here are the steps:
- __Exploratory Data Analysis:__ Get a better understanding of the dataset and variables.
- __Data Preparation:__ Process the data to correct the data quality problems and use feature selection to create new attributes combining the domain knowledge.
- __Prepped Data Review:__ Re-run the EDA by prepped data to check if the variables are ready to be used in our model building.
- __Regression Model and Evaluation:__ Build the model in different ways and evaluate.
- __Model Selection:__ Show the model selection criteria based on the model performance.
- __Conclusion:__ Clearly state our summary and conclusion we can get after analyzing

### 1.2 Dataset
The dataset is comprised of more than 73,000 observations, each of which pertains to a particular NY State
school district and associated subgroupings/categorizations of high school students who had been enrolled for
at least 4 years as of the end of the 2018-2019 school year. A data dictionary describing the attributes
contained within the file is provided below. 

The response variable we will be modeling is the data set’s “dropout_cnt” attribute, which represents the number
of enrolled students who discontinued their enrollment (i.e., “dropped out”) from within the indicated school
district | student subgroup .

| Attribute                       | Description                                                                                   |
|---------------------------------|-----------------------------------------------------------------------------------------------|
| report_school_year              | Indicates school year for which high school graduation info is being reported                 |
| aggregation_index               | Numeric code identifying manner in which high school graduation data has been aggregated      |
| aggregation_type                | Text description of how high school graduation data has been aggregated                       |
| nrc_code                        | Numeric code identifying "needs / resource capacity", an indicator of the type of school district |
| nrc_desc                        | Text description of the type of school district                                               |
| county_code                     | Numeric code for county name                                                                  |
| county_name                     | Full name of applicable NY State county                                                       |
| nyc_ind                         | Indicates whether or not the school district resides within the borders of NYC                |
| membership_desc                 | Indicates school year in which students first enrolled in High School                         |
| subgroup_code                   | Numeric code identifying student subgrouping                                                  |
| subgroup_name                   | Text description of student subgrouping. Note: a student may belong to more than one subgrouping |
| enroll_cnt                      | How many students of the indicated subgrouping were enrolled during the given school year     |
| grad_cnt                        | How many enrolled students of the indicated subgrouping graduated at the end of the given school year |
| grad_pct                        | What percentage of enrolled students of the indicated subgrouping graduated at the end for the given school year |
| local_cnt                       | How many enrolled students of the indicated subgrouping were awarded a "Local" diploma        |
| local_pct                       | What percentage of enrolled students of the indicated subgrouping were awarded a "Local" diploma |
| reg_cnt                         | How many enrolled students of the indicated subgrouping were awarded a "Regents" diploma      |
| reg_pct                         | What percentage of enrolled students of the indicated subgrouping were awarded a "Regents" diploma |
| reg_adv_cnt                     | How many enrolled students of the indicated subgrouping were awarded a "Regents Advanced" diploma |
| reg_adv_pct                     | What percentage of enrolled students of the indicated subgrouping were awarded a "Regents Advanced" diploma |
| non_diploma_credential_cnt      | How many enrolled students of the indicated subgrouping achieved a non-diploma credential     |
| non_diploma_credential_pct      | What percentage of enrolled students of the indicated subgrouping achieved a non-diploma credential |
| still_enrolled_cnt              | How many enrolled students of the indicated subgrouping did not graduate but were still enrolled |
| still_enrolled_pct              | What percentage of enrolled students of the indicated subgrouping did not graduate but were still enrolled |
| ged_cnt                         | How many enrolled students of the indicated subgrouping were awarded a "GED" diploma         |
| ged_pct                         | What percentage of enrolled students of the indicated subgrouping were awarded a "GED" diploma |
| __dropout_cnt(Target)__                     | __How many enrolled students of the indicated subgrouping discontinued their high school enrollment during the school year__ |
| dropout_pct                     | What percentage of enrolled students of the indicated subgrouping discontinued their high school enrollment during the school year |


# 2. Exploratory Data Analysis

### 2.1 Data Overview

Read the data from our github repository. The dataframe is shown below:

In [2]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=pd.core.generic.SettingWithCopyWarning)

In [3]:
dropouts = pd.read_csv("https://raw.githubusercontent.com/Zhengnan817/DAV-6150/main/Project_1/src/Project1_Data.csv")
dropouts

Unnamed: 0,report_school_year,aggregation_index,aggregation_type,aggregation_name,nrc_code,nrc_desc,county_code,county_name,nyc_ind,membership_desc,...,reg_adv_cnt,reg_adv_pct,non_diploma_credential_cnt,non_diploma_credential_pct,still_enr_cnt,still_enr_pct,ged_cnt,ged_pct,dropout_cnt,dropout_pct
0,2018-19,3,District,ALBANY CITY SCHOOL DISTRICT,3,Urban-Suburban High Needs,1,ALBANY,0,2013 Total Cohort - 6 Year Outcome,...,91,14%,16,2%,30,5%,0,0%,148,22%
1,2018-19,3,District,ALBANY CITY SCHOOL DISTRICT,3,Urban-Suburban High Needs,1,ALBANY,0,2013 Total Cohort - 6 Year Outcome,...,47,15%,2,1%,11,3%,0,0%,65,20%
2,2018-19,3,District,ALBANY CITY SCHOOL DISTRICT,3,Urban-Suburban High Needs,1,ALBANY,0,2013 Total Cohort - 6 Year Outcome,...,44,13%,14,4%,19,6%,0,0%,83,25%
3,2018-19,3,District,ALBANY CITY SCHOOL DISTRICT,3,Urban-Suburban High Needs,1,ALBANY,0,2013 Total Cohort - 6 Year Outcome,...,-,-,-,-,-,-,-,-,-,-
4,2018-19,3,District,ALBANY CITY SCHOOL DISTRICT,3,Urban-Suburban High Needs,1,ALBANY,0,2013 Total Cohort - 6 Year Outcome,...,23,6%,10,3%,18,5%,0,0%,91,25%
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73147,2018-19,3,District,DUNDEE CENTRAL SCHOOL DISTRICT,4,Rural High Needs,68,YATES,0,2013 Total Cohort - 6 Year Outcome - August 2019,...,-,-,-,-,-,-,-,-,-,-
73148,2018-19,3,District,DUNDEE CENTRAL SCHOOL DISTRICT,4,Rural High Needs,68,YATES,0,2013 Total Cohort - 6 Year Outcome - August 2019,...,-,-,-,-,-,-,-,-,-,-
73149,2018-19,3,District,DUNDEE CENTRAL SCHOOL DISTRICT,4,Rural High Needs,68,YATES,0,2013 Total Cohort - 6 Year Outcome - August 2019,...,-,-,-,-,-,-,-,-,-,-
73150,2018-19,3,District,DUNDEE CENTRAL SCHOOL DISTRICT,4,Rural High Needs,68,YATES,0,2013 Total Cohort - 6 Year Outcome - August 2019,...,-,-,-,-,-,-,-,-,-,-


Preview the dataframe