# Education data prep **Part 2**

### This script combines the following 3 datasets, aggregates them by county, redesigns column naming structure, and re-calculates rates:
1. District Student Mobility/Stability Statistics 2011-2012 **by Instructional Program/Service Type**
2. District Student Mobility/Stability Statistics 2011-2012 **by Gender & Race/Ethnicity**
3. District Graduation Data Statistics 2011-2012 **by Instructional Program Service Type**
## Reference: Column Naming conventions

- This dataset is designed so you should never have to look at the columns to find the name of one (since there are around 140 columns). Just look here for reference instead.
- For instance, to get the rate for any variable, just use `_rate` after a variable. So `graduated` becomes `graduated_rate`

| Type | Naming | Example |
| - | - | - |
| County Total | variable | `stable` |
| Count | group + variable | `disabled_stable` |
| Rate | group + variable + "rate" | `disabled_stable_rate` |
| Group Total | group + group total | `disabled_pupil_total` |

<br>

#### Mobility/Stability columns

| GROUPS | VARIABLES | GROUP TOTALS |
| - | - | - |
| disabled | stable | pupil_total |
| limited_eng | mobile | 
| poor | mobile_instances |
| migrant | 
| title_1 | 
| homeless |
| gifted |
| male |
| female |
| white |
| asian |
| black |
| hispanic |

<br>

#### Graduation columns

| GROUPS | VARIABLES | GROUP TOTALS |
| - | - | - |
| disabled | graduated | grad_base_total |
| limited_eng | completed |
| poor |
| migrant |
| title_1 |
| homeless |
| gifted |

<br>

**What are group totals?**
- Notice they aren't just called "total". This is because, for graduation data, we don't care about the total number of students. We care about the total number of students who are actually in the pool for graduation. So, we call it `grad_base_total` and use that when calculating rate

**Rates are calculated by dividing a variable by its group total, then multiplying by 100**

---
---
---

In [49]:
def head(*args, n=3):
    for df in [*args]:
        print("COLS: ", df.shape[1])
        print("ROWS: ", df.shape[0])
        display(df.head(n))

In [50]:
import sys
import pandas as pd, numpy as np
sys.path.append('../geo')
from geo_df import GeoDF

# These 3 datasets have each been cleaned already, and had their county names standardized so they can be joined
grad_raw = pd.read_csv("../education/output/dist_grad_rate.csv", index_col=0)
mob_raw = pd.read_csv("../education/output/dist_mobility_rate.csv", index_col=0)
mob_dem_raw = pd.read_csv("../education/output/dist_mobility_rate_demographics.csv", index_col=0)
geo = pd.read_csv('../geo/output/geo_county_school.csv')
geo = geo[['county', 'dist', 'geo_county', 'geo_dist', 'geo_county_point', 'geo_dist_point']]
head(grad_raw, mob_raw, mob_dem_raw, geo)

COLS:  37
ROWS:  184


Unnamed: 0,county,school_dist,students_with_disabilities_final_grad_base,students_with_disabilities_graduates_total,students_with_disabilities_graduation_rate,students_with_disabilities_completers_total,students_with_disabilities_completion_rate,limited_english_proficient_final_grad_base,limited_english_proficient_graduates_total,limited_english_proficient_graduation_rate,...,homeless_final_grad_base,homeless_graduates_total,homeless_graduation_rate,homeless_completers_total,homeless_completion_rate,gifted_talented_final_grad_base,gifted_talented_graduates_total,gifted_talented_graduation_rate,gifted_talented_completers_total,gifted_talented_completion_rate
0,STATE TOTAL,STATE TOTAL,5775,3099,53.7,3222,55.8,6171,3289,53.3,...,2394,1175,49.1,1262,52.7,6604,6048,91.6,6156,93.2
2,ADAMS,MAPLETON 1,49,18,36.7,19,38.8,219,73,33.3,...,41,12,29.3,16,39.0,44,27,61.4,27,61.4
3,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,250,118,47.2,127,50.8,379,257,67.8,...,106,62,58.5,65,61.3,227,201,88.5,208,91.6


COLS:  58
ROWS:  184


Unnamed: 0,county,school_dist,total_pupil_count_all_students,total_stable_pupil_count_all_students,total_stability_rate_all_students,total_mobile_student_count_all_students,total_student_mobility_rate_all_students,total_instances_of_mobility_all_students,total_mobility_incidence_rate_all_students,students_with_disabilities_pupil_count,...,homeless_student_mobility_rate,homeless_instances_of_mobility,homeless_mobility_incidence_rate,gifted_talented_pupil_count,gifted_talented_stable_student_count,gifted_talented_stability_rate,gifted_talented_mobile_student_count,gifted_talented_student_mobility_rate,gifted_talented_instances_of_mobility,gifted_talented_mobility_incidence_rate
0,STATE TOTAL,STATE TOTAL,939283,705064,75.1,231706,24.7,253577,27.0,84121,...,45.3,11558,54.2,73344,66620,90.8,6641,9.1,7366,10.0
1,ADAMS,MAPLETON 1,9037,5077,56.2,3919,43.4,4133,45.7,735,...,32.7,79,36.9,250,205,82.0,44,17.6,47,18.8
2,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,68.7,15424,30.9,16854,33.8,4339,...,57.2,481,68.2,3590,3225,89.8,361,10.1,404,11.3


COLS:  72
ROWS:  184


Unnamed: 0,county,school_dist,total_pupil_count,total_stable_student_count,total_stability_rate,total_mobile_student_count,total_student_mobility_rate,total_instances_of_mobility,total_mobility_incidence_rate,total_female_pupil_count,...,total_native_hawaiian_or_other_pacific_islander_student_mobility_rate,total_native_hawaiian_or_other_pacific_islander_instances_of_mobility,total_native_hawaiian_or_other_pacific_islander_mobility_incidence_rate,total_two_or_more_races_pupil_count,total_two_or_more_races_stable_student_count,total_two_or_more_races_stability_rate,total_two_or_more_races_mobile_student_count,total_two_or_more_races_student_mobility_rate,total_two_or_more_races_instances_of_mobility,total_two_or_more_races_mobility_incidence_rate
0,STATE TOTAL,STATE TOTAL,939283,705064,75.1,231706,24.7,253577,27.0,458512,...,34.8,840,38.0,29329,21501,73.3,7718,26.3,8433,28.8
2,ADAMS,MAPLETON 1,9037,5077,56.2,3919,43.4,4133,45.7,4450,...,70.8,17,70.8,219,129,58.9,90,41.1,91,41.6
3,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,68.7,15424,30.9,16854,33.8,24340,...,45.3,42,48.8,662,455,68.7,203,30.7,222,33.5


COLS:  6
ROWS:  183


Unnamed: 0,county,dist,geo_county,geo_dist,geo_county_point,geo_dist_point
0,ADAMS,MAPLETON 1,MULTIPOLYGON (((-103.70574149517748 39.9999110...,MULTIPOLYGON (((-105.01581612299998 39.8144774...,POINT (-104.1930918 39.8398269),POINT (-104.9187196 39.8415103)
1,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,MULTIPOLYGON (((-103.70574149517748 39.9999110...,MULTIPOLYGON (((-105.05310614499996 39.9302934...,POINT (-104.1930918 39.8398269),POINT (-104.9668135 39.9262994)
2,ADAMS,ADAMS COUNTY 14,MULTIPOLYGON (((-103.70574149517748 39.9999110...,MULTIPOLYGON (((-104.96883410999999 39.7910064...,POINT (-104.1930918 39.8398269),POINT (-104.9268419 39.8059605)


### Merge and group by county

In [51]:
# Remove the columns duplicated across mobility demographics and mobility datasets
mob_dem = mob_dem_raw.drop(columns=[
    'total_pupil_count', 'total_stable_student_count', 'total_stability_rate', 'total_mobile_student_count',
    'total_student_mobility_rate', 'total_instances_of_mobility', 'total_mobility_incidence_rate'])

# Combine the two mobility datasets
mob = mob_raw.merge(mob_dem, on=['county', 'school_dist'])

# Combined the mobility and graduate data into one df
df_raw_dist = mob.merge(grad_raw, on=['county', 'school_dist'])

df_raw_dist.to_csv('output/all_education_raw.csv', index=False)

head(df_raw_dist)

COLS:  156
ROWS:  184


Unnamed: 0,county,school_dist,total_pupil_count_all_students,total_stable_pupil_count_all_students,total_stability_rate_all_students,total_mobile_student_count_all_students,total_student_mobility_rate_all_students,total_instances_of_mobility_all_students,total_mobility_incidence_rate_all_students,students_with_disabilities_pupil_count,...,homeless_final_grad_base,homeless_graduates_total,homeless_graduation_rate,homeless_completers_total,homeless_completion_rate,gifted_talented_final_grad_base,gifted_talented_graduates_total,gifted_talented_graduation_rate,gifted_talented_completers_total,gifted_talented_completion_rate
0,STATE TOTAL,STATE TOTAL,939283,705064,75.1,231706,24.7,253577,27.0,84121,...,2394,1175,49.1,1262,52.7,6604,6048,91.6,6156,93.2
1,ADAMS,MAPLETON 1,9037,5077,56.2,3919,43.4,4133,45.7,735,...,41,12,29.3,16,39.0,44,27,61.4,27,61.4
2,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,68.7,15424,30.9,16854,33.8,4339,...,106,62,58.5,65,61.3,227,201,88.5,208,91.6


### Functions

In [52]:
INDEX = []

def cols(df):
    """ Get cols as list instead of index object. Exclude county """
    return [c for c in df.columns if c not in INDEX]

def separate_by(df, text) -> (pd.DataFrame, pd.DataFrame):
    """
    Given a df and a substring, return two dfs:
    - df with county + all columns whose name does NOT contain substring
    - df with county + all columns whose name DOES contain substring
    """
    names = [c for c in df.columns if text in c]
    return (
        df.copy().drop(columns=names), # Cols without text
        df.copy()[INDEX + names], # Cols with text
    )

def rename_all(df, text, replacement) -> pd.DataFrame:
    """ Bulk replace a substring in the name of all columns """
    for c in df.columns:
        df = df.rename(columns={c: c.replace(text, replacement)})
    return df

def merge(df1, df2) -> pd.DataFrame:
    """ Shorthand for pandas merging, since they will always be inner joined on county"""
    return df1.merge(df2, how='inner', on=INDEX)

## Column Name Manipulation
---

In [53]:
df = df_raw_dist.copy()

### Remove all rates. They got messed up when we aggregated by county

In [54]:
df, _ = separate_by(df, "rate")

#### Remove native american and native hawaiian because the group sizes are very small and values are 0 for a lot of counties. Remove "two_or_more_races" because it's inconsistent, and difficult to compare groups

In [55]:
df, _ = separate_by(df, "american_indian")
df, _ = separate_by(df, "native_hawaiian")
df, _ = separate_by(df, "two_or_more")

### Standardize group names, then shorten group names
- Graduation data has `limited_english_proficient` and `econ_disadvant` 
- Mobility data `english_language_learners` and `economically_disadvantaged`

**Standardize these to `limited_english` and `econ_disadvant`, and shorten the others**

In [56]:
# Mobility/Stability groups
df = rename_all(df, "limited_english_proficient", "limited_eng")
df = rename_all(df, "english_language_learners", "limited_eng")
df = rename_all(df, "economically_disadvantaged", "poor")
df = rename_all(df, "econ_disadvant", "poor")
df = rename_all(df, "students_with_disabilities", "disabled")
df = rename_all(df, "gifted_talented", "gifted")

# Demographics
df = rename_all(df, "black_or_african_american", "black")
df = rename_all(df, "hispanic_or_latino", "hispanic")

### Rename more stuff for readability/consistency

In [57]:
# Graduation data
df = rename_all(df, "final_grad_base", "grad_base_total")
df = rename_all(df, "graduates_total", "graduated")
df = rename_all(df, "completers_total", "completed")

# Mobility/Stability data
df = rename_all(df, "instances_of_mobility", "mobile_instances")
df = rename_all(df, "pupil_count", "pupil_total")
df = rename_all(df, "_student_count", "")

# Variable totals
df = rename_all(df, "_all_students", "")
df = rename_all(df, "total_", "")
df = df.rename(columns={'stable_pupil_total': 'stable'})

In [58]:
df_dist_counts = df.copy()
df_county_counts = df.copy().groupby('county').sum().reset_index()

head(df_dist_counts)

COLS:  79
ROWS:  184


Unnamed: 0,county,school_dist,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,STATE TOTAL,STATE TOTAL,939283,705064,231706,253577,84121,65593,18348,20784,...,223,7398,3853,4129,2394,1175,1262,6604,6048,6156
1,ADAMS,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,...,5,218,118,124,41,12,16,44,27,27
2,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,15424,16854,4339,3001,1325,1501,...,12,224,80,98,106,62,65,227,201,208


## Calculate Rates
---

In [59]:
def get_rates(df, INDEX):
    df = df.copy()
    df_rates = df.copy()[INDEX]

    for c in ['stable', 'mobile', 'mobile_instances']:
        group_rate = (df[c] / df['pupil_total'] * 100).round(2).fillna(0)
        df_rates[f"{c}_rate"] = group_rate
        df[f"{c}_rate"] = group_rate

    # Calculate rates dynamically
    for group in [
            'disabled', 'limited_eng', 'poor', 'migrant', 'title_1', 'homeless', 'gifted',
            'male', 'female', 'white', 'black', 'hispanic', 'asian']:

        for c in [c for c in df.columns if group in c and "total" not in c]:
            var = c.replace(f"{group}_", '')

            if var in ['graduated', 'completed']:
                new = df[c] / df[f"{group}_grad_base_total"]
            else:
                new = df[c] / df[f"{group}_pupil_total"]
            
            new = (new * 100).round(2).fillna(0)
            df_rates[f"{c}_rate"] = new
            df[f"{c}_rate"] = new

    return df, df_rates

In [60]:
df_dist_all, df_dist_rates = get_rates(df_dist_counts, ['county', 'school_dist'])
df_county_all, df_county_rates = get_rates(df_county_counts, ['county'])

In [61]:
head(df_dist_all, df_dist_counts, df_dist_rates, df_county_all, df_county_counts, df_county_rates)

COLS:  138
ROWS:  184


Unnamed: 0,county,school_dist,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,STATE TOTAL,STATE TOTAL,939283,705064,231706,253577,84121,65593,18348,20784,...,23.98,64.76,34.59,38.9,72.93,26.74,30.02,77.21,22.68,24.42
1,ADAMS,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,...,52.08,48.04,51.4,53.63,60.31,39.19,42.19,47.22,52.78,53.7
2,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,15424,16854,4339,3001,1325,1501,...,32.27,55.98,43.94,47.41,67.3,32.23,36.81,81.07,18.84,21.15


COLS:  79
ROWS:  184


Unnamed: 0,county,school_dist,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,STATE TOTAL,STATE TOTAL,939283,705064,231706,253577,84121,65593,18348,20784,...,223,7398,3853,4129,2394,1175,1262,6604,6048,6156
1,ADAMS,MAPLETON 1,9037,5077,3919,4133,735,469,261,279,...,5,218,118,124,41,12,16,44,27,27
2,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,49889,34283,15424,16854,4339,3001,1325,1501,...,12,224,80,98,106,62,65,227,201,208


COLS:  61
ROWS:  184


Unnamed: 0,county,school_dist,stable_rate,mobile_rate,mobile_instances_rate,disabled_stable_rate,disabled_mobile_rate,disabled_mobile_instances_rate,disabled_graduated_rate,disabled_completed_rate,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,STATE TOTAL,STATE TOTAL,75.06,24.67,27.0,77.97,21.81,24.71,53.66,55.79,...,23.98,64.76,34.59,38.9,72.93,26.74,30.02,77.21,22.68,24.42
1,ADAMS,MAPLETON 1,56.18,43.37,45.73,63.81,35.51,37.96,36.73,38.78,...,52.08,48.04,51.4,53.63,60.31,39.19,42.19,47.22,52.78,53.7
2,ADAMS,ADAMS 12 FIVE STAR SCHOOLS,68.72,30.92,33.78,69.16,30.54,34.59,47.2,50.8,...,32.27,55.98,43.94,47.41,67.3,32.23,36.81,81.07,18.84,21.15


COLS:  137
ROWS:  64


Unnamed: 0,county,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,ADAMS,98546,67272,31222,33925,8848,6263,2588,2896,20773,...,32.45,54.66,45.08,47.91,67.49,32.71,36.4,78.55,21.37,23.54
1,ALAMOSA,2775,1882,885,950,223,159,63,66,368,...,35.54,57.14,42.86,42.86,70.15,29.47,32.69,64.0,36.0,36.0
2,ARAPAHOE,124639,94109,30134,32269,11842,9461,2354,2568,25370,...,21.16,67.47,32.03,34.57,72.76,26.75,29.2,78.59,21.3,22.56


COLS:  78
ROWS:  64


Unnamed: 0,county,pupil_total,stable,mobile,mobile_instances,disabled_pupil_total,disabled_stable,disabled_mobile,disabled_mobile_instances,limited_eng_pupil_total,...,migrant_completed,title_1_grad_base_total,title_1_graduated,title_1_completed,homeless_grad_base_total,homeless_graduated,homeless_completed,gifted_grad_base_total,gifted_graduated,gifted_completed
0,ADAMS,98546,67272,31222,33925,8848,6263,2588,2896,20773,...,33,935,529,559,360,190,204,402,337,345
1,ALAMOSA,2775,1882,885,950,223,159,63,66,368,...,4,28,22,23,6,6,6,0,0,0
2,ARAPAHOE,124639,94109,30134,32269,11842,9461,2354,2568,25370,...,9,488,202,213,243,96,102,909,820,828


COLS:  60
ROWS:  64


Unnamed: 0,county,stable_rate,mobile_rate,mobile_instances_rate,disabled_stable_rate,disabled_mobile_rate,disabled_mobile_instances_rate,disabled_graduated_rate,disabled_completed_rate,limited_eng_stable_rate,...,white_mobile_instances_rate,black_stable_rate,black_mobile_rate,black_mobile_instances_rate,hispanic_stable_rate,hispanic_mobile_rate,hispanic_mobile_instances_rate,asian_stable_rate,asian_mobile_rate,asian_mobile_instances_rate
0,ADAMS,68.26,31.68,34.43,70.78,29.25,32.73,47.54,50.1,69.99,...,32.45,54.66,45.08,47.91,67.49,32.71,36.4,78.55,21.37,23.54
1,ALAMOSA,67.82,31.89,34.23,71.3,28.25,29.6,86.67,93.33,72.01,...,35.54,57.14,42.86,42.86,70.15,29.47,32.69,64.0,36.0,36.0
2,ARAPAHOE,75.51,24.18,25.89,79.89,19.88,21.69,51.26,52.06,74.94,...,21.16,67.47,32.03,34.57,72.76,26.75,29.2,78.59,21.3,22.56


## Save
---

In [62]:
df_dist_all.to_csv("output/education_dist.csv", index=False)
df_dist_counts.to_csv("output/education_dist_counts.csv", index=False)
df_dist_rates.to_csv("output/education_dist_rates.csv", index=False)

df_county_all.to_csv("output/education_county.csv", index=False)
df_county_counts.to_csv("output/education_county_counts.csv", index=False)
df_county_rates.to_csv("output/education_county_rates.csv", index=False)