### This script combines the following 3 datasets, aggregates them by county, redesigns column naming structure, and re-calculates rates:
1. District Student Mobility/Stability Statistics 2011-2012 **by Instructional Program/Service Type**
2. District Student Mobility/Stability Statistics 2011-2012 **by Gender & Race/Ethnicity**
3. District Graduation Data Statistics 2011-2012 **by Instructional Program Service Type**
## Reference: Column Naming conventions

- This dataset is designed so you should never have to look at the columns to find the name of one (since there are around 140 columns). Just look here for reference instead.
- For instance, to get the rate for any variable, just use `_rate` after a variable. So `graduated` becomes `graduated_rate`

| Type | Naming | Example |
| - | - | - |
| County Total | variable | `stable` |
| Count | group + variable | `disabled_stable` |
| Rate | group + variable + "rate" | `disabled_stable_rate` |
| Group Total | group + group total | `disabled_pupil_total` |

<br>

#### Mobility/Stability columns

| GROUPS | VARIABLES | GROUP TOTALS |
| - | - | - |
| disabled | stable | pupil_total |
| limited_eng | mobile | 
| poor | mobile_instances |
| migrant | 
| title_1 | 
| homeless |
| gifted |
| male |
| female |
| white |
| asian |
| black |
| hispanic |

<br>

#### Graduation columns

| GROUPS | VARIABLES | GROUP TOTALS |
| - | - | - |
| disabled | graduated | grad_base_total |
| limited_eng | completed |
| poor |
| migrant |
| title_1 |
| homeless |
| gifted |

<br>

**What are group totals?**
- Notice they aren't just called "total". This is because, for graduation data, we don't care about the total number of students. We care about the total number of students who are actually in the pool for graduation. So, we call it `grad_base_total` and use that when calculating rate

**Rates are calculated by dividing a variable by its group total, then multiplying by 100**

---
---
---

In [16]:
import pandas as pd, numpy as np
# These 3 datasets have each been cleaned already, and had their county names standardized so they can be joined
grad_raw = pd.read_csv("../education/output/dist_grad_rate.csv", index_col=0)
mob_raw = pd.read_csv("../education/output/dist_mobility_rate.csv", index_col=0)
mob_dem_raw = pd.read_csv("../education/output/dist_mobility_rate_demographics.csv", index_col=0)

### Merge and group by county

In [17]:
# Remove the columns duplicated across mobility demographics and mobility datasets
mob_dem_raw = mob_dem_raw.drop(columns=[
    'total_pupil_count', 'total_stable_student_count', 'total_stability_rate', 'total_mobile_student_count',
    'total_student_mobility_rate', 'total_instances_of_mobility', 'total_mobility_incidence_rate'])

# Combine the two mobility datasets
mob_raw = mob_raw.merge(mob_dem_raw, on=['county', 'school_dist'])

# Combined the mobility and graduate data into one df
df_raw = mob_raw.merge(grad_raw, on=['county', 'school_dist'])

# Group by county
df_raw_county = df_raw.groupby('county').sum().reset_index()

df = df_raw_county.copy()

### Functions

In [18]:
def cols(df):
    """ Get cols as list instead of index object. Exclude county """
    return [c for c in df.columns if c != 'county']

def separate_by_str(df, text) -> (pd.DataFrame, pd.DataFrame):
    """
    Given a df and a substring, return two dfs:
    - df with county + all columns whose name does NOT contain substring
    - df with county + all columns whose name DOES contain substring
    """
    names = [c for c in df.columns if text in c]
    return (
        df.copy().drop(columns=names), # Cols without text
        df.copy()[['county'] + names], # Cols with text
    )

def rename(df, text, replacement) -> pd.DataFrame:
    """ Bulk replace a substring in the name of all columns """
    for c in df.columns:
        df = df.rename(columns={c: c.replace(text, replacement)})
    return df

def merge(df1, df2) -> pd.DataFrame:
    """ Shorthand for pandas merging, since they will always be inner joined on county"""
    return df1.merge(df2, how='inner', on='county')

## Column Name Manipulation
---

### Remove all rates. They got messed up when we aggregated by county

In [19]:
df, _ = separate_by_str(df, "rate")

#### Remove native american and native hawaiian because the group sizes are very small and values are 0 for a lot of counties. Remove "two_or_more_races" because it's inconsistent, and difficult to compare groups

In [20]:
df, _ = separate_by_str(df, "american_indian")
df, _ = separate_by_str(df, "native_hawaiian")
df, _ = separate_by_str(df, "two_or_more")

### Standardize group names, then shorten group names
- Graduation data has `limited_english_proficient` and `econ_disadvant` 
- Mobility data `english_language_learners` and `economically_disadvantaged`

**Standardize these to `limited_english` and `econ_disadvant`, and shorten the others**

In [21]:
# Mobility/Stability groups
df = rename(df, "limited_english_proficient", "limited_eng")
df = rename(df, "english_language_learners", "limited_eng")
df = rename(df, "economically_disadvantaged", "poor")
df = rename(df, "econ_disadvant", "poor")
df = rename(df, "students_with_disabilities", "disabled")
df = rename(df, "gifted_talented", "gifted")

# Demographics
df = rename(df, "black_or_african_american", "black")
df = rename(df, "hispanic_or_latino", "hispanic")

### Rename more stuff for readability/consistency

In [22]:
# Graduation data
df = rename(df, "final_grad_base", "grad_base_total")
df = rename(df, "graduates_total", "graduated")
df = rename(df, "completers_total", "completed")

# Mobility/Stability data
df = rename(df, "instances_of_mobility", "mobile_instances")
df = rename(df, "pupil_count", "pupil_total")
df = rename(df, "_student_count", "")

# Variable totals
df = rename(df, "_all_students", "")
df = rename(df, "total_", "")
df = df.rename(columns={'stable_pupil_total': 'stable'})

In [23]:
df_counts = df

## Calculate Rates
---

### Rates for variables

In [24]:
for c in ['stable', 'mobile', 'mobile_instances']:
    df[f"{c}_rate"] = (df[c] / df['pupil_total'] * 100).round(2).fillna(0)

### Rates for groups

In [25]:
# Calculate rates dynamically
for group in [
        'disabled', 'limited_eng', 'poor', 'migrant', 'title_1', 'homeless', 'gifted',
        'male', 'female', 'white', 'black', 'hispanic', 'asian']:

    for c in [c for c in df.columns if group in c and "total" not in c]:
        var = c.replace(f"{group}_", '')

        if var in ['graduated', 'completed']:
            new = df[c] / df[f"{group}_grad_base_total"]
        else:
            new = df[c] / df[f"{group}_pupil_total"]
        
        new = (new * 100).round(2).fillna(0)
        df[f"{c}_rate"] = new

In [26]:
df_all = df

## Create new dataset for rates and group totals only
---

In [27]:
_, pupil_totals = separate_by_str(df_all, "pupil_total")
_, grad_bases = separate_by_str(df_all, "grad_base_total")
_, rates_only = separate_by_str(df_all, "rate")

df = merge(rates_only, merge(pupil_totals, grad_bases))

In [28]:
df_rates = df

## Save
---

In [29]:
df_all.to_csv("output/all_education.csv")
df_counts.to_csv("output/all_education_counts.csv")
df_rates.to_csv("output/all_education_rates.csv")