# Identifying schools that have inaccurate data

Many schools in this data contain inaccurate values for disipline data due to data entry errors. This notebook documents the various data entry issues with the data. All rows printed in this notebook are removed in `clean/src/clean_crdc_data.py`

In [1]:
import pandas as pd
import constants


This input file was copied from the output of the `clean` task without calling the function `drop_data_entry_errors`

In [2]:
df = pd.read_csv("input/crdc-referrals-arrests-cleaned.csv", low_memory=False)


## Arrest or enrollment rates over 100%
Per the CRDC documentation, arrest and referral totals are supposed to represent the number of unique students who were referred or arrested. In this data, however, some schools have arrest and referral totals greater than the size of their total enrollment. 

In [3]:
df.query("year == 2015")[["COMBOKEY","LEAID", "SCHID", "SCH_NAME"]]

Unnamed: 0,COMBOKEY,LEAID,SCHID,SCH_NAME
95507,20018000075,200180,75,East High School
95508,20018000064,200180,64,Clark Middle School
95509,20018000057,200180,57,Bartlett High School
95510,20018000729,200180,729,Nicholas J. Begich Middle School
95511,20018000120,200180,120,West High School
...,...,...,...,...
191862,481161000000.0,4811610,12545,BRIGHT BEGINNINGS ACADEMIC CENTER
191863,250477000000.0,2504770,2804,Adams School
191864,481173000000.0,4811730,8873,CHALLENGE ACADEMY
191865,173300000000.0,1733000,5082,Adams Co Juvenile Detention Cntr


In [4]:
df.query("total_arrests > total_enrollment | total_referrals > total_enrollment")


Unnamed: 0,COMBOKEY,LEA_STATE,LEAID,LEA_NAME,SCHID,SCH_NAME,JJ,SCH_STATUS_ALT,SCH_ENR_HI_M,SCH_ENR_HI_F,...,total_enrollment_hp,total_arrests_tr,total_referrals_tr,total_enrollment_tr,total_arrests_idea,total_arrests_nondis,total_referrals_idea,total_referrals_nondis,total_enrollment_idea,total_enrollment_nondis
79217,69103710670.0,CA,691037,SHASTA COUNTY OFFICE OF EDUCATION,10670,OASIS COMMUNITY,No,Yes,11.0,2.0,...,4.0,0.0,4.0,4.0,0.0,0.0,30.0,74.0,,93.0
79974,61995012357.0,CA,619950,KLAMATH-TRINITY JOINT UNIFIED,12357,RIVER'S EDGE COMMUNITY DAY,No,Yes,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,,7.0
87925,482247007799.0,TX,4822470,HARLANDALE ISD,7799,HAC DAEP MIDDLE,No,Yes,5.0,5.0,...,0.0,0.0,0.0,0.0,5.0,7.0,5.0,13.0,,10.0
87929,482247008703.0,TX,4822470,HARLANDALE ISD,8703,HARLANDALE ALTERNATIVE CENTER BOOT MIDDLE,No,Yes,5.0,5.0,...,0.0,0.0,0.0,0.0,5.0,7.0,5.0,13.0,,10.0
177825,482850000000.0,TX,4828500,LUBBOCK ISD,3208,PRIORITY INTERVENTION ACADEMY,No,Yes,50.0,26.0,...,0.0,0.0,0.0,4.0,41.0,145.0,0.0,0.0,,119.0
178795,483432000000.0,TX,4834320,PASADENA ISD,10775,THE SUMMIT (HIGH SCHOOL),No,Yes,35.0,20.0,...,0.0,0.0,0.0,0.0,16.0,60.0,16.0,60.0,,74.0
179108,484428000000.0,TX,4844280,WACO ISD,7397,CHALLENGE ACADEMY,No,Yes,5.0,0.0,...,0.0,0.0,0.0,0.0,2.0,14.0,2.0,14.0,,14.0
179280,63021006249.0,CA,630210,Perris Union High,6249,The Academy Community Day,No,Yes,29.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,5.0,37.0,,37.0
180974,481970000000.0,TX,4819700,FORT WORTH ISD,5472,METRO OPPORTUNITY,No,Yes,8.0,2.0,...,0.0,0.0,0.0,2.0,2.0,37.0,0.0,0.0,,38.0
181349,482850000000.0,TX,4828500,LUBBOCK ISD,7677,LUBBOCK CO J J A E P,No,Yes,5.0,2.0,...,0.0,0.0,0.0,0.0,4.0,12.0,4.0,12.0,,12.0


## More arrests than referrals
Per the CRDC documentation, all arrests are to be counted as referrals, but not all referrals are arrests, therefore the number of arrests should never be greater than the number of referrals. 

In [5]:
df.query("total_arrests > total_referrals")


Unnamed: 0,COMBOKEY,LEA_STATE,LEAID,LEA_NAME,SCHID,SCH_NAME,JJ,SCH_STATUS_ALT,SCH_ENR_HI_M,SCH_ENR_HI_F,...,total_enrollment_hp,total_arrests_tr,total_referrals_tr,total_enrollment_tr,total_arrests_idea,total_arrests_nondis,total_referrals_idea,total_referrals_nondis,total_enrollment_idea,total_enrollment_nondis
85,320048000545,NV,3200480,WASHOE COUNTY SCHOOL DISTRICT,545,SPANISH SPRINGS HIGH SCHOOL,No,No,335.0,311.0,...,16.0,4.0,2.0,91.0,17.0,50.0,2.0,4.0,263.0,2296.0
144,550852000925,WI,5508520,MADISON METROPOLITAN SCHOOL DISTRICT,925,EAST HIGH,No,No,122.0,122.0,...,2.0,4.0,0.0,160.0,8.0,0.0,2.0,2.0,323.0,1613.0
154,170993003505,IL,1709930,CITY OF CHICAGO SD 299,3505,CHICAGO INTERNATIONAL CHARTER,No,No,1172.0,1121.0,...,7.0,0.0,0.0,94.0,0.0,8.0,0.0,5.0,1211.0,8563.0
171,180363000548,IN,1803630,FORT WAYNE COMMUNITY SCHOOLS,548,MIAMI MIDDLE SCHOOL,No,No,83.0,95.0,...,2.0,0.0,0.0,52.0,4.0,2.0,0.0,0.0,148.0,777.0
172,180363000550,IN,1803630,FORT WAYNE COMMUNITY SCHOOLS,550,NORTH SIDE HIGH SCHOOL,No,No,86.0,92.0,...,2.0,0.0,0.0,118.0,4.0,2.0,2.0,2.0,283.0,1667.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
283671,180192000249,IN,1801920,Clarksville Community School Corp,249,Clarksville Senior High School,No,No,23.0,18.0,...,0.0,0.0,0.0,37.0,0.0,3.0,0.0,0.0,85.0,424.0
283674,180192000248,IN,1801920,Clarksville Community School Corp,248,Clarksville Middle School,No,No,23.0,18.0,...,0.0,0.0,0.0,37.0,0.0,3.0,0.0,0.0,85.0,424.0
283847,180951001555,IN,1809510,Richmond Community Schools,1555,Dennis Intermediate School,No,No,35.0,31.0,...,0.0,3.0,4.0,92.0,3.0,11.0,2.0,11.0,133.0,649.0
284003,181110001798,IN,1811100,M S D Steuben County,1798,Angola Middle School,No,No,46.0,29.0,...,0.0,0.0,0.0,14.0,0.0,12.0,0.0,10.0,113.0,660.0


## Schools with very high totals and near-identical arrest and referral rates
After reaching out to Del Valle ISD outside Austin, TX, we were told their data was incorrect and that the person who did the data entry for the distict misunderstood the definitions in the discipline section. All of their schools reported extremely high arrest and referral totals, and had arrest totals that were equal or nearly equal to their referral totals. It's clear that there are several other districts that made the same error -- all have significantly high arrest totals that are identical to their referral totals

In [6]:
df = df.assign(
    grade_category=lambda df: df.apply(
        lambda row: "high school"
        if row.max_grade in range(10, 13)
        else "middle school"
        if row.max_grade in range(7, 10)
        else "elementary school"
        if row.max_grade in range(1, 7)
        else "other",
        axis=1,
    )
)

threshold_df = (
    df.groupby(["grade_category", "year"])
    .total_referrals_arrests.quantile(0.999)
    .to_frame("threshold")
)

threshold_df


Unnamed: 0_level_0,Unnamed: 1_level_0,threshold
grade_category,year,Unnamed: 2_level_1
elementary school,2013,58.73
elementary school,2015,33.596
elementary school,2017,29.26
high school,2013,221.608
high school,2015,218.482
high school,2017,193.498
middle school,2013,193.8
middle school,2015,130.248
middle school,2017,137.282
other,2013,44.737


In [7]:
close_vals = list(range(0, 3))
df.merge(threshold_df, left_on=["grade_category", "year"], right_index=True).query(
    "total_referrals_arrests > threshold & abs(total_arrests - total_referrals) in @close_vals"
)


Unnamed: 0,COMBOKEY,LEA_STATE,LEAID,LEA_NAME,SCHID,SCH_NAME,JJ,SCH_STATUS_ALT,SCH_ENR_HI_M,SCH_ENR_HI_F,...,total_referrals_tr,total_enrollment_tr,total_arrests_idea,total_arrests_nondis,total_referrals_idea,total_referrals_nondis,total_enrollment_idea,total_enrollment_nondis,grade_category,threshold
231,271242000610,MN,2712420,FRIDLEY PUBLIC SCHOOL DISTRICT,610,FRIDLEY MIDDLE,No,No,50.0,50.0,...,17.0,28.0,80.0,275.0,80.0,275.0,124.0,824.0,middle school,193.800
14314,290462002517,MO,2904620,BELTON 124,2517,YEOKUM MIDDLE,No,No,53.0,38.0,...,12.0,19.0,53.0,361.0,53.0,361.0,84.0,719.0,middle school,193.800
70510,291376002659,MO,2913760,HARRISONVILLE R-IX,2659,HARRISONVILLE MIDDLE,No,No,8.0,11.0,...,2.0,7.0,15.0,93.0,15.0,93.0,47.0,582.0,middle school,193.800
3614,271242000611,MN,2712420,FRIDLEY PUBLIC SCHOOL DISTRICT,611,FRIDLEY SENIOR HIGH,No,No,38.0,35.0,...,20.0,25.0,89.0,408.0,89.0,408.0,140.0,869.0,high school,221.608
74157,251050001684,MA,2510500,SAUGUS,1684,SAUGUS HIGH,No,No,32.0,50.0,...,4.0,7.0,28.0,167.0,28.0,167.0,47.0,709.0,high school,221.608
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
232963,481662012337,TX,4816620,DEL VALLE ISD,12337,DAILEY MIDDLE,No,No,275.0,254.0,...,7.0,12.0,92.0,259.0,92.0,258.0,98.0,678.0,middle school,137.282
232965,481662009527,TX,4816620,DEL VALLE ISD,9527,JOHN P OJEDA MIDDLE,No,No,411.0,398.0,...,4.0,11.0,133.0,314.0,133.0,313.0,140.0,905.0,middle school,137.282
232972,481662001425,TX,4816620,DEL VALLE ISD,1425,DEL VALLE MIDDLE,No,No,436.0,398.0,...,4.0,12.0,104.0,277.0,104.0,277.0,110.0,962.0,middle school,137.282
274333,260769004354,MI,2607690,Public Schools of Calumet Laurium & Keweenaw,4354,Washington Middle School,No,No,1.0,1.0,...,2.0,5.0,25.0,47.0,25.0,48.0,44.0,315.0,middle school,137.282
