# 3.1 Preprocessing
In this notebook, we build upon 3.0-Preprocessing using updated methods. Because we acknoweldge that some of the features being being used for the KMeans are categorical, we will produce another set of  data for separate KMeans model using Sentence Embedding to store our features followed by BERT to interpret the formed sentences and create an embedding of unlabeled features that will be used for a KMeans model

For each model we have selected columns based on importance to our model and tried to avoid features that have a high pearson-correlation with other columns. Specifically, we selected the following features
- Estimated Total Population
- Child Poverty Ratio: the ratio of children in poverty
- Child to Adult Ratio: the ratio of children to adults
- Percent Free Or Reduced Lunches: the percentage of students receiving free or reduced lunches
- All Expense Columns Per Pupil: instruction, support, community, and other
- All Demographic Information
- Achievement, Growth, and Overall Directions: indicators that specifiy how a school is improving
- School Grade: an overall measure of performance with grades 1-13 where 1 is the lowest and 13 is the highest

For the high school dataset, we added the following features
- Percent Remediation: the percentage of students that have to retake high school level course in college
- Graduation Rate: the percentage of students that graduate
- All College Readiness Indicators: a boolean feature for each subject of the ACT that is sufficient for college

Furthermore, for each model, we use a strategy to fill NA values as follows. First, we fill them with the most recent year's value for each school followed by the previous year's for each school if they are available. For any remaining NA values we will fill them using the median.

After performing transformations, each model will also remove outliers utilizing the pyod.models.ecod library.

There is no need to split data into train and test since we will only be using unsupervized learning

In [10]:
import pandas as pd

import importlib
import sys

# setting path
sys.path.append('..')
# importing
from src.features import preprocessors
from src.features import build_features
importlib.reload(preprocessors);
importlib.reload(build_features);

In [2]:
input_filepath = '../data/interim'
output_filepath = '../data/processed'

In [3]:
all_data, high_school = build_features.load_interim_data(input_filepath)

In [4]:
all_data.head()

Unnamed: 0,district_id,est_child_poverty,est_total_child,est_total_pop,year,child_pov_ratio,child_adult_ratio,county,instruction_total,support_total,...,pct_amind,pct_asian,pct_black,pct_hisp,pct_white,pct_pi,pct_2ormore,pct_fr,district_name,school
0,880.0,22978.0,85901.0,604414.0,2010,0.267494,0.165668,DENVER,438251412.0,369798031.0,...,0.009326,0.029016,0.009845,0.91658,0.031606,0.000518,0.003109,0.93,DENVER COUNTY 1,ABRAHAM LINCOLN HIGH SCHOOL
1,880.0,21750.0,88725.0,619968.0,2011,0.245139,0.167014,DENVER,445458597.0,384559117.0,...,0.009326,0.029016,0.009845,0.91658,0.031606,0.000518,0.003109,0.93,DENVER COUNTY 1,ABRAHAM LINCOLN HIGH SCHOOL
2,880.0,26358.0,90920.0,634265.0,2012,0.289903,0.167334,DENVER,474517651.0,382873858.0,...,0.007357,0.0268,0.009984,0.913295,0.037835,0.001051,0.003678,0.9296,DENVER COUNTY 1,ABRAHAM LINCOLN HIGH SCHOOL
3,880.0,22978.0,85901.0,604414.0,2010,0.267494,0.165668,DENVER,438251412.0,369798031.0,...,0.015015,0.015015,0.27027,0.351351,0.348348,0.0,0.0,1.0,DENVER COUNTY 1,RIDGE VIEW ACADEMY CHARTER SCHOOL
4,880.0,21750.0,88725.0,619968.0,2011,0.245139,0.167014,DENVER,445458597.0,384559117.0,...,0.015015,0.015015,0.27027,0.351351,0.348348,0.0,0.0,1.0,DENVER COUNTY 1,RIDGE VIEW ACADEMY CHARTER SCHOOL


In [5]:
high_school.head()

Unnamed: 0,district_id,school_id,eng_yn,math_yn,read_yn,sci_yn,year,pct_remediation,est_child_poverty,est_total_child,...,pct_amind,pct_asian,pct_black,pct_hisp,pct_white,pct_pi,pct_2ormore,pct_fr,district_name,school
0,10,309,0.0,0.0,0.0,0.0,2010,,1069.0,6160.0,...,0.003215,0.032154,0.012862,0.710611,0.234727,0.0,0.006431,0.733,MAPLETON 1,SKYVIEW ACADEMY HIGH SCHOOL
1,20,15,1.0,0.0,0.0,0.0,2010,,4446.0,41735.0,...,0.003063,0.025268,0.006891,0.250383,0.706738,0.001531,0.006126,0.155,ADAMS 12 FIVE STAR,ACADEMY OF CHARTER SCHOOLS
2,20,4108,1.0,0.0,0.0,0.0,2010,0.386667,4446.0,41735.0,...,0.001103,0.008274,0.002758,0.241589,0.745174,0.0,0.001103,0.099,ADAMS 12 FIVE STAR,HORIZON HIGH SCHOOL
3,20,5043,1.0,0.0,0.0,0.0,2010,0.23506,4446.0,41735.0,...,0.00346,0.056846,0.010875,0.156698,0.767672,0.000494,0.003955,0.152,ADAMS 12 FIVE STAR,LEGACY HIGH SCHOOL
4,20,5816,1.0,0.0,0.0,0.0,2010,0.433566,4446.0,41735.0,...,0.001703,0.010783,0.010216,0.56924,0.403519,0.000568,0.003973,0.379,ADAMS 12 FIVE STAR,THORNTON HIGH SCHOOL


## KMeans Model

In [6]:
processed_all, processed_high = build_features.build_kmeans(output_filepath, all_data, high_school)

In [7]:
processed_all.head()

Unnamed: 0,school_id,year,achievement_dir,growth_dir,overall_dir,school_grade,est_total_pop,child_pov_ratio,child_adult_ratio,instruction_per_pupil,support_per_pupil,community_per_pupil,other_per_pupil,pct_amind,pct_asian,pct_black,pct_hisp,pct_white,pct_2ormore,pct_fr
0,10.0,2010.0,1.0,0.0,-1.0,5.0,0.999873,4.425115e-07,2.740622e-07,0.009962,0.008407,0.000423,0.009105,1.542857e-08,4.8e-08,1.628572e-08,1.516286e-06,5.228572e-08,5.142857e-09,2e-06
1,10.0,2011.0,1.0,0.0,0.0,4.0,0.999887,3.95362e-07,2.693609e-07,0.009578,0.008269,0.000518,0.008087,1.50417e-08,4.67964e-08,1.587735e-08,1.478265e-06,5.097465e-08,5.013899e-09,1e-06
2,10.0,2012.0,1.0,0.0,1.0,5.0,0.999803,4.569794e-07,2.637712e-07,0.009683,0.007812,0.000542,0.015475,1.159666e-08,4.224496e-08,1.573832e-08,1.439642e-06,5.963995e-08,5.798329e-09,1e-06
3,40.0,2010.0,0.0,-1.0,-1.0,5.0,0.999873,4.425115e-07,2.740622e-07,0.009962,0.008407,0.000423,0.009105,2.483913e-08,2.483913e-08,4.471043e-07,5.812355e-07,5.762677e-07,0.0,2e-06
4,40.0,2011.0,0.0,1.0,1.0,5.0,0.999887,3.95362e-07,2.693609e-07,0.009578,0.008269,0.000518,0.008087,2.421628e-08,2.421628e-08,4.358931e-07,5.66661e-07,5.618177e-07,0.0,2e-06


In [8]:
processed_high.head()

Unnamed: 0,school_id,year,achievement_dir,growth_dir,overall_dir,school_grade,eng_yn,math_yn,read_yn,sci_yn,...,other_per_pupil,pct_amind,pct_asian,pct_black,pct_hisp,pct_white,pct_2ormore,pct_fr,pct_remediation,graduation_rate
0,309.0,2010.0,1.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,...,0.024333,1.028156e-07,1.028156e-06,4.112625e-07,2.272225e-05,8e-06,2.056312e-07,2.343816e-05,1e-05,0.001982
1,15.0,2010.0,1.0,1.0,0.0,6.0,1.0,0.0,0.0,0.0,...,0.009896,1.452304e-08,1.198151e-07,3.267684e-08,1.187259e-06,3e-06,2.904608e-08,7.349748e-07,2e-06,0.000372
2,4108.0,2010.0,0.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,...,0.009896,5.230859e-09,3.923144e-08,1.307715e-08,1.145558e-06,4e-06,5.230859e-09,4.694355e-07,2e-06,0.000404
3,5043.0,2010.0,1.0,0.0,0.0,9.0,1.0,0.0,0.0,0.0,...,0.009896,1.640752e-08,2.695521e-07,5.156649e-08,7.430262e-07,4e-06,1.875145e-08,7.207495e-07,1e-06,0.000411
4,5816.0,2010.0,0.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,...,0.009896,8.073395e-09,5.113149e-08,4.844036e-08,2.699205e-06,2e-06,1.883792e-08,1.797132e-06,2e-06,0.000332


## LLM + KMeans

In [11]:
processed_all, processed_high = build_features.build_llm_kmeans(output_filepath, all_data, high_school)

Batches:   0%|          | 0/195 [00:00<?, ?it/s]

Batches:   0%|          | 0/195 [00:00<?, ?it/s]

Batches:   0%|          | 0/195 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

Batches:   0%|          | 0/17 [00:00<?, ?it/s]

In [12]:
processed_all.head()

Unnamed: 0,school_id,year,0,1,2,3,4,5,6,7,...,374,375,376,377,378,379,380,381,382,383
0,10.0,2010.0,0.0334,-0.00231,0.040984,-0.00671,-0.097035,-0.011135,0.028651,-0.017524,...,0.100542,0.00446,0.044892,-0.028527,-0.067499,0.022077,0.06574,0.031074,-0.085655,0.02421
1,10.0,2011.0,0.033669,0.000999,0.038963,-0.012745,-0.107526,-0.013695,0.013141,-0.017488,...,0.101296,-0.002628,0.049775,-0.028609,-0.070565,0.016917,0.073487,0.022821,-0.081952,0.019
2,10.0,2012.0,0.036114,-0.019987,0.047702,-0.010298,-0.098111,-0.005213,0.009907,-0.005605,...,0.092061,0.012806,0.060415,-0.026126,-0.093822,0.013348,0.060561,0.035823,-0.067749,0.022978
3,40.0,2010.0,0.033696,-0.005182,0.043652,-0.010673,-0.098483,-0.013081,0.025171,-0.015598,...,0.099338,0.008753,0.048566,-0.026638,-0.066445,0.024296,0.068583,0.029949,-0.08561,0.022008
4,40.0,2011.0,0.032874,-0.000128,0.038368,-0.010974,-0.107373,-0.013452,0.015049,-0.016279,...,0.100055,-0.001117,0.050061,-0.028589,-0.071818,0.018591,0.072918,0.021729,-0.082519,0.020876


In [13]:
processed_high.head()

Unnamed: 0,school_id,year,0,1,2,3,4,5,6,7,...,374,375,376,377,378,379,380,381,382,383
0,309.0,2010.0,0.02641,-0.003775,0.031414,-0.009450681,-0.096812,-0.014821,0.017311,-0.011133,...,0.100162,0.00713,0.050775,-0.038824,-0.086213,0.024621,0.071451,0.023761,-0.069915,0.030733
1,15.0,2010.0,0.03255,0.000404,0.043422,-0.001408776,-0.094958,-0.014704,0.01686,-0.019745,...,0.094588,-0.005528,0.055173,-0.026067,-0.074812,0.024736,0.065939,0.026397,-0.06947,0.020479
2,4108.0,2010.0,0.031531,0.002112,0.045239,5.76269e-07,-0.095794,-0.015627,0.014491,-0.020178,...,0.096915,-0.006267,0.052589,-0.023986,-0.071446,0.02377,0.060952,0.027229,-0.068795,0.019685
3,5043.0,2010.0,0.030433,0.000194,0.04461,-0.0002191796,-0.0948,-0.014412,0.016068,-0.019942,...,0.095253,-0.005503,0.053885,-0.026467,-0.073168,0.023903,0.064077,0.02725,-0.06886,0.020527
4,5816.0,2010.0,0.031531,0.002112,0.045239,5.76269e-07,-0.095794,-0.015627,0.014491,-0.020178,...,0.096915,-0.006267,0.052589,-0.023986,-0.071446,0.02377,0.060952,0.027229,-0.068795,0.019685
