# 3.0 Preprocessing

In this notebook, we prepare the data for a KMeans clustering. We have selected columns based on importance to our model and tried to avoid features that have a high pearson-correlation with other columns. Specifically, we selected the following features
- Estimated Total Population
- Child Poverty Ratio: the ratio of children in poverty
- Child to Adult Ratio: the ratio of children to adults
- Percent Free Or Reduced Lunches: the percentage of students receiving free or reduced lunches
- All Expense Columns Per Pupil: instruction, support, community, and other
- All Demographic Information
- Achievement, Growth, and Overall Directions: indicators that specifiy how a school is improving
- School Grade: an overall measure of performance with grades 1-13 where 1 is the lowest and 13 is the highest

For the high school dataset, we added the following features
- Percent Remediation: the percentage of students that have to retake high school level course in college
- Graduation Rate: the percentage of students that graduate
- All College Readiness Indicators: a boolean feature for each subject of the ACT that is sufficient for college

In [1]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from pyod.models.ecod import ECOD

import importlib
import sys

# setting path
sys.path.append('..')
# importing
from src.features import build_features
importlib.reload(build_features);

ImportError: 

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.10 from "C:\Users\caeley\Documents\School\Springboard\Capstones\CaeleyLewis-Capstone2-StoryOfColoradoEducation\venv\Scripts\python.exe"
  * The NumPy version is: "1.25.2"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'


In [2]:
all_data = pd.read_csv('../data/interim/all_data.csv')
high_school = pd.read_csv('../data/interim/high_school.csv')

In [3]:
# Columns that will be used to index the data, so that additional data can be joined such as the county, district, or other features we might be interested in
index_cols = ['school_id', 'year']

all_cols = ['est_total_pop',
            'child_pov_ratio',
            'child_adult_ratio',
            'instruction_per_pupil',
            'support_per_pupil',
            'community_per_pupil',
            'other_per_pupil',
            'pct_amind',
            'pct_asian',
            'pct_black',
            'pct_hisp',
            'pct_white',
            'pct_2ormore',
            'pct_fr',
            'achievement_dir',
            'growth_dir',
            'overall_dir',
            'school_grade']

all_num_cols = ['est_total_pop',
                'child_pov_ratio',
                'child_adult_ratio',
                'instruction_per_pupil',
                'support_per_pupil',
                'community_per_pupil',
                'other_per_pupil',
                'pct_amind',
                'pct_asian',
                'pct_black',
                'pct_hisp',
                'pct_white',
                'pct_2ormore',
                'pct_fr',]

high_cols = ['pct_remediation',
             'graduation_rate',
             'eng_yn',
             'math_yn',
             'read_yn',
             'sci_yn']

high_num_cols = ['pct_remediation',
                 'graduation_rate']

In [4]:
# Fills NA values with the most recent values, then the previous values, and then the median
filler = build_features.FillBackForward()
filled_df = filler.transform(all_data[all_cols + index_cols])

In [5]:
# The pipeline that will process numerical features
num = Pipeline(steps=[
    ('encoder', Normalizer())
])

# The processor that will build features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num, all_num_cols)
    ], remainder='passthrough')

# The pipeline that will enact the feature building
pipeline = Pipeline(
    steps=[('preprocessor', preprocessor)])
pipe_fit = pipeline.fit(filled_df)
data = pd.DataFrame(pipe_fit.transform(filled_df), columns=pipe_fit.get_feature_names_out())

In [6]:
data.head()

Unnamed: 0,num__est_total_pop,num__child_pov_ratio,num__child_adult_ratio,num__instruction_per_pupil,num__support_per_pupil,num__community_per_pupil,num__other_per_pupil,num__pct_amind,num__pct_asian,num__pct_black,num__pct_hisp,num__pct_white,num__pct_2ormore,num__pct_fr,remainder__achievement_dir,remainder__growth_dir,remainder__overall_dir,remainder__school_grade,remainder__school_id,remainder__year
0,0.999873,4.425115e-07,2.740622e-07,0.009962,0.008407,0.000423,0.009105,1.542857e-08,4.8e-08,1.628572e-08,1.516286e-06,5.228572e-08,5.142857e-09,2e-06,1.0,0.0,-1.0,5.0,10.0,2010.0
1,0.999887,3.95362e-07,2.693609e-07,0.009578,0.008269,0.000518,0.008087,1.50417e-08,4.67964e-08,1.587735e-08,1.478265e-06,5.097465e-08,5.013899e-09,1e-06,1.0,0.0,0.0,4.0,10.0,2011.0
2,0.999803,4.569794e-07,2.637712e-07,0.009683,0.007812,0.000542,0.015475,1.159666e-08,4.224496e-08,1.573832e-08,1.439642e-06,5.963995e-08,5.798329e-09,1e-06,1.0,0.0,1.0,5.0,10.0,2012.0
3,0.999873,4.425115e-07,2.740622e-07,0.009962,0.008407,0.000423,0.009105,2.483913e-08,2.483913e-08,4.471043e-07,5.812355e-07,5.762677e-07,0.0,2e-06,0.0,-1.0,-1.0,5.0,40.0,2010.0
4,0.999887,3.95362e-07,2.693609e-07,0.009578,0.008269,0.000518,0.008087,2.421628e-08,2.421628e-08,4.358931e-07,5.66661e-07,5.618177e-07,0.0,2e-06,0.0,1.0,1.0,5.0,40.0,2011.0


In [7]:
data.to_csv('../data/processed/kmeans_all_data.csv')

In [12]:
# Fills NA values with the most recent values, then the previous values, and then the median
filler = build_features.FillBackForward()
filled_df = filler.transform(high_school[all_cols + high_cols + index_cols])

In [16]:
# The transformer that will build features using the same numerical processor
feature_builder = ColumnTransformer(
    transformers=[
        ('num', num, all_num_cols+high_num_cols)
    ], remainder='passthrough')

# transformers
preprocessor = Pipeline(
    steps=[
        ('NA_filler', build_features.FillBackForward()),
        ('preprocessor', feature_builder),
        ('outlier_remover', build_features.OutlierRemover(cols=(all_cols + high_cols + index_cols)))])
pipe_fit = preprocessor.fit(high_school[all_cols + high_cols + index_cols])
processed_df = pd.DataFrame(pipe_fit.transform(high_school[all_cols + high_cols + index_cols]), columns=pipe_fit[1].get_feature_names_out())

In [14]:
processed_df.head()

Unnamed: 0,num__est_total_pop,num__child_pov_ratio,num__child_adult_ratio,num__instruction_per_pupil,num__support_per_pupil,num__community_per_pupil,num__other_per_pupil,num__pct_amind,num__pct_asian,num__pct_black,...,remainder__achievement_dir,remainder__growth_dir,remainder__overall_dir,remainder__school_grade,remainder__eng_yn,remainder__math_yn,remainder__read_yn,remainder__sci_yn,remainder__school_id,remainder__year
0,0.978103,5.549022e-06,8e-06,0.177657,0.105616,0.001055,0.024333,1.028156e-07,1.028156e-06,4.112625e-07,...,1.0,1.0,1.0,5.0,0.0,0.0,0.0,0.0,309.0,2010.0
1,0.999495,5.051377e-07,1e-06,0.025994,0.015382,0.000109,0.009896,1.452304e-08,1.198151e-07,3.267684e-08,...,1.0,1.0,0.0,6.0,1.0,0.0,0.0,0.0,15.0,2010.0
2,0.999495,5.051377e-07,1e-06,0.025994,0.015382,0.000109,0.009896,5.230859e-09,3.923144e-08,1.307715e-08,...,0.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,4108.0,2010.0
3,0.999495,5.051377e-07,1e-06,0.025994,0.015382,0.000109,0.009896,1.640752e-08,2.695521e-07,5.156649e-08,...,1.0,0.0,0.0,9.0,1.0,0.0,0.0,0.0,5043.0,2010.0
4,0.999495,5.051378e-07,1e-06,0.025994,0.015382,0.000109,0.009896,8.073395e-09,5.113149e-08,4.844036e-08,...,0.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,5816.0,2010.0


In [15]:
processed_df.to_csv('../data/processed/kmeans_high_school.csv')