#### Feature Transformation and Imputation

In [1]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer

Read in the initial dataset.

In [2]:
df = pd.read_csv('../data/dataset.csv')

Set column names to be lowercase.

In [3]:
df.columns = df.columns.str.lower()

Set index as district ID. 

In [4]:
df.set_index('leaid', inplace=True)

Inspect. 

In [5]:
df.head()

Unnamed: 0_level_0,name,stabbr,agchrt,v33,totalrev,tfedrev,c14,c15,c16,c17,...,w01,w31,w61,v95,v02,k14,ce1,ce2,ce3,graduation rate
leaid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2700001,MOUNTAIN IRON-BUHL,MN,3,507.0,8146000.0,442000.0,175000.0,0.0,13000.0,0.0,...,214000.0,0.0,14622000.0,199000.0,47000.0,92000.0,-1.0,-1.0,0.0,0.9355
2700005,UNITED SOUTH CENTRAL,MN,3,707.0,12242000.0,554000.0,170000.0,0.0,29000.0,0.0,...,1114000.0,0.0,4264000.0,182000.0,126000.0,237000.0,-1.0,-1.0,0.0,0.881
2700006,MAPLE RIVER,MN,3,927.0,13103000.0,489000.0,146000.0,0.0,31000.0,0.0,...,0.0,0.0,5541000.0,318000.0,195000.0,147000.0,-1.0,-1.0,0.0,0.9747
2700007,KINGSLAND,MN,3,557.0,8078000.0,374000.0,160000.0,0.0,52000.0,0.0,...,626000.0,0.0,2405000.0,235000.0,55000.0,66000.0,-1.0,-1.0,0.0,0.9677
2700008,ST LOUIS COUNTY,MN,3,2007.0,39951000.0,1860000.0,506000.0,0.0,10000.0,0.0,...,4616000.0,0.0,8073000.0,589000.0,705000.0,0.0,-1.0,-1.0,0.0,0.8607


Grab the columns needing to be transformed:

In [6]:
num_cols = df.drop(columns=['name', 'stabbr', 'agchrt', 'v33', 'graduation rate']).columns

Remove rows without proper student population values: 

In [7]:
no_pop = df[df['v33'] <= 0].index

In [8]:
df.drop(no_pop, inplace=True)

Divide by population for per capita values:

In [9]:
for col in num_cols:
    df[col] = df[col] / df['v33']

Log-scale values other than the missing values represented by zeroes and negatives, which will be set as zero.

In [10]:
df[num_cols] = np.where(df[num_cols] <= 0, 0, np.log(df[num_cols]))

Then we fill in those missing values with the feature means.

In [11]:
df[num_cols] = df[num_cols].replace(0, np.nan)

imputer = SimpleImputer(strategy='mean')
df[num_cols] = imputer.fit_transform(df[num_cols])

Finally, the target values need to be set between 5 and 95 percent, as some of the states' graduation rate data only report whether the district's graduation rate is above 95 or below 5. 

In [12]:
df['graduation rate'] = np.where(df['graduation rate'] >= .95, .95,
                                np.where(df['graduation rate'] <= .05, .05, 
                                         df['graduation rate']))

Then save the new data:

In [13]:
df.to_csv('../data/log_per_student.csv', index=True)