# (prototype) Preprocess NYT SHSAT table

In this notebook we will preprocess the NYT table so we can join it into our other data and test hypotheses.

In [1]:
import pandas as pd

df = pd.read_csv('../data/raw/nyt_table.csv')
df.head()

Unnamed: 0,DBN,school_name_number,school_name_details,borough,testers,offers,offers_per_student,pct_hispanic_black
0,20K187,Intermediate School 187,The Christa McAuliffe School,Brooklyn,251,205,75%,8%
1,21K239,Intermediate School 239,The Mark Twain Intermediate School for the Gif...,Brooklyn,336,196,46%,13%
2,03M054,Junior High School 54,The Booker T. Washington School,Manhattan,257,150,53%,23%
3,15K051,Midde School 51,The William Alexander School,Brooklyn,280,122,33%,28%
4,02M312,,New York City Lab Middle School for Collaborat...,Manhattan,163,113,62%,8%


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 589 entries, 0 to 588
Data columns (total 8 columns):
DBN                    589 non-null object
school_name_number     231 non-null object
school_name_details    589 non-null object
borough                589 non-null object
testers                589 non-null object
offers                 589 non-null object
offers_per_student     589 non-null object
pct_hispanic_black     589 non-null object
dtypes: object(8)
memory usage: 36.9+ KB


In [3]:
import numpy as np


# the character '—' is pretty close, but not equal to '-'

def parse_int(x):
    return np.nan if x == '—' else int(x)

def parse_pct(x):
    return np.nan if x == '—' else float(x[:-1]) / 100.0

df['testers'] = df['testers'].apply(parse_int)
df['offers'] = df['offers'].apply(parse_int)

df['offers_per_student'] = df['offers_per_student'].apply(parse_pct)
df['pct_hispanic_black'] = df['pct_hispanic_black'].apply(parse_pct)

df.head()

Unnamed: 0,DBN,school_name_number,school_name_details,borough,testers,offers,offers_per_student,pct_hispanic_black
0,20K187,Intermediate School 187,The Christa McAuliffe School,Brooklyn,251.0,205.0,0.75,0.08
1,21K239,Intermediate School 239,The Mark Twain Intermediate School for the Gif...,Brooklyn,336.0,196.0,0.46,0.13
2,03M054,Junior High School 54,The Booker T. Washington School,Manhattan,257.0,150.0,0.53,0.23
3,15K051,Midde School 51,The William Alexander School,Brooklyn,280.0,122.0,0.33,0.28
4,02M312,,New York City Lab Middle School for Collaborat...,Manhattan,163.0,113.0,0.62,0.08


In [4]:
df.shape

(589, 8)

In [5]:
df.isnull().sum()

DBN                      0
school_name_number     358
school_name_details      0
borough                  0
testers                 52
offers                 469
offers_per_student     469
pct_hispanic_black       0
dtype: int64

In [6]:
df.isnull().mean()

DBN                    0.000000
school_name_number     0.607810
school_name_details    0.000000
borough                0.000000
testers                0.088285
offers                 0.796265
offers_per_student     0.796265
pct_hispanic_black     0.000000
dtype: float64

The missing values in this dataset mean that there are *5 or less students* in the category.

We can see that most schools in this dataset receive less than 6 SPHS offers, and about 8.8% of the schools have less than 6 test takers.