# Clean data from SCHMA

In this notebook, we will clean data from the <a href="http://steinhardt.nyu.edu/research_alliance/research/schma"> School-Level Master File</a>. This dataset is produced by the Research Alliance for New York City Schools at NYU. They describe it as follows:

<i>The School-Level Master File (SCHMA) is a dataset developed by the Research Alliance for New York City Schools at New York University. To create the file, we compiled publicly available data from the New York City Department of Education (DOE) and the U.S. Department of Education. The result is a consistent, accessible document that can be used to investigate characteristics of individual New York City schools or groups of schools and how they have changed over time.</i>

We will use the SCHMA to obtain each school's latitude and longitude (so we can create spatial plots).

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

raw_SCHMA = pd.read_csv('../../data/SCHMA/schma19962013.csv', usecols=['YEAR', 'BNLONG','LCGGEOX','LCGGEOY'], low_memory=False)

Next, drop all records not from 2006 to 2013.

In [3]:
proc_SCHMA = raw_SCHMA.loc[(2006 <= raw_SCHMA.YEAR)]
proc_SCHMA = proc_SCHMA.loc[(raw_SCHMA.YEAR <= 2012)]

In [4]:
proc_SCHMA.describe()

Unnamed: 0,YEAR,LCGGEOX,LCGGEOY
count,11035.0,4651.0,4651.0
mean,2009.070322,-73.919906,40.735262
std,1.99386,0.080696,0.0861
min,2006.0,-74.2441,40.50822
25%,2007.0,-73.9651,40.67208
50%,2009.0,-73.9214,40.72682
75%,2011.0,-73.8795,40.81732
max,2012.0,-73.7091,40.90353


Note that the max and min values for all three fields are reasonable. Thus, we proceed by dropping missing records.

In [5]:
proc_SCHMA.dropna(inplace=True)
proc_SCHMA.shape

(4651, 4)

Let's see how many schools have latitudes and longitudes:

In [6]:
len(proc_SCHMA.BNLONG.unique())

1592

Thus, we have latitutde and longitudes for 1592 schools. Let's see how these are distributed across years:

In [7]:
proc_SCHMA.YEAR.value_counts()

2012    1565
2011    1555
2010    1531
dtype: int64

It appears geographic coordinates are only available from 2010-2012, so in future analyses we'll limit our mapping to these three years. Finally, now that we've cleaned the data, let's save it:

In [8]:
proc_SCHMA.to_csv('../../data/clean_SCHMA.csv')