# Data Cleaning 

## Our Research Question 

To what extent is the language chosen in bilingual program schools in New York City influenced by the ethnic/racial breakdown of the school (and economic factors)?

Read more about our research question and listen to the podcast that inspired us: https://www.nytimes.com/2020/07/23/podcasts/nice-white-parents-serial.html

## Data Description 

This dataset was sourced from NYC OpenData. The information within this dataset reports all of the bilingual programs in New York City for the 2020-2021 school year, sorted by program type and language per school. It is organized by borough and school district. NYC OpenData was developed as part of an initiative to make the NYC government more accessible, transparent and accountable. The open, free, public data is produced by various city agencies, including the Department of Education, Department of Buildings, New York City Taxi and Limousine Commission, Board of Corrections, etc. By centralizing data from all these NYC agencies, the hope is that citizens will have a more efficient way to find useful, machine-readable data.
This particular data is owned and provided by the New York City Department of Education. Regarding influences on data collection, because this should be a comprehensive list of schools with bilingual programs in the city, there should not be schools that are excluded, at least not on purpose. Whether a school has a bilingual program or not is public information and should be relatively easy to collect, and thus there should not be any problems in terms of collection bias and recording. The stakeholders here are the schools, not any individuals in particular, so there should also not be problems with the misuse of information, especially since it is not personal information. Data collected within this dataset that was preprocessed includes the language translated column to the language written in English. For example, columns that indicate the language as Spanish have the language translated column to be preprocessed to Español. In addition, the school DBN column was also preprocessed from the given school name as individuals are more likely to know the school name and then the data preprocesses the name of the school to the school DBN. 

There are 538 rows, each corresponding to a different NYC school. There are 11 columns, detailing borough, borough/citywide office, district, school, school name, school category (K-8, elementary, early childhood, etc.), program, program language, program language translated into the respective language, whether the school is a general or special education school, and special education model. Based on the website, there is no evidence of preprocessing of the data. However, the data is quite clean already, so it could be possible that there was some data cleaning done on the part of NYC OpenData, primarily just figuring out which schools had bilingual programs, and inputting all the corresponding information about the school itself to prevent the dataset from having any null or NaN values.


In [4]:
## load libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

In [26]:
bilingual_data = pd.read_csv("bilingual_rates.csv")
bilingual_data.rename(columns={"School Name":"school_name", "School": "DBN"}, inplace=True)
bilingual_data.head()
bilingual_data = bilingual_data.drop(columns = ['Language (Translated)', 'Special Education Model'])

# COLUMNS
display(bilingual_data)

# FIXING DATA TYPES
# change DBN + school name to string, district is already an int
bilingual_str_columns = ['DBN', 'school_name']
bilingual_data[bilingual_str_columns] = bilingual_data[bilingual_str_columns].astype("string")
      
# OTHER DATA CLEANING
print(bilingual_data.shape)

Unnamed: 0,Borough,Borough/Citywide Office (B/CO),District,DBN,school_name,School Category,Program,Language,General/Special Education
0,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Chinese,General Education
1,Manhattan,Manhattan,1,01M020,P.S. 020 Anna Silver,Elementary,Dual Language,Spanish,General Education
2,Manhattan,Manhattan,1,01M184,P.S. 184m Shuang Wen,K-8,Dual Language,Chinese,General Education
3,Manhattan,Manhattan,1,01M184,P.S. 184m Shuang Wen,K-8,Dual Language,Chinese,Special Education
4,Manhattan,Manhattan,1,01M188,P.S. 188 The Island School,K-8,Dual Language,Spanish,General Education
...,...,...,...,...,...,...,...,...,...
533,Bronx,District 75,75,75X811,P.S. X811,Secondary School,Transitional Bilingual Education,Spanish,Special Education
534,Bronx,District 75,75,75X811,P.S. X811,Secondary School,Transitional Bilingual Education,Spanish,Special Education
535,Bronx,District 75,75,75X811,P.S. X811,Secondary School,Transitional Bilingual Education,Spanish,Special Education
536,Manhattan,ACCESS,79,79M973,Restart Academy,Ungraded,Transitional Bilingual Education,Spanish,General Education


(538, 9)
