# Analyzing NYC High School Data

## Data Cleaning

The topic that has been chosen is analize data about New York City public schools .

This project focuses on exploring and analyzing a dataset, developing data cleaning and storytelling skills, which enables us to complete projects on our own.

The main focus is primarily on data exploration in this lesson and combine several messy data sets into a single clean one to make analysis easier.

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests and whether they're unfair to certain groups.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the dataset for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.

Investigating the correlations between **SAT** scores and demographics might be an interesting angle to take. We could correlate SAT scores with factors like race, gender, income, and more.


New York City makes its [data on high school SAT scores](https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4) available online, as well as the [demographics for each high school](https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2). 

Unfortunately, combining both of the datasets won't give us all of the demographic information we want to use. We'll need to supplement our data with other sources to do our full analysis.

<br>

<li><a href="https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4" target="_blank">SAT scores by school</a> - SAT scores for each high school in New York City</li>
<li><a href="https://data.cityofnewyork.us/Education/School-Attendance-and-Enrollment-Statistics-by-Dis/7z8d-msnt" target="_blank">School attendance</a> - Attendance information for each school in New York City</li>
<li><a href="https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3" target="_blank">Class size</a> - Information on class size for each school</li>
<li><a href="https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e" target="_blank">AP test results</a> - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)</li>
<li><a href="https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a" target="_blank">Graduation outcomes</a> - The percentage of students who graduated, and other outcome information</li>
<li><a href="https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j" target="_blank">Demographics</a> - Demographic information for each school</li>
<li><a href="https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8" target="_blank">School survey</a> - Surveys of parents, teachers, and students at each school</li>


All of these datasets are interrelated. We'll need to combine them into a single dataset before we can find correlations.

Understanding the meaning of data we will helps us avoid costly mistakes, such as thinking that a column represents something other than what it does. **Background research** gives us a better understanding of how to combine and analyze the data.


<ul>
<li><a href="https://en.wikipedia.org/wiki/New_York_City" target="_blank">New York City</a></li>
<li><a href="https://en.wikipedia.org/wiki/SAT" target="_blank">The SAT</a></li>
<li><a href="https://en.wikipedia.org/wiki/List_of_high_schools_in_New_York_City" target="_blank">Schools in New York City</a></li>
<li><a href="https://data.cityofnewyork.us/browse?category=Education" target="_blank">Our data</a></li>
</ul>


We can learn a few different things from these resources. For example:

- Only high school students take the SAT, so **we'll want to focus on high schools**.

- **New York City is made up of five boroughs**, which are essentially distinct regions.

- New York City schools fall within several different school districts, each of which can contain dozens of schools.

- Our datasets include **several different types of schools. We'll need to clean them** so that we can focus on high schools only.

- Each school in New York City has a unique code called a **DBN** or district borough number.

- **Aggregating data by district allows us to use the district mapping data to plot district-by-district differences**.

Once we've done our background research, we're ready to read in the data. 


<ul>
<li>Data on <a href="https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e" target="_blank">AP test results</a></li>
<li>Data on <a href="https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3" target="_blank">class size</a></li>
<li>Data on <a href="https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j" target="_blank">demographics</a></li>
<li>Data on <a href="https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a" target="_blank">graduation outcomes</a></li>
<li>A directory of <a href="https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2" target="_blank">high schools</a></li>
<li>Data on <a href="https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4" target="_blank">SAT scores</a></li>
<li>Data on <a href="https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8" target="_blank">surveys</a> from all schools</li>
<li>Data on <a href="https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8" target="_blank">surveys</a> from New York City <a href="https://www.schools.nyc.gov/learning/special-education/school-settings/district-75" target="_blank">district 75</a></li>
</ul>


<br>

`survey_all.txt` and `survey_d75.txt` are in more complicated formats than the other files. For now, we'll focus on reading in the CSV files only, and then explore them.


We'll read each file into a pandas dataframe and store all of the dataframes in a dictionary. This will gives us a convenient way to store and a quick way to reference them later on.


In [1]:
import pandas as pd

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for files in data_files:
    dfs = pd.read_csv("schools/{}".format(files))
    idx = files.replace(".csv","")
    data[idx]=dfs

- dataset names are now dictionary keys

In [2]:
data.keys()

dict_keys(['ap_2010', 'class_size', 'demographics', 'graduation', 'hs_directory', 'sat_results'])

### Let's explore `sat_results` (in particular)

**Exploring the dataframe helps us understand the structure of the data** and make it easier for us to analyze it.

What we're mainly interested in is the SAT dataset, which corresponds to the dictionary key `sat_results`. 

This dataset contains the SAT scores for each high school in New York City. We eventually want to correlate selected information from this dataset with information in the other datasets.

In [3]:
data['sat_results'].head(5)

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384


We can make a few observations based on this output:

- The DBN appears to be a unique ID for each school.

- We can tell from the first few rows of names that we only have data about high schools.

- There's only a single row for each high school, so each DBN is unique in the SAT data.

- We may eventually want to combine the three columns that contain SAT scores -- SAT Critical Reading Avg. Score, SAT Math Avg. Score, and SAT Writing Avg. Score -- into a single column to make the scores easier to analyze.

Given these observations... 

### Let's explore the other datasets 

To see if we can gain any insight into how to combine them.

In [4]:
keys = [ 'ap_2010', 'class_size', 'demographics',
        'graduation','hs_directory','sat_results']

for k in keys:
    print(k)
    print(data[k].head(5))
    print("- - -\n")

ap_2010
      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  
- - -

class_size
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente    

We can make some observations based on the first few rows of each one.

- Each dataset appears to either have a **DBN** column or the information we need to create one. That means we can use a **DBN** column to combine the datasets. 

- First we'll pinpoint matching rows from different datasets by looking for identical **DBN**, then group all of their columns together in a single dataset.

- Some **fields look interesting for mapping** -- particularly Location 1, which contains coordinates inside a larger string.

- Some of the datasets appear to contain multiple rows for each school (because the rows have duplicate **DBN** values). That means we’ll have to do some preprocessing to ensure that each **DBN** is unique within each dataset. 

If we don't do this, we'll run into problems when we combine the datasets, because we might be merging two rows in one data set with one row in another dataset.

Before we proceed with the merge, we should make sure we have all of the data we want to unify. 

We mentioned the survey data earlier (`survey_all.txt` and `survey_d75.txt`), but we didn't read those files in because they're in a slightly more complex format.

In [5]:
import chardet

txt_list = ["survey_all.txt","survey_d75.txt"]

for txt in txt_list:
    with open("schools/{0}".format(txt),"rb") as file:
        print(chardet.detect(file.read()))

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
{'encoding': 'ascii', 'confidence': 1.0, 'language': ''}


In [6]:
! file schools/survey_all.txt

schools/survey_all.txt: ASCII text, with very long lines, with CRLF line terminators


In [7]:
! file schools/survey_d75.txt

schools/survey_d75.txt: ASCII text, with very long lines, with CRLF line terminators


In [8]:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t",encoding='Windows-1252')
all_survey.head(5)

Unnamed: 0,dbn,bn,schoolname,d75,studentssurveyed,highschool,schooltype,rr_s,rr_t,rr_p,...,s_N_q14e_3,s_N_q14e_4,s_N_q14f_1,s_N_q14f_2,s_N_q14f_3,s_N_q14f_4,s_N_q14g_1,s_N_q14g_2,s_N_q14g_3,s_N_q14g_4
0,01M015,M015,P.S. 015 Roberto Clemente,0,No,0.0,Elementary School,,88,60,...,,,,,,,,,,
1,01M019,M019,P.S. 019 Asher Levy,0,No,0.0,Elementary School,,100,60,...,,,,,,,,,,
2,01M020,M020,P.S. 020 Anna Silver,0,No,0.0,Elementary School,,88,73,...,,,,,,,,,,
3,01M034,M034,P.S. 034 Franklin D. Roosevelt,0,Yes,0.0,Elementary / Middle School,89.0,73,50,...,20.0,16.0,23.0,54.0,33.0,29.0,31.0,46.0,16.0,8.0
4,01M063,M063,P.S. 063 William McKinley,0,No,0.0,Elementary School,,100,60,...,,,,,,,,,,


In [9]:
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t",encoding='ascii')
d75_survey.head(5)

Unnamed: 0,dbn,bn,schoolname,d75,studentssurveyed,highschool,schooltype,rr_s,rr_t,rr_p,...,s_q14_2,s_q14_3,s_q14_4,s_q14_5,s_q14_6,s_q14_7,s_q14_8,s_q14_9,s_q14_10,s_q14_11
0,75K004,K004,P.S. K004,1,Yes,0.0,District 75 Special Education,38.0,90,72,...,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,75K036,K036,P.S. 36,1,Yes,,District 75 Special Education,70.0,69,44,...,20.0,27.0,19.0,9.0,2.0,6.0,1.0,2.0,0.0,0.0
2,75K053,K053,P.S. K053,1,Yes,,District 75 Special Education,94.0,97,53,...,14.0,12.0,12.0,10.0,21.0,13.0,11.0,2.0,0.0,0.0
3,75K077,K077,P.S. K077,1,Yes,,District 75 Special Education,95.0,65,55,...,14.0,14.0,7.0,11.0,16.0,10.0,6.0,4.0,7.0,7.0
4,75K140,K140,P.S. K140,1,Yes,0.0,District 75 Special Education,77.0,70,42,...,35.0,34.0,17.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0


- Combine `d75_survey` and `all_survey` into a single dataframe.

In [10]:
survey = pd.concat([all_survey, d75_survey], axis=0)
survey

Unnamed: 0,dbn,bn,schoolname,d75,studentssurveyed,highschool,schooltype,rr_s,rr_t,rr_p,...,s_q14_2,s_q14_3,s_q14_4,s_q14_5,s_q14_6,s_q14_7,s_q14_8,s_q14_9,s_q14_10,s_q14_11
0,01M015,M015,P.S. 015 Roberto Clemente,0,No,0.0,Elementary School,,88,60,...,,,,,,,,,,
1,01M019,M019,P.S. 019 Asher Levy,0,No,0.0,Elementary School,,100,60,...,,,,,,,,,,
2,01M020,M020,P.S. 020 Anna Silver,0,No,0.0,Elementary School,,88,73,...,,,,,,,,,,
3,01M034,M034,P.S. 034 Franklin D. Roosevelt,0,Yes,0.0,Elementary / Middle School,89.0,73,50,...,,,,,,,,,,
4,01M063,M063,P.S. 063 William McKinley,0,No,0.0,Elementary School,,100,60,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51,75X352,X352,The Vida Bogart School for All Children,1,Yes,0.0,District 75 Special Education,90.0,58,48,...,38.0,24.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,75X721,X721,P.S. X721 - Stephen McSweeney School,1,Yes,,District 75 Special Education,84.0,90,48,...,1.0,1.0,9.0,21.0,31.0,15.0,9.0,5.0,5.0,2.0
53,75X723,X723,X723,1,Yes,,District 75 Special Education,77.0,74,20,...,24.0,27.0,11.0,11.0,3.0,5.0,0.0,0.0,0.0,0.0
54,75X754,X754,J. M. Rapport School Career Development,1,Yes,,District 75 Special Education,63.0,93,22,...,0.0,0.0,5.0,15.0,13.0,17.0,18.0,16.0,10.0,6.0


There are two immediate facts that we can see in the data:

- There are over 2000 columns, nearly all of which we don't need. We'll **have to filter** the data to remove the unnecessary ones. Working with fewer columns makes it easier to print the dataframe out and find correlations within it.

- The survey data has a **dbn** column that we'll want to **convert to uppercase (DBN)**. The conversion makes the column name consistent with the other data sets.

First, we'll need to filter the columns to remove the ones we don't need. Luckily, there's a data dictionary at the [original data download location](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8). 

The dictionary tells us what each column represents. 

Based on our knowledge of the problem and the analysis we're trying to do, we can use the data dictionary to determine which columns to use.

In [11]:
pd.set_option("display.max_columns", None)
path=("schools/Survey_Data_Dictionary.xls")
dictionary = pd.ExcelFile(path)
print(dictionary.sheet_names)

['Sheet1']


In [12]:
dictionary = dictionary.parse('Sheet1')
dictionary

Unnamed: 0,2011 NYC School Survey\nData Dictionary,Unnamed: 1
0,This data dictionary can be used with the scho...,
1,,
2,Field Name,Field Description
3,dbn,School identification code (district borough n...
4,sch_type,"School type (Elementary, Middle, High, etc)"
5,location,School name
6,enrollment,Enrollment size
7,borough,Borough
8,principal,Principal name
9,studentsurvey,Only students in grades 6-12 partipate in the ...


- Based on the dictionary, it looks like these are the relevant columns:

In [13]:
["dbn", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

['dbn',
 'rr_s',
 'rr_t',
 'rr_p',
 'N_s',
 'N_t',
 'N_p',
 'saf_p_11',
 'com_p_11',
 'eng_p_11',
 'aca_p_11',
 'saf_t_11',
 'com_t_11',
 'eng_t_11',
 'aca_t_11',
 'saf_s_11',
 'com_s_11',
 'eng_s_11',
 'aca_s_11',
 'saf_tot_11',
 'com_tot_11',
 'eng_tot_11',
 'aca_tot_11']

These columns give us aggregate survey data about how parents, teachers, and students feel about school safety, academic performance, and more. It also gives us the DBN, which allows us to uniquely identify the school.

Before we filter columns out, we'll want to copy the data from the dbn column into a new column called DBN. We can copy columns like this:

In [14]:
survey['DBN'] = survey['dbn']

- Filtering columns

In [15]:
survey.shape

(1702, 2774)

In [16]:
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

In [17]:
survey = survey.loc[:,survey_fields]

In [18]:
data['survey']=survey

In [19]:
data['survey']

Unnamed: 0,DBN,rr_s,rr_t,rr_p,N_s,N_t,N_p,saf_p_11,com_p_11,eng_p_11,aca_p_11,saf_t_11,com_t_11,eng_t_11,aca_t_11,saf_s_11,com_s_11,eng_s_11,aca_s_11,saf_tot_11,com_tot_11,eng_tot_11,aca_tot_11
0,01M015,,88,60,,22.0,90.0,8.5,7.6,7.5,7.8,7.5,7.8,7.6,7.9,,,,,8.0,7.7,7.5,7.9
1,01M019,,100,60,,34.0,161.0,8.4,7.6,7.6,7.8,8.6,8.5,8.9,9.1,,,,,8.5,8.1,8.2,8.4
2,01M020,,88,73,,42.0,367.0,8.9,8.3,8.3,8.6,7.6,6.3,6.8,7.5,,,,,8.2,7.3,7.5,8.0
3,01M034,89.0,73,50,145.0,29.0,151.0,8.8,8.2,8.0,8.5,7.0,6.2,6.8,7.8,6.2,5.9,6.5,7.4,7.3,6.7,7.1,7.9
4,01M063,,100,60,,23.0,90.0,8.7,7.9,8.1,7.9,8.4,7.3,7.8,8.1,,,,,8.5,7.6,7.9,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51,75X352,90.0,58,48,38.0,46.0,160.0,8.9,8.3,7.9,8.2,6.4,5.5,5.7,5.8,6.8,6.0,7.8,7.6,7.4,6.6,7.1,7.2
52,75X721,84.0,90,48,237.0,82.0,239.0,8.6,7.6,7.5,7.7,7.6,6.4,6.7,7.0,7.8,7.2,7.8,7.9,8.0,7.1,7.3,7.6
53,75X723,77.0,74,20,103.0,69.0,74.0,8.4,7.8,7.8,7.8,7.7,7.2,6.7,7.6,6.7,7.2,7.7,7.7,7.6,7.4,7.4,7.7
54,75X754,63.0,93,22,336.0,82.0,124.0,8.3,7.5,7.5,7.8,6.7,6.5,6.6,7.1,6.8,6.6,7.6,7.7,7.2,6.9,7.3,7.5


* * *
### Do we have all the DBN columns in the dataframes?


When we explored all of the datasets, we noticed that some of them, like `class_size` and `hs_directory`, don't have a **DBN** column. `hs_directory` does have a **dbn** column, though, so **we can just rename it**.

However, `class_size` doesn't appear to have the column at all. Here are the first few rows of the data set:

In [20]:
data["class_size"].head(3)

Unnamed: 0,CSD,BOROUGH,SCHOOL CODE,SCHOOL NAME,GRADE,PROGRAM TYPE,CORE SUBJECT (MS CORE and 9-12 ONLY),CORE COURSE (MS CORE and 9-12 ONLY),SERVICE CATEGORY(K-9* ONLY),NUMBER OF STUDENTS / SEATS FILLED,NUMBER OF SECTIONS,AVERAGE CLASS SIZE,SIZE OF SMALLEST CLASS,SIZE OF LARGEST CLASS,DATA SOURCE,SCHOOLWIDE PUPIL-TEACHER RATIO
0,1,M,M015,P.S. 015 Roberto Clemente,0K,GEN ED,-,-,-,19.0,1.0,19.0,19.0,19.0,ATS,
1,1,M,M015,P.S. 015 Roberto Clemente,0K,CTT,-,-,-,21.0,1.0,21.0,21.0,21.0,ATS,
2,1,M,M015,P.S. 015 Roberto Clemente,01,GEN ED,-,-,-,17.0,1.0,17.0,17.0,17.0,ATS,


In [21]:
data["sat_results"].head(5)

Unnamed: 0,DBN,SCHOOL NAME,Num of SAT Test Takers,SAT Critical Reading Avg. Score,SAT Math Avg. Score,SAT Writing Avg. Score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
2,01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
3,01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
4,01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384


From looking at these rows, we can tell that the **DBN** in the `sat_results` data is just a combination of the **CSD** and **SCHOOL CODE** columns in the `class_size` data. 

The main difference is that the **DBN** have a fill space, so that the **CSD** portion of it always consists of two digits. 

That means we'll need to add a leading **0** to the **CSD**, if the **CSD** is less than two digits long.

- Whenever the **CSD** is less than two digits long, we need to **add a leading 0**. 

We can accomplish this using the `pandas.Series.apply()` method, along with a custom function that:

- Takes in a number.

- Converts the number to a string using the str() function.

- Check the length of the string using the len() function.

- If the string is two digits long, returns the string.

- If the string is one digit long, adds a 0 to the front of the string, then returns it.

- Use the string method zfill() to do this.

Once we've padded the **CSD**, we can use the addition operator (+) to combine the values in the **CSD** and **SCHOOL CODE** columns to create **DBN column**.


In [22]:
data['hs_directory']['DBN'] = data['hs_directory']['dbn']

In [23]:
def pad_csd(num):
    return str(num).zfill(2)

In [24]:
data["class_size"]["padded_csd"]=data["class_size"]['CSD'].apply(pad_csd)
data["class_size"]["padded_csd"].unique()

array(['01', '02', '21', '27', '06', '03', '05', '07', '04', '08', '09',
       '10', '11', '12', '13', '14', '15', '16', '17', '19', '18', '20',
       '22', '23', '24', '25', '26', '28', '29', '30', '31', '32'],
      dtype=object)

In [25]:
data["class_size"]['DBN'] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"] 

In [26]:
data["class_size"]['DBN'].head(5)

0    01M015
1    01M015
2    01M015
3    01M015
4    01M015
Name: DBN, dtype: object

In [27]:
data["class_size"]['DBN'].tail(5)

27606    32K564
27607    32K564
27608    32K564
27609    32K564
27610    32K564
Name: DBN, dtype: object

Before we combine our datasets, let's take some time to calculate variables that are useful in our analysis.

We've already discussed one such variable -- a column that totals up the **SAT** scores for the different sections of the exam. This makes it much easier to correlate scores with demographic factors because we'll be working with a single number, rather than three different ones.

Before we can generate this column, we'll need to convert:

 - `SAT Math Avg. Score`
 
 
 - `SAT Critical Reading Avg. Score`
 
 
 - `SAT Writing Avg. Score` columns in the `sat_results`
 
 convert the dataset from type (string) to data type to a numeric data type.

In [28]:
data['sat_results'][['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']].head(5)

Unnamed: 0,SAT Math Avg. Score,SAT Critical Reading Avg. Score,SAT Writing Avg. Score
0,404,355,363
1,423,383,366
2,402,377,370
3,401,414,359
4,433,390,384


In [29]:
data['sat_results'][['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 3 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   SAT Math Avg. Score              478 non-null    object
 1   SAT Critical Reading Avg. Score  478 non-null    object
 2   SAT Writing Avg. Score           478 non-null    object
dtypes: object(3)
memory usage: 11.3+ KB


In [30]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']

for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
data['sat_results']['sat_score'].head()

0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

* * *

### Let's parse the latitude and longitude coordinates for each school. 

We want to extract the latitude, `40.8276026690005`, and the longitude, `-73.90447525699966`. 

Taken together, latitude and longitude make up a pair of coordinates that allows us to pinpoint any location on Earth.

In [31]:
pd.set_option("display.max_columns", None)

In [32]:
data['hs_directory']['Location 1']

0      883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1      1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2      1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3      411 Pearl Street\nNew York, NY 10038\n(40.7106...
4      160-20 Goethals Avenue\nJamaica, NY 11432\n(40...
                             ...                        
430    2225 Webster Avenue\nBronx, NY 10457\n(40.8546...
431    925 Astor Avenue\nBronx, NY 10469\n(40.8596983...
432    800 East Gun Hill Road\nBronx, NY 10467\n(40.8...
433    26 Broadway\nNew York, NY 10004\n(40.705234939...
434    149-11 Melbourne Avenue\nFlushing, NY 11367\n(...
Name: Location 1, Length: 435, dtype: object

### Extracting the Latitude

In [33]:
import re

def find_lat(loc):
    coords = re.findall("\(.+\)", loc)
     # coords ['(40.601989336000486, -73.76283432299965)']
             #['(40.83416725000046, -73.90403294799967)']
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lat"]

0      40.670299
1      40.827603
2      40.842414
3      40.710679
4      40.718810
         ...    
430    40.854647
431    40.859698
432    40.875754
433    40.705235
434    40.734408
Name: lat, Length: 435, dtype: float64

### Extracting the Longitude


In [34]:
def find_long(loc):
    coords = re.findall("\(.+\)", loc)
     # coords ['(40.601989336000486, -73.76283432299965)']
             #['(40.83416725000046, -73.90403294799967)']
    lon = coords[0].split(",")[1].replace("(", "")
    return lon

data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_long)

data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

In [35]:
data["hs_directory"].head(3)

Unnamed: 0,dbn,school_name,boro,building_code,phone_number,fax_number,grade_span_min,grade_span_max,expgrade_span_min,expgrade_span_max,bus,subway,primary_address_line_1,city,state_code,zip,website,total_students,campus_name,school_type,overview_paragraph,program_highlights,language_classes,advancedplacement_courses,online_ap_courses,online_language_courses,extracurricular_activities,psal_sports_boys,psal_sports_girls,psal_sports_coed,school_sports,partner_cbo,partner_hospital,partner_highered,partner_cultural,partner_nonprofit,partner_corporate,partner_financial,partner_other,addtl_info1,addtl_info2,start_time,end_time,se_services,ell_programs,school_accessibility_description,number_programs,priority01,priority02,priority03,priority04,priority05,priority06,priority07,priority08,priority09,priority10,Location 1,DBN,lat,lon
0,17K548,Brooklyn School for Music & Theatre,Brooklyn,K440,718-230-6250,718-230-6262,9,12,,,"B41, B43, B44-SBS, B45, B48, B49, B69","2, 3, 4, 5, F, S to Botanic Garden ; B, Q to P...",883 Classon Avenue,Brooklyn,NY,11225,Bkmusicntheatre.com,399.0,Prospect Heights Educational Campus,,Brooklyn School for Music & Theatre (BSMT) use...,We offer highly competitive positions in our D...,Spanish,"English Language and Composition, United State...",,,"Variety of clubs: Chess, The Step Team, Fashio...","Baseball, Basketball & JV Basketball, Cross Co...","Basketball, Cross Country, Indoor Track, Outdo...",,,F.Y.R.EZONE (Finding Your Rhythm thru Educatio...,,,"In 2002, Roundabout Theatre was selected by Ne...",One To World‘s Global Classroom connects New Y...,,,,,,8:10 AM,3:00 PM,This school will provide students with disabil...,ESL,Functionally Accessible,1,Priority to Brooklyn students or residents,Then to New York City residents,,,,,,,,,"883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...",17K548,40.670299,
1,09X543,High School for Violin and Dance,Bronx,X400,718-842-0687,718-589-9849,9,12,,,"Bx13, Bx15, Bx17, Bx21, Bx35, Bx4, Bx41, Bx4A,...","2, 5 to Intervale Ave",1110 Boston Road,Bronx,NY,10456,www.hsvd.org,378.0,Morris Educational Campus,,The High School for Violin and Dance (HSVD) is...,Freshmen take both violin and dance; College N...,Spanish,,,,Advancement via Individual Determination (AVID...,"Baseball, Basketball & JV Basketball, Volleyball","Basketball, Softball, Volleyball",,Morris Educational Campus Basketball and Volle...,McGraw Hill - Big Brother Big Sister,,"Hostos Community College, Monroe College, Teac...",Bronx Arts Ensemble,buildOn,Print International,,Bronx Cares,Our students are required to take four years o...,"Student Summer Orientation, Summer Internship ...",8:00 AM,3:00 PM,This school will provide students with disabil...,ESL,Functionally Accessible,1,Priority to Bronx students or residents who at...,Then to New York City residents who attend an ...,Then to Bronx students or residents,Then to New York City residents,,,,,,,"1110 Boston Road\nBronx, NY 10456\n(40.8276026...",09X543,40.827603,
2,09X327,Comprehensive Model School Project M.S. 327,Bronx,X240,718-294-8111,718-294-8109,6,12,,,"Bx1, Bx11, Bx13, Bx18, Bx2, Bx3, Bx32, Bx35, Bx36","4 to Mt Eden Ave ; B, D to 170th St",1501 Jerome Avenue,Bronx,NY,10452,http://schools.nyc.gov/schoolportals/09/X327,543.0,DOE New Settlement Community Campus,,At the Comprehensive Model School Project (CMS...,"After-school and Saturday Tutoring, Advisory, ...",Spanish,"Biology, Chemistry, United States History",,,"Choir, Gaming, Girls Club, Newspaper, Spanish,...",,,,"As we expand, we plan to offer PSAL sports.",New Settlement Community Center,Montefiore Hospital,,,,,,,Dress Code Required: white or baby blue button...,,8:00 AM,4:00 PM,This school will provide students with disabil...,ESL,Functionally Accessible,1,Priority to continuing 8th graders,Then to Bronx students or residents who attend...,Then to New York City residents who attend an ...,Then to Bronx students or residents,Then to New York City residents,,,,,,"1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...",09X327,40.842414,


## Data_Cleaning and Combining_the_Data


. The first thing we'll need to do in preparation for the merge is condense some of the datasets. In the last lesson, we noticed that the values in the DBN column were unique in the sat_results data set. Other data sets like class_size had duplicate DBN values.

**Condense**: means each value in the **DBN** column in every dataset is unique. If not, we'll run into issues when it comes time to combine the datasets.

While the main dataset we want to analyze, `sat_results`, has unique **DBN** values for every high school in New York City, other datasets aren't as clean.

A single row in the `sat_results` dataset may match multiple rows in the `class_size` dataset, this situation creates problems, because we don't know which of the multiple entries in the `class_size` dataset we should combine with the single matching entry in `sat_results`.

Here's a diagram that illustrates the problem:

|||||
|:--|:--|:--|:--|
|sat_results||class_size||
|DBN|...|DBN|...|
|01M022|...|01M022|...|
|05M345|...|01M022|...|
|02M456|...|05M345|...|
|99M520|...|05M345|...|

In the diagram above, we can't combine the rows from both datasets, because there are several cases where multiple rows in `class_size` match a single row in `sat_results`.

To resolve this issue, we'll condense the `class_size`, `graduation` and `demographics` datasets so that each **DBN** is unique.

The first dataset that we'll condense is `class_size`.