In [1]:
! ls -l


total 21388
drwxrwxr-x 3 ion ion     4096 jul  9 17:04  2011_School_Survey
-rw-rw-r-- 1 ion ion   136258 jul 11 10:33  Analyzing_NYC_High_School_Data.ipynb
-rw-rw-r-- 1 ion ion    20055 jul 10 18:24  handle_missing.png
-rw-rw-r-- 1 ion ion        1 jul  9 17:04  md
-rw-rw-r-- 1 ion ion    12344 jul 10 13:00  no_combine.png
-rw-rw-r-- 1 ion ion 10167295 jul 11 10:32 'NYC High School Analyzing and Visualizing the Data.ipynb'
-rw-rw-r-- 1 ion ion    23815 jul 11 10:57  NYC_1.csv
-rw-rw-r-- 1 ion ion  1338703 jul 11 10:59  NYC_2.csv
-rw-rw-r-- 1 ion ion 10159717 jul 11 11:01 'NYC_High_School Combining the Data.ipynb'
-rw-rw-r-- 1 ion ion    10533 jul  9 17:04  padded.png
-rw-rw-r-- 1 ion ion       15 jul  9 17:04  README.md
drwxrwxr-x 2 ion ion     4096 jul  9 17:04  schools


In [2]:
!pwd

/home/ion/Documentos/GitHub/Data-science-projects/03_Data_Cleaning/Analyzing NYC High School Data


# Data Cleaning Walkthrough


Data science projects usually consist of one of two things:

- An **exploration and analysis of a set of data**. One example might involve analyzing donors to political campaigns, creating a plot, and then sharing an analysis of the plot with others.

- An **operational system that generates predictions based on data that updates continually**. An algorithm that pulls in daily stock ticker data and predicts which stock prices rise and fall would be one example.

<br>

This project walk throught the first part of a complete data science project, how to adquire raw data and exploring and analyzing a dataset, combining several messy data sets into a single one to make the analysis easier.

The first step is to decide on a topic, two ways to find a good topic:

- Think about what sectors or angles you're really interested in, then find data sets relating to those sectors.

- Review several datasets and find one that seems interesting enough to explore.

A point to start migh be:

<ul>
<li><a href="https://www.data.gov" target="_blank">Data.gov</a> - A directory of government data downloads</li>
<li><a href="https://reddit.com/r/datasets" target="_blank">/r/datasets</a> - A subreddit that has hundreds of interesting datasets</li>
<li><a href="https://github.com/caesar0301/awesome-public-datasets" target="_blank">Awesome datasets</a> - A list of datasets hosted on GitHub</li>
<li><a href="http://rs.io/100-interesting-data-sets-for-statistics/" target="_blank">rs.io</a> - A great blog post with hundreds of interesting datasets</li>
</ul>

In real-world data science, you may not find an ideal dataset.

Once have chosen a topic, ie:

### New York City public schools


you'll want to to choose an angle to investigate that has enough depth to analyze but isn't so complicated that it's difficult to get started. You want to finish the project and your results to be interesting to others.


### angle to investigate:

One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests and whether they're unfair to certain groups. Given our prior knowledge of this topic, investigating the correlations between SAT scores and demographics might be an interesting angle to take. We could correlate SAT scores with factors like race, gender, income, and more.

**The SAT, or Scholastic Aptitude Test**: Is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's important to perform well.

The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the dataset for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.

Is necessary to know the SAT scores and demographic data associated with those scores

- [2012 SAT Results](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4)

- [2014 - 2015 DOE High School Directory](https://data.cityofnewyork.us/Education/2014-2015-DOE-High-School-Directory/n3p6-zve2)


<br>

Unfortunately, combining both of the datasets won't give us all of the demographic information we want to use.

The same website has several related datasets covering demographic information and test scores. Here are the links to all of the datasets we'll be using:

<ul>
<li><a href="https://data.cityofnewyork.us/Education/SAT-Results/f9bf-2cp4" target="_blank">SAT scores by school</a> - SAT scores for each high school in New York City</li>
<li><a href="https://data.cityofnewyork.us/Education/School-Attendance-and-Enrollment-Statistics-by-Dis/7z8d-msnt" target="_blank">School attendance</a> - Attendance information for each school in New York City</li>
<li><a href="https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3" target="_blank">Class size</a> - Information on class size for each school</li>
<li><a href="https://data.cityofnewyork.us/Education/AP-College-Board-2010-School-Level-Results/itfs-ms3e" target="_blank">AP test results</a> - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)</li>
<li><a href="https://data.cityofnewyork.us/Education/Graduation-Outcomes-Classes-Of-2005-2010-School-Le/vh2h-md7a" target="_blank">Graduation outcomes</a> - The percentage of students who graduated, and other outcome information</li>
<li><a href="https://data.cityofnewyork.us/Education/School-Demographics-and-Accountability-Snapshot-20/ihfw-zy9j" target="_blank">Demographics</a> - Demographic information for each school</li>
<li><a href="https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8" target="_blank">School survey</a> - Surveys of parents, teachers, and students at each school</li>
</ul>


All of these datasets are interrelated. We'll need to combine them into a single dataset before we can find correlations.


The next step to take before moving on to coding is to do a background check. A deep understanding of data helps avoid costly mistakes, such as thinking that a column represents something different from what it does. Background research gives us a better understanding of how to combine and analyze data.

<ul>
<li><a href="https://en.wikipedia.org/wiki/New_York_City" target="_blank">Know about neighborhoods that make up the city New York City</a></li>
<li><a href="https://en.wikipedia.org/wiki/SAT" target="_blank">The scoring system SAT</a></li>
<li><a href="https://en.wikipedia.org/wiki/List_of_high_schools_in_New_York_City" target="_blank">List of high schools in New York City</a></li>
<li><a href="https://data.cityofnewyork.us/browse?category=Education" target="_blank">The dataset source</a></li>
</ul>

Knowing the context allows us to know the following things:

- **Only high school students take the SAT**, so we'll want to focus on high schools.

- **New York City is made up of five boroughs**, which are essentially distinct regions.

- New York City schools fall within several different school districts, each of which can contain dozens of schools.

- **Our datasets include several different types of schools. We'll need to clean them** so that we can focus on high schools only.

- **Each school in New York City has a unique code called a DBN or district borough number**.

- Aggregating data by district allows us to use the district mapping data to plot district-by-district differences.

Once we've done our background research, we're ready to read in the data.


to avoid surprises I prefer to check the nature of each file using two methods:

- the chardet library

- the linux file command

In [3]:
! ls -l

total 21388
drwxrwxr-x 3 ion ion     4096 jul  9 17:04  2011_School_Survey
-rw-rw-r-- 1 ion ion   136258 jul 11 10:33  Analyzing_NYC_High_School_Data.ipynb
-rw-rw-r-- 1 ion ion    20055 jul 10 18:24  handle_missing.png
-rw-rw-r-- 1 ion ion        1 jul  9 17:04  md
-rw-rw-r-- 1 ion ion    12344 jul 10 13:00  no_combine.png
-rw-rw-r-- 1 ion ion 10167295 jul 11 10:32 'NYC High School Analyzing and Visualizing the Data.ipynb'
-rw-rw-r-- 1 ion ion    23815 jul 11 10:57  NYC_1.csv
-rw-rw-r-- 1 ion ion  1338703 jul 11 10:59  NYC_2.csv
-rw-rw-r-- 1 ion ion 10159717 jul 11 11:01 'NYC_High_School Combining the Data.ipynb'
-rw-rw-r-- 1 ion ion    10533 jul  9 17:04  padded.png
-rw-rw-r-- 1 ion ion       15 jul  9 17:04  README.md
drwxrwxr-x 2 ion ion     4096 jul  9 17:04  schools


In [4]:
! ls schools/

ap_2010.csv	  graduation.csv    survey_all.txt
class_size.csv	  hs_directory.csv  survey_d75.txt
demographics.csv  sat_results.csv   Survey_Data_Dictionary.xls


In [5]:
import chardet

datafiles = ['ap_2010.csv',
         'graduation.csv',
         'survey_all.txt',
         'class_size.csv', 
         'hs_directory.csv', 
         'survey_d75.txt',
         'demographics.csv',
         'sat_results.csv', 
         'Survey_Data_Dictionary.xls']

In [6]:
'''
datafiles_NonUTF = []

for f in datafiles:
    with open("schools/{}".format(f),'rb') as file:
        print("file name: {nme} format name: {frt}".format(nme=f,frt=chardet.detect(file.read())))
        if 'utf-8' not in chardet.detect(file.read()):
            datafiles_NonUTF.append(f)
        else:
            datafiles.append(f)
'''

'\ndatafiles_NonUTF = []\n\nfor f in datafiles:\n    with open("schools/{}".format(f),\'rb\') as file:\n        print("file name: {nme} format name: {frt}".format(nme=f,frt=chardet.detect(file.read())))\n        if \'utf-8\' not in chardet.detect(file.read()):\n            datafiles_NonUTF.append(f)\n        else:\n            datafiles.append(f)\n'

- This is the output that chardet, as it takes a long time to analyze each file this is what gives us to the output.

    file name: ap_2010.csv format name: {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
    file name: graduation.csv format name: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
    file name: survey_all.txt format name: {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
    file name: class_size.csv format name: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
    file name: hs_directory.csv format name: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
    file name: survey_d75.txt format name: {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
    file name: demographics.csv format name: {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
    file name: sat_results.csv format name: {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
    file name: Survey_Data_Dictionary.xls format name: {'encoding': 'Windows-1254', 'confidence': 0.3643929081009458, 'language': 'Turkish'}

- We have 3 files with a not too common encoding system **'Windows-1254'**, so for now we will work with the files that have the **UTF-8** encoding system.

In [7]:
! file -i schools/ap_2010.csv

schools/ap_2010.csv: application/csv; charset=us-ascii


In [8]:
! file -i schools/graduation.csv

schools/graduation.csv: application/octet-stream; charset=binary


In [9]:
! file -i schools/class_size.csv

schools/class_size.csv: application/csv; charset=us-ascii


In [10]:
! file -i schools/hs_directory.csv

schools/hs_directory.csv: application/csv; charset=utf-8


In [11]:
! file -i schools/demographics.csv

schools/demographics.csv: application/csv; charset=us-ascii


In [12]:
! file -i schools/sat_results.csv

schools/sat_results.csv: application/csv; charset=utf-8


||||||||
|:|:|:|:|:|:|:|
|**file name:**|`ap_2010.csv`|`graduation.csv`|`class_size.csv`|`hs_directory.csv`|`demographics.csv`|`sat_results.csv`|
|**chardet lib:**|'ascii'    |'utf-8'       |utf-8'        |utf-8'          |'ascii'         |**'Windows-1252'**|
|**file -i:**    |charset=us-ascii        |charset=binary|charset=us-ascii|charset=utf-8   |charset=us-ascii        |charset=utf-8|


- If we compare the output of the file `sat_results.csv` does not coincide with that offered by the file command, however it is important to note that we are told that the accuracy is 0.73, so let's see if the load can be done.

In [13]:
import pandas as pd

In [14]:
datafiles = ['ap_2010.csv',
         'graduation.csv',
         'class_size.csv', 
         'hs_directory.csv', 
         'demographics.csv',
         'sat_results.csv']

In [15]:
data = {}

for f in datafiles:
    file = f.replace(".csv","")
    if ".txt" not in file:                            # avoiding for errors loadings
        data[file]= pd.read_csv("schools/{0}".format(f))
data

{'ap_2010':         DBN                                         SchoolName  \
 0    01M448                       UNIVERSITY NEIGHBORHOOD H.S.   
 1    01M450                             EAST SIDE COMMUNITY HS   
 2    01M515                                LOWER EASTSIDE PREP   
 3    01M539                     NEW EXPLORATIONS SCI,TECH,MATH   
 4    02M296              High School of Hospitality Management   
 ..      ...                                                ...   
 253  31R605                         STATEN ISLAND TECHNICAL HS   
 254  32K545                      EBC-HS FOR PUB SERVICE (BUSH)   
 255  32K552                          Academy of Urban Planning   
 256  32K554               All City Leadership Secondary School   
 257  32K556  Bushwick Leaders High School for Academic Exce...   
 
     AP Test Takers  Total Exams Taken Number of Exams with scores 3 4 or 5  
 0                39                49                                   10  
 1                19       

Of all the datasets, the one that has the most importance is the `sat_results` because it is the one that contains the scores of the high schools of NY. This will be the main dataset because we will correlate information from this dataset with the other datasets.

In [16]:
print(data['sat_results'].head(5))

      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    38

We can make observations based on this output:

<div>
<p>We can make a few observations based on this output:</p>
<ul>
<li>The <code>DBN</code> appears to be a unique ID for each school.</li>
<li>We can tell from the first few rows of names that we only have data about high schools.  </li>
<li>There's only a single row for each high school, so each <code>DBN</code> is unique in the SAT data.  </li>
<li>We may eventually want to combine the three columns that contain SAT scores -- <code>SAT Critical Reading Avg. Score</code>, <code>SAT Math Avg. Score</code>, and <code>SAT Writing Avg. Score</code> -- into a single column to make the scores easier to analyze.</li>
</ul>
<p>Given these observations, let's explore the other datasets to see if we can gain any insight into how to combine them.</p></div>

In [17]:
for idx in data.keys():
    print("dataset: {ndt}".format(ndt=idx))
    print("\n")
    print("{dt}".format(dt=data[idx].head(5)))

dataset: ap_2010


      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  
dataset: graduation


    Demographic     DBN                            School Name    Cohort  \
0  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2003   
1  Total Cohort  01M292  HENRY STREET SCHOOL

- observing the first rows of the dataframes can give us an idea of what are the common elements in each of the dataframes.

<ul>
<li>Each dataset appears to either have a <code>DBN</code> column or the information we need to create one.  That means we can use a <code>DBN</code> column to combine the datasets.  First we'll pinpoint matching rows from different datasets by looking for identical <code>DBN</code>s, then group all of their columns together in a single dataset.</li>
<li>Some fields look interesting for mapping -- particularly <code>Location 1</code>, which contains coordinates inside a larger string.</li>
<li>Some of the datasets appear to contain multiple rows for each school (because the rows have duplicate <code>DBN</code> values).  That means we’ll have to do some preprocessing to ensure that each <code>DBN</code> is unique within each dataset.  If we don't do this, we'll run into problems when we combine the datasets, because we might be merging two rows in one data set with one row in another dataset.</li>
</ul>

Before we proceed with the merge, we should make sure we have all of the data we want to unify. We mentioned the survey data earlier (`survey_all.txt` and `survey_d75.txt`), but we didn't read those files in because they're in a slightly more complex format.

The files are **tab delimited** and **encoded with Windows-1252 encoding**. An encoding defines how a computer stores the contents of a file in binary.

In [18]:
! file  schools/survey_all.txt

schools/survey_all.txt: ASCII text, with very long lines


In [19]:
! file  schools/survey_d75.txt

schools/survey_d75.txt: ASCII text, with very long lines


In [20]:
survey_all = pd.read_csv("schools/survey_all.txt",encoding='Windows-1252',delimiter='\t')

In [21]:
survey_all.head(5)

Unnamed: 0,dbn,bn,schoolname,d75,studentssurveyed,highschool,schooltype,rr_s,rr_t,rr_p,...,s_N_q14e_3,s_N_q14e_4,s_N_q14f_1,s_N_q14f_2,s_N_q14f_3,s_N_q14f_4,s_N_q14g_1,s_N_q14g_2,s_N_q14g_3,s_N_q14g_4
0,01M015,M015,P.S. 015 Roberto Clemente,0,No,0.0,Elementary School,,88,60,...,,,,,,,,,,
1,01M019,M019,P.S. 019 Asher Levy,0,No,0.0,Elementary School,,100,60,...,,,,,,,,,,
2,01M020,M020,P.S. 020 Anna Silver,0,No,0.0,Elementary School,,88,73,...,,,,,,,,,,
3,01M034,M034,P.S. 034 Franklin D. Roosevelt,0,Yes,0.0,Elementary / Middle School,89.0,73,50,...,20.0,16.0,23.0,54.0,33.0,29.0,31.0,46.0,16.0,8.0
4,01M063,M063,P.S. 063 William McKinley,0,No,0.0,Elementary School,,100,60,...,,,,,,,,,,


In [22]:
survey_d75 = pd.read_csv("schools/survey_d75.txt",encoding='Windows-1252',delimiter='\t')

In [23]:
survey_d75.head(5)

Unnamed: 0,dbn,bn,schoolname,d75,studentssurveyed,highschool,schooltype,rr_s,rr_t,rr_p,...,s_q14_2,s_q14_3,s_q14_4,s_q14_5,s_q14_6,s_q14_7,s_q14_8,s_q14_9,s_q14_10,s_q14_11
0,75K004,K004,P.S. K004,1,Yes,0.0,District 75 Special Education,38.0,90,72,...,29.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,75K036,K036,P.S. 36,1,Yes,,District 75 Special Education,70.0,69,44,...,20.0,27.0,19.0,9.0,2.0,6.0,1.0,2.0,0.0,0.0
2,75K053,K053,P.S. K053,1,Yes,,District 75 Special Education,94.0,97,53,...,14.0,12.0,12.0,10.0,21.0,13.0,11.0,2.0,0.0,0.0
3,75K077,K077,P.S. K077,1,Yes,,District 75 Special Education,95.0,65,55,...,14.0,14.0,7.0,11.0,16.0,10.0,6.0,4.0,7.0,7.0
4,75K140,K140,P.S. K140,1,Yes,0.0,District 75 Special Education,77.0,70,42,...,35.0,34.0,17.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0


- Let`s concatenate it.

In [24]:
survey = pd.concat([survey_all,survey_d75],axis=0)

In [25]:
print(survey.head(5))

      dbn    bn                      schoolname  d75 studentssurveyed  \
0  01M015  M015       P.S. 015 Roberto Clemente    0               No   
1  01M019  M019             P.S. 019 Asher Levy    0               No   
2  01M020  M020            P.S. 020 Anna Silver    0               No   
3  01M034  M034  P.S. 034 Franklin D. Roosevelt    0              Yes   
4  01M063  M063       P.S. 063 William McKinley    0               No   

   highschool                  schooltype  rr_s  rr_t  rr_p  ...  s_q14_2  \
0         0.0           Elementary School   NaN    88    60  ...      NaN   
1         0.0           Elementary School   NaN   100    60  ...      NaN   
2         0.0           Elementary School   NaN    88    73  ...      NaN   
3         0.0  Elementary / Middle School  89.0    73    50  ...      NaN   
4         0.0           Elementary School   NaN   100    60  ...      NaN   

   s_q14_3  s_q14_4  s_q14_5  s_q14_6  s_q14_7  s_q14_8  s_q14_9  s_q14_10  \
0      NaN      NaN 



There are two immediate facts that we can see in the data:

<ul>
<li>There are over <code>2000</code> columns, nearly all of which we don't need.  We'll have to filter the data to remove the unnecessary ones.  Working with fewer columns makes it easier to print the dataframe out and find correlations within it.</li>
    
<br>
<li>The survey data has a <code>dbn</code> column that we'll want to convert to uppercase (<code>DBN</code>).  The conversion makes the column name consistent with the other data sets.</li>
</ul>

therefore it is necessary to filter the columns and for this we will make use of the data dictionary.

In [26]:
datadictionary = pd.read_excel(r"schools/Survey_Data_Dictionary.xls",index_col=0)
print(datadictionary)

                                                                                           Unnamed: 1
2011 NYC School Survey\nData Dictionary                                                              
This data dictionary can be used with the schoo...                                                NaN
NaN                                                                                               NaN
Field Name                                                                          Field Description
dbn                                                 School identification code (district borough n...
sch_type                                                  School type (Elementary, Middle, High, etc)
location                                                                                  School name
enrollment                                                                            Enrollment size
borough                                                                           

Based on the dictionary, it looks like these are the relevant columns:

    ["dbn", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]


Before we filter columns out, we'll want to copy the data from the `dbn` column into a new column called `DBN`. We can copy columns like this:

In [27]:
survey["DBN"] = survey["dbn"]

In [28]:
survey_fields=["DBN","rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]

In [29]:
survey.loc[:,survey_fields]

Unnamed: 0,DBN,rr_s,rr_t,rr_p,N_s,N_t,N_p,saf_p_11,com_p_11,eng_p_11,...,eng_t_11,aca_t_11,saf_s_11,com_s_11,eng_s_11,aca_s_11,saf_tot_11,com_tot_11,eng_tot_11,aca_tot_11
0,01M015,,88,60,,22.0,90.0,8.5,7.6,7.5,...,7.6,7.9,,,,,8.0,7.7,7.5,7.9
1,01M019,,100,60,,34.0,161.0,8.4,7.6,7.6,...,8.9,9.1,,,,,8.5,8.1,8.2,8.4
2,01M020,,88,73,,42.0,367.0,8.9,8.3,8.3,...,6.8,7.5,,,,,8.2,7.3,7.5,8.0
3,01M034,89.0,73,50,145.0,29.0,151.0,8.8,8.2,8.0,...,6.8,7.8,6.2,5.9,6.5,7.4,7.3,6.7,7.1,7.9
4,01M063,,100,60,,23.0,90.0,8.7,7.9,8.1,...,7.8,8.1,,,,,8.5,7.6,7.9,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51,75X352,90.0,58,48,38.0,46.0,160.0,8.9,8.3,7.9,...,5.7,5.8,6.8,6.0,7.8,7.6,7.4,6.6,7.1,7.2
52,75X721,84.0,90,48,237.0,82.0,239.0,8.6,7.6,7.5,...,6.7,7.0,7.8,7.2,7.8,7.9,8.0,7.1,7.3,7.6
53,75X723,77.0,74,20,103.0,69.0,74.0,8.4,7.8,7.8,...,6.7,7.6,6.7,7.2,7.7,7.7,7.6,7.4,7.4,7.7
54,75X754,63.0,93,22,336.0,82.0,124.0,8.3,7.5,7.5,...,6.6,7.1,6.8,6.6,7.6,7.7,7.2,6.9,7.3,7.5


<div><p>When we explored all of the datasets, we noticed that some of them, like <code>class_size</code> and <code>hs_directory</code>, don't have a <code>DBN</code> column.  <code>hs_directory</code> does have a <code>dbn</code> column, though, so we can just rename it.</p>
<p>However, <code>class_size</code> doesn't appear to have the column at all.  Here are the first few rows of the data set:</p>
</div>

    CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
    0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED
    1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT
    2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED
    3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT
    4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED



<div>
<p>Here are the first few rows of the <code>sat_results</code> data, which does have a <code>DBN</code> column:</p>
</div>

    DBN                                    SCHOOL NAME  \
    0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
    1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
    2  01M450                     EAST SIDE COMMUNITY SCHOOL
    3  01M458                      FORSYTH SATELLITE ACADEMY
    4  01M509                        MARTA VALLE HIGH SCHOOL
    

<p>From looking at these rows, we can tell that the <code>DBN</code> in the <code>sat_results</code> data is just a combination of the <code>CSD</code> and <code>SCHOOL CODE</code> columns in the <code>class_size</code> data.  The main difference is that the <code>DBN</code> is padded, so that the <code>CSD</code> portion of it always consists of two digits. That means we'll need to add a leading <code>0</code> to the <code>CSD</code> if the <code>CSD</code> is less than two digits long.  Here's a diagram illustrating what we need to do:</p>

|||
|:|:|
|CSD|Padded CSD|
|1|01|
|19|19|
|2|02|
|99|99|

- Whenever the **CSD** is less than two digits long, we need to add a leading 0

Using the `pandas.Series.apply()` method, along with a custom function that:

 - Takes in a number.

 - Converts the number to a string using the `str()` function.

 - Check the length of the string using the `len()` function.

 - If the string is **two digits long**, returns the string.

 - If the string is **one digit long**, adds a 0 to the front of the string, then returns it.

 - Using the string method `zfill()` to do this.
 
 <p>Once we've padded the <code>CSD</code>, we can use the addition operator (<code>+</code>) to combine the values in the <code>CSD</code> and <code>SCHOOL CODE</code> columns.  Here's an example of how we would do this:</p>
 
           dataframe["new_column"] = dataframe["column_one"] + dataframe["column_two"]
    
This is the basic concept:


![padded](padded.png)

In [30]:
data['class_size'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27611 entries, 0 to 27610
Data columns (total 16 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   CSD                                   27611 non-null  int64  
 1   BOROUGH                               27611 non-null  object 
 2   SCHOOL CODE                           27611 non-null  object 
 3   SCHOOL NAME                           27611 non-null  object 
 4   GRADE                                 26127 non-null  object 
 5   PROGRAM TYPE                          26127 non-null  object 
 6   CORE SUBJECT (MS CORE and 9-12 ONLY)  26127 non-null  object 
 7   CORE COURSE (MS CORE and 9-12 ONLY)   26127 non-null  object 
 8   SERVICE CATEGORY(K-9* ONLY)           26127 non-null  object 
 9   NUMBER OF STUDENTS / SEATS FILLED     26127 non-null  float64
 10  NUMBER OF SECTIONS                    26127 non-null  float64
 11  AVERAGE CLASS S

In [31]:
def padding(serie):
    data = str(serie)
    if len(data) == 1:
        data = data.zfill(2)
        return data
    else:
        return data

In [32]:
data['class_size']['Padded CSD'] = data['class_size']['CSD'].apply(padding)

In [33]:
data['class_size']['Padded CSD']

0        01
1        01
2        01
3        01
4        01
         ..
27606    32
27607    32
27608    32
27609    32
27610    32
Name: Padded CSD, Length: 27611, dtype: object

In [34]:
data['class_size']['DBN'] = data['class_size']['Padded CSD'] + data['class_size']['SCHOOL CODE']
data['class_size']['DBN']

0        01M015
1        01M015
2        01M015
3        01M015
4        01M015
          ...  
27606    32K564
27607    32K564
27608    32K564
27609    32K564
27610    32K564
Name: DBN, Length: 27611, dtype: object

* * *

Let's take some time to calculate variables that are useful in our analysis. We've already discussed one such variable 

-- a column that totals up the SAT scores for the different sections of the exam.

This makes it much easier to correlate scores with demographic factors because we'll be working with a single number, rather than three different ones.

In [35]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']

In [36]:
data['sat_results'].loc[:,cols]

Unnamed: 0,SAT Math Avg. Score,SAT Critical Reading Avg. Score,SAT Writing Avg. Score
0,404,355,363
1,423,383,366
2,402,377,370
3,401,414,359
4,433,390,384
...,...,...,...
473,s,s,s
474,s,s,s
475,s,s,s
476,400,496,426


In [37]:
for c in cols:
    data['sat_results'][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
    
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

print(data['sat_results']['sat_score'].head())

0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64


In [38]:
data['sat_results'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 7 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   DBN                              478 non-null    object 
 1   SCHOOL NAME                      478 non-null    object 
 2   Num of SAT Test Takers           478 non-null    object 
 3   SAT Critical Reading Avg. Score  421 non-null    float64
 4   SAT Math Avg. Score              421 non-null    float64
 5   SAT Writing Avg. Score           421 non-null    float64
 6   sat_score                        421 non-null    float64
dtypes: float64(4), object(3)
memory usage: 26.3+ KB


* * *

Next, we'll want to parse the latitude and longitude coordinates for each school. This enables us to map the schools and uncover any geographic patterns in the data. The coordinates are currently in the text field `Location 1` in the `hs_directory` dataset.

In [39]:
data['hs_directory']['Location 1']

0      883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1      1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2      1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3      411 Pearl Street\nNew York, NY 10038\n(40.7106...
4      160-20 Goethals Avenue\nJamaica, NY 11432\n(40...
                             ...                        
430    2225 Webster Avenue\nBronx, NY 10457\n(40.8546...
431    925 Astor Avenue\nBronx, NY 10469\n(40.8596983...
432    800 East Gun Hill Road\nBronx, NY 10467\n(40.8...
433    26 Broadway\nNew York, NY 10004\n(40.705234939...
434    149-11 Melbourne Avenue\nFlushing, NY 11367\n(...
Name: Location 1, Length: 435, dtype: object

We want to extract the coordinates, which are in parentheses at the end of the field. Here's an example:

        1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)
        
We can do the extraction with a regular expression. The following expression pulls out everything inside the parentheses:

        import re
        re.findall("\(.+\)", "1110 Boston Road\nBronx, NY 10456\n(40.8276026690005, -73.90447525699966)")

So lets build a function to make it.

- Takes in a string

- Uses the regular expression above to extract the coordinates

- Uses string manipulation functions to pull out the latitude

- Returns the latitude

In [40]:
import re
def find_lat(loc):
    coords = re.findall("\(.+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)

print(data["hs_directory"].head())

      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12   
1          X400    718-842-0687  718-589-9849              9              12   
2          X240    718-294-8111  718-294-8109              6              12   
3          M520  718-935-3477             NaN              9              10   
4          Q695    718-969-3155  718-969-3552              6              12   

  expgrade_span_min  expgrade_span_max  ...  \
0               NaN  

On the last screen, we parsed the latitude from the Location 1 column. Now we'll just need to do the same for the longitude.

Once we have both coordinates, we'll need to convert them to numeric values. We can use the

In [41]:
import re
def find_lon(loc):
    coords = re.findall("\(.+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")


print(data["hs_directory"].head())

      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12   
1          X400    718-842-0687  718-589-9849              9              12   
2          X240    718-294-8111  718-294-8109              6              12   
3          M520  718-935-3477             NaN              9              10   
4          Q695    718-969-3155  718-969-3552              6              12   

  expgrade_span_min  expgrade_span_max  ...  \
0               NaN  

In [42]:
data["hs_directory"].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435 entries, 0 to 434
Data columns (total 60 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   dbn                               435 non-null    object 
 1   school_name                       435 non-null    object 
 2   boro                              435 non-null    object 
 3   building_code                     435 non-null    object 
 4   phone_number                      435 non-null    object 
 5   fax_number                        423 non-null    object 
 6   grade_span_min                    435 non-null    object 
 7   grade_span_max                    435 non-null    int64  
 8   expgrade_span_min                 33 non-null     object 
 9   expgrade_span_max                 33 non-null     float64
 10  bus                               434 non-null    object 
 11  subway                            358 non-null    object 
 12  primary_

In [43]:
df = pd.DataFrame([data],index=[0])

In [44]:
df.to_csv('NYC_1.csv')