## Creating location_table_dimension

I will use the DBN to extract the district number and borough for each school, following this guideline:

| Part           | Meaning             | Example |
| -------------- | ------------------- | ------- |
| First 2 digits | **District Number** | `02`    |
| Letter         | **Borough Code**    | `M`     |
| Last 3 digits  | School identifier   | `475`   |

***

| Code | Borough                  |
| ---- | ------------------------ |
| M    | Manhattan                |
| X    | Bronx                    |
| K    | Brooklyn (Kings)         |
| Q    | Queens                   |
| R    | Staten Island (Richmond) |


In [38]:
import pandas as pd
import numpy as np

In [39]:
# loading in the DBN Excel file
path = 'DBN.xlsx'
df = pd.read_excel(path)
df.head(10)

Unnamed: 0,DBN,School Name
0,01M292,Orchard Collegiate Academy
1,01M448,University Neighborhood High School
2,01M450,East Side Community School
3,01M539,"New Explorations into Science, Technology and ..."
4,01M696,Bard High School Early College
5,02M047,47 The American Sign Language and English Seco...
6,02M135,The Urban Assembly Early College High School o...
7,02M139,Stephen T. Mather Building Arts & Craftsmanshi...
8,02M260,The Clinton School
9,02M280,Manhattan Early College School for Advertising


In [40]:
# extracting the letter from DBN (represents the borough)
df['boro_short'] = df['DBN'].str[2]

# creating a boro map
boro_map_long = {
    'M': 'Manhattan',
    'X': 'Bronx',
    'K': 'Brooklyn',
    'Q': 'Queens',
    'R': 'Staten Island'
    }

# mapping borough names from letter
df['borough'] = df['boro_short'].map(boro_map_long)
df.tail(10)

Unnamed: 0,DBN,School Name,boro_short,borough
496,84X611,AECI II: NYC Charter High School for Computer ...,X,Bronx
497,84X614,Dream Charter School Mott Haven,X,Bronx
498,84X617,KIPP Bronx Charter School III,X,Bronx
499,84X627,Capital Preparatory Bronx Charter School,X,Bronx
500,84X635,Earl Monroe New Renaissance Basketball Charter...,X,Bronx
501,84X640,Family Life Academy Charter Schools High School,X,Bronx
502,84X648,Girls Preparatory Charter School of New York,X,Bronx
503,84X649,Democracy Preparatory Endurance Charter School,X,Bronx
504,84X703,Bronx Preparatory Charter School,X,Bronx
505,84X704,KIPP Academy Charter School,X,Bronx


In [41]:
# now extracting the district code from DBN
df['district'] = df['DBN'].str[0:2]
df.tail(10)

Unnamed: 0,DBN,School Name,boro_short,borough,district
496,84X611,AECI II: NYC Charter High School for Computer ...,X,Bronx,84
497,84X614,Dream Charter School Mott Haven,X,Bronx,84
498,84X617,KIPP Bronx Charter School III,X,Bronx,84
499,84X627,Capital Preparatory Bronx Charter School,X,Bronx,84
500,84X635,Earl Monroe New Renaissance Basketball Charter...,X,Bronx,84
501,84X640,Family Life Academy Charter Schools High School,X,Bronx,84
502,84X648,Girls Preparatory Charter School of New York,X,Bronx,84
503,84X649,Democracy Preparatory Endurance Charter School,X,Bronx,84
504,84X703,Bronx Preparatory Charter School,X,Bronx,84
505,84X704,KIPP Academy Charter School,X,Bronx,84


In [42]:
# lets also pull the school identifier number
df['school_indentifier'] = df['DBN'].str[3:6]
df.head(10)

Unnamed: 0,DBN,School Name,boro_short,borough,district,school_indentifier
0,01M292,Orchard Collegiate Academy,M,Manhattan,1,292
1,01M448,University Neighborhood High School,M,Manhattan,1,448
2,01M450,East Side Community School,M,Manhattan,1,450
3,01M539,"New Explorations into Science, Technology and ...",M,Manhattan,1,539
4,01M696,Bard High School Early College,M,Manhattan,1,696
5,02M047,47 The American Sign Language and English Seco...,M,Manhattan,2,47
6,02M135,The Urban Assembly Early College High School o...,M,Manhattan,2,135
7,02M139,Stephen T. Mather Building Arts & Craftsmanshi...,M,Manhattan,2,139
8,02M260,The Clinton School,M,Manhattan,2,260
9,02M280,Manhattan Early College School for Advertising,M,Manhattan,2,280


In [43]:
df

Unnamed: 0,DBN,School Name,boro_short,borough,district,school_indentifier
0,01M292,Orchard Collegiate Academy,M,Manhattan,01,292
1,01M448,University Neighborhood High School,M,Manhattan,01,448
2,01M450,East Side Community School,M,Manhattan,01,450
3,01M539,"New Explorations into Science, Technology and ...",M,Manhattan,01,539
4,01M696,Bard High School Early College,M,Manhattan,01,696
...,...,...,...,...,...,...
501,84X640,Family Life Academy Charter Schools High School,X,Bronx,84,640
502,84X648,Girls Preparatory Charter School of New York,X,Bronx,84,648
503,84X649,Democracy Preparatory Endurance Charter School,X,Bronx,84,649
504,84X703,Bronx Preparatory Charter School,X,Bronx,84,703


In [44]:
#let's ensure all the data types are correct (probably won't matter if going to excel, but good practice)
df.dtypes
# need to change district and school_identifier to numeric

DBN                   object
School Name           object
boro_short            object
borough               object
district              object
school_indentifier    object
dtype: object

In [45]:
# converting necessary columns to numeric
numeric_cols = ['district', 'school_indentifier']
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce')
df.dtypes

DBN                   object
School Name           object
boro_short            object
borough               object
district               int64
school_indentifier     int64
dtype: object

In [46]:
df

Unnamed: 0,DBN,School Name,boro_short,borough,district,school_indentifier
0,01M292,Orchard Collegiate Academy,M,Manhattan,1,292
1,01M448,University Neighborhood High School,M,Manhattan,1,448
2,01M450,East Side Community School,M,Manhattan,1,450
3,01M539,"New Explorations into Science, Technology and ...",M,Manhattan,1,539
4,01M696,Bard High School Early College,M,Manhattan,1,696
...,...,...,...,...,...,...
501,84X640,Family Life Academy Charter Schools High School,X,Bronx,84,640
502,84X648,Girls Preparatory Charter School of New York,X,Bronx,84,648
503,84X649,Democracy Preparatory Endurance Charter School,X,Bronx,84,649
504,84X703,Bronx Preparatory Charter School,X,Bronx,84,703


In [47]:
df

Unnamed: 0,DBN,School Name,boro_short,borough,district,school_indentifier
0,01M292,Orchard Collegiate Academy,M,Manhattan,1,292
1,01M448,University Neighborhood High School,M,Manhattan,1,448
2,01M450,East Side Community School,M,Manhattan,1,450
3,01M539,"New Explorations into Science, Technology and ...",M,Manhattan,1,539
4,01M696,Bard High School Early College,M,Manhattan,1,696
...,...,...,...,...,...,...
501,84X640,Family Life Academy Charter Schools High School,X,Bronx,84,640
502,84X648,Girls Preparatory Charter School of New York,X,Bronx,84,648
503,84X649,Democracy Preparatory Endurance Charter School,X,Bronx,84,649
504,84X703,Bronx Preparatory Charter School,X,Bronx,84,703


In [48]:
# finally, send it back to xlsx
#df.to_excel('DBN_mapped.xlsx', index=False)

# Next Part: Mapping Coordinates to DBNs

In [49]:
# importing geopandas to read in file
import geopandas as gpd

# loading gdf
gdf = gpd.read_file('SchoolPoints_APS_2024_08_28/SchoolPoints_APS_2024_08_28.shp')
gdf.head(5)

Unnamed: 0,ATS,Building_C,Location_C,Name,Geographic,Latitude,Longitude,geometry
0,01M015,M015,M015,P.S. 015 Roberto Clemente,1,40.722075,-73.978747,POINT (-8235276.446 4971433.816)
1,01M020,M020,M020,P.S. 020 Anna Silver,1,40.721305,-73.986312,POINT (-8236118.578 4971320.718)
2,01M034,M034,M034,P.S. 034 Franklin D. Roosevelt,1,40.726008,-73.975058,POINT (-8234865.788 4972011.521)
3,01M063,M063,M063,The STAR Academy - P.S.63,1,40.72444,-73.986214,POINT (-8236107.668 4971781.199)
4,01M064,M064,M064,P.S. 064 Robert Simon,1,40.72313,-73.981597,POINT (-8235593.706 4971588.778)


In [50]:
# merging gdf with df on DBN and ATS (merging right to keep the DBNs that don't have a match in the gdf)
merged_gdf = gdf.merge(df, left_on='ATS', right_on='DBN', how='right')
merged_gdf.head(5)

Unnamed: 0,ATS,Building_C,Location_C,Name,Geographic,Latitude,Longitude,geometry,DBN,School Name,boro_short,borough,district,school_indentifier
0,01M292,M056,M292,Orchard Collegiate Academy,1.0,40.713362,-73.986051,POINT (-8236089.523 4970154.116),01M292,Orchard Collegiate Academy,M,Manhattan,1,292
1,01M448,M446,M448,University Neighborhood High School,1.0,40.712269,-73.984128,POINT (-8235875.456 4969993.596),01M448,University Neighborhood High School,M,Manhattan,1,448
2,01M450,M060,M450,East Side Community School,1.0,40.729152,-73.982472,POINT (-8235691.111 4972473.356),01M450,East Side Community School,M,Manhattan,1,450
3,01M539,M022,M539,"New Explorations into Science, Technology and ...",1.0,40.719416,-73.979581,POINT (-8235369.286 4971043.264),01M539,"New Explorations into Science, Technology and ...",M,Manhattan,1,539
4,01M696,M097,M696,Bard High School Early College,1.0,40.718276,-73.976093,POINT (-8234981.004 4970875.827),01M696,Bard High School Early College,M,Manhattan,1,696


In [51]:
# creating new gdf with only columns we want
new_gdf = merged_gdf[['DBN', 'School Name', 'borough', 'district', 'school_indentifier', 'Latitude', 'Longitude', 'geometry']]
new_gdf.head(5)

Unnamed: 0,DBN,School Name,borough,district,school_indentifier,Latitude,Longitude,geometry
0,01M292,Orchard Collegiate Academy,Manhattan,1,292,40.713362,-73.986051,POINT (-8236089.523 4970154.116)
1,01M448,University Neighborhood High School,Manhattan,1,448,40.712269,-73.984128,POINT (-8235875.456 4969993.596)
2,01M450,East Side Community School,Manhattan,1,450,40.729152,-73.982472,POINT (-8235691.111 4972473.356)
3,01M539,"New Explorations into Science, Technology and ...",Manhattan,1,539,40.719416,-73.979581,POINT (-8235369.286 4971043.264)
4,01M696,Bard High School Early College,Manhattan,1,696,40.718276,-73.976093,POINT (-8234981.004 4970875.827)


In [60]:
# finding School name of missing Latitude values
missing_latitude = new_gdf[new_gdf['Latitude'].isnull()]
print(missing_latitude[['DBN', 'School Name', 'Latitude']])

        DBN                                     School Name  Latitude
502  84X648    Girls Preparatory Charter School of New York       NaN
503  84X649  Democracy Preparatory Endurance Charter School       NaN


In [53]:
# Let's merge on 'Name' and 'School Name' instead
merged_gdf_name = gdf.merge(df, left_on='Name', right_on='School Name', how='right')
merged_gdf_name.head(5)

Unnamed: 0,ATS,Building_C,Location_C,Name,Geographic,Latitude,Longitude,geometry,DBN,School Name,boro_short,borough,district,school_indentifier
0,01M292,M056,M292,Orchard Collegiate Academy,1.0,40.713362,-73.986051,POINT (-8236089.523 4970154.116),01M292,Orchard Collegiate Academy,M,Manhattan,1,292
1,01M448,M446,M448,University Neighborhood High School,1.0,40.712269,-73.984128,POINT (-8235875.456 4969993.596),01M448,University Neighborhood High School,M,Manhattan,1,448
2,01M450,M060,M450,East Side Community School,1.0,40.729152,-73.982472,POINT (-8235691.111 4972473.356),01M450,East Side Community School,M,Manhattan,1,450
3,,,,,,,,,01M539,"New Explorations into Science, Technology and ...",M,Manhattan,1,539
4,01M696,M097,M696,Bard High School Early College,1.0,40.718276,-73.976093,POINT (-8234981.004 4970875.827),01M696,Bard High School Early College,M,Manhattan,1,696


### Merging on Name ended up getting all schools, so I'll send that CSV instead

In [54]:
# creating new gdf (again) with only columns we want
new_gdf2 = merged_gdf_name[['DBN', 'School Name', 'borough', 'district', 'school_indentifier', 'Latitude', 'Longitude', 'geometry']]
new_gdf2.head(5)

Unnamed: 0,DBN,School Name,borough,district,school_indentifier,Latitude,Longitude,geometry
0,01M292,Orchard Collegiate Academy,Manhattan,1,292,40.713362,-73.986051,POINT (-8236089.523 4970154.116)
1,01M448,University Neighborhood High School,Manhattan,1,448,40.712269,-73.984128,POINT (-8235875.456 4969993.596)
2,01M450,East Side Community School,Manhattan,1,450,40.729152,-73.982472,POINT (-8235691.111 4972473.356)
3,01M539,"New Explorations into Science, Technology and ...",Manhattan,1,539,,,
4,01M696,Bard High School Early College,Manhattan,1,696,40.718276,-73.976093,POINT (-8234981.004 4970875.827)


In [61]:
# double check for missing Latitude values
missing_latitude_name = new_gdf[new_gdf2['Latitude'].isnull()]
print(missing_latitude_name[['DBN', 'School Name']])

        DBN                                        School Name
3    01M539  New Explorations into Science, Technology and ...
5    02M047  47 The American Sign Language and English Seco...
6    02M135  The Urban Assembly Early College High School o...
7    02M139  Stephen T. Mather Building Arts & Craftsmanshi...
15   02M300  Urban Assembly School of Design and Constructi...
..      ...                                                ...
488  84X482  Dr. Richard Izquierdo Health and Science Chart...
491  84X539  United Charter High School for Advanced Math a...
492  84X553      United Charter High School for the Humanities
496  84X611  AECI II: NYC Charter High School for Computer ...
500  84X635  Earl Monroe New Renaissance Basketball Charter...

[93 rows x 2 columns]


In [68]:
# That didn't work, let's check the names of the missing schools in the original gdf
# Girls Preparatory Charter School of New York
# Democracy Preparatory Endurance Charter School

missing_schools = ['Girls Preparatory Charter School of New York', 'Democracy Preparatory Endurance Charter School']
missing_schools_gdf = gdf[gdf['Name'].isin(missing_schools)]
print(missing_schools_gdf[['Name', 'Latitude', 'Longitude', 'geometry']])

                                                Name   Latitude  Longitude  \
1732  Democracy Preparatory Endurance Charter School  40.801547 -73.935335   
1750    Girls Preparatory Charter School of New York  40.729152 -73.982472   

                              geometry  
1732  POINT (-8230443.844 4983113.812)  
1750  POINT (-8235691.111 4972473.356)  


In [69]:
# Adding those columns to new_gdf by matching on the school name
for school in missing_schools:
    school_info = missing_schools_gdf[missing_schools_gdf['Name'] == school]
    new_gdf.loc[new_gdf['School Name'] == school, 'Latitude'] = school_info['Latitude'].values[0]
    new_gdf.loc[new_gdf['School Name'] == school, 'Longitude'] = school_info['Longitude'].values[0]
    new_gdf.loc[new_gdf['School Name'] == school, 'geometry'] = school_info['geometry'].values[0]

# double check for missing Latitude values again
missing_latitude_final = new_gdf[new_gdf['Latitude'].isnull()]
print(missing_latitude_final[['DBN', 'School Name']])

Empty DataFrame
Columns: [DBN, School Name]
Index: []


In [71]:
new_gdf.isna().sum()

DBN                   0
School Name           0
borough               0
district              0
school_indentifier    0
Latitude              0
Longitude             0
geometry              0
dtype: int64

In [73]:
# convert new gdf to csv
new_gdf.to_csv('location_dimension.csv', index=False)