# Table of contents
*  [Introduction](#section1) 
*  [Read in the data](#section2)
      - [School data](#section3)
      - [Survey data](#section4)
* [Cleaning](#section5)
    - [Add DBN columns](#section6)
    - [Convert columns to numeric](#section7)
    - [Condense datasets](#section8)
    - [Convert AP scores to numeric](#section9)
    - [Combine the datasets](#section10)
    - [Add a school district column for mapping](#section11)
* [Analysis](#section12)
    - [Find correlations](#section13)
    - [Plotting survey correlations](#section14)
    - [Geographic data](#section15)
    - [Race and gender](#section16)
* [Schools vs. Neighborhoods](#section17)
* [High School Ranking by SAT Score](#section18)
    
    
by @antosnj


 <a id='section1'></a>
# Introduction
This project aims to analyze NYC High School Data. The starting datasets have been taken from https://opendata.cityofnewyork.us/.

In [None]:
%autosave 2

import pandas as pd
import numpy
import re
import warnings
import matplotlib.pyplot as plt

%matplotlib inline

warnings.filterwarnings('ignore') 

 <a id='section2'></a>
# Read in the data
 <a id='section3'></a>
## School data

In [None]:
data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("../input/nyc-data/nyc_highschool_data/schools/{0}".format(f))

    data[f.replace(".csv", "")] = d

 <a id='section4'></a>
# Read in the surveys

In [None]:
all_survey = pd.read_csv("../input/nyc-data/nyc_highschool_data/schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("../input/nyc-data/nyc_highschool_data/schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

 <a id='section5'></a>
# Cleaning
 <a id='section6'></a>
## Add DBN columns

In [None]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

 <a id='section7'></a>
## Convert columns to numeric

In [None]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

 <a id='section8'></a>
## Condense datasets

In [None]:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

 <a id='section9'></a>
## Convert AP scores to numeric

In [None]:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

 <a id='section10'></a>
## Combine the datasets

In [None]:
combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

 <a id='section11'></a>
## Add a school district column for mapping

In [None]:
def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
school_dist = combined["school_dist"]

 <a id='section12'></a>
# Analysis
 <a id='section13'></a>
## Find correlations

In [None]:
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)

 <a id='section14'></a>
## Plotting survey correlations

In [None]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

In [None]:
fig = plt.figure()
combined.corr().loc['sat_score', survey_fields].plot.bar()
plt.ylabel('sat_score')

We can observe that many of the fields are correlated with the SAT scores. In particular, the number of student, teacher and parent respondents, as well as the safety score from students and teachers present a strong correlation.

Let's take a closer look at how the way students perceive safety at their schools affects SAT scores.

In [None]:
combined.plot.scatter(x='sat_score', y='saf_s_11') 

We can see the less safe students think their school is, the worse their SAT scores tend to be, and viceversa. 

 <a id='section15'></a>
## Geographic Data
In order to dig into this relationship a bit more, let's map out safety scores by geographic area in NYC, along with the SAT score.

In [None]:
import numpy as np
from mpl_toolkits.basemap import Basemap

grouped = combined.groupby('school_dist')
av_school_dist = grouped.agg(np.mean)

m = Basemap(
    projection='merc', 
    llcrnrlat=40.496044, 
    urcrnrlat=40.915256, 
    llcrnrlon=-74.255735, 
    urcrnrlon=-73.700272,
    resolution='i'
    )

longitudes = av_school_dist['lon'].tolist()
latitudes = av_school_dist['lat'].tolist()

fig = plt.figure(figsize=(15,10))

ax1 = fig.add_subplot(1,2,1)
ax1.set_title('Safety Score (Students) by District')
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
m.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=av_school_dist['saf_s_11'], cmap='summer')

ax2 = fig.add_subplot(1,2,2)
ax2.set_title('SAT Scores by District')
m.drawmapboundary(fill_color='#85A6D9')
m.drawcoastlines(color='#6D5F47', linewidth=.4)
m.drawrivers(color='#6D5F47', linewidth=.4)
m.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=av_school_dist['sat_score'], cmap='summer')

plt.show()



The map shows Brooklyn and The Bronx have for the most part a lower safety score based on student surveys, whereas Queens, Staten Island and Manhattan have a higher one. SAT scores can be observed to be higher in Queens and Staten Island, followed by Manhattan, while Brooklyn and The Bronx present a lower one. 

Therefore, we could say there is indeed a positive correlation between how safe students feel at their schools and the SAT scores they get.

 <a id='section16'></a>
## Race and gender

Now, let's analyse how the percentage of each race at a given school affects SAT scores.

In [None]:
races = ['white_per', 'asian_per', 'black_per', 'hispanic_per']

fig = plt.figure()
combined.corr().loc['sat_score', races].plot.bar()
plt.ylabel('sat_score')

It's clearly shown that races white and asian present a strong positive correlation with the SAT scores, while correlation with races black and hispanic is shown to be negative. 

Let's dig deeper into the results on the hispanic race.

In [None]:
combined.plot.scatter(x='hispanic_per', y='sat_score')

At first, we can see how the bigger the percentage of hispanic people in schools, the lower the SAT score tends to be, given how most of the scatter points are concentrated in the lower right corner of the plot. 

However, a significant number of schools with a low percentage of hispanic people also shows low grades, which tells us the fact that there are hispanics in those schools is not the only cause of low SAT scores. 

In [None]:
high_hispanic = combined[combined['hispanic_per']>95]
print(high_hispanic['SCHOOL NAME'])

In [None]:
low_hispanic = combined[combined['hispanic_per']<10]
high_SAT_hisp = low_hispanic[low_hispanic['sat_score']>1800]

print(high_SAT_hisp['SCHOOL NAME'])

In [None]:
gender = ['male_per', 'female_per']

fig = plt.figure()
combined.corr().loc['sat_score', gender].plot.bar()
plt.ylabel('sat_score')

We can see neither of the genders present a significant correlation, and the male has a negative correlation whereas the female is positive.

In [None]:
combined.plot.scatter(x='female_per', y='sat_score')

In [None]:
high_female = combined[combined['female_per']>60]
high_SAT_female = high_female[high_female['sat_score']>1700]

print(high_SAT_female['SCHOOL NAME'])

In addition, it could be an interesting next step to analyse how the percentage of AP Test takers might potentially affect SAT scores.

In [None]:
combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']
combined.plot.scatter(x='ap_per', y='sat_score')
print("Pearson coefficient (r):", combined.corr().loc['sat_score', 'ap_per'])

It can be observed and concluded based on the Pearson coefficient that the percentage of students in each school that took an AP exam is not correlated to the SAT scores.

Let's find out if the class size is.

In [None]:
combined.plot.scatter(x='AVERAGE CLASS SIZE', y='sat_score')
print("Pearson coefficient (r):", combined.corr().loc['AVERAGE CLASS SIZE', 'ap_per'])

As we could expect, the class size is negatively correlated to the SAT score.

This means the bigger the number of students in class, the lower the potential SAT score is, which makes sense in terms of  the difference in the level of personal attention teachers can achieve depending on the class size.

<a id='section17'></a>
# Schools vs. Neighborhoods
It would be interesting to figure out which neighborhoods have the best schools. In order to do that, let's first read a dataset containing property values in NYC, which can be found at https://www.kaggle.com/new-york-city/nyc-property-sales#nyc-rolling-sales.csv.

In [None]:
properties = pd.read_csv('../input/nyc-data/nyc_highschool_data/nyc_properties/nyc-rolling-sales.csv')
data['properties'.replace('.csv','')] = properties

print(properties.columns)
properties.head()

Let's now see some general stats.

In [None]:
properties.describe()

Values in the 'BOROUGH' column go from 1 to 5, such that:

1 - Manhattan
2 - The Bronx
3 - Brooklyn
4 - Queens
5 - Staten Island

Let's replace numbers with each borough name.

In [None]:
boroughs = {
            '1': 'Manhattan',
            '2': 'The Bronx',
            '3': 'Brooklyn',
            '4': 'Queens',
            '5': 'Staten Island'
            }

def number2borough(number):
    return boroughs[str(number)]

properties['BOROUGH'] = properties['BOROUGH'].apply(number2borough)

Neighborhoods are often categorized based on property valuation. For this particular case, let's break them down in such way.

Since we are only interested in studying the property sale price/value, let's take a look at the 'SALE PRICE' column:

In [None]:
properties['SALE PRICE'].value_counts()

Seems like there is a significant number of properties whose value does not make much sense, in particular, those with a sale price of $0 or $10, as well as those without a value. 

Let's get rid of those.

In [None]:
properties['SALE PRICE'] = pd.to_numeric(properties['SALE PRICE'], errors='coerce')
properties.drop(properties.index[properties[properties['SALE PRICE']<=10].index], inplace=True)

properties['SALE PRICE'].value_counts()

In [None]:
properties = properties.groupby('BOROUGH').agg(numpy.mean)
properties.reset_index(inplace=True)

properties.plot.bar(x='BOROUGH', y='SALE PRICE', rot=30, legend=False)
plt.ylabel('SALE PRICE ($)')


It is clear that Manhattan has the highest average property value, at a price close to \$3.5M, followed by Brooklyn, valued at close to \$1.5M, and The Bronx, Queens and Staten Island, priced between \$500k - \$1M.

Earlier in this project, I showed how beside some exceptions, SAT scores tended to be lower in Brooklyn and The Bronx, while higher in the rest of boroughs. Even though Manhattan also present good scores, the best ones were gotten at Queens and Staten Island. 

That means, in conclusion, that for the most part the least expensive neighborhoods have the best schools.

<a id='section18'></a>
# High School Ranking by SAT Score
To conclude, let's rank schools by their SAT scores to check the observations and conclusions are consistent.

In [None]:
sorted_SAT = combined.sort_values('sat_score', ascending=False)
sorted_SAT['SCHOOL NAME'].reset_index()