The purpose of this analysis is to examine the efficacy of college standardized admissions tests and whether they're unfair to particular demographics. As such, we will be investigating the potential correlation between SAT/AP test scores and various demographic factors, such as race, gender, and income. 

Data was taken from the NYC Open Data website concerning NYC high schools. The files are as follows:

    * 'ap_scores' - Data on AP test scores
    * 'class_size' - Data on class size
    * 'demographics' - Data on class high school demographics
    * 'graduation_outcomes' - Data on graduation rates, etc.
    * 'hs_directory' - A general directory of the high schools
    * 'sat_results' - Data on SAT scores
    * 'all_survey' - Data on surveys from all NYC high schools
    * 'd75_survey' - Data on surveys from NYC district 75

In [66]:
import pandas as pd
data ={}

## We'll read in the .csv files first. Survey results will be read in during the following step using a different pandas tool.
file_list = ['ap_scores.csv','hs_directory.csv','class_size.csv','demographics.csv','graduation_outcomes.csv','sat_results.csv']
print(sorted(file_list))
for file_name in sorted(file_list):
    string_name = str.replace(file_name,'.csv','')
    data[string_name] = pd.read_csv(file_name)
    print(data[string_name].columns)

['ap_scores.csv', 'class_size.csv', 'demographics.csv', 'graduation_outcomes.csv', 'hs_directory.csv', 'sat_results.csv']
Index(['DBN', 'SchoolName', 'AP Test Takers ', 'Total Exams Taken',
       'Number of Exams with scores 3 4 or 5'],
      dtype='object')
Index(['CSD', 'BOROUGH', 'SCHOOL CODE', 'SCHOOL NAME', 'GRADE ',
       'PROGRAM TYPE', 'CORE SUBJECT (MS CORE and 9-12 ONLY)',
       'CORE COURSE (MS CORE and 9-12 ONLY)', 'SERVICE CATEGORY(K-9* ONLY)',
       'NUMBER OF STUDENTS / SEATS FILLED', 'NUMBER OF SECTIONS',
       'AVERAGE CLASS SIZE', 'SIZE OF SMALLEST CLASS', 'SIZE OF LARGEST CLASS',
       'DATA SOURCE', 'SCHOOLWIDE PUPIL-TEACHER RATIO'],
      dtype='object')
Index(['DBN', 'Name', 'schoolyear', 'fl_percent', 'frl_percent',
       'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4',
       'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11',
       'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent',
       'ctt_nu

So far, each of the data sets appear to have a 'DBN' column (with the exception of 'class_size' and 'hs_directory'), which provides a unique identifer for each of the schools. We can use this as we clean and merge the data sets. 

In [70]:
## Now that we've read in our .csv files, we can read in the two survey files, which are .xlxs files. We should also merge them
## together to save time as we add them to our 'data' dictionary. 

all_survey = pd.io.excel.read_excel('all_survey.xlsx')
d75_survey = pd.io.excel.read_excel('d75_survey.xlsx')
survey = pd.concat([all_survey,d75_survey],axis=0)

In [71]:
## The survey data contains far more columns of information than we need for our analysis. Rather than adding the entire data set
## into our dictionary, we can examine the data dictionary that accompanied the data sets to see what is most useful to us. After
## a look at the dictionary, it seems as though the following columns are most relevant.
columns = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
survey['DBN'] = survey['dbn']

survey = survey[columns]
data['survey'] = survey
data['survey'].shape

(1702, 23)