In [1]:
import pandas as pd
import numpy as np
import json
import requests
import os
from dotenv import load_dotenv

In [2]:
load_dotenv()
API_KEY = os.getenv('API_KEY')

<a href='https://web.postman.co/workspace/Springboard~8fb65eb5-3826-4258-92f7-fd1474949256/collection/24117868-5badc386-8106-46be-9093-38903d50c5e1?tab=authorization'>Postman collection with explanation of queries used to access desired data</a>

Access the first page of results from <a href='https://api.data.gov/ed/collegescorecard/v1/schools'>api.data.gov</a>
<br>A pipeline for data cleaning will be demonstrated for this page and applied to all pages from the latest collection.
<br>If needed, this process will be repeated for multiple years to effectively resample the data and supplement number of observations.

The information from the school object will be requested separately. The reasoning for this decision is that if it becomes necessary to pull in data from multiple years, these values will not be repeated.

Data descriptions located in references directory, or can be downloaded directly at <a href='https://collegescorecard.ed.gov/data/documentation/'>College Scorecard Webiste</a> and API documentation can be viewed <a href=https://github.com/RTICWDT/open-data-maker/blob/master/API.md>here</a>.

In [3]:
url = 'https://api.data.gov/ed/collegescorecard/v1/schools.json'

To access the data, a series of 10 API calls are made to the College Scorecard DataBase
- school info for public schools
- school info for private schools
- student body for public schools
- student body for private schools
- cost data for public schools
- cost data for private schools
- aid and admissions for public schools
- aid and admission for private schools
- academics data for public schools
- academics data for private schools

My initial reasoning for separating calls in this manner was to eliminate receiving null values for private costs when accessing public schools and vice versa. However, my decision to separate in such a manner and join resulting data frames in Pandas came to be in case patterns of missing values correlated heavily with whether a university was public or private. If there are any clearly problematic columns, it may be easier to initially identify the source this way.

There are 6 pages of results for public institutions, and 15 pages of results for private institutions as indicated by the responses from the API. (See collections <a href='https://www.postman.com/louis-fortunato/workspace/springboard/collection/24117868-5badc386-8106-46be-9093-38903d50c5e1?ctx=documentation&tab=authorization'>here</a>.)

In [4]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of public schools
# The same will be done for private schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_school_public = []
# API call school info for public schools
for i in range(6):
    response = requests.get(url,
                            params={'fields':'id,{year}.school'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'1',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_school_public = res['results']
    # Serializing json
    json_object = json.dumps(data_school_public, indent=4)
    # Writing to file
    with open("../data/raw/school_public_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    # Add dataframe for this page to list of DFs
    dfs_school_public.append(pd.json_normalize(data_school_public))
    
# Concat all pages to one DF
df_school_public = pd.concat(dfs_school_public)
print(df_school_public.shape)
df_school_public.head()

(581, 49)


Unnamed: 0,id,latest.school.name,latest.school.city,latest.school.state,latest.school.zip,latest.school.accreditor,latest.school.school_url,latest.school.price_calculator_url,latest.school.degrees_awarded.predominant_recoded,latest.school.degrees_awarded.predominant,...,latest.school.institutional_characteristics.level,latest.school.open_admissions_policy,latest.school.accreditor_code,latest.school.title_iv.approval_date,latest.school.title_iv.eligibility_type,latest.school.ownership_peps,latest.school.endowment.begin,latest.school.endowment.end,latest.school.dolflag,latest.school.search
0,100654,Alabama A & M University,Normal,AL,35762,Southern Association of Colleges and Schools C...,www.aamu.edu/,www.aamu.edu/admissions-aid/tuition-fees/net-p...,3.0,3,...,1,2.0,SACSCC,12/12/1965,1,1.0,,,0,
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,Southern Association of Colleges and Schools C...,https://www.uab.edu/,https://tcc.ruffalonl.com/University of Alabam...,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,537349307.0,539858544.0,0,
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,Southern Association of Colleges and Schools C...,www.uah.edu/,finaid.uah.edu/,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,77250279.0,75837207.0,1,
3,100724,Alabama State University,Montgomery,AL,36104-0271,Southern Association of Colleges and Schools C...,www.alasu.edu/,www.alasu.edu/cost-aid/tuition-costs/net-price...,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,94536751.0,111315175.0,0,
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,Southern Association of Colleges and Schools C...,www.ua.edu/,financialaid.ua.edu/net-price-calculator/,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,946690144.0,939393269.0,1,


In [5]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of private schools
# The same will was done for public schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_school_private = []
# API call school info for public schools
for i in range(15):
    response = requests.get(url,
                            params={'fields':'id,{year}.school'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'2,3',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_school_private = res['results']
    # Serializing json
    json_object = json.dumps(data_school_private, indent=4)
    # Writing to file
    with open("../data/raw/school_private_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_school_private.append(pd.json_normalize(data_school_private))
df_school_private = pd.concat(dfs_school_private)
print(df_school_private.shape)
df_school_private.head()

(1436, 49)


Unnamed: 0,id,latest.school.name,latest.school.city,latest.school.state,latest.school.zip,latest.school.accreditor,latest.school.school_url,latest.school.price_calculator_url,latest.school.degrees_awarded.predominant_recoded,latest.school.degrees_awarded.predominant,...,latest.school.institutional_characteristics.level,latest.school.open_admissions_policy,latest.school.accreditor_code,latest.school.title_iv.approval_date,latest.school.title_iv.eligibility_type,latest.school.ownership_peps,latest.school.endowment.begin,latest.school.endowment.end,latest.school.dolflag,latest.school.search
0,100937,Birmingham-Southern College,Birmingham,AL,35254,Southern Association of Colleges and Schools C...,www.bsc.edu/,www.bsc.edu/fp/np-calculator.cfm,3.0,3,...,1,2.0,SACSCC,8/23/1976,1,2.0,53117638.0,49403369.0,0,
1,101116,South University-Montgomery,Montgomery,AL,36116,Southern Association of Colleges and Schools C...,www.southuniversity.edu/,www.southuniversity.edu/montgomery/net-price-c...,3.0,3,...,1,1.0,SACSCC,5/8/1998,1,3.0,,,0,
2,101189,Faulkner University,Montgomery,AL,36109-3390,Southern Association of Colleges and Schools C...,www.faulkner.edu/,www.faulkner.edu/net-price-calculator/,3.0,3,...,1,2.0,SACSCC,12/2/1968,1,2.0,18492390.0,17838794.0,0,
3,101435,Huntingdon College,Montgomery,AL,36106-2148,Southern Association of Colleges and Schools C...,www.huntingdon.edu/,hawk.huntingdon.edu/oiac/netpricecalculator/in...,3.0,3,...,1,2.0,SACSCC,12/19/1965,1,2.0,50818280.0,51910937.0,0,
4,101453,Heritage Christian University,Florence,AL,35630-9977,Association for Bibical Higher Educaiton,www.hcu.edu/,www.hcu.edu/netpricecalculator/,3.0,3,...,1,2.0,ABHE,8/26/1981,1,2.0,11825721.0,12502663.0,0,


In [6]:
df_school = pd.concat([df_school_public, df_school_private])
print(df_school.shape)
df_school.head()

(2017, 49)


Unnamed: 0,id,latest.school.name,latest.school.city,latest.school.state,latest.school.zip,latest.school.accreditor,latest.school.school_url,latest.school.price_calculator_url,latest.school.degrees_awarded.predominant_recoded,latest.school.degrees_awarded.predominant,...,latest.school.institutional_characteristics.level,latest.school.open_admissions_policy,latest.school.accreditor_code,latest.school.title_iv.approval_date,latest.school.title_iv.eligibility_type,latest.school.ownership_peps,latest.school.endowment.begin,latest.school.endowment.end,latest.school.dolflag,latest.school.search
0,100654,Alabama A & M University,Normal,AL,35762,Southern Association of Colleges and Schools C...,www.aamu.edu/,www.aamu.edu/admissions-aid/tuition-fees/net-p...,3.0,3,...,1,2.0,SACSCC,12/12/1965,1,1.0,,,0,
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,Southern Association of Colleges and Schools C...,https://www.uab.edu/,https://tcc.ruffalonl.com/University of Alabam...,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,537349307.0,539858544.0,0,
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,Southern Association of Colleges and Schools C...,www.uah.edu/,finaid.uah.edu/,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,77250279.0,75837207.0,1,
3,100724,Alabama State University,Montgomery,AL,36104-0271,Southern Association of Colleges and Schools C...,www.alasu.edu/,www.alasu.edu/cost-aid/tuition-costs/net-price...,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,94536751.0,111315175.0,0,
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,Southern Association of Colleges and Schools C...,www.ua.edu/,financialaid.ua.edu/net-price-calculator/,3.0,3,...,1,2.0,SACSCC,12/1/1965,1,1.0,946690144.0,939393269.0,1,


In [9]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of public schools
# The same will be done for private schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_student_public = []
# API call student info for public schools
for i in range(6):
    response = requests.get(url,
                            params={'fields':'id,{year}.student,{year}.completion.consumer_rate'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'1',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_student_public = res['results']
    # Serializing json
    json_object = json.dumps(data_student_public, indent=4)
    # Writing to file
    with open("../data/raw/student_public_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_student_public.append(pd.json_normalize(data_student_public))
df_student_public = pd.concat(dfs_student_public)
print(df_student_public.shape)
df_student_public.head()

(581, 120)


Unnamed: 0,id,latest.completion.consumer_rate,latest.student.size,latest.student.enrollment.all,latest.student.enrollment.undergrad_12_month,latest.student.enrollment.grad_12_month,latest.student.demographics.race_ethnicity.white,latest.student.demographics.race_ethnicity.black,latest.student.demographics.race_ethnicity.hispanic,latest.student.demographics.race_ethnicity.asian,...,latest.student.students_with_pell_grant,latest.student.undergrads_with_pell_grant_or_federal_student_loan,latest.student.undergrads_non_degree_seeking,latest.student.grad_students,latest.student.retention_rate_suppressed.four_year.full_time_pooled,latest.student.retention_rate_suppressed.four_year.part_time_pooled,latest.student.retention_rate_suppressed.lt_four_year.full_time_pooled,latest.student.retention_rate_suppressed.lt_four_year.part_time_pooled,latest.student.ftft_undergrads_with_pell_grant_or_federal_student_loan,latest.student.ftft_undergrads_with_pell_grant_or_federal_student_loan_pooled
0,100654,0.3098,5090,,5479,1081.0,0.0159,0.9022,0.0116,0.0012,...,0.852793,5273.0,3.0,884.0,0.5533,0.283,,,1688.0,3090.0
1,100663,0.5615,13549,,14969,10874.0,0.5496,0.2401,0.061,0.0704,...,0.62493,13836.0,329.0,8685.0,0.8477,0.5417,,,2304.0,4549.0
2,100706,0.5362,7825,,8898,2414.0,0.7173,0.0907,0.0599,0.0354,...,0.557137,7987.0,202.0,1972.0,0.8234,0.2059,,,1489.0,2917.0
3,100724,0.3196,3603,,4127,513.0,0.0167,0.9265,0.013,0.0019,...,0.874601,3750.0,11.0,458.0,0.6164,0.25,,,1000.0,2025.0
4,100751,0.6758,30610,,35872,6224.0,0.7695,0.1024,0.0512,0.0131,...,0.452632,32795.0,1060.0,6170.0,0.8708,0.5246,,,6734.0,13366.0


In [10]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of private schools
# The same will was done for public schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_student_private = []
# API call student info for private schools
for i in range(15):
    response = requests.get(url,
                            params={'fields':'id,{year}.student,{year}.completion.consumer_rate'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'2,3',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_student_private = res['results']
    # Serializing json
    json_object = json.dumps(data_student_private, indent=4)
    # Writing to file
    with open("../data/raw/student_private_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_student_private.append(pd.json_normalize(data_student_private))
df_student_private = pd.concat(dfs_student_private)
print(df_student_private.shape)
df_student_private.head()

(1436, 120)


Unnamed: 0,id,latest.completion.consumer_rate,latest.student.size,latest.student.enrollment.all,latest.student.enrollment.undergrad_12_month,latest.student.enrollment.grad_12_month,latest.student.demographics.race_ethnicity.white,latest.student.demographics.race_ethnicity.black,latest.student.demographics.race_ethnicity.hispanic,latest.student.demographics.race_ethnicity.asian,...,latest.student.students_with_pell_grant,latest.student.undergrads_with_pell_grant_or_federal_student_loan,latest.student.undergrads_non_degree_seeking,latest.student.grad_students,latest.student.retention_rate_suppressed.four_year.full_time_pooled,latest.student.retention_rate_suppressed.four_year.part_time_pooled,latest.student.retention_rate_suppressed.lt_four_year.full_time_pooled,latest.student.retention_rate_suppressed.lt_four_year.part_time_pooled,latest.student.ftft_undergrads_with_pell_grant_or_federal_student_loan,latest.student.ftft_undergrads_with_pell_grant_or_federal_student_loan_pooled
0,100937,0.6647,1129.0,,1237.0,,0.7653,0.1435,0.0452,0.0204,...,0.466063,1209.0,,,0.797,,,,332.0,660.0
1,101116,0.1848,289.0,,487.0,107.0,0.3114,0.6505,0.0173,0.0,...,0.777258,251.0,,60.0,0.4118,,,,13.0,48.0
2,101189,0.4178,1834.0,,2645.0,1125.0,0.4564,0.4215,0.03,0.0076,...,0.777463,2240.0,264.0,863.0,0.6074,0.3529,,,258.0,512.0
3,101435,0.4857,917.0,,1109.0,,0.6696,0.1832,0.0654,0.0087,...,0.628698,1008.0,3.0,,0.6587,,,,256.0,504.0
4,101453,0.2157,70.0,,85.0,46.0,0.9,0.0429,0.0286,0.0,...,,59.0,,40.0,,,,,1.0,12.0


In [11]:
df_student = pd.concat([df_student_public, df_student_private])
print(df_student.shape)
df_student.head()

(2017, 120)


Unnamed: 0,id,latest.completion.consumer_rate,latest.student.size,latest.student.enrollment.all,latest.student.enrollment.undergrad_12_month,latest.student.enrollment.grad_12_month,latest.student.demographics.race_ethnicity.white,latest.student.demographics.race_ethnicity.black,latest.student.demographics.race_ethnicity.hispanic,latest.student.demographics.race_ethnicity.asian,...,latest.student.students_with_pell_grant,latest.student.undergrads_with_pell_grant_or_federal_student_loan,latest.student.undergrads_non_degree_seeking,latest.student.grad_students,latest.student.retention_rate_suppressed.four_year.full_time_pooled,latest.student.retention_rate_suppressed.four_year.part_time_pooled,latest.student.retention_rate_suppressed.lt_four_year.full_time_pooled,latest.student.retention_rate_suppressed.lt_four_year.part_time_pooled,latest.student.ftft_undergrads_with_pell_grant_or_federal_student_loan,latest.student.ftft_undergrads_with_pell_grant_or_federal_student_loan_pooled
0,100654,0.3098,5090.0,,5479.0,1081.0,0.0159,0.9022,0.0116,0.0012,...,0.852793,5273.0,3.0,884.0,0.5533,0.283,,,1688.0,3090.0
1,100663,0.5615,13549.0,,14969.0,10874.0,0.5496,0.2401,0.061,0.0704,...,0.62493,13836.0,329.0,8685.0,0.8477,0.5417,,,2304.0,4549.0
2,100706,0.5362,7825.0,,8898.0,2414.0,0.7173,0.0907,0.0599,0.0354,...,0.557137,7987.0,202.0,1972.0,0.8234,0.2059,,,1489.0,2917.0
3,100724,0.3196,3603.0,,4127.0,513.0,0.0167,0.9265,0.013,0.0019,...,0.874601,3750.0,11.0,458.0,0.6164,0.25,,,1000.0,2025.0
4,100751,0.6758,30610.0,,35872.0,6224.0,0.7695,0.1024,0.0512,0.0131,...,0.452632,32795.0,1060.0,6170.0,0.8708,0.5246,,,6734.0,13366.0


In [12]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of public schools
# The same will be done for private schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_cost_public = []
# API call cost info for public schools
for i in range(6):
    response = requests.get(url,
                            params={'fields':'id,{year}.cost.avg_net_price.public,{year}.cost.avg_net_price.consumer,{year}.cost.avg_net_price.overall,{year}.cost.net_price.public,{year}.cost.net_price.consumer,{year}.cost.title_iv.public,{year}.cost.attendance.academic_year,{year}.cost.tuition.in_state,{year}.cost.tuition.out_of_state,{year}.cost.booksupply,{year}.cost.roomboard,{year}.cost.otherexpense'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'1',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_cost_public = res['results']
    # Serializing json
    json_object = json.dumps(data_cost_public, indent=4)
    # Writing to file
    with open("../data/raw/cost_public_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_cost_public.append(pd.json_normalize(data_cost_public))
df_cost_public = pd.concat(dfs_cost_public)
print(df_cost_public.shape)
df_cost_public.head()

(581, 36)


Unnamed: 0,id,latest.cost.avg_net_price.public,latest.cost.avg_net_price.overall,latest.cost.avg_net_price.consumer.median_by_pred_degree,latest.cost.avg_net_price.consumer.overall_median,latest.cost.attendance.academic_year,latest.cost.tuition.in_state,latest.cost.tuition.out_of_state,latest.cost.booksupply,latest.cost.net_price.public.by_income_level.0-30000,...,latest.cost.title_iv.public.by_income_level.0-30000,latest.cost.title_iv.public.by_income_level.30001-48000,latest.cost.title_iv.public.by_income_level.48001-75000,latest.cost.title_iv.public.by_income_level.75001-110000,latest.cost.title_iv.public.by_income_level.110001-plus,latest.cost.roomboard.oncampus,latest.cost.roomboard.offcampus,latest.cost.otherexpense.oncampus,latest.cost.otherexpense.offcampus,latest.cost.otherexpense.withfamily
0,100654,15529.0,15529.0,19525.5,15950.5,23445.0,10024.0,18634.0,1600.0,14694.0,...,420.0,135.0,76.0,33.0,26.0,9240.0,9240.0,3090.0,3090.0,3440.0
1,100663,16530.0,16530.0,19525.5,15950.5,25542.0,8568.0,20400.0,1200.0,13443.0,...,378.0,216.0,202.0,204.0,232.0,12307.0,12307.0,5555.0,5555.0,5555.0
2,100706,17208.0,17208.0,19525.5,15950.5,24861.0,11338.0,23734.0,2200.0,13631.0,...,117.0,89.0,97.0,107.0,128.0,10652.0,10652.0,4076.0,4076.0,4076.0
3,100724,19534.0,19534.0,19525.5,15950.5,21892.0,11068.0,19396.0,1600.0,19581.0,...,326.0,121.0,38.0,17.0,6.0,6050.0,7320.0,3392.0,4228.0,4228.0
4,100751,20917.0,20917.0,19525.5,15950.5,30016.0,11620.0,31090.0,1000.0,17523.0,...,373.0,206.0,181.0,177.0,335.0,13810.0,13810.0,4620.0,4620.0,5692.0


As mentioned, grabbing the cost features specific to public schools was the driving decision for separating the calls. In order to concatenate these DataFrames, all columns labeled as 'public' should be renamed so that they can match the columns labeled as 'private' resulting from the next call.

In [13]:
public_columns = [feature for feature in df_cost_public.columns if 'public' in feature]
df_cost_public.columns = df_cost_public.columns.str.replace(r'.public','',regex=True)

In [14]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of private schools
# The same will was done for public schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_cost_private = []
# API call cost info for private schools
for i in range(15):
    response = requests.get(url,
                            params={'fields':'id,{year}.cost.avg_net_price.private,{year}.cost.avg_net_price.consumer,{year}.cost.avg_net_price.overall,{year}.cost.net_price.private,{year}.cost.net_price.consumer,{year}.cost.title_iv.private,{year}.cost.attendance.academic_year,{year}.cost.tuition.in_state,{year}.cost.tuition.out_of_state,{year}.cost.booksupply,{year}.cost.roomboard,{year}.cost.otherexpense'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'2,3',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_cost_private = res['results']
    # Serializing json
    json_object = json.dumps(data_cost_private, indent=4)
    # Writing to file
    with open("../data/raw/cost_private_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_cost_private.append(pd.json_normalize(data_cost_private))
df_cost_private = pd.concat(dfs_cost_private)
print(df_cost_private.shape)
df_cost_private.head()

(1436, 36)


Unnamed: 0,id,latest.cost.avg_net_price.private,latest.cost.avg_net_price.overall,latest.cost.avg_net_price.consumer.median_by_pred_degree,latest.cost.avg_net_price.consumer.overall_median,latest.cost.attendance.academic_year,latest.cost.tuition.in_state,latest.cost.tuition.out_of_state,latest.cost.booksupply,latest.cost.net_price.private.by_income_level.0-30000,...,latest.cost.title_iv.private.by_income_level.0-30000,latest.cost.title_iv.private.by_income_level.30001-48000,latest.cost.title_iv.private.by_income_level.48001-75000,latest.cost.title_iv.private.by_income_level.75001-110000,latest.cost.title_iv.private.by_income_level.110001-plus,latest.cost.roomboard.oncampus,latest.cost.roomboard.offcampus,latest.cost.otherexpense.oncampus,latest.cost.otherexpense.offcampus,latest.cost.otherexpense.withfamily
0,100937,19808.0,19808.0,19525.5,15950.5,32514.0,18900.0,18900.0,1260.0,17186.0,...,49.0,13.0,32.0,26.0,62.0,12900.0,8000.0,2770.0,4690.0,4690.0
1,101116,20518.0,20518.0,19525.5,15950.5,28761.0,17014.0,17014.0,1500.0,20485.0,...,8.0,2.0,1.0,0.0,1.0,,7108.0,,6562.0,4593.0
2,101189,20500.0,20500.0,19525.5,15950.5,34835.0,22990.0,22990.0,1800.0,19201.0,...,80.0,22.0,42.0,39.0,29.0,8100.0,8200.0,4400.0,4800.0,4800.0
3,101435,21632.0,21632.0,19525.5,15950.5,37483.0,27900.0,27900.0,300.0,17960.0,...,52.0,23.0,30.0,35.0,61.0,10150.0,10150.0,2068.0,2068.0,2035.0
4,101453,,,19525.5,15950.5,,11532.0,11532.0,1000.0,,...,,,,,,5745.0,13932.0,2466.0,3897.0,3492.0


We must do the same for the private schools, removing the string '.private' from all column labels containing it. The columns labeled as private or public refer to the same information, but are returned from the API as 2 separate objects, populated entirely with null values for the ownership category not associated with the school.

In [15]:
private_columns = [feature for feature in df_cost_private.columns if 'private' in feature]
df_cost_private.columns = df_cost_private.columns.str.replace(r'.private','',regex=True)

In [16]:
df_cost = pd.concat([df_cost_public, df_cost_private])
print(df_cost.shape)
df_cost.head()

(2017, 36)


Unnamed: 0,id,latest.cost.avg_net_price,latest.cost.avg_net_price.overall,latest.cost.avg_net_price.consumer.median_by_pred_degree,latest.cost.avg_net_price.consumer.overall_median,latest.cost.attendance.academic_year,latest.cost.tuition.in_state,latest.cost.tuition.out_of_state,latest.cost.booksupply,latest.cost.net_price.by_income_level.0-30000,...,latest.cost.title_iv.by_income_level.0-30000,latest.cost.title_iv.by_income_level.30001-48000,latest.cost.title_iv.by_income_level.48001-75000,latest.cost.title_iv.by_income_level.75001-110000,latest.cost.title_iv.by_income_level.110001-plus,latest.cost.roomboard.oncampus,latest.cost.roomboard.offcampus,latest.cost.otherexpense.oncampus,latest.cost.otherexpense.offcampus,latest.cost.otherexpense.withfamily
0,100654,15529.0,15529.0,19525.5,15950.5,23445.0,10024.0,18634.0,1600.0,14694.0,...,420.0,135.0,76.0,33.0,26.0,9240.0,9240.0,3090.0,3090.0,3440.0
1,100663,16530.0,16530.0,19525.5,15950.5,25542.0,8568.0,20400.0,1200.0,13443.0,...,378.0,216.0,202.0,204.0,232.0,12307.0,12307.0,5555.0,5555.0,5555.0
2,100706,17208.0,17208.0,19525.5,15950.5,24861.0,11338.0,23734.0,2200.0,13631.0,...,117.0,89.0,97.0,107.0,128.0,10652.0,10652.0,4076.0,4076.0,4076.0
3,100724,19534.0,19534.0,19525.5,15950.5,21892.0,11068.0,19396.0,1600.0,19581.0,...,326.0,121.0,38.0,17.0,6.0,6050.0,7320.0,3392.0,4228.0,4228.0
4,100751,20917.0,20917.0,19525.5,15950.5,30016.0,11620.0,31090.0,1000.0,17523.0,...,373.0,206.0,181.0,177.0,335.0,13810.0,13810.0,4620.0,4620.0,5692.0


In [17]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of public schools
# The same will be done for private schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_aid_admit_public = []
# API call aid/admissions info for public schools
for i in range(6):
    response = requests.get(url,
                            params={'fields':'id,{year}.aid,{year}.admissions'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'1',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_aid_admit_public = res['results']
    # Serializing json
    json_object = json.dumps(data_aid_admit_public, indent=4)
    # Writing to file
    with open("../data/raw/aid_admit_public_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_aid_admit_public.append(pd.json_normalize(data_aid_admit_public))
df_aid_admit_public = pd.concat(dfs_aid_admit_public)
print(df_aid_admit_public.shape)
df_aid_admit_public.head()

(581, 134)


Unnamed: 0,id,latest.aid.pell_grant_rate,latest.aid.federal_loan_rate,latest.aid.loan_principal,latest.aid.median_debt.completers.overall,latest.aid.median_debt.completers.monthly_payments,latest.aid.median_debt.noncompleters,latest.aid.median_debt.income.0_30000,latest.aid.median_debt.income.30001_75000,latest.aid.median_debt.income.greater_than_75000,...,latest.admissions.act_scores.25th_percentile.writing,latest.admissions.act_scores.75th_percentile.cumulative,latest.admissions.act_scores.75th_percentile.english,latest.admissions.act_scores.75th_percentile.math,latest.admissions.act_scores.75th_percentile.writing,latest.admissions.act_scores.midpoint.cumulative,latest.admissions.act_scores.midpoint.english,latest.admissions.act_scores.midpoint.math,latest.admissions.act_scores.midpoint.writing,latest.admissions.test_requirements
0,100654,0.7095,0.7504,15250.0,31000.0,309.897388,10221.0,16000.0,15159.0,14463.0,...,,20.0,20.0,18.0,,18.0,17.0,17.0,,1.0
1,100663,0.3397,0.4688,15085.0,22250.0,222.426351,9500.0,16219.0,15000.0,14591.0,...,,30.0,33.0,27.0,,26.0,28.0,24.0,,1.0
2,100706,0.2403,0.3855,14000.0,21450.0,214.428999,9500.0,14126.0,14639.0,13500.0,...,,31.0,33.0,29.0,,28.0,29.0,26.0,,1.0
3,100724,0.7368,0.7805,17500.0,31000.0,309.897388,10489.0,17827.0,15875.0,16500.0,...,,20.0,20.0,20.0,,17.0,17.0,17.0,,1.0
4,100751,0.1718,0.3644,17671.0,23072.0,230.64363,9500.0,17500.0,18000.0,17500.0,...,7.0,31.0,33.0,29.0,8.0,27.0,28.0,25.0,8.0,1.0


In [19]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of private schools
# The same will was done for public schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_aid_admit_private = []
# API call aid/admissions info for private schools
for i in range(15):
    response = requests.get(url,
                            params={'fields':'id,{year}.aid,{year}.admissions'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'2,3',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_aid_admit_private = res['results']
    # Serializing json
    json_object = json.dumps(data_aid_admit_private, indent=4)
    # Writing to file
    with open("../data/raw/aid_admit_private_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_aid_admit_private.append(pd.json_normalize(data_aid_admit_private))
df_aid_admit_private = pd.concat(dfs_aid_admit_private)
print(df_aid_admit_private.shape)
df_aid_admit_private.head()

(1436, 134)


Unnamed: 0,id,latest.aid.pell_grant_rate,latest.aid.federal_loan_rate,latest.aid.loan_principal,latest.aid.median_debt.completers.overall,latest.aid.median_debt.completers.monthly_payments,latest.aid.median_debt.noncompleters,latest.aid.median_debt.income.0_30000,latest.aid.median_debt.income.30001_75000,latest.aid.median_debt.income.greater_than_75000,...,latest.admissions.act_scores.25th_percentile.writing,latest.admissions.act_scores.75th_percentile.cumulative,latest.admissions.act_scores.75th_percentile.english,latest.admissions.act_scores.75th_percentile.math,latest.admissions.act_scores.75th_percentile.writing,latest.admissions.act_scores.midpoint.cumulative,latest.admissions.act_scores.midpoint.english,latest.admissions.act_scores.midpoint.math,latest.admissions.act_scores.midpoint.writing,latest.admissions.test_requirements
0,100937,0.2258,0.4615,16000.0,25800.0,257.9146,7500.0,14750.0,17875.0,16250.0,...,,28.0,30.0,27.0,,25.0,26.0,24.0,,5.0
1,101116,0.5936,0.6773,13516.0,25379.0,253.705994,9500.0,12667.0,15444.0,15277.0,...,,,,,,,,,,
2,101189,0.5009,0.6384,14250.0,23000.0,229.923869,8250.0,15105.0,12945.0,14250.0,...,,23.0,24.0,22.0,,21.0,21.0,19.0,,1.0
3,101435,0.4077,0.7252,17500.0,27000.0,269.910628,6500.0,20000.0,17303.0,15000.0,...,,24.0,25.0,24.0,,22.0,22.0,21.0,,1.0
4,101453,0.4915,0.1017,,,,,,,,...,11.0,,,,12.0,,,,12.0,1.0


In [20]:
df_aid_admit = pd.concat([df_aid_admit_public, df_aid_admit_private])
print(df_aid_admit.shape)
df_aid_admit.head()

(2017, 134)


Unnamed: 0,id,latest.aid.pell_grant_rate,latest.aid.federal_loan_rate,latest.aid.loan_principal,latest.aid.median_debt.completers.overall,latest.aid.median_debt.completers.monthly_payments,latest.aid.median_debt.noncompleters,latest.aid.median_debt.income.0_30000,latest.aid.median_debt.income.30001_75000,latest.aid.median_debt.income.greater_than_75000,...,latest.admissions.act_scores.25th_percentile.writing,latest.admissions.act_scores.75th_percentile.cumulative,latest.admissions.act_scores.75th_percentile.english,latest.admissions.act_scores.75th_percentile.math,latest.admissions.act_scores.75th_percentile.writing,latest.admissions.act_scores.midpoint.cumulative,latest.admissions.act_scores.midpoint.english,latest.admissions.act_scores.midpoint.math,latest.admissions.act_scores.midpoint.writing,latest.admissions.test_requirements
0,100654,0.7095,0.7504,15250.0,31000.0,309.897388,10221.0,16000.0,15159.0,14463.0,...,,20.0,20.0,18.0,,18.0,17.0,17.0,,1.0
1,100663,0.3397,0.4688,15085.0,22250.0,222.426351,9500.0,16219.0,15000.0,14591.0,...,,30.0,33.0,27.0,,26.0,28.0,24.0,,1.0
2,100706,0.2403,0.3855,14000.0,21450.0,214.428999,9500.0,14126.0,14639.0,13500.0,...,,31.0,33.0,29.0,,28.0,29.0,26.0,,1.0
3,100724,0.7368,0.7805,17500.0,31000.0,309.897388,10489.0,17827.0,15875.0,16500.0,...,,20.0,20.0,20.0,,17.0,17.0,17.0,,1.0
4,100751,0.1718,0.3644,17671.0,23072.0,230.64363,9500.0,17500.0,18000.0,17500.0,...,7.0,31.0,33.0,29.0,8.0,27.0,28.0,25.0,8.0,1.0


In [22]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of public schools
# The same will be done for private schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_academics_public = []
# API call academics info for public schools
for i in range(6):
    response = requests.get(url,
                            params={'fields':'id,{year}.academics.program_percentage,{year}.academics.program.assoc,{year}.academics.program.bachelors,{year}.academics.program.degree_or_certificate'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'1',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_academics_public = res['results']
    # Serializing json
    json_object = json.dumps(data_academics_public, indent=4)
    # Writing to file
    with open("../data/raw/academics_public_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_academics_public.append(pd.json_normalize(data_academics_public))
df_academics_public = pd.concat(dfs_academics_public)
print(df_academics_public.shape)
df_academics_public.head()

(581, 153)


Unnamed: 0,id,latest.academics.program_percentage.agriculture,latest.academics.program_percentage.resources,latest.academics.program_percentage.architecture,latest.academics.program_percentage.ethnic_cultural_gender,latest.academics.program_percentage.communication,latest.academics.program_percentage.communications_technology,latest.academics.program_percentage.computer,latest.academics.program_percentage.personal_culinary,latest.academics.program_percentage.education,...,latest.academics.program.degree_or_certificate.public_administration_social_service,latest.academics.program.degree_or_certificate.social_science,latest.academics.program.degree_or_certificate.construction,latest.academics.program.degree_or_certificate.mechanic_repair_technology,latest.academics.program.degree_or_certificate.precision_production,latest.academics.program.degree_or_certificate.transportation,latest.academics.program.degree_or_certificate.visual_performing,latest.academics.program.degree_or_certificate.health,latest.academics.program.degree_or_certificate.business_marketing,latest.academics.program.degree_or_certificate.history
0,100654,0.0274,0.0085,0.0051,0.0017,0.0,0.0393,0.0342,0.0,0.0393,...,1,1,0,0,0,0,1,0,1,0
1,100663,0.0,0.0,0.0,0.0022,0.0323,0.0,0.0258,0.0,0.0577,...,1,2,0,0,0,0,1,1,1,1
2,100706,0.0,0.0,0.0,0.0,0.0155,0.0,0.0667,0.0,0.0155,...,0,1,0,0,0,0,1,1,1,1
3,100724,0.0,0.0,0.0,0.0,0.0963,0.0,0.057,0.0,0.1123,...,1,1,0,0,0,0,1,1,1,1
4,100751,0.0,0.0052,0.0,0.0028,0.0931,0.0,0.014,0.0,0.0603,...,1,1,0,0,0,0,1,1,1,1


In [23]:
# Each response from the API will return data for 100 schools
# They will packaged into a list of DataFrames which will be vertically concatenated into the full list of private schools
# The same will was done for public schools, and the two resulting full lists will also be vertically concatenated

# Initialize empty list of DFs
dfs_academics_private = []
# API call academics info for private schools
for i in range(15):
    response = requests.get(url,
                            params={'fields':'id,{year}.academics.program_percentage,{year}.academics.program.assoc,{year}.academics.program.bachelors,{year}.academics.program.degree_or_certificate'.format(year='latest'),
                                    'school.degrees_awarded.predominant':'3',
                                    'school.ownership':'2,3',
                                    'keys_nested':'true',
                                    'per_page':'100',
                                    'page':i,
                                    'api_key':API_KEY}
                           )
    res = response.json()
    # Select key for array of results
    data_academics_private = res['results']
    # Serializing json
    json_object = json.dumps(data_academics_private, indent=4)
    # Writing to file
    with open("../data/raw/academics_private_{}_page_{}.json".format('latest',str(i)), "w") as outfile:
        outfile.write(json_object)
    dfs_academics_private.append(pd.json_normalize(data_academics_private))
df_academics_private = pd.concat(dfs_academics_private)
print(df_academics_private.shape)
df_academics_private.head()

(1436, 153)


Unnamed: 0,id,latest.academics.program_percentage.agriculture,latest.academics.program_percentage.resources,latest.academics.program_percentage.architecture,latest.academics.program_percentage.ethnic_cultural_gender,latest.academics.program_percentage.communication,latest.academics.program_percentage.communications_technology,latest.academics.program_percentage.computer,latest.academics.program_percentage.personal_culinary,latest.academics.program_percentage.education,...,latest.academics.program.degree_or_certificate.public_administration_social_service,latest.academics.program.degree_or_certificate.social_science,latest.academics.program.degree_or_certificate.construction,latest.academics.program.degree_or_certificate.mechanic_repair_technology,latest.academics.program.degree_or_certificate.precision_production,latest.academics.program.degree_or_certificate.transportation,latest.academics.program.degree_or_certificate.visual_performing,latest.academics.program.degree_or_certificate.health,latest.academics.program.degree_or_certificate.business_marketing,latest.academics.program.degree_or_certificate.history
0,100937,0.0,0.0083,0.0041,0.0041,0.0207,0.0,0.0207,0.0,0.029,...,0,1,0,0,0,0,1,1,1,1
1,101116,0.0,0.0,0.0,0.0,0.0,0.0,0.0282,0.0,0.0,...,0,0,0,0,0,0,0,1,1,0
2,101189,0.0,0.0,0.0,0.0,0.0,0.0,0.0131,0.0,0.0174,...,0,1,0,0,0,0,1,2,2,0
3,101435,0.0,0.0,0.0,0.0,0.0429,0.0,0.0,0.0,0.0644,...,0,1,0,0,0,0,1,1,1,1
4,101453,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
df_academics = pd.concat([df_academics_public, df_academics_private])
print(df_academics.shape)
df_academics.head()

(2017, 153)


Unnamed: 0,id,latest.academics.program_percentage.agriculture,latest.academics.program_percentage.resources,latest.academics.program_percentage.architecture,latest.academics.program_percentage.ethnic_cultural_gender,latest.academics.program_percentage.communication,latest.academics.program_percentage.communications_technology,latest.academics.program_percentage.computer,latest.academics.program_percentage.personal_culinary,latest.academics.program_percentage.education,...,latest.academics.program.degree_or_certificate.public_administration_social_service,latest.academics.program.degree_or_certificate.social_science,latest.academics.program.degree_or_certificate.construction,latest.academics.program.degree_or_certificate.mechanic_repair_technology,latest.academics.program.degree_or_certificate.precision_production,latest.academics.program.degree_or_certificate.transportation,latest.academics.program.degree_or_certificate.visual_performing,latest.academics.program.degree_or_certificate.health,latest.academics.program.degree_or_certificate.business_marketing,latest.academics.program.degree_or_certificate.history
0,100654,0.0274,0.0085,0.0051,0.0017,0.0,0.0393,0.0342,0.0,0.0393,...,1,1,0,0,0,0,1,0,1,0
1,100663,0.0,0.0,0.0,0.0022,0.0323,0.0,0.0258,0.0,0.0577,...,1,2,0,0,0,0,1,1,1,1
2,100706,0.0,0.0,0.0,0.0,0.0155,0.0,0.0667,0.0,0.0155,...,0,1,0,0,0,0,1,1,1,1
3,100724,0.0,0.0,0.0,0.0,0.0963,0.0,0.057,0.0,0.1123,...,1,1,0,0,0,0,1,1,1,1
4,100751,0.0,0.0052,0.0,0.0028,0.0931,0.0,0.014,0.0,0.0603,...,1,1,0,0,0,0,1,1,1,1


In [42]:
df_latest = pd.merge(pd.merge(pd.merge(pd.merge(df_school,df_student,on='id'),df_cost,on='id'),df_aid_admit,on='id'),df_academics,on='id')
print(df_latest.shape)
df_latest.head()

(2017, 488)


Unnamed: 0,id,latest.school.name,latest.school.city,latest.school.state,latest.school.zip,latest.school.accreditor,latest.school.school_url,latest.school.price_calculator_url,latest.school.degrees_awarded.predominant_recoded,latest.school.degrees_awarded.predominant,...,latest.academics.program.degree_or_certificate.public_administration_social_service,latest.academics.program.degree_or_certificate.social_science,latest.academics.program.degree_or_certificate.construction,latest.academics.program.degree_or_certificate.mechanic_repair_technology,latest.academics.program.degree_or_certificate.precision_production,latest.academics.program.degree_or_certificate.transportation,latest.academics.program.degree_or_certificate.visual_performing,latest.academics.program.degree_or_certificate.health,latest.academics.program.degree_or_certificate.business_marketing,latest.academics.program.degree_or_certificate.history
0,100654,Alabama A & M University,Normal,AL,35762,Southern Association of Colleges and Schools C...,www.aamu.edu/,www.aamu.edu/admissions-aid/tuition-fees/net-p...,3.0,3,...,1,1,0,0,0,0,1,0,1,0
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,Southern Association of Colleges and Schools C...,https://www.uab.edu/,https://tcc.ruffalonl.com/University of Alabam...,3.0,3,...,1,2,0,0,0,0,1,1,1,1
2,100706,University of Alabama in Huntsville,Huntsville,AL,35899,Southern Association of Colleges and Schools C...,www.uah.edu/,finaid.uah.edu/,3.0,3,...,0,1,0,0,0,0,1,1,1,1
3,100724,Alabama State University,Montgomery,AL,36104-0271,Southern Association of Colleges and Schools C...,www.alasu.edu/,www.alasu.edu/cost-aid/tuition-costs/net-price...,3.0,3,...,1,1,0,0,0,0,1,1,1,1
4,100751,The University of Alabama,Tuscaloosa,AL,35487-0100,Southern Association of Colleges and Schools C...,www.ua.edu/,financialaid.ua.edu/net-price-calculator/,3.0,3,...,1,1,0,0,0,0,1,1,1,1


Now, we can take a look at school features from the first set of API calls, to filter out unnecessarily information like school url and price calculator, etc.

In [43]:
df_school.columns

Index(['id', 'latest.school.name', 'latest.school.city', 'latest.school.state',
       'latest.school.zip', 'latest.school.accreditor',
       'latest.school.school_url', 'latest.school.price_calculator_url',
       'latest.school.degrees_awarded.predominant_recoded',
       'latest.school.degrees_awarded.predominant',
       'latest.school.degrees_awarded.highest',
       'latest.school.under_investigation', 'latest.school.main_campus',
       'latest.school.branches', 'latest.school.ownership',
       'latest.school.state_fips', 'latest.school.region_id',
       'latest.school.locale', 'latest.school.degree_urbanization',
       'latest.school.carnegie_basic', 'latest.school.carnegie_undergrad',
       'latest.school.carnegie_size_setting',
       'latest.school.minority_serving.historically_black',
       'latest.school.minority_serving.predominantly_black',
       'latest.school.minority_serving.annh',
       'latest.school.minority_serving.tribal',
       'latest.school.minority_s

We will delete 'latest.school.city', 'latest.school.state' (use FIPS instead), 'latest.school.zip', 'latest.school.accreditor' (we will use accreditor code instead), 'latest.school.school_url', 'latest.school.price_calculator_url', 'latest.school.degrees_awarded.predominant_recoded' (these recodes don't apply to universities), 'latest.school.degree_urbanization' (DISC), 'latest.school.alias', 'latest.school.ownership_peps' (redundant with other ownership column), 'latest.school.search'

In [44]:
df_latest = df_latest.drop(columns=['latest.school.city', 'latest.school.state', 'latest.school.zip', 'latest.school.accreditor', 'latest.school.school_url', 'latest.school.price_calculator_url', 'latest.school.degrees_awarded.predominant_recoded', 'latest.school.degree_urbanization', 'latest.school.alias', 'latest.school.ownership_peps', 'latest.school.search'])
df_latest.shape

(2017, 477)

Certain demographic features are discontinued (info in data dictionary included in references directory). These columns should be dropped as well.

In [50]:
# These columns represent features that are either discontinued or not relevent to 4-year colleges.
df_student.isna().sum()[df_student.isna().sum() == len(df_student)].index

Index(['latest.student.enrollment.all',
       'latest.student.demographics.race_ethnicity.white_non_hispanic',
       'latest.student.demographics.race_ethnicity.black_non_hispanic',
       'latest.student.demographics.race_ethnicity.asian_pacific_islander',
       'latest.student.demographics.race_ethnicity.aian_prior_2009',
       'latest.student.demographics.race_ethnicity.hispanic_prior_2009',
       'latest.student.demographics.race_ethnicity.unknown_2000',
       'latest.student.demographics.race_ethnicity.white_2000',
       'latest.student.demographics.race_ethnicity.black_2000',
       'latest.student.demographics.race_ethnicity.api_2000',
       'latest.student.demographics.race_ethnicity.aian_2000',
       'latest.student.demographics.race_ethnicity.hispanic_2000',
       'latest.student.demographics.non_resident_aliens_2000',
       'latest.student.demographics.age_entry_squared',
       'latest.student.demographics.avg_family_income_log',
       'latest.student.demographi

In [52]:
df_latest = df_latest.drop(columns=['latest.student.enrollment.all',
       'latest.student.demographics.race_ethnicity.white_non_hispanic',
       'latest.student.demographics.race_ethnicity.black_non_hispanic',
       'latest.student.demographics.race_ethnicity.asian_pacific_islander',
       'latest.student.demographics.race_ethnicity.aian_prior_2009',
       'latest.student.demographics.race_ethnicity.hispanic_prior_2009',
       'latest.student.demographics.race_ethnicity.unknown_2000',
       'latest.student.demographics.race_ethnicity.white_2000',
       'latest.student.demographics.race_ethnicity.black_2000',
       'latest.student.demographics.race_ethnicity.api_2000',
       'latest.student.demographics.race_ethnicity.aian_2000',
       'latest.student.demographics.race_ethnicity.hispanic_2000',
       'latest.student.demographics.non_resident_aliens_2000',
       'latest.student.demographics.age_entry_squared',
       'latest.student.demographics.avg_family_income_log',
       'latest.student.demographics.avg_family_income_independents_log',
       'latest.student.part_time_share_2000',
       'latest.student.retention_rate.lt_four_year.full_time',
       'latest.student.retention_rate.lt_four_year.part_time',
       'latest.student.retention_rate.lt_four_year.full_time_pooled',
       'latest.student.retention_rate.lt_four_year.part_time_pooled',
       'latest.student.retention_rate.cohort.lt_four_year.full_time_pooled',
       'latest.student.retention_rate.cohort.lt_four_year.part_time_pooled',
       'latest.student.fafsa_sent.overall',
       'latest.student.fafsa_sent.1_college',
       'latest.student.fafsa_sent.2_colleges',
       'latest.student.fafsa_sent.3_college',
       'latest.student.fafsa_sent.4_colleges',
       'latest.student.fafsa_sent.5_or_more_colleges',
       'latest.student.retention_rate_suppressed.lt_four_year.full_time_pooled',
       'latest.student.retention_rate_suppressed.lt_four_year.part_time_pooled'])
df_latest.shape

(2017, 446)

Within admissions, we will remove any column related to ACT or SAT writing, as these sections are no longer included on the test, and these features account for a high volume of missing values.

In [59]:
writing_columns = [feature for feature in df_latest.columns if 'writing' in feature]
df_latest = df_latest.drop(columns=writing_columns)
df_latest.shape

(2017, 440)

Those were the columns that were obvious to drop. Let's take a closer look where other missing values lie.

In [62]:
# Check and get a sense of where missing values lie
pd.set_option('display.max_rows',300)
df_latest.isna().sum()[df_latest.isna().sum() > 0].sort_values(ascending=False)

latest.aid.plus_debt.nostafford_any_school.eval_inst.median                      1715
latest.aid.plus_debt.stafford_any_school.eval_inst.median                        1715
latest.aid.plus_debt.nostafford_this_school.eval_inst.median                     1685
latest.aid.plus_debt.stafford_this_school.eval_inst.median                       1685
latest.aid.plus_debt.stafford_any_school.all_inst.median                         1568
latest.aid.plus_debt.nostafford_any_school.all_inst.median                       1568
latest.aid.plus_debt.stafford_any_school.eval_inst.count                         1516
latest.aid.plus_debt.nostafford_any_school.eval_inst.count                       1516
latest.aid.plus_debt.nostafford_this_school.eval_inst.count                      1445
latest.aid.plus_debt.stafford_this_school.eval_inst.count                        1445
latest.student.demographics.veteran                                              1442
latest.student.retention_rate_suppressed.four_year.par