# COGS 108 - Data Checkpoint

# Names

- Alex Franz
- Bryant Tan
- Cole Carter
- Henri Schulz

# Research Question

Does a higher Center for World University Rankings ranking increase financial return on invest (ROI) at a US university in 2018-2019 where financial ROI is measured by the cost of tuition, loan repayment rate, average debt, and post-graduation earnings ?



## Background and Prior Work

### Introduction to the Topic


The influence of the prestige of a university on career success is a significant area of research, especially in fields like engineering where educational pedigree may impact job opportunities and earnings potential. The discussion typically revolves around whether the advantages of attending a top-tier university, such as enhanced learning environments, better networking opportunities, and elevated prestige, translate into tangible career benefits like higher salaries, job satisfaction, and overall career advancement. Generally, attending a prestigious institution has been associated with higher success rates, but the question remains: does the cost of tuition justify these outcomes?


### Prior Research and Findings


An article provided by NBC News notes significant differences in starting salaries among graduates from universities of varying prestige. For instance, graduates from the University of Southern California have a notable starting salary advantage over those from lesser-known institutions and that graduates from Yale have an average starting salary of `$68,300`, significantly higher than the `$32,000` starting salary for graduates from Mississippi Valley State University. <sup id="cite_ref-1">[1](#cite_note-1)</sup> This highlights the economic benefit of attending more prestigious universities.


Further research indicates that while the cost of attending college has dramatically increased, the financial benefits such as earnings growth have not kept pace. Specifically, college costs have risen by 169% over the past four decades, while earnings for workers between the ages of 22 and 27 have increased by just 19%, according to an analysis of U.S. Census, Bureau of Labor Statistics, and National Center for Education Statistics data. <sup id="cite_ref-2">[2](#cite_note-2)</sup> This disparity raises questions about the economic return on investment of higher education, particularly from less prestigious institutions.


Additionally, a study exploring the quality of working life among university academics reveals that work-life balance, job and career satisfaction, and working conditions significantly affect employee commitment and stress levels, ultimately impacting overall well-being. It emphasizes that higher-ranked universities tend to offer better job security and more supportive working conditions, which not only enhance personal well-being but also improve job satisfaction <sup id="cite_ref-3">[3](#cite_note-3)</sup> This research underscores the importance of university environment and job security, particularly in contrasting permanent versus temporary academic roles, and highlights the subtle effects of university prestige not only on earnings but also on job satisfaction and personal well-being.


### Relevance to Current Project


This previous work sets a foundational understanding that while the reputation of an educational institution may not universally guarantee better job prospects or earnings, it does provide certain groups with measurable advantages. The current project can build on these findings by more specifically analyzing how these outcomes vary among engineering graduates, where the impact of a university's rank might be more pronounced due to the technical and often competitive nature of the field.


### References
- <sup id="cite_note-1">1</sup> [NBC News](https://www.nbcnews.com/business/business-news/does-it-even-matter-where-you-go-college-here-s-n982851) Report on the influence of university prestige on starting salaries and career success.
- <sup id="cite_note-2">2</sup> [CNBC News](https://www.cnbc.com/2021/11/02/the-gap-in-college-costs-and-earnings-for-young-workers-since-1980.html) College costs have increased by 169% since 1980—but pay for young workers is up by just 19%: Georgetown report
- <sup id="cite_note-3">3</sup> [Research Gate](https://www.researchgate.net/publication/305689177_Quality_of_working_life_of_academics_and_researchers_in_the_UK_the_roles_of_contract_type_tenure_and_university_ranking) Fontinha, R., Van Laar, D., & Easton, S. (2018). *Quality of working life of academics and researchers in the UK: the roles of contract type, tenure and university ranking*. Studies in Higher Education, 43(4), 786–806.


# Hypothesis



We hypothesize that a higher university rank will correlate to a higher return on investment because higher ranked colleges hold more merit in industry and academia, resulting in better opportunities for graduates from highly ranked colleges as compared to lower ranked colleges.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Department of Education College Scorecard 2018-2019
  - Link to the dataset: [https://collegescorecard.ed.gov/data/](https://collegescorecard.ed.gov/data/)
  - Number of observations: 6694
  - Number of variables: 3244
- Dataset #2
  - Dataset Name: Center for World University Rankings World University Rankings 2018-2019
  - Link to the dataset: [https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking](https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking)
  - Number of observations: 1000
  - Number of variables: 12

__Dataset #1 Description__

The Department of Education College Scorecard 2018-2019 includes 3244 variables from 6694 different colleges in the US regarding admissions, financial aid, earnings after college, student demographics, graduation rates, and student outcomes. We plan to use this dataset to extract the median cost of attendence, debt repayment rates in percentages at 1, 4, and 5 years after college, and median earnings at 1, 4, and 10 years after college. With these metrics, we will be able to establish a quantitative heuristic for the return on investment of a given institution. To perform this analysis, we will have to filter our dataset to include only columns of potential interest, discard any resulting columns of interest with an excessive proportion of NaN values, and combine duplicate metrics across different columns to generate an accurate depiction of our desired aforementioned metrics.

__Datset #2 Description__

The CWRU World University Rankings 2018-2019 will be used to identify the rankings of colleges. The most important metrics found in this dataset are string-formatted names, integer-formatted world ranking and national rankings, and float-formatted overall scores of the 1000 universities within the dataset. This dataset will be used to filter out the schools from the College Scorecard that data we want to include in our study using a DataFrame merge, as well as include ranking data in our overall combined dataset. The combined dataset will be sorted according to the institutions' world rank as determined by the CWUR rankings.


__Combined Dataset Description__

The combined dataset focuses on US institutions present in both the College Scorecard and CWUR datasets. It includes the following key variables:

Institution (institution): The name of the college or university. <>
Average Cost (avg_cost): The average annual cost of attendance.
Supply Cost (supply_cost): The cost of books and supplies.
Average Debt (avg_debt): The median debt of graduates.
75% Earnings 8 Years After Enrollment (75%_earnings_8_yrs): The earnings at the 75th percentile 8 years after enrollment.
25% Earnings 8 Years After Enrollment (25%_earnings_8_yrs): The earnings at the 25th percentile 8 years after enrollment.
Earnings 6 Years After Enrollment (earnings_6_yrs): The median earnings 6 years after enrollment.
Earnings 10 Years After Enrollment (earnings_10_yrs): The median earnings 10 years after enrollment.
National Rank (rank): The national ranking of the institution according to CWUR.
Return on Investment (roi): A calculated metric representing the ROI based on median earnings and average cost.

__Data Cleaning Performed__

Standardization: Institution names were standardized to ensure consistency across datasets.
Filtering: The dataset was filtered to include only institutions present in the CWUR dataset.
Handling Missing Values: Rows with missing values in key variables were dropped to ensure data quality.
Numeric Conversion: All relevant columns were converted to numeric types for analysis.

## Data Imports and Standardization Function

In [8]:
# Imports 
import pandas as pd 
import numpy as np 
import seaborn as sns 
import re 
import matplotlib.pyplot as plt

# Standardizes instritution, used for filtering Scorecard data based on ranking data
def institution_standardize(string):
    string = string.lower()
    string = string.replace(',', '')
    string = string.replace(' ', '')
    string = re.sub(r'[^a-zA-Z0-9]', '', string)
    return string

## Center for World University Rankings 2018-2019

In [9]:
# Importing Data
rankings = pd.read_csv("CWUR_2018-2019.csv")

# Gathering Ranking data we need to filter the Scorecard data
rankings = rankings[rankings['Location'] == 'USA']
rankings = rankings.loc[:,['Institution','National Rank']]
rankings = rankings.reset_index(drop=True)

# Standardizing ranking data
rankings['Institution'] = rankings['Institution'].apply(institution_standardize)
rankings['Institution'] = rankings["Institution"].astype('string')

rankings.head()

Unnamed: 0,Institution,National Rank
0,harvarduniversity,1
1,stanforduniversity,2
2,massachusettsinstituteoftechnology,3
3,universityofcaliforniaberkeley,4
4,princetonuniversity,5


## Department of Education College Scorecard 

In [10]:
# Importing Data 
college_data = pd.read_csv("2018_2019_College_Data.csv")

# Creating copy dataframe, so we can use this to filter our original data with the original institution names
college_data_filter = college_data.copy()

# Standardizing copy Scorecard data
college_data_filter['INSTNM'] = college_data_filter['INSTNM'].apply(institution_standardize)
college_data_filter['INSTNM'] = college_data_filter['INSTNM'].astype('string')

# Filtering dataset to include only universities in the rankings dataset
filtered_college_data = college_data_filter[college_data_filter['INSTNM'].isin(rankings['Institution'])]

# Filtering dataset to only include desired columns
desired_cols = [
    'INSTNM','COSTT4_A','BOOKSUPPLY','GRAD_DEBT_MDN','PCT75_EARN_WNE_P8','PCT25_EARN_WNE_P8','MD_EARN_WNE_P6','MD_EARN_WNE_P10'
]
filtered_college_data = filtered_college_data.loc[:, desired_cols]
filtered_college_data.head()

  college_data = pd.read_csv("2018_2019_College_Data.csv")


Unnamed: 0,INSTNM,COSTT4_A,BOOKSUPPLY,GRAD_DEBT_MDN,PCT75_EARN_WNE_P8,PCT25_EARN_WNE_P8,MD_EARN_WNE_P6,MD_EARN_WNE_P10
1,universityofalabamaatbirmingham,24347.0,1200.0,22500,63552.0,28738.0,39271.0,46990.0
3,universityofalabamainhuntsville,23441.0,2034.0,21607,73997.0,29122.0,47533.0,54361.0
9,auburnuniversity,31282.0,1200.0,21281,76935.0,36231.0,49695.0,56933.0
58,universityofalaskafairbanks,18510.0,2000.0,19500,59768.0,19798.0,35456.0,43728.0
74,universityofarizona,26712.0,800.0,20171,76918.0,32696.0,43784.0,55205.0


## Combining Datasets ##

In [11]:
# Convert the list of institutions to a categorical data type with the desired order
filtered_college_data['INSTNM'] = pd.Categorical(filtered_college_data['INSTNM'], categories=rankings['Institution'], ordered=True)

# Sort the dataframe based on the ordered 'INSTNM' column
college_sorted = filtered_college_data.sort_values(by='INSTNM')

# Extracting indices to use to pull observations from the college_data dataframe we need
indices = college_sorted.index.to_list()

# Filtering out universities we aren't including in the study
college_data = college_data.iloc[indices]

# Gathering columns we need for EDA and adding rank column 
college_data = college_data.loc[:,desired_cols]
college_data = college_data.dropna().reset_index(drop=True)
college_data['NationalRank'] = college_data.index + 1

# Converting all columns to numeric values for EDA purposes
numeric_cols = ['COSTT4_A','BOOKSUPPLY','GRAD_DEBT_MDN', 'PCT75_EARN_WNE_P8', 'PCT25_EARN_WNE_P8','MD_EARN_WNE_P6','MD_EARN_WNE_P10', 'NationalRank']
college_data[numeric_cols] = college_data[numeric_cols].astype(float)

# Renaming columns 
college_data.rename(
    columns={
        'INSTNM': 'institution',
        'COSTT4_A': 'avg_cost', 
        'BOOKSUPPLY': 'supply_cost',
        'GRAD_DEBT_MDN':'avg_debt',
        'PCT75_EARN_WNE_P8':'75%_earnings_8_yrs',
        'PCT25_EARN_WNE_P8':'25%_earnings_8_yrs',
        'MD_EARN_WNE_P6': 'earnings_6_yrs',
        'MD_EARN_WNE_P10':'earnings_10_yrs',
        'NationalRank':'rank'
        }
        , inplace=True
        )

# Final dataframe
college_data.head()

Unnamed: 0,institution,avg_cost,supply_cost,avg_debt,75%_earnings_8_yrs,25%_earnings_8_yrs,earnings_6_yrs,earnings_10_yrs,rank
0,Harvard University,71135.0,1000.0,13750.0,135753.0,45980.0,77816.0,84918.0,1.0
1,Stanford University,69109.0,1455.0,11750.0,166805.0,58448.0,88873.0,97798.0,2.0
2,Massachusetts Institute of Technology,67430.0,800.0,12500.0,169114.0,75080.0,112623.0,111222.0,3.0
3,University of California-Berkeley,36739.0,849.0,13478.0,117722.0,44547.0,65914.0,80364.0,4.0
4,Princeton University,66950.0,1050.0,10750.0,147835.0,56354.0,84713.0,95689.0,5.0


In [13]:
college_data.shape

(146, 9)

# Ethics & Privacy

In the CWRU rankings dataset, the main source of bias is the determination of what constitutes a university's ranking. It is nearly impossible to holistically declare one school as better than another, so instead the CWRU turns to certain metrics to quantitatively rank these universties. This method of ranking is not explicitly provided, so there may be potential biases originating in the metrics being used or in the analysis which uses potentially biased weights for these metrics to assess the overall quality of an institution However, despite these potential biases, the Center for World University Ranking's frequent updates to their data, their frequency of use in other studies, and their overall reputatition suggests that while these rankings may not be in absolutely perfect order, they are in general reflective of the prestige of schools and therefore are sufficient to draw conclusions about the effect of rank across a large range of rankings.

As for the College Scoreboard data, we may not have an equal amount of data points for determining the financial metrics of each university, so this data may not cover all of the school equally. There could be factors such as geographical bias based on where the writers of the reports and datasets are located. It also might have socioeconomic biases due to the fact that some economic factors such as the ability to afford college can be obscured in the data and not depict situations fairly. However, the fact that is that there are no individuals in the dataset but rather generalized statistics leads to some normalization and increased accuracy of the metrics across a large number of students at each institution. Conversely, this brings the method of data collection into question. Considering that the data found in the College Scoreboard is sourced from government data and individual institutions, the authority of these data sources contribute to an increased likelihood of their accurate depiction of our desired metrics, reducing the likelihood of bias.

It is important to note that any generic conclusions or patterns presented in this report should not be viewed as absolute statements or actionable information. While our analysis is conducted in a way that minimizes bias, it remains impossible to eleminate all potential sources of bias and note that confounding factors such as socioeconomic class, which influences college decisions and financial basis and career opportunities, could be strong contributers to our results. Even if we were able to ensure unbiased, purely objective analyses, it is important to note that this ranking is no way definitive of salary outcomes. While students at certain universities may be more likely on average to meet certain financial metrics, attendance of a specific university does not guarantee these outcomes and should not be interpreted as such.

# Team Expectations 


* Respond to messages from other team members within 24 hours
* Weekly check-in to update team progress
* If a member has to miss a meeting, message the other members
* Treat all members with respect
* Always be willing to help out members if they ask
* Get the work done assigned to you in a timely manner
* Communicate with the rest of the team if you're stuck

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/1  |  5 PM | Brainstorm ideas for projects  | Determine our research question and split up Project Proposal sections | 
| 5/8  |  5 PM | Identify potential data sources  | Finalize data sources and plan for data schema. Assign data wrangling tasks | 
| 5/15  | 5 PM  | Complete data wrangling necessary for EDA  | Discuss goals of EDA and what we are looking for | 
| 5/22  | 5 PM  | Complete EDA  | Compare results for EDA and do writeup. Discuss plans for analysis and assign individual tasks | 
| 5/29  | 5 PM  | Start individual analysis tasks  | Compare analysis results | 
| 6/4  | 5 PM  | Refine analysis and start drafting final write-up sections  | Finalize analysis and focus on writeup | 
| 5/11 | 5 PM  | Finish writeup second drafts | Finalize project and submit |