**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Alex Franz
- Bryant Tan
- Cole Carter
- Henri Schulz

# Research Question

Does a higher Center for World University Rankings ranking increase financial return on invest (ROI) at a US university in 2018-2019 where financial ROI is measured by the cost of tuition, loan repayment rate, average debt, and post-graduation earnings ?



## Background and Prior Work

### Introduction to the Topic


The influence of the prestige of a university on career success is a significant area of research, especially in fields like engineering where educational pedigree may impact job opportunities and earnings potential. The discussion typically revolves around whether the advantages of attending a top-tier university, such as enhanced learning environments, better networking opportunities, and elevated prestige, translate into tangible career benefits like higher salaries, job satisfaction, and overall career advancement. Generally, attending a prestigious institution has been associated with higher success rates, but the question remains: does the cost of tuition justify these outcomes?


### Prior Research and Findings


An article provided by NBC News notes significant differences in starting salaries among graduates from universities of varying prestige. For instance, graduates from the University of Southern California have a notable starting salary advantage over those from lesser-known institutions and that graduates from Yale have an average starting salary of \${68,300}, significantly higher than the \${32,000} starting salary for graduates from Mississippi Valley State University. <sup id="cite_ref-1">[1](#cite_note-1)</sup> This highlights the economic benefit of attending more prestigious universities.


Further research indicates that while the cost of attending college has dramatically increased, the financial benefits such as earnings growth have not kept pace. Specifically, college costs have risen by 169% over the past four decades, while earnings for workers between the ages of 22 and 27 have increased by just 19%, according to an analysis of U.S. Census, Bureau of Labor Statistics, and National Center for Education Statistics data. <sup id="cite_ref-2">[2](#cite_note-2)</sup> This disparity raises questions about the economic return on investment of higher education, particularly from less prestigious institutions.


Additionally, a study exploring the quality of working life among university academics reveals that work-life balance, job and career satisfaction, and working conditions significantly affect employee commitment and stress levels, ultimately impacting overall well-being. It emphasizes that higher-ranked universities tend to offer better job security and more supportive working conditions, which not only enhance personal well-being but also improve job satisfaction <sup id="cite_ref-3">[3](#cite_note-3)</sup> This research underscores the importance of university environment and job security, particularly in contrasting permanent versus temporary academic roles, and highlights the subtle effects of university prestige not only on earnings but also on job satisfaction and personal well-being.


### Relevance to Current Project


This previous work sets a foundational understanding that while the reputation of an educational institution may not universally guarantee better job prospects or earnings, it does provide certain groups with measurable advantages. The current project can build on these findings by more specifically analyzing how these outcomes vary among engineering graduates, where the impact of a university's rank might be more pronounced due to the technical and often competitive nature of the field.


### References
- <sup id="cite_note-1">1</sup> [NBC News](https://www.nbcnews.com/business/business-news/does-it-even-matter-where-you-go-college-here-s-n982851) Report on the influence of university prestige on starting salaries and career success.
- <sup id="cite_note-2">2</sup> [CNBC News](https://www.cnbc.com/2021/11/02/the-gap-in-college-costs-and-earnings-for-young-workers-since-1980.html) College costs have increased by 169% since 1980—but pay for young workers is up by just 19%: Georgetown report
- <sup id="cite_note-3">3</sup> [Research Gate](https://www.researchgate.net/publication/305689177_Quality_of_working_life_of_academics_and_researchers_in_the_UK_the_roles_of_contract_type_tenure_and_university_ranking) Fontinha, R., Van Laar, D., & Easton, S. (2018). *Quality of working life of academics and researchers in the UK: the roles of contract type, tenure and university ranking*. Studies in Higher Education, 43(4), 786–806.


# Hypothesis



We hypothesize that a higher university rank will correlate to a higher return on investment because higher ranked colleges hold more merit in industry and academia, resulting in better opportunities for graduates from highly ranked colleges as compared to lower ranked colleges.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Department of Education College Scorecard
  - Link to the dataset: [https://collegescorecard.ed.gov/data/](https://collegescorecard.ed.gov/data/)
  - Number of observations: 6694
  - Number of variables: 3244
- Dataset #2
  - Dataset Name: Center for World University Ranking 2018-2019
  - Link to the dataset: [https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking](https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking)
  - Number of observations: 1000
  - Number of variables: 12

__Dataset #1 Description__

This dataset includes a plethora of college data from 6694 different colleges in the US regarding admissions, financial aid, earnings after college, etc. We plan to use this dataset to extract the median cost of attendence, debt repayment rates in percentages at 1, 4, and 5 years after college, median earnings at 1, 4, and 10 years after college. With these metrics, we will be able to establish a quantitative heuristic for the return on investment of a given institution. To perform this analysis, we will have to filter our dataset to include only columns of potential interest, discard any columns of interest with an excessive proportion of NaN values, and combine duplicate metrics across different columns to generate an accurate depiction of our desired metrics.
__Datset #2 Description__

Our second dataset will be used to identify the rankings of colleges. It contains important variables such as world ranking and national ranking. This dataset will be used to filter out schools from the College Scorecard data we want to include in our study, as well as include ranking data in our overall combined dataset. The combined dataset will be sorted according to the institutions' world rank as determied by the CWUR rankings.


## Department of Education College Scorecard

In [1]:
import pandas as pd

# Import data
scorecard = pd.read_csv('Data/College_Scorecard_Data.csv') 

print("Shape", scorecard.shape) # Display shape
scorecard.head() # Display some data

  scorecard = pd.read_csv('College_Scorecard_Data.csv')


Shape (6694, 3244)


Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,DCS_PELL_LOAN,PCTPELL_DCS_POOLED_SUPP,PCTFLOAN_DCS_POOLED_SUPP,DCS_PELL_LOAN_POOLED,POOLYRS_DCS,SATVR50,SATMT50,ACTCM50,ACTEN50,ACTMT50
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,


## Center for World University Rankings 2018-2019

In [84]:
# Read the original CSV file
rankings = pd.read_csv('Data/CWUR_2018-2019.csv')

print("Shape", rankings.shape) # Display shape
rankings.head() # Display some data

Shape (1000, 12)


Unnamed: 0,World Rank,Institution,Location,National Rank,Quality of Education,Alumni Employment,Quality of Faculty,Research Output,Quality Publications,Influence,Citations,Score
0,1,Harvard University,USA,1,2,1,1,1,1,1,1,100.0
1,2,Stanford University,USA,2,10,3,2,10,4,3,2,96.7
2,3,Massachusetts Institute of Technology,USA,3,3,11,3,30,15,2,6,95.1
3,4,University of Cambridge,United Kingdom,1,5,19,6,12,8,6,19,94.0
4,5,University of Oxford,United Kingdom,2,9,25,10,9,5,7,4,93.2


# Ethics & Privacy

When managing ethics and privacy, we wanted to dive deeper into getting data from a better source. We have gone about this by going to the place that governs it all in the US. The data we were able to grab was from a .gov website regarding college scorecards. This data came from an enormous dataset that tracked lots of different metrics about schools. 

Now, when looking at what kinds of biases might have come forth from the datasets. First to start with the easier one, for dataset #2, the data is ranking colleges and it was we took the US schools from a global ranking of colleges. Some biases for this are that it includes both private and public schools in the same list, the metrics that they used to rank the schools might not be what everyone agrees with, and it might not consider things like the demographics or the socioeconomic background that one might need to have to attend one of these colleges. However, despite these potential biases, the Center for World University Ranking's frequent updates to their data, their frequency of use in other studies, and their overall reputatition suggests that while these rankings may not be in absolutely perfect order, they are in general reflective of the prestige of schools and therefore are sufficient to draw conclusions about the effect of rank across a large range of rankings.

As for the actual College Scoreboard data, regarding biases, while we are going to look at only the top schools from our other list, the data may not cover all of the school equally. The schools that are more popular may have more information for the government to make assessments about. There could be things like geographical bias based on where the writers of the reports and datasets are located. It also might have socioeconomic biases due to the fact that some economic factors can be hidden in the data and not portray the data fairly. One of the better things is that there are no individuals in the dataset, it is all somewhat normalized, hopefully this leads to a more accurate representation of the schools. On the flip side though, it could hide the fact that they only sampled misrepresentations of the individual schools. We could always add a little bit of noise for each college to make try and even it out even more that way.

Finally we are going to do our best to keep our analysis to be as non-biased as possible. We will not state anything to be fact, and just what we found based on our analysis and we will be super transparent about where we got our data. It is never our intention to come up with conclusions that could reflect badly or could be misued on any college or group of people.

# Team Expectations 


* Respond to messages from other team members within 24 hours
* Weekly check-in to update team progress
* If a member has to miss a meeting, message the other members
* Treat all members with respect
* Always be willing to help out members if they ask
* Get the work done assigned to you in a timely manner
* Communicate with the rest of the team if you're stuck

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/1  |  5 PM | Brainstorm ideas for projects  | Determine our research question and split up Project Proposal sections | 
| 5/8  |  5 PM | Identify potential data sources  | Finalize data sources and plan for data schema. Assign data wrangling tasks | 
| 5/15  | 5 PM  | Complete data wrangling necessary for EDA  | Discuss goals of EDA and what we are looking for | 
| 5/22  | 5 PM  | Complete EDA  | Compare results for EDA and do writeup. Discuss plans for analysis and assign individual tasks | 
| 5/29  | 5 PM  | Start individual analysis tasks  | Compare analysis results | 
| 6/4  | 5 PM  | Refine analysis and start drafting final write-up sections  | Finalize analysis and focus on writeup | 
| 5/11 | 5 PM  | Finish writeup second drafts | Finalize project and submit |