**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Alex Franz
- Bryant Tan
- Cole Carter
- Henri Schulz

# Research Question

How does college rank affect the financial return of attendance at a 4-year undergraduate university in 2018-2019 where financial ROI is measured by the cost of tuition, post-grad salary, loan repayment rate, average debt, and ten-year earnings?



## Background and Prior Work

### Introduction to the Topic

The influence of the prestige of a university on career success is a significant area of research, especially in fields like engineering where educational pedigree may impact job opportunities and earnings potential. The discussion typically revolves around whether the advantages of attending a top-tier university, such as enhanced learning environments, better networking opportunities, and elevated prestige, translate into tangible career benefits like higher salaries, job satisfaction, and overall career advancement.

### Prior Research and Findings

An article provided by NBC News notes significant differences in starting salaries among graduates from universities of varying prestige. For instance, graduates from the University of Southern California have a notable starting salary advantage over those from lesser-known institutions and that graduates from Yale have an average starting salary of \$68,300, significantly higher than the \$32,000 starting salary for graduates from Mississippi Valley State University. <sup id="cite_ref-1">[1](#cite_note-1)</sup> This highlights the economic benefit of attending more prestigious universities.

Previous research has provided mixed insights into the relationship between university ranking and career success. A study by Dale and Krueger (2014) found that for most students, the selectivity of the college they attend does not affect their earnings significantly after controlling for the competitiveness of the student—defined as their academic and extracurricular achievements prior to college. However, for subgroups such as students from disadvantaged backgrounds, the prestige of the institution did have a positive impact on earnings potential. <sup id="cite_ref-2">[2](#cite_note-2)</sup>

Further compounding this view, research from the National Center for Education Statistics (NCES) suggests that college quality, as measured by resources and educational conditions, does correlate with higher earnings post-graduation. <sup id="cite_ref-3">[3](#cite_note-3)</sup> This correlation persists even when accounting for student backgrounds and labor market conditions, indicating that the institution's characteristics independently contribute to the career outcomes of its graduates​​.

Additionally, a study exploring the quality of working life among university academics reveals that work-life balance, job and career satisfaction, and working conditions significantly affect employee commitment and stress levels, ultimately impacting overall well-being. It emphasizes that higher-ranked universities tend to offer better job security and more supportive working conditions, which not only enhance personal well-being but also improve job satisfaction <sup id="cite_ref-4">[4](#cite_note-4)</sup> This research underscores the importance of university environment and job security, particularly in contrasting permanent versus temporary academic roles, and highlights the subtle effects of university prestige not only on earnings but also on job satisfaction and personal well-being.

### Relevance to Current Project

This previous work sets a foundational understanding that while the reputation of an educational institution may not universally guarantee better job prospects or earnings, it does provide certain groups with measurable advantages. The current project can build on these findings by more specifically analyzing how these outcomes vary among engineering graduates, where the impact of a university's rank might be more pronounced due to the technical and often competitive nature of the field.

### References

- <sup id="cite_note-1">1</sup> Dale, S. B., & Krueger, A. B. (2014). *Estimating the Return to College Selectivity over the Career Using Administrative Earning Data*. Journal of Human Resources.
- <sup id="cite_note-2">2</sup> National Center for Education Statistics (NCES). *College Quality and the Earnings of Recent College Graduates*. U.S. Department of Education.
- <sup id="cite_note-3">3</sup> [NBC News](https://www.nbcnews.com/business/business-news/does-it-even-matter-where-you-go-college-here-s-n982851). Report on the influence of university prestige on starting salaries and career success.
- <sup id="cite_note-4">4</sup> Fontinha, R., Van Laar, D., & Easton, S. (2018). *Quality of working life of academics and researchers in the UK: the roles of contract type, tenure and university ranking*. Studies in Higher Education, 43(4), 786–806.


# Hypothesis



We hypothesize that a higher college rank will correlate to a higher return on investment because higher ranked colleges hold more merit in industry and academia, resulting in better opportunities for graduates from highly ranked colleges as compared to lower ranked colleges.

# Data

## Data overview

- Dataset #1
  - Dataset Name: Department of Education College Scorecard
  - Link to the dataset: [https://collegescorecard.ed.gov/data/](https://collegescorecard.ed.gov/data/)
  - Number of observations: 6694
  - Number of variables: 3244
- Dataset #2
  - Dataset Name: Center for World University Ranking 2018-2019
  - Link to the dataset: [https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking](https://www.kaggle.com/datasets/erfansobhaei/ultimate-university-ranking)
  - Number of observations: 1000
  - Number of variables: 12

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

__Dataset #1 Description__

This dataset includes a plethora of college data from 6694 different colleges in the US regarding admissions, financial aid, earnings after college, etc. We plan to use this dataset to extract important variables such as cost of attendence, debt repayment rates, earnings after college, and other metrics that will help us calculate an approximate ROI. We'll have to filter out the columns that aren't going to help us in this hypothesis, as well as get rid of any NANs, which we will do using Pandas. 

__Datset #2 Description__

Our second dataset will be used to identify the rankings of colleges. It contains important variables such as world ranking and national ranking. This dataset will be used to filter out schools from the College Scorecard data we want to include in our study, as well as include ranking data in our overall combined dataset. The combined dataset will be sorted according to the institutions' world rank as determied by the CWUR rankings.


## Department of Education College Scorecard

In [1]:
import pandas as pd

# Import data
scorecard = pd.read_csv('Data/College_Scorecard_Data.csv') 

print("Shape", scorecard.shape) # Display shape
scorecard.head() # Display some data

  scorecard = pd.read_csv('College_Scorecard_Data.csv')


Shape (6694, 3244)


Unnamed: 0,UNITID,OPEID,OPEID6,INSTNM,CITY,STABBR,ZIP,ACCREDAGENCY,INSTURL,NPCURL,...,DCS_PELL_LOAN,PCTPELL_DCS_POOLED_SUPP,PCTFLOAN_DCS_POOLED_SUPP,DCS_PELL_LOAN_POOLED,POOLYRS_DCS,SATVR50,SATMT50,ACTCM50,ACTEN50,ACTMT50
0,100654,100200,1002,Alabama A & M University,Normal,AL,35762,,,,...,,,,,,,,,,
1,100663,105200,1052,University of Alabama at Birmingham,Birmingham,AL,35294-0110,,,,...,,,,,,,,,,
2,100690,2503400,25034,Amridge University,Montgomery,AL,36117-3553,,,,...,,,,,,,,,,
3,100706,105500,1055,University of Alabama in Huntsville,Huntsville,AL,35899,,,,...,,,,,,,,,,
4,100724,100500,1005,Alabama State University,Montgomery,AL,36104-0271,,,,...,,,,,,,,,,


## Center for World University Rankings 2018-2019

In [84]:
# Read the original CSV file
rankings = pd.read_csv('Data/CWUR_2018-2019.csv')

print("Shape", rankings.shape) # Display shape
rankings.head() # Display some data

Shape (1000, 12)


Unnamed: 0,World Rank,Institution,Location,National Rank,Quality of Education,Alumni Employment,Quality of Faculty,Research Output,Quality Publications,Influence,Citations,Score
0,1,Harvard University,USA,1,2,1,1,1,1,1,1,100.0
1,2,Stanford University,USA,2,10,3,2,10,4,3,2,96.7
2,3,Massachusetts Institute of Technology,USA,3,3,11,3,30,15,2,6,95.1
3,4,University of Cambridge,United Kingdom,1,5,19,6,12,8,6,19,94.0
4,5,University of Oxford,United Kingdom,2,9,25,10,9,5,7,4,93.2


# Combined Dataset #

In [3]:
# Filter the DataFrame to include only rows where the country is "USA"
rankings = rankings[rankings['Location'] == 'USA'].reset_index()

# Rename institution column and filter scorecard to only contain institutions in rankings
scorecard = scorecard[
  scorecard['INSTNM']
  .isin(rankings['Institution'])
  ]

# Rename column for join
rankings = rankings.rename({'Institution' : 'INSTNM'},axis=1)

combined = (
  pd.merge(scorecard, rankings, how='outer')
  .rename({'index' : 'RANK'}, axis=1)     
  .set_index('RANK')
  .sort_index()
)

# Generating list of columns of interest
cols_of_interest = """RANK
INSTNM
NPT41_PUB
NPT42_PUB
NPT43_PUB
NPT44_PUB
NPT45_PUB
NPT41_PRIV
NPT42_PRIV
NPT43_PRIV
NPT44_PRIV
NPT45_PRIV
NPT41_PROG
NPT42_PROG
NPT43_PROG
NPT44_PROG
NPT45_PROG
NPT41_OTHER
NPT42_OTHER
NPT43_OTHER
NPT44_OTHER
NPT45_OTHER
NPT4_048_PUB
NPT4_048_PRIV
NPT4_048_PROG
NPT4_048_OTHER
NPT4_3075_PUB
NPT4_3075_PRIV
NPT4_75UP_PUB
NPT4_75UP_PRIV
NPT4_3075_PROG
NPT4_3075_OTHER
NPT4_75UP_PROG
NPT4_75UP_OTHER
COSTT4_A
COSTT4_P
TUITIONFEE_IN
TUITIONFEE_OUT
TUITIONFEE_PROG
GRAD_DEBT_MDN
LO_INC_DEBT_MDN
MD_INC_DEBT_MDN
HI_INC_DEBT_MDN
COUNT_NWNE_3YR
COUNT_WNE_3YR
CNTOVER150_3YR
DBRR4_FED_UG_DEN
DBRR4_FED_UG_RT
DBRR5_FED_UG_NUM
DBRR5_FED_UG_DEN
DBRR10_FED_UG_NUM
DBRR10_FED_UG_DEN
DBRR20_FED_UG_NUM
DBRR20_FED_UG_DEN
DBRR1_FED_UGCOMP_NUM
DBRR1_FED_UGCOMP_DEN
BBRR1_FED_UGCOMP_MAKEPROG
BBRR1_FED_UGCOMP_PAIDINFULL
BBRR2_FED_UGCOMP_MAKEPROG
BBRR2_FED_UGCOMP_PAIDINFULL
MDCOST_ALL
MDEARN_PD
MD_EARN_WNE_1YR
MD_EARN_WNE_4YR
MDEARN_PD
MD_EARN_WNE_1YR
MD_EARN_WNE_4YR
""".split(sep='\n')

# Determine which of our interested columns are in the combined data
selected_cols = [col for col in cols_of_interest if col in combined.columns.to_list()]

# Keep only columns of interest and remove any variables that are all null
combined = combined[selected_cols].dropna(how='all', axis=0).dropna(how='all', axis=1)
print("Combined dataset shape:", combined.shape)
combined.head()

Combined dataset shape: (213, 40)


Unnamed: 0_level_0,INSTNM,NPT41_PUB,NPT42_PUB,NPT43_PUB,NPT44_PUB,NPT45_PUB,NPT41_PRIV,NPT42_PRIV,NPT43_PRIV,NPT44_PRIV,...,DBRR1_FED_UGCOMP_NUM,DBRR1_FED_UGCOMP_DEN,BBRR1_FED_UGCOMP_MAKEPROG,BBRR1_FED_UGCOMP_PAIDINFULL,BBRR2_FED_UGCOMP_MAKEPROG,BBRR2_FED_UGCOMP_PAIDINFULL,MD_EARN_WNE_1YR,MD_EARN_WNE_4YR,MD_EARN_WNE_1YR,MD_EARN_WNE_4YR
RANK,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,Harvard University,,,,,,2973.0,1010.0,3411.0,15553.0,...,3816106.0,4234187.0,0.30-0.34,0.15-0.19,0.30-0.34,0.20-0.24,86549.0,104267.0,86549.0,104267.0
1,Stanford University,,,,,,-1387.0,1145.0,1959.0,10631.0,...,4817005.0,6155327.0,0.35-0.39,0.25-0.29,0.35-0.39,0.30-0.34,77677.0,109840.0,77677.0,109840.0
2,Massachusetts Institute of Technology,,,,,,4535.0,1820.0,6049.0,14381.0,...,4060250.0,6282860.0,0.30-0.34,0.40-0.44,0.35-0.39,0.45-0.49,102988.0,129392.0,102988.0,129392.0
5,"University of California, Berkeley",,,,,,,,,,...,,,,,,,,,,
6,Princeton University,,,,,,1386.0,2044.0,7576.0,16989.0,...,1695765.0,2047959.0,0.30-0.39,0.20-0.29,0.40-0.49,0.20-0.29,64287.0,93603.0,64287.0,93603.0


# Ethics & Privacy

When managing ethics and privacy, we wanted to dive deeper into getting data from a better source. We have gone about this by going to the place that governs it all in the US. The data we were able to grab was from a .gov website regarding college scorecards. This data came from an enormous dataset that tracked lots of different metrics about schools. 

Now, when looking at what kinds of biases might have come forth from the datasets. First to start with the easier one, for dataset #2, 

When managing our questions that we proposed and thought that we might ask we considered a plethora of moral factors. We wanted to make sure that the question that we decided on didn't have any problems with the ethics or morals that we believed in. We decided to look at the questions we were proposing very objectively. With the question we landed on, we wanted to use those metrics that people objectively use to measure success. These included things like satisfaction, work-life balance, and the typical but maybe less ethical income. 

In terms of the dataset that we plan to use, it undoubtably has some biases as it is very challenging to accurately sample a group of people that we identified as United States university engineering grads. However, in our analysis we will be very open with where we got our data and how representative of the population that we are claiming it is. Regardless we will do our best to create an analysis that is as equitable as possible by trimming the data set to not be skewed toward any one sub-population. This will be in the stages of data manipulation of our project, pre-analysis.

The strong suit of our question in terms of equitability is that going to a university for an engineering degree is already a subset of people that, while may not be representative of the United States as a whole, is on a semi-even playing field. Of course not everyone's upbringing and the challenges that they faced along the way vary, however, they all made it to the point of university. This, in some ways, makes the population much more equitable than if we had considered a more broad group. 

# Team Expectations 


* Respond to messages from other team members within 24 hours
* Weekly check-in to update team progress
* If a member has to miss a meeting, message the other members
* Treat all members with respect
* Always be willing to help out members if they ask
* Get the work done assigned to you in a timely manner
* Communicate with the rest of the team if you're stuck

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 5/1  |  5 PM | Brainstorm ideas for projects  | Determine our research question and split up Project Proposal sections | 
| 5/8  |  5 PM | Identify potential data sources  | Finalize data sources and plan for data schema. Assign data wrangling tasks | 
| 5/15  | 5 PM  | Complete data wrangling necessary for EDA  | Discuss goals of EDA and what we are looking for | 
| 5/22  | 5 PM  | Complete EDA  | Compare results for EDA and do writeup. Discuss plans for analysis and assign individual tasks | 
| 5/29  | 5 PM  | Start individual analysis tasks  | Compare analysis results | 
| 6/4  | 5 PM  | Refine analysis and start drafting final write-up sections  | Finalize analysis and focus on writeup | 
| 5/11 | 5 PM  | Finish writeup second drafts | Finalize project and submit |