# Exploring the data

Ideas:
- Train a model that eventually matches with real world rankings
- Determine the features that have the most impact based on the weights

In [2]:
import pandas as pd
import altair as alt

## Mapping labels with string classification

![](https://www.startpage.com/av/proxy-image?piurl=https%3A%2F%2Ftse3.explicit.bing.net%2Fth%3Fid%3DOIP.8wHctLr_Dq71EwdJ8ydV3gHaEK%26pid%3DApi&sp=1741834617Ta555c5c0c05aefccefb18821829f8a1fad2051ba2c2f2699f7b018685c719e62)

Un-doing the categorical encoding already presented to get a better understanding of the data.

In [3]:
idcAboutStates_df = pd.read_csv("Labels.csv")

## I LOVE CLEANING UP DATA

Some of the columns in `data_df` have extra spaces in random places. Causing `replace()` to fail for those columns.
So annoying

![](https://media1.tenor.com/m/z4Pjus3zEcsAAAAd/dad-coraline.gif)


In [4]:
# Filter out state abbreviations from our label data because it's irrelevant
idcAboutStates_df = (
    idcAboutStates_df[
        ~idcAboutStates_df["VariableName"]
        .str.contains("abbreviation", regex=True, case=False)]
        .astype({"Value": int, "ValueLabel": str}))

unique_variables = idcAboutStates_df["VariableName"].unique()

In [128]:
data_df = pd.read_csv("data.csv").drop(columns="UnitID")

In [129]:
for row in idcAboutStates_df.itertuples():
    try:
        data_df[row.VariableName].replace({row.Value : row.ValueLabel}, inplace=True)
    # Skip over labels that don't exist on the data
    except KeyError as e:
        print(f"wtf man: {e}")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data_df[row.VariableName].replace({row.Value : row.ValueLabel}, inplace=True)


In [108]:
data_df

Unnamed: 0,Institution Name,Institution (entity) name (HD2023),Historically Black College or University (HD2023),Tribal college (HD2023),Carnegie Classification 2021: Basic (HD2023),Institution grants a medical degree (HD2023),State abbreviation (HD2023),Carnegie Classification 2021: Undergraduate Profile (HD2023),Primary public control (IC2023),"Yellow Ribbon Program (officially known as Post-9/11 GI Bill, Yellow Ribbon Program) (IC2023)",...,All students enrolled (EF2023A_DIST Undergraduate total),Students enrolled exclusively in distance education courses (EF2023A_DIST Undergraduate total),Students enrolled in some but not all distance education courses (EF2023A_DIST Undergraduate total),Student not enrolled in any distance education courses (EF2023A_DIST Undergraduate total),Total library FTE staff (AL2023),Total physical library circulations (books and media) (AL2023),Total library circulations (physical and digital/electronic) (AL2023),Total digital/electronic circulations (books and media) (AL2023),Full-time retention rate 2023 (EF2023D),Student-to-faculty ratio (EF2023D)
0,Aaniiih Nakoda College,Aaniiih Nakoda College,2,1,33,2,MT,3,1,0,...,133,,,133,1.00,85.0,805.0,720.0,,10
1,Abraham Baldwin Agricultural College,Abraham Baldwin Agricultural College,2,2,23,2,GA,6,2,1,...,3768,400.0,1289.0,2079,16.50,12594.0,236340.0,223746.0,69.0,23
2,Adams State University,Adams State University,2,2,18,2,CO,7,2,1,...,1576,266.0,193.0,1117,11.00,11461.0,277461.0,266000.0,55.0,14
3,Aims Community College,Aims Community College,2,2,3,2,CO,1,8,0,...,7529,988.0,1478.0,5063,18.00,2635.0,46411.0,43776.0,63.0,17
4,Alabama A & M University,Alabama A & M University,1,2,18,2,AL,10,2,1,...,5845,265.0,2407.0,3173,26.00,544.0,135329.0,134785.0,64.0,19
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
751,Wright State University-Main Campus,Wright State University-Main Campus,2,2,16,1,OH,11,2,1,...,7012,521.0,3911.0,2580,38.13,24379.0,510523.0,486144.0,63.0,15
752,Yakima Valley College,Yakima Valley College,2,2,14,2,WA,6,2,0,...,3523,996.0,1214.0,1313,5.25,2000.0,33000.0,31000.0,,14
753,Yavapai College,Yavapai College,2,2,2,2,AZ,1,4,0,...,7200,2074.0,1289.0,3837,16.71,11103.0,75327.0,64224.0,,19
754,Youngstown State University,Youngstown State University,2,2,18,2,OH,10,2,0,...,8436,678.0,4063.0,3695,17.39,31918.0,686379.0,654461.0,77.0,18


In [8]:
count_carnegie = data_df['Carnegie Classification 2021: Undergraduate Profile (HD2023)'].value_counts().reset_index()
count_carnegie.rename(columns={count_carnegie.columns[0] : "category"}, inplace=True)

### Carnegie Classification

Majority of the institutions in the list are 4 year institutions, with 4 year, full time and with a higher transfer rate being the most common

In [9]:
alt.Chart(count_carnegie.nlargest(5, 'count')).mark_arc().encode(
theta="count",
color=alt.Color("category", legend=alt.Legend(labelLimit=0, labelFontSize=15)),
tooltip=[
        alt.Tooltip("category", title="Carnegie Classification"),
        alt.Tooltip("count", title="Count")
    ]
).properties(title="Carnegie Classification")

In [38]:
data_df.nlargest(40, "U.S. Nonresident total (EF2023A  All students total)")

Unnamed: 0,UnitID,Institution Name,Institution (entity) name (HD2023),Historically Black College or University (HD2023),Tribal college (HD2023),Carnegie Classification 2021: Basic (HD2023),Institution grants a medical degree (HD2023),State abbreviation (HD2023),Carnegie Classification 2021: Undergraduate Profile (HD2023),Primary public control (IC2023),...,All students enrolled (EF2023A_DIST Undergraduate total),Students enrolled exclusively in distance education courses (EF2023A_DIST Undergraduate total),Students enrolled in some but not all distance education courses (EF2023A_DIST Undergraduate total),Student not enrolled in any distance education courses (EF2023A_DIST Undergraduate total),Total library FTE staff (AL2023),Total physical library circulations (books and media) (AL2023),Total library circulations (physical and digital/electronic) (AL2023),Total digital/electronic circulations (books and media) (AL2023),Full-time retention rate 2023 (EF2023D),Student-to-faculty ratio (EF2023D)
174,139755,Georgia Institute of Technology-Main Campus,Georgia Institute of Technology-Main Campus,No,No,Doctoral Universities: Very High Research Acti...,No,GA,"Four-year, full-time, more selective, higher t...",State,...,19505,1036.0,3656.0,14813,101.0,32543.0,3394555.0,3362012.0,98.0,18
597,145637,University of Illinois Urbana-Champaign,University of Illinois Urbana-Champaign,No,No,Doctoral Universities: Very High Research Acti...,Yes,IL,"Four-year, full-time, more selective, lower tr...",State,...,35564,436.0,21645.0,13483,395.18,138967.0,9550228.0,9411261.0,94.0,20
13,104151,Arizona State University Campus Immersion,Arizona State University Campus Immersion,No,No,Doctoral Universities: Very High Research Acti...,No,AZ,"Four-year, full-time, selective, higher transf...",State,...,65174,3586.0,39453.0,22135,194.0,38971.0,4001681.0,3962710.0,85.0,18
396,243780,Purdue University-Main Campus,Purdue University-Main Campus,No,No,Doctoral Universities: Very High Research Acti...,Yes,IN,"Four-year, full-time, selective, lower transfe...",State,...,39864,756.0,13012.0,26096,175.66,23090.0,3909175.0,3886085.0,92.0,14
620,170976,University of Michigan-Ann Arbor,University of Michigan-Ann Arbor,No,No,Doctoral Universities: Very High Research Acti...,Yes,MI,"Four-year, full-time, more selective, lower tr...",State,...,33730,18.0,3421.0,30291,574.0,110300.0,9292905.0,9182605.0,98.0,11
653,227216,University of North Texas,University of North Texas,No,No,Doctoral Universities: Very High Research Acti...,No,TX,"Four-year, full-time, selective, higher transf...",State,...,33858,3141.0,21803.0,8914,404.0,50836.0,24287310.0,24236474.0,77.0,23
682,236948,University of Washington-Seattle Campus,University of Washington-Seattle Campus,No,No,Doctoral Universities: Very High Research Acti...,Yes,WA,"Four-year, full-time, more selective, lower tr...",State,...,39515,634.0,7080.0,31801,347.0,79582.0,5255413.0,5175831.0,95.0,20
565,110680,University of California-San Diego,University of California-San Diego,No,No,Doctoral Universities: Very High Research Acti...,Yes,CA,"Four-year, full-time, more selective, higher t...",State,...,33792,203.0,7198.0,26391,228.72,32542.0,7305662.0,7273120.0,94.0,19
559,110635,University of California-Berkeley,University of California-Berkeley,No,No,Doctoral Universities: Very High Research Acti...,No,CA,"Four-year, full-time, more selective, higher t...",State,...,33078,49.0,7082.0,25947,444.0,253850.0,4524685.0,4270835.0,97.0,18
360,214777,Pennsylvania State University-Main Campus,Pennsylvania State University-Main Campus,No,No,Doctoral Universities: Very High Research Acti...,Yes,PA,"Not applicable, not in Carnegie universe (not ...",State,...,42223,207.0,18817.0,23199,530.0,144963.0,5833996.0,5689033.0,92.0,15


## Column Groupings

To avoid the possibility of carpal tunnel, we group the columns.
To get more information about the groupings, see the Notes notebook

Unnamed: 0,Grand total (EF2023 All students total),Grand total (EF2023 All students Undergraduate total),Grand total (EF2023 All students Graduate and First professional)
0,133,133,
1,3768,3768,
2,2950,1576,1374.0
3,7529,7529,
4,6614,5845,769.0
...,...,...,...
751,9884,7012,2872.0
752,3523,3523,
753,7200,7200,
754,11040,8436,2604.0


## Add real world rankings

Add columns on where an institution ranks

Each real world ranking such as:
- U.S. News Best Colleges
- Wall Street Journal
- Princeton Review
- Forbes
- Washington Monthly

Analyze and look for patterns for each rankings with the variables from the original data

### External College Rankings

External rankings use different annotations so an exact merge is not always possible.
Use fuzzy matching with a threshold of 85 for accuracy

#### Niche

https://www.niche.com/api/renaissance/results/?type=private&type=public&listURL=best-colleges&page=1&searchType=college&limit=800

TU ranks 341. 

I'm cooked ![](https://cdn.betterttv.net/emote/59f27b3f4ebd8047f54dee29/3x.webp)

#### Forbes 2025

TU ranks 174

#### Times Higher Ed

TU ranks 391

In [49]:
from json import load
with open("Ranking_datasets/niche-800.json", "r") as f:
    niche_json = load(f)

with open("Ranking_datasets/Forbes-Ranking-2025.json") as f:
    forbes_json = load(f)

with open("Ranking_datasets/timeshighered-2022.json") as f:
    times_json = load(f)

niche_rankings = {}
forbes_rankings = {}
highered_rankings = {}
for idx, university in enumerate(niche_json["entities"]):
    niche_rankings[university["content"]["entity"]["name"]] = idx + 1

niche_df = pd.DataFrame(niche_rankings.items(), columns=["Institution", "Niche_Ranking"])

for university in forbes_json["organizationList"]["organizationsLists"]:
    forbes_rankings[university["organizationName"]] = university["rank"]

forbes_df = pd.DataFrame(forbes_rankings.items(), columns=["Institution", "Forbes_Ranking"])

for university in times_json["data"]:
    highered_rankings[university["name"]] = university["rank_order"]

highered_df = pd.DataFrame(highered_rankings.items(), columns=["Institution", "HigherEd_Ranking"])

In [130]:
from thefuzz import fuzz

# ngl, I had to use Claude because laziness and it's pretty common anyway
def simple_fuzzy_merge(df_left, df_right, left_on="Institution Name", right_on = "Institution", threshold=97):
    """
    Performs a simplified fuzzy merge between two dataframes using fuzzywuzzy.
    
    Parameters:
    -----------
    df_left : pandas DataFrame
        Left dataframe to merge
    df_right : pandas DataFrame
        Right dataframe to merge
    left_on : str
        Column name from left dataframe to match on
    right_on : str
        Column name from right dataframe to match on
    threshold : int, default 90
        Minimum similarity score to consider a match (0-100)
        
    Returns:
    --------
    pandas DataFrame
        Merged dataframe with additional column showing match score
    """
    # Create a copy of dataframes to avoid modifying originals
    left_df = df_left.copy().reset_index().rename(columns={'index': 'left_idx'})
    right_df = df_right.copy().reset_index().rename(columns={'index': 'right_idx'})
    
    # Create an empty list to store matches
    matches = []
    
    # For each value in the left dataframe
    for _, left_row in left_df.iterrows():
        left_value = str(left_row[left_on]).lower()
        left_idx = left_row['left_idx']
        
        # Calculate similarity scores with all values in the right dataframe
        right_df['score'] = right_df[right_on].apply(
            lambda x: fuzz.token_sort_ratio(left_value, str(x).lower())
        )
        
        # Get the best match if it's above the threshold
        best_match = right_df[right_df['score'] >= threshold].sort_values('score', ascending=False).head(1)
        
        if not best_match.empty:
            matches.append({
                'left_idx': left_idx,
                'right_idx': best_match['right_idx'].values[0],
                'score': best_match['score'].values[0]
            })
    
    # If no matches were found, return empty dataframe
    if not matches:
        return pd.DataFrame()
    
    # Create a dataframe from matches
    matches_df = pd.DataFrame(matches)
    
    # Merge the original dataframes based on the indices from matches_df
    result = pd.merge(
        df_left.loc[matches_df['left_idx']].reset_index(drop=True),
        df_right.loc[matches_df['right_idx']].reset_index(drop=True),
        left_index=True, 
        right_index=True,
        suffixes=('_left', '_right')
    )
    
    # Add the match score
    result['fuzzy_score'] = matches_df['score'].values
    
    return result

merged_niche = simple_fuzzy_merge(data_df, niche_df)
merged_forbes = simple_fuzzy_merge(data_df, forbes_df)
merged_highered = simple_fuzzy_merge(data_df, highered_df)

In [46]:
niche_df

Unnamed: 0,Institution,Niche_Ranking,fuzz_score
0,Massachusetts Institute of Technology,1,79
1,Yale University,2,81
2,Stanford University,3,86
3,Harvard University,4,82
4,Dartmouth College,5,71
...,...,...,...
795,Commonwealth University - Bloomsburg,796,69
796,High Point University,797,76
797,West Virginia University at Parkersburg,798,100
798,Austin Peay State University,799,100


#### How TU compares to other Maryland colleges

In [None]:
column_ranges = {
    "Degrees Conferred": (16, 21),
    "Financial Aid": (21, 27),
    "Student Success": (28, 32),
    "School Finance": (33, 63),
    "Library": (64, 71),
    "Admissions": (72, 74),
    "Race": (87, 96),
    "Population": (81, 84),
}

# Insert Institution Name for each dictionary values
column_groups = {
    name: data_df.columns[start:end].insert(0, data_df.columns[0])
    for name, (start, end) in column_ranges.items()
}

column_ranges.items()


dict_items([('Degrees Conferred', (16, 21)), ('Financial Aid', (21, 27)), ('Student Success', (28, 32)), ('School Finance', (33, 63)), ('Library', (64, 71)), ('Admissions', (72, 74)), ('Race', (87, 96)), ('Population', (81, 84))])

In [132]:
merged_forbes[merged_forbes["State abbreviation (HD2023)"] == "MD"].sort_values(by="Forbes_Ranking")[column_groups["Degrees Conferred"]]

Unnamed: 0,Institution Name,Grand total (C2023_A First major Grand total Bachelor's degree),Grand total (C2023_A First major Grand total Master's degree),Grand total (C2023_A First major Grand total Doctor's degree - research/scholarship ),Grand total (C2023_A First major Grand total Doctor's degree - professional practice ),Grand total (C2023_A First major Grand total Doctor's degree - other )
134,University of Maryland-College Park,8075.0,3078.0,631.0,39.0,
99,Towson University,4064.0,844.0,12.0,38.0,
133,University of Maryland-Baltimore County,2419.0,1000.0,100.0,,
79,Salisbury University,1605.0,287.0,5.0,8.0,
87,St. Mary's College of Maryland,291.0,14.0,,,
132,University of Maryland Global Campus,7843.0,3582.0,,41.0,
107,University of Baltimore,373.0,341.0,7.0,196.0,


## Data Preprocessing

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

numerical_col = [col for col in merged_forbes.select_dtypes(include=["number"]).columns if col not in ["Forbes_Ranking", "fuzzy_score"]]
merged_forbes[numerical_col] = scaler.fit_transform(merged_forbes[numerical_col])

In [142]:
merged_forbes[column_groups["Population"]]

Unnamed: 0,Institution Name,Grand total (EF2023 All students total),Grand total (EF2023 All students Undergraduate total),Grand total (EF2023 All students Graduate and First professional)
0,Appalachian State University,0.270365,0.315671,0.087570
1,Auburn University,0.425330,0.441788,0.292105
2,Boise State University,0.341735,0.384665,0.150984
3,California Polytechnic State University-San Lu...,0.286597,0.352042,0.043642
4,California State Polytechnic University-Pomona,0.344080,0.400942,0.113536
...,...,...,...,...
176,West Virginia University,0.309192,0.302332,0.265615
177,Western Washington University,0.183384,0.219408,0.044642
178,Wichita State University,0.210919,0.205089,0.184621
179,William & Mary,0.118971,0.105584,0.132879
