## An Investigation in classifying Alumni’s College Regions using respective Salaries

## INTRODUCTION

Our data set takes many colleges, and groups them according to where they’re located by region. Along with the region, our data set has the starting median salary, along with the mid-career median salary which further aids classification. 

**Classification question: Given a person’ entry level and mid-career salary, which region did they most likely attend colleges in?**

The data set we are using from Where it pays to attend college is “Salaries for colleges by region”, collected from The Wall Street Journal, Payscale Inc. This data frame provides the average entry level salary and mid-career salary from colleges in the U.S., which are classified by the regions the colleges are located in. We intend to use K-nearest neighbors classification to examine the data set, whereby we predict new observations (colleges that are not within our data sets) into a region based on their entry and mid-career salary using K numbers of closest entry and mid-career salary in the set.



In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.4     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


## EXPLORATORY DATA ANALYSIS

In [None]:
download.file("https://raw.githubusercontent.com/colinyee9935/DSCI100_Group_Project/main/data/salaries_by_region.csv", "salary_region.csv")
salary_region<-read_csv("salary_region.csv")

salary_region <- salary_region |>
    rename_with(~str_to_lower(.) |>              #reformat the column names into lower case and replace space with underscores
                str_replace_all("[^[:alnum:]]+", "_"), .cols = everything())|>
    mutate(mid_career_10th_percentile_salary = na_if(mid_career_10th_percentile_salary, "N/A"))|>
    mutate(mid_career_90th_percentile_salary = na_if(mid_career_90th_percentile_salary, "N/A"))|>

#turning column to dbl
    mutate(mid_career_10th_percentile_salary_d=as.numeric(mid_career_10th_percentile_salary))|>
    mutate(mid_career_90th_percentile_salary_d=as.numeric(mid_career_90th_percentile_salary))|>
    select(-mid_career_10th_percentile_salary,-mid_career_90th_percentile_salary)|>

#rearranging columns
    select(1:4, mid_career_10th_percentile_salary_d, everything(), mid_career_90th_percentile_salary_d)|>
    rename(midc_10th_salary = mid_career_10th_percentile_salary_d,
          midc_25th_salary= mid_career_25th_percentile_salary,
          midc_75th_salary=mid_career_75th_percentile_salary,
          midc_90th_salary=mid_career_90th_percentile_salary_d)
    

#    select(school_name:mid_career_median_salary)
salary_region

In [None]:
#graphing of data sets: starting median salary and mid career median salary
#distribution of the two definitive data we'll be using

options(repr.plot.width = 10, repr.plot.height = 10)

salary_region_graph<-salary_region|>
    ggplot(aes(x=starting_median_salary, y=mid_career_median_salary, color=region))+
    geom_point(alpha = 0.4)+
    labs(x="Starting Median Salary($)",
        y="Mid Career Median Salary($)",
        color="Regions")+
    scale_color_brewer(palette = "Dark2")+
    theme(text=element_text(size=15))

salary_region_graph

#the distribution has some distinct characteristics to it - a large group of data points are clustered in the box formulated by x(40000,50000) and y(70000,90000).
#Classification within this range might becomes obscure as data points affialited to each region is really close toghether
#Thus, we should be cautious with this and test-retest our model with different set when incorporating only these two data if we do not include percentile data columns

In [None]:
#investigate the number of observations in each region
#this is important in finding whether each group has sufficient data points for reference

region_count<-salary_region|>
    group_by(region)|>
    summarize(count=n())
region_count

#from the table below, we can see that there are an unbalance observation between regions
#Observations go as low as only 28 in California, which makes sense since it's only one state
#but since the socioeconomic status in Calfornia is largely different from many other states in the U.S.
#it's understandable that it stands on it's own, as it has almost the highest median income and best educational resources
#Observation go as high as 100 in Northeastern. This disparity in the number of observations post a potential confounding to our analysis,
#as it's going to affect the model which ultimately impact the predications. We need to procede with caution and be critical on our findings.

In [None]:
#counting the missing data points in certain columnds of the data table
#this is signigicant in deciding whether we should incorporate percentile column in our classification model for increasing accuracy

columns_to_check <- c("midc_10th_salary", "midc_25th_salary", "midc_75th_salary", "midc_90th_salary")

missing_counts <- salary_region |>
    summarize(across(all_of(columns_to_check), ~ sum(is.na(.), na.rm = TRUE)))
missing_counts

#The following table shows the number of missing data in each percentile salary column.
#We can see that for 10th, 90th percentile there are 47 missing data and none for the other two.
#This is about 15% of the entire set, which is quite a significant amount that poses potential confoundings in our model
#As for some regions that has lower number of observation, such as California, the issue becomes prominent as they are weighed less if a data is missing
#Nonetheless, we could incorporate 25th and 75th percentile to increase the accuracy of the data set as they contain no missing data.

## METHODS

The three columns of data we will be using from the data set are region, starting median salary, mid-career salary. Region is going to be our factor in the analysis, and starting median salary, and mid career salary are going to be our predictors. The significance of starting median salary and mid career salary is the combination gives a more holistic view of alumni’s salary as people who graduated from the top colleges are likely to receive higher average salary at the start but for those who did not go to those colleges, they might start with lower average salary but eventually hit that bar. This creates a unique pattern of identification for the type of colleges, which is also impacted by the regions of the colleges (higher paying places vs. lower paying places) a person went to when performing the classification, allowing for higher accuracy during the test. As stated above, we will create a K-nearest neighbors classification model with the region as factor and the two salary types as the predictors. Before this, we’ll follow the conventions of subsetting our data set into 5 proportionate subset, each will be tested against the remaining 80% of the data for training of internal consistency, and yield the best possible K. We’ll then use this K value for classification of a newly observed data set provided in another sheet from the website.

### Visualization
We’ll be generating a confusion matrix that helps to visualize the number of correct predictions and false predictions. From here, we’re going to be able to assess the accuracy of the model in terms of its accuracy, precision, and recall by defining the positive object.


## EXPECTED OUTCOMES AND SIGNIFICANCE

### Expected Outcomes
We could find that some regions have a higher starting salary than other regions, so we can generalize and assume that a higher starting salary/middle salary would correlate to those specific regions 

### Impact of findings
We can use these findings as guidance strategies for educational investments and career guidance to enhance economic growth. Furthermore, providing students with region-specific salary data will allow them to make better-informed decisions about their education and take a proactive approach toward career planning for long-term success.

### Future question
Future research could explore the effects of industry distribution, local economies, and cost of living on regional salary outcomes for graduates.


