# World University Rankings 2023 #

## STAT 301 Group Project 

### Introduction

Start with relevant background information on the topic to prepare those unfamiliar for the rest of your proposal.

Formulate one or two questions for investigation and detail the dataset that will be utilized to address these questions.

Additionally, align your question/objectives with the existing literature. To contextualize your study, include a minimum of two scientific publications (these should be listed in the References section).



### Methods and Results

In this section, you will include:

**a) “Exploratory Data Analysis (EDA)”**

- Demonstrate that the dataset can be read into R.
- Clean and wrangle your data into a tidy format.
- Plot the relevant raw data, tailoring your plot to address your question.
  - Make sure to explore the association of the explanatory variables with the response.
- Any summary tables that are relevant to your analysis.
- Be sure not to print output that takes up a lot of screen space.
- Your EDA must be comprehensive with high quality plots.

**b) “Methods: Plan”**

- Describe in written English the methods you used to perform your analysis from beginning to end, and narrate the code that does the analysis.
- If included, describe the “Feature Selection” process and how and why you choose the covariates of your final model.
- Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
  - If inference is the aim of your project, a detailed interpretation of your fitted model is required, as well as a discussion of relevant quantities (e.g., are the coefficients significant? How does the model fit the data)?
  - A careful model assessment must be conducted.
  - If prediction is the project's aim, describe the test data used or how it was created.
- Ensure your tables and/or figures are labelled with a figure/table number.

In [6]:
library(tidyverse)
library(repr)
library(broom)
library(GGally)
library(car)
library(rsample)
library(leaps)
library(glmnet)

In [106]:
# part a
#loading and fixing column names
university_data <- read_csv("uni_rankings_2023.csv")
colnames(university_data) <- c("university_rank", "name_of_university", "location", "no_of_student", "no_of_student_per_staff", 
                               "international_student", "female_male_ratio", "overall_score", "teaching_score", "research_score",
                               "citations_score","industry_income_score", "international_outlook_score")
head(university_data)

# changing data from chr to dbl
university_data_cleaned <- university_data |>
mutate(international_student = as.numeric(gsub("%", "", international_student)) / 100,
      female_male_ratio = as.numeric(sub(":.*", "", female_male_ratio))/as.numeric(sub(".*:", "", female_male_ratio)),
      overall_score = as.numeric(overall_score),
      teaching_score = as.numeric(teaching_score),
      research_score = as.numeric(research_score),
      citations_score = as.numeric(citations_score),
      industry_income_score = as.numeric(industry_income_score),
      international_outlook_score = as.numeric(international_outlook_score)) |>
drop_na()

head(university_data_cleaned)

[1mRows: [22m[34m2341[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): University Rank, Name of University, Location, International Stude...
[32mdbl[39m  (1): No of student per staff
[32mnum[39m  (1): No of student

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


university_rank,name_of_university,location,no_of_student,no_of_student_per_staff,international_student,female_male_ratio,overall_score,teaching_score,research_score,citations_score,industry_income_score,international_outlook_score
<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,University of Oxford,United Kingdom,20965,10.6,42%,48 : 52,96.4,92.3,99.7,99.0,74.9,96.2
2,Harvard University,United States,21887,9.6,25%,50 : 50,95.2,94.8,99.0,99.3,49.5,80.5
3,University of Cambridge,United Kingdom,20185,11.3,39%,47 : 53,94.8,90.9,99.5,97.0,54.2,95.8
3,Stanford University,United States,16164,7.1,24%,46 : 54,94.8,94.2,96.7,99.8,65.0,79.8
5,Massachusetts Institute of Technology,United States,11415,8.2,33%,40 : 60,94.2,90.7,93.6,99.8,90.9,89.3
6,California Institute of Technology,United States,2237,6.2,34%,37 : 63,94.1,90.9,97.0,97.3,89.8,83.6


[1m[22m[36mℹ[39m In argument: `overall_score = as.numeric(overall_score)`.
[33m![39m NAs introduced by coercion


university_rank,name_of_university,location,no_of_student,no_of_student_per_staff,international_student,female_male_ratio,overall_score,teaching_score,research_score,citations_score,industry_income_score,international_outlook_score
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,University of Oxford,United Kingdom,20965,10.6,0.42,0.9230769,96.4,92.3,99.7,99.0,74.9,96.2
2,Harvard University,United States,21887,9.6,0.25,1.0,95.2,94.8,99.0,99.3,49.5,80.5
3,University of Cambridge,United Kingdom,20185,11.3,0.39,0.8867925,94.8,90.9,99.5,97.0,54.2,95.8
3,Stanford University,United States,16164,7.1,0.24,0.8518519,94.8,94.2,96.7,99.8,65.0,79.8
5,Massachusetts Institute of Technology,United States,11415,8.2,0.33,0.6666667,94.2,90.7,93.6,99.8,90.9,89.3
6,California Institute of Technology,United States,2237,6.2,0.34,0.5873016,94.1,90.9,97.0,97.3,89.8,83.6


### part b

We will first fit the full linear model.

In [107]:
# Main Developer: Kaichi

# fitting the linear model
university_lm <- lm(overall_score ~ . - university_rank - name_of_university - location, data = university_data_cleaned)
uni_tidy_lm <- tidy(university_lm)
uni_tidy_lm

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.01850202,0.03456206,0.53532757,0.5933669
no_of_student,-8.625404e-08,2.270879e-07,-0.37982662,0.7047136
no_of_student_per_staff,-0.0003538851,0.0003015935,-1.17338432,0.2428551
international_student,0.00129407,0.0371465,0.03483693,0.9722649
female_male_ratio,-0.01823315,0.009996851,-1.82388907,0.07053866
teaching_score,0.2988413,0.0005299324,563.92339403,3.234071e-216
research_score,0.3008783,0.0004628687,650.029511,5.452255999999999e-224
citations_score,0.3005338,0.0003213449,935.23775731,6.825761e-244
industry_income_score,0.02487489,0.0001831504,135.81670636,1.727166e-138
international_outlook_score,0.07482483,0.0002889544,258.95029499,1.142898e-173


**Table 1**: Estimated Coefficients of `university_lm`

Now we will verify their vifs to identify if multicollinearity exists

In [108]:
# Main Developer: Kaichi

# check their vifs
university_vif <- vif(university_lm)
university_vif

Research score and teaching score have concerningly high variance inflation factors, indicating we should remove these variables from the model. However, upon closer inspection, the two variables are highly correlated to each other, thus, removing the variable with the highest vif should suffice.

In [117]:
# Main Developer: Kaichi

teaching_removed_lm <- lm(overall_score ~ . - university_rank - name_of_university - location - teaching_score, 
                          data = university_data_cleaned)
teaching_removed_vif <- vif(teaching_removed_lm)
teaching_removed_vif

However, using only the vif may not result in the best fitted model; thus, we will use regularization techniques to select the best variables to include in our final model. Since this data has a large number of covariates where multicollinearity could potentially be an issue and we want to find the best variables to predict future values of `overall_score`, we will run Lasso regularization to select our varaiables for the final model. 

First we will split our data into training(70%) and testing(30%) data.

In [121]:
# Main Developer: Kaichi

data_split <- initial_split(university_data_cleaned, prop = 0.7) #splits the dataset into a 7:3 ratio
uni_training <- training(data_split) #creates training data which includes 70% of original data
uni_testing <- testing(data_split) #creates testing data which includes 30% of original data
head(uni_training)

university_rank,name_of_university,location,no_of_student,no_of_student_per_staff,international_student,female_male_ratio,overall_score,teaching_score,research_score,citations_score,industry_income_score,international_outlook_score
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
21,"University of California, Los Angeles",United States,42434,9.7,0.16,1.2727273,85.8,80.4,88.9,95.4,58.8,65.0
117,Aarhus University,Denmark,26475,13.5,0.1,1.2222222,61.2,40.6,60.0,77.6,74.1,77.9
139,University of Würzburg,Germany,23754,40.7,0.1,1.3809524,58.4,40.7,45.2,87.8,79.4,57.3
44,Monash University,Australia,58725,42.5,0.41,1.3255814,73.6,56.9,68.7,90.4,78.4,91.0
91,Korea Advanced Institute of Science and Technology (KAIST),South Korea,9946,10.7,0.09,0.2658228,64.2,64.5,66.0,65.7,100.0,38.2
48,University of Illinois at Urbana-Champaign,United States,48674,17.4,0.22,0.9230769,72.7,67.1,78.9,78.1,50.1,56.2


**Table 2**: First 5 Rows of Training Data

In [122]:
head(uni_testing)

university_rank,name_of_university,location,no_of_student,no_of_student_per_staff,international_student,female_male_ratio,overall_score,teaching_score,research_score,citations_score,industry_income_score,international_outlook_score
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,University of Oxford,United Kingdom,20965,10.6,0.42,0.9230769,96.4,92.3,99.7,99.0,74.9,96.2
5,Massachusetts Institute of Technology,United States,11415,8.2,0.33,0.6666667,94.2,90.7,93.6,99.8,90.9,89.3
7,Princeton University,United States,8279,8.0,0.23,0.8518519,92.4,87.6,95.9,99.1,66.0,80.3
10,Imperial College London,United Kingdom,18545,11.2,0.61,0.6666667,90.4,82.8,90.8,98.3,59.8,97.5
13,The University of Chicago,United States,15366,6.0,0.36,0.8867925,88.9,86.5,88.8,97.7,56.2,74.2
14,University of Pennsylvania,United States,21453,6.3,0.23,1.1276596,88.8,86.0,88.8,97.0,75.8,71.5


**Table 3**: First 5 Rows of Testing Data

Now we will run Lasso regularization on `uni_training` since the training data is used to train the model to fit the data and penalize large coefficients. By running lasso on the training data we will be able to find the value of the L1 penalty term ($\lambda$) that provides the lowest cross-validation MSE.

In [123]:
# Main Developer: Kaichi

#run the lasso regression on the training set
uni_lasso<-
    cv.glmnet(uni_training %>% select(-university_rank, -name_of_university, -location, -overall_score) %>% as.matrix(), 
              uni_training$overall_score, 
              alpha=1)

lasso_model


Call:  cv.glmnet(x = uni_testing %>% select(-university_rank, -name_of_university,      -location, -overall_score) %>% as.matrix(), y = uni_testing$overall_score,      alpha = 1) 

Measure: Mean-Squared Error 

    Lambda Index Measure      SE Nonzero
min 0.2205    44  0.1915 0.04553       5
1se 0.2420    43  0.2268 0.05453       5

Now that we ran the model, we will extract the coefficients of the model with the lowest cross-validation MSE and the names of the covariates.

In [124]:
# Main Developer: Kaichi

lasso_coef <-
    coef(uni_lasso, s = uni_lasso$lambda.min)

lasso_selected_covariates <-
    as_tibble(
        as.matrix(lasso_coef),
        rownames='covariate') %>%
        filter(covariate != '(Intercept)' & abs(s1) !=0) %>% 
        pull(covariate)
lasso_coef

10 x 1 sparse Matrix of class "dgCMatrix"
                                    s1
(Intercept)                 3.07625261
no_of_student               .         
no_of_student_per_staff     .         
international_student       .         
female_male_ratio           .         
teaching_score              0.27771788
research_score              0.31398129
citations_score             0.28652796
industry_income_score       0.01364919
international_outlook_score 0.06356860

**Table 4**: Coefficients of Model with Lowest MSE

In [126]:
# Main Developer: Kaichi

#obtain the selected covariates
lasso_selected_covariates

Lasso regression should remove variables with high multicollinearity and is able to fit models on datasets with high multicollinearity. To verify this, we will check the variance inflation factor (VIF) of the lasso model

In [127]:
# Main Developer: Kaichi

lasso_vif <- vif(lm(overall_score ~ . , data = uni_training %>% 
        select(contains(lasso_selected_covariates), overall_score)))
lasso_vif

Most of the variables selected in this model have low vif values indicating there is no issue of multicollinearity. However, `teaching_score` and `reasearch_score` both have high vif values which point to multicollinearity. Despite the high multicollinearity, the model could still be effective since lasso regression includeds variables that have high multicollinearity if they contribute significantly to the model. To verify the effectiveness of the model, we will fit an oridinary least squares model with the lasso selected variables and evaluate the $R^2$ value to determine the models effectiveness. To fit the OLS we will use the testing data which was kept separate from the training process as it provides an unbiased evaluation when comparing different models and a measure of how well the model does on unseen data. 

In [129]:
# Main Developer: Kaichi

#fit OLS model using lasso selected covariates
lasso_cov_model <- 
    lm(overall_score ~ .,
        data = uni_testing %>% 
                   select(contains(lasso_selected_covariates), overall_score))
summary(lasso_cov_model)


Call:
lm(formula = overall_score ~ ., data = uni_testing %>% select(contains(lasso_selected_covariates), 
    overall_score))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.05068 -0.02063  0.01030  0.01722  0.05708 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 -0.0366695  0.0512154  -0.716    0.479    
teaching_score               0.2998697  0.0007036 426.186   <2e-16 ***
research_score               0.2996657  0.0006858 436.970   <2e-16 ***
citations_score              0.3005105  0.0005030 597.434   <2e-16 ***
industry_income_score        0.0249281  0.0002898  86.021   <2e-16 ***
international_outlook_score  0.0754046  0.0003467 217.522   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.03133 on 35 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 1.276e+06 on 5 and 35 DF,  p-value: < 2.2e-16


**Table 5**: Summary of OLS Model Using Lasso Selected Variables

We will now compare it to the full OLS model.

In [131]:
summary(university_lm)


Call:
lm(formula = overall_score ~ . - university_rank - name_of_university - 
    location, data = university_data_cleaned)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.084303 -0.025437  0.000312  0.022858  0.080683 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  1.850e-02  3.456e-02   0.535   0.5934    
no_of_student               -8.625e-08  2.271e-07  -0.380   0.7047    
no_of_student_per_staff     -3.539e-04  3.016e-04  -1.173   0.2429    
international_student        1.294e-03  3.715e-02   0.035   0.9723    
female_male_ratio           -1.823e-02  9.997e-03  -1.824   0.0705 .  
teaching_score               2.988e-01  5.299e-04 563.923   <2e-16 ***
research_score               3.009e-01  4.629e-04 650.030   <2e-16 ***
citations_score              3.005e-01  3.213e-04 935.238   <2e-16 ***
industry_income_score        2.487e-02  1.832e-04 135.817   <2e-16 ***
international_outlook_score  7.482e

**Table 6**: Summary of Full OLS Model

The adjusted $R^2$ in the full model is the same as the model with the lasso selected variables which implies that both the full model and model with lasso selected variables explain the variance `overall_score` equally than the model with both of the $R^2$ being 1. The $R^2$ being 1 implies that both models perfectly explain the variance of `overall_score`. Since the $R^2$ between the full model and lasso model was relatively the same, it justifies the variables selected with the lasso model since it implies that the variables removed did not make a significant impact to the models ability to explain the variance in `overall_score`. This result is also consistent with the result of the $p$-values in the full model as the variables that were removed in the lasso regression are also shown as insignificant in the full model.

### Discussion

In this section, you’ll interpret the results you obtained in the previous section with respect to the main question/goal of your project.

Summarize what you found and the implications/impact of your findings.
If relevant, discuss whether your results were what you expected to find.
Discuss how your model could be improved;
Discuss future questions/research this study could lead to.

### References

At least two citations of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.