# World University Rankings 2023 #

## STAT 301 Group Project 

### Introduction

Start with relevant background information on the topic to prepare those unfamiliar for the rest of your proposal.

Formulate one or two questions for investigation and detail the dataset that will be utilized to address these questions.

Additionally, align your question/objectives with the existing literature. To contextualize your study, include a minimum of two scientific publications (these should be listed in the References section).



### Methods and Results

In this section, you will include:

**a) “Exploratory Data Analysis (EDA)”**

- Demonstrate that the dataset can be read into R.
- Clean and wrangle your data into a tidy format.
- Plot the relevant raw data, tailoring your plot to address your question.
  - Make sure to explore the association of the explanatory variables with the response.
- Any summary tables that are relevant to your analysis.
- Be sure not to print output that takes up a lot of screen space.
- Your EDA must be comprehensive with high quality plots.

**b) “Methods: Plan”**

- Describe in written English the methods you used to perform your analysis from beginning to end, and narrate the code that does the analysis.
- If included, describe the “Feature Selection” process and how and why you choose the covariates of your final model.
- Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
  - If inference is the aim of your project, a detailed interpretation of your fitted model is required, as well as a discussion of relevant quantities (e.g., are the coefficients significant? How does the model fit the data)?
  - A careful model assessment must be conducted.
  - If prediction is the project's aim, describe the test data used or how it was created.
- Ensure your tables and/or figures are labelled with a figure/table number.

In [26]:
library(tidyverse)
library(repr)
library(broom)
library(GGally)
library(car)

Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some




In [9]:
# part a
#loading and fixing column names
university_data <- read_csv("uni_rankings_2023.csv")
colnames(university_data) <- c("university_rank", "name_of_university", "location", "no_of_student", "no_of_student_per_staff", 
                               "international_student", "female_male_ratio", "overall_score", "teaching_score", "research_score",
                               "citations_score","industry_income_score", "international_outlook_score")
head(university_data)

university_data_cleaned <- university_data |>
mutate(international_student = as.numeric(gsub("%", "", international_student)) / 100,
      female_male_ratio = as.numeric(sub(":.*", "", female_male_ratio))/as.numeric(sub(".*:", "", female_male_ratio)),
      overall_score = as.numeric(overall_score),
      teaching_score = as.numeric(teaching_score),
      research_score = as.numeric(research_score),
      citations_score = as.numeric(citations_score),
      industry_income_score = as.numeric(industry_income_score),
      international_outlook_score = as.numeric(international_outlook_score)) |>
drop_na()

head(university_data_cleaned)

[1mRows: [22m[34m2341[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): University Rank, Name of University, Location, International Stude...
[32mdbl[39m  (1): No of student per staff
[32mnum[39m  (1): No of student

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


university_rank,name_of_university,location,no_of_student,no_of_student_per_staff,international_student,female_male_ratio,overall_score,teaching_score,research_score,citations_score,industry_income_score,international_outlook_score
<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,University of Oxford,United Kingdom,20965,10.6,42%,48 : 52,96.4,92.3,99.7,99.0,74.9,96.2
2,Harvard University,United States,21887,9.6,25%,50 : 50,95.2,94.8,99.0,99.3,49.5,80.5
3,University of Cambridge,United Kingdom,20185,11.3,39%,47 : 53,94.8,90.9,99.5,97.0,54.2,95.8
3,Stanford University,United States,16164,7.1,24%,46 : 54,94.8,94.2,96.7,99.8,65.0,79.8
5,Massachusetts Institute of Technology,United States,11415,8.2,33%,40 : 60,94.2,90.7,93.6,99.8,90.9,89.3
6,California Institute of Technology,United States,2237,6.2,34%,37 : 63,94.1,90.9,97.0,97.3,89.8,83.6


[1m[22m[36mℹ[39m In argument: `overall_score = as.numeric(overall_score)`.
[33m![39m NAs introduced by coercion


university_rank,name_of_university,location,no_of_student,no_of_student_per_staff,international_student,female_male_ratio,overall_score,teaching_score,research_score,citations_score,industry_income_score,international_outlook_score
<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,University of Oxford,United Kingdom,20965,10.6,0.42,0.9230769,96.4,92.3,99.7,99.0,74.9,96.2
2,Harvard University,United States,21887,9.6,0.25,1.0,95.2,94.8,99.0,99.3,49.5,80.5
3,University of Cambridge,United Kingdom,20185,11.3,0.39,0.8867925,94.8,90.9,99.5,97.0,54.2,95.8
3,Stanford University,United States,16164,7.1,0.24,0.8518519,94.8,94.2,96.7,99.8,65.0,79.8
5,Massachusetts Institute of Technology,United States,11415,8.2,0.33,0.6666667,94.2,90.7,93.6,99.8,90.9,89.3
6,California Institute of Technology,United States,2237,6.2,0.34,0.5873016,94.1,90.9,97.0,97.3,89.8,83.6


### part b

In [14]:
# fitting the linear model
university_lm <- lm(overall_score ~ . - university_rank - name_of_university - location, data = university_data_cleaned)
uni_tidy_lm <- tidy(university_lm)
uni_tidy_lm

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.01850202,0.03456206,0.53532757,0.5933669
no_of_student,-8.625404e-08,2.270879e-07,-0.37982662,0.7047136
no_of_student_per_staff,-0.0003538851,0.0003015935,-1.17338432,0.2428551
international_student,0.00129407,0.0371465,0.03483693,0.9722649
female_male_ratio,-0.01823315,0.009996851,-1.82388907,0.07053866
teaching_score,0.2988413,0.0005299324,563.92339403,3.234071e-216
research_score,0.3008783,0.0004628687,650.029511,5.452255999999999e-224
citations_score,0.3005338,0.0003213449,935.23775731,6.825761e-244
industry_income_score,0.02487489,0.0001831504,135.81670636,1.727166e-138
international_outlook_score,0.07482483,0.0002889544,258.95029499,1.142898e-173


In [29]:
# check their vifs
university_vif <- vif(university_lm)
university_vif

Research score and teaching score have concerningly high variance inflation factors, indicating we should remove these variables from the model. However, upon closer inspection, the two variables are highly correlated to each other, thus, removing the variable with the highest vif should suffice.

In [30]:
teaching_removed_lm <- lm(overall_score ~ . - university_rank - name_of_university - location - teaching_score, 
                          data = university_data_cleaned)
teaching_removed_lm
teaching_removed_vif <- vif(teaching_removed_lm)
teaching_removed_vif


Call:
lm(formula = overall_score ~ . - university_rank - name_of_university - 
    location - teaching_score, data = university_data_cleaned)

Coefficients:
                (Intercept)                no_of_student  
                  7.737e+00                   -2.092e-05  
    no_of_student_per_staff        international_student  
                 -3.777e-02                    9.759e-01  
          female_male_ratio               research_score  
                 -1.736e+00                    5.398e-01  
            citations_score        industry_income_score  
                  3.277e-01                    8.301e-03  
international_outlook_score  
                  1.947e-02  


### Discussion

In this section, you’ll interpret the results you obtained in the previous section with respect to the main question/goal of your project.

Summarize what you found and the implications/impact of your findings.
If relevant, discuss whether your results were what you expected to find.
Discuss how your model could be improved;
Discuss future questions/research this study could lead to.

### References

At least two citations of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.