# Selecting Features in University Rankings

For this project, three different approached will be explored to determine the most relevant features for a classification machine-learning model.

1. Univariate Analysis
2. Recursive Feature Elimination
3. Tree Classifier

The data for this project was obtainted from Kaggle, in concrete from the World University Rankings Competition https://www.kaggle.com/mylesoneill/world-university-rankings

Two different datasets are available from this link, the Shangai and Times rankings. We want select features for a model which can classify whether a university ranks as top 50 in the rankings.  

Before loading the data, the appropiate modules must be imported first.

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Then, the data needs to be cleaned. I decided to use a function to clean the data to add one parameter: whether we want to investigate the top 50, top 100 or top 10. The function is also stored on a separate file (using the magic function %%) to be easily imported into other notebooks if needed.

The function below is used to clean the Shangai dataset

In [2]:
%%writefile data_cleaning.py

# Import again so it is saved on the file

import pandas as pd
import numpy as np

# Clean shangai dataset function

def shangai_clean(x):
    
    # Read excel file and sort by total score
    shangai = pd.read_excel("shanghaiData.xlsx").sort_values(by = "total_score", ascending = False)
        
    # Filter by the latest year
    shangai = shangai[shangai["year"] == shangai["year"].max()]
    
    # Simplify dataframe with only explanatory variables and drop null values
    shangai.drop(["world_rank", "university_name", "national_rank", "year", "total_score"], axis = 1, inplace = True)
    
    # Drop null values
    shangai.dropna(inplace = True)

    # Code the top 50 universities
    array_ref = (np.arange(len(shangai)) < x)
    shangai["top_50"] = array_ref
    code = {True:1.0, False:0.0}
    shangai["top_50"] = shangai["top_50"].map(code)
    
    # Return the array
    return shangai

Overwriting data_cleaning.py


For the times dataset, we will be appending the function to the previous file.

In [3]:
%%writefile -a data_cleaning.py

# Clean the times dataset. Here we dont need to import the modules again, since we are appending the function to the previous file 

def times_clean(a):
    
    # Read the csv file and sort by total score ind descending order
    times = pd.read_csv("timesData.csv").sort_values(by = "total_score", ascending = False)
    
    # Drop null values from total score
    times["total_score"] = pd.to_numeric(times["total_score"], errors = "coerce")
    
    # Filter by the latest year
    times = times[times["year"] == 2016] 
            
    # Simplify the table by dropping non explanatory variables
    times.drop(["world_rank", "university_name", "country", "year", "total_score"], axis = 1, inplace = True)
    
    # Times drop null values
    times.dropna(inplace = True)

    # Convert all other columns to float type
    times["international"] = pd.to_numeric(times["international"], errors = "coerce")
    times["income"] = pd.to_numeric(times["income"], errors = "coerce") 
    times["female_male_ratio"] = pd.to_numeric(times["female_male_ratio"].apply(lambda d: d.split(" : ")[0]), errors = "coerce")
    times["international_students"] = pd.to_numeric(times["international_students"].apply(lambda d: d.split("%")[0]), errors = "coerce")
    
    student_list = []
    
    for x in range(len(times)):
        student_value = times["num_students"].iloc[x].replace(",",".")
        student_list.append(student_value)
        
    times["num_students"] = pd.to_numeric(student_list, errors = "coerce")
    
    # Drop null values once again after converting all other columns to float
    times.dropna(inplace = True)
    
    # Add a new column with the actual ranking
    array_ref = (np.arange(len(times)) < a)
    times["top_50"] = array_ref
    code = {True:1.0, False:0.0}
    times["top_50"] = times["top_50"].map(code)
    
    return times

Appending to data_cleaning.py


For feature selection, we need to separate the x and y in each dataset. The y will be the dummy variable indicating whether a university is top 50 or not, and x will be all other variables.

In [22]:
# Import files with data_cleaning functions

import data_cleaning as dc


# Shangai and Times top-50 analysis

shangai_treat = dc.shangai_clean(50)
shan_cols = shangai_treat.columns

times_treat = dc.times_clean(50)
times_cols = times_treat.columns


# Get all values from the table as an array

shan_array = shangai_treat.values
tim_array = times_treat.values


# Separate x and y

shan_x = shan_array[:,0:6]
shan_y = shan_array[:,6]

times_x = tim_array[:,0:9]
times_y = tim_array[:,9]

The first feature selection method we will explore will be a univariate analysis

In [46]:
# Run univariate analysis

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 

test = SelectKBest(score_func=chi2, k=3)
uni = test.fit(shan_x,shan_y)

print()
print('\033[1m' + 'Univariate Analysis' '\033[0m')

# Automate column outputs for univariate analysis

cols = uni.scores_.argsort()[-3:]
print("The top features are: " + shan_cols[cols[0]] + ", " + shan_cols[cols[1]] + ", " + shan_cols[cols[2]]) 

features = uni.transform(shan_x)
features[0:5,:]


[1mUnivariate Analysis[0m
The top features are: hici, alumni, award


array([[100. , 100. , 100. ],
       [ 40.7,  89.6,  80.1],
       [ 68.2,  80.7,  60.6],
       [ 65.1,  79.4,  66.1],
       [ 77.1,  96.6,  50.8]])