Read in the SVM results

In [59]:
import pandas as pd
from google.colab import drive

drive.mount('/content/gdrive')
df = pd.read_csv("gdrive/My Drive/Dissertation Complete/RobertA_SVM_ranking_results.csv")

df.head()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Unnamed: 0,text,university,RobertA_score,Predicted_rank
0,Ya honestly it depends on the interviewer. Peo...,Duke,0.582699,222076.98033
1,How to get into Harvard: do what you are passi...,Harvard,0.582248,222079.104335
2,: I am sorry that you feel this way--undergr...,Georgetown,0.582227,222079.200971
3,You clearly seem like you would be a much bett...,Stanford,0.582041,222080.079114
4,If they find out you applied elsewhere during ...,Princeton,0.581978,222080.374357


Obtain the total instances of each university

In [60]:
university_counts = df['university'].value_counts()

university_counts_df = university_counts.reset_index()
university_counts_df.columns = ['university', 'total_entries']

university_counts_df.head(50)

Unnamed: 0,university,total_entries
0,Harvard,31250
1,MIT,27126
2,Stanford,26310
3,USC,25423
4,UCLA,22539
5,Yale,22152
6,Cornell,19902
7,Columbia,18207
8,Princeton,17831
9,Brown,17822


We then group each university by the total score assigned by the SVM model, sort by this score and then merge this dataframe with the dataframe containing the total instances of each university.

In [61]:
# Group the DataFrame by 'University' and calculate the total sentiment score
university_sentiment = (df.groupby('university')['Predicted_rank'].sum() / 1).reset_index()

# Sort the DataFrame by the total sentiment score in descending order
university_sentiment_sorted = university_sentiment.sort_values(by='Predicted_rank', ascending=False)
university_sentiment_sorted.reset_index(inplace=True, drop=True)

university_sentiment_sorted['Predicted_rank_sum'] = university_sentiment_sorted['Predicted_rank'].astype(int)
university_sentiment_sorted.drop(columns=["Predicted_rank"], inplace=True)

university_sentiment_sorted = pd.merge(university_counts_df, university_sentiment_sorted, on='university', how='inner')

university_sentiment_sorted.head()

Unnamed: 0,university,total_entries,Predicted_rank_sum
0,Harvard,31250,6944308901
1,MIT,27126,6030202538
2,Stanford,26310,5849792943
3,USC,25423,5652752742
4,UCLA,22539,5010897792


We obtain an overall score for each university by dividing the sum of the SVM ranks by the total entries of each university. The dataframe is then resorted by the new score, with universities that have few entries being dropped.
The reasoning for this is that if we were to only sort by the ranked sum the universities would order by the ones with the most entries, however by dividing each by its total, it does not matter how many occurances of each university there is, just by how positve the sentiment is for each /  the rank generated by SVM. Additionally by dropping universities with a small sample size this eliminates cases where 99/100 entries are positive meaning that a university that is hardly discussed and does not have enough data would be at the top of the ranking list.

In [62]:
university_sentiment_sorted['score'] = university_sentiment_sorted['Predicted_rank_sum'] / university_sentiment_sorted['total_entries']

# Reduce the score number for visual clarity
university_sentiment_sorted['score'] = university_sentiment_sorted['score'] / 100

# Round to 4 DP
university_sentiment_sorted['score'] = round(university_sentiment_sorted['score'], 4)


university_sentiment_sorted = university_sentiment_sorted.sort_values(by='score', ascending=True) # asc is true as lower svm is better
university_sentiment_sorted.reset_index(inplace=True, drop=True)

# Drop all unis with under 50 entries (low sample size)
university_sentiment_sorted = university_sentiment_sorted[university_sentiment_sorted["total_entries"] >= 50]
university_sentiment_sorted.reset_index(inplace=True, drop=True)

university_sentiment_sorted.head()

Unnamed: 0,university,total_entries,Predicted_rank_sum,score
0,Emory,13957,3101160636,2221.9393
1,Georgetown,10751,2389000796,2222.1196
2,Harvard,31250,6944308901,2222.1788
3,Duke,17718,3937728697,2222.4454
4,Johns Hopkins,2111,469212807,2222.704


Here we drop unneeded columns, as well as preparing to merge the results dataframe with the actual ranking list, by renaming to a common column name for the universities abbreviartions that we created earlier for the NER.

In [63]:
university_sentiment_sorted.drop(columns=["Predicted_rank_sum", "total_entries"], inplace=True)
university_sentiment_sorted['Rank'] = university_sentiment_sorted.index + 1

university_sentiment_sorted.rename(columns={'university': 'Abbreviation'}, inplace=True)

scores = university_sentiment_sorted['score']
university_sentiment_sorted.drop(columns=["score"], inplace=True)
university_sentiment_sorted['score'] = scores

university_sentiment_sorted.head()

Unnamed: 0,Abbreviation,Rank,score
0,Emory,1,2221.9393
1,Georgetown,2,2222.1196
2,Harvard,3,2222.1788
3,Duke,4,2222.4454
4,Johns Hopkins,5,2222.704


Read in the actual 2017 rankings

In [64]:
from google.colab import drive
import pandas as pd

drive.mount('/content/gdrive')
dataset = pd.read_csv("gdrive/My Drive/Dissertation Complete/National Universities Rankings.csv", encoding='latin-1')

dataset.rename(columns={'Rank': 'Actual Rank'}, inplace=True)


Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


merge the two datasets, sorting by predicted rank, for comparison.

In [65]:
final_df = pd.merge(dataset, university_sentiment_sorted, on='Abbreviation', how='inner')

final_df.rename(columns={'Rank': 'Predicted Rank'}, inplace=True)
final_df.sort_values(by=['Predicted Rank'], inplace=True)
final_df.reset_index(inplace=True, drop=True)

final_df.head(50)

Unnamed: 0,Name,Abbreviation,Actual Rank,Predicted Rank,score
0,Emory University,Emory,20,1,2221.9393
1,Georgetown University,Georgetown,20,2,2222.1196
2,Harvard University,Harvard,2,3,2222.1788
3,Duke University,Duke,8,4,2222.4454
4,Johns Hopkins University,Johns Hopkins,10,5,2222.704
5,University of Michigan--Ann Arbor,UMich,27,6,2222.7961
6,Georgia Institute of Technology,Georgia Institute of Technology,34,7,2222.8504
7,Cornell University,Cornell,15,8,2222.8822
8,Yale University,Yale,3,9,2222.9447
9,Massachusetts Institute of Technology,MIT,7,10,2223.0342
