## Examining Student and Online Perceptions of LSE

### Part 1: Introduction

Before arriving at LSE, many students likely hear of the university's low student satisfaction rates. These impressions often gain traction on online forums such as Reddit, where anecdotal accounts tend to amplify perceptions of dissatisfaction. Additionally, multiple university ranking websites also present LSE as having a comparatively low student satisfaction rate. Therefore, it is worth exploring this topic, to understand how "severe" the problem is and whether the data fully aligns with the online perception of the university.

We can explore the statement above through a series of focused questions:
- How does LSE student satisfaction compare to other universities?
- How does student satisfaction within LSE vary by degree?
- How has LSE student satisfaction evolved over the last few years (2020-2023)?
- How is LSE perceived by people online?

It is worth considering what is meant by student satisfaction. Many data sources such as the National Student Survey break down student's views into categories such as teaching, course content, course organisation and assessment. Meanwhile, online the focus tends to be on social opportunities, as well as societies. As "student satisfaction" is multifaceted we will be breaking down each question into the specific type of student satisfaction we are discussing, to avoid confusion. The NSS is generally better suited towards 'academic' related definitions of student satisfaction, such as student's views on the quality of the course, the teaching etc. whilst Reddit and online platforms likely better represent student's thoughts on their social lives at certain universities.




### Part 2: Data acquisition
- Go to 'Data_acquisition.ipynb' file

### Part 3: Data Preparation and Exploration

In [1]:
import pandas as pd
import pickle

### NSS data

We can answer many of the questions through the NSS. The NSS questions students on multiple different categories, such as teaching quality, course quality and assessment quality. Most websites ignore this distinction and simply focus on one of these metrics such as teaching quality, however, we wanted to make the distinction between categories clearer. As there are multiple categories, we create multiple dataframes so that visualisation is easier. Another important thing to consider is the response to population size; if it is high it means the measure of student satisfaction is more likely to accurately represent the real student body, compared to if it is low, when the results are less robust. This is the reason why many online displays of these metrics leave out universities like the University of Oxford, where a hefty proportion of the students boycott the NSS.

We also wanted to display the variance between student's opinions on these measures through the standard deviation score.

In [2]:
file_path = './data/univ_df.pkl'

# Load the dictionary from the pickle file
with open(file_path, 'rb') as pickle_file:
    univ_df = pickle.load(pickle_file)

#refresher of the universities surveyed
for key in univ_df.keys():
    print(key)

LSE
Oxford
UCL
Birmingham
Edinburgh
Glasgow
Imperial
KCL
Manchester
Norwich
Strathclyde
Warwick


Below, we create dataframes representing the differences in student opinions on teaching quality, course quality, assessment quality, support quality, as well as course organisation quality. We considered aggregating every metric into one 'overall satisfaction' score, however, feel that would not take into account the fact that some may view certain criterion as deserving more weight; for example, course quality may be valued much more than assessment quality. 

The way each criterion's score has been calculated is through averaging 'Positivity Measures' to groups of questions. For example, the 'Teaching Quality' dataframe takes the following questions into account:

- Q01: How good are teaching staff at explaining things?
- Q02: How often do teaching staff make the subject engaging?

The Course Quality measures the following:

- Q03: How often is the course intellectually stimulating?
- Q04: How often does your course challenge you to achieve your best work?
- Q05: To what extent have you had the chance to explore ideas and concepts in depth?
- Q06: How well does your course introduce subjects and skills in a way that builds on what you have already learned?
- Q07: To what extent have you had the chance to bring together information and ideas from different topics?
- Q08: To what extent does your course have the right balance of directed and independent study?
- Q09: How well has your course developed your knowledge and skills that you think you will need for your future?

*Course Quality differs from Course Organisation in that Course Quality focuses more on the knowledge conveyed in the course, rather than the resources and overall organisation of the course*

Assessment Quality measures the following questions:

- Q10: How clear were the marking criteria used to assess your work?
- Q11: How fair has the marking and assessment been on your course?
- Q12: How well have assessments allowed you to demonstrate what you have learned?
- Q13: How often have you received assessment feedback on time?

Support:

- Q14: How often does feedback help you to improve your work?
- Q15: How easy was it to contact teaching staff when you needed to?
- Q16: How well have teaching staff supported your learning?

Course Organisation:

- Q17: How well organised is your course?
- Q18: How well were any changes to teaching on your course communicated?
- Q19: How well have the IT resources and facilities supported your learning?
- Q20: How well have the library resources (e.g., books, online services and learning spaces) supported your learning?
- Q21: How easy is it to access subject specific resources (e.g., equipment, facilities, software) when you need them?

Please remember that these categories represent student's *opinions* about the categories themselves and are not an objective measure of what a university's facilities are actually like; for instance, students at universities with high ranks may expect more from said university. 



In [62]:
teach_dict={}
course_dict={}
assess_dict={}
support_dict={}
organisation_dict={}

for name,uni_df in univ_df.items():
    #We measure 'All Subjects' at this point as we don't want to differentiate by Subject
    teach_df=uni_df[uni_df["Subject level"]=='All subjects'][:2]
    pos_average=teach_df['Positivity measure (%)'].sum()/2
    pos_average=round(pos_average,2)
    sd_average=teach_df['Standard deviation'].sum()/2
    response_ratio=teach_df['Responses'].sum()/teach_df['Population'].sum()
    response_ratio=round(response_ratio, 2)
    teach_dict[name]={'Positivity Measure(%)':pos_average, 'Standard Deviation': sd_average, 'Response Ratio': response_ratio}

    course_df=uni_df[uni_df["Subject level"]=='All subjects'][2:9]
    pos_average= course_df['Positivity measure (%)'].sum()/course_df['Positivity measure (%)'].count() #7 rows
    pos_average=round(pos_average,2)
    sd_average= course_df['Standard deviation'].sum()/course_df['Standard deviation'].count()
    sd_average=round(sd_average,2)
    response_ratio=course_df['Responses'].sum()/course_df['Population'].sum()
    response_ratio=round(response_ratio, 2)
    course_dict[name]={'Positivity Measure(%)':pos_average, 'Standard Deviation': sd_average, 'Response Ratio': response_ratio}
    
    assess_df=uni_df[uni_df["Subject level"]=='All subjects'][9:13]
    pos_average=assess_df['Positivity measure (%)'].sum()/assess_df['Positivity measure (%)'].count()
    pos_average=round(pos_average,2)
    sd_average= assess_df['Standard deviation'].sum()/assess_df['Standard deviation'].count()
    sd_average=round(sd_average,2)
    response_ratio=assess_df['Responses'].sum()/assess_df['Population'].sum()
    response_ratio=round(response_ratio, 2)
    assess_dict[name]={'Positivity Measure(%)':pos_average, 'Standard Deviation': sd_average, 'Response Ratio': response_ratio}
    
    support_df=uni_df[uni_df["Subject level"]=='All subjects'][13:16]
    pos_average=support_df['Positivity measure (%)'].sum()/support_df['Positivity measure (%)'].count()
    pos_average=round(pos_average,2)
    sd_average=support_df['Standard deviation'].sum()/support_df['Standard deviation'].count()
    sd_average=round(sd_average,2)
    response_ratio=support_df['Responses'].sum()/support_df['Population'].sum()
    response_ratio=round(response_ratio, 2)
    support_dict[name]={'Positivity Measure(%)':pos_average, 'Standard Deviation': sd_average, 'Response Ratio': response_ratio}
    
    org_df=uni_df[uni_df["Subject level"]=='All subjects'][16:21]
    pos_average=org_df['Positivity measure (%)'].sum()/org_df['Positivity measure (%)'].count()
    pos_average=round(pos_average,2)
    sd_average=org_df['Standard deviation'].sum()/org_df['Standard deviation'].count()
    sd_average=round(sd_average,2)
    response_ratio=org_df['Responses'].sum()/org_df['Population'].sum()
    response_ratio=round(response_ratio, 2)
    organisation_dict[name]={'Positivity Measure(%)':pos_average, 'Standard Deviation': sd_average, 'Response Ratio': response_ratio}
    
    
    

In [63]:
teach_df=pd.DataFrame(teach_dict)
print('Student Views on Teaching Quality') #a higher percentage means more students think positively of that factor
teach_df.head()

Student Views on Teaching Quality


Unnamed: 0,LSE,Oxford,UCL,Birmingham,Edinburgh,Glasgow,Imperial,KCL,Manchester,Norwich,Strathclyde,Warwick
Positivity Measure(%),84.7,91.45,83.4,82.9,84.8,85.5,86.1,83.4,81.0,85.3,87.95,86.8
Standard Deviation,0.95,0.75,0.5,0.5,0.6,0.6,0.75,0.55,0.45,1.35,0.65,0.5
Response Ratio,0.67,0.5,0.72,0.69,0.65,0.7,0.72,0.7,0.74,0.82,0.74,0.72


In [41]:
course_df=pd.DataFrame(course_dict)
print('Student Views on Course Quality')
course_df.head()

Student Views on Course Quality


Unnamed: 0,LSE,Oxford,UCL,Birmingham,Edinburgh,Glasgow,Imperial,KCL,Manchester,Norwich,Strathclyde,Warwick
Positivity Measure(%),81.94,85.73,81.14,81.06,78.36,82.21,87.23,79.34,78.59,79.31,85.26,84.69
Standard Deviation,1.03,0.87,0.53,0.57,0.66,0.67,0.79,0.6,0.5,1.53,0.77,0.6
Response Ratio,0.67,0.5,0.71,0.69,0.65,0.7,0.72,0.7,0.74,0.82,0.74,0.72


In [42]:
print('Student Views on Assessment Quality')
assess_df=pd.DataFrame(assess_dict)
assess_df.head()

Student Views on Assessment Quality


Unnamed: 0,LSE,Oxford,UCL,Birmingham,Edinburgh,Glasgow,Imperial,KCL,Manchester,Norwich,Strathclyde,Warwick
Positivity Measure(%),71.57,72.38,70.88,71.0,65.38,71.78,71.35,69.95,69.95,82.22,75.05,78.72
Standard Deviation,1.12,1.0,0.6,0.62,0.75,0.75,0.92,0.65,0.52,1.55,0.85,0.62
Response Ratio,0.67,0.49,0.71,0.69,0.64,0.7,0.72,0.7,0.74,0.82,0.74,0.72


In [43]:
print('Student Views on Academic Support Quality')
support_df=pd.DataFrame(support_dict)
support_df.head()

Student Views on Academic Support Quality


Unnamed: 0,LSE,Oxford,UCL,Birmingham,Edinburgh,Glasgow,Imperial,KCL,Manchester,Norwich,Strathclyde,Warwick
Positivity Measure(%),81.47,87.5,75.93,73.47,72.9,75.5,77.17,73.87,74.23,84.7,80.77,81.07
Standard Deviation,1.03,0.87,0.53,0.63,0.67,0.73,0.87,0.63,0.53,1.5,0.77,0.6
Response Ratio,0.67,0.5,0.71,0.69,0.64,0.7,0.72,0.7,0.74,0.82,0.74,0.71


In [45]:
print('Student Views on Course Organisation Quality')
organisation_df=pd.DataFrame(organisation_dict)
organisation_df.head()

Student Views on Course Organisation Quality


Unnamed: 0,LSE,Oxford,UCL,Birmingham,Edinburgh,Glasgow,Imperial,KCL,Manchester,Norwich,Strathclyde,Warwick
Positivity Measure(%),84.66,82.44,83.36,81.48,79.0,79.44,84.26,77.82,77.26,78.74,83.88,86.2
Standard Deviation,1.0,0.9,0.52,0.54,0.64,0.68,0.76,0.62,0.52,1.56,0.76,0.58
Response Ratio,0.65,0.49,0.7,0.67,0.63,0.68,0.71,0.69,0.73,0.81,0.72,0.7


Another research question involved investigating how student satisfaction varied by degree. Firstly, we see how this works *within* LSE, then compare certain courses across universities.

In [70]:
lse_subjectdat=univ_df['LSE']
lse_subjectdat['Subject'].value_counts()

Subject
Law                                                408
Psychology                                         272
Economics                                          272
Geography, earth and environmental studies         272
Politics                                           272
Business and management                            272
Mathematical sciences                              272
Sociology, social policy and anthropology          136
Language and area studies                          136
Philosophy                                         136
Philosophy and religious studies                   136
History                                            136
History and archaeology                            136
Historical, philosophical and religious studies    136
Asian studies                                      136
Languages and area studies                         136
Accounting                                         136
Sociology                                          136
Fi

Many of these subjects overlap and so for the following analysis we will ignore certain categories. For example, we will focus on only Psychology instead of Psychology (non-specific) and both Mathematics and Statistics individually rather than 'Mathematical sciences'.

Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-specific)
Psychology (non-sp

### Reddit data

A refresh of the format of the data frame used to store this data:

In [48]:
reddit_data_df = pd.read_csv("Data/reddit_data.csv")
reddit_data_df.head(10)

Unnamed: 0,Title,Score,Top Comment
0,The irony of LSE being a socialist institution,327,Don’t really want to give super identifying de...
1,Got kicked outta LSE.....,238,Can you share more details? Specifically.\n\nD...
2,why are masters degrees so expensive?,207,This doesn't really apply to most masters cour...
3,TIL some unis have worse graduate prospects th...,200,I think people should bear in mind that Imperi...
4,University subreddits,190,"/r/Edinburgh_University Edinburgh, University of"
5,LSE bread 😭 my DREAM uni offer!,147,Well done!
6,I've Ruined my Master's Degree and Ruined my F...,141,"Get help, now. You are clearly in crisis, and ..."
7,My teacher is discouraging me from applying to...,116,"To some degree, I would agree with your coordi..."
8,"I realise beyond the very top, differences in ...",113,I love how much this sub obsesses over rankings
9,What’s with the poor student satisfaction at m...,109,Student satisfaction and university reputation...


The first thing to check is the number of duplicate or null values. As you can see below, there is neither for both.
- There are no null values since the code used to access the top comment from each post was written to only add the post to the dataframe if it contained a comment. 

- There are no duplicate posts since the chance of two identical strings of words is highly unlikely.

In [45]:
num_duplicates = reddit_data_df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

Number of duplicate rows: 0


In [46]:
total_nulls = reddit_data_df.isnull().sum().sum()
print(f"Number of null values: {total_nulls}")
print(reddit_data_df.shape)

Number of null values: 0
(30, 3)


We only want to consider posts that have a high amount of upvotes, since they are more likely to be helpful and insightful information if other users liked them. We can acheive this by removing posts whose score is below '30', of which there way only one. This number was decided given the maximum score was 327 and the mean score was 97.

It is worth noting that the posts are already sorted in order of descending score, since that is a paramter included in the Reddit API call.

In [47]:
print(f"Maximum score: {reddit_data_df['Score'].max()}")
print(f"Mean score: {int(reddit_data_df['Score'].mean())}")
print(f"Current number of posts: {reddit_data_df.shape[0]}")
num_low_score_posts = (reddit_data_df['Score']<= 30).sum()
print(f"Number of posts with a score of 5 or below: {num_low_score_posts}")
reddit_data_df = reddit_data_df[reddit_data_df['Score'] > 30]
print(f"New number of posts: {reddit_data_df.shape[0]}")

Maximum score: 327
Mean score: 97
Current number of posts: 30
Number of posts with a score of 5 or below: 1
New number of posts: 29
