## Examining Student and Online Perceptions of LSE

### Part 1: Introduction

Before arriving at LSE, many students likely hear of the university's low student satisfaction rates. These impressions often gain traction on online forums such as Reddit, where anecdotal accounts tend to amplify perceptions of dissatisfaction. Therefore, it is worth exploring this topic, to understand how "severe" the problem is and whether the data fully aligns with the online perception of the university.

We can explore the statement above through a series of focused questions:
- How does LSE student satisfaction compare to other universities?
- How does student satisfaction within LSE vary by degree?
- How has LSE student satisfaction evolved over the last few years (2020-2023)?
- How is LSE perceived by people online?

It is worth considering what is meant by student satisfaction. Many data sources such as the National Student Survey break down student's views into categories such as teaching, course content, course organisation and resources. Meanwhile, online the focus tends to be on social opportunities, as well as societies. As "student satisfaction" is multifaceted we will be breaking down each question into the specific type of student satisfaction we are discussing, to avoid confusion. 




### Part 2: Data acquisition
- Go to 'Data_acquisition.ipynb' file

### Part 3: Data Preparation and Exploration

### NSS data

We are not interested in certain columns of the NSS data. So, we remove them.

For example, since the NSS combines student's responses to the Questions into one overall \"Positivity Measure\", which is the proportion of respondents who gave a positive answer, we only need that, rather than the actual responses to the questions (e.g. Option 1, Option 2, etc.). This does lose some of the núance to the data, but makes visualisation and understanding the data easier.

In [1]:
import pandas as pd

### Reddit data

A refresh of the format of the data frame used to store this data:

In [48]:
reddit_data_df = pd.read_csv("Data/reddit_data.csv")
reddit_data_df.head(10)

Unnamed: 0,Title,Score,Top Comment
0,The irony of LSE being a socialist institution,327,Don’t really want to give super identifying de...
1,Got kicked outta LSE.....,238,Can you share more details? Specifically.\n\nD...
2,why are masters degrees so expensive?,207,This doesn't really apply to most masters cour...
3,TIL some unis have worse graduate prospects th...,200,I think people should bear in mind that Imperi...
4,University subreddits,190,"/r/Edinburgh_University Edinburgh, University of"
5,LSE bread 😭 my DREAM uni offer!,147,Well done!
6,I've Ruined my Master's Degree and Ruined my F...,141,"Get help, now. You are clearly in crisis, and ..."
7,My teacher is discouraging me from applying to...,116,"To some degree, I would agree with your coordi..."
8,"I realise beyond the very top, differences in ...",113,I love how much this sub obsesses over rankings
9,What’s with the poor student satisfaction at m...,109,Student satisfaction and university reputation...


The first thing to check is the number of duplicate or null values. As you can see below, there is neither for both.
- There are no null values since the code used to access the top comment from each post was written to only add the post to the dataframe if it contained a comment. 

- There are no duplicate posts since the chance of two identical strings of words is highly unlikely.

In [45]:
num_duplicates = reddit_data_df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

Number of duplicate rows: 0


In [46]:
total_nulls = reddit_data_df.isnull().sum().sum()
print(f"Number of null values: {total_nulls}")
print(reddit_data_df.shape)

Number of null values: 0
(30, 3)


We only want to consider posts that have a high amount of upvotes, since they are more likely to be helpful and insightful information if other users liked them. We can acheive this by removing posts whose score is below '30', of which there way only one. This number was decided given the maximum score was 327 and the mean score was 97.

It is worth noting that the posts are already sorted in order of descending score, since that is a paramter included in the Reddit API call.

In [47]:
print(f"Maximum score: {reddit_data_df['Score'].max()}")
print(f"Mean score: {int(reddit_data_df['Score'].mean())}")
print(f"Current number of posts: {reddit_data_df.shape[0]}")
num_low_score_posts = (reddit_data_df['Score']<= 30).sum()
print(f"Number of posts with a score of 5 or below: {num_low_score_posts}")
reddit_data_df = reddit_data_df[reddit_data_df['Score'] > 30]
print(f"New number of posts: {reddit_data_df.shape[0]}")

Maximum score: 327
Mean score: 97
Current number of posts: 30
Number of posts with a score of 5 or below: 1
New number of posts: 29
