### Part 2: Data Acquisition

The National Student Survey gathers feedback from students concerning student satisfaction in regards to their course, teaching, and other aspects of university. It's collected from final year undergraduate students across UK universities.

Below, we filter data from the National Student Survey, as well as from the website Reddit.com. We also import all of the modules we will use throughout the project.

In [11]:
import pandas as pd
import json
import requests.auth

In [15]:
lse_dat=pd.read_excel('./data/LSE_dat.xlsx', sheet_name="Teaching")
lse_dat.columns = lse_dat.iloc[0]
lse_dat = lse_dat[1:]  #we do this to remove the top row of the data, which is, redundantly, the same as the next row
lse_dat
lse_dat.to_csv("Data/lse_dat.csv")

There is also data from other universities that we have to similarly process and store into dataframes. For this project we have chosen to compare LSE to the following:

- UCL
- KCL
- SOAS
- Loughborough University

more universities needed/might use SQL here somehow

For the Reddit API data, we decided to focus on the subreddit 'UniUK' that has 151 thousand members and is in the top 1% of communities ranked by size. We noticed few posts directly used the phrase 'student satisfaction' when discussing LSE, which we assumed was due to the informal nature of the site. 

Instead, we searched for posts that contained the words 'lse' and 'opinion'. We retreived the top thirty posts under this search criteria, and stored the 'Post Title', 'Score' (number of 'upvotes' by other redditers) and 'Top Comment' in a pandas data frame 'df'. This dataframe was then saved to a csv file "Data/reddit_data.csv" so that it can accessed from any other files, and crucially, so that the data only needs to be fetched from Reddit.com once. Since the contents of the subreddit is likely to change frequently, it would be inefficent to perform effective analysis using realtime data, so we instead stored data from the site on 12/04/2024.

We chose to use the website Reddit to acquire data for many reasons. Firstly, their API is easy to use, reliable and openly accessible to the public. This allowed us to effiently access user-generated content without the computational and legal issues that web scraping would risk. Since the type of textual data we are searching for (the general opinion of LSE) is rarely provided in formatted tables online, using an the Reddit API is the best way to acquire and aggregate this information in a structured way.

Furthermore, the website has a strong user base for university students or people generally interested in the subject, exemplified by the size of the subreddit 'UniUK'. Not only does this give access to a large amount of data, but the posts are detailed and honest. This is because a significant part of the branding for Reddit is the idea of community, where you are likely to find other users who want contribute to the discussion under the protection of anonymity.

In [48]:
'''
Reddit API Data Collection.
Use the API key to request an access token by using the data stored in keys.json.
Make an API call to retrive posts from 'UniUK'
Store the ddata in a data frame.
'''

#access keys
with open('Data/keys.json') as f:
    keys = json.load(f)
app_id = keys['reddit']['app_id']
app_secret = keys['reddit']['app_secret']
username = keys['reddit']['username']
password = keys['reddit']['password']

#request a token
client_auth = requests.auth.HTTPBasicAuth(app_id, app_secret)
post_data = {'grant_type': 'password',
            'username': username,
            'password': password}
headers = {'User-Agent': f'new connection lse/0.0.1 by {username}'}
r = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=client_auth, data=post_data, headers=headers)
access_token = r.json()['access_token']

#send a request
headers = {"Authorization": f"bearer {access_token}",
'User-Agent': f'lse/0.0.1 by {username}'}

r = requests.get("https://oauth.reddit.com/r/UniUK/search?q=lse" \
"&limit=30&sort=top&restrict_sr=true", headers=headers)

titles = []
scores = []
top_comments = []

#post title, score and top comment
for post in r.json()["data"]["children"]:
    post_title = post["data"]["title"]
    post_score = post["data"]["score"]
    post_permalink = post["data"]["permalink"]

    comments_request = requests.get(f"https://oauth.reddit.com{post_permalink}.json", headers=headers)
    comments_data = comments_request.json()

    if isinstance(comments_data, list) and len(comments_data) > 1:
        comments = comments_data[1]["data"]["children"]
        if comments:
            top_comment = comments[0]["data"]["body"]
            titles.append(post_title)
            scores.append(post_score)
            top_comments.append(top_comment)

df = pd.DataFrame({
    "Title": titles,
    "Score": scores,
    "Top Comment": top_comments
})

#save dataframe to csv file
df.to_csv("Data/reddit_data.csv", index=False)
