### <font color='black'>Introduction</font>

<font color='#404040'>The goal of this project is to review the comments of different universities in the U.K., and *[www.whatuni.com](https://www.whatuni.com/university-course-reviews/?pageno=1)* is a popular website where students give comments about their universities. The research project involves (1) collecting textual data from this website; (2) applying visualization and machine learning methods to review the comments; (3) giving insight for the results derived from the black box methods.</font>

### <font color='black'> Data collection</font>

<font color='#404040'>This jupyter notebook scrapes data from *[www.whatuni.com](https://www.whatuni.com/university-course-reviews/?pageno=1)* for 3 different universities, namely, *University of Oxford*, *University of Edinburgh* and *University of Warwick*. These universities are chosen based on my academic interests. Some explanations are added for each code block to illustrate why and how the codes are written.</font>

<font color='#404040'>We use the following packages for data collection: </font>

In [1]:
import pandas as pd
import time
import requests
from bs4 import BeautifulSoup

<font color='#404040'>Now, we specify the hyperlinks to the comments of the target universities: </font>

In [2]:
# Base hyperlink
hlink_basic = 'https://www.whatuni.com/university-course-reviews/{}/{}'

# Hyperlink for different universities
hlink_oxford = hlink_basic.format('university-of-oxford', '3757')
hlink_edinburgh = hlink_basic.format('university-of-edinburgh', '5508')
hlink_warwick = hlink_basic.format('university-of-warwick', '3771')

<font color='#404040'>Given there are category, question and reviews, we store these information into a dictionary, namely, *comment*. Since we repeat this action over a number of comments from different users. We wrap the codes into a function called *extract_comment*.</font>

In [3]:
def extract_comment(rating_category, rating_question, rating_reviews, comment):
    # Check if there is indeed a question
    if rating_question is not None:
        # Check if there is indeed a review
        if rating_reviews is not None:
            # If yes, append the contents
            comment[rating_category.text] = {rating_question.text: rating_reviews.text}
        
        else:
            # If no, rating_reviews is a None-type object
            comment[rating_category.text] = {rating_question.text: rating_reviews}
    
    else:
        # There is no question, return a None-type object
        comment[rating_category.text] = None
    
    # Return the scraped comment
    return comment

<font color='#404040'>Given a hyperlink, we make a request to the target webpage and format the html page with *BeautifulSoup*. First, we should check if it is an empty page. Then, we proceed to find the html tags for the comments. Each comment begins with *&lt;div class:rlst_row&gt;*.</font>
    
<font color='#404040'>Inside the tag, we can find name, date, degree and review ratings by identifying the related html tags. Multiple categories are incorporated in review ratings. Now, we can use for-loop to go through all the categories, extract the questions and reviews by identifying the related html tags, and call *extract_comment*.</font>

In [4]:
def comment_summarize(hyperlink):
    # Read and prettify the hyperlink
    url_source = requests.get(hyperlink)
    url_prettified = BeautifulSoup(url_source.content, 'html.parser', from_encoding = 'utf-8')

    summary = []
    
    # Check whether the page has any comments, exit if no comments are found
    if url_prettified.find('div', {'class':'rlst_row'}) is None:
        return 'EMPTY'
    
    # Loop through each reply located in <div class=rlst_row> element
    for case in url_prettified.find_all('div', {'class':'rlst_row'}):
        # Craete an empty dictionary, corresponding to each observation
        comment = dict()
        
        # Collect user information
        name = case.find('div', {'class': 'rev_name'}).text
        date = case.find('div', {'class':'rev_dte'}).text
        degree = case.find('h3')
        
        # Some users do not give info about their degrees, it leads to 'degree' as a None object
        if degree is not None:
            degree = degree.text
        
        # Update
        comment['name'] = name
        comment['date'] = date
        comment['degree'] = degree
        
        # Reviews are located in <div class=reviw_rating>
        rating = case.find('div', {'class':'reviw_rating'})

        # Each rating is located in <div class=rate_new>
        cc = 1
        prev_category = None
        for subrating in rating.find_all('div', {'class':'rate_new'}):
            # Each rating is associated with 3 parts: category, question, reviews
            rating_category = subrating.find('span', {'class': 'cat_rat'})
            rating_question = subrating.find('div', {'class': 'rw_qus_des'})
            rating_reviews = subrating.find('p', {'class': 'rev_dec'})
            
            # Check if there is a category
            if rating_category is not None:
                # Update
                comment = extract_comment(rating_category, rating_question, rating_reviews, comment)
                prev_category = rating_category
                cc = 1 # Reset counter
            
            # If there is no category, it means that the question and review belong the previous category
            else:
                # Update
                comment = extract_comment(prev_category, rating_question, rating_reviews, comment)
                cc += 1
        
        # Update
        summary.append(comment)
        
    # Return the scraped comment
    return summary

<font color='#404040'>These selected universities are popular and famous, so there are multiple pages of comments about the universities. We need to format the hyperlink by adding *'?pageno={}'* before passing it to *comment_summarize* for scraping. We implement a for-loop to go over all possible pages until an empty page, and scrape all the textual data using the previous function.</font>

In [5]:
def get_content(hyperlink):
    # Access pages
    pagelink = hyperlink + '?pageno={}'
    total_summary = []
    
    # Loop through pages
    for i in range(0, 200): 
        # Access i-th page, and scrape the comments by calling comment_summarize()
        page_summary = comment_summarize(pagelink.format(i))

        # In case i-th page is empty, break the loop because there are at most i - 1 pages
        if page_summary == 'EMPTY':
            print('There are at most {} pages'.format(i - 1))
            break
        
        else:
            # Update
            total_summary.extend(page_summary)
        
        # Avoid visiting the website too quickly and get blocked
        time.sleep(5)
        
    # Return the scraped comments
    return total_summary

<font color='#404040'>With the predefined functions, we now can start the data collection process! Apply *get_content* to each university.</font>

In [None]:
# Scrape websites
content_oxford = get_content(hlink_oxford)
content_edinburgh = get_content(hlink_edinburgh)
content_warwick = get_content(hlink_warwick)

### <font color='black'> Export data</font>

<font color='#404040'>We convert the scraped data into a dataframe object in pandas, then export them to the data folder in the directory. Note that we are using *relative path* here. So, you do not need to change the path when running this notebook.</font>

In [None]:
# Export data
def export_data(content_uni, name_uni):
    pd.DataFrame(content_uni).to_csv('./data/' + name_uni + '.csv', index = False)

In [None]:
export_data(content_oxford, 'oxford')
export_data(content_edinburgh, 'edinburgh')
export_data(content_warwick, 'warwick')

<font color='#404040'>The data have been collected, and now we proceed to the data cleaning part in the next notebook.</font>