### <font color='black'>Tagging and lemmatization</font>

<font color='#404040'>In this notebook, we will perform POS tagging and lemmatization. In text mining, words are often having the same meaning or steming form the same origin. The results of analyzing the raw data can be unsatisfactory if we ignore such properties. Several possible questions are answered as follows:</font>

---

<font color='#404040'>**Question 1:** Why do we prefer lemmatization over stemming for this analysis?</font>

<font color='#404040'>**Answer:** Both lemmatization and stemming are used to "normalize" the text. *Stemming* is a kind of heuristic which uses the *stem* of words (i.e. the substring of words), whereas lemmatization is taking the part-of-speech into account before normalization. So, lemmatization is more "intelligent" which is able to distinguish some confusing cases. That's why we also need to do POS tagging in this notebook.</font>

---

<font color='#404040'>**Question 2:** Why do we need to convert the data types from *string* to *numeric*?</font>

<font color='#404040'>**Answer:** It is due to the statistical property of rating scores. We expect a higher score should imply a better rating, which is an ordinal ranking. However, *string* is not something ordinal in nature. From the statistical point of view, *numeric* is preferred over *string* for rating scores.</font>

---

<font color='#404040'>**Question 3:** Why do the data for different universities are merged together in the last section?</font>

<font color='#404040'>**Answer:** For the convenience of file importing and exporting.</font>

---

<font color='#404040'>**Question 4:** Why do we need *reviews_lem* and *reviews_lem_short*?</font>

<font color='#404040'>**Answer:** When using simple word tokenization (unigram), a clear topic pattern can be found using adjective and noun. Hence, we can experiment and compare the difference in topic modeling with *reviews_lem* and *reviews_lem_short*.</font>

In [1]:
import numpy as np
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet, stopwords
from nltk.stem import WordNetLemmatizer

In [2]:
# NLTK Resource
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger') # for tagging
nltk.download('stopwords') # stopwords

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### <font color='black'>Import data</font>

<font color='#404040'>First, we import data cleaned in the previous notebook with relative paths.</font>

In [3]:
# Read data
dat_oxford = pd.read_csv('./data/oxford_sum.csv')
dat_edinburgh = pd.read_csv('./data/edinburgh_sum.csv')
dat_warwick = pd.read_csv('./data/warwick_sum.csv')

### <font color='black'>Lemmatization</font>

<font color='#404040'>As the lemmatization of WordNet is used, we need to convert *[NLTK part-of-speech tagging](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)* to *WordNet part-of-speech tagging*. So, we define *get_wordnet_pos* function which uses if-else statements to do this.</font>

In [4]:
def get_wordnet_pos(tag):
    # Adjective
    if tag.startswith('J'):
        return wordnet.ADJ
    
    # Verb
    elif tag.startswith('V'):
        return wordnet.VERB

    # Noun
    elif tag.startswith('N'):
        return wordnet.NOUN
    
    # Adverb
    elif tag.startswith('R'):
        return wordnet.ADV
    
    # If not matched, return None-type object
    else:
        return None

<font color='#404040'>We also need to define a set of stopwords, which contains irrelevant words to be removed from the text. We use stopwords from the NLTK library.</font>

In [5]:
# A list of stopwords
stop_words = stopwords.words('english')

<font color='#404040'>The following function performs POS tagging, lemmatization and stopwords filtering in a certain order:</font>

<font color='#404040'>We apply POS tagging and then lemmatize the words (non-stop words) using the POS tags. Stopwords filtering comes after POS tagging because the removal of stopwords may impact on the results of POS tagging, for example, subjects and reflexive pronouns are removed. Lemmatization depends also on the part-of-speech; hence it should comes after POS tagging as well. Then, we apply the function *lemmatization* to the data.</font>

<font color='#404040'>In addition, comments should be converted to lower case.</font>

In [6]:
def lemmatization(comment):
    # Tokenize the comment using NLTK and get their part-of-speeches
    # Punctuations (e.g. comma and fullstop) are handled in the tokenization
    words = []
    words2 = []
    tagging_nltk = nltk.pos_tag(word_tokenize(comment))
    
    # Convert the part-of-speeches into the WordNet format
    # Store them in a list of tuples: (tokenized word in **lower case**, POS)
    tagging_nltk = [*map(lambda s: (s[0].lower(), get_wordnet_pos(s[1])), tagging_nltk)]
    
    # Filter out the stopwords
    tagging_nltk = [s for s in tagging_nltk if s[0] not in stop_words]
        
    # Loop through each word in the comment
    for word, tag in tagging_nltk:
        # If there is POS tagging, lemmatize it and add to a list
        if tag is not None:
            word_lem = WordNetLemmatizer().lemmatize(word, pos = tag)
            words.append(word_lem)
            
            # Another list containing adjectives and nouns only
            if tag not in (wordnet.VERB, wordnet.ADV):
                words2.append(word_lem)
                
    # Join the lemmatized words and return a single string
    return ' '.join(words), ' '.join(words2)

<font color='#404040'>Below is a function which handles multiple output from *lemmatization*. It converts the tuple-like output into a pandas dataframe with 2 columns.</font>

In [7]:
def multiple_output(pd_series):
    # Apply lemmatization row-wise to dat_university['reviews']
    # Use .tolist + pd.DataFrame() to convert multiple output into two columns 
    return pd.DataFrame(pd_series.apply(lambda x: lemmatization(x)).tolist())

<font color='#404040'>Lemmatize the dataset for each university. We use *apply* function because each entry in *reviews* column is a sentence.</font>

In [8]:
# Lemmatize the dataset
dat_oxford[['reviews_lem', 'reviews_lem_short']] = multiple_output(dat_oxford['reviews'])
dat_edinburgh[['reviews_lem', 'reviews_lem_short']] = multiple_output(dat_edinburgh['reviews'])
dat_warwick[['reviews_lem', 'reviews_lem_short']] = multiple_output(dat_warwick['reviews'])

### <font color='black'>Data type</font>

<font color='#404040'>The data type for rating scores are *string* but the scores actually range from 1 to 5. Hence, we need to convert these features to *numeric* data type because *string* does not show the ordinal ranking. Some examples with inconsistent / undesirable data types are listed as below:</font>

In [9]:
# Example of the original data type
# rat1 - rat5 (string) should be 1 to 5 (numeric)
dat_edinburgh['score_Course and Lecturers'].value_counts()

rat4    541
rat3    233
rat5    229
rat2     43
rat1      8
Name: score_Course and Lecturers, dtype: int64

In [10]:
# Example of the original data type
# ratY and ratN (string) should be encoded to 1 and 0 (numeric)
dat_edinburgh['score_Do you think your time at university this year has been value for money?'].value_counts()

ratY    18
ratN    18
Name: score_Do you think your time at university this year has been value for money?, dtype: int64

In [11]:
# Example of the original data type
# non_app should be replaced by NaN
dat_oxford['score_Job Prospects'].value_counts()

rat5       181
rat4       107
rat3        43
rat2        13
non_app     12
rat1         4
Name: score_Job Prospects, dtype: int64

<font color='#404040'>Below is a function to convert the appropriate data types. First, it checks whether the input is a missing value or *non_app* (standards for not applicable). For both cases, we return missing values.</font>

<font color='#404040'>Otherwise, we extract the last character of the input because the raw scores are *rat1*,...,*rat5*, *ratY* and *ratN* which have the same pattern <font color='blue'>rat[score]</font>. If the scores are *Y* or *N*, we encode them into 1 and 0. Otherwise, they should be numeric from 1 to 5.</font>

In [12]:
def convert_dtypes(x):
    # Check if missing value
    if pd.isnull(x) == True:
        return x
    
    # If 'non_app', treat it as missing value
    elif x == 'non_app':
        return np.nan
    
    else:
        # Get the last character
        x = x[-1]
        
        # If 'Y'/'N', return 1/0
        if x == 'Y':
            return 1
        
        elif x == 'N':
            return 0
        
        # Otherwise, it ends with a number, convert it into float
        else:
            return float(x)

<font color='#404040'>Since such pattern can be found in every column starting with *score*, we need a *for-loop* to loop over the columns for each university.</font>

In [13]:
def score_rating(dat_uni, uni_name):
    # All columns with rating scores
    score_columns = dat_uni.columns[dat_uni.columns.str.startswith('score_')]
    
    # Loop through each column and convert their data types
    for column in score_columns:
        dat_uni[column] = dat_uni[column].copy().apply(lambda x: convert_dtypes(x))
    
    return dat_uni

In [14]:
dat_oxford = score_rating(dat_oxford, 'oxford')
dat_edinburgh = score_rating(dat_edinburgh, 'edinburgh')
dat_warwick = score_rating(dat_warwick, 'warwick')

<font color='#404040'>We concatenate the data from different universities together using *pd.concat*. In the previous function *score_rating*, a column called *University* is created and can be used as an identifier.</font>

In [15]:
# Concatenate as the same dataframe
def concat_dat_uni(dat_uni, dat_uni2, dat_uni3):
    # Create a column to specify university
    dat_uni['University'] = 'oxford'
    dat_uni2['University'] = 'edinburgh'
    dat_uni3['University'] = 'warwick'
    
    # Concatenate (row-wise, equivalently along axis = 0)
    return pd.concat([dat_uni, dat_uni2, dat_uni3])

In [16]:
dat_uni = concat_dat_uni(dat_oxford, dat_edinburgh, dat_warwick)

### <font color='black'>Export</font>

<font color='#404040'>Our dataset is now ready for analysis, namely, *training_data.csv*. Please find topic modeling and text classification in the next notebook.</font>

In [17]:
dat_uni.to_csv('./data/training_data.csv', index = False)