# Using Natural Language Processing to derive insights from BSc Data Science personal statements

This contains 10 BSc Data Science personal statements representating 50% of the cohort starting in 2022

Using NPL the following things will be identified from the personal statements:
- Main topics (Using Topic Modelling)
- Top 10 most frequent words - this may be more insightful than topic modelling as topic modelling is an unsupervised machine learning algorithm it is not always as effective as desired 
- Key people who have been mentioned - this could be used to provide a basis for potential students to do further research to explore their interests
- Readability levels - This will be calculated using the Flesch-Kincaid Grade Level Forumla which calculates the grade level needed to understand a text using the average syllables per word and the average number of words per sentence. 

In [16]:
import pandas as pd

# Reading the text file
with open('ST115_personal_statements.txt', 'r') as file:
    all_ps = file.read()

# Displaying the first 1000 characters of the text file
all_ps[:1000]

'Personal statement:\nTaking part in a coding competition and several UKMT challenges exposed me to new ways of approaching problems from what I was used to in school. The challenge of having to think differently about problems and eventually arrive at solutions after reaching dead-ends was something I relished doing, which convinced me that a quantitative degree was for me.\nI have continued to challenge myself and explore the interplay between computer science and mathematics by solving Project Euler problems. For example, I used recursion to compute the convergents of continued fractions of the square root of 2. These problems nurtured my interest in pure mathematics, leading me to read Kevin Houston\'s "How to Think like a Mathematician". This introduced me to more sophisticated inductive proof techniques than the ones I learned at A-level. To supplement this, I explored how Lean, a programming language, can be taught to inductively prove a simple addition from first principles. Wh

Looking at the first 1000 characters shown above it can be seen that some intial data cleaning must be done in the sense of removing '\n' and any backslashes in general.
Also we want to manipulate this text data in order to store all the different personal statements as one text to do topic modelling with and to store them as individual statements to calculate readability and key people.

In [17]:
#Inital removal of '\n' and backslashes
all_ps = all_ps.replace("\n", "")
all_ps = all_ps.replace("\xa0", "")
all_ps = all_ps.replace("\'", "")

In [18]:
full_text = all_ps.replace("Personal statement:", " ")

In [19]:
all_ps

'Personal statement:Taking part in a coding competition and several UKMT challenges exposed me to new ways of approaching problems from what I was used to in school. The challenge of having to think differently about problems and eventually arrive at solutions after reaching dead-ends was something I relished doing, which convinced me that a quantitative degree was for me.I have continued to challenge myself and explore the interplay between computer science and mathematics by solving Project Euler problems. For example, I used recursion to compute the convergents of continued fractions of the square root of 2. These problems nurtured my interest in pure mathematics, leading me to read Kevin Houstons "How to Think like a Mathematician". This introduced me to more sophisticated inductive proof techniques than the ones I learned at A-level. To supplement this, I explored how Lean, a programming language, can be taught to inductively prove a simple addition from first principles. What ent

<b> STEP 1: Readability </b>
- Creating a dataframe where each row accounts for a single personal statement. This will be used to calculate the individual readability scores.

In [36]:
# This phrase has been used in the text file to indicate the start of a different personal statement
individual_statements = all_ps.split('Personal statement:')

ps_df = pd.DataFrame({'Personal Statements': individual_statements})

#The first row is empty so it is removed
ps_df = ps_df.drop(index=0).reset_index(drop=True)


Creating the Flesch-Kincaid function to calculate the readability scores 

In [35]:
import re

def count_syllables(word):
    '''
    Helper function to count the number of sylables in a given word
    Input: 
    word : string
    
    Output:
    num_vowels : int
    '''
    num_vowels = len(re.findall(r'[aeiouy]+', word))

    # Subtract the number of silent e's at the end of the word
    if re.search(r'e$', word):
        num_vowels -= 1

    # Subtract one for each diphthong
    num_vowels -= len(re.findall(r'[aeiouy]{2}', word))

    # Add one if the word ends in "-le"
    if re.search(r'le$', word):
        num_vowels += 1

    # Add one if the word is one syllable and ends in a consonant followed by "y"
    if len(re.findall(r'^[^aeiouy]+[aeiouy]+[^aeiouy]+y$', word)):
        num_vowels += 1

    return max(1, num_vowels)

def flesch_kincaid_grade(text):
    '''
    The function uses the Flesch-Kincaid formula to calculate the grade level required to be able to read a specificed text
    
    Inputs: 
    text : string
    The text to calculate the readability level of 
    
    Outputs: 
    grade_level : float
    Grade age of readability 
    '''
    text = text
    sentences = text.split('.')
    words = text.split()

    # Calculate the average number of words per sentence
    words_per_sentence = len(words) / len(sentences)

    # Calculate the average number of syllables per word
    syllables = 0
    for word in words:
        syllables += count_syllables(word)
    syllables_per_word = syllables / len(words)

    grade_level = 0.39 * words_per_sentence + 11.8 * syllables_per_word - 15.59
    
    grade_level = round(grade_level, 1)

    return grade_level

Applying the function to all the personal statements

In [37]:
ps_df['Readability Age Grade'] = ps_df['Personal Statements'].apply(flesch_kincaid_grade)


In [38]:
ps_df

Unnamed: 0,Personal Statements,Readability Age Grade
0,Taking part in a coding competition and severa...,12.5
1,Mathematics and the social sciences are seen a...,14.5
2,Data Science appealed to me as it offers a ne...,14.0
3,My interest in data analytics stems from an ex...,13.4
4,"Growing up on the football pitch, I learnt th...",10.6
5,"I have always been drawn to figures and data, ...",12.1
6,Data is one of the most valuable tools that th...,11.8
7,I couldn’t believe what I saw. After looking a...,14.0
8,I believe that the intersection of mathematic...,13.7
9,Seeing how influential big data is in shaping...,13.7


<b> STEP 2: NLP Preprocessing tasks</b> 

The following tasks will be carried out in this section: 
- Tokenisation
- Stop word removal 
- POS tagging
- Named Entity Recognition 
- TF-IDF
    

Tokenisation:

In [49]:
import nltk
from nltk.tokenize import word_tokenize

full_text = full_text.lower()

#Creating a dataframe to store all the steps of preprocessing
npl_df = pd.DataFrame({'Original text': full_text})

tokens = word_tokenize(full_text)



ValueError: If using all scalar values, you must pass an index

Checking if there are any specific words that should be added to the list of stop words 

In [48]:
full_token_dict = {}

for token in tokens:
    if token in full_token_dict: 
        full_token_dict[token] += 1
    else: 
        full_token_dict[token] = 1

full_token_dict


{'taking': 5,
 'part': 5,
 'in': 127,
 'a': 118,
 'coding': 2,
 'competition': 2,
 'and': 189,
 'several': 2,
 'ukmt': 1,
 'challenges': 2,
 'exposed': 2,
 'me': 69,
 'to': 268,
 'new': 5,
 'ways': 2,
 'of': 185,
 'approaching': 2,
 'problems': 11,
 'from': 27,
 'what': 12,
 'i': 191,
 'was': 44,
 'used': 21,
 'school': 9,
 '.': 219,
 'the': 285,
 'challenge': 2,
 'having': 4,
 'think': 13,
 'differently': 1,
 'about': 24,
 'eventually': 1,
 'arrive': 1,
 'at': 18,
 'solutions': 6,
 'after': 6,
 'reaching': 1,
 'dead-ends': 1,
 'something': 3,
 'relished': 1,
 'doing': 4,
 ',': 246,
 'which': 22,
 'convinced': 1,
 'that': 84,
 'quantitative': 4,
 'degree': 3,
 'for': 48,
 'me.i': 1,
 'have': 30,
 'continued': 3,
 'myself': 5,
 'explore': 7,
 'interplay': 1,
 'between': 8,
 'computer': 13,
 'science': 36,
 'mathematics': 15,
 'by': 41,
 'solving': 2,
 'project': 11,
 'euler': 1,
 'example': 5,
 'recursion': 1,
 'compute': 3,
 'convergents': 1,
 'fractions': 1,
 'square': 1,
 'root': 1,
